CN111128119B - Voice synthesis method and device - Google Patents
Voice synthesis method and device Download PDFInfo
- Publication number
- CN111128119B CN111128119B CN201911420316.5A CN201911420316A CN111128119B CN 111128119 B CN111128119 B CN 111128119B CN 201911420316 A CN201911420316 A CN 201911420316A CN 111128119 B CN111128119 B CN 111128119B
- Authority
- CN
- China
- Prior art keywords
- voice
- pieces
- voice information
- information
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 119
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 119
- 238000012549 training Methods 0.000 claims abstract description 36
- 238000000034 method Methods 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims description 37
- 238000012216 screening Methods 0.000 claims description 22
- 230000009467 reduction Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 230000008451 emotion Effects 0.000 abstract description 9
- 238000010276 construction Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a voice synthesis method and a device, wherein the method comprises the following steps: sequentially recording N pieces of voice information for a user through preset equipment; when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side; training a preset baseline model of a server side through the first N/2 pieces of voice information to obtain a first voice synthesis model; when the preset device records the later N/2 pieces of voice information, sending the later N/2 pieces of voice information to a server side; and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model. By the technical scheme, the speech synthesis model with the user speaking mode or emotion can be synthesized according to the requirements of the user, the naturalness of the synthesis result is high, namely the similarity with the user speaking mode, emotion and tone in the user voice is high, the time spent on model construction is short, and the user experience is greatly improved.
Description
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus.
Background
The Speech synthesis, also known as Text to Speech (Text to Speech) technology, can convert any Text information into standard smooth Speech in real time for reading, and is equivalent to mounting an artificial mouth on a machine. The method relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, is a leading-edge technology in the field of Chinese information processing, and solves the main problem of how to convert character information into audible sound information, namely, to enable a machine to speak like a person. The personalized speech synthesis is To make TTS (text To speech) speech technology synthesize speech, speaking mode and speaking emotion of a specific person by recording some speech segments of the person through some recording equipment at the side.
At present, during personalized speech synthesis, users of different ages and different sexes all adopt a single speech synthesis model, and speech synthesis can be carried out only according to specific speaking modes and emotions in the model, so that the similarity between speech synthesis results and the speaking modes, emotions and sound timbre of the users is low, and the user experience is greatly influenced.
Disclosure of Invention
The invention provides a voice synthesis method and a voice synthesis device. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a speech synthesis method, including:
sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side;
training a preset baseline model of the server side through the first N/2 pieces of voice information to obtain a first voice synthesis model;
when the preset equipment records the back N/2 pieces of voice information, sending the back N/2 pieces of voice information to the server side;
and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
In an embodiment, the training the preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model includes:
when the number of the N pieces of voice information is smaller than the preset number, determining that the first voice synthesis model reaches a convergence state;
acquiring a preset number of models generated in the process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
and when the N pieces of voice information are more than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as the first voice synthesis model.
In one embodiment, the speech synthesis method further comprises:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side.
In one embodiment, the filtering the voice information includes:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
In one embodiment, the second speech synthesis model has reached a converged state.
According to a second aspect of the embodiments of the present invention, there is provided a speech synthesis apparatus including:
the recording module is used for sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
the first sending module is used for sending the first N/2 pieces of voice information to a server side when the preset equipment finishes recording the first N/2 pieces of voice information;
the first training module is used for training a preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model;
the second sending module is used for sending the last N/2 pieces of voice information to the server side when the preset equipment records the last N/2 pieces of voice information;
and the second training module is used for training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
In one embodiment, the first training module comprises:
the determining submodule is used for determining that the first voice synthesis model reaches a convergence state when the N pieces of voice information are smaller than the preset number;
the first obtaining submodule is used for obtaining a preset number of models generated in the process of training the preset baseline model of the server side through the first N/2 pieces of voice information;
and the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number.
In one embodiment, the speech synthesis apparatus further includes:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
In one embodiment, the processing module includes:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of recording N pieces of voice information for a user in sequence through preset equipment, sending the front N/2 pieces of voice information to a server end when the preset equipment records the front N/2 pieces of voice information, then training a preset baseline model of the server end through the front N/2 pieces of voice information to obtain a first voice synthesis model, sending the back N/2 pieces of voice information to the server end when the preset equipment records the back N/2 pieces of voice information, further training the first voice synthesis model through the back N/2 pieces of voice information to obtain a second voice synthesis model, and then performing personalized voice synthesis through the second voice synthesis model, wherein compared with the method that a single voice synthesis model is adopted for users of different ages and different sexes, in the technical scheme of the invention, the method can synthesize the voice synthesis model with the user speaking mode or emotion according to the requirements of the user, namely, when the user needs to perform personalized voice synthesis through the voice synthesis model, N pieces of voice information of the user can be recorded in sequence, a preset baseline model is trained through the front N \2 pieces of voice information to obtain a first voice synthesis model, when the user urgently needs to perform personalized voice synthesis, the first voice synthesis model can be directly synthesized through the first voice synthesis model, then the first voice model is trained through the rear N \2 pieces of voice information on the basis of the first voice synthesis model to obtain a second voice synthesis model meeting the requirements of the user, the result naturalness of the synthesis of the second voice synthesis model is high, namely, the result has high similarity with the user speaking mode, emotion and the tone in the user voice, because the number of the required voices is small during model training, the time spent on model construction is short, and the user experience is greatly improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speech synthesis method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a speech synthesis apparatus according to an embodiment of the present invention;
FIG. 4 is a block diagram of another speech synthesis apparatus according to an embodiment of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.
Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present invention, and as shown in fig. 1, the method can be implemented as the following steps S11-S15:
in step S11, sequentially recording N pieces of voice information to a user through a preset device, where N is a positive integer; the value of N may be, but is not limited to, tens, for example thirty.
In step S12, when the preset device finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to the server, where when N is an even number, the first N/2 pieces of voice information are the first half of the N pieces of voice, and the last N/2 pieces of voice information are the second half of the N pieces of voice; when N is an odd number, the first piece of speech information belongs to the first N/2 pieces, the first half of the remaining pieces of speech information excluding the first piece of speech information also belongs to the first N/2 pieces, and the second half belongs to the last N/2 pieces, for example, if the value of N is 31, the first 16 pieces belong to the first N/2 pieces, and the remaining 15 pieces belong to the last N/2 pieces;
in step S13, training a preset baseline model of the server side through the first N/2 pieces of speech information to obtain a first speech synthesis model; the preset baseline model is a model obtained by training a large amount of voice data, can be used for voice synthesis, but cannot be used for personalized voice synthesis, and the first voice synthesis model can be used for personalized voice synthesis, but the voice synthesis effect does not reach the optimal state.
In step S14, when the preset device finishes recording the last N/2 pieces of voice information, sending the last N/2 pieces of voice information to the server;
in step S15, the first speech synthesis model is trained by the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis, that is, the second speech synthesis model can perform personalized speech synthesis, and the speech synthesis effect is optimal.
The method comprises the steps of recording N pieces of voice information for a user in sequence through preset equipment, sending the front N/2 pieces of voice information to a server end when the preset equipment records the front N/2 pieces of voice information, then training a preset baseline model of the server end through the front N/2 pieces of voice information to obtain a first voice synthesis model, sending the back N/2 pieces of voice information to the server end when the preset equipment records the back N/2 pieces of voice information, further training the first voice synthesis model through the back N/2 pieces of voice information to obtain a second voice synthesis model, and then performing personalized voice synthesis through the second voice synthesis model, wherein compared with the method that a single voice synthesis model is adopted for users of different ages and different sexes, in the technical scheme of the invention, the method can synthesize the voice synthesis model with the user speaking mode or emotion according to the requirements of the user, namely, when the user needs to perform personalized voice synthesis through the voice synthesis model, N pieces of voice information of the user can be recorded in sequence, a preset baseline model is trained through the front N \2 pieces of voice information to obtain a first voice synthesis model, when the user urgently needs to perform personalized voice synthesis, the first voice synthesis model can be directly synthesized through the first voice synthesis model, then the first voice model is trained through the rear N \2 pieces of voice information on the basis of the first voice synthesis model to obtain a second voice synthesis model meeting the requirements of the user, the result naturalness of the synthesis of the second voice synthesis model is high, namely, the result has high similarity with the user speaking mode, emotion and the tone in the user voice, because the number of the required voices is small during model training, the time spent on model construction is short, and the user experience is greatly improved.
As shown in fig. 2, in one embodiment, the above step S13 can be implemented as the following steps S131-S133:
in step S131, when the N pieces of speech information are less than the preset number, it is determined that the first speech synthesis model reaches a convergence state; the convergence state is the case where the mean square error is the smallest.
In step S132, obtaining a preset number of models generated in a process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
in step S133, when the N pieces of speech information are greater than or equal to the preset number, selecting a model meeting a preset standard from the preset number of models as a first speech synthesis model; the preset standard is that a verification set (validationset) is used as a test sample, data in the verification set is voice information recorded by a user, parameters of a preset number of models are estimated through the data in the verification set, namely, a predicted mean square error is calculated, the predicted mean square errors of the models are compared, and a fitting model with the minimum predicted mean square error is selected as a first voice synthesis model.
The method comprises the steps of obtaining a preset number of models generated in the process of training a preset baseline model of a server end through the first N/2 pieces of voice information, and selecting a model meeting a preset standard from the preset number of models as a first voice synthesis model when the N pieces of voice information are greater than or equal to the preset number, so that the first voice synthesis model is guaranteed to be the model with the best effect in the training process, and an optimal base stone is provided for the generation of a second voice synthesis model.
In one embodiment, the speech synthesis method further comprises:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side. For example, an excessively long silent section or the like in the voice information can be removed by the filtering process.
The processed voice can be obtained by carrying out noise reduction processing and screening processing on the voice information, model training is carried out on the processed voice, and when the finally obtained model carries out voice synthesis, the accuracy rate is higher.
In one embodiment, the filtering the voice information includes:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
The second voiceprint information in the user recorded voice information is extracted, and then matching is carried out according to the second voiceprint information and the first voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
When the second speech synthesis model reaches the convergence state, the speech synthesis effect is optimal.
As to the above speech synthesis method provided by the embodiment of the present invention, an embodiment of the present invention further provides a speech synthesis apparatus, as shown in fig. 3, the apparatus includes:
the recording module 31 is configured to record N pieces of voice information to a user in sequence through a preset device, where N is a positive integer; (ii) a
The first sending module 32 is configured to send the first N/2 pieces of voice information to a server side when the preset device finishes recording the first N/2 pieces of voice information;
the first training module 33 is configured to train a preset baseline model of the server through the first N/2 pieces of speech information to obtain a first speech synthesis model;
the second sending module 34 is configured to, when N/2 pieces of voice information are recorded by the preset device, send the last N/2 pieces of voice information to the server;
a second training module 35, configured to train the first speech synthesis model through the last N/2 pieces of speech information to obtain a second speech synthesis model, where the second speech synthesis model is used for speech synthesis.
As shown in fig. 4, in one embodiment, the first training module 33 includes:
the determining submodule 331 is configured to determine that the first speech synthesis model reaches a convergence state when the N pieces of speech information are smaller than a preset number;
a first obtaining submodule 332, configured to obtain a preset number of models generated in a process of training a preset baseline model of the server through the first N/2 pieces of voice information;
the selecting submodule 333 is configured to select, when the N pieces of speech information are greater than or equal to a preset number, a model meeting a preset standard from among the preset number of models as the first speech synthesis model.
In one embodiment, the speech synthesis apparatus further includes:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
In one embodiment, the processing module includes:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
In one embodiment, the second speech synthesis model has reached a converged state.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (8)
1. A method of speech synthesis, comprising:
sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
when the preset equipment finishes recording the first N/2 pieces of voice information, sending the first N/2 pieces of voice information to a server side;
training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, which specifically comprises the following steps:
when the number of the N pieces of voice information is smaller than the preset number, determining that the first voice synthesis model reaches a convergence state;
acquiring a preset number of models generated in the process of training a preset baseline model of the server side through the first N/2 pieces of voice information;
when the number of the N pieces of voice information is larger than or equal to a preset number, selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model;
when the preset equipment records the back N/2 pieces of voice information, sending the back N/2 pieces of voice information to the server side;
and training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
2. The method of claim 1, further comprising:
before the voice information is sent to the server side, noise reduction processing and screening processing are carried out on the voice information, and the voice information after the noise reduction processing and the screening processing are completed is sent to the server side.
3. The method of claim 2, wherein said filtering said voice message comprises:
acquiring first voiceprint information prestored by the user;
extracting second voiceprint information in the voice information to judge whether the first voiceprint information is matched with the second voiceprint information;
and when the first voiceprint information is matched with the second voiceprint information, screening the voice information according to a preset standard.
4. The method of claim 1, wherein the second speech synthesis model has reached a converged state.
5. A speech synthesis apparatus, comprising:
the recording module is used for sequentially recording N pieces of voice information for a user through preset equipment, wherein N is a positive integer;
the first sending module is used for sending the first N/2 pieces of voice information to a server side when the preset equipment finishes recording the first N/2 pieces of voice information;
the first training module is used for training the preset baseline model of the server end through the first N/2 pieces of voice information to obtain a first voice synthesis model, and comprises:
the determining submodule is used for determining that the first voice synthesis model reaches a convergence state when the N pieces of voice information are smaller than the preset number;
the first obtaining submodule is used for obtaining a preset number of models generated in the process of training the preset baseline model of the server side through the first N/2 pieces of voice information;
the selecting submodule is used for selecting a model which meets a preset standard from the preset number of models as the first voice synthesis model when the N pieces of voice information are more than or equal to the preset number;
the second sending module is used for sending the last N/2 pieces of voice information to the server side when the preset equipment records the last N/2 pieces of voice information;
and the second training module is used for training the first voice synthesis model through the last N/2 pieces of voice information to obtain a second voice synthesis model, wherein the second voice synthesis model is used for voice synthesis.
6. The apparatus of claim 5, further comprising:
and the processing module is used for carrying out noise reduction processing and screening processing on the voice information before the voice information is sent to the server side, and sending the voice information after the noise reduction processing and the screening processing are finished to the server side.
7. The apparatus of claim 6, wherein the processing module comprises:
the second obtaining submodule is used for obtaining first voiceprint information prestored by the user;
the extraction submodule is used for extracting second voiceprint information in the voice information so as to judge whether the first voiceprint information is matched with the second voiceprint information;
and the screening submodule is used for screening the voice information according to a preset standard when the first voiceprint information is matched with the second voiceprint information.
8. The apparatus of claim 5, in which the second speech synthesis model has reached a converged state.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911420316.5A CN111128119B (en) | 2019-12-31 | 2019-12-31 | Voice synthesis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911420316.5A CN111128119B (en) | 2019-12-31 | 2019-12-31 | Voice synthesis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111128119A CN111128119A (en) | 2020-05-08 |
CN111128119B true CN111128119B (en) | 2022-04-22 |
Family
ID=70506935
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911420316.5A Active CN111128119B (en) | 2019-12-31 | 2019-12-31 | Voice synthesis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111128119B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111862933A (en) * | 2020-07-20 | 2020-10-30 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for generating synthesized speech |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8175873B2 (en) * | 2008-12-12 | 2012-05-08 | At&T Intellectual Property I, L.P. | System and method for referring to entities in a discourse domain |
CN105185372B (en) * | 2015-10-20 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device |
US11238843B2 (en) * | 2018-02-09 | 2022-02-01 | Baidu Usa Llc | Systems and methods for neural voice cloning with a few samples |
CN110148398A (en) * | 2019-05-16 | 2019-08-20 | 平安科技(深圳)有限公司 | Training method, device, equipment and the storage medium of speech synthesis model |
-
2019
- 2019-12-31 CN CN201911420316.5A patent/CN111128119B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111128119A (en) | 2020-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110782872A (en) | Language identification method and device based on deep convolutional recurrent neural network | |
CN108847215B (en) | Method and device for voice synthesis based on user timbre | |
CN111667812A (en) | Voice synthesis method, device, equipment and storage medium | |
JP2013539558A (en) | Parameter speech synthesis method and system | |
CN104903954A (en) | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination | |
US20210327446A1 (en) | Method and apparatus for reconstructing voice conversation | |
JP2021110943A (en) | Cross-lingual voice conversion system and method | |
CN112735371B (en) | Method and device for generating speaker video based on text information | |
CN111916054B (en) | Lip-based voice generation method, device and system and storage medium | |
WO2019119279A1 (en) | Method and apparatus for emotion recognition from speech | |
CN108091323A (en) | For identifying the method and apparatus of emotion from voice | |
CN111128119B (en) | Voice synthesis method and device | |
CN114283783A (en) | Speech synthesis method, model training method, device and storage medium | |
CN112580669B (en) | Training method and device for voice information | |
CN117275498A (en) | Voice conversion method, training method of voice conversion model, electronic device and storage medium | |
CN112185342A (en) | Voice conversion and model training method, device and system and storage medium | |
CN111383627B (en) | Voice data processing method, device, equipment and medium | |
US20230252971A1 (en) | System and method for speech processing | |
Yanagisawa et al. | Noise robustness in HMM-TTS speaker adaptation | |
CN112885326A (en) | Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech | |
JP6000326B2 (en) | Speech synthesis model learning device, speech synthesis device, speech synthesis model learning method, speech synthesis method, and program | |
CN114005428A (en) | Speech synthesis method, apparatus, electronic device, storage medium, and program product | |
CN115700871A (en) | Model training and speech synthesis method, device, equipment and medium | |
Nthite et al. | End-to-End Text-To-Speech synthesis for under resourced South African languages | |
CN111429878A (en) | Self-adaptive speech synthesis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |