CN111312210B - Text-text fused voice synthesis method and device - Google Patents

Text-text fused voice synthesis method and device Download PDF

Info

Publication number
CN111312210B
CN111312210B CN202010145198.8A CN202010145198A CN111312210B CN 111312210 B CN111312210 B CN 111312210B CN 202010145198 A CN202010145198 A CN 202010145198A CN 111312210 B CN111312210 B CN 111312210B
Authority
CN
China
Prior art keywords
text
features
visual information
module
synthesizing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010145198.8A
Other languages
Chinese (zh)
Other versions
CN111312210A (en
Inventor
张晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010145198.8A priority Critical patent/CN111312210B/en
Publication of CN111312210A publication Critical patent/CN111312210A/en
Application granted granted Critical
Publication of CN111312210B publication Critical patent/CN111312210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Abstract

The invention discloses a method and a device for synthesizing voice fused with pictures and texts, wherein the method comprises the following steps: acquiring text information characteristics of current sports characters; determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture; fusing the text information features and the visual information features to obtain multi-modal features; synthesizing the target speech based on the multi-modal features. The text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.

Description

Text-text fused voice synthesis method and device
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a method and a device for synthesizing voice by fusing pictures and texts.
Background
With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted from the front end. End-to-end synthesis systems have emerged in order to allow the synthesis system to be as simplified as possible, reducing manual intervention and the requirements on the linguistics related background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with.
Disclosure of Invention
In response to the above-identified problems, the method is based on combining textual information features and visual information features to form multi-modal features and resultant speech.
A speech synthesis method for fusing graphics context comprises the following steps:
acquiring text information characteristics of current sports characters;
determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
fusing the text information features and the visual information features to obtain multi-modal features;
synthesizing the target speech according to the multi-modal features.
Preferably, the acquiring the text information characteristic of the current sports word includes:
constructing a preset dictionary;
acquiring each character in the current sports text;
encoding each of the characters into a one-hot encoded vector;
and acquiring the text information characteristic of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
Preferably, the determining, according to the current sports text, the video picture corresponding to the current sports text, and obtaining the visual information characteristic of the video picture includes:
determining a sports event video corresponding to the current sports text according to the current sports text;
acquiring n frames corresponding to the sports event video;
acquiring n visual information of the n frames by using an image encoder;
and determining the n pieces of visual information as the visual information characteristics of the video pictures in combination.
Preferably, the fusing the text information feature and the visual information feature to obtain a multi-modal feature includes:
generating a weight proportion by using the visual information characteristics;
carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and determining the text information features after the weighting processing as the multi-modal features.
Preferably, the synthesizing the target speech according to the multi-modal features includes:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
A device for synthesizing speech by fusing pictures and texts, the device comprising:
the first acquisition module is used for acquiring the text information characteristics of the current sports characters;
the second acquisition module is used for determining a video picture corresponding to the current sports characters according to the current sports characters and acquiring visual information characteristics of the video picture;
the fusion module is used for fusing the text information features and the visual information features to obtain multi-modal features;
and the synthesis module is used for synthesizing the target voice according to the multi-modal characteristics.
Preferably, the first obtaining module includes:
the construction submodule is used for constructing a preset dictionary;
the first acquisition sub-module is used for acquiring each character in the current sports text;
an encoding sub-module for encoding each of the characters into a one-hot encoded vector;
and the second obtaining submodule is used for obtaining the text information characteristics of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
Preferably, the second obtaining module includes:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition submodule is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information of the n frames by using an image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
Preferably, the fusion module includes:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as the multi-modal features.
Preferably, the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis sub-module is used for synthesizing the target voice according to the frequency spectrum.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
Fig. 1 is a flowchart of a method for synthesizing voice with text-text fusion according to the present invention;
fig. 2 is another work flow chart of the method for synthesizing voice with text and graphics integration provided by the present invention;
FIG. 3 is a screenshot of a workflow of a speech synthesis method for image-text fusion according to the present invention;
fig. 4 is a structural diagram of a text-fusion speech synthesis apparatus provided in the present invention;
fig. 5 is another structural diagram of a text-text fused speech synthesis apparatus provided by the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.
With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted by the front end. End-to-end synthesis systems have emerged in order to simplify the synthesis system as much as possible, reducing manual intervention and the requirements on linguistically relevant background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with. In order to solve the above problem, the present embodiment discloses a method for combining into speech based on combining text information features and visual information features to form multi-modal features.
A method for synthesizing voice with fused text and text, as shown in fig. 1, includes the following steps:
s101, acquiring text information characteristics of current sports characters;
s102, determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
s103, fusing the text information features and the visual information features to obtain multi-modal features;
and step S104, synthesizing the target voice according to the multi-modal characteristics.
The working principle of the technical scheme is as follows: the method comprises the steps of obtaining text information characteristics of current sports characters, determining video pictures corresponding to the current sports characters according to the current sports characters, obtaining visual information characteristics of the video pictures, then fusing the text information characteristics and the visual information characteristics to obtain multi-mode characteristics, and finally synthesizing target voice according to the multi-mode characteristics.
The beneficial effects of the above technical scheme are: the text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.
In one embodiment, as shown in fig. 2, acquiring the text information characteristic of the current sports word includes:
step S201, constructing a preset dictionary;
step S202, acquiring each character in the current sports text;
step S203, encoding each character into a one-hot encoding vector;
and step S204, acquiring text information characteristics of the one-hot coded vector by using an embedded layer of a preset dictionary.
The beneficial effects of the above technical scheme are: the characters are coded into the one-hot coded vectors, so that the characteristics of the text information can be more conveniently and accurately obtained, and each character is coded into the one-hot coded vectors, so that the problem that the characteristics of the obtained text information are inaccurate due to the fact that important information is missed is avoided.
In one embodiment, determining a video picture corresponding to the current sports text according to the current sports text, and acquiring visual information characteristics of the video picture includes:
determining a sports event video corresponding to the current sports characters according to the current sports characters;
acquiring n frames corresponding to a sports event video;
acquiring n visual information of n frames by using an image encoder;
combining the n pieces of visual information to determine the visual information characteristics of the video pictures;
in this embodiment, n is a positive integer of 2 or more.
The beneficial effects of the above technical scheme are: the n frames corresponding to the sports event video are obtained, so that the obtained n visual information is more accurate, and meanwhile, the determined visual information characteristics obtained by combining the n frames are clearer and clearer compared with the visual information characteristics obtained by one-time sports event video obtaining, so that a good sample is provided for text information characteristic fusion.
In one embodiment, fusing the textual information features and visual information features to obtain multimodal features includes:
generating a weight proportion by using visual information characteristics;
carrying out weighting processing on the text information characteristics by using the weight proportion;
and determining the text information features after the weighting processing as multi-modal features.
The beneficial effects of the above technical scheme are: through multi-mode information fusion of visual information update and text information characteristics, personalized voice can be generated, and more visual and nice experience is provided for users. For example: to sports broadcast football match, can be through catching visual information such as goal, goal in the video frame, supplementary text generates the speech effect of exciting a lot. And an emotion key generated by voice can be established according to the information of the picture inserted in the book during the intelligent reading of machines such as a book drawing, tang poetry and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.
In one embodiment, synthesizing target speech from multi-modal features includes:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
The beneficial effects of the above technical scheme are: the decoder is used for decoding the attention focusing part in the multi-modal characteristics and emphasizing the process of synthesizing the voice to properly adjust the type and emotion of the synthesized voice according to the attention focusing part. The synthesized voice has the advantage of diversification.
In one embodiment, as shown in FIG. 3, includes:
step 1: and (4) obtaining the text information characteristic representation of the sports text to be broadcasted through a character coder. By constructing a dictionary, each character is coded into a one-hot vector, and then the feature representation of the text information is obtained through an embedding layer;
step 2: and obtaining visual information characteristic representation of the picture frame corresponding to the video of the sports event through an image encoder. Visual information characteristic representation of the corresponding image can be extracted through a sub-network such as ResNet;
and 3, step 3: and the multi-mode feature fusion module fuses the text information features and the corresponding visual information features to obtain multi-mode feature representation. Splicing of characteristic dimensions can be adopted, and weights can be generated through image characteristics to perform weighting processing on text information;
and 4, step 4: the multi-modal features are decoded by an attention-based decoder module. The decoder is an RNN structure and can output the characteristic representation of the sequence based on the attention component;
and 5, step 5: the post-processing module processes the output of the decoder to generate a spectral representation, and obtains the voice.
The beneficial effects of the above technical scheme are: the unimodal text-to-speech synthesis ignores important visual information, and variable speech cannot be generated according to scenes in some scenes. The method can generate personalized voice representation through multi-modal information fusion of the image and the text. For example, for a sports broadcast football game, a exciting voice effect can be generated by capturing visual information such as a goal, a goal and the like in a video frame and assisting a text. And the emotion key generated by voice can be established according to the information of the images inserted in the book during the intelligent reading of machines such as book drawing, tang poem and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.
The embodiment also discloses a text-fused speech synthesis device, as shown in fig. 4, the device includes:
a first obtaining module 401, configured to obtain text information characteristics of a current sports word;
a second obtaining module 402, configured to determine, according to the current sports text, a video picture corresponding to the current sports text, and obtain visual information characteristics of the video picture;
the fusion module 403 is configured to fuse the text information features and the visual information features to obtain multi-modal features;
a synthesis module 404 for synthesizing the target speech based on the multi-modal features.
In one embodiment, the first obtaining module includes:
a construction sub-module 4011 configured to construct a preset dictionary;
the first obtaining sub-module 4012 is configured to obtain each character in the current sports text;
an encoding sub-module 4013 for encoding each character into a unique hot encoded vector;
the second obtaining sub-module 4014 is configured to obtain text information features of the unique hot coded vector by using the embedded layer of the preset dictionary.
In one embodiment, the second obtaining module includes:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition sub-module is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information for every n frames by using the image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
In one embodiment, a fusion module includes:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as multi-modal features.
In one embodiment, the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis submodule is used for synthesizing the target voice according to the frequency spectrum.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (8)

1. A speech synthesis method for fusing pictures and texts is characterized by comprising the following steps:
acquiring text information characteristics of current sports characters;
determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
fusing the text information features and the visual information features to obtain multi-modal features;
synthesizing target speech according to the multi-modal features;
the fusing the text information features and the visual information features to obtain multi-modal features comprises:
generating a weight proportion by using the visual information characteristics;
carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and determining the text information features after the weighting processing as the multi-modal features.
2. The method for synthesizing voice with fused texts according to claim 1, wherein the obtaining the text information characteristics of the current sports text comprises:
constructing a preset dictionary;
acquiring each character in the current sports text;
encoding each of the characters into a one-hot encoded vector;
and acquiring text information characteristics of the one-hot coded vector by utilizing an embedded layer of the preset dictionary.
3. The method for synthesizing voice with fused texts according to claim 1, wherein the determining a video picture corresponding to the current sports text according to the current sports text to obtain the visual information characteristics of the video picture comprises:
determining a sports event video corresponding to the current sports text according to the current sports text;
acquiring n frames corresponding to the sports event video;
acquiring n visual information of the n frames by using an image encoder;
and determining the n pieces of visual information as the visual information characteristics of the video pictures in combination.
4. The method for synthesizing fused text speech according to claim 1, wherein the synthesizing target speech according to the multi-modal features comprises:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
5. A text-fused speech synthesis apparatus, comprising:
the first acquisition module is used for acquiring the text information characteristics of the current sports characters;
the second acquisition module is used for determining a video picture corresponding to the current sports text according to the current sports text and acquiring visual information characteristics of the video picture;
the fusion module is used for fusing the text information features and the visual information features to obtain multi-modal features;
a synthesis module for synthesizing a target speech according to the multi-modal features;
the fusion module comprises:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as the multi-modal features.
6. The device for synthesizing fused text speech according to claim 5, wherein the first obtaining module comprises:
the construction submodule is used for constructing a preset dictionary;
the first obtaining sub-module is used for obtaining each character in the current sports text;
an encoding sub-module for encoding each of the characters into a one-hot encoded vector;
and the second obtaining submodule is used for obtaining the text information characteristics of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
7. The device for synthesizing fused text speech according to claim 5, wherein the second obtaining module comprises:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition submodule is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information of the n frames by using an image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
8. The device for synthesizing fused text speech according to claim 5, wherein the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis sub-module is used for synthesizing the target voice according to the frequency spectrum.
CN202010145198.8A 2020-03-05 2020-03-05 Text-text fused voice synthesis method and device Active CN111312210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010145198.8A CN111312210B (en) 2020-03-05 2020-03-05 Text-text fused voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010145198.8A CN111312210B (en) 2020-03-05 2020-03-05 Text-text fused voice synthesis method and device

Publications (2)

Publication Number Publication Date
CN111312210A CN111312210A (en) 2020-06-19
CN111312210B true CN111312210B (en) 2023-03-21

Family

ID=71148105

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010145198.8A Active CN111312210B (en) 2020-03-05 2020-03-05 Text-text fused voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN111312210B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114945108A (en) * 2022-05-14 2022-08-26 云知声智能科技股份有限公司 Method and device for assisting vision-impaired person in understanding picture

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7117155B2 (en) * 1999-09-07 2006-10-03 At&T Corp. Coarticulation method for audio-visual text-to-speech synthesis
JP4762553B2 (en) * 2005-01-05 2011-08-31 三菱電機株式会社 Text-to-speech synthesis method and apparatus, text-to-speech synthesis program, and computer-readable recording medium recording the program
CN105096934B (en) * 2015-06-30 2019-02-12 百度在线网络技术(北京)有限公司 Construct method, phoneme synthesizing method, device and the equipment in phonetic feature library
CN105931631A (en) * 2016-04-15 2016-09-07 北京地平线机器人技术研发有限公司 Voice synthesis system and method
CN107220591A (en) * 2017-04-28 2017-09-29 哈尔滨工业大学深圳研究生院 Multi-modal intelligent mood sensing system
CN106910514A (en) * 2017-04-30 2017-06-30 上海爱优威软件开发有限公司 Method of speech processing and system
JP6619072B2 (en) * 2018-10-10 2019-12-11 日本電信電話株式会社 SOUND SYNTHESIS DEVICE, SOUND SYNTHESIS METHOD, AND PROGRAM THEREOF
CN109461435B (en) * 2018-11-19 2022-07-01 北京光年无限科技有限公司 Intelligent robot-oriented voice synthesis method and device
CN109754778B (en) * 2019-01-17 2023-05-30 平安科技(深圳)有限公司 Text speech synthesis method and device and computer equipment
CN110136692B (en) * 2019-04-30 2021-12-14 北京小米移动软件有限公司 Speech synthesis method, apparatus, device and storage medium
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion
KR20190104941A (en) * 2019-08-22 2019-09-11 엘지전자 주식회사 Speech synthesis method based on emotion information and apparatus therefor

Also Published As

Publication number Publication date
CN111312210A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN111741326B (en) Video synthesis method, device, equipment and storage medium
CN111415399B (en) Image processing method, device, electronic equipment and computer readable storage medium
EP0993197B1 (en) A method and an apparatus for the animation, driven by an audio signal, of a synthesised model of human face
CN112562720A (en) Lip-synchronization video generation method, device, equipment and storage medium
CN110035326A (en) Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment
CN110210310B (en) Video processing method and device for video processing
CN108683924A (en) A kind of method and apparatus of video processing
CN111460219A (en) Video processing method and device and short video platform
CN108924583B (en) Video file generation method, device, system and storage medium thereof
CN114419702B (en) Digital person generation model, training method of model, and digital person generation method
CN111312210B (en) Text-text fused voice synthesis method and device
CN113542624A (en) Method and device for generating commodity object explanation video
CN107610207A (en) A kind of method, apparatus and system of generation GIF pictures
KR20040091689A (en) Quality of video
CN106507175A (en) Method of video image processing and device
CN111967397A (en) Face image processing method and device, storage medium and electronic equipment
CN113207025B (en) Video processing method and device, electronic equipment and storage medium
JP2012512424A (en) Method and apparatus for speech synthesis
CN114466222A (en) Video synthesis method and device, electronic equipment and storage medium
TW201225669A (en) System and method for synchronizing with multimedia broadcast program and computer program product thereof
CN113038185A (en) Bullet screen processing method and device
CN113873296A (en) Video stream processing method and device
CN114339391A (en) Video data processing method, video data processing device, computer equipment and storage medium
CN115942039B (en) Video generation method, device, electronic equipment and storage medium
CN116524524B (en) Content identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant