CN111312210B - Text-text fused voice synthesis method and device - Google Patents
Text-text fused voice synthesis method and device Download PDFInfo
- Publication number
- CN111312210B CN111312210B CN202010145198.8A CN202010145198A CN111312210B CN 111312210 B CN111312210 B CN 111312210B CN 202010145198 A CN202010145198 A CN 202010145198A CN 111312210 B CN111312210 B CN 111312210B
- Authority
- CN
- China
- Prior art keywords
- text
- features
- visual information
- module
- synthesizing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
Abstract
The invention discloses a method and a device for synthesizing voice fused with pictures and texts, wherein the method comprises the following steps: acquiring text information characteristics of current sports characters; determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture; fusing the text information features and the visual information features to obtain multi-modal features; synthesizing the target speech based on the multi-modal features. The text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.
Description
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a method and a device for synthesizing voice by fusing pictures and texts.
Background
With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted from the front end. End-to-end synthesis systems have emerged in order to allow the synthesis system to be as simplified as possible, reducing manual intervention and the requirements on the linguistics related background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with.
Disclosure of Invention
In response to the above-identified problems, the method is based on combining textual information features and visual information features to form multi-modal features and resultant speech.
A speech synthesis method for fusing graphics context comprises the following steps:
acquiring text information characteristics of current sports characters;
determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
fusing the text information features and the visual information features to obtain multi-modal features;
synthesizing the target speech according to the multi-modal features.
Preferably, the acquiring the text information characteristic of the current sports word includes:
constructing a preset dictionary;
acquiring each character in the current sports text;
encoding each of the characters into a one-hot encoded vector;
and acquiring the text information characteristic of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
Preferably, the determining, according to the current sports text, the video picture corresponding to the current sports text, and obtaining the visual information characteristic of the video picture includes:
determining a sports event video corresponding to the current sports text according to the current sports text;
acquiring n frames corresponding to the sports event video;
acquiring n visual information of the n frames by using an image encoder;
and determining the n pieces of visual information as the visual information characteristics of the video pictures in combination.
Preferably, the fusing the text information feature and the visual information feature to obtain a multi-modal feature includes:
generating a weight proportion by using the visual information characteristics;
carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and determining the text information features after the weighting processing as the multi-modal features.
Preferably, the synthesizing the target speech according to the multi-modal features includes:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
A device for synthesizing speech by fusing pictures and texts, the device comprising:
the first acquisition module is used for acquiring the text information characteristics of the current sports characters;
the second acquisition module is used for determining a video picture corresponding to the current sports characters according to the current sports characters and acquiring visual information characteristics of the video picture;
the fusion module is used for fusing the text information features and the visual information features to obtain multi-modal features;
and the synthesis module is used for synthesizing the target voice according to the multi-modal characteristics.
Preferably, the first obtaining module includes:
the construction submodule is used for constructing a preset dictionary;
the first acquisition sub-module is used for acquiring each character in the current sports text;
an encoding sub-module for encoding each of the characters into a one-hot encoded vector;
and the second obtaining submodule is used for obtaining the text information characteristics of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
Preferably, the second obtaining module includes:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition submodule is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information of the n frames by using an image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
Preferably, the fusion module includes:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as the multi-modal features.
Preferably, the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis sub-module is used for synthesizing the target voice according to the frequency spectrum.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
Fig. 1 is a flowchart of a method for synthesizing voice with text-text fusion according to the present invention;
fig. 2 is another work flow chart of the method for synthesizing voice with text and graphics integration provided by the present invention;
FIG. 3 is a screenshot of a workflow of a speech synthesis method for image-text fusion according to the present invention;
fig. 4 is a structural diagram of a text-fusion speech synthesis apparatus provided in the present invention;
fig. 5 is another structural diagram of a text-text fused speech synthesis apparatus provided by the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the disclosure, as detailed in the appended claims.
With the development of speech synthesis technology and the popularization of applications, speech synthesis services are increasingly accepted and used by users. The existing voice synthesis system is mainly divided into a front end and a rear end, wherein the front end is responsible for extracting text characteristic information such as word segmentation, part of speech, polyphone labeling and the like; the back end module completes voice generation according to the text features extracted by the front end. End-to-end synthesis systems have emerged in order to simplify the synthesis system as much as possible, reducing manual intervention and the requirements on linguistically relevant background knowledge. Directly input text or ZhuYin characters, and the system outputs audio waveforms. However, this method has the following disadvantages: the speech synthesis method in the prior art only utilizes the single-mode text characteristics, neglects the importance of visual information, and cannot realize accurate and flexible customized speech synthesis aiming at the synthesis scene. The result of the speech synthesis becomes single and boring, and the special situation can not be dealt with. In order to solve the above problem, the present embodiment discloses a method for combining into speech based on combining text information features and visual information features to form multi-modal features.
A method for synthesizing voice with fused text and text, as shown in fig. 1, includes the following steps:
s101, acquiring text information characteristics of current sports characters;
s102, determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
s103, fusing the text information features and the visual information features to obtain multi-modal features;
and step S104, synthesizing the target voice according to the multi-modal characteristics.
The working principle of the technical scheme is as follows: the method comprises the steps of obtaining text information characteristics of current sports characters, determining video pictures corresponding to the current sports characters according to the current sports characters, obtaining visual information characteristics of the video pictures, then fusing the text information characteristics and the visual information characteristics to obtain multi-mode characteristics, and finally synthesizing target voice according to the multi-mode characteristics.
The beneficial effects of the above technical scheme are: the text information features and the visual information features are fused to form multi-mode features, and then voice synthesis is carried out, so that the result of voice synthesis is not single and boring, but different voices can be synthesized according to different scene states, and the problems that in the prior art, due to the fact that the single-mode text features are simply utilized, accurate and flexible customized voice synthesis cannot be achieved aiming at a synthesis scene, the result of voice synthesis becomes single and boring, and special conditions cannot be met are solved.
In one embodiment, as shown in fig. 2, acquiring the text information characteristic of the current sports word includes:
step S201, constructing a preset dictionary;
step S202, acquiring each character in the current sports text;
step S203, encoding each character into a one-hot encoding vector;
and step S204, acquiring text information characteristics of the one-hot coded vector by using an embedded layer of a preset dictionary.
The beneficial effects of the above technical scheme are: the characters are coded into the one-hot coded vectors, so that the characteristics of the text information can be more conveniently and accurately obtained, and each character is coded into the one-hot coded vectors, so that the problem that the characteristics of the obtained text information are inaccurate due to the fact that important information is missed is avoided.
In one embodiment, determining a video picture corresponding to the current sports text according to the current sports text, and acquiring visual information characteristics of the video picture includes:
determining a sports event video corresponding to the current sports characters according to the current sports characters;
acquiring n frames corresponding to a sports event video;
acquiring n visual information of n frames by using an image encoder;
combining the n pieces of visual information to determine the visual information characteristics of the video pictures;
in this embodiment, n is a positive integer of 2 or more.
The beneficial effects of the above technical scheme are: the n frames corresponding to the sports event video are obtained, so that the obtained n visual information is more accurate, and meanwhile, the determined visual information characteristics obtained by combining the n frames are clearer and clearer compared with the visual information characteristics obtained by one-time sports event video obtaining, so that a good sample is provided for text information characteristic fusion.
In one embodiment, fusing the textual information features and visual information features to obtain multimodal features includes:
generating a weight proportion by using visual information characteristics;
carrying out weighting processing on the text information characteristics by using the weight proportion;
and determining the text information features after the weighting processing as multi-modal features.
The beneficial effects of the above technical scheme are: through multi-mode information fusion of visual information update and text information characteristics, personalized voice can be generated, and more visual and nice experience is provided for users. For example: to sports broadcast football match, can be through catching visual information such as goal, goal in the video frame, supplementary text generates the speech effect of exciting a lot. And an emotion key generated by voice can be established according to the information of the picture inserted in the book during the intelligent reading of machines such as a book drawing, tang poetry and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.
In one embodiment, synthesizing target speech from multi-modal features includes:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
The beneficial effects of the above technical scheme are: the decoder is used for decoding the attention focusing part in the multi-modal characteristics and emphasizing the process of synthesizing the voice to properly adjust the type and emotion of the synthesized voice according to the attention focusing part. The synthesized voice has the advantage of diversification.
In one embodiment, as shown in FIG. 3, includes:
step 1: and (4) obtaining the text information characteristic representation of the sports text to be broadcasted through a character coder. By constructing a dictionary, each character is coded into a one-hot vector, and then the feature representation of the text information is obtained through an embedding layer;
step 2: and obtaining visual information characteristic representation of the picture frame corresponding to the video of the sports event through an image encoder. Visual information characteristic representation of the corresponding image can be extracted through a sub-network such as ResNet;
and 3, step 3: and the multi-mode feature fusion module fuses the text information features and the corresponding visual information features to obtain multi-mode feature representation. Splicing of characteristic dimensions can be adopted, and weights can be generated through image characteristics to perform weighting processing on text information;
and 4, step 4: the multi-modal features are decoded by an attention-based decoder module. The decoder is an RNN structure and can output the characteristic representation of the sequence based on the attention component;
and 5, step 5: the post-processing module processes the output of the decoder to generate a spectral representation, and obtains the voice.
The beneficial effects of the above technical scheme are: the unimodal text-to-speech synthesis ignores important visual information, and variable speech cannot be generated according to scenes in some scenes. The method can generate personalized voice representation through multi-modal information fusion of the image and the text. For example, for a sports broadcast football game, a exciting voice effect can be generated by capturing visual information such as a goal, a goal and the like in a video frame and assisting a text. And the emotion key generated by voice can be established according to the information of the images inserted in the book during the intelligent reading of machines such as book drawing, tang poem and the like. For example, the color in the illustration is dark, the voice generation effect may be more focused on sad emotions, and the like.
The embodiment also discloses a text-fused speech synthesis device, as shown in fig. 4, the device includes:
a first obtaining module 401, configured to obtain text information characteristics of a current sports word;
a second obtaining module 402, configured to determine, according to the current sports text, a video picture corresponding to the current sports text, and obtain visual information characteristics of the video picture;
the fusion module 403 is configured to fuse the text information features and the visual information features to obtain multi-modal features;
a synthesis module 404 for synthesizing the target speech based on the multi-modal features.
In one embodiment, the first obtaining module includes:
a construction sub-module 4011 configured to construct a preset dictionary;
the first obtaining sub-module 4012 is configured to obtain each character in the current sports text;
an encoding sub-module 4013 for encoding each character into a unique hot encoded vector;
the second obtaining sub-module 4014 is configured to obtain text information features of the unique hot coded vector by using the embedded layer of the preset dictionary.
In one embodiment, the second obtaining module includes:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition sub-module is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information for every n frames by using the image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
In one embodiment, a fusion module includes:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as multi-modal features.
In one embodiment, the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis submodule is used for synthesizing the target voice according to the frequency spectrum.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (8)
1. A speech synthesis method for fusing pictures and texts is characterized by comprising the following steps:
acquiring text information characteristics of current sports characters;
determining a video picture corresponding to the current sports characters according to the current sports characters, and acquiring visual information characteristics of the video picture;
fusing the text information features and the visual information features to obtain multi-modal features;
synthesizing target speech according to the multi-modal features;
the fusing the text information features and the visual information features to obtain multi-modal features comprises:
generating a weight proportion by using the visual information characteristics;
carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and determining the text information features after the weighting processing as the multi-modal features.
2. The method for synthesizing voice with fused texts according to claim 1, wherein the obtaining the text information characteristics of the current sports text comprises:
constructing a preset dictionary;
acquiring each character in the current sports text;
encoding each of the characters into a one-hot encoded vector;
and acquiring text information characteristics of the one-hot coded vector by utilizing an embedded layer of the preset dictionary.
3. The method for synthesizing voice with fused texts according to claim 1, wherein the determining a video picture corresponding to the current sports text according to the current sports text to obtain the visual information characteristics of the video picture comprises:
determining a sports event video corresponding to the current sports text according to the current sports text;
acquiring n frames corresponding to the sports event video;
acquiring n visual information of the n frames by using an image encoder;
and determining the n pieces of visual information as the visual information characteristics of the video pictures in combination.
4. The method for synthesizing fused text speech according to claim 1, wherein the synthesizing target speech according to the multi-modal features comprises:
decoding the multi-modal features with an attention-based decoder;
generating a frequency spectrum of the decoded multi-modal features by utilizing a post-processing module based on the decoded multi-modal features;
and synthesizing the target voice according to the frequency spectrum.
5. A text-fused speech synthesis apparatus, comprising:
the first acquisition module is used for acquiring the text information characteristics of the current sports characters;
the second acquisition module is used for determining a video picture corresponding to the current sports text according to the current sports text and acquiring visual information characteristics of the video picture;
the fusion module is used for fusing the text information features and the visual information features to obtain multi-modal features;
a synthesis module for synthesizing a target speech according to the multi-modal features;
the fusion module comprises:
the first generation submodule is used for generating a weight proportion by utilizing the visual information characteristics;
the processing submodule is used for carrying out weighting processing on the text information characteristics by utilizing the weight proportion;
and the second determining sub-module is used for determining the text information features after the weighting processing as the multi-modal features.
6. The device for synthesizing fused text speech according to claim 5, wherein the first obtaining module comprises:
the construction submodule is used for constructing a preset dictionary;
the first obtaining sub-module is used for obtaining each character in the current sports text;
an encoding sub-module for encoding each of the characters into a one-hot encoded vector;
and the second obtaining submodule is used for obtaining the text information characteristics of the one-hot coded vector by utilizing the embedded layer of the preset dictionary.
7. The device for synthesizing fused text speech according to claim 5, wherein the second obtaining module comprises:
the first determining submodule is used for determining the sports event video corresponding to the current sports characters according to the current sports characters;
the third acquisition submodule is used for acquiring n frames corresponding to the sports event video;
a fourth obtaining sub-module, configured to obtain n pieces of visual information of the n frames by using an image encoder;
and the combining sub-module is used for combining the n pieces of visual information to determine the visual information characteristics of the video pictures.
8. The device for synthesizing fused text speech according to claim 5, wherein the synthesis module comprises:
a decoding sub-module for decoding the multi-modal features with an attention-based decoder;
the second generation submodule is used for generating a frequency spectrum of the decoded multi-modal characteristics by utilizing the post-processing module based on the decoded multi-modal characteristics;
and the synthesis sub-module is used for synthesizing the target voice according to the frequency spectrum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010145198.8A CN111312210B (en) | 2020-03-05 | 2020-03-05 | Text-text fused voice synthesis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010145198.8A CN111312210B (en) | 2020-03-05 | 2020-03-05 | Text-text fused voice synthesis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111312210A CN111312210A (en) | 2020-06-19 |
CN111312210B true CN111312210B (en) | 2023-03-21 |
Family
ID=71148105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010145198.8A Active CN111312210B (en) | 2020-03-05 | 2020-03-05 | Text-text fused voice synthesis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111312210B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114945108A (en) * | 2022-05-14 | 2022-08-26 | 云知声智能科技股份有限公司 | Method and device for assisting vision-impaired person in understanding picture |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7117155B2 (en) * | 1999-09-07 | 2006-10-03 | At&T Corp. | Coarticulation method for audio-visual text-to-speech synthesis |
JP4762553B2 (en) * | 2005-01-05 | 2011-08-31 | 三菱電機株式会社 | Text-to-speech synthesis method and apparatus, text-to-speech synthesis program, and computer-readable recording medium recording the program |
CN105096934B (en) * | 2015-06-30 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | Construct method, phoneme synthesizing method, device and the equipment in phonetic feature library |
CN105931631A (en) * | 2016-04-15 | 2016-09-07 | 北京地平线机器人技术研发有限公司 | Voice synthesis system and method |
CN107220591A (en) * | 2017-04-28 | 2017-09-29 | 哈尔滨工业大学深圳研究生院 | Multi-modal intelligent mood sensing system |
CN106910514A (en) * | 2017-04-30 | 2017-06-30 | 上海爱优威软件开发有限公司 | Method of speech processing and system |
JP6619072B2 (en) * | 2018-10-10 | 2019-12-11 | 日本電信電話株式会社 | SOUND SYNTHESIS DEVICE, SOUND SYNTHESIS METHOD, AND PROGRAM THEREOF |
CN109461435B (en) * | 2018-11-19 | 2022-07-01 | 北京光年无限科技有限公司 | Intelligent robot-oriented voice synthesis method and device |
CN109754778B (en) * | 2019-01-17 | 2023-05-30 | 平安科技(深圳)有限公司 | Text speech synthesis method and device and computer equipment |
CN110136692B (en) * | 2019-04-30 | 2021-12-14 | 北京小米移动软件有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN110211563A (en) * | 2019-06-19 | 2019-09-06 | 平安科技(深圳)有限公司 | Chinese speech synthesis method, apparatus and storage medium towards scene and emotion |
KR20190104941A (en) * | 2019-08-22 | 2019-09-11 | 엘지전자 주식회사 | Speech synthesis method based on emotion information and apparatus therefor |
-
2020
- 2020-03-05 CN CN202010145198.8A patent/CN111312210B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN111312210A (en) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111741326B (en) | Video synthesis method, device, equipment and storage medium | |
CN111415399B (en) | Image processing method, device, electronic equipment and computer readable storage medium | |
EP0993197B1 (en) | A method and an apparatus for the animation, driven by an audio signal, of a synthesised model of human face | |
CN112562720A (en) | Lip-synchronization video generation method, device, equipment and storage medium | |
CN110035326A (en) | Subtitle generation, the video retrieval method based on subtitle, device and electronic equipment | |
CN110210310B (en) | Video processing method and device for video processing | |
CN108683924A (en) | A kind of method and apparatus of video processing | |
CN111460219A (en) | Video processing method and device and short video platform | |
CN108924583B (en) | Video file generation method, device, system and storage medium thereof | |
CN114419702B (en) | Digital person generation model, training method of model, and digital person generation method | |
CN111312210B (en) | Text-text fused voice synthesis method and device | |
CN113542624A (en) | Method and device for generating commodity object explanation video | |
CN107610207A (en) | A kind of method, apparatus and system of generation GIF pictures | |
KR20040091689A (en) | Quality of video | |
CN106507175A (en) | Method of video image processing and device | |
CN111967397A (en) | Face image processing method and device, storage medium and electronic equipment | |
CN113207025B (en) | Video processing method and device, electronic equipment and storage medium | |
JP2012512424A (en) | Method and apparatus for speech synthesis | |
CN114466222A (en) | Video synthesis method and device, electronic equipment and storage medium | |
TW201225669A (en) | System and method for synchronizing with multimedia broadcast program and computer program product thereof | |
CN113038185A (en) | Bullet screen processing method and device | |
CN113873296A (en) | Video stream processing method and device | |
CN114339391A (en) | Video data processing method, video data processing device, computer equipment and storage medium | |
CN115942039B (en) | Video generation method, device, electronic equipment and storage medium | |
CN116524524B (en) | Content identification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |