CN109637518B - Virtual anchor implementation method and device - Google Patents
Virtual anchor implementation method and device Download PDFInfo
- Publication number
- CN109637518B CN109637518B CN201811320949.4A CN201811320949A CN109637518B CN 109637518 B CN109637518 B CN 109637518B CN 201811320949 A CN201811320949 A CN 201811320949A CN 109637518 B CN109637518 B CN 109637518B
- Authority
- CN
- China
- Prior art keywords
- virtual anchor
- model
- voice
- image
- synthesis model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 116
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 116
- 238000012549 training Methods 0.000 claims description 31
- 238000012545 processing Methods 0.000 claims description 27
- 238000002372 labelling Methods 0.000 claims description 18
- 238000006748 scratching Methods 0.000 claims description 8
- 230000002393 scratching effect Effects 0.000 claims description 8
- 230000009471 action Effects 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 abstract description 12
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 210000003128 head Anatomy 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000004397 blinking Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000001454 recorded image Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/431—Generation of visual interfaces for content selection or interaction; Content or additional data rendering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/431—Generation of visual interfaces for content selection or interaction; Content or additional data rendering
- H04N21/4312—Generation of visual interfaces for content selection or interaction; Content or additional data rendering involving specific graphical features, e.g. screen layout, special fonts or colors, blinking icons, highlights or animations
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/265—Mixing
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L2013/021—Overlap-add techniques
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Studio Devices (AREA)
- Processing Or Creating Images (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention discloses a method and a device for realizing a virtual anchor, wherein the method comprises the following steps: receiving an input text; obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model; and synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. The invention can greatly improve the visual effect of the virtual anchor.
Description
Technical Field
The invention relates to animation technology, in particular to a method and a device for realizing a virtual anchor.
Background
Currently, network traffic is accelerating from text to video with the rapid rise from media people and the rise of some short video platforms. Some video platforms can provide richer presentation modes for users, but the anchor played by a real person is limited by conditions such as the anchor itself, so that the presentation form is single, and the audience experience is influenced. For this reason, video products that replace real people with avatars have been introduced in the industry, namely avatars, a thought of name, which presents relevant video content to users via avatars, such as an avatar to lead a video column, broadcast news, etc. However, the virtual image in such products is usually an animated character, which not only has a long production period, but also has a poor visual effect.
Disclosure of Invention
The embodiment of the invention provides a method and a device for realizing an online virtual anchor, which are used for improving the visual effect of the virtual anchor.
Therefore, the invention provides the following technical scheme:
a virtual anchor implementation method, the method comprising:
receiving an input text;
obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;
and synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.
Optionally, the method further comprises:
and constructing the duration model, the voice synthesis model and the biological state synthesis model by using the collected audio data and video data.
Optionally, the audio data and the video data include: the virtual anchor corresponds to synchronized audio data and video data of the object.
Optionally, the audio data further comprises: the virtual anchor corresponds to pure audio data of the object.
Optionally, the constructing a biological state synthesis model by using the collected audio data and video data comprises:
the audio data and the video data which are synchronously collected are used as training data, biological characteristic parameter marking and category marking are carried out on the video data, and voice parameter marking is carried out on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;
respectively extracting voice parameters of audio data and biological characteristic parameters of video data in the training data;
and training to obtain a biological state synthesis model by using the voice parameters, the biological characteristic parameters and the labeling information.
Optionally, the biological state synthesis model comprises; a lip model and/or an eye position model.
Optionally, the method further comprises:
acquiring a corresponding object picture of the virtual anchor;
scratching a specific biological region in the picture to obtain a specific biological region image and a scratched image;
the obtaining of the virtual anchor image sequence corresponding to the input text using the biological state synthesis model comprises:
obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;
and overlapping the scratched image to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
Optionally, the method further comprises:
pre-recording a background image sequence;
the step of synchronously overlaying the voice sequence and the virtual anchor image sequence comprises:
and synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.
Optionally, the background image sequence comprises at least any one of:
a virtual anchor head action image sequence;
a virtual anchor hand motion image sequence.
A virtual anchor implementation apparatus, the apparatus comprising:
the receiving module is used for receiving an input text;
the voice synthesis module is used for obtaining a voice sequence corresponding to the input text by utilizing a pre-constructed voice synthesis model;
the image synthesis module is used for obtaining a virtual anchor image sequence corresponding to the input text by utilizing a pre-constructed biological state synthesis model; the biological state synthesis model and the voice synthesis model are constructed based on a same duration model;
and the superposition processing module is used for synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.
Optionally, the apparatus further comprises:
the model building module is used for building the duration model, the voice synthesis model and the biological state synthesis model by utilizing the collected audio data and video data;
the model building module comprises:
the data acquisition module is used for acquiring audio data and video data;
the duration model building module is used for building a duration model;
the voice synthesis model building module is used for building a voice synthesis model based on the duration model;
and the biological state synthesis model building module is used for building a biological state synthesis model based on the duration model.
Optionally, the audio data and the video data include: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.
Optionally, the audio data further comprises:
the virtual anchor corresponds to pure audio data of the object.
Optionally, the biological state synthesis model building module comprises:
the information labeling unit is used for taking the synchronously acquired audio data and video data as training data and performing biological characteristic parameter labeling and category labeling on the video data; performing voice parameter marking on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;
the feature extraction unit is used for respectively extracting voice parameters of audio data and biological feature parameters of video data in the training data;
and the training unit is used for training to obtain a biological state synthesis model by utilizing the voice parameters, the biological characteristic parameters and the labeling information.
Optionally, the biological state synthesis model comprises; a lip model and/or an eye position model.
Optionally, the apparatus further comprises:
the image processing module is used for acquiring a corresponding object image of the virtual anchor and scratching a specific biological region in the image to obtain a specific biological region image and a scratched image;
the image synthesis module includes:
a specific biological state image generation unit for obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;
and the image superposition unit is used for superposing the image subjected to image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
Optionally, the apparatus further comprises:
the background image acquisition module is used for prerecording a background image sequence;
and the superposition processing module is used for synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.
Optionally, the background image sequence comprises at least any one of:
a virtual anchor head action image sequence;
virtual anchor hand motion image sequence.
An electronic device, comprising: one or more processors, memory;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.
A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.
According to the method and the device for realizing the virtual anchor, after the input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained by utilizing a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.
Furthermore, by scratching the local images, the data volume in the image synthesis processing is greatly reduced, and the processing speed is improved, so that the scheme of the invention not only can realize the virtual anchor to the offline text, but also can realize the live broadcast of the virtual anchor to the real-time input text, and the phenomenon of picture pause can not occur.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
Fig. 1 is a flowchart of a virtual host implementation method according to an embodiment of the present invention;
fig. 2 is a block diagram of a virtual host implementation apparatus according to an embodiment of the present invention;
fig. 3 is another block diagram of a virtual host implementation apparatus according to an embodiment of the present invention;
fig. 4 is another block diagram of a virtual anchor implementation apparatus according to an embodiment of the present invention;
FIG. 5 is a block diagram illustrating an apparatus for implementing a virtual host in accordance with an exemplary embodiment;
fig. 6 is a schematic structural diagram of a server in an embodiment of the present invention.
Detailed Description
In order to make the technical field to better understand the solution of the embodiments of the present invention, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings and the embodiments.
According to the method and the device for realizing the virtual anchor, after the input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained by utilizing a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.
In the embodiment of the invention, the speech synthesis model and the biological state synthesis model are constructed based on the same time length model so as to ensure better synchronization of audio and video and improve visual effect. Each model needs to be trained using pre-collected audio data and video data.
It should be noted that, in the embodiment of the present invention, the virtual anchor may be an avatar based on a real person. Therefore, in practical applications, the collected audio and video data may be synchronous audio data and video data of the corresponding object of the virtual anchor, that is, audio data and video data of a real person are synchronously recorded. Of course, when training the speech synthesis model, text data corresponding to the audio data also needs to be obtained. In order to further improve the voice synthesis effect, some voice data of the corresponding object of the virtual anchor can be separately collected to increase the training data amount and ensure the voice synthesis effect.
The duration model is based on a model of the pronunciation unit to predict the duration of each pronunciation unit. For chinese, the pronunciation unit may be a pronunciation unit with a syllable, a phoneme, a state, and the like as a unit, which is not limited in the embodiment of the present invention. The duration model can be constructed by adopting the prior art, such as a statistical method or a model method.
In practical applications, the speech synthesis for the input text may adopt a parametric mode, a concatenation mode or other existing modes. In the speech synthesis in the parameter mode, in the modeling stage, speech parameters or prosodic parameters (such as frequency spectrum, fundamental frequency, duration and the like) need to be modeled to obtain a speech synthesis model; in the speech synthesis stage, a speech synthesis model is used for predicting speech parameters of an input text, and then the predicted speech parameters are used for restoring a time domain speech signal. The speech synthesis in the splicing mode needs to model pronunciation units (such as phonemes) in a modeling stage, that is, an audio segment corresponding to each pronunciation unit is established; in the speech synthesis stage, the target cost and the connection cost of each pronunciation unit corresponding to the input text are calculated through some algorithms or models, and then the synthesized speech is spliced.
Accordingly, in the embodiment of the present invention, the speech synthesis model may be constructed by using the prior art, which is not described in detail.
The biological state synthesis model is a biological state synthesis model associated with speech, which may include a single local biological state, such as a lip model or an eye position model; multiple local biological states, such as facial expression models, may also be included.
The image finally presented by the virtual anchor may be a half-body image, a whole-body image, a head image, and the like, and the posture may be a sitting posture, a standing posture, and the like, which is not limited in the embodiments of the present invention. Accordingly, when video data is acquired, the pose of the image of the real person in the recorded image and the like can be determined according to application requirements.
When a biological state synthesis model is constructed, audio data and video data which are synchronously acquired can be used as training data, biological characteristic parameter labeling (such as inner and outer lip lines, lip width, lip height, lip protrusion and the like) and category labeling are carried out on the video data, voice parameter labeling is carried out on the audio data, voice parameters of the audio data and biological characteristic parameters of the video data in the training data are respectively extracted, and the biological state synthesis model is obtained through training by utilizing the voice parameters, the biological characteristic parameters and the labeling information.
It should be noted that the category of the biological status can be determined by means of statistics or clustering.
In addition, the video data can be subjected to dimensionality reduction processing to improve the model training speed.
Based on the pre-constructed models, the virtual anchor implementation method and device provided by the embodiment of the invention can generate the virtual anchor video corresponding to the text in real time after receiving the input text.
As shown in fig. 1, it is a flowchart of a method for implementing a virtual host according to an embodiment of the present invention, and the method includes the following steps:
102, obtaining a voice sequence corresponding to the input text by using a voice synthesis model;
103, obtaining a virtual anchor image sequence corresponding to the input text by using a biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;
it should be noted that, the step 103 and the step 102 are performed synchronously, and there is no chronological order.
And step 104, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.
In practical application, the virtual anchor audio and video data can be used for live broadcasting.
As mentioned above, the biological state synthesis model may be a model for a single local biological state, or may be a model for a plurality of local biological states.
If a biological state synthesis model aiming at a single local biological state, such as a lip model, is adopted, in this case, in order to increase the vividness of the virtual anchor image and improve the visual effect, the change state of other local biological states, such as a randomly added eye state change image, can be superimposed on the finally generated virtual anchor state image, so that the finally presented virtual anchor image not only has the visual effect of changing lip shape along with the change of audio frequency, but also has the visual effect of blinking eyes. Furthermore, some background images can be recorded aiming at the corresponding objects of the virtual anchor in advance, background image sequences such as a head action image sequence of the virtual anchor and a hand action image sequence of the virtual anchor are generated, and the background image sequences are synchronously superposed in the image sequences of the virtual anchor, so that live broadcast images have a real-person effect.
In addition, if a biological state synthesis model aiming at a single local biological state is adopted, when the acquired video data is processed, only corresponding local biological characteristic parameters can be labeled, the training data volume is further reduced, and the processing speed is improved. Correspondingly, when image synthesis is carried out, an object picture corresponding to the virtual anchor can be obtained in advance; scratching a specific biological region in the picture to obtain a specific biological region image and a scratched image; then, obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model; and superposing the image after image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
After receiving an input text, respectively obtaining a voice sequence and a virtual anchor image sequence corresponding to the input text by using a voice synthesis model and a biological state synthesis model which are constructed in advance based on a same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.
Furthermore, by scratching the local images, the data volume in the image synthesis processing is greatly reduced, and the processing speed is improved, so that the scheme of the invention not only can realize the virtual anchor to the offline text, but also can realize the live broadcast of the virtual anchor to the real-time input text, and the phenomenon of picture pause can not occur.
Correspondingly, an embodiment of the present invention further provides an apparatus for implementing a virtual anchor, as shown in fig. 2, which is a block diagram of the apparatus, and includes the following modules:
a receiving module 202, configured to receive an input text;
the speech synthesis module 203 is configured to obtain a speech sequence corresponding to the input text by using a pre-established speech synthesis model;
the image synthesis module 204 is configured to obtain a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model;
and the superposition processing module 205 is configured to synchronously superpose the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio/video data.
The models may be constructed by a model construction module (not shown) in advance using the collected audio data and video data, and the model construction module may be integrated in the apparatus of the present invention or may be independent from the apparatus of the present invention, which is not limited thereto.
The model building module may specifically include the following modules:
the data acquisition module is used for acquiring audio data and video data;
the duration model building module is used for building a duration model;
the voice synthesis model building module is used for building a voice synthesis model based on the duration model;
and the biological state synthesis model building module is used for building a biological state synthesis model based on the duration model.
It should be noted that the duration model is a model based on pronunciation units, and is used to predict the duration of each pronunciation unit. The pronunciation unit may be a syllable, a phoneme, a state, etc., and the embodiment of the present invention is not limited thereto. The duration model can be constructed by adopting the prior art, such as a statistical method or a model method.
In addition, in the embodiment of the present invention, the speech synthesis model and the biological state synthesis model need to be constructed based on the same duration model, so as to ensure better synchronization of audio and video and improve visual effect.
In practical application, the audio and video data collected by the data collection module may be synchronous audio data and video data of the corresponding object of the virtual anchor, that is, audio data and video data of a real person are synchronously recorded. Of course, when training the speech synthesis model, text data corresponding to the audio data also needs to be obtained. In order to further improve the voice synthesis effect, voice data of some virtual anchor corresponding objects can be separately collected to increase the training data amount.
When the speech synthesis model is trained, only the synchronously recorded audio data can be used as training data, and the synchronously recorded audio data and the separately recorded audio data can also be used as training data, so that the number of the training data is increased, and more accurate model parameters are obtained. Likewise, the training of the speech synthesis model may also be performed using known techniques, which will not be described in detail.
The biological state synthesis model building module may specifically include the following units:
the information labeling unit is used for taking the synchronously acquired audio data and video data as training data and performing biological characteristic parameter labeling and category labeling on the video data; performing voice parameter marking on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;
the feature extraction unit is used for respectively extracting voice parameters of audio data and biological feature parameters of video data in the training data;
and the training unit is used for training to obtain a biological state synthesis model by utilizing the voice parameters, the biological characteristic parameters and the labeling information.
It should be noted that the biological state synthesis model may be a model for a single local biological state, such as a lip model, an eye position model, etc.; it may also be a model for a plurality of local biological states, such as a facial expression model.
According to the virtual anchor implementation device provided by the embodiment of the invention, a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model are utilized, and after an input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.
Fig. 3 is another block diagram of the virtual host implementation apparatus according to the present invention.
Unlike the embodiment shown in fig. 2, in this embodiment, the apparatus further includes:
the background image acquisition module 301 is configured to pre-record a background image sequence, such as a virtual anchor head motion image sequence, a virtual anchor hand motion image sequence, and the like.
Accordingly, in this embodiment, the overlay processing module 205 can overlay the voice sequence, the background image sequence, and the virtual anchor image sequence synchronously.
By superposing the background images, the vividness of the virtual anchor image is further increased, and the visual effect is improved.
Fig. 4 is another block diagram of the virtual host implementation apparatus according to the present invention.
Unlike the embodiment shown in fig. 2, in this embodiment, the apparatus further includes:
the image processing module 401 is configured to obtain a picture of an object corresponding to a virtual anchor, and extract a specific biological region in the picture to obtain a specific biological region image and an extracted image;
accordingly, in this embodiment, the image synthesis module 204 may include the following units:
a specific biological state image generation unit for obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;
and the image superposition unit is used for superposing the image subjected to image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
The above-mentioned picture processing module 401 can also be applied to the above-mentioned embodiment shown in fig. 3.
The virtual anchor implementation device provided by the embodiment can ensure that the voice corresponds to the virtual anchor image state and better conforms to the real person state, so that the picture is more natural and smooth, and by scratching the local image, the data volume in the image synthesis process is greatly reduced, and the processing speed is improved, so that the scheme of the invention can implement the virtual anchor on the offline text and can implement the virtual anchor on the real-time input text.
Fig. 5 is a block diagram illustrating an apparatus 800 to implement a virtual host, according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various classes of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the key press false touch correction method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present invention also provides a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform all or part of the steps of the above-described method embodiments of the present invention.
Fig. 6 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (18)
1. A virtual anchor implementation method, the method comprising:
receiving an input text;
obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model; synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data;
the method further comprises the following steps:
constructing the biological state synthesis model by using the collected audio data and video data;
the method for constructing the biological state synthesis model by using the collected audio data and video data comprises the following steps:
the audio data and the video data which are synchronously collected are used as training data, biological characteristic parameter marking and category marking are carried out on the video data, and voice parameter marking is carried out on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;
respectively extracting voice parameters of audio data and biological characteristic parameters of video data in the training data;
and training to obtain a biological state synthesis model by using the voice parameters, the biological characteristic parameters and the labeling information.
2. The method of claim 1, further comprising:
and constructing the duration model and the voice synthesis model by using the collected audio data and video data.
3. The method of claim 2, wherein the audio data and video data comprise: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.
4. The method of claim 3, wherein the audio data further comprises:
the virtual anchor corresponds to pure audio data of the object.
5. The method of claim 1, wherein the biostate synthesis model comprises; a lip model and/or an eye position model.
6. The method of claim 1, further comprising:
acquiring a corresponding object picture of the virtual anchor;
scratching a specific biological region in the picture to obtain a specific biological region image and a scratched image;
the obtaining of the virtual anchor image sequence corresponding to the input text using the biological state synthesis model comprises:
obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;
and overlapping the scratched image to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
7. The method according to any one of claims 1 to 6, further comprising:
pre-recording a background image sequence;
the step of synchronously overlaying the voice sequence and the virtual anchor image sequence comprises:
and synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.
8. The method of claim 7, wherein the background image sequence comprises at least any one of:
a virtual anchor head action image sequence;
a virtual anchor hand motion image sequence.
9. An apparatus for implementing a virtual anchor, the apparatus comprising:
the receiving module is used for receiving an input text;
the voice synthesis module is used for obtaining a voice sequence corresponding to the input text by utilizing a pre-constructed voice synthesis model;
the image synthesis module is used for obtaining a virtual anchor image sequence corresponding to the input text by utilizing a pre-constructed biological state synthesis model; the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;
the superposition processing module is used for synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data;
the device further comprises:
the model building module is used for building the biological state synthesis model by utilizing the collected audio data and video data;
the model building module comprises:
the biological state synthesis model building module is used for building a biological state synthesis model based on the duration model;
the biological state synthesis model construction module comprises:
the information labeling unit is used for taking the synchronously acquired audio data and video data as training data and performing biological characteristic parameter labeling and category labeling on the video data; performing voice parameter marking on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;
the feature extraction unit is used for respectively extracting voice parameters of audio data and biological feature parameters of video data in the training data;
and the training unit is used for training to obtain a biological state synthesis model by utilizing the voice parameters, the biological characteristic parameters and the labeling information.
10. The apparatus of claim 9, wherein the model building module is further configured to build the duration model and the speech synthesis model using the collected audio data and video data;
the model building module further comprises:
the data acquisition module is used for acquiring audio data and video data;
the duration model building module is used for building a duration model;
and the voice synthesis model building module is used for building a voice synthesis model based on the duration model.
11. The apparatus of claim 10, wherein the audio data and video data comprise: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.
12. The apparatus of claim 11, wherein the audio data further comprises:
the virtual anchor corresponds to pure audio data of the object.
13. The apparatus of claim 9, wherein the biostate synthesis model comprises; a lip model and/or an eye position model.
14. The apparatus of claim 9, further comprising:
the image processing module is used for acquiring a corresponding object image of the virtual anchor and scratching a specific biological region in the image to obtain a specific biological region image and a scratched image;
the image synthesis module includes:
a specific biological state image generation unit for obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;
and the image superposition unit is used for superposing the image subjected to image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.
15. The apparatus of any one of claims 9 to 14, further comprising:
the background image acquisition module is used for prerecording a background image sequence;
and the superposition processing module is used for synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.
16. The apparatus of claim 15, wherein the sequence of background images comprises at least any one of:
a virtual anchor head action image sequence;
a virtual anchor hand motion image sequence.
17. An electronic device, comprising: one or more processors, memory;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 8.
18. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811320949.4A CN109637518B (en) | 2018-11-07 | 2018-11-07 | Virtual anchor implementation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811320949.4A CN109637518B (en) | 2018-11-07 | 2018-11-07 | Virtual anchor implementation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109637518A CN109637518A (en) | 2019-04-16 |
CN109637518B true CN109637518B (en) | 2022-05-24 |
Family
ID=66067462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811320949.4A Active CN109637518B (en) | 2018-11-07 | 2018-11-07 | Virtual anchor implementation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109637518B (en) |
Families Citing this family (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110347867B (en) * | 2019-07-16 | 2022-04-19 | 北京百度网讯科技有限公司 | Method and device for generating lip motion video |
CN110493613B (en) * | 2019-08-16 | 2020-05-19 | 江苏遨信科技有限公司 | Video lip synchronization synthesis method and system |
CN110534085B (en) * | 2019-08-29 | 2022-02-25 | 北京百度网讯科技有限公司 | Method and apparatus for generating information |
CN111050187B (en) * | 2019-12-09 | 2020-12-15 | 腾讯科技(深圳)有限公司 | Virtual video processing method, device and storage medium |
CN110913259A (en) * | 2019-12-11 | 2020-03-24 | 百度在线网络技术(北京)有限公司 | Video playing method and device, electronic equipment and medium |
CN111010589B (en) * | 2019-12-19 | 2022-02-25 | 腾讯科技(深圳)有限公司 | Live broadcast method, device, equipment and storage medium based on artificial intelligence |
CN111010586B (en) * | 2019-12-19 | 2021-03-19 | 腾讯科技(深圳)有限公司 | Live broadcast method, device, equipment and storage medium based on artificial intelligence |
CN111369967B (en) * | 2020-03-11 | 2021-03-05 | 北京字节跳动网络技术有限公司 | Virtual character-based voice synthesis method, device, medium and equipment |
CN111508467A (en) * | 2020-04-13 | 2020-08-07 | 湖南声广信息科技有限公司 | Audio splicing method for host of music broadcasting station |
CN113689879B (en) * | 2020-05-18 | 2024-05-14 | 北京搜狗科技发展有限公司 | Method, device, electronic equipment and medium for driving virtual person in real time |
CN111883107B (en) * | 2020-08-03 | 2022-09-16 | 北京字节跳动网络技术有限公司 | Speech synthesis and feature extraction model training method, device, medium and equipment |
CN112002005A (en) * | 2020-08-25 | 2020-11-27 | 成都威爱新经济技术研究院有限公司 | Cloud-based remote virtual collaborative host method |
CN112233210B (en) * | 2020-09-14 | 2024-06-07 | 北京百度网讯科技有限公司 | Method, apparatus, device and computer storage medium for generating virtual character video |
CN112820265B (en) * | 2020-09-14 | 2023-12-08 | 腾讯科技(深圳)有限公司 | Speech synthesis model training method and related device |
CN112333179B (en) * | 2020-10-30 | 2023-11-10 | 腾讯科技(深圳)有限公司 | Live broadcast method, device and equipment of virtual video and readable storage medium |
CN112420014A (en) * | 2020-11-17 | 2021-02-26 | 平安科技(深圳)有限公司 | Virtual face construction method and device, computer equipment and computer readable medium |
CN112560622B (en) * | 2020-12-08 | 2023-07-21 | 中国联合网络通信集团有限公司 | Virtual object action control method and device and electronic equipment |
CN112633110B (en) * | 2020-12-16 | 2024-02-13 | 中国联合网络通信集团有限公司 | Data processing method and device |
CN112770062B (en) * | 2020-12-22 | 2024-03-08 | 北京奇艺世纪科技有限公司 | Image generation method and device |
CN112887747B (en) * | 2021-01-25 | 2023-09-12 | 百果园技术(新加坡)有限公司 | Live broadcasting room control method and device and electronic equipment |
CN113570686A (en) * | 2021-02-07 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Virtual video live broadcast processing method and device, storage medium and electronic equipment |
CN113178206B (en) * | 2021-04-22 | 2022-05-31 | 内蒙古大学 | AI (Artificial intelligence) composite anchor generation method, electronic equipment and readable storage medium |
CN113642394B (en) * | 2021-07-07 | 2024-06-11 | 北京搜狗科技发展有限公司 | Method, device and medium for processing actions of virtual object |
CN113891150B (en) * | 2021-09-24 | 2024-10-11 | 北京搜狗科技发展有限公司 | Video processing method, device and medium |
CN114630144B (en) * | 2022-03-03 | 2024-10-01 | 广州方硅信息技术有限公司 | Audio replacement method, system, device, computer equipment and storage medium in live broadcasting room |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN107170030A (en) * | 2017-05-31 | 2017-09-15 | 珠海金山网络游戏科技有限公司 | A kind of virtual newscaster's live broadcasting method and system |
CN107277599A (en) * | 2017-05-31 | 2017-10-20 | 珠海金山网络游戏科技有限公司 | A kind of live broadcasting method of virtual reality, device and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8224652B2 (en) * | 2008-09-26 | 2012-07-17 | Microsoft Corporation | Speech and text driven HMM-based body animation synthesis |
-
2018
- 2018-11-07 CN CN201811320949.4A patent/CN109637518B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103218842A (en) * | 2013-03-12 | 2013-07-24 | 西南交通大学 | Voice synchronous-drive three-dimensional face mouth shape and face posture animation method |
CN106653052A (en) * | 2016-12-29 | 2017-05-10 | Tcl集团股份有限公司 | Virtual human face animation generation method and device |
CN107170030A (en) * | 2017-05-31 | 2017-09-15 | 珠海金山网络游戏科技有限公司 | A kind of virtual newscaster's live broadcasting method and system |
CN107277599A (en) * | 2017-05-31 | 2017-10-20 | 珠海金山网络游戏科技有限公司 | A kind of live broadcasting method of virtual reality, device and system |
Also Published As
Publication number | Publication date |
---|---|
CN109637518A (en) | 2019-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109637518B (en) | Virtual anchor implementation method and device | |
US11503377B2 (en) | Method and electronic device for processing data | |
CN109446876B (en) | Sign language information processing method and device, electronic equipment and readable storage medium | |
CN108363706B (en) | Method and device for man-machine dialogue interaction | |
US20170304735A1 (en) | Method and Apparatus for Performing Live Broadcast on Game | |
CN109429078B (en) | Video processing method and device for video processing | |
CN112199016B (en) | Image processing method, image processing device, electronic equipment and computer readable storage medium | |
WO2019153925A1 (en) | Searching method and related device | |
CN113691833B (en) | Virtual anchor face changing method and device, electronic equipment and storage medium | |
WO2022198934A1 (en) | Method and apparatus for generating video synchronized to beat of music | |
EP3340077B1 (en) | Method and apparatus for inputting expression information | |
US20210029304A1 (en) | Methods for generating video, electronic device and storage medium | |
CN104574299A (en) | Face picture processing method and device | |
CN110490164B (en) | Method, device, equipment and medium for generating virtual expression | |
EP4300431A1 (en) | Action processing method and apparatus for virtual object, and storage medium | |
CN109033423A (en) | Simultaneous interpretation caption presentation method and device, intelligent meeting method, apparatus and system | |
CN110730360A (en) | Video uploading and playing methods and devices, client equipment and storage medium | |
WO2021232875A1 (en) | Method and apparatus for driving digital person, and electronic device | |
CN111954063A (en) | Content display control method and device for video live broadcast room | |
CN110990534A (en) | Data processing method and device and data processing device | |
CN113806570A (en) | Image generation method and generation device, electronic device and storage medium | |
CN111145080B (en) | Training method of image generation model, image generation method and device | |
KR20130096983A (en) | Method and apparatus for processing video information including face | |
CN105635573B (en) | Camera visual angle regulating method and device | |
CN110636377A (en) | Video processing method, device, storage medium, terminal and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |