CN109637518B

CN109637518B - Virtual anchor implementation method and device

Info

Publication number: CN109637518B
Application number: CN201811320949.4A
Authority: CN
Inventors: 樊博; 陈伟; 孟凡博; 刘恺; 段文君
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2022-05-24
Anticipated expiration: 2038-11-07
Also published as: CN109637518A

Abstract

The invention discloses a method and a device for realizing a virtual anchor, wherein the method comprises the following steps: receiving an input text; obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model; and synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. The invention can greatly improve the visual effect of the virtual anchor.

Description

Virtual anchor implementation method and device

Technical Field

The invention relates to animation technology, in particular to a method and a device for realizing a virtual anchor.

Background

Currently, network traffic is accelerating from text to video with the rapid rise from media people and the rise of some short video platforms. Some video platforms can provide richer presentation modes for users, but the anchor played by a real person is limited by conditions such as the anchor itself, so that the presentation form is single, and the audience experience is influenced. For this reason, video products that replace real people with avatars have been introduced in the industry, namely avatars, a thought of name, which presents relevant video content to users via avatars, such as an avatar to lead a video column, broadcast news, etc. However, the virtual image in such products is usually an animated character, which not only has a long production period, but also has a poor visual effect.

Disclosure of Invention

The embodiment of the invention provides a method and a device for realizing an online virtual anchor, which are used for improving the visual effect of the virtual anchor.

Therefore, the invention provides the following technical scheme:

a virtual anchor implementation method, the method comprising:

receiving an input text;

obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;

and synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.

Optionally, the method further comprises:

and constructing the duration model, the voice synthesis model and the biological state synthesis model by using the collected audio data and video data.

Optionally, the audio data and the video data include: the virtual anchor corresponds to synchronized audio data and video data of the object.

Optionally, the audio data further comprises: the virtual anchor corresponds to pure audio data of the object.

Optionally, the constructing a biological state synthesis model by using the collected audio data and video data comprises:

the audio data and the video data which are synchronously collected are used as training data, biological characteristic parameter marking and category marking are carried out on the video data, and voice parameter marking is carried out on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;

respectively extracting voice parameters of audio data and biological characteristic parameters of video data in the training data;

and training to obtain a biological state synthesis model by using the voice parameters, the biological characteristic parameters and the labeling information.

Optionally, the biological state synthesis model comprises; a lip model and/or an eye position model.

Optionally, the method further comprises:

acquiring a corresponding object picture of the virtual anchor;

scratching a specific biological region in the picture to obtain a specific biological region image and a scratched image;

the obtaining of the virtual anchor image sequence corresponding to the input text using the biological state synthesis model comprises:

obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;

and overlapping the scratched image to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.

Optionally, the method further comprises:

pre-recording a background image sequence;

the step of synchronously overlaying the voice sequence and the virtual anchor image sequence comprises:

and synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.

Optionally, the background image sequence comprises at least any one of:

a virtual anchor head action image sequence;

a virtual anchor hand motion image sequence.

A virtual anchor implementation apparatus, the apparatus comprising:

the receiving module is used for receiving an input text;

the voice synthesis module is used for obtaining a voice sequence corresponding to the input text by utilizing a pre-constructed voice synthesis model;

the image synthesis module is used for obtaining a virtual anchor image sequence corresponding to the input text by utilizing a pre-constructed biological state synthesis model; the biological state synthesis model and the voice synthesis model are constructed based on a same duration model;

and the superposition processing module is used for synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.

Optionally, the apparatus further comprises:

the model building module is used for building the duration model, the voice synthesis model and the biological state synthesis model by utilizing the collected audio data and video data;

the model building module comprises:

the data acquisition module is used for acquiring audio data and video data;

the duration model building module is used for building a duration model;

the voice synthesis model building module is used for building a voice synthesis model based on the duration model;

and the biological state synthesis model building module is used for building a biological state synthesis model based on the duration model.

Optionally, the audio data and the video data include: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.

Optionally, the audio data further comprises:

the virtual anchor corresponds to pure audio data of the object.

Optionally, the biological state synthesis model building module comprises:

the information labeling unit is used for taking the synchronously acquired audio data and video data as training data and performing biological characteristic parameter labeling and category labeling on the video data; performing voice parameter marking on the audio data; the biological characteristic parameters and the voice parameters comprise duration parameters determined based on the duration model;

the feature extraction unit is used for respectively extracting voice parameters of audio data and biological feature parameters of video data in the training data;

and the training unit is used for training to obtain a biological state synthesis model by utilizing the voice parameters, the biological characteristic parameters and the labeling information.

Optionally, the apparatus further comprises:

the image processing module is used for acquiring a corresponding object image of the virtual anchor and scratching a specific biological region in the image to obtain a specific biological region image and a scratched image;

the image synthesis module includes:

a specific biological state image generation unit for obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model;

and the image superposition unit is used for superposing the image subjected to image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.

Optionally, the apparatus further comprises:

the background image acquisition module is used for prerecording a background image sequence;

and the superposition processing module is used for synchronously superposing the voice sequence, the background image sequence and the virtual anchor image sequence.

Optionally, the background image sequence comprises at least any one of:

a virtual anchor head action image sequence;

virtual anchor hand motion image sequence.

An electronic device, comprising: one or more processors, memory;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.

A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.

According to the method and the device for realizing the virtual anchor, after the input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained by utilizing a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.

Furthermore, by scratching the local images, the data volume in the image synthesis processing is greatly reduced, and the processing speed is improved, so that the scheme of the invention not only can realize the virtual anchor to the offline text, but also can realize the live broadcast of the virtual anchor to the real-time input text, and the phenomenon of picture pause can not occur.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a flowchart of a virtual host implementation method according to an embodiment of the present invention;

fig. 2 is a block diagram of a virtual host implementation apparatus according to an embodiment of the present invention;

fig. 3 is another block diagram of a virtual host implementation apparatus according to an embodiment of the present invention;

fig. 4 is another block diagram of a virtual anchor implementation apparatus according to an embodiment of the present invention;

FIG. 5 is a block diagram illustrating an apparatus for implementing a virtual host in accordance with an exemplary embodiment;

fig. 6 is a schematic structural diagram of a server in an embodiment of the present invention.

Detailed Description

In order to make the technical field to better understand the solution of the embodiments of the present invention, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings and the embodiments.

According to the method and the device for realizing the virtual anchor, after the input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained by utilizing a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.

In the embodiment of the invention, the speech synthesis model and the biological state synthesis model are constructed based on the same time length model so as to ensure better synchronization of audio and video and improve visual effect. Each model needs to be trained using pre-collected audio data and video data.

It should be noted that, in the embodiment of the present invention, the virtual anchor may be an avatar based on a real person. Therefore, in practical applications, the collected audio and video data may be synchronous audio data and video data of the corresponding object of the virtual anchor, that is, audio data and video data of a real person are synchronously recorded. Of course, when training the speech synthesis model, text data corresponding to the audio data also needs to be obtained. In order to further improve the voice synthesis effect, some voice data of the corresponding object of the virtual anchor can be separately collected to increase the training data amount and ensure the voice synthesis effect.

The duration model is based on a model of the pronunciation unit to predict the duration of each pronunciation unit. For chinese, the pronunciation unit may be a pronunciation unit with a syllable, a phoneme, a state, and the like as a unit, which is not limited in the embodiment of the present invention. The duration model can be constructed by adopting the prior art, such as a statistical method or a model method.

In practical applications, the speech synthesis for the input text may adopt a parametric mode, a concatenation mode or other existing modes. In the speech synthesis in the parameter mode, in the modeling stage, speech parameters or prosodic parameters (such as frequency spectrum, fundamental frequency, duration and the like) need to be modeled to obtain a speech synthesis model; in the speech synthesis stage, a speech synthesis model is used for predicting speech parameters of an input text, and then the predicted speech parameters are used for restoring a time domain speech signal. The speech synthesis in the splicing mode needs to model pronunciation units (such as phonemes) in a modeling stage, that is, an audio segment corresponding to each pronunciation unit is established; in the speech synthesis stage, the target cost and the connection cost of each pronunciation unit corresponding to the input text are calculated through some algorithms or models, and then the synthesized speech is spliced.

Accordingly, in the embodiment of the present invention, the speech synthesis model may be constructed by using the prior art, which is not described in detail.

The biological state synthesis model is a biological state synthesis model associated with speech, which may include a single local biological state, such as a lip model or an eye position model; multiple local biological states, such as facial expression models, may also be included.

The image finally presented by the virtual anchor may be a half-body image, a whole-body image, a head image, and the like, and the posture may be a sitting posture, a standing posture, and the like, which is not limited in the embodiments of the present invention. Accordingly, when video data is acquired, the pose of the image of the real person in the recorded image and the like can be determined according to application requirements.

When a biological state synthesis model is constructed, audio data and video data which are synchronously acquired can be used as training data, biological characteristic parameter labeling (such as inner and outer lip lines, lip width, lip height, lip protrusion and the like) and category labeling are carried out on the video data, voice parameter labeling is carried out on the audio data, voice parameters of the audio data and biological characteristic parameters of the video data in the training data are respectively extracted, and the biological state synthesis model is obtained through training by utilizing the voice parameters, the biological characteristic parameters and the labeling information.

It should be noted that the category of the biological status can be determined by means of statistics or clustering.

In addition, the video data can be subjected to dimensionality reduction processing to improve the model training speed.

Based on the pre-constructed models, the virtual anchor implementation method and device provided by the embodiment of the invention can generate the virtual anchor video corresponding to the text in real time after receiving the input text.

As shown in fig. 1, it is a flowchart of a method for implementing a virtual host according to an embodiment of the present invention, and the method includes the following steps:

step 101, receiving an input text;

102, obtaining a voice sequence corresponding to the input text by using a voice synthesis model;

103, obtaining a virtual anchor image sequence corresponding to the input text by using a biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;

it should be noted that, the step 103 and the step 102 are performed synchronously, and there is no chronological order.

And step 104, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data.

In practical application, the virtual anchor audio and video data can be used for live broadcasting.

As mentioned above, the biological state synthesis model may be a model for a single local biological state, or may be a model for a plurality of local biological states.

If a biological state synthesis model aiming at a single local biological state, such as a lip model, is adopted, in this case, in order to increase the vividness of the virtual anchor image and improve the visual effect, the change state of other local biological states, such as a randomly added eye state change image, can be superimposed on the finally generated virtual anchor state image, so that the finally presented virtual anchor image not only has the visual effect of changing lip shape along with the change of audio frequency, but also has the visual effect of blinking eyes. Furthermore, some background images can be recorded aiming at the corresponding objects of the virtual anchor in advance, background image sequences such as a head action image sequence of the virtual anchor and a hand action image sequence of the virtual anchor are generated, and the background image sequences are synchronously superposed in the image sequences of the virtual anchor, so that live broadcast images have a real-person effect.

In addition, if a biological state synthesis model aiming at a single local biological state is adopted, when the acquired video data is processed, only corresponding local biological characteristic parameters can be labeled, the training data volume is further reduced, and the processing speed is improved. Correspondingly, when image synthesis is carried out, an object picture corresponding to the virtual anchor can be obtained in advance; scratching a specific biological region in the picture to obtain a specific biological region image and a scratched image; then, obtaining a virtual anchor specific biological state image sequence corresponding to the input text by using the biological state synthesis model; and superposing the image after image matting to each image in the virtual anchor specific biological state image sequence to obtain a virtual anchor image sequence corresponding to the input text.

After receiving an input text, respectively obtaining a voice sequence and a virtual anchor image sequence corresponding to the input text by using a voice synthesis model and a biological state synthesis model which are constructed in advance based on a same duration model; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.

Correspondingly, an embodiment of the present invention further provides an apparatus for implementing a virtual anchor, as shown in fig. 2, which is a block diagram of the apparatus, and includes the following modules:

a receiving module 202, configured to receive an input text;

the speech synthesis module 203 is configured to obtain a speech sequence corresponding to the input text by using a pre-established speech synthesis model;

the image synthesis module 204 is configured to obtain a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model;

and the superposition processing module 205 is configured to synchronously superpose the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio/video data.

The models may be constructed by a model construction module (not shown) in advance using the collected audio data and video data, and the model construction module may be integrated in the apparatus of the present invention or may be independent from the apparatus of the present invention, which is not limited thereto.

The model building module may specifically include the following modules:

the data acquisition module is used for acquiring audio data and video data;

the duration model building module is used for building a duration model;

It should be noted that the duration model is a model based on pronunciation units, and is used to predict the duration of each pronunciation unit. The pronunciation unit may be a syllable, a phoneme, a state, etc., and the embodiment of the present invention is not limited thereto. The duration model can be constructed by adopting the prior art, such as a statistical method or a model method.

In addition, in the embodiment of the present invention, the speech synthesis model and the biological state synthesis model need to be constructed based on the same duration model, so as to ensure better synchronization of audio and video and improve visual effect.

In practical application, the audio and video data collected by the data collection module may be synchronous audio data and video data of the corresponding object of the virtual anchor, that is, audio data and video data of a real person are synchronously recorded. Of course, when training the speech synthesis model, text data corresponding to the audio data also needs to be obtained. In order to further improve the voice synthesis effect, voice data of some virtual anchor corresponding objects can be separately collected to increase the training data amount.

When the speech synthesis model is trained, only the synchronously recorded audio data can be used as training data, and the synchronously recorded audio data and the separately recorded audio data can also be used as training data, so that the number of the training data is increased, and more accurate model parameters are obtained. Likewise, the training of the speech synthesis model may also be performed using known techniques, which will not be described in detail.

The biological state synthesis model building module may specifically include the following units:

It should be noted that the biological state synthesis model may be a model for a single local biological state, such as a lip model, an eye position model, etc.; it may also be a model for a plurality of local biological states, such as a facial expression model.

According to the virtual anchor implementation device provided by the embodiment of the invention, a voice synthesis model and a biological state synthesis model which are constructed in advance based on the same duration model are utilized, and after an input text is received, a voice sequence and a virtual anchor image sequence corresponding to the input text are respectively obtained; and then, synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data. Due to the fact that the same duration model is adopted, the voice and the virtual anchor image state can be guaranteed to correspond to the real person state better, the picture is natural and smooth, and the visual effect is improved.

Fig. 3 is another block diagram of the virtual host implementation apparatus according to the present invention.

Unlike the embodiment shown in fig. 2, in this embodiment, the apparatus further includes:

the background image acquisition module 301 is configured to pre-record a background image sequence, such as a virtual anchor head motion image sequence, a virtual anchor hand motion image sequence, and the like.

Accordingly, in this embodiment, the overlay processing module 205 can overlay the voice sequence, the background image sequence, and the virtual anchor image sequence synchronously.

By superposing the background images, the vividness of the virtual anchor image is further increased, and the visual effect is improved.

Fig. 4 is another block diagram of the virtual host implementation apparatus according to the present invention.

the image processing module 401 is configured to obtain a picture of an object corresponding to a virtual anchor, and extract a specific biological region in the picture to obtain a specific biological region image and an extracted image;

accordingly, in this embodiment, the image synthesis module 204 may include the following units:

The above-mentioned picture processing module 401 can also be applied to the above-mentioned embodiment shown in fig. 3.

The virtual anchor implementation device provided by the embodiment can ensure that the voice corresponds to the virtual anchor image state and better conforms to the real person state, so that the picture is more natural and smooth, and by scratching the local image, the data volume in the image synthesis process is greatly reduced, and the processing speed is improved, so that the scheme of the invention can implement the virtual anchor on the offline text and can implement the virtual anchor on the real-time input text.

Fig. 5 is a block diagram illustrating an apparatus 800 to implement a virtual host, according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various classes of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the key press false touch correction method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform all or part of the steps of the above-described method embodiments of the present invention.

Fig. 6 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A virtual anchor implementation method, the method comprising:

receiving an input text;

obtaining a voice sequence corresponding to the input text by using a pre-constructed voice synthesis model, and obtaining a virtual anchor image sequence corresponding to the input text by using a pre-constructed biological state synthesis model, wherein the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model; synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data;

the method further comprises the following steps:

constructing the biological state synthesis model by using the collected audio data and video data;

the method for constructing the biological state synthesis model by using the collected audio data and video data comprises the following steps:

2. The method of claim 1, further comprising:

and constructing the duration model and the voice synthesis model by using the collected audio data and video data.

3. The method of claim 2, wherein the audio data and video data comprise: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.

4. The method of claim 3, wherein the audio data further comprises:

the virtual anchor corresponds to pure audio data of the object.

5. The method of claim 1, wherein the biostate synthesis model comprises; a lip model and/or an eye position model.

6. The method of claim 1, further comprising:

acquiring a corresponding object picture of the virtual anchor;

7. The method according to any one of claims 1 to 6, further comprising:

pre-recording a background image sequence;

8. The method of claim 7, wherein the background image sequence comprises at least any one of:

a virtual anchor head action image sequence;

a virtual anchor hand motion image sequence.

9. An apparatus for implementing a virtual anchor, the apparatus comprising:

the receiving module is used for receiving an input text;

the image synthesis module is used for obtaining a virtual anchor image sequence corresponding to the input text by utilizing a pre-constructed biological state synthesis model; the biological state synthesis model and the voice synthesis model are constructed on the basis of a same duration model;

the superposition processing module is used for synchronously superposing the voice sequence and the virtual anchor image sequence to obtain virtual anchor audio and video data;

the device further comprises:

the model building module is used for building the biological state synthesis model by utilizing the collected audio data and video data;

the model building module comprises:

the biological state synthesis model building module is used for building a biological state synthesis model based on the duration model;

the biological state synthesis model construction module comprises:

10. The apparatus of claim 9, wherein the model building module is further configured to build the duration model and the speech synthesis model using the collected audio data and video data;

the model building module further comprises:

the data acquisition module is used for acquiring audio data and video data;

the duration model building module is used for building a duration model;

and the voice synthesis model building module is used for building a voice synthesis model based on the duration model.

11. The apparatus of claim 10, wherein the audio data and video data comprise: and synchronizing the audio data and the video data of the corresponding object of the virtual anchor.

12. The apparatus of claim 11, wherein the audio data further comprises:

the virtual anchor corresponds to pure audio data of the object.

13. The apparatus of claim 9, wherein the biostate synthesis model comprises; a lip model and/or an eye position model.

14. The apparatus of claim 9, further comprising:

the image synthesis module includes:

15. The apparatus of any one of claims 9 to 14, further comprising:

16. The apparatus of claim 15, wherein the sequence of background images comprises at least any one of:

a virtual anchor head action image sequence;

a virtual anchor hand motion image sequence.

17. An electronic device, comprising: one or more processors, memory;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 8.

18. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 8.