CN116630485A

CN116630485A - Virtual image driving method, virtual image rendering method and electronic device

Info

Publication number: CN116630485A
Application number: CN202310536810.8A
Authority: CN
Inventors: 张隆昊; 王中坚; 张鹏; 张邦
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-22

Abstract

The application discloses a driving method of an avatar, a rendering method of the avatar and electronic equipment. The method comprises the following steps: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set; based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to execute a corresponding action within a preset time period. The application is applied to the fields of digital people and virtual people, and solves the technical problem of poor virtual image display effect in the related technology.

Description

Virtual image driving method, virtual image rendering method and electronic device

Technical Field

The present application relates to the fields of digital persons, virtual persons, and the like, and more particularly, to a driving method of an avatar, a rendering method of an avatar, and an electronic apparatus.

Background

At present, the virtual image is applied in the fields of live broadcast, chat and the like, and key points are generated through 2D images of entity objects to drive the virtual image, but the virtual image at present has various problems, such as incapability of restoring a real image, unnatural expression and inaccurate speaking mouth shape, so that the display effect of the virtual image is poor.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a driving method of an avatar, a rendering method of the avatar and electronic equipment, which are used for at least solving the technical problem of poor display effect of the avatar in the related technology.

According to an aspect of an embodiment of the present application, there is provided a driving method of an avatar, including: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to execute a corresponding action within a preset time period.

According to an aspect of the embodiment of the present application, there is also provided a method of rendering an avatar, including: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; and rendering the avatar in a preset time period based on the reconstructed grid sequence and the rendering parameter set.

According to an aspect of an embodiment of the present application, there is also provided a driving method of an avatar, including: responding to an input instruction acted on an operation interface, and displaying sequence information of a voice sequence and a multi-view image set on the operation interface, wherein the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object in a multi-view mode, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; and responding to a driving instruction acting on the operation interface, and displaying an avatar on the operation interface, wherein the avatar is driven in a preset time period based on a reconstruction grid sequence and a rendering parameter set, the reconstruction grid sequence and the rendering parameter set are obtained by dynamic reconstruction based on a multi-view image set and a target grid sequence of the avatar, the target grid sequence is generated based on a voice sequence, the target grid sequence is used for representing action information of an entity object in the preset time period, the reconstruction grid sequence comprises reconstruction grids in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period.

According to an aspect of an embodiment of the present application, there is also provided a driving method of an avatar, including: driving a Virtual Reality (VR) device or an Augmented Reality (AR) device to acquire a voice sequence sent by an entity object in an entity scene, and displaying a multi-view image set on a display picture, wherein the voice sequence comprises voice data in a preset time period, the multi-view image set is obtained after the entity object is subjected to multi-view shooting, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, driving the virtual image to execute corresponding actions within a preset time period; the VR device or the AR device is driven to render the presentation avatar.

According to an aspect of an embodiment of the present application, there is also provided a driving method of an avatar, including: the method comprises the steps that a voice sequence and a multi-view image set are obtained by calling a first interface, wherein the first interface comprises a first parameter, parameter values of the first parameter are the voice sequence and the multi-view image set, the voice sequence is obtained by collecting an entity object in an entity scene, the multi-view image set is obtained by shooting the entity object at multiple angles, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, driving the virtual image to execute corresponding actions within a preset time period; and outputting the avatar by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the avatar.

In the embodiment of the application, a voice sequence sent by an entity object in an entity scene is acquired, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is acquired, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period, so that the display effect of the virtual image is improved. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application, as claimed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

fig. 1 is a schematic view of a hardware environment of a virtual reality device of a driving method of an avatar according to an embodiment of the present application;

fig. 2 is a block diagram of a computing environment of a driving method of an avatar according to an embodiment of the present application;

fig. 3 is a flowchart of a driving method of an avatar according to embodiment 1 of the present application;

fig. 4 is a diagram illustrating a driving process structure of an avatar according to an embodiment of the present application;

fig. 5 is a flowchart of a method of rendering an avatar according to embodiment 2 of the present application;

fig. 6 is a flowchart of a driving method of an avatar according to embodiment 3 of the present application;

fig. 7 is a flowchart of a driving method of an avatar according to embodiment 4 of the present application;

Fig. 8 is a flowchart of a driving method of an avatar according to embodiment 5 of the present application;

fig. 9 is a schematic view of an avatar driving apparatus according to embodiment 6 of the present application;

fig. 10 is a schematic view of an avatar rendering apparatus according to embodiment 7 of the present application;

fig. 11 is a schematic view of an avatar driving apparatus according to embodiment 8 of the present application;

fig. 12 is a schematic view of an avatar driving apparatus according to embodiment 9 of the present application;

fig. 13 is a schematic view of an avatar driving apparatus according to embodiment 10 of the present application;

fig. 14 is a block diagram of a computer terminal according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, partial terms or terminology appearing in the course of describing embodiments of the application are applicable to the following explanation:

neural rendering (NeuralRadiance Field, abbreviated as NeRF), which is a graphics rendering method based on deep learning and artificial intelligence techniques, can automatically learn the geometric structure, illumination, material, and other attributes of a 3D scene from image data by training a neural network, and use the information to generate the 3D scene into a 2D image;

A natural language classical model (transducer) that can employ an attention mechanism to handle relationships between input and output sequences, the model being suitable for various natural language processing tasks such as machine translation, text classification, named entity recognition, etc.;

a variable Auto-Encoder (VAE) is a neural network used to generate a model, capable of mapping input data into a low-dimensional space, and reconstructing raw data from the low-dimensional representation.

Example 1

According to an embodiment of the present application, there is provided a driving method of an avatar, it is to be noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that herein.

Fig. 1 is a schematic view of a hardware environment of a virtual reality device according to a driving method of an avatar according to an embodiment of the present application. As shown in fig. 1, the virtual reality device 104 is connected to the terminal 106, the terminal 106 is connected to the server 102 via a network, and the virtual reality device 104 is not limited to: the terminal 104 is not limited to a PC, a mobile phone, a tablet computer, etc., and the server 102 may be a server corresponding to a media file operator, and the network includes, but is not limited to: a wide area network, a metropolitan area network, or a local area network.

Optionally, the virtual reality device 104 of this embodiment includes: memory, processor, and transmission means. The memory is used to store an application program that can be used to perform: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period, so that the technical problem of poor display effect of the virtual image in the related technology is solved.

The terminal of this embodiment may be configured to perform an action of displaying an avatar on a presentation screen of a Virtual Reality (VR) device or an augmented Reality (Augmented Reality, AR) device, where the action is performed within a preset time period, and send the avatar to the Virtual Reality device 104, where the avatar is displayed at a target delivery location after receiving the action of performing the corresponding action within the preset time period.

Optionally, the HMD (Head Mount Display, head mounted display) head display and the eye tracking module of the virtual reality device 104 of this embodiment have the same functions as those of the above embodiment, that is, a screen in the HMD head display is used for displaying a real-time picture, and the eye tracking module in the HMD is used for acquiring a real-time motion track of an eyeball of a user. The terminal of the embodiment obtains the position information and the motion information of the user in the real three-dimensional space through the tracking system, and calculates the three-dimensional coordinates of the head of the user in the virtual three-dimensional space and the visual field orientation of the user in the virtual three-dimensional space.

The hardware architecture block diagram shown in fig. 1 may be used not only as an exemplary block diagram for an AR/VR device (or mobile device) as described above, but also as an exemplary block diagram for a server as described above, and in an alternative embodiment, fig. 2 shows in block diagram form one embodiment of a computing node in a computing environment 201 using an AR/VR device (or mobile device) as described above in fig. 1. Fig. 2 is a block diagram of a computing environment of a driving method of an avatar according to an embodiment of the present application, and as shown in fig. 2, the computing environment 201 includes a plurality of computing nodes (e.g., servers) running on a distributed network (shown by 210-1, 210-2, …). Different computing nodes contain local processing and memory resources and end user 202 may run applications or store data remotely in computing environment 201. The application may be provided as a plurality of services 220-1, 220-2, 220-3, and 220-4 in computing environment 201, representing services "A", "D", "E", and "H", respectively.

End user 202 may provide and access services through a web browser or other software application on a client, in some embodiments, provisioning and/or requests of end user 202 may be provided to portal gateway 230. Ingress gateway 230 may include a corresponding agent to handle provisioning and/or request for services (one or more services provided in computing environment 201).

Services are provided or deployed in accordance with various virtualization techniques supported by the computing environment 201. In some embodiments, services may be provided according to Virtual Machine (VM) based virtualization, container based virtualization, and/or the like. Virtual machine-based virtualization may be the emulation of a real computer by initializing a virtual machine, executing programs and applications without directly touching any real hardware resources. While the virtual machine virtualizes the machine, according to container-based virtualization, a container may be started to virtualize the entire Operating System (OS) so that multiple workloads may run on a single Operating System instance.

In one embodiment based on container virtualization, several containers of a service may be assembled into one Pod (e.g., kubernetesPod). For example, as shown in FIG. 2, the service 220-2 may be equipped with one or more Pods 240-1, 240-2, …,240-N (collectively referred to as Pods). The Pod may include an agent 245 and one or more containers 242-1, 242-2, …,242-M (collectively referred to as containers). One or more containers in the Pod handle requests related to one or more corresponding functions of the service, and the agent 245 generally controls network functions related to the service, such as routing, load balancing, etc. Other services may also be equipped with similar Pod.

In operation, executing a user request from end user 202 may require invoking one or more services in computing environment 201, and executing one or more functions of one service may require invoking one or more functions of another service. As shown in FIG. 2, service "A"220-1 receives a user request of end user 202 from ingress gateway 230, service "A"220-1 may invoke service "D"220-2, and service "D"220-2 may request service "E"220-3 to perform one or more functions.

The computing environment may be a cloud computing environment, and the allocation of resources is managed by a cloud service provider, allowing the development of functions without considering the implementation, adjustment or expansion of the server. The computing environment allows developers to execute code that responds to events without building or maintaining a complex infrastructure. Instead of expanding a single hardware device to handle the potential load, the service may be partitioned to a set of functions that can be automatically scaled independently.

In the above-described operation environment, the present application provides a driving method of an avatar as shown in fig. 3. It should be noted that a driving method of an avatar of this embodiment may be performed by the mobile terminal of the embodiment shown in fig. 1. Fig. 3 is a flowchart of a driving method of an avatar according to embodiment 1 of the present application. As shown in fig. 3, the method may include the steps of:

Step S302, a voice sequence sent by an entity object in an entity scene is acquired, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is acquired.

The voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period.

The physical object may be a person, an animal, an object, etc. in a real environment, and is not particularly limited herein, as long as the physical object exists in the real environment.

The above-mentioned entity scene may be used to represent the real scene in which the entity object is located.

The preset time period may be a preset time period. The voice sequence sent by the entity object in the entity scene can be acquired within a preset time period, and a multi-view image set obtained after multi-view shooting of the entity object is acquired.

In an alternative embodiment, the voice sequence sent by the entity object in the entity scene may be collected by the voice collecting device, where the voice sequence may be a voice sequence of communication voice of the entity object, a voice sequence of singing, etc., a voice sequence of news broadcasting, etc., the specific type of the voice sequence is not limited herein, the voice sequence of the entity object may be collected according to the actual situation, and the voice sequence may also be a voice sequence of various sounds sent by the entity object.

In another optional embodiment, the capturing device may capture the physical object from a plurality of different perspectives to obtain a multi-view image set, where the multi-view image set may include multi-view images of a face and a head of the physical object, and optionally, the multi-view image set may further include multi-view images of a whole body image of the physical object, where specific display content of the multi-view images included in the multi-view image set is not limited, and may be set according to an actual application scenario.

Step S304, based on the voice sequence, generating a target grid sequence of the virtual image corresponding to the entity object.

The target grid sequence is used for representing action information of the entity object in a preset time period.

The action information of the entity object in the target mesh sequence may be an action related to a voice sequence sent by the entity object, for example, a facial action, a mouth shape, etc. of the entity object, which are not limited herein, and may be set according to an actual scene.

In an alternative embodiment, the transformation of the mouth shape in the avatar may be adjusted in combination with the voice sequence, thereby ensuring consistency of the avatar with the voice sequence, and making the presentation effect of the avatar more natural. Optionally, when the target grid sequence is generated according to the voice sequence, preprocessing can be performed according to the voice characteristics corresponding to the voice sequence, so that the target voice characteristics with better continuity and integrity are obtained, and the target grid sequence with better effect can be obtained when the preset grid of the virtual image is processed according to the target voice characteristics.

Step S306, dynamic reconstruction is carried out based on the multi-view image set and the target grid sequence, and a reconstruction grid sequence and a rendering parameter set are generated.

The reconstruction grid sequence comprises reconstruction grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period.

The reconstructed mesh may be a mesh of an avatar including voice information, and the rendering parameters may be parameters including details of a texture, a hairstyle, etc. of the physical object.

In an alternative embodiment, the dynamically reconstructed structure may be a variational self-encoder, and the multi-view image set and the target mesh sequence may be mapped into a low-dimensional space by a neural rendering method, a reconstructed mesh sequence is constructed in the low-dimensional space, and a rendering parameter set is obtained.

Step S308, based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period.

In an alternative embodiment, the reconstructed grid sequence and the rendering parameter set can be used for driving the avatar to execute corresponding actions within a preset time period in a neural rendering mode, alternatively, different frames of the reconstructed grid sequence can be rendered based on the rendering parameter set in a neural rendering mode so as to obtain image output of the avatar, and the purpose of driving the avatar to execute the actions corresponding to the voice sequence within the preset time period can be achieved by outputting multi-frame images.

In the interaction scene, the voice sequence and the multi-view image set can be input into an input box in the display interface, the voice sequence and the multi-view image set can be subjected to subsequent processing in the background, and the virtual image is driven to execute corresponding actions in a preset time period in an output box of the display interface, so that a user can conveniently check the effect of executing actions by the virtual image.

In the virtual anchor scene, the real anchor can send out voice behind the curtain, drive the virtual anchor image to be displayed for a user to watch, collect the voice sequence sent by the real anchor first, obtain the multi-view image set obtained after multi-view shooting the face of the real anchor, generate the target grid sequence of the virtual anchor image corresponding to the real anchor according to the voice sequence, dynamically reconstruct according to the multi-view image set and the target grid sequence, generate the reconstructed grid sequence and the rendering parameter set, drive the virtual anchor image to execute corresponding actions within a preset time period according to the reconstructed grid sequence and the rendering parameter set, display the virtual anchor to the user, and simultaneously output the sound of the real anchor.

In the virtual live broadcast scene, a real anchor can send out voice after a curtain, a virtual anchor image is driven to be displayed on a live broadcast interface, a voice sequence sent by the real anchor can be acquired first, a multi-view image set obtained after multi-view shooting is carried out on the face of the real anchor is acquired, a target grid sequence of the virtual anchor image corresponding to the real anchor can be generated according to the voice sequence, dynamic reconstruction can be carried out according to the multi-view image set and the target grid sequence, a reconstruction grid sequence and a rendering parameter set can be generated, the virtual anchor image can be driven to execute corresponding actions in a preset time period according to the reconstruction grid sequence and the rendering parameter set, and the virtual anchor is displayed to a user in the virtual live broadcast scene.

And when the multi-view images acquired in real time are input into the input frame of the display interface, the voice sequences sent out by the entity objects are acquired in real time, so that a multi-view image set corresponding to the voice sequences is obtained, the voice sequences and the multi-view image set are subjected to subsequent processing in the background, and the actions corresponding to the entity objects can be executed by driving the virtual images in real time in the output frame of the display interface.

Through the steps, a voice sequence sent by an entity object in an entity scene is acquired, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is acquired, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period, so that the display effect of the virtual image is improved. It is easy to notice that generating the target mesh sequence of the virtual image corresponding to the entity object based on the voice sequence can make the virtual image more close to the actual action of the entity object, so as to achieve the effect of high fidelity, and the effect of multi-view rendering can be realized by dynamically reconstructing based on the multi-view image set and the target mesh sequence, so that the display effect of the virtual image can be improved, and the technical problem of poor display effect of the virtual image in the related technology is solved.

The digital person written or super written on the market has a plurality of problems, (1) the reduction degree of the virtual image of the entity object is low, and the terrorist valley effect exists; (2) The expression of the virtual image is unnatural and the speaking mouth shape is inaccurate; (3) multi-view rendering cannot be achieved.

The frame of the application can reconstruct real face details with high fidelity based on methods such as a transducer, a NeRF and the like, supports multi-view rendering, and has the advantages that the virtual image can be directly driven by voice, the mouth shape is accurate and the expression is natural.

In the above embodiment of the present application, based on the voice sequence, generating the target mesh sequence of the avatar corresponding to the entity object includes: inputting the voice sequence and the preset grid into a motion prediction model, and acquiring an initial grid sequence output by the motion prediction model; and carrying out interpolation processing on target vertexes in the initial grid sequence to obtain the target grid sequence, wherein the target vertexes are used for representing vertexes corresponding to target parts of the entity objects.

The motion prediction model may be a transducer, wherein the transducer structure references a facial plastic (FaceFormer).

The target portion may be an eye of a physical object, but is not limited thereto, and may be set according to an actual scene.

In an alternative embodiment, the motion prediction model can be utilized to perform position coding on the voice sequence and the preset grid by adopting a periodic coding strategy, an initial grid sequence is obtained through prediction, grid editing can be introduced to perform interpolation processing on target vertexes in the initial grid sequence to obtain a target grid sequence, and optionally, the grid with open eyes and the grid with closed eyes can be subjected to interpolation on the eye vertexes through grid editing to obtain the target grid sequence, so that blink control is simply and effectively realized without affecting mouth shape, and the problem of poor virtual image blink effect of voice driving in the related technology is solved.

In the above embodiment of the present application, the motion prediction model includes: the encoder module and the decoder module, wherein, input the speech sequence and the preset grid to the motion prediction model, and obtain the grid sequence that the motion prediction model outputs, include: extracting the characteristics of the voice sequence by utilizing an encoder module to obtain target voice characteristics of the voice sequence; and predicting the motion of the target voice characteristic and the preset grid by using a decoder module to obtain an initial grid sequence.

The Encoder module (Encoder) described above may be used to convert an input speech sequence into a feature form for subsequent better processing; the Decoder module (Decoder) may be configured to restore the target speech feature to the original lattice sequence form based on a predetermined lattice, thereby obtaining the initial lattice sequence.

In an alternative embodiment, the encoder module may be used to perform feature extraction on the voice sequence to obtain a target voice feature of the voice sequence, through which the avatar may be driven, the decoder module may be used to perform motion prediction on the target voice feature and the preset mesh, and by combining the target voice feature to predict the preset mesh, continuity and consistency of the mouth shape of the avatar in the initial mesh sequence may be ensured.

In the above embodiment of the present application, feature extraction is performed on a speech sequence by using an encoder module to obtain a target speech feature of the speech sequence, including: extracting features of the voice sequence to obtain initial voice features of the voice sequence; performing linear interpolation on the initial voice characteristics to obtain resampled voice characteristics; performing feature conversion on the resampled voice features to obtain converted voice features; and linearly projecting the converted voice characteristics to obtain target voice characteristics.

In an alternative embodiment, feature extraction may be performed on the voice sequence to obtain an initial voice feature of the voice sequence, in order to ensure continuity of the voice feature, linear interpolation may be performed on the initial voice feature to obtain a resampled voice feature with better continuity, feature conversion may be performed on the resampled voice feature to obtain a more complete converted voice feature, linear projection may be performed on the converted voice feature, and the converted voice feature may be projected to other dimensions to obtain a target voice feature corresponding to the voice sequence, and optionally, the linear projection may be weighted and processed on the converted voice feature to obtain the target voice feature of the voice sequence.

In the above embodiment of the present application, the motion prediction is performed on the target speech feature and the preset mesh by using the decoder module to obtain an initial mesh sequence, including: performing periodic position coding on a preset grid and a predicted grid obtained by last prediction to obtain position coding features, wherein the position coding features are used for representing time information of a grid sequence; performing self-attention processing on the position coding features to obtain self-attention features; cross attention processing is carried out on the self attention feature and the target voice feature, so that a cross attention feature is obtained; performing motion prediction on the cross attention characteristics to obtain a prediction grid obtained by the prediction; generating a grid sequence based on the prediction grid obtained by historical prediction and the prediction grid obtained by current prediction.

The above-mentioned position coding feature can be used to characterize the position information of each element in the sequence, and the periodic coding strategy refers to converting the position information into a vector representation with a periodic structure, and by performing periodic position coding on the prediction grid and the preset grid, the final initial grid sequence can be presented according to time information so as to achieve the effect of real-time display.

In an alternative embodiment, the preset grid may be predicted based on the target voice feature to obtain a first prediction grid, after the first prediction grid is obtained, when the next prediction is performed, the position coding feature of the first prediction grid may be obtained according to the first prediction grid and the preset grid, the self-attention processing may be performed on the position coding feature to obtain a relatively important feature in the position coding feature, that is, the self-attention feature may be performed on the self-attention feature based on the target voice feature, so as to obtain a cross-attention feature related to the target voice feature in the self-attention feature, and the motion prediction may be performed on the cross-attention feature to obtain a second prediction grid, and the initial grid sequence may be generated according to the first prediction grid and the second prediction grid.

Fig. 4 is a diagram illustrating a driving process of an avatar according to an embodiment of the present application, as shown in fig. 4, a preset mesh and a voice sequence may be output, wherein the preset mesh may be a standard gesture and an expression, the voice sequence and the preset mesh may be predicted by using a natural language classical model to obtain a target mesh sequence of the avatar, a multi-view image set and the target mesh sequence of the avatar may be input into an encoder to perform dynamic reconstruction, the reconstructed mesh sequence and a rendering parameter set are output through a decoder, the reconstructed mesh sequence is neural-rendered based on the rendering parameter set to improve the authenticity of the avatar, as shown in a graph after the decoder in fig. 4, different blocks are used to represent different views, points in the graph are used to represent voxel points to be rendered, voxel points in the reconstructed mesh sequence may be rendered based on the rendering parameter set by a neural rendering manner, thereby obtaining the avatar, and the avatar is driven to perform a corresponding action within a preset period of time.

The application uses a transducer to predict the grid sequence, ensures that each prediction link in the processing process has relevance, ensures continuity and consistency of mouth shapes, can realize explicit control of blinking in the speaking process by editing the grid sequence, and can adopt a neural rendering method to replace 2D image generation and traditional rendering, wherein the neural rendering can realize multi-view rendering and photo-level real (photo-alignment) face generation.

It should be noted that, the above embodiment is used to describe the periodic prediction process according to the prediction grid obtained by two predictions, but the number of predictions in actual prediction is not limited to the above two, and only an example is described here, and in a specific implementation process, the above embodiment may be executed according to the actual situation.

In the above embodiment of the present application, performing dynamic reconstruction based on a multi-view image set and a target mesh sequence to generate a reconstructed mesh sequence and a rendering parameter set, includes: performing feature extraction on the multi-view image set and the target grid sequence by using a coding network to obtain image features and second grid features; and dynamically reconstructing the image features and the second grid features by using a decoding network to generate a reconstructed grid sequence and a rendering parameter set.

The encoding network may be a variable self-encoder and the decoding network may be a decoder.

The multi-view image set may be a group of static multi-view three-channel images, which are not limited herein, and may be other image sets.

The image features described above may be used to represent presentation details of a physical object, including but not limited to textures, hairstyles, and the like.

The second mesh feature described above may be used to represent a feature that drives the avatar.

In an alternative embodiment, the encoding network may be used to map the multi-view image set and the target grid sequence into a low-dimensional space, and feature extraction is performed in the low-dimensional space, so as to obtain the image feature and the second grid feature; the decoding network can be utilized to reconstruct the image features in three dimensions so as to obtain a grid sequence corresponding to the entity object, the decoding network can be utilized to predict the second grid features so as to obtain a rendering parameter set, and the grid sequence is rendered through the rendering parameter set so as to obtain the final virtual image.

In the above embodiment of the present application, dynamically reconstructing the image feature and the second mesh feature by using the decoding network to generate a reconstructed mesh sequence and a rendering parameter set, including: performing three-dimensional reconstruction on the image features to generate a reconstruction grid sequence; a set of rendering parameters is generated based on the reconstructed mesh sequence and the second mesh feature.

In an alternative embodiment, the two-dimensional image features can be reconstructed into a three-dimensional reconstruction grid sequence, and the second grid features can be dynamically reconstructed on the basis of the reconstruction grid sequence, so that the second grid features can be attached to the reconstruction grid sequence to obtain a rendering parameter set which is more in line with the reconstruction grid sequence, and the driving image can be more natural when the virtual image is driven subsequently.

In the above embodiment of the present application, based on the reconstructed mesh sequence and the rendering parameter set, driving the avatar to execute the corresponding action within the preset time period includes: initializing the action of the avatar based on texture development of the reconstructed mesh sequence; and adjusting the action of the avatar based on the rendering parameter set in a preset time period.

In an alternative embodiment, the virtual image may be initialized based on a UV unfolding method of the reconstructed mesh sequence using a Model-View-Presenter (MVP) architecture, in which UV unfolding is performed according to the reconstructed mesh sequence, and then voxel units (Voxels) in space are initialized using the texture information; further, the position, orientation, size and other actions of the avatar can be adjusted according to the predicted rendering parameter set, and then a plurality of frames of the avatar can be displayed and output in a neural rendering mode.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, but that it may also be implemented by means of hardware. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

Example 2

There is also provided, in accordance with an embodiment of the present application, a method of rendering an avatar, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that herein.

Fig. 5 is a flowchart of a method of rendering an avatar according to embodiment 2 of the present application, as shown in fig. 5, the method including the steps of:

step S502, a voice sequence sent by an entity object in an entity scene is acquired, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is acquired.

Step S504, based on the voice sequence, generating a target grid sequence of the virtual image corresponding to the entity object.

Step S506, dynamic reconstruction is carried out based on the multi-view image set and the target grid sequence, and a reconstructed grid sequence and a rendering parameter set are generated.

And step S508, rendering the avatar in a preset time period based on the reconstructed grid sequence and the rendering parameter set.

Through the steps, a voice sequence sent by an entity object in an entity scene is acquired, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is acquired, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, the virtual image is rendered in a preset time period, so that the display effect of the virtual image is improved. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

It should be noted that, the preferred embodiment of the present application in the above examples is the same as the embodiment provided in example 1, the application scenario and the implementation process, but is not limited to the embodiment provided in example 1.

Example 3

There is also provided a driving method of an avatar according to an embodiment of the present application, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and although a logic order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order different from that herein.

Fig. 6 is a flowchart of a driving method of an avatar according to embodiment 3 of the present application, and as shown in fig. 6, the method includes the steps of:

in step S602, in response to an input instruction acting on the operation interface, sequence information of a voice sequence and a multi-view image set are displayed on the operation interface.

The voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object in multiple views, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period.

The operation interface may be a display interface capable of displaying the sequence information of the voice sequence and the multi-view image set, and the sequence information of the voice sequence and the multi-view image set may be displayed by performing a related touch operation on the display interface.

In step S604, an avatar is displayed on the operation interface in response to the driving instruction acting on the operation interface.

The virtual image is driven based on a reconstruction grid sequence and a rendering parameter set in a preset time period, the reconstruction grid sequence and the rendering parameter set are obtained by dynamic reconstruction based on a multi-view image set and a target grid sequence of the virtual image, the target grid sequence is generated based on a voice sequence, the target grid sequence is used for representing action information of an entity object in the preset time period, the reconstruction grid sequence comprises a reconstruction grid in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period.

The driving instruction may be an instruction generated by performing touch control on the operation interface when the avatar is required to be driven, and according to the driving instruction, the avatar performing a corresponding action in a preset time period based on the reconstructed mesh sequence and the rendering parameter set may be displayed on the operation interface.

In an optional embodiment, the above-mentioned operation interface may be an operation interface capable of interacting with a user, sequence information of a voice sequence and a multi-view image set may be input into an input box on the operation interface, optionally, the sequence information and the multi-view image may be input into the input box of the operation interface by dragging or uploading, after a shooting image is input into the input box, a target mesh sequence of an avatar corresponding to the entity object may be generated in the background based on the voice sequence, dynamic reconstruction may be performed based on the multi-view image set and the target mesh sequence, a reconstructed mesh sequence and a rendering parameter set may be generated, and based on the reconstructed mesh sequence and the rendering parameter set, the avatar may be driven in the input box on the operation interface to perform a corresponding action in a preset period. The method can realize that the virtual image is driven to execute corresponding actions in a preset time period according to the voice sequence and the multi-view image set in a user interaction mode.

Through the steps, the sequence information of a voice sequence and a multi-view image set are displayed on an operation interface in response to an input instruction acted on the operation interface, wherein the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object in a multi-view mode, the voice sequence comprises voice data in a preset time period, the multi-view image set comprises images captured in the preset time period, an virtual image is displayed on the operation interface in response to a driving instruction acted on the operation interface, the virtual image is driven in the preset time period based on a reconstruction grid sequence and a rendering parameter set, the reconstruction grid sequence and the rendering parameter set are obtained by dynamically reconstructing a target grid sequence based on the multi-view image set and the virtual image, the target grid sequence is generated based on the voice sequence, the target grid sequence is used for representing action information of the entity object in the preset time period, the reconstruction grid sequence comprises reconstruction grids in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period, and the virtual image is displayed on the operation interface. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

Example 4

There is also provided, in accordance with an embodiment of the present application, a method of driving an avatar in a virtual reality scene applicable to a virtual reality VR device, an augmented reality AR device, etc., it being noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different from that herein.

Fig. 7 is a flowchart of a driving method of an avatar according to embodiment 4 of the present application, and as shown in fig. 7, the method includes the steps of:

step S702, the virtual reality VR device or the augmented reality AR device is driven to collect a voice sequence sent by an entity object in the entity scene, and the multi-view image set is displayed on the presentation screen.

The voice sequence comprises voice data in a preset time period, the multi-view image set is obtained by shooting the entity object in multiple views, and the multi-view image set comprises images captured in the preset time period.

Step S704, based on the voice sequence, generating a target grid sequence of the avatar corresponding to the entity object.

Step S706, dynamically reconstructing based on the multi-view image set and the target mesh sequence, and generating a reconstructed mesh sequence and a rendering parameter set.

Step S708, based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to perform a corresponding action within a preset time period.

Step S710, driving the VR device or the AR device to render the presentation avatar.

Through the steps, driving a Virtual Reality (VR) device or an Augmented Reality (AR) device to acquire a voice sequence sent by an entity object in an entity scene, and displaying a multi-view image set on a display picture, wherein the voice sequence comprises voice data in a preset time period, the multi-view image set is obtained after the entity object is subjected to multi-view shooting, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, driving the virtual image to execute corresponding actions within a preset time period; and driving the VR equipment or the AR equipment to render and display the virtual image, so as to improve the display effect of the virtual image. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

Example 5

Fig. 8 is a flowchart of a driving method of an avatar according to embodiment 5 of the present application, and as shown in fig. 8, the method includes the steps of:

step S802, a voice sequence and a multi-view image set are acquired by calling a first interface.

The first interface comprises a first parameter, the parameter value of the first parameter is a voice sequence and a multi-view image set, the voice sequence is obtained by collecting an entity object in an entity scene, the multi-view image set is obtained by shooting the entity object at multiple angles, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period.

The first interface may be an interface for performing data interaction between the server and the client, where the client may transmit the voice sequence and the multi-view image set into an interface function as a first parameter of the interface function, so as to achieve the purpose of uploading the voice sequence and the multi-view image set to the cloud server.

Step S804, based on the voice sequence, generating a target grid sequence of the avatar corresponding to the entity object.

Step S806, dynamic reconstruction is performed based on the multi-view image set and the target mesh sequence, and a reconstructed mesh sequence and a rendering parameter set are generated.

Step S808, based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period.

And step S810, outputting the avatar by calling the second interface.

The second interface comprises a second parameter, and the parameter value of the second parameter is the avatar.

The second interface in the above step may be an interface for exchanging data between the cloud server and the client, where the cloud server may transfer the motion virtual image to the interface function by executing the corresponding motion in a preset time period, and the second interface is used as a second parameter of the interface function to achieve the purpose of issuing the motion virtual image to the client by executing the corresponding motion in the preset time period.

Through the steps, a voice sequence and a multi-view image set are obtained by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the voice sequence and the multi-view image set, the voice sequence is obtained by collecting an entity object in an entity scene, the multi-view image set is obtained by shooting the entity object at multiple angles, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, driving the virtual image to execute corresponding actions within a preset time period; and outputting the avatar by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the avatar, so that the display effect of the avatar is improved. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

Example 6

There is also provided an avatar rendering apparatus for implementing the above-described avatar driving method according to an embodiment of the present application, and fig. 9 is a schematic view of an avatar driving apparatus according to embodiment 6 of the present application, as shown in fig. 9, the apparatus 900 including: an acquisition module 902, a generation module 904, a reconstruction module 906, a drive module 908.

The acquisition module is used for acquiring a voice sequence sent by an entity object in an entity scene and acquiring a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; the generating module is used for generating a target grid sequence of the virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing the action information of the entity object in a preset time period; the reconstruction module is used for carrying out dynamic reconstruction based on the multi-view image set and the target grid sequence to generate a reconstruction grid sequence and a rendering parameter set, wherein the reconstruction grid sequence comprises reconstruction grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; the driving module is used for driving the avatar to execute corresponding actions within a preset time period based on the reconstructed grid sequence and the rendering parameter set.

It should be noted that, the above-mentioned acquisition module 902, generation module 904, reconstruction module 906, and driving module 908 correspond to steps S302 to S308 in embodiment 1, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), and the above-mentioned modules may also be executed as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

In the above embodiment of the present application, the generating module is further configured to input a voice sequence and a preset mesh to the motion prediction model, obtain an initial mesh sequence output by the motion prediction model, and perform interpolation processing on a target vertex in the initial mesh sequence to obtain a target mesh sequence, where the target vertex is used to represent a vertex corresponding to a target location of the entity object.

In the above embodiment of the present application, the generating module is further configured to perform feature extraction on the voice sequence by using the encoder module to obtain a target voice feature of the voice sequence, and perform motion prediction on the target voice feature and a preset grid by using the decoder module to obtain an initial grid sequence.

In the above embodiment of the present application, the generating module is further configured to perform feature extraction on the voice sequence to obtain an initial voice feature of the voice sequence, perform linear interpolation on the initial voice feature to obtain a resampled voice feature, perform feature conversion on the resampled voice feature to obtain a converted voice feature, and perform linear projection on the converted voice feature to obtain a target voice feature.

In the above embodiment of the present application, the generating module is further configured to perform periodic position encoding on a preset grid and a predicted grid obtained by previous prediction to obtain a position encoding feature, where the position encoding feature is used to characterize time information of a grid sequence, perform self-attention processing on the position encoding feature to obtain a self-attention feature, perform cross-attention processing on the self-attention feature and a target voice feature to obtain a cross-attention feature, perform motion prediction on the cross-attention feature to obtain a predicted grid obtained by current prediction, and generate a grid sequence based on the predicted grid obtained by historical prediction and the predicted grid obtained by current prediction.

In the above embodiment of the present application, the reconstruction module is further configured to perform feature extraction on the multi-view image set and the target mesh sequence by using the encoding network, to obtain an image feature and a second mesh feature, and dynamically reconstruct the image feature and the second mesh feature by using the decoding network, to generate a reconstructed mesh sequence and a rendering parameter set.

In the above embodiment of the present application, the reconstruction module is further configured to perform three-dimensional reconstruction on the image feature, generate a reconstructed mesh sequence, and generate a rendering parameter set based on the reconstructed mesh sequence and the second mesh feature.

In the above embodiment of the present application, the driving module is further configured to initialize the motion of the avatar based on the texture expansion of the reconstructed mesh sequence, and adjust the motion of the avatar based on the rendering parameter set within a preset time period.

Example 7

According to an embodiment of the present application, there is also provided an avatar rendering apparatus for implementing the above-mentioned avatar rendering method, and fig. 10 is a schematic diagram of an avatar rendering apparatus according to embodiment 7 of the present application, as shown in fig. 10, the apparatus 1000 including: an acquisition module 1002, a generation module 1004, a reconstruction module 1006, and a rendering module 1008.

The acquisition module is used for acquiring a voice sequence sent by an entity object in an entity scene and acquiring a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; the generating module is used for generating a target grid sequence of the virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing the action information of the entity object in a preset time period; the reconstruction module is used for carrying out dynamic reconstruction based on the multi-view image set and the target grid sequence to generate a reconstruction grid sequence and a rendering parameter set, wherein the reconstruction grid sequence comprises reconstruction grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; the rendering module is used for rendering the avatar in a preset time period based on the reconstruction grid sequence and the rendering parameter set.

It should be noted that, the acquisition module 1002, the generation module 1004, the reconstruction module 1006, and the rendering module 1008 correspond to steps S502 to S508 in embodiment 2, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), and the above-mentioned modules may also be executed as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 8

There is also provided an avatar driving apparatus for implementing the above-described avatar driving method according to an embodiment of the present application, and fig. 11 is a schematic view of an avatar driving apparatus according to embodiment 8 of the present application, as shown in fig. 11, the apparatus 1100 including: a first display module 1102, a second display module 1104.

The first display module is used for responding to an input instruction acted on the operation interface, displaying sequence information of a voice sequence and a multi-view image set on the operation interface, wherein the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object in a multi-view mode, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; the second display module is used for responding to a driving instruction acting on the operation interface and displaying an virtual image on the operation interface, wherein the virtual image is driven in a preset time period based on a reconstruction grid sequence and a rendering parameter set, the reconstruction grid sequence and the rendering parameter set are obtained by dynamic reconstruction based on a multi-view image set and a target grid sequence of the virtual image, the target grid sequence is generated based on a voice sequence, the target grid sequence is used for representing action information of a physical object in the preset time period, the reconstruction grid sequence comprises reconstruction grids in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period.

Here, it should be noted that the first display module 1102 and the second display module 1104 correspond to steps S602 to S604 in embodiment 3, and the two modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), and the above-mentioned modules may also be executed as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 9

There is also provided an avatar driving apparatus for implementing the above-described avatar driving method according to an embodiment of the present application, and fig. 12 is a schematic view of an avatar driving apparatus according to embodiment 9 of the present application, as shown in fig. 12, the apparatus 1200 including: a first driving module 1202, a generating module 1204, a reconstructing module 1206, a second driving module 1208, and a third driving module 1210.

The first driving module is used for driving the virtual reality VR device or the augmented reality AR device to acquire a voice sequence sent by an entity object in an entity scene and displaying a multi-view image set on a display picture, wherein the voice sequence comprises voice data in a preset time period, the multi-view image set is obtained after the entity object is subjected to multi-view shooting, and the multi-view image set comprises images captured in the preset time period; the generating module is used for generating a target grid sequence of the virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing the action information of the entity object in a preset time period; the reconstruction module is used for carrying out dynamic reconstruction based on the multi-view image set and the target grid sequence to generate a reconstruction grid sequence and a rendering parameter set, wherein the reconstruction grid sequence comprises reconstruction grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; the second driving module is used for driving the virtual image to execute corresponding actions in a preset time period based on the reconstructed grid sequence and the rendering parameter set; the third driving module is used for driving the VR device or the AR device to render and display the virtual image.

It should be noted that, the first driving module 1202, the generating module 1204, the reconstructing module 1206, the second driving module 1208, and the third driving module 1210 correspond to steps S702 to S710 in embodiment 4, and the five modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), and the above-mentioned modules may also be executed as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 10

There is also provided an avatar driving apparatus for implementing the above-described avatar driving method according to an embodiment of the present application, and fig. 13 is a schematic view of an avatar driving apparatus according to embodiment 10 of the present application, as shown in fig. 13, the apparatus 1300 including: an acquisition module 1302, a first generation module 1304, a second generation module 1306, a drive module 1308.

The acquisition module is used for acquiring a voice sequence and a multi-view image set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the voice sequence and the multi-view image set, the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object at multiple angles, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; the first generation module is used for generating a target grid sequence of the virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; the second generation module is used for carrying out dynamic reconstruction based on the multi-view image set and the target grid sequence to generate a reconstruction grid sequence and a rendering parameter set, wherein the reconstruction grid sequence comprises reconstruction grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; the driving module is used for driving the virtual image to execute corresponding actions in a preset time period based on the reconstructed grid sequence and the rendering parameter set; the output module is used for outputting the avatar by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the avatar.

Here, the acquiring module 1302, the first generating module 1304, the second generating module 1306, and the driving module 1308 correspond to steps S802 to S7808 in embodiment 5, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-mentioned modules or units may be hardware components or software components stored in a memory (for example, the memory 104) and processed by one or more processors (for example, the processors 102a,102b, … …,102 n), and the above-mentioned modules may also be executed as a part of the apparatus in the computer terminal 10 provided in embodiment 1.

Example 11

Embodiments of the present application may provide an AR/VR device that may be any one of a group of AR/VR devices. Alternatively, in this embodiment, the AR/VR device may be replaced by a terminal device such as a mobile terminal.

Alternatively, in this embodiment, the AR/VR device may be located in at least one network device among a plurality of network devices of the computer network.

In this embodiment, the above-mentioned AR/VR device may execute the program codes of the following steps in the avatar driving method: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to execute a corresponding action within a preset time period.

Alternatively, fig. 14 is a block diagram of a computer terminal according to an embodiment of the present application. As shown in fig. 14, the computer terminal a may include: one or more (only one is shown) processors 102, memory 104, memory controller, and peripheral interfaces, where the peripheral interfaces are connected to the radio frequency module, audio module, and display.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for starting an avatar in the embodiment of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby implementing the above-mentioned method for driving an avatar. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, which may be connected to terminal a through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to execute a corresponding action within a preset time period.

Optionally, the above processor may further execute program code for: inputting the voice sequence and the preset grid into a motion prediction model, and acquiring an initial grid sequence output by the motion prediction model; and carrying out interpolation processing on target vertexes in the initial grid sequence to obtain the target grid sequence, wherein the target vertexes are used for representing vertexes corresponding to target parts of the entity objects.

Optionally, the above processor may further execute program code for: extracting the characteristics of the voice sequence by utilizing an encoder module to obtain target voice characteristics of the voice sequence; and predicting the motion of the target voice characteristic and the preset grid by using a decoder module to obtain an initial grid sequence.

Optionally, the above processor may further execute program code for: extracting features of the voice sequence to obtain initial voice features of the voice sequence; performing linear interpolation on the initial voice characteristics to obtain resampled voice characteristics; performing feature conversion on the resampled voice features to obtain converted voice features; and linearly projecting the converted voice characteristics to obtain target voice characteristics.

Optionally, the above processor may further execute program code for: performing periodic position coding on a preset grid and a predicted grid obtained by last prediction to obtain position coding features, wherein the position coding features are used for representing time information of a grid sequence; performing self-attention processing on the position coding features to obtain self-attention features; cross attention processing is carried out on the self attention feature and the target voice feature, so that a cross attention feature is obtained; performing motion prediction on the cross attention characteristics to obtain a prediction grid obtained by the prediction; generating a grid sequence based on the prediction grid obtained by historical prediction and the prediction grid obtained by current prediction.

Optionally, the above processor may further execute program code for: performing feature extraction on the multi-view image set and the target grid sequence by using a coding network to obtain image features and second grid features; and dynamically reconstructing the image features and the second grid features by using a decoding network to generate a reconstructed grid sequence and a rendering parameter set.

Optionally, the above processor may further execute program code for: performing three-dimensional reconstruction on the image features to generate a reconstruction grid sequence; a set of rendering parameters is generated based on the reconstructed mesh sequence and the second mesh feature.

Optionally, the above processor may further execute program code for: initializing the action of the avatar based on texture development of the reconstructed mesh sequence; and adjusting the action of the avatar based on the rendering parameter set in a preset time period.

By adopting the embodiment of the application, the voice sequence sent by the entity object in the entity scene is collected, and a multi-view image set obtained after multi-view shooting is carried out on the entity object is obtained, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed grid sequence and the rendering parameter set, the virtual image is driven to execute corresponding actions within a preset time period, so that the display effect of the virtual image is improved. It is easy to note that, generating the target mesh sequence of the avatar corresponding to the entity object based on the voice sequence can make the avatar show the relevant actions of the entity object speaking, such as mouth shape, face state, etc. when the entity object speaks, the avatar is more close to the actual actions of the entity object, achieving the effect of high fidelity, and dynamically reconstructing based on the multi-view image set and the target mesh sequence can realize the effect of multi-view rendering, thereby improving the display effect of the avatar, and driving the avatar through the rendering parameter set and the reconstructed mesh sequence in the preset time period, achieving the effect of displaying the avatar corresponding to the entity object in real time, and further solving the technical problem of poor display effect of the avatar in the related art.

It will be appreciated by those skilled in the art that the structure shown in fig. 14 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm computer, a mobile internet device (MobileInternetDevices, MID), a PAD, etc. Fig. 14 does not limit the structure of the electronic device. For example, the computer terminal a may also include more or fewer components (such as a network interface, a display device, etc.) than shown in fig. 14, or have a different configuration than shown in fig. 14.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

Example 12

Embodiments of the present application also provide a computer-readable storage medium. Alternatively, in the present embodiment, the above-described computer-readable storage medium may be used to store program code executed by the avatar rendering method provided in the above-described embodiment 1.

Alternatively, in this embodiment, the above-mentioned computer readable storage medium may be located in any one of the AR/VR device terminals in the AR/VR device network or in any one of the mobile terminals in the mobile terminal group.

Optionally, in the present embodiment, the computer readable storage medium is configured to store program code for performing the steps of: the processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting of the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period; generating a target grid sequence of an virtual image corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in a preset time period; dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in a preset time period, and the rendering parameter set comprises rendering parameters in the preset time period; based on the reconstructed mesh sequence and the rendering parameter set, the avatar is driven to execute a corresponding action within a preset time period.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: inputting the voice sequence and the preset grid into a motion prediction model, and acquiring an initial grid sequence output by the motion prediction model; and carrying out interpolation processing on target vertexes in the initial grid sequence to obtain the target grid sequence, wherein the target vertexes are used for representing vertexes corresponding to target parts of the entity objects.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: extracting the characteristics of the voice sequence by utilizing an encoder module to obtain target voice characteristics of the voice sequence; and predicting the motion of the target voice characteristic and the preset grid by using a decoder module to obtain an initial grid sequence.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: extracting features of the voice sequence to obtain initial voice features of the voice sequence; performing linear interpolation on the initial voice characteristics to obtain resampled voice characteristics; performing feature conversion on the resampled voice features to obtain converted voice features; and linearly projecting the converted voice characteristics to obtain target voice characteristics.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: performing periodic position coding on a preset grid and a predicted grid obtained by last prediction to obtain position coding features, wherein the position coding features are used for representing time information of a grid sequence; performing self-attention processing on the position coding features to obtain self-attention features; cross attention processing is carried out on the self attention feature and the target voice feature, so that a cross attention feature is obtained; performing motion prediction on the cross attention characteristics to obtain a prediction grid obtained by the prediction; generating a grid sequence based on the prediction grid obtained by historical prediction and the prediction grid obtained by current prediction.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: performing feature extraction on the multi-view image set and the target grid sequence by using a coding network to obtain image features and second grid features; and dynamically reconstructing the image features and the second grid features by using a decoding network to generate a reconstructed grid sequence and a rendering parameter set.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: performing three-dimensional reconstruction on the image features to generate a reconstruction grid sequence; a set of rendering parameters is generated based on the reconstructed mesh sequence and the second mesh feature.

Optionally, the above-mentioned storage medium is further configured to store program code for performing the steps of: initializing the action of the avatar based on texture development of the reconstructed mesh sequence; and adjusting the action of the avatar based on the rendering parameter set in a preset time period.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A driving method of an avatar, comprising:

collecting a voice sequence sent by an entity object in an entity scene, and obtaining a multi-view image set obtained after multi-view shooting is carried out on the entity object, wherein the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period;

generating a target grid sequence of an avatar corresponding to the entity object based on the voice sequence, wherein the target grid sequence is used for representing action information of the entity object in the preset time period;

dynamically reconstructing based on the multi-view image set and the target grid sequence to generate a reconstructed grid sequence and a rendering parameter set, wherein the reconstructed grid sequence comprises reconstructed grids in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period;

And driving the avatar to execute corresponding actions within the preset time period based on the reconstruction grid sequence and the rendering parameter set.

2. The method of claim 1, wherein generating a target mesh sequence of the avatar corresponding to the physical object based on the voice sequence comprises:

inputting the voice sequence and a preset grid into a motion prediction model, and acquiring an initial grid sequence output by the motion prediction model;

and carrying out interpolation processing on target vertexes in the initial grid sequence to obtain the target grid sequence, wherein the target vertexes are used for representing vertexes corresponding to target parts of the entity objects.

3. The method of claim 2, wherein the motion prediction model comprises: the encoder module and the decoder module input the voice sequence and a preset grid into a motion prediction model, and acquire the initial grid sequence output by the motion prediction model, and the encoder module comprises:

extracting features of the voice sequence by utilizing the encoder module to obtain target voice features of the voice sequence;

and predicting the motion of the target voice feature and the preset grid by using the decoder module to obtain the initial grid sequence.

4. A method according to claim 3, wherein feature extraction of the speech sequence with the encoder module results in target speech features of the speech sequence, comprising:

extracting features of the voice sequence to obtain initial voice features of the voice sequence;

performing linear interpolation on the initial voice characteristics to obtain resampled voice characteristics;

performing feature conversion on the resampled voice features to obtain converted voice features;

and linearly projecting the converted voice characteristic to obtain the target voice characteristic.

5. A method according to claim 3, wherein predicting the motion of the target speech feature and the predetermined mesh using the decoder module to obtain the initial mesh sequence comprises:

performing periodic position coding on the preset grid and a predicted grid obtained by last prediction to obtain position coding features, wherein the position coding features are used for representing time information of the initial grid sequence;

performing self-attention processing on the position coding feature to obtain a self-attention feature;

performing cross attention processing on the self attention feature and the target voice feature to obtain a cross attention feature;

Performing motion prediction on the cross attention features to obtain a prediction grid obtained by the prediction;

and generating the initial grid sequence based on the prediction grid obtained by historical prediction and the prediction grid obtained by current prediction.

6. The method of claim 1, wherein dynamically reconstructing based on the set of multi-view images and the target mesh sequence, generating a reconstructed mesh sequence and a set of rendering parameters, comprises:

extracting features of the multi-view image set and the target grid sequence by using a coding network to obtain image features and second grid features;

and dynamically reconstructing the image features and the second grid features by using a decoding network to generate the reconstructed grid sequence and the rendering parameter set.

7. The method of claim 6, wherein dynamically reconstructing the image feature and the second mesh feature using a decoding network to generate the reconstructed mesh sequence and the set of rendering parameters comprises:

performing three-dimensional reconstruction on the image features to generate the reconstruction grid sequence;

the set of rendering parameters is generated based on the reconstructed mesh sequence and the second mesh feature.

8. The method of claim 1, wherein driving the avatar to perform a corresponding action within the preset time period based on the reconstructed mesh sequence and the set of rendering parameters comprises:

initializing an action of the avatar based on texture development of the reconstructed mesh sequence;

and adjusting the action of the avatar based on the rendering parameter set in the preset time period.

9. A method of rendering an avatar, comprising:

And rendering the avatar in the preset time period based on the reconstruction grid sequence and the rendering parameter set.

10. A driving method of an avatar, comprising:

responding to an input instruction acted on an operation interface, and displaying sequence information of a voice sequence and a multi-view image set on the operation interface, wherein the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object in a multi-view mode, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period;

and responding to a driving instruction acted on the operation interface, displaying the avatar on the operation interface, wherein the avatar is driven in the preset time period based on a reconstruction grid sequence and a rendering parameter set, the reconstruction grid sequence and the rendering parameter set are obtained by dynamic reconstruction based on the multi-view image set and a target grid sequence of the avatar, the target grid sequence is generated based on the voice sequence, the target grid sequence is used for representing action information of the entity object in the preset time period, the reconstruction grid sequence comprises reconstruction grids in the preset time period, and the rendering parameter set comprises rendering parameters in the preset time period.

11. A driving method of an avatar, comprising:

driving a Virtual Reality (VR) device or an Augmented Reality (AR) device to acquire a voice sequence sent by an entity object in an entity scene, and displaying a multi-view image set on a display picture, wherein the voice sequence comprises voice data in a preset time period, the multi-view image set is obtained after multi-view shooting of the entity object, and the multi-view image set comprises images captured in the preset time period;

driving the avatar to execute a corresponding action within the preset time period based on the reconstructed mesh sequence and the rendering parameter set;

And driving the VR equipment or the AR equipment to render and display the avatar.

12. A driving method of an avatar, comprising:

acquiring a voice sequence and a multi-view image set by calling a first interface, wherein the first interface comprises a first parameter, the parameter value of the first parameter is the voice sequence and the multi-view image set, the voice sequence is acquired by acquiring an entity object in an entity scene, the multi-view image set is acquired by shooting the entity object at multiple angles, the voice sequence comprises voice data in a preset time period, and the multi-view image set comprises images captured in the preset time period;

and outputting the avatar by calling a second interface, wherein the second interface comprises a second parameter, and the parameter value of the second parameter is the avatar.

13. An electronic device, comprising:

a memory storing an executable program;

a processor for executing the program, wherein the program when run performs the method of any of claims 1 to 12.

14. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored executable program, wherein the executable program when run controls a device in which the computer readable storage medium is located to perform the method of any one of claims 1 to 12.