CN114998514A

CN114998514A - Virtual role generation method and equipment

Info

Publication number: CN114998514A
Application number: CN202210526179.9A
Authority: CN
Inventors: 许瀚誉; 吴连朋
Original assignee: Juhaokan Technology Co Ltd
Current assignee: Juhaokan Technology Co Ltd
Priority date: 2022-05-16
Filing date: 2022-05-16
Publication date: 2022-09-02

Abstract

The application relates to the technical field of three-dimensional reconstruction, and provides a virtual role generation method and equipment, wherein a video of a target object rotating for a circle in a preset posture is acquired through a civil RGB camera, the front image detail characteristics of the target object are rich, the front image detail characteristics can be used as a first video frame to extract the body parameters of the target object so as to perform voxelization treatment on an initial human body parameterized model, a reference frame model with skin is generated, the reference frame model is continuously optimized through the posture parameters extracted from each video frame to obtain a target geometric model, then after texture mapping is performed on the target geometric model according to the RGB information of each video frame, a virtual role consistent with the appearance of the target object is obtained, the authenticity of the virtual role is improved through continuous optimization, and in the optimization process, the loss of a previous video frame before and after the reference frame model is deformed is used for optimizing the deformation parameters of a next frame of a video frame, the model does not need to be adjusted manually, and the generation efficiency of the virtual role is improved.

Description

Virtual role generation method and equipment

Technical Field

The present application relates to the field of three-dimensional reconstruction technologies, and in particular, to a method and an apparatus for generating a virtual character.

Background

With the development of three-dimensional reconstruction technology, real-time remote three-dimensional interaction is carried out between vivid virtual characters and other virtual characters in a virtual space, the face-to-face immersion feeling in a first-generation social mode is reproduced, and virtual/augmented reality can become a fifth-generation social mode following a fourth-generation mobile internet age.

At present, there are two main implementation methods for implementing remote social contact based on human body three-dimensional reconstruction: firstly, human body three-dimensional reconstruction is carried out according to RGB data or RGBD data collected in real time to obtain a virtual role, and three-dimensional data (such as vertex coordinates, patch indexes, textures and the like) of the virtual role is transmitted to other interactive terminals through a cloud end; because the method is based on the reconstruction of the acquired real data, the human body, clothes, hair and the like of the virtual character are real, but the transmission quantity is large, the existing network bandwidth is difficult to meet the real-time requirement, the interactive process is blocked, and the immersive experience of the user is reduced. The other method is to construct a virtual character in advance, and then adopt a motion capture technology to collect the motion data of the user in the interactive process in real time so as to drive the virtual character to move; the method reduces the data volume transmitted in the real-time interaction process, but when the method is reconstructed by using civil-grade acquisition equipment, the real-time interaction can be realized only when a virtual character which is not worn with clothes or is worn with preset clothes is driven, if the clothes of the virtual character are required to be close to the user and move along with the user, the clothes can only be drawn manually by an animator by adopting a video-grade method, and the real-time performance cannot be realized.

Disclosure of Invention

The embodiment of the application provides a virtual character generation method and virtual character generation equipment, which are used for improving the authenticity and the generation efficiency of virtual characters under the condition of meeting real-time interaction.

On one hand, an embodiment of the present application provides a method for generating a virtual role, including:

acquiring a video of a target object which is acquired by an RGB camera and rotates for one circle in a preset posture, and taking a front image of the target object as a first video frame to extract a shape parameter of the target object;

initializing a voxel space with set resolution according to an initial human body parameterized model and the shape parameters, and generating a reference frame model with skin according to an initialization result;

according to the attitude parameters of the target object contained in each video frame and the three-dimensional coordinates of each vertex in the reference frame model in the corresponding voxel block, respectively deforming the reference frame model to obtain a target geometric model; in the deformation process, the loss before and after the deformation of the reference frame model corresponding to the previous video frame is used for optimizing the deformation parameter of the next video frame;

and performing texture mapping on the target geometric model according to the RGB information of each video frame to obtain the virtual role of the target object.

On the other hand, an interactive device is provided in an embodiment of the present application, and is configured to generate a virtual character for remote interaction, where the interactive device includes a processor, a memory, a display, and a communication interface, and the communication interface, the display, and the memory are connected to the processor through a bus;

the memory includes a data storage unit and a program storage unit, the program storage unit stores computer program instructions, and the processor executes according to the computer program to perform the following operations:

acquiring a video of a target object which is acquired by an RGB camera and rotates for a circle at a preset posture through the communication interface, storing the video into the data storage unit, and taking a front image of the target object as a first video frame to extract a shape parameter of the target object;

initializing a voxel space with set resolution according to an initial human body parameterized model and the shape parameters of the target object, and generating a reference frame model with skin according to an initialization result;

and performing texture mapping on the target geometric model according to the RGB information of each video frame to obtain the virtual role of the target object, and displaying the virtual role by the display.

In another aspect, the present application provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to enable a computer to execute the method for generating a virtual character provided in the embodiments of the present application.

The embodiment of the application provides a virtual role generation method and equipment, wherein a civil RGB camera is used for collecting a video of a target object rotating for a circle in a preset posture and storing the video, a front image with rich detail information is used as a first video frame to extract a body parameter of the target object, and after the body parameter is used for carrying out voxelization processing on an initial human body parameterized model with a consistent topological structure, a reference frame model with skin is generated; after a reference frame model with skin is obtained, the reference frame model is deformed respectively according to the attitude parameters contained in each video frame and the three-dimensional coordinates of each vertex in the reference frame model in the corresponding voxel block to obtain a target geometric model, and in the deformation process, the loss of the reference frame model corresponding to the previous video frame before and after deformation is used for optimizing the deformation parameters of the next video frame, so that the target geometric model is more consistent with a target object, the reality of the target geometric model is improved, a more real virtual role is obtained after texture mapping is performed on the target geometric model according to the RGB information of each video frame, and the skin of the virtual role can truly depict the appearance of the target object. In the whole generation process, the optimization of the deformation parameters is automatically carried out, an animator is not required to participate, manpower and material resources are saved, and the generation efficiency of the virtual character is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1A is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 1B is a schematic diagram of another application scenario provided in the embodiment of the present application;

FIG. 2 is a system framework diagram for generating virtual roles provided by an embodiment of the present application;

fig. 3 is a flowchart of a method for generating a virtual role according to an embodiment of the present application;

fig. 4 is a flowchart of a method for extracting a shape parameter and an attitude parameter according to an embodiment of the present application;

fig. 5 is a flowchart of a voxelization processing method provided in an embodiment of the present application;

FIG. 6 is a flowchart of a method for generating a reference frame model according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of a method for transforming a reference frame model using current pose parameters according to an embodiment of the present disclosure;

fig. 8 is a flowchart of a method for optimizing a deformation parameter of a next video frame according to a loss before and after deformation of a previous video frame according to an embodiment of the present disclosure;

FIG. 9 is a diagram illustrating the effect of a geometric model of a target with skin according to an embodiment of the present application;

fig. 10 is a flowchart of a method for determining a correspondence between a vertex of a model and a pixel of a texture map according to an embodiment of the present application;

FIG. 11 is a flowchart of a method for texture mapping of a target geometric model according to an embodiment of the present application;

FIG. 12 is a diagram illustrating an effect of a generated virtual character according to an embodiment of the present application;

fig. 13 is a flowchart of a method for performing remote interaction by using a virtual character according to an embodiment of the present application;

fig. 14A is a structural diagram of an interaction device according to an embodiment of the present application;

fig. 14B is a block diagram of another interactive device provided in the embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, the core technology of a remote three-dimensional interactive system is the real-time three-dimensional reconstruction of a human body, a human face, hair and clothes. The three-dimensional reconstruction relates to data such as shape, motion, material and the like, firstly, data of an object to be reconstructed are obtained from various acquisition devices and are used as input parameters, then, a non-rigid real-time three-dimensional reconstruction method is adopted to process the input parameters, and three-dimensional information of a human body is obtained, so that a virtual character of the object to be reconstructed is reconstructed, and the novelty and the interestingness of interaction are improved.

Acquisition devices can be classified into optical scanners (e.g., visible structured light or laser scanners), RGBD cameras, and RGB cameras, depending on the type of sensor. The virtual character generation method based on the optical scanner is often applied to scenes (such as movies) with high requirements on reality, three-dimensional point cloud data are obtained by scanning corresponding characters, the three-dimensional point cloud data are reorganized into human body grid data, and high-quality virtual characters are obtained by careful polishing and modification of animators.

In recent years, with the continuous development of imaging technology, a method for generating a virtual character based on an RGBD camera is rapidly developed, and the method performs three-dimensional reconstruction based on a Simultaneous Mapping and positioning (SLAM) technology, so that the generation efficiency of the virtual character is improved. In order to generate the virtual character rapidly and reduce reconstruction cost, a method for generating the virtual character based on a civil-grade RGB camera becomes a research hotspot at present.

Some products on the market generate virtual characters of simulation or cartoon through single-frame 2D images, and the simulation effect is poor due to small model data volume; some camera arrays with a large number are adopted for 3D data acquisition, modeling is carried out with large calculation power, and virtual roles with good simulation effect are generated, but a wide network bandwidth is needed, and the system cost is high; a depth sensor is also arranged to collect depth data of a character, motion capture is carried out by combining RGB data, motion data for driving a three-dimensional model is extracted, and then a pre-built model is driven to obtain a virtual character.

At present, in various methods for generating virtual characters, the accuracy of virtual characters obtained by a reconstruction method based on an optical scanner is the highest, because in the scanning process, an object to be reconstructed is required to keep still for a few seconds or minutes, a high-accuracy static three-dimensional human body model is reconstructed by splicing multi-angle high-accuracy three-dimensional scanning data, and then an animator performs targeted manual repair and skeleton information embedding to convert the static three-dimensional human body model into a drivable virtual character with clothes. Because the method needs manual participation, the generation efficiency of the virtual role is low.

The three-dimensional reconstruction technology based on real-time acquired data is difficult to generate three-dimensional models with consistent topological structures, the closure of the models cannot be ensured, the reconstruction quality is difficult to ensure, and at present, no corresponding mature compression scheme exists for the transmission of the three-dimensional data, and a large amount of three-dimensional data cannot be transmitted in real time under the transmission capability of a wide area network, so that the method cannot be popularized and applied, and generally only single-person and double-person bidirectional three-dimensional data transmission is performed under the local area network of a private network. Therefore, this method does not have the feasibility of remote three-dimensional interaction.

The method for generating the virtual role by driving the pre-modeling can transmit the motion data with small data volume so as to complete remote social contact, has a simple generation mode, and achieves real-time transmission under the existing network capability. However, because of the human parametric model with the consistent topological structure used in the pre-modeling process, the real figure appearance is difficult to be described, and the clothes and the hair of the figure are difficult to be directly generated, the real-time interaction can be realized only when a virtual character without clothes or with preset clothes is driven, if the appearance of the virtual character is required to be consistent with the real appearance, the manual adding at the later stage of an animator is required, different clothes need to be repeatedly added by manpower, if the clothes of the virtual character is required to be close to the user and move along with the user, the animator is required to repeatedly drive, simulate and manually optimize, and the efficiency is low.

In view of this, embodiments of the present application provide a method and an apparatus for generating a virtual character, which utilize a video acquired by a civil RGB camera and rotating a target object for one circle at a preset posture, to generate a virtual character with skin (such as clothes, hair, hat, etc.) and consistent with the target object. In the generation process, after the initial human body parameterized model is subjected to voxelization processing by using the body parameters of the target object contained in the first video frame, a reference frame model with skin is generated, the reference frame model is continuously optimized based on the vertex coordinates of the voxelized reference frame model and the attitude parameters of the target object contained in each video frame, so that the authenticity of the virtual character is improved, and in the optimization process, the optimization result of the next video frame is adjusted by using the optimization loss of the previous video frame, so that the authenticity of the virtual character is further improved. Meanwhile, real and complete texture data can be obtained by using the RGB information in each video frame, so that the virtual character after mapping is consistent with the target object. The whole optimization process does not need the participation of animators, so that manpower and material resources are saved, and the generation efficiency of the virtual roles is improved.

After the virtual character of the target object is obtained, in the process of remote three-dimensional interaction, motion data for driving the virtual character to move needs to be transmitted, so that the virtual character consistent with the action of the target object can be obtained, the data volume of network transmission is reduced, the implementation interaction requirement is met under the condition of the existing network bandwidth, and the immersive experience of a user is improved.

Referring to fig. 1A, a user a and a user B interact with each other by using a smart phone 101 and a laptop 102 as interaction devices, respectively, for an application scenario diagram provided in the embodiment of the present application. In the interaction process, the three-dimensional model data of the two are uploaded to the server 200 through the same local area network, and the interaction equipment downloads the three-dimensional model data of the other side from the server for rendering display, so that the virtual roles of the two are placed in the same virtual space, and the immersive experience of face-to-face interaction is realized.

In the interaction process, the interaction device in the same local area network can also perform remote interaction with the interaction devices in other local area networks. For example, referring to fig. 1B, the smart phone 101 and the smart tv 103 are in the lan 1, the notebook computer 102 is in the lan 2, the smart phone 101 and the smart tv 103 upload their three-dimensional model data to the server 200 through the lan 1, and the notebook computer uploads their three-dimensional model data to the server 200 through the lan 2.

It should be noted that fig. 1A and fig. 1B are only examples, and the embodiment of the present application does not make a limitation on the interactive terminal, and may be an interactive device such as a tablet, a desktop, and a VR head display device, besides a smart phone, a smart television, and a notebook computer. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Based on the application scenario, fig. 2 is a system framework diagram for generating a virtual role according to an embodiment of the present disclosure, where the system includes a civil RGB camera and a host equipped with a high performance Graphics Processing Unit (GPU), and the system may be divided into a data acquisition module, a geometric reconstruction module, a texture reconstruction module, and a virtual role generation module.

And the data acquisition module comprises a civil-grade RGB camera and is used for acquiring the image of the target object. In the collection process, the front of a target object faces towards the RGB camera to make a Tposition or Apose posture, and then the target object rotates at a low speed for one circle. And after the rotation is finished, storing the video data acquired by the RGB camera.

The geometric reconstruction module initializes a voxel space with fixed resolution by using a human body parameterized model (human body geometric grid) corresponding to the target object to obtain a reference frame model, then learns the conversion relation between each stored video frame and the reference frame model by using two deep learning networks, and continuously optimizes the non-rigid deformation relation and the skinning weight to enrich the details of the reference frame model, thereby obtaining the target geometric model with consistent topological structure and further improving the authenticity of the virtual role.

And the texture reconstruction module is used for establishing the corresponding relation between the RGB information in each frame of video and the target geometric model so as to complement the texture of the reference frame (the first video frame in the embodiment of the application) and obtain a complete texture map of the target geometric model.

And the virtual character generation module is used for performing texture mapping on the target geometric model by using the complete texture map to obtain a virtual character consistent with the skin and the posture of the target object.

In the embodiment of the application, the skinning weight for driving the human body target geometric model to move can be extracted through the process of continuously optimizing the reference frame model, so that in the remote interaction process, after the motion data of the human body parametric model is acquired from the RGB image acquired in real time, the skinning weight is combined to drive the target geometric model to move, a virtual role matched with the current action of the target object is obtained, the data volume occupied by the motion data transmitted in the whole interaction process is small, and the real-time interaction requirement can be met under the condition of wide existing network proxy.

Referring to fig. 3, a flowchart of a method for generating a virtual role according to an embodiment of the present application is executed by an interactive device equipped with a high-performance GPU, and mainly includes the following steps:

s301: and acquiring a video of a target object which is acquired by an RGB camera and rotates for one circle in a preset posture, and taking a front image of the target object as a first video frame to extract the body parameters of the target object.

The RGB camera may be a camera configured on the interactive device, or may be a camera independent from the interactive device.

During specific implementation, the RGB camera is fixed, the target object walks to the front of the RGB camera and faces the RGB camera in the forward direction, and the position of the target object ensures that the whole body data of the target object can be shot by the visual field of the RGB camera. Then, the RGB camera is started, the target object rotates for a circle at a preset posture (such as T-position or A-position) in the visual field of the RGB camera, the RGB camera is enabled to collect the target object from multiple angles, and a video of the target object is generated and stored. The video may be transmitted to a GPU of the interactive device for processing via the communication interface.

The front image of the target object contains richer detail information, so that the collected front image of the target object is used as a first video frame, which is also called a reference frame.

In an embodiment of the present application, in order to generate a high-quality avatar, the resolution of the RGB camera may be set to be greater than 720 p.

It should be noted that, the rotation mode of the target object is not limited in the embodiments of the present application, and for example, the target object may be rotated in a fixed posture while standing on a mechanical turntable. In order to ensure the integrity of the virtual character and the authenticity of the subsequent drive, the target object can rotate in a certain amplitude of action during the rotation process.

S302: initializing a voxel space with set resolution according to the initial human body parameterized model and the shape parameters of the target object, and generating a reference frame model with skin according to the initialization result, wherein the shape parameters are extracted from the first video frame.

After the video of the target object is obtained, each video frame is fitted with a preset initial human body parameterized model, and the shape parameters of the target object and the attitude parameters of the target object contained in each video frame are determined. The initial human body parameterized model may be an SMPL model, and its expression formula is as follows:

M(β，θ)＝W(T _p (β, θ), J (β), θ, ω) formula 1

T _p (β，θ)＝T+B _s (β)+B _p (theta) formula 2

Wherein M (beta, theta) is an SMPL model construction function, T _p (beta, theta) is a mixed deformation function for correcting human body posture, beta is a body parameter, theta is a posture parameter, and W is a skin functionNumber, B _s (beta) is a linear function constructed by body parameters for people of different body types, B _p (theta) is the influence function of the posture parameters of the people with different body states on the body type of the human body, T is the average body parameter of the different body parameters, J (beta) is the function of predicting the positions of the joint points of the different human bodies, and omega is the mixed weight.

Referring to fig. 4, a specific fitting process mainly includes the following steps:

s3020_ 1: and extracting a target skeleton and a first human body contour of the target object aiming at the first video frame.

In S3020_1, a 2D human skeleton extraction algorithm is used to extract a target skeleton of the target object from the first video frame, and a human image segmentation algorithm is used to extract a human region of the target object from the first video frame, so as to obtain a first human body contour of the target object.

S3020_ 2: initializing the shape parameters and the posture parameters of the target skeleton, and initializing a human body parameterized model through skin function deformation.

And when S3020_2 is executed, estimating a group of body parameters and posture parameters based on the extracted target skeleton, taking the estimated values as the body parameters and posture parameters after initialization is obtained, and deforming the initial human body parameterized model through a skin function.

S3020_ 3: and extracting a reference framework of the deformed human body parameterized model, and determining framework loss values of the reference framework and the target framework.

And obtaining a skeleton loss value between the reference skeleton extracted from the deformed human body parametric model and a 2D target skeleton, and recording the skeleton loss value as lossl.

S3020_ 4: and according to the internal parameters of the RGB camera, projecting the reference skeleton into the first video frame to optimize the skeleton loss value and obtain the external parameters and new attitude parameters of the RGB camera.

Generally, the internal parameters of the RGB camera are known at the time of factory shipment, and in order to improve the accuracy of solving the physical parameters and the attitude parameters, the internal parameters of the RGB camera can be determined by pre-calibration.

In performing S3020_4, a reference skeleton is projected into the first video frame according to the internal parameters of the RGB camera, the reference skeleton is made to coincide with the target skeleton to optimize (reduce) loss1, and new pose parameters and external parameters of the RGB camera are obtained.

S3020_ 5: and projecting the vertex of the deformed human body parameterized model into a first video frame according to the new posture parameter and the external parameter to obtain a second human body contour.

And when S3020_5 is executed, fixing external parameters of the RGB camera, taking the new posture parameters as initial values, projecting the vertex of the deformed human body parameterized model into a first video frame, and obtaining a second human body contour corresponding to the human body parameterized model.

S3020_ 6: and updating the body parameters and the posture parameters according to the contour loss values of the first human body contour and the second human body contour.

And recording the contour loss values of the first human body contour and the second human body contour as loss2, updating the body parameters and the posture parameters of the target object contained in the first video frame by reducing loss2, taking the updated body parameters as the final body parameters of the target object, and taking the updated posture parameters as the posture parameters of the target object contained in the first video frame.

S3020_ 7: and fixing external parameters of the RGB camera and body parameters of the target object, and re-determining a skeleton loss value and a contour loss value aiming at the non-first video frame so as to determine the corresponding attitude parameters of the non-first video frame.

In executing S3020_7, for the non-first video frame, the determination process of the skeleton loss value los 1 and the contour loss value loss2 is consistent with the determination process of the first video frame, and is not repeated here, and the pose parameters of the target object included in the non-first video frame can be obtained through the skeleton loss value los 1 and the contour loss value loss 2.

After obtaining the shape parameters of the target object, in step S302, according to a voxel space with a set resolution (e.g., 512 × 512 or higher), performing voxelization on the initial human body parameterized model, mapping the T-pos or a-pos human body parameterized model to the voxel space, and generating a reference frame model with skin based on the voxelized result, with reference to fig. 5:

s3021: and according to the shape parameters of the target object, deforming the initial human body parameterized model to obtain the target human body parameterized model in the preset posture.

And the shape parameters of the target object are extracted from the first video frame, and when S3021 is executed, the shape parameters of the target object are substituted into formula 1 and formula 2, and the SMPL model is deformed to obtain the target human body parameterized model with the preset posture corresponding to the target object.

S3022: the target human parametric model is aligned to the origin of coordinates of the voxel space.

And when S3022 is executed, aligning the coordinate origin of the target human body parameterized model with the coordinate origin of the voxel space, and after the alignment, dividing the whole target human body parameterized model into front and back connected voxel blocks by the voxel space, wherein each voxel block has the same volume and the same length, width and height.

S3023: and generating a reference frame model with skin according to the SDF value of each vertex in the voxel space in the target human body parameterized model by using the set first deep neural network.

After aligning the target parameterized model with the origin of coordinates in the voxel space, for each vertex in the target parameterized model, a directed Distance Field (SDF) value of the vertex in the voxel space may be determined.

Wherein the SDF value of the real surface of the target object in the voxel space is 0, and if the SDF value is positive, the SDF value is represented outside the target object; the SDF value is negative and indicates inside the target object.

Assume a first deep neural network, denoted net1, which is composed of Multiple Layer Perceptron (MLP) full connections. And when S3023 is executed, inputting the three-dimensional coordinates P (x, y, z) of each vertex in the voxel space and a preset network parameter eta into the first deep neural network, learning an isosurface with a zero SDF value through the first deep neural network to form implicit expression of the reference frame model with skin, and rendering to obtain a reference frame model for displaying the expression.

Specifically, referring to fig. 6, the process of determining the reference frame model with skin mainly includes the following steps:

s3023_ 1: and determining the SDF value of each vertex in the voxel space by adopting a first deep neural network according to the three-dimensional coordinates of each vertex in the voxel space in the target human body parameterized model.

After the target parameterized model is aligned with the origin of coordinates in the voxel space, each vertex of the target parameterized model can be determined, and the three-dimensional coordinates in the corresponding voxel block in the voxel space are denoted as P (x, y, z). Each three-dimensional coordinate can reflect the spatial position relationship between the vertex and the voxel block, and the SDF value of the vertex in the voxel space is obtained according to the spatial position relationship.

S3023_ 2: and extracting a vertex set with the SDF value of 0 to obtain the implicit expression of the reference frame model with the skin.

The vertex set with the SDF value of zero forms a zero isosurface which can be used as the implicit expression of a reference frame model with skin, and the implicit expression formula is as follows:

S _η ＝{P∈R ³ | f (P;. eta) ═ 0} equation 3

Wherein S is _η The method comprises the steps of representing a vertex set with an SDF value of 0, P being a three-dimensional coordinate of a vertex in a voxel space, eta representing a network parameter of a first deep learning network and being optimized through learning, and f representing a first deep neural network.

S3023_ 3: and rendering a zero isosurface formed by all the vertexes in the vertex set to obtain a reference frame model with skin.

In specific implementation, a Marching Cube (Marching Cube) algorithm is adopted, a three-dimensional grid model of the target object is extracted by rendering a zero isosurface formed by a vertex set with an SDF value of zero, and a reference frame model with skin for displaying expression is obtained.

S303: and respectively deforming the reference frame model according to the attitude parameters of the target object contained in each video frame and the three-dimensional coordinates of each vertex in the reference frame model in the corresponding voxel block to obtain the target geometric model.

Taking a video frame as an example, the following specific deformation process is shown in fig. 7:

s3031: and mapping the reference frame model to the video frame according to the attitude parameters of the target object contained in the video frame.

In the embodiment of the application, a first deep neural network is adopted to generate a reference frame model with skin according to an initialization result, and the reference frame model is used as an initial reference frame model. In step S3031, according to the physical parameters of the target object included in the video frame, the implicit expression and the display expression of the initial reference frame model are mapped to the video frame, and the three-dimensional coordinates of each vertex in the mapped reference frame model within the corresponding voxel block are determined. And inputting the three-dimensional coordinates into a set second deep neural network as a condition variable for deforming the mapped reference frame model.

S3032: and determining the non-rigid deformation relation between the video frame and the reference frame model by using the set second deep neural network according to the shape parameters of the target object contained in the video frame and the three-dimensional coordinates of each vertex in the corresponding voxel block in the mapped reference frame model.

The second deep neural network may also be composed of MLP full connections, similar to the first deep neural network, and is denoted as net 2.

In step S3032, the shape parameters of the target object included in the video frame and the three-dimensional coordinates of each vertex in the corresponding voxel block in the mapped reference frame model are respectively input to the second deep neural network, and the non-rigid deformation relationship d between the video frame and the reference frame model is output through the second deep neural network.

S3033: and deforming the reference frame model according to the non-rigid deformation relation.

And when S3033 is executed, determining, by using the skinning function W corresponding to the current pose parameter, a skinning deformation field D (·)) corresponding to the video frame according to the non-rigid deformation relationship, and deforming the implicit expression and the display expression of the reference frame model according to the skinning deformation field D, respectively, to obtain a reference frame model corresponding to the current form parameter.

In the embodiment of the application, in the process of deforming the reference frame model by using each video frame, the loss before and after deformation of the reference frame model corresponding to the previous video frame is used for optimizing the deformation parameter of the next video frame, i.e. the non-rigid deformation relationship, and the specific process is as shown in fig. 8:

s3034: and projecting the deformed reference frame model into the video frame according to the parameters of the RGB camera.

And when the step S3034 is executed, projecting the reference frame model corresponding to the current attitude parameter to the video frame corresponding to the current attitude parameter according to the internal parameter and the external parameter of the RGB camera.

S3035: and optimizing the implicit expression of the reference frame model according to the loss value between the outer contour of the projected reference frame model and the outer contour of the target object obtained by segmentation in the video frame.

When executing S3035, adopting a portrait segmentation algorithm to extract the outer contour of the target object contained in the video frame, and extracting the outer contour of the projected reference frame model, calculating the loss values of the two, and adjusting the network parameters of the first deep neural network through the loss values so as to optimize the implicit expression of the reference frame model and improve the reality of the virtual character.

S3036: first intersection points of the non-rigid projection rays and the implicit expression of the optimized reference frame model and second intersection points of the non-rigid projection rays and the implicit expression of the optimized reference frame model are determined.

And when S3036 is executed, adjusting network parameters of the first deep neural network according to the loss value, obtaining the implicit expression of the optimized reference frame model, and then determining intersection points (denoted as first intersection points) of the non-rigid projection ray and the implicit expression of the optimized reference frame model and intersection points (denoted as second intersection points) of the non-rigid projection ray and the implicit expression of the reference frame model before optimization respectively. The SDF difference value between the first intersection point and the second intersection point can fully reflect the detail characteristics of the reference frame model, and the reality of the virtual role is improved.

S3037: and transforming the voxel block corresponding to the SDF difference value between the first intersection point and the second intersection point into a voxel space according to the attitude parameters of the target object contained in the video frame so as to optimize the non-rigid deformation relation of the next video frame for deforming the reference frame model.

When S3037 is executed, the voxel block corresponding to the SDF difference between the first intersection point and the second intersection point is transformed to the voxel space to adjust the network parameters of the second deep neural network according to the pose parameters of the target object included in the video frame, so that when the non-rigid deformation relationship that the implicit expression and the explicit expression of the reference frame model are deformed by the next video is determined according to the second deep neural network, the accuracy of the non-rigid deformation relationship can be improved, and the accuracy of the virtual character is further improved.

The method steps of fig. 7 and 8 are repeated until the deformation of the reference frame model by the last video frame is completed, and a final target geometric model is obtained, which has topological structure consistency, see fig. 9.

In the embodiment of the application, when the reference frame model is deformed respectively according to the attitude parameters of the target object contained in each video frame, the network parameters of the first deep neural network and the second deep neural network are adjusted respectively, and when the reference frame model is deformed in the next frame, the accuracy of the non-rigid deformation relation is favorably improved once adjustment is performed, so that the deformed reference frame model is consistent with the attitude of the target object, and the authenticity of the virtual character is improved.

S304: and performing texture mapping on the target geometric model according to the RGB information of each video frame to obtain the virtual role of the target object.

In the embodiment of the application, the collected video of the target object rotating for one circle in the preset posture comprises all RGB information of the target object under multiple angles, when S304 is executed, the RGB information can be obtained based on each video frame, a complete texture map of the target object is obtained, after texture mapping is carried out on the target geometric model, a virtual character of the target object is obtained, the virtual character and the target object have the same skin, and the appearance of the target object can be truly reflected.

In S304, when texture images are performed on the target geometric model, the corresponding relationship between the vertex of the target geometric model and the pixel points of the texture image needs to be determined. Referring to fig. 10, the specific process is as follows:

s3041: a blank first texture map with a fixed resolution is generated in advance.

For example, the resolution of the first texture map is set to 2048 × 2048 px. Since the first texture map is blank, it can be rendered subsequently by the RGB information in each video frame to obtain a complete texture map that is consistent with the real skin of the target object.

S3042: rasterizing the target geometric model, and determining a first corresponding relation between a vertex included by each triangle patch after rasterizing and a pixel point in a prefabricated second texture map, wherein the resolution of the second texture map is the same as that of the first texture map.

In S3042, a plurality of triangle patches may be obtained by rasterizing the target geometric model, where each triangle patch includes three vertices, and each vertex corresponds to an index, which may be a barycentric coordinate of the corresponding triangle patch and the triangle patch. And performing upsampling on the vertex contained by each triangular patch, and determining a first corresponding relation between each vertex and a pixel point in the second texture map, wherein the first corresponding relation can be characterized by the index of the vertex and the coordinate of the pixel point.

S3043: and migrating the first corresponding relation to the first texture map to obtain a second corresponding relation between the vertex of the target geometric model and the pixel points of the first texture map.

In S3043, since the resolution of the first texture map and the resolution of the second texture map are the same, that is, the vertex of the target geometric model has a corresponding pixel in the first texture map and a corresponding pixel in the second texture map, the first corresponding relationship between the vertex of the target geometric model and the pixel in the second texture map may be migrated to the first texture map, and the second corresponding relationship between the vertex of the target geometric model and the pixel in the first texture map is obtained and stored.

Furthermore, after the first texture map is rendered according to the RGB information extracted from each video frame, complete texture data of the target object can be obtained, and then texture mapping of the target geometric model is completed by using a second corresponding relation between the vertex of the target geometric model and the pixel point of the first texture map. The specific process is shown in fig. 11, which mainly includes the following steps:

s3044: for each video frame, a third correspondence between pixel points in the video frame and vertices of the target geometric model is determined.

When S3044 is executed, a skeleton corresponding to the target geometric model is extracted, the target geometric model is mapped to the video frame through the internal parameters and the external parameters of the RGB camera, and a third correspondence between a pixel point in the video frame and a vertex of the target geometric model is determined.

S3045: and rendering a blank first texture map by using the RGB information of the pixel points in the video frame according to the third corresponding relation and the second corresponding relation.

In S3045, according to the third corresponding relationship between the pixel point in the video frame and the vertex of the target geometric model and the second corresponding relationship between the vertex of the target geometric model and the pixel point of the first texture map, the corresponding pixel point of the pixel point in the video frame in the first texture map may be determined, and the RGB information of the pixel point in the video frame is assigned to the corresponding pixel point in the first texture map, so as to complete the rendering of the first texture map.

S3046: and performing texture mapping on the target geometric model according to the first texture map rendered by each video frame to obtain the virtual role of the target object.

In S3046, after all RGB information of the pixels in each video frame is assigned to the corresponding pixels in the first texture map, the first texture map includes complete texture data of the target object, so that after the texture mapping is performed on the geometric model of the target according to the first texture map, the virtual character of the target object can be obtained, referring to fig. 12, because the first texture map includes real skin (clothes, hair, etc.) data of the target object, the appearance of the virtual character corresponds to the target, and the reality of the target object is improved.

In the method for generating the virtual role, a civil RGB camera is used for collecting a video of a target object rotating for a circle in a preset posture and storing the video, a front image with rich detail information is used as a first video frame to extract a body parameter of the target object, and after an initial human body parametric model with a consistent topological structure is subjected to voxelization processing by using the body parameter, a reference frame model with skin is generated; after a reference frame model with skin is obtained, the reference frame model is deformed respectively according to the attitude parameters contained in each video frame and the three-dimensional coordinates of each vertex in the reference frame model in the corresponding voxel block to obtain a target geometric model, and in the deformation process, the loss of the reference frame model corresponding to the previous video frame before and after deformation is used for optimizing the deformation parameters of the next video frame, so that the target geometric model is more consistent with a target object, the reality of the target geometric model is improved, a more real virtual role is obtained after texture mapping is performed on the target geometric model according to the RGB information of each video frame, and the skin of the virtual role can truly depict the appearance of the target object. In the whole generation process, the optimization of the deformation parameters is automatically carried out, an animator is not required to participate, the manpower and material resources are saved, and the generation efficiency of the virtual character is improved.

After the virtual character of the target object is obtained, the virtual character can be used for remote three-dimensional communication so as to improve the face-to-face immersion in the interaction process. Referring to fig. 13, a specific interaction process mainly includes the following steps:

s305: and acquiring an RGB image of the target object acquired by the RGB camera in the interaction process.

In the interaction process, the target object can move in any posture, and an RGB (red, green and blue) camera acquires an RGB image in the interaction process in real time and transmits the RGB image to a GPU (graphics processing unit) of the interaction equipment.

S306: and extracting the motion data of the target object according to the RGB image.

Through a motion capture algorithm, the skeleton nodes of the target object are extracted from the RGB image, and motion data of the skeleton nodes, such as the rotation angles of the skeleton nodes, the coordinates of the skeleton nodes and the like, are obtained.

S307: and driving the virtual character to move according to the movement data so as to match the real action of the target object.

And driving the virtual character to move by using the obtained motion data of the bone nodes according to the driving relation between the bone nodes and the model vertex, so that the motion of the moved virtual character is consistent with the current real motion of the target.

In the embodiment of the application, the motion data of the target object occupies a small amount of data relative to the three-dimensional model data, meets the real-time requirement of remote three-dimensional communication under the condition of the existing network bandwidth, reduces the blockage in the communication process, and improves the immersive experience of a user.

Based on the same technical concept, embodiments of the present application provide an interactive device, which is capable of performing the steps of the method for generating a virtual character provided in the foregoing embodiments and achieving the same technical effects, and details are not repeated herein.

Referring to fig. 14A, the interactive device comprises a processor 1401, a memory 1402, a display 1403 and a communication interface 1404, said display 1403, said memory 1402 and said processor 1401 being connected by a bus 1405;

the memory 1402 comprises data storage units and program storage units, the program storage units storing computer program instructions, the processor 1401 executing according to the computer program to perform the following operations:

through the communication interface 1404, a video of a target object which is acquired by an RGB camera and rotates for a circle in a preset posture is acquired and stored in the data storage unit, and a front image of the target object is taken as a first video frame to extract a shape parameter of the target object;

the geometric model of the target is texture mapped according to the RGB information of each video frame to obtain a virtual character of the target object, and the virtual character is displayed on the display 1403.

Optionally, the processor 1401 initializes a voxel space with a set resolution according to the initial human parametric model and the shape parameter, and generates a reference frame model with skin according to an initialization result, and specifically performs the following operations:

according to the shape parameters, deforming the initial human body parameterized model to obtain a target human body parameterized model in a preset posture;

aligning the target human parametric model with a coordinate origin of the voxel space;

and generating a reference frame model with skin according to the SDF value of each vertex in the voxel space in the target human body parameterized model by using the set first deep neural network.

Optionally, the processor 1401 generates a reference frame model with skin according to the SDF value of each vertex in the target human body parameterized model in the voxel space by using the set first deep neural network, and specifically operates as follows:

determining the SDF value of each vertex in the voxel space according to the three-dimensional coordinate of each vertex in the target human body parameterized model in the voxel space;

extracting a vertex set with an SDF value of 0 to obtain implicit expression of a reference frame model with skin;

and rendering a zero isosurface formed by all the vertexes in the vertex set, and generating a reference frame model with skin for displaying expression.

Optionally, the processor 1401 deforms the reference frame model according to the pose parameter of the target object included in each video frame and the three-dimensional coordinates of each vertex in the reference frame model in the corresponding voxel block, and specifically:

for each video frame, mapping the reference frame model to the video frame according to the attitude parameters of the target object contained in the video frame;

determining a non-rigid deformation relation between the video frame and the reference frame model by using a set second deep neural network according to the attitude parameters and the three-dimensional coordinates of each vertex in the mapped reference frame model in the corresponding voxel block;

and deforming the reference frame model according to the non-rigid deformation relation.

Optionally, after transforming the reference frame model for each video frame, the processor 1401 further performs:

projecting the deformed reference frame model into the video frame according to the parameters of the RGB camera;

optimizing the implicit expression of the reference frame model according to the projected outer contour of the reference frame model and the loss value between the outer contours of the target object obtained by segmentation in the video frame;

determining each first intersection point of the non-rigid projection ray and the implicit expression of the optimized reference frame model and each second intersection point of the non-rigid projection ray and the implicit expression of the optimized reference frame model before optimization;

and transforming the voxel block corresponding to the SDF difference value between the first intersection point and the second intersection point into the voxel space according to the attitude parameters of the target object contained in the video frame so as to optimize the non-rigid deformation relation of the next video frame for deforming the reference frame model.

Optionally, the processor 1401 performs texture mapping on the target geometric model according to RGB information of each video frame to obtain a virtual character of the target object, and specifically operates to:

rasterizing the target geometric model, and determining a first corresponding relation between a vertex included in each triangle patch after rasterizing and a pixel point in a prefabricated second texture map, wherein the resolution of the second texture map is the same as that of the first texture map, and the first texture map is a prefabricated blank texture map;

migrating the first corresponding relation to the first texture map to obtain a second corresponding relation between a vertex of the target geometric model and a pixel point of the first texture map;

for each video frame, determining a third corresponding relation between a pixel point in the video frame and a vertex of the target geometric model;

rendering the first texture map by using RGB information of pixel points in the video frame according to the third corresponding relation and the second corresponding relation;

and performing texture mapping on the target geometric model according to the first texture map rendered by each video frame to obtain the virtual role of the target object.

Optionally, after obtaining the virtual role of the target object, the processor 1401 further performs:

acquiring an RGB image of the target object acquired by the RGB camera in the interaction process;

extracting motion data of the target object according to the RGB image;

and driving the virtual character to move according to the movement data so as to match the real action of the target object.

Optionally, the processor 1401 determines the morphological parameters and the pose parameters by:

extracting a target skeleton and a first human body contour of the target object aiming at a first video frame;

initializing the shape parameter and the attitude parameter of the target skeleton, and deforming the initial human body parameterized model through a skin function;

extracting a reference framework of the deformed human body parameterized model, and determining framework loss values of the reference framework and the target framework;

projecting the reference skeleton into a first video frame according to the internal parameters of the RGB camera to optimize the skeleton loss value, and obtaining the external parameters and new attitude parameters of the RGB camera;

projecting the vertex of the deformed human body parameterized model into a first video frame according to the new posture parameter and the external parameter to obtain a second human body outline;

updating the body parameters and the posture parameters according to the contour loss values of the first human body contour and the second human body contour;

and fixing the external parameters of the RGB camera and the body parameters of the target object, and re-determining the skeleton loss value and the contour loss value aiming at the non-first video frame so as to determine the corresponding posture parameters of the non-first video frame.

It should be noted that fig. 14A is only necessary hardware required by the detection device to implement the virtual character generation method provided in the embodiment of the present application, and optionally, the interaction device further includes conventional hardware such as a speaker and an audio processor.

In an alternative embodiment, the interactive device itself may incorporate the RGB camera 1406, see fig. 14B, with other hardware consistent with fig. 14A and not repeated here.

Embodiments of the present application also provide a computer-readable storage medium for storing instructions that, when executed, may implement the methods of the foregoing embodiments.

Embodiments of the present application further provide a computer program product for storing a computer program, where the computer program is configured to execute the method of the foregoing embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for generating a virtual character, comprising:

initializing a voxel space with set resolution according to an initial human body parameterized model and the body parameters, and generating a reference frame model with skin according to an initialization result;

2. The method of claim 1, wherein initializing a voxel space with a set resolution according to an initial human parametric model and the shape parameters, and generating a reference frame model with skin according to an initialization result comprises:

3. The method of claim 2, wherein the generating a reference frame model with skin according to the SDF value of each vertex in the voxel space in the target human parametric model by using the set first deep neural network comprises:

4. The method according to claim 1, wherein said deforming the reference frame model according to the pose parameters of the target object included in each video frame and the three-dimensional coordinates of the vertices in the corresponding voxel block in the reference frame model comprises:

5. The method of claim 1, wherein after deforming the reference frame model for each video frame, the method further comprises:

determining each first intersection point of the non-rigid projection ray and the implicit expression of the optimized reference frame model and each second intersection point of the non-rigid projection ray and the implicit expression of the optimized reference frame model;

and transforming a voxel block corresponding to the SDF difference value between the first intersection point and the second intersection point into the voxel space according to the attitude parameters of the target object contained in the video frame so as to optimize the non-rigid deformation relation of the next video frame for deforming the reference frame model.

6. The method of claim 1, wherein said texture mapping the target geometric model according to the RGB information of each video frame to obtain the virtual character of the target object comprises:

rasterizing the target geometric model, and determining a first corresponding relation between a vertex included by each rasterized triangular patch and a pixel point in a prefabricated second texture map, wherein the resolution of the second texture map is the same as that of a first texture map, and the first texture map is a prefabricated blank texture map;

7. The method of any of claims 1-6, wherein after obtaining the virtual character of the target object, the method further comprises:

extracting motion data of the target object according to the RGB image;

8. The method of any one of claims 1-6, wherein the morphological parameters and pose parameters are determined by:

initializing the shape parameters and the posture parameters of the target skeleton, and deforming the initial human body parameterized model through a skin function;

projecting the vertex of the deformed human body parameterized model into a first video frame according to the new posture parameter and the external parameter to obtain a second human body contour;

and fixing the external parameters of the RGB camera and the body parameters of the target object, and re-determining the skeleton loss value and the contour loss value aiming at the non-first video frame so as to determine the posture parameters corresponding to the non-first video frame.

9. An interaction device for generating a virtual character for remote interaction, the interaction device comprising a processor, a memory, a display and a communication interface, wherein the communication interface, the display, the memory and the processor are connected by a bus;

the memory includes a data storage unit and a program storage unit, the program storage unit storing computer program instructions, the processor executing according to the computer program to perform the following operations:

acquiring a video of a target object which is acquired by an RGB camera and rotates for one circle in a preset posture through the communication interface, storing the video into the data storage unit, and taking a front image of the target object as a first video frame to extract a shape parameter of the target object;

10. The interactive device of claim 9, wherein after the processor obtains the virtual character of the target object, the processor further performs:

acquiring an RGB image of the target object acquired by the RGB camera in an interaction process through the communication interface;

extracting motion data of the target object according to the RGB image;

and driving the virtual character to move to match the real posture of the target object according to the movement data, and displaying the driven virtual character by the display.