CN116433826B

CN116433826B - Avatar driving method, apparatus, device and medium

Info

Publication number: CN116433826B
Application number: CN202310678166.8A
Authority: CN
Inventors: 陈毅; 郭紫垣; 赵亚飞; 范锡睿; 张世昌; 杜宗财; 王志强
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-29
Anticipated expiration: 2043-06-08
Also published as: CN116433826A

Abstract

The disclosure provides a virtual image driving method, device, equipment and medium, relates to the technical field of artificial intelligence, in particular to the technologies of computer vision, virtual reality, augmented reality and the like, and can be used for a metauniverse scene. The specific implementation scheme is as follows: acquiring an original picture frame comprising a real image; determining an image area of the real image in an original picture frame and image characteristic data carried by the image area; according to the regional attribute information and the image characteristic data of the image region, determining virtual displacement data of an actual image corresponding to the virtual image in the virtual space under the perspective projection condition; the global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space; and driving and displaying the virtual image according to the virtual displacement data. According to the technology of the present disclosure, generalization of avatar displacement driving and driving effect accuracy are improved.

Description

Avatar driving method, apparatus, device and medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to techniques of computer vision, virtual reality, augmented reality, and the like, which may be used in a meta-space scene.

Background

The motion capture is to set a tracker at a key part of a moving object, obtain three-dimensional coordinate data by capturing the position of the tracker and processing the captured position data, and transcribe the motion of a real moving object into the motion of an avatar by applying the three-dimensional data to the avatar, thereby being widely applied to the field of animation or games.

Disclosure of Invention

The present disclosure provides an avatar driving method, apparatus, device, and medium.

According to an aspect of the present disclosure, there is provided an avatar driving method including:

acquiring an original picture frame comprising a real image;

determining an image area of the real image in an original picture frame and image characteristic data carried by the image area;

determining virtual displacement data of the real image corresponding to the virtual image in the virtual space under the perspective projection condition according to the region attribute information and the image characteristic data of the image region; the global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space;

and driving and displaying the virtual image according to the virtual displacement data.

According to another aspect of the present disclosure, there is also provided an electronic apparatus including:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the avatar driving methods provided by the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform any one of the avatar driving methods provided by the embodiments of the present disclosure.

According to the technology of the present disclosure, generalization of avatar displacement driving and driving effect accuracy are improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flowchart of an avatar driving method provided in an embodiment of the present disclosure;

Fig. 2 is a flowchart of another avatar driving method provided in an embodiment of the present disclosure;

fig. 3 is a flowchart of another avatar driving method provided by an embodiment of the present disclosure;

fig. 4 is a flowchart of another avatar driving method provided by an embodiment of the present disclosure;

fig. 5A is a block diagram illustrating a configuration of an avatar driving system provided in an embodiment of the present disclosure;

fig. 5B is a flowchart of another avatar driving method provided in an embodiment of the present disclosure;

fig. 6 is a block diagram of an avatar driving apparatus provided in an embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device for implementing an avatar driving method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the prior art, when performing motion capture, an original picture frame including an actual image is generally collected, key points of the actual image in the original picture frame are identified, actual displacement data of the key points are determined, the actual displacement data in the original picture frame are converted into virtual images corresponding to the actual image according to a preset proportionality coefficient, and virtual displacement data in a virtual space are generated, so that the virtual image is driven and displayed in the virtual space according to the virtual displacement data.

However, the preset scaling factor needs to be adaptively adjusted along with the change conditions of the avatar and the virtual scene, so that the displacement driving process of the avatar does not have generalization and universality, and even the situation that the avatar penetrates through a model with the space ground or canvas can occur, thereby affecting the displacement driving effect.

The virtual image driving method and device provided by the embodiment of the disclosure are suitable for solving the problems of poor generalization and driving effect under the scene of virtual image displacement driving. The method for driving the avatar provided in the embodiments of the present disclosure may be performed by an avatar driving apparatus, where the apparatus may be implemented by software and/or hardware, and specifically configured in an electronic device having a certain computing capability, where the electronic device may be a terminal device or a server, and the present disclosure is not limited in any way.

For ease of understanding, an avatar driving method will be described in detail first.

Referring to fig. 1, the avatar driving method includes:

s101, acquiring an original picture frame comprising a real image.

The real image corresponds to the virtual image, and can be understood as the image of the actual moving object in the motion capturing process. An avatar may be understood as an avatar that moves correspondingly in the virtual space according to the action of the real avatar. Wherein the avatar may be a character, an animal or other IP (Intellectual Property ) character, etc., to which the present disclosure is not limited in any way.

Wherein the original picture frame can be understood as a video frame in the captured action video of the real figure. Wherein the number of original picture frames is at least one; and combining different original picture frames according to the capturing time sequence to obtain the captured action video. It should be noted that the present disclosure adopts the same avatar driving method for different original picture frames, so that the original picture frames in the subsequent description process are applicable to any original picture frame.

Alternatively, the original picture frame including the real avatar may be captured and acquired directly by an execution device executing the avatar driving method; or alternatively, capturing, by the motion capture device, an original picture frame including the real avatar, and storing the original picture frame in a storage device communicatively connected to an execution device executing the avatar driving method, and correspondingly, acquiring the stored original picture frame from the storage device.

S102, determining an image area of the real image in the original picture frame and image characteristic data carried by the image area.

Because the original picture frame is not fully filled with the real image, that is, the original picture frame comprises a background area except for the image area corresponding to the real image, and the background area does not contain information of practical significance, the image area in the original picture frame and the effective information with practical significance carried by the image area need to be extracted. The image area may be an area where an circumscribed rectangular frame of the real image is located. In an alternative embodiment, to ensure consistency of the determined bounding rectangle frame, and ease of subsequent operations, the bounding rectangle frame is typically set to the largest bounding rectangle frame in the horizontal or vertical direction.

The character feature data may be understood as ROI (region of interest ) features carried by the real character in the character region, and may include at least one of head features, torso features, limb features, and the like.

In an alternative embodiment, the avatar region of the real avatar in the original picture frame may be determined based on the object detection network, and the ROI feature in the avatar region may be extracted as the avatar feature data based on the feature extraction network. The target detection network and the feature extraction network can be realized based on at least one neural network in the prior art, and the specific network structure and training mode of the target detection network and the feature extraction network are not limited in the disclosure.

In order to improve the richness and comprehensiveness of the image feature data extracted by the feature extraction network, in an alternative embodiment, the head feature data in the image region may be extracted based on the head feature extraction network; extracting trunk feature data in the image area based on the trunk feature extraction network; extracting limb characteristic data in the image area based on the limb characteristic extraction network; image feature data including at least one of head feature data, torso feature data, and limb feature data is generated.

S103, determining virtual displacement data of the real image corresponding to the virtual image in the virtual space under the perspective projection condition according to the region attribute information and the image characteristic data of the image region.

The global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space. The global two-dimensional space is a two-dimensional space constructed by the original picture frame, and the boundary of the two-dimensional space is constrained by the edge of the original picture frame. It is noted that the two-dimensional space in which the image area in the original picture frame is located may be referred to as a local two-dimensional space in the global two-dimensional space. Wherein, based on a pinhole imaging model (also called a central perspective projection model), the information of the three-dimensional object is mapped onto a two-dimensional plane, called transmission projection.

The region attribute information is used for representing a basic attribute of the image region and can include region size information corresponding to the size attribute and region position information corresponding to the position attribute. By way of example, the region size attribute may include region border size, relative scale size of the real world image within the region, and the like. Wherein the relative dimensions may be characterized by the diagonal length of the region borders. For example, the region position information may include position coordinates of the region center.

The virtual displacement data are used for representing the displacement data of the virtual image in the virtual space and matched with the real displacement data of the real image in the original picture frame in the global two-dimensional space.

In an alternative embodiment, the region attribute information and the image feature data of the image region may be input to a pre-trained virtual displacement determination network to obtain virtual displacement data. Wherein the virtual displacement determination network may be implemented based on at least one existing machine learning or deep learning model, the present disclosure does not set any limitation on the specific network structure of the virtual displacement determination network. For example, the region attribute information of the image region in the sample picture frame and the image feature data of the image region corresponding to the sample picture frame may be used as training samples, input to the virtual displacement determination network to be trained, map the network output result back to the sample picture frame, and perform model training on the virtual displacement determination network constructed in advance according to the displacement difference between the mapping result and the original sample picture frame.

And S104, driving and displaying the virtual image according to the virtual displacement data.

The avatar model (which may be, for example, a skinned bone model) of the avatar is driven in accordance with the virtual displacement data, and the driving result is displayed.

Because partial noise interference information may exist in the virtual displacement data, under the condition that the virtual image is driven by the frame sequence of the multi-frame original picture, the condition of picture jitter occurs, and the driving display effect of the virtual image is affected. In order to avoid the above situation, it is also possible to smooth the virtual displacement data and drive and display the avatar according to the smoothing result. Since the smoothing process can remove at least part of noise, the driving and displaying of the avatar is performed based on the smoothing result, and the smoothness of the displacement process in the picture is better.

Alternatively, at least one of laplace smoothing, kalman filtering or median filtering may be used to smooth the virtual displacement data, and the smoothing method used in the smoothing process is not limited in this disclosure.

According to the embodiment of the disclosure, the virtual displacement data of the virtual image in the virtual space is determined by introducing the region attribute information and the image characteristic data of the image region in the original picture frame, so that the displacement data of the original picture frame in the local two-dimensional space where the image region is located is converted into the displacement data in the virtual space. Because the virtual space is a transmission projection result of the global two-dimensional space, in the virtual displacement data determination process under the virtual space, the influence of the image characteristic data of the real image in the global two-dimensional space and the local two-dimensional space is comprehensively considered, and the accuracy of the virtual displacement data determination result is improved, so that the occurrence of the mold penetration condition in the virtual image driving process is avoided, and the displacement driving effect is improved. In addition, in the process of driving the virtual image, various virtual images and virtual scenes can be adapted, and the virtual image driving device does not need to change according to the image type of the virtual image and the scene type of the virtual scene, and has generalization and universality.

Based on the above technical solutions, the present disclosure further provides an optional embodiment, in which the generation mechanism of the virtual displacement data is optimized and improved. It should be noted that, in the embodiments of the present disclosure, parts not described in detail may be referred to related expressions of other embodiments, which are not described herein.

Referring to fig. 2, an avatar driving method includes:

s201, acquiring an original picture frame comprising a real image.

S202, determining an image area of the real image in the original picture frame and image characteristic data carried by the image area.

S203, determining local projection parameters corresponding to the image areas according to the area size information and the image characteristic data.

The region size information is used for representing that the real image is in a large size corresponding to the image region in the original picture frame. For example, the region size information may include at least one of a region edge size and a region focal length size. In a specific implementation, if the image area is defined by a rectangular frame, the area edge size may be understood as the area border size, i.e., the rectangular frame size, such as the longest border size of the rectangular frame; accordingly, the focal length of the region may be a diagonal dimension of a rectangular frame, which is used as a virtual focal length to characterize the relative distance between the real image and the original picture frame acquisition device.

The local projection parameters are projection parameters of the virtual image corresponding to the real image, and the virtual image is projected from the virtual space where the virtual image is located to the local two-dimensional space where the image area is located in weak perspective. The global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space.

It should be noted that, the transmission projection is generally nonlinear, and does not retain most of the geometric properties of the original elements, such as parallelism, angle, and the like, and has imaging characteristics that the object is imaged in a near-far-small manner and parallel lines intersect. On the basis of the transmission projection, it is assumed that when the focal length is infinite, there will be an orthogonal projection on the imaging plane that maintains a parallel relationship with each projection line parallel. The dimension of the orthographic projection corresponds to the size of the original object. And the weak perspective projection is the corresponding orthogonal projection when the scale scaling of the orthogonal projection is considered.

A weak perspective projection may be understood as the projection process of a weak perspective projection model. Whereas the weak perspective projection model is essentially a further simplification of the perspective projection corresponding pinhole imaging model: the distance between each point and the small hole in the projection of the object is replaced by the average distance of the object. Since the near-far size is no longer present, the same average distance is used instead, and thus the size of such an image is determined by the ratio between the imaging plane and the pinhole distance, and the object and the pinhole distance.

The local projection parameters may include scale parameters and displacement parameters, among others. The scale parameter is used for representing the scaling scale in weak perspective projection; the displacement parameter is used for representing the position deviation condition of the key points in the image area in the virtual space. The key point may be a center point of the image area.

It can be understood that by introducing weak perspective projection of the virtual space to the local two-dimensional space, a mapping relationship between the local two-dimensional space corresponding to the image area and the virtual space can be established, so that the action details of the real image can be accurately controlled.

S204, determining virtual displacement data of the virtual image in the virtual space under the perspective projection condition according to the region position information, the region size information and the local projection parameters.

Because the weak perspective projection is used for representing the mapping relation between the local two-dimensional space and the virtual space, and the local two-dimensional space cannot represent the relative position relation between the real image and the real space, the virtual displacement data is determined by directly using the weak perspective projection parameters, and the determined virtual displacement data is 0, namely the virtual image does not have displacement movement any more. In order to ensure the mobility of the avatar, the local weak perspective projection is required to be converted into the global perspective projection between the global two-dimensional space and the virtual space, and the mobility of the avatar in the virtual space can be endowed through the relative position relationship between the local two-dimensional space and the global two-dimensional space.

In an alternative embodiment, the virtual displacement data of the avatar in the virtual space in the case of perspective projection may be determined based on the transformation relationship between the weak perspective projection and the perspective projection according to the region position information, the region size information and the local projection parameters.

Optionally, virtual displacement data of the virtual image in a first space dimension parallel to the global two-dimensional space in the virtual space can be determined according to the scale parameter in the local projection parameter, the displacement parameter in the local projection parameter, the area edge size in the area size information and the area center coordinate of the image area; and determining virtual displacement data of the virtual image in the vertical space dimension of the first space dimension in the virtual space according to the scale parameter in the local projection parameter, the region focal length dimension in the region dimension information, the region edge dimension of the image region and the standard edge dimension of the original picture frame.

Illustratively, determining virtual displacement data of the avatar in a first spatial dimension parallel to the global two-dimensional space in the virtual space according to the scale parameter in the local projection parameter, the displacement parameter in the local projection parameter, the region edge size in the region size information, and the region center coordinate of the avatar region may be: determining a displacement variable of the virtual image in a first space dimension parallel to the global two-dimensional space in the virtual space according to the scale parameter in the local projection parameter, the area edge size in the area size information and the area center coordinate of the image area; and determining virtual displacement data of the virtual image in a first space dimension parallel to the global two-dimensional space in the virtual space according to the displacement parameters and the displacement variables in the local projection parameters.

Specifically, the following formula can be used to determine the global two-dimensional spaceParallel first spatial dimension->Virtual displacement data on:

；

wherein,,is a scale parameter in the local projection parameters; />The displacement parameter is a displacement parameter in the local projection parameters;the region center coordinates of the image region; />The area edge size of the image area (for example, the maximum frame length of a rectangular frame corresponding to the image area) is the area edge size of the image area; />In the virtual space for the avatar->First spatial dimension +.>Virtual displacement data thereon.

Illustratively, determining virtual displacement data of the avatar in a vertical space dimension of a first space dimension in the avatar space according to the scale parameter in the local projection parameter, the region focal length dimension in the region dimension information, the region edge dimension of the avatar region, and the standard edge dimension of the original picture frame may be: determining an object distance increase result in a vertical space dimension of the first space dimension according to the scale parameter in the local projection parameter and the standard edge dimension of the original picture frame; and determining virtual displacement data of the virtual image in the vertical space dimension of the first space dimension in the virtual space according to the object distance increasing result, the region focal length dimension and the region edge dimension of the image region in the region dimension information and the standard edge dimension of the original picture frame.

Specifically, the first spatial dimension may be determined using the following formulaVirtual displacement data in the vertical spatial dimension z:

；

wherein,,for an ideal focal length, it can be set according to an empirical value, for example 5000; />The standard edge size of the original picture frame, namely the expected size in the case of perspective projection, can be set by a technician according to the actual situation;is a scale parameter in the local projection parameters; />As a result of the object distance increase; />The area edge size of the image area (for example, the maximum frame length of a rectangular frame corresponding to the image area) is the area edge size of the image area; />The focal length of the region in the region size information (for example, the focal length of the diagonal line of the rectangular frame corresponding to the image region) is the focal length of the region in the region size information; />For the first spatial dimension->Virtual displacement data in the vertical spatial dimension z.

It can be understood that the conversion between the global perspective projection and the local weak perspective projection can be realized in the mode, so that the displacement condition of the virtual image in the three-dimensional virtual space can be conveniently restored, the operation amount is small, and the accuracy is high.

S205, driving and displaying the virtual image according to the virtual displacement data.

According to the embodiment of the disclosure, the weak perspective projection of the virtual space to the local two-dimensional space is introduced, so that the mapping relation between the local two-dimensional space corresponding to the image area and the virtual space can be established, and the action details of the real image can be accurately controlled; the conversion from the local weak perspective projection to the global perspective projection is realized by introducing the region position information and the region size information, so that the global displacement data hidden in the weak perspective projection is effectively restored, the accuracy of the determined virtual displacement data is improved, meanwhile, the loss of the action details of the real image is avoided, and the driving and displaying effects of the virtual image are improved.

Based on the technical schemes, the present disclosure also provides an alternative embodiment. In this alternative embodiment, the avatar's driving presentation process is optimized and improved. It should be noted that, in the embodiments of the present disclosure, parts not described in detail may be referred to related expressions of other embodiments, which are not described herein.

Referring to fig. 3, an avatar driving method includes:

s301, acquiring an original picture frame comprising a real image.

S302, determining an image area of the real image in the original picture frame and image characteristic data carried by the image area.

S303, determining virtual displacement data of the virtual image corresponding to the real image in the virtual space under the perspective projection condition according to the region attribute information and the image characteristic data of the image region.

The global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space.

S304, coding the virtual displacement data into a displacement hidden variable space to obtain displacement coding data.

The displacement hidden variable space can be understood as a feature space constructed by a displacement hidden variable capable of determining at least one dimension of a displacement attribute in virtual displacement data. The displacement coding data can be understood as unified description of the displacement hidden variables obtained by coding the virtual displacement data, so that variation inference of the virtual displacement data is realized. The data space of the virtual displacement data is a relatively high-dimensional data space, the displacement hidden variable space is a relatively low-dimensional data space, and disturbance hidden variables which are irrelevant to displacement attributes in the virtual displacement data are removed from the displacement hidden variable space, so that the purpose of noise reduction is achieved. In one embodiment, the displacement encoded data may be a resulting vector of random sampling of each displacement hidden variable based on a probability distribution of each displacement hidden variable.

S305, decoding the displacement encoding data into a standard displacement space to obtain displacement decoding data.

The standard displacement space can be understood as an ideal displacement space of the reconstructed virtual displacement data. In the displacement space, the data which can determine the displacement attribute in the virtual displacement data can be restored, and other data which does not contain the displacement attribute in the virtual displacement data can be eliminated.

The displacement decoding data is understood to be virtual displacement data that only determines the displacement attribute.

For example, the virtual displacement data can be reconstructed according to the displacement coding data, and the approximate probability distribution of the virtual displacement data is restored, so that the displacement decoding data corresponding to the virtual displacement data is obtained.

S306, decoding data according to the displacement, and driving and displaying the virtual image.

According to the embodiment of the disclosure, the displacement decoding data obtained after irrelevant interference information in the virtual displacement data is removed is adopted to replace the virtual displacement data, so that the displacement decoding data is used as a driving basis of the virtual image, and the virtual image jitter in time sequence can be avoided under the condition that virtual image driving is performed based on a plurality of continuous original picture frame sequences. Meanwhile, the problems of lying, tilting or raising the hand and leaning back of the virtual image and the like caused by failure of the action on the universal joint can be avoided, and the fluency and naturalness of the displacement action of the virtual image can be improved.

Based on the above technical solutions, the present disclosure further provides another optional embodiment, in which, when the avatar is driven, morphological pose data of the real avatar is further introduced, so that the avatar in the avatar space has pose mapping capability of the real avatar. It should be noted that, in the embodiments of the present disclosure, parts not described in detail may be referred to related expressions of other embodiments, which are not described herein.

Referring to fig. 4, an avatar driving method includes:

s401, acquiring an original picture frame comprising a real image.

S402, determining an image area of the real image in the original picture frame and image characteristic data carried by the image area.

S403, according to the regional attribute information and the image characteristic data of the image region, respectively determining virtual displacement data of the real image corresponding to the virtual image in the virtual space and image posture data of the real image in the image region under the perspective projection condition.

Wherein the global two-dimensional space where the original picture frame is located is the transmission projection result of the virtual space

The image gesture data are used for representing the gesture forms presented by different parts in the real image.

In an alternative embodiment, the region attribute information and the image feature data may be input to a pre-trained gesture recognition network to obtain image gesture data. Wherein the gesture recognition network may be implemented based on at least one existing machine learning or deep learning model, the present disclosure does not set any limitation on the specific network structure of the gesture recognition network. For example, the region attribute information of the image region in the sample picture frame and the image feature data of the image region corresponding to the sample picture frame may be used as training samples, input to the virtual displacement determination network to be trained, map the network output result back to the sample picture frame, and perform model training on the virtual displacement determination network constructed in advance according to the posture difference between the mapping result and the original sample picture frame.

It should be noted that, the gesture recognition network and the virtual displacement determination network adopted in the present disclosure may be trained independently or jointly, and the training manner of the gesture recognition network and the virtual displacement determination network is not limited in any way in the present disclosure. In order to avoid the situation that the characteristics of the two independently trained networks are inhibited, the gesture recognition network and the virtual displacement determination network are generally combined to be trained.

In addition, the gesture recognition network to which the present disclosure relates may be the same as or different from the network structure of the aforementioned virtual displacement determination network. In a specific implementation manner, the two can be integrated into the same comprehensive network, and the comprehensive network has gesture recognition capability and virtual displacement determination capability, so that the aim of simplifying network complexity is fulfilled.

S404, driving and displaying the virtual image according to the virtual displacement data and the image posture data.

According to the virtual displacement data and the image posture data, the virtual images are driven and displayed in a combined mode, so that the virtual images can present positions and postures consistent with those of the real images in the original picture frames in the virtual space, the position and posture reduction capability of the virtual images is improved, and the display effect of the virtual images is further improved.

Because partial noise interference information may exist in the image posture data, under the condition that the virtual image is driven by the multi-needle original picture frame sequence, the condition of picture jitter occurs, and the driving display effect of the virtual image is affected. In order to avoid the occurrence of the above situation, the image pose data may be further smoothed to update the image pose data. At least part of noise can be filtered out in the smoothing process, so that the driving and displaying of the virtual image are performed based on the updated image posture data, and the fluency of the action picture is better.

Optionally, at least one of laplace smoothing, kalman filtering or median filtering may be used to smooth the image pose data, and the smoothing method used in the smoothing process is not limited in this disclosure.

It should be noted that, the smoothing manner adopted for smoothing the image posture data and the smoothing manner adopted for smoothing the virtual displacement data may be the same or different, which is not limited in this disclosure.

Because the smoothing process has the condition of failure of the universal joint, the problems of lying, tilting or raising the hands and leaning back of the avatar can occur, and the smoothness and naturalness of the motion of the avatar are affected. In order to avoid the above, the driving and displaying of the avatar may be performed in the following manner: coding the image gesture data to a gesture hidden variable space to obtain gesture coding data; decoding the gesture encoded data to a standard gesture space to obtain gesture decoded data; and decoding the data according to the gesture, and driving and displaying the avatar.

The gesture hidden variable space can be understood as a feature space constructed by gesture hidden variables which can determine at least one dimension of gesture attributes in the image gesture data. The gesture coding data can be understood as unified description of gesture hidden variables obtained by coding the gesture data, so that variation inference of the gesture data is realized. The data space of the image gesture data is a relatively high-dimensional data space, the gesture hidden variable space is a relatively low-dimensional data space, and disturbance hidden variables which are irrelevant to gesture attributes in the image gesture data are removed from the gesture hidden variable space, so that the purpose of noise reduction is achieved. In a specific implementation, the gesture encoding data may be a result vector obtained by randomly sampling each gesture hidden variable based on a probability distribution of each gesture hidden variable.

The standard posture space can be understood as an ideal posture space of the reconstructed image posture data. In the gesture space, the data which can determine the gesture attribute in the image gesture data can be restored, and other data which have no relation to the gesture attribute in the image gesture data are not contained.

Wherein the pose decoding data can be understood as only preserving the image pose data of the part determining the pose attribute.

For example, the image gesture data can be reconstructed according to the gesture coding data, and the approximate probability distribution of the image gesture data is restored, so that gesture decoding data corresponding to the image gesture data is obtained; correspondingly, the gesture decoding data is adopted to replace the gesture data of the avatar, and is used as the driving basis of the avatar.

According to the method and the device, the gesture decoding data obtained after irrelevant interference information in the gesture data of the images is removed is adopted to replace the gesture data of the images, the gesture decoding data is used as a driving basis of the virtual images, and the time sequence of the virtual image jitter is avoided under the condition that the virtual image driving is sequentially carried out based on a plurality of continuous original picture frames. Meanwhile, the problems of lying, tilting or raising the hands and leaning back of the virtual image and the like caused by failure of the action on the universal joint can be avoided, and the fluency and naturalness of the gesture action of the virtual image can be improved.

Based on the above technical solutions, the present disclosure provides a preferred embodiment for implementing avatar driving. The following will explain in detail a structural block diagram of the avatar driving system shown in fig. 5A and a flowchart of the avatar driving method shown in fig. 5B.

Referring to fig. 5B, the avatar driving method includes:

s501, acquiring an original picture frame comprising a real image.

S502, determining an image rectangular frame of the real image in the original picture frame through a target detection network.

The target detection network can multiplex a mature network structure in the prior art or can be obtained by combined training with a subsequent feature extraction network and/or a pose recognition network.

S503, based on the feature extraction network, extracting the ROI features in the image rectangular frame.

The feature extraction network can multiplex at least one mature network structure in the prior art or can be obtained by combined training with the target detection network and/or the pose recognition network.

In an alternative embodiment, the avatar rectangular frame includes an overall real avatar rectangular frame. The whole rectangular frame can be split into a head rectangular frame comprising a head area, a trunk rectangular frame comprising a trunk area and a limb rectangular frame comprising a limb area. Accordingly, the feature extraction network may include at least one of a head feature extraction network, a torso feature extraction network, and a limb feature extraction network; extracting head ROI features in the head rectangular frame based on the head feature extraction network; extracting trunk ROI features based on a trunk feature extraction network; and extracting the limb ROI features based on the limb feature extraction network.

S504, taking the longer frame length of the image rectangular frame as the area frame size, and taking the diagonal line size of the image rectangular frame as the area focal length size.

S505, inputting the ROI features, the regional frame sizes and the regional focal length sizes into a pose recognition network to obtain image pose data and corresponding weak perspective projection parameters when virtual space weak perspective projection is carried out on the image rectangular frame.

The pose recognition network is constructed based on at least one existing machine learning model or deep learning model, and the specific network structure of the pose recognition network is not limited in any way.

S506, converting the local weak perspective projection into global perspective projection to the original picture frame, and obtaining virtual displacement data corresponding to the virtual image under the condition that the virtual space is projected to the original picture frame.

For example, the following formula may be used to determine virtual displacement data corresponding to the avatar:

；

wherein,,is a scale parameter in the local projection parameters; />The displacement parameter is a displacement parameter in the local projection parameters;the standard edge size of the original picture frame, namely the expected size in the case of perspective projection, can be set by a technician according to the actual situation; / >The region center coordinates of the image region; />The area edge size of the image area (for example, the maximum frame length of a rectangular frame corresponding to the image area) is the area edge size of the image area; />For an ideal focal length, it can be set according to an empirical value, for example 5000; />As a result of the object distance increase; />The focal length of the region in the region size information (for example, the focal length of the diagonal line of the rectangular frame corresponding to the image region) is the focal length of the region in the region size information; />Virtual displacement data for the avatar in the virtual space.

S507, respectively smoothing the virtual displacement data and the image posture data, and driving and displaying the virtual image according to the smoothing result.

For example, the virtual displacement data can be encoded into a displacement hidden variable space to obtain displacement encoded data, and the displacement encoded data is decoded into a standard displacement space to obtain displacement decoded data; the displacement decoded data is used as a smoothing result of the virtual displacement data. Coding the image gesture data into a gesture hidden variable space to obtain gesture coding data, and decoding the gesture coding data into a standard gesture space to obtain gesture decoding data; and taking the gesture code adding data as a smoothing result of the image gesture data.

When training the pose recognition network, determining an image rectangular frame of the real image in the sample picture frame through a target detection network; extracting ROI features in the image rectangular frame based on the feature extraction network; taking the longer frame length of the image rectangular frame as the area frame size, and taking the diagonal line size of the image rectangular frame as the area focal length size; taking the ROI features, the region frame sizes and the region focal length sizes of a large number of sample picture frames as training samples, and inputting the training samples into a pose recognition network to obtain sample image pose data and sample weak perspective projection parameters; and converting the local weak perspective projection into global perspective projection of the original picture frame to obtain sample virtual displacement data corresponding to the virtual image under the condition that the virtual space is projected to the sample picture frame. Coding the sample virtual displacement data into a displacement hidden variable space to obtain sample displacement coded data, and decoding the sample displacement coded data into a standard displacement space to obtain sample displacement decoded data; and encoding the sample image pose data into a pose hidden variable space to obtain sample pose encoded data, and decoding the sample pose encoded data into a standard pose space to obtain sample pose decoded data. Mapping back a sample picture frame based on the focal length size of the region and the central coordinate of the image rectangular frame; and adjusting network parameters of the pose recognition network according to the difference condition between the mapping result and the original sample picture frame. Wherein the difference condition can be realized by at least one loss function in the prior art, the disclosure does not limit the adopted specific loss function.

It should be noted that, the apparatus used in training the pose recognition network may be the same or different from the apparatus used in performing the avatar driving presentation using the pose recognition network, which is not limited in this disclosure.

As an implementation of the above-described respective avatar driving methods, the present disclosure also provides an alternative embodiment of an executing apparatus implementing the above-described respective avatar driving methods.

Referring to fig. 6, the avatar driving apparatus 600 includes: an original picture frame acquisition module 601, an image area determination module 602, a virtual displacement data determination module 603 and a drive presentation module 604. Wherein,,

an original picture frame acquisition module 601, configured to acquire an original picture frame including a real image;

the image area determining module 602 is configured to determine an image area of the real image in the original picture frame and image feature data carried by the image area;

the virtual displacement data determining module 603 is configured to determine virtual displacement data of the real avatar corresponding to the avatar in the virtual space under the perspective projection condition according to the region attribute information of the avatar region and the avatar feature data; the global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space;

And the driving display module 604 is used for driving and displaying the avatar according to the virtual displacement data.

In an alternative embodiment, the region attribute information includes region size information and region position information;

The virtual displacement data determining module 603 includes:

the local projection parameter determining unit is used for determining local projection parameters corresponding to the image area according to the area size information and the image characteristic data; the local projection parameters are projection parameters of the virtual image corresponding to the real image, and the virtual image is projected from the virtual space where the virtual image is positioned to the local two-dimensional space where the image area is positioned in a weak perspective mode;

and the virtual displacement data determining unit is used for determining virtual displacement data of the virtual image in the virtual space under the condition of perspective projection according to the region position information, the region size information and the local projection parameters.

In an alternative embodiment, the virtual displacement data determining unit includes:

a first determining subunit, configured to determine virtual displacement data of the avatar in a first spatial dimension parallel to the global two-dimensional space in the virtual space according to a scale parameter in the local projection parameter, a displacement parameter in the local projection parameter, a region edge size in the region size information, and a region center coordinate of the avatar region;

And the second determining subunit is used for determining virtual displacement data of the virtual image in the vertical space dimension of the first space dimension in the virtual space according to the scale parameter in the local projection parameter, the region focal length dimension in the region dimension information, the region edge dimension of the image region and the standard edge dimension of the original picture frame.

In an alternative embodiment, the region size information includes a region bezel size and a region focal length size.

In an alternative embodiment, the drive presentation module 604 includes:

the displacement coding unit is used for coding the virtual displacement data into a displacement hidden variable space to obtain displacement coding data;

the displacement decoding unit is used for decoding the displacement encoded data to a standard displacement space to obtain displacement decoded data;

and the displacement driving display unit is used for driving and displaying the virtual image according to the displacement decoding data.

In an alternative embodiment, the apparatus 600 further comprises:

the image posture data determining module is used for determining the image posture data of the real image in the image area according to the area attribute information and the image characteristic data;

The driving display module is also used for driving and displaying the virtual image according to the image posture data.

In an alternative embodiment, the driving display module 604 further includes:

the gesture coding unit is used for coding the image gesture data to a gesture hidden variable space to obtain gesture coding data;

the gesture decoding unit is used for decoding the gesture encoded data to a standard gesture space to obtain gesture decoded data;

and the gesture drive display unit is used for driving and displaying the virtual image according to the gesture decoding data.

The avatar driving device can execute the avatar driving method provided by any embodiment of the present disclosure, and has the corresponding function modules and beneficial effects of executing each avatar driving method.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the data comprising the original picture frames of the real images accord with the regulations of the related laws and regulations, and the public order is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 performs the respective methods and processes described above, for example, an avatar driving method. For example, in some embodiments, the avatar driving method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the avatar driving method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the avatar driving method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligent software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Cloud computing (cloud computing) refers to a technical system that a shared physical or virtual resource pool which is elastically extensible is accessed through a network, resources can comprise servers, operating systems, networks, software, applications, storage devices and the like, and resources can be deployed and managed in an on-demand and self-service mode. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and blockchain, and model training.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions provided by the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An avatar driving method, comprising:

acquiring an original picture frame comprising a real image;

determining an image area of the real image in the original picture frame and image characteristic data carried by the image area;

determining virtual displacement data of the real image corresponding to the virtual image in the virtual space under the perspective projection condition according to the area attribute information of the image area and the image characteristic data; the global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space;

coding the virtual displacement data to a displacement hidden variable space to obtain displacement coding data;

decoding the displacement coded data to a standard displacement space to obtain displacement decoded data;

driving and displaying the virtual image according to the displacement decoding data;

The displacement hidden variable space is a feature space constructed by determining displacement hidden variables of displacement attributes in virtual displacement data; the standard displacement space can restore the data which determines the displacement attribute in the virtual displacement data, and other data which does not contain the irrelevant displacement attribute in the virtual displacement data are not included in the corresponding restoring result; the displacement decoding data is virtual displacement data of a reserved and decided displacement attribute part.

2. The method of claim 1, wherein the region attribute information includes region size information and region position information;

and determining virtual displacement data of the real image corresponding to the virtual image in the virtual space under the perspective projection condition according to the area attribute information of the image area and the image characteristic data, wherein the virtual displacement data comprises the following components:

determining local projection parameters corresponding to the image area according to the area size information and the image characteristic data; the local projection parameters are projection parameters of the virtual image corresponding to the real image, and the virtual image is projected from the virtual space where the virtual image is positioned to the local two-dimensional space where the image area is positioned in a weak perspective mode;

and determining virtual displacement data of the virtual image in the virtual space under the perspective projection condition according to the region position information, the region size information and the local projection parameters.

3. The method of claim 2, wherein the determining virtual displacement data of the avatar in the virtual space in the perspective projection case according to the region position information, the region size information, and the local projection parameter includes:

determining virtual displacement data of the virtual image in a first space dimension parallel to the global two-dimensional space in the virtual space according to the scale parameter in the local projection parameter, the displacement parameter in the local projection parameter, the area edge size in the area size information and the area center coordinate of the image area;

and determining virtual displacement data of the virtual image in the vertical space dimension of the first space dimension in the virtual space according to the scale parameter in the local projection parameter, the region focal length dimension in the region dimension information, the region edge dimension of the image region and the standard edge dimension of the original picture frame.

4. The method of claim 2, wherein the region size information includes a region bezel size and a region focal length size.

5. The method of any of claims 1-4, wherein the method further comprises:

According to the regional attribute information and the image characteristic data, determining image posture data of the real image in the image region;

and driving and displaying the virtual image according to the image posture data.

6. The method of claim 5, wherein the driving and displaying the avatar according to the avatar posture data comprises:

coding the image gesture data to a gesture hidden variable space to obtain gesture coding data;

decoding the gesture encoded data to a standard gesture space to obtain gesture decoded data;

driving and displaying the avatar according to the gesture decoding data;

the attitude hidden variable space is a feature space constructed by determining an attitude hidden variable of an attitude attribute in the image attitude data; the standard gesture space can restore the data of the determined gesture attribute in the image gesture data, and other data of irrelevant gesture attributes in the image gesture data are not contained in the corresponding restoring result; the gesture decoding data is the image gesture data of the reserved and decided gesture attribute part.

7. An avatar driving apparatus, comprising:

the original picture frame acquisition module is used for acquiring an original picture frame comprising a real image;

The image area determining module is used for determining an image area of the real image in the original picture frame and image characteristic data carried by the image area;

the virtual displacement data determining module is used for determining virtual displacement data of the real image corresponding to the virtual image in a virtual space under the perspective projection condition according to the region attribute information of the image region and the image characteristic data; the global two-dimensional space where the original picture frame is located is a transmission projection result of the virtual space;

the driving display module is used for driving and displaying the virtual image according to the virtual displacement data;

wherein, drive show module includes:

the displacement driving display unit is used for driving and displaying the virtual image according to the displacement decoding data;

8. The apparatus of claim 7, wherein the region attribute information includes region size information and region position information;

the virtual displacement data determining module includes:

9. The apparatus of claim 8, wherein the virtual displacement data determination unit comprises:

10. The apparatus of claim 8, wherein the region size information comprises a region bezel size and a region focal length size.

11. The apparatus according to any one of claims 7-10, wherein the apparatus further comprises:

12. The apparatus of claim 11, wherein the drive presentation module further comprises:

The gesture drive display unit is used for driving and displaying the virtual image according to the gesture decoding data;

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the avatar driving method of any one of claims 1-6.

14. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the avatar driving method according to any one of claims 1 to 6.