CN115494958A

CN115494958A - Hand-object interaction image generation method, system, equipment and storage medium

Info

Publication number: CN115494958A
Application number: CN202211377250.8A
Authority: CN
Inventors: 李厚强; 周文罡; 胡鹤臻; 王炜伦
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2022-12-20

Abstract

The invention discloses a hand-object interaction image generation method, a system, equipment and a storage medium, wherein complex self-occlusion and mutual occlusion between hands and objects are explicitly modeled by using representation sensed by a model as a condition to obtain a hand image (rough image) and a hand-object topological graph corresponding to a primary target posture, then the appearance difference between the hands and the objects is considered, corresponding images are respectively generated in a divide-and-conquer mode, and then the corresponding images are fused into a final hand-object interaction image; the method can enhance the generation effect of the hand-object interaction image and has good application prospect.

Description

Hand-object interaction image generation method, system, equipment and storage medium

Technical Field

The invention relates to the technical field of hand-object interaction image generation, in particular to a hand-object interaction image generation method, a system, equipment and a storage medium.

Background

Pose-guided image synthesis is a conditional task aimed at generating images under target pose conditions, while preserving identity information in the source images. The task is mainly focused on the problem of posture migration of a single object, mainly relates to rigid objects (including human bodies, faces, hands and the like), and can be used for various scenes such as image animation, face reproduction, sign language generation and the like. In recent work, anyone using a differentiable global local attention module generates an image of a person in a target pose in a multi-scale manner. Dung et al use a 3DMM face parameterization model to decouple face pose expression for guiding face generation. Hu et al attempt to add hand priors to the gesture migration task to improve the generation effect. At the same time, most of the current work on hand-object interaction is focused on estimating hand pose at the same time aligned with a given image. To better depict the state of hand-object interaction, existing approaches resort to dense triangular meshes of predefined topology generated by the hand model MANO and modeled objects. Hasson et al use physical constraints to better estimate the hand mesh. Cao et al propose an optimization-based method to improve performance using two-dimensional image cues and 3D interaction priors. Liu and other people utilize a large number of external hand object videos to help improve hand object interaction estimation performance through semi-supervised learning; however, the above solutions are designed only for the problem of gesture migration under a single object, and cannot adapt to the challenges brought by complex interaction relationships.

Furthermore, the effectiveness of generating a countermeasure network (GAN) has been verified in the generation of real images of, for example, a human body, a face, and a hand. These GAN-based synthesis methods may be conditioned on different input information, such as simply drawn sketches, 2D sparse keypoints, and dense semantic masks, among others. GestureGAN addresses the generation of isolated hands, which uses 2D sparse hand keypoints as conditions and attempts to generate target hand images from optical flows learned from sources and targets. However, these research works do not consider the generation problem of two interaction instances, and the method used in the method can not adapt to the challenges brought by the generation of the interaction instances, and specifically, 3D information or the occlusion relationship between hands and objects are not considered, which results in poor layering and poor effect of the generated images.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for generating a hand object interaction image, which can generate a hand object target picture by a divide-and-conquer method in consideration of complex occlusion between hands.

The purpose of the invention is realized by the following technical scheme:

a method for generating a hand-object interaction image comprises the following steps:

a data acquisition stage: acquiring a source image, a source posture, object information and a target posture; the source posture and the target posture comprise a hand model, an object model and interaction posture information of the hand model and the object model;

topology modeling stage of occlusion awareness: mapping a hand image in a source image to a pre-constructed unified space by using a source posture to obtain a conversion stream from the source image space to the unified space; positioning a visual texture of a hand by calculating a shielded part of the hand in a source image, and obtaining a texture image by combining a conversion stream from a source image space to a uniform space; calculating conversion flow from the unified space to a hand-object interaction image space by using the target posture, and combining the texture image to obtain a hand image corresponding to the primary target posture; mapping object textures contained in the object information to a unified space, and combining a conversion flow from the unified space to a hand-object interaction image space to obtain an object image corresponding to a primary target posture; generating a hand topological graph and an object topological graph corresponding to the target gesture by combining the hand-object interaction image plane;

and a hand-object interaction image generation stage: clipping the source image through a hand object foreground mask, obtaining an image without hands and objects, and generating a background image by filling; generating a hand image corresponding to the target posture by using the hand image corresponding to the preliminary target posture and the hand topological graph; generating an object image corresponding to the target posture by using the object image corresponding to the primary target posture and the object topological graph; and fusing the background image, the hand image corresponding to the target posture model and the object image corresponding to the target posture to generate a hand-object interaction image.

A hand-object interaction image generation system, comprising:

the data acquisition unit is applied to the data acquisition stage and comprises: acquiring a source image, a source posture, object information and a target posture; the source posture and the target posture comprise a hand model, an object model and interaction posture information of the hand model and the object model;

the topology modeling unit for occlusion perception is applied to the topology modeling stage for occlusion perception and comprises the following steps: mapping a hand image in a source image to a pre-constructed unified space by using a source posture to obtain a conversion stream from the source image space to the unified space; positioning a visual texture of a hand by calculating a shielded part of the hand in a source image, and obtaining a texture image by combining a conversion stream from a source image space to a uniform space; calculating conversion flow from the unified space to a hand-object interaction image space by using the target posture, and combining the texture image to obtain a hand image corresponding to the primary target posture; mapping object textures contained in the object information to a unified space, and combining a conversion flow from the unified space to a hand-object interaction image space to obtain an object image corresponding to a primary target posture; generating a hand topological graph and an object topological graph corresponding to the target gesture by combining the hand-object interaction image plane;

the hand object generator is applied to a hand object interaction image generation phase and comprises the following steps: cutting a source image through a hand foreground mask to obtain an image without hands and objects, and generating a background image through filling; generating a hand image corresponding to the target posture by using the hand image corresponding to the preliminary target posture and the hand topological graph; generating an object image corresponding to the target posture by using the object image corresponding to the primary target posture and the object topological graph; and fusing the background image, the hand image corresponding to the target posture model and the object image corresponding to the target posture to generate a hand-object interaction image.

A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium, storing a computer program which, when executed by a processor, implements the aforementioned method.

According to the technical scheme provided by the invention, complex self-occlusion and mutual occlusion between the hand and the object are explicitly modeled by using the representation of model perception as a condition to obtain a hand image (rough image) and a hand-object topological graph corresponding to a primary target posture, then the appearance difference between the hand and the object is considered, corresponding images are respectively generated in a divide-and-conquer mode, and then the corresponding images are fused into a final hand-object interactive image; the method can enhance the generation effect of the hand-object interaction image and has good application prospect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for generating a hand-object interaction image according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a task for generating a hand-object interaction image according to an embodiment of the present invention;

FIG. 3 is an overall framework diagram of a method for generating a hand-object interaction image according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hand-object interaction image generation system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".

The terms "comprising," "including," "containing," "having," or other similar terms in describing these terms are to be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, step, process, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article, etc.) that is not specifically recited, should be interpreted to include not only the specifically recited feature but also other features not specifically recited and known in the art.

The following describes a method, a system, a device and a storage medium for generating a hand-object interaction image provided by the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. The examples of the present invention, in which specific conditions are not specified, were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.

Example one

The embodiment of the invention provides a hand-object interaction image generation method, which is used for completing a new hand-object interaction image generation task, has rich application prospect, and can be used for scenes such as AR/VR (augmented reality/virtual reality) games, online shopping and the like. For example, when a consumer shops online, the interaction visualization may bring him/her an immersive experience. Furthermore, if a consumer wants to add a name on an object (e.g., on a cell phone), part of the corresponding application in topology modeling with occlusion awareness by the present invention can accomplish this by object texture editing. Furthermore, in an online shopping scenario, a consumer typically does not have a corresponding merchandise. Only one hand picture needs to be uploaded by the user, and a real interactive experience can be presented by generating a hand object image with the hand identity preserved. The invention can be used for synthesizing hand-object interaction pictures, thereby improving the performance of other subsequent tasks, for example, the current methods for estimating the hand-object interaction posture are usually based on deep learning, but the performance of the methods is limited by the size of training data due to the marking cost.

The hand-object interaction image generation task in the embodiment of the invention is a condition generation task. Its goal is to generate a hand-object interaction image in a target pose condition while preserving the identity of the source image. In particular, the acquisition is well modeled on the object and the various textures are known. Solving this task is challenging because understanding the complex interaction relationships between the hands is not a trivial task in image generation. The main challenges are summarized below. First, the occlusion relationship is modeled for two instances of interaction (i.e., hand and object). In a hand-object interaction scene, complex self-occlusion and mutual occlusion usually occur; since occlusions lead to a higher complexity of the transitions between sources, these occluded regions should be located and identified, which will facilitate the final generation. Secondly, different characteristics between the two examples need to be considered in the generation process, specifically, the hand is articulated and self-occlusion exists between the joints, while the object is usually rigid and has fine texture, so that the generated hand-object interaction image should have real-looking interaction examples and reasonable interaction between the two.

To this end, the present invention proposes a framework (called the HOGAN framework) that we present to address the challenges of this new task. The HOGAN framework comprises: the topology modeling of the occlusion perception and the hand object generation are carried out, the topology modeling of the occlusion perception utilizes the representation of the model perception as a condition and utilizes the inherent structural topology to construct a unified space, and in the unified space, the complex self-occlusion and mutual occlusion between the hand and the object are considered explicitly. Specifically, the visible portions of the hand and object and their corresponding fine-grained topographies are mapped to the target image plane. Meanwhile, the uniform space is used as an intermediate variable for constructing the conversion flow, the conversion flow between the source and the target can be directly calculated, and the obtained result provides rich information for the final image synthesis. In the hand generating section, a final hand-object interaction image is generated in a stepwise manner in consideration of the appearance difference between the hand and the object. As shown in fig. 1, a main flow of a method for generating a hand-object interaction image according to an embodiment of the present invention is shown, which includes the following steps:

step 1, data acquisition stage.

In the embodiment of the invention, a source image, a source posture, object information and a target posture are mainly obtained. Wherein the source gesture and the target gesture are both hand-object interaction gestures.

FIG. 2 illustrates a definition of a task for generating a hand-object interaction image in an embodiment of the present invention, which is to generate the hand-object interaction image in a target posture while maintaining the appearance of a source image. Each row in fig. 2 represents an example of generating a hand-object interaction image by using the acquired information, and each column represents sequentially from left to right: source Image (Source Image), object information (Object info.), source pose (Source post), target pose (Target post), and generated hand-Object interaction Image.

In the embodiment of the invention, the source posture and the target posture both comprise hand models and object models and interaction posture information of the hand models and the object models, the source posture is aligned with the source image, and the target posture is aligned with the hand-object interaction image expected to be produced. The object information mainly contains an object model (obj.

And 2, a topology modeling stage of occlusion perception.

In this stage: mapping a hand image in a source image to a pre-constructed unified space by using a source posture to obtain a conversion stream from the source image space to the unified space, positioning a visual texture of a hand by calculating a part of the hand in the source image, combining the conversion stream from the source image space to the unified space to obtain a texture image, calculating a conversion stream from the unified space to a hand-object interaction image space by using a target posture, and combining the texture image to obtain a hand image corresponding to a primary target posture; mapping object textures contained in the object information to a unified space, and combining a conversion flow from the unified space to a hand-object interaction image space to obtain an object image corresponding to a primary target posture; and generating a hand topological graph and an object topological graph corresponding to the target gesture by combining the hand-object interaction image plane. . Specifically, the preferred embodiment at this stage is as follows:

all kinds of information obtained at this stage belong to the characterization information of the hand object, and the characterization information is generated by the model and has an inherent topological structure (a patch group consisting of numerous nodes, the connection relationship between the nodes is fixed), so that the model is called as model perception characterization. An overview of the model-aware characterization used is first given here. The hands in the source and target poses may be represented by two sets of MANO models and the objects may be represented by two sets of YCB models. Both MANO and YCB provide a triangular mesh representation that can densely delineate the structure of hands and objects. Specifically, the hand model representation includes N _v A vertex and N _f A triangular face (referred to simply as a face), the grid of which is represented by

Fixed topology information and inherent topology information can be determined according to the model information

Organized as vertex triplets, where each cell is recorded as a respective vertex coordinate aligned with a plane. In the following description, s, t and u refer to symbols of a source image space, a target space (a hand-object interaction image space) and a unified space, respectively.

First, the surfaces of the hand model and the object model are unwrapped to construct a unified space according to the intrinsic topology information. In unified space, the same representation is bound to the same grid surface, while the state of the gesture is ignored, i.e. the same bin in the source gesture and the target gesture hand model (or object model) can both be mapped to the same representation in unified space. The unified space may enable mapping from source to target and may insert pre-known object textures.

Then, mapping the hand image in the source image to a uniform space by using a source posture in an occlusion perception mode (carrying out perception modeling by occlusion), and obtaining a conversion stream T from the source image space to the uniform space _u←s 。

Wherein the conversion stream T from the source image space to the unified space at the (x, y) position in the unified space _u←s (x, y) is represented as:

T _u←s (x,y)＝W ^u (x,y)·P ^s (F ^u (x,y))

wherein x represents the position of the abscissa, y represents the position of the ordinate, F ^u (x, y) denotes a plane index at an (x, y) position in a uniform space, P ^s (F ^u (x, y)) an index F of the sum-and-face in a model (comprising a hand model and an object model, in particular determined by position) representing the source pose ^u Three vertex coordinates, W, of the corresponding (x, y) plane ^u (x, y) represents the relative weighted weight of the face at the (x, y) location in unified space, which can be determined from the relationship to the pose and the corresponding image space.

And, occlusion (including self-occlusion and mutual occlusion) is calculated synchronously, represented as:

O _u←s (x,y)＝(F ^u (x,y)≠F ^s (T _u←s (x,y)))

wherein, O _u←s (x, y) indicates that the (x, y) position is occluded, F ^s (T _u←s (x, y)) represents according to the transition T _u←s (x, y) positioned source image F ^s Image of the corresponding position in F ^u (x,y)≠F ^s (T _u←s (x, y)) means F ^s (T _u←s (x, y)) the corresponding image has no corresponding face index in the unified space, and the (x, y) position is considered to be blocked;

thereby, the visual texture of the hand can be located: 1-O _u←s (x,y)。

Combining a source image space to a unified space transition stream T _u←s Mapping the visual texture of the hand to a uniform space to obtain an initial texture image, expressed as:

I _u ＝Warp(T _u←s ,I _s )⊙(1-O _u←s )

wherein, I _u Representing an initial texture image, which is aligned with a uniform space; the lines and Warp (·) represent element-level multiplication operations and mapping operations, respectively.

This is not sufficient for target generation since the object in the source image inevitably contains occluded regions. Therefore, the texture in the initial texture image is replaced by the pre-stored hand texture, and the final texture image is obtained

Then, a conversion flow T from the unified space to the hand-object interaction image space is calculated by utilizing the target posture _t←u Wherein, at the (x ', y') position in the hand-object interaction image space, the conversion stream T from the unified space to the hand-object interaction image space _t←u (x ', y') is expressed as:

T _t←u (x′,y′)＝W ^t (x′,y′)·P ^u (F ^t (x′,y′))

wherein x 'represents the position of the horizontal axis, y' represents the position of the vertical axis, F ^t (x ', y') represents a face index, P, of the position (x ', y') in the hand object interaction image space ^u (F ^t (x ', y')) an index F to the surface in a model (including hand and object models, specifically determined by position) representing a uniform space ^t Three vertex coordinates, W, of the corresponding (x, y) plane ^t (x ', y') represents the relative weighted weight of the face at the (x ', y') location on the hand-object interaction image space.

Since the hand-object interaction image space and the target pose space are aligned, F ^t May be determined based on the target pose.

Finally, the conversion flow T from the unified space to the hand-object interaction image space _t←u By sampling the texture image

Obtaining a hand image I corresponding to a preliminary target posture _t Expressed as:

wherein Warp (·) denotes a mapping operation.

In addition, since the object information can provide a rough object texture, after the object texture is mapped to the unified space, a preliminary object image corresponding to the target pose is obtained by combining the conversion flow from the unified space to the hand-object interaction image space.

Meanwhile, in order to provide sufficient guidance for the next stage, it is necessary to generate a fine-grained topology graph Y _t The topology at the (x ', y') position is synchronously generated as follows:

Y _t (x′,y′)＝Bary(P ^u (F ^t (x′,y′)))

wherein, bary (·) refers to the gravity center of the corresponding surface in the surface space, and the corresponding hand topological graph and the object topological graph are obtained by combining the position area of the hand and the object position area in the target posture through the formula.

And 3, generating a hand-object interaction image.

In this stage: cutting a source image through a hand foreground mask to obtain an image without hands and objects, and generating a background image through filling; generating a hand image corresponding to the target posture by using the hand image corresponding to the preliminary target posture and the hand topological graph; generating an object image corresponding to the target posture by using the object image corresponding to the primary target posture and the object topological graph; and fusing the background image, the hand image corresponding to the target posture model and the object image corresponding to the target posture to generate a hand-object interaction image. Specifically, the preferred embodiment at this stage is as follows:

in consideration of the fact that the hand and the object exhibit different attributes, the hand generator is designed in the embodiment of the present invention to generate the hand-object interaction image (target image) in a divide-and-conquer manner. It comprises three branches, a background branch, an object branch and a hand branch.

1) The Background branch is provided with a first generation network which is responsible for generating a Background image (Ininterlaced Background); specifically, a hand foreground mask (Background) is used, which can be obtained by projecting the source pose on a 2D plane, and the Background clipped from the source image is filled.

2) The object branch is provided with a second generating network responsible for generating an object image (obj. And taking an object image (obj. Input) corresponding to a preliminary target posture as the input of a second generation network, and simultaneously, adopting a space adaptive normalization (SPADE) mode to inject an object topological graph (obj. Topo.) into the second generation network for knowing the structure information of the object by the object branch flow, and generating the object image corresponding to the target posture by the second generation network.

3) The Hand branch is provided with a third generation network which is responsible for generating a Hand image (Hand Foreground) corresponding to the target posture model. And taking a Hand image (Hand input) corresponding to the primary target posture as the input of a third generation network, simultaneously, injecting a Hand topological graph (Hand topo.) into the third generation network by adopting a spatial adaptive normalization mode, and generating the Hand image corresponding to the target posture by the third generation network.

The following are exemplary: the three generation networks may use a U-shaped structure network (Unet network), and the specific network structure and principle may refer to the conventional technology, which is not described in detail herein.

Meanwhile, during training, setting a part of source postures equal to corresponding target postures, processing a topology modeling stage and a hand-object interaction image generation stage according to occlusion perception, wherein the generated hand-object interaction image is a reconstructed source image, as shown in the upper half part of fig. 3, and at the moment, integrating the source postures and the target postures into a generation process through an attention sampler.

In the embodiment of the invention, three branches respectively process three examples with different attributes, namely a background, an object and a hand, and the results of the three branches are merged through a Fusion module (Fusion). By extracting the last layer of features before the generation of the network output layer (the output layer is the layer of the output image), the invention learns a fusion mask by using two convolution layers respectively to obtain two fusion masks, namely a hand mask M _h Hand mask M _f Refer to the non-occluded hands respectivelyAnd a hand object foreground, the fusion module combines the results of the three branches into a final generated result, which is expressed as:

I＝(I _h ⊙M _h +I _o ⊙(1-M _h ))⊙M _f +I _b ⊙(1-M _f )

wherein I represents a hand-object interaction image, I _h Representing hand images corresponding to the target pose model, I _o Representing an image of an object, I _b Representing a background image, an element-level multiplication operation.

On the other hand, the hand product generator is used for training; the overall loss of training comprises three parts.

The first part is the perception loss on the generated hand-object interaction image, and is expressed as:

wherein f is _i (. O) a feature extractor referring to layer i, x _t And

respectively representing a real hand-object interaction image (known image) and a generated hand-object interaction image.

In the embodiment of the invention, the part of the perception loss uses a newly introduced pre-training network, and for example, the 2 nd, 7 th, 12 th, 21 th and 30 th layer feature extractors of the pre-trained VGG network can be used for extracting relevant image features.

The second part is reconstruction loss on the source image, the reconstruction loss is calculated when the target posture is the same as the source posture, specifically, the generated hand-object interaction image is the reconstructed source image, the reconstruction loss on the source image is calculated by using the reconstructed source image and the obtained source image, and the reconstruction loss is expressed as:

wherein x is _s And

respectively representing the acquired source image and the reconstructed source image.

And the third part is resistance loss which is used for restricting the distribution of the generated hand-object interaction image and the real hand-object interaction image. In particular, the invention designs a discriminator to train in a mode of resisting learning, thereby improving the visual performance of the generated hand-object interaction image. Define the discriminator as D (-) and the resistance loss as:

wherein x is _t And

representing the real hand-object interaction image (known image) and the generated hand-object interaction image, respectively, c represents the generated object topology map in combination with the hand topology map,

representing the confrontational loss of the hand creature generator,

in order to combat the loss of the discriminator,

the mathematical expectation is represented by the mathematical expectation,

D(x _t i c)) represents the probability that the discriminator determines that the input image is a generated image or a real image on the condition of a given combination c of the generated object topological graph and the hand topological graph.

The final overall loss is:

wherein λ is ₁ And λ ₂ Are weight factors used to balance the associated loss functions.

FIG. 3 illustrates the overall framework of the present invention, the dashed box at the bottom left corner represents the Topology Modeling phase of Occlusion perception (Occlusion-Aware Topology Modeling), which represents the process of mapping 3D Hand Object model to 2D plane, expresses how the source pose and target pose are associated with a uniform space, and the output that can be obtained by association, including the rendered image of Hand and Object, the Object Topology of Hand and Object, and the dashed box at the bottom right corner represents the Hand-Object interaction image generation phase (Hand-Object Generator), which illustrates the processing flow of three branches; the upper half shows an integral training frame, four input information (namely, four information mentioned in the data acquisition stage) are arranged on the left side, then a Topology Modeling stage (abbreviated as Topology Modeling) for shielding perception is carried out, a hand-object interaction Image is Generated through a hand-object interaction Image generation stage (abbreviated as HO-gen), a part of Source postures are set in the training frame to be equal to corresponding Target postures, the Generated hand-object interaction Image is a Reconstructed Source Image (Reconstructed Source Image), and under the other conditions, the Source postures are different from the corresponding Target postures, and the Generated hand-object interaction Image is a Generated Target Image (Generated Target Image).

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

Example two

The invention further provides a system for generating a hand-object interaction image, which is implemented mainly based on the method provided by the foregoing embodiment, as shown in fig. 4, the system mainly includes:

the data acquisition unit is applied to the data acquisition stage and comprises the following steps: acquiring a source image, a source posture, object information and a target posture; the source posture and the target posture comprise a hand model, an object model and interaction posture information of the hand model and the object model;

the topology modeling unit for occlusion perception is applied to the topology modeling stage for occlusion perception and comprises the following steps: mapping a hand image in a source image to a pre-constructed unified space by using a source posture to obtain a conversion stream from the source image space to the unified space; positioning a visual texture of a hand by calculating a hand-shielded part in a source image, and obtaining a texture image by combining a conversion stream from a source image space to a uniform space; calculating conversion flow from the unified space to a hand-object interaction image space by using the target posture, and combining the texture image to obtain a hand image corresponding to the primary target posture; mapping object textures contained in the object information to a unified space, and combining a conversion flow from the unified space to a hand-object interaction image space to obtain an object image corresponding to a primary target posture; generating a hand topological graph and an object topological graph corresponding to the target gesture by combining the hand-object interaction image plane;

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

EXAMPLE III

The present invention also provides a processing apparatus, as shown in fig. 5, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Example four

The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating a hand-object interaction image is characterized by comprising the following steps:

topology modeling stage of occlusion awareness: mapping a hand image in a source image to a pre-constructed unified space by using a source posture to obtain a conversion stream from the source image space to the unified space; positioning a visual texture of a hand by calculating a shielded part of the hand in a source image, and obtaining a texture image by combining a conversion stream from a source image space to a uniform space; calculating conversion flow from the unified space to a hand-object interaction image space by using the target posture, and combining the texture image to obtain a hand image corresponding to a primary target posture; mapping object textures contained in the object information to a unified space, and combining a conversion flow from the unified space to a hand-object interaction image space to obtain an object image corresponding to a primary target posture; generating a hand topological graph and an object topological graph corresponding to the target gesture by combining the hand-object interaction image plane;

a hand-object interaction image generation stage: cutting a source image through a hand foreground mask to obtain an image without hands and objects, and generating a background image through filling; generating a hand image corresponding to the target posture by using the hand image corresponding to the preliminary target posture and the hand topological graph; generating an object image corresponding to the target posture by using the object image corresponding to the primary target posture and the object topological graph; and fusing the background image, the hand image corresponding to the target posture model and the object image corresponding to the target posture to generate a hand-object interaction image.

2. The method as claimed in claim 1, wherein the unified space is a space constructed by unfastening surfaces of the hand model and the object model, the hand image in the source image is mapped into the unified space by using the source pose, and a transformation stream T from the source image space to the unified space is obtained _u←s ；

Wherein the conversion stream T from the source image space to the unified space is at an (x, y) position in the unified space _u←s (x, y) is represented as:

T _u←s (x，y)＝W ^u (x，y)·P ^s (F ^u (x，y))

wherein x represents the horizontal axis position, y represents the vertical axis position, F ^u (x, y) denotes a plane index at an (x, y) position in a uniform space, P ^s (F ^u (x, y)) and face index F in the model representing the source pose ^u Three vertex coordinates, W, of the corresponding (x, y) plane ^u (x, y) represents the relative weighting of the face at the (x, y) position in uniform space.

3. The method for generating a hand-object interaction image according to claim 1, wherein the step of positioning the visual texture of the hand by calculating the occluded part of the hand in the source image and obtaining the texture image by combining the conversion stream from the source image space to the unified space comprises:

the occlusion position is calculated as:

O _u←s (x，y)＝(F ^u (x，y)≠F ^s (T _u←s (x，y)))

wherein x represents the horizontal axis position, y represents the vertical axis position, O _u←s (x, y) indicates that the (x, y) position is occluded, F ^u (x, y) denotes a face index at an (x, y) position in a uniform space, T _u←s (x, y) represents the transformation flow of the source image space to the uniform space at the (x, y) position, F ^s (T _u←s (x, y)) represents a conversion current T _u←s (x, y) positioned source image F ^s Image of the corresponding position in F ^u (x，y)≠F ^s (T _u←s (x, y)) means F ^s (T _u←s (x, y)) the corresponding image has no corresponding face index in the unified space;

positioning the visual texture of the hand: 1-O _u←s (x，y)；

Combining a source image space to a unified space transition stream T _u←s Visual texture O of hand _u←s Mapping to a uniform space to obtain an initial texture image, expressed as:

I _u ＝Warp(T _u←s ，I _s )⊙(1-O _u←s )

wherein, I _u Representing an initial texture image,. And Warp (. Cndot.) represent element-level multiplication and mapping operations, respectively;

replacing the texture in the initial texture image by the pre-stored hand texture to obtain the final texture image

4. The method according to claim 1, wherein the calculating a conversion flow from a unified space to a hand-object interaction image space by using the target pose and combining the texture image to obtain a hand image corresponding to a preliminary target pose comprises:

computing a transformation flow T from unified space to hand-object interaction image space using target poses _t←u Wherein, at the (x ', y') position in the hand-object interaction image space, the conversion flow T from the unified space to the hand-object interaction image _t←u (x ', y') is expressed as:

T _t←u (x′，y′)＝W ^t (x′，y′)·P ^u (F ^t (x′，y′))

wherein x 'represents the position of the horizontal axis, y' represents the position of the vertical axis, F ^t (x, y) represents (x', y) in hand-object interaction image space') plane index of position, P ^u (F ^t (x ', y')) the index of the sum-and-face F in the model representing a uniform space ^t Three vertex coordinates, W, of the corresponding plane of (x', y ^t (x ', y') represents the relative weighted weights of the planes of the (x ', y') locations on the hand object interaction image;

conversion stream T to hand-object interaction image in unified space _t←u By sampling the texture image

wherein Warp (·) denotes a mapping operation.

5. The method according to claim 1, wherein generating a hand topology map and an object topology map corresponding to the target gesture by combining the hand-object interaction image plane comprises:

synchronously generating the topological graph at the (x ', y') position as follows:

Y _t (x′，y′)＝Bary(P ^u (F ^t (x′，y′)))

wherein, bary (·) refers to the gravity center of the corresponding surface in the surface space, (x ', y') is the position of the hand object in the interaction image space, x 'represents the position of the horizontal axis, y' represents the position of the vertical axis, F ^t (x, y) a plane index, P, representing the position (x ', y') in space of the hand-object interaction image ^u (F ^t (x ', y')) the index of the sum-and-face F in the model representing a uniform space ^t (x ', y') three vertex coordinates of the corresponding face;

and obtaining a corresponding hand topological graph and an object topological graph by combining the position area of the lower hand and the object position area of the target posture through the formula.

6. The method for generating a hand-object interaction image according to claim 1, wherein the generation phase of the hand-object interaction image is realized by a hand-object generator, and a background branch is provided with a first generation network and is responsible for generating a background image; the object branch is provided with a second generation network, the object image corresponding to the primary target posture is used as the input of the second generation network, and meanwhile, the object topological graph is injected into the second generation network in a space self-adaptive normalization mode to generate the object image corresponding to the target posture; the hand branch is provided with a third generation network, the hand image corresponding to the primary target posture is used as the input of the third generation network, and meanwhile, the hand topological graph is injected into the third generation network in a space self-adaptive normalization mode to generate the hand image corresponding to the target posture;

extracting the last layer of features before generating three network output layers, and learning a fusion mask by using two convolution layers to obtain two fusion masks, namely hand mask M _h And hand mask M _f Respectively refer to the unoccluded hand and hand object foreground, and the fusion mode is expressed as:

I＝(I _h ⊙M _h +I _o ⊙(1-M _h ))⊙M _f +I _b ⊙(1-M _f )

7. The method for generating hand-object interaction images according to claim 1 or 6, wherein the hand-object interaction image generation phase is realized by a hand-object generator and the hand-object generator is trained;

during training, setting a part of source postures equal to the corresponding target postures, and processing a topology modeling stage and a hand-object interaction image generation stage according to occlusion perception, wherein the generated hand-object interaction image is a reconstructed source image; the overall loss of training comprises three parts:

the first part is the perception loss on the generated hand-object interaction image;

the second part is reconstruction loss on the source image, and when the target posture is the same as the source posture, the reconstruction loss on the source image is calculated by using the reconstructed source image and the acquired source image, wherein the generated hand-object interaction image is the reconstructed source image;

and the third part is resistance loss which is used for restricting the distribution of the generated hand-object interaction image and the real hand-object interaction image.

8. A hand-object interaction image generation system, realized based on the method of any one of claims 1 to 7, comprising:

9. A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium, storing a computer program, characterized in that the computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.