CN116958405A - Double-hand reconstruction method, device, equipment and storage medium - Google Patents

Double-hand reconstruction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116958405A
CN116958405A CN202310232179.2A CN202310232179A CN116958405A CN 116958405 A CN116958405 A CN 116958405A CN 202310232179 A CN202310232179 A CN 202310232179A CN 116958405 A CN116958405 A CN 116958405A
Authority
CN
China
Prior art keywords
hand
center
feature representation
graph
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310232179.2A
Other languages
Chinese (zh)
Inventor
余正谛
黄少立
单影
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310232179.2A priority Critical patent/CN116958405A/en
Publication of CN116958405A publication Critical patent/CN116958405A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a two-hand reconstruction method, a device, equipment and a storage medium, and belongs to the field of artificial intelligence. The method comprises the following steps: acquiring images of both hands; coding to obtain a parameter diagram, a hand center diagram, a finger joint segmentation diagram and an interactive hand prior diagram of the two-hand image through a feature coding network; generating a global feature representation of the two-hand image based on the hand center graph and the parameter graph through the feature aggregation network; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrogram and the interactive hand prior map; generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation; the hands are modeled from hand feature representations of the hands. The method supports two-hand reconstruction in any scene.

Description

Double-hand reconstruction method, device, equipment and storage medium
Technical Field
The application relates to the field of artificial intelligence, in particular to a two-hand reconstruction method, a device, equipment and a storage medium.
Background
Two-hand reconstruction plays an important role in various applications such as augmented reality and virtual reality, man-machine interaction, three-dimensional character animation of movies and games, and the like. One simple strategy for early handling of two-handed reconstruction is to locate each hand separately and then reduce the task to a one-handed reconstruction. However, since two hands are in an interactive state, which often causes mutual occlusion, the AI (Artificial Intelligence ) model has difficulty in accurately predicting individual individuals from the mutually occluded two hands.
In the related art, a solution for two-hand reconstruction based on a monocular RGB camera is further provided. Under such solutions, the hands are often considered as a whole, and are simultaneously and uniformly modeled by a highly coupled detection frame-level feature representation, which implicitly encodes the hands interaction state.
However, the solution of reconstructing two hands based on the monocular RGB camera in the related art is very fragile for the situation that the two hands are not fully interacted (such as the image contains the hands truncated by edges, the hands separated from each other or the shielding from the inside and the outside), and the reconstructed three-dimensional model of the two hands has obvious flaws.
Disclosure of Invention
The application provides a two-hand reconstruction method, a device, equipment and a storage medium, which support two-hand reconstruction under any scene. The technical scheme is as follows:
according to one aspect of the present application, there is provided a two-hand reconstruction method, the method comprising:
acquiring images of both hands;
coding to obtain a parameter diagram, a hand center diagram, a finger joint segmentation diagram and an interactive hand prior diagram of the two-hand image through a feature coding network; the parameter diagram at least comprises basic hand parameters of the two hands, and the hand center diagram is used for representing the positions of the hand centers of the two hands; the finger joint segmentation graph is at least used for representing the positions of a plurality of finger joints of the two hands, and the interactive hand prior graph is used for reasoning the interactive relation between the two hands;
Generating a global feature representation of the two-hand image based on the hand center graph and the parameter graph through the feature aggregation network; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrogram and the interactive hand prior map;
generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation; the hands are modeled from hand feature representations of the hands.
According to one aspect of the present application, there is provided a two-hand reconstruction device, the device comprising:
the acquisition module is used for acquiring the images of the two hands;
the coding module is used for coding to obtain a parameter diagram, a hand center diagram, a finger joint segmentation diagram and an interactive hand prior diagram of the two-hand image through the feature coding network; the parameter diagram at least comprises basic hand parameters of the two hands, and the hand center diagram is used for representing the positions of the hand centers of the two hands; the finger joint segmentation graph is at least used for representing the positions of a plurality of finger joints of the two hands, and the interactive hand prior graph is used for reasoning the interactive relation between the two hands;
the aggregation module is used for generating a global feature representation of the two-hand image based on the hand center graph and the parameter graph through the feature aggregation network; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrogram and the interactive hand prior map;
The reconstruction module is used for generating hand feature representations of the hands based on the global feature representations, the local feature representations and the dependent feature representations; the hands are modeled from hand feature representations of the hands.
In one embodiment, the aggregation module is further configured to generate a left-hand global feature representation based on the left-hand center subgraph in the hand center graph and the left-hand parameter subgraph in the parameter graph; generating a left-hand local feature representation based on the left-hand segmentation subgraph and the left-hand parameter subgraph in the finger joint segmentation graph; generating a left-hand dependent feature representation based on the right-hand center subgraph in the hand center graph and the left-hand prior subgraph in the interactive hand prior graph; a left hand feature representation is generated based on the left hand global feature representation, the left hand local feature representation, and the left hand dependent feature representation.
In one embodiment, the aggregation module is further configured to generate a right-hand global feature representation based on the right-hand center subgraph in the hand center graph and the right-hand parameter subgraph in the parameter graph; generating a right hand local feature representation based on the right hand segmentation subgraph and the right hand parameter subgraph in the finger joint segmentation graph; generating a right-hand dependent feature representation based on the left-hand center subgraph in the hand center graph and the right-hand prior subgraph in the interactive hand prior graph; a right hand feature representation is generated based on the right hand global feature representation, the right hand local feature representation, and the right hand dependent feature representation.
In one embodiment, the interactive hand prior graph comprises a left hand prior sub-graph comprising base hand parameters of the left hand and a right hand prior sub-graph comprising base hand parameters of the right hand, the left hand dependent feature representation is used to characterize the left hand prior knowledge derived from the right hand theory, and the right hand dependent feature representation is used to characterize the right hand prior knowledge derived from the left hand theory.
In one embodiment, the aggregation module is further configured to convert the left-hand center subgraph in the hand center graph into a left-hand center attention graph through a normalized exponential function; performing pixel level dot multiplication operation on the left-hand center attention diagram and a left-hand parameter subgraph in the parameter diagram; and fully connecting the calculation result of the dot multiplication operation to obtain the left-hand global feature representation.
In one embodiment, the aggregation module is further configured to convert the right-hand center subgraph in the hand center graph into a right-hand center attention graph through a normalized exponential function; performing pixel level dot multiplication operation on the right-hand central attention diagram and a right-hand parameter subgraph in the parameter diagram; and fully connecting the calculation result of the dot multiplication operation to obtain the right-hand global feature representation.
In one embodiment, the apparatus further comprises an update module. The updating module is used for generating an adjusting vector based on the Gaussian kernel size of the left hand center in the left hand center sub-graph, the Gaussian kernel size of the right hand center in the right hand center sub-graph, the position difference between the left hand center and the right hand center and the Euclidean distance between the left hand center and the right hand center; the adjustment vector characterizes the rejection of the left hand center to the right hand center; carrying out weighted summation operation on the position of the left hand center and the adjustment vector to obtain an updated position of the left hand center; the position of the right hand center and the adjustment vector are subjected to weighted difference calculation to obtain the updated position of the right hand center; generating an updated left-hand center subgraph based on the updated left-hand center position; and generating an updated right-hand center subgraph based on the updated right-hand center position.
In one embodiment, the aggregation module is further configured to convert the left-hand segmentation subgraph in the finger joint segmentation graph into a left-hand segmentation attention map through a normalized exponential function; and carrying out Hadamard product operation on the left-hand segmentation attention map and the left-hand parameter subgraph to generate a left-hand local feature representation.
In one embodiment, the aggregation module is further configured to convert the right-hand segmentation subgraph in the finger joint segmentation graph into a right-hand segmentation attention map through a normalized exponential function; and carrying out Hadamard product operation on the right-hand segmentation attention map and the right-hand parameter subgraph to generate a right-hand local feature representation.
In one embodiment, the aggregation module is further configured to convert the right-hand center subgraph in the hand center graph into a right-hand center attention graph through a normalized exponential function; performing pixel-level dot multiplication operation on the right hand center attention diagram and the left hand priori subgraph in the interactive hand priori diagram; and fully connecting the calculation result of the dot multiplication operation to obtain the left-hand dependency characteristic representation.
In one embodiment, the aggregation module is further configured to convert the left-hand center subgraph in the hand center graph into a left-hand center attention graph through a normalized exponential function; performing pixel-level dot multiplication operation on the left-hand center attention diagram and the right-hand prior subgraph in the interactive hand prior diagram; and fully connecting the calculation result of the dot multiplication operation to obtain right-hand dependent characteristic representation.
In one embodiment, the aggregation module is further configured to calculate a euclidean distance between a left-hand center in the left-hand center graph and a right-hand center in the right-hand center graph; generating an interaction threshold according to the Gaussian kernel size of the left hand center and the Gaussian kernel size of the right hand center; setting the interaction intensity coefficient to be zero under the condition that the Euclidean distance is larger than the interaction threshold value; and generating an interaction intensity coefficient according to the interaction threshold and the Euclidean distance under the condition that the Euclidean distance is not larger than the interaction threshold.
In one embodiment, the aggregation module is further configured to multiply the interaction strength coefficient by the left-hand dependent feature representation; splicing the multiplication calculation result with the left-hand global characteristic representation and the left-hand local characteristic representation; and (5) fully connecting the splicing results to obtain the hand characteristic representation of the left hand.
In one embodiment, the aggregation module is further configured to multiply the interaction strength coefficient by the right-hand dependent feature representation; splicing the multiplication calculation result with the right-hand global characteristic representation and the right-hand local characteristic representation; and fully connecting the splicing results to obtain the hand characteristic representation of the right hand.
In one embodiment, the left-hand parameter subgraph comprises a gesture parameter of the left hand, a morphological parameter of the left hand and a weak perspective camera parameter corresponding to the left hand; the right hand parameter subgraph comprises a right hand posture parameter, a right hand morphology parameter and a right hand corresponding weak perspective camera parameter.
In one embodiment, the finger segment segmentation map is a probabilistic segmenter comprising a left-hand probabilistic segmenter corresponding to the left-hand segmentation map, a right-hand probabilistic segmenter corresponding to the right-hand segmentation map, and a background dimension; a voxel on the left-hand probability segmentation body represents a probability logic channel of a plurality of knuckle categories corresponding to the left hand; one voxel on the right-hand probability segmentation body represents one probability logic channel of a plurality of knuckle categories corresponding to the right hand; pixels in the background dimension characterize the probability of being in the background region.
In one embodiment, the left-hand prior subgraph includes a pose parameter of the left hand, a morphology parameter of the left hand, and a weak perspective camera parameter corresponding to the left hand; the right-hand prior subgraph comprises a right-hand posture parameter, a right-hand morphological parameter and a right-hand corresponding weak perspective camera parameter.
In one embodiment, the reconstruction module is further configured to input the hand feature representation of the two hands into the parameterized model, and regress to obtain a reconstructed three-dimensional model of the two hands.
In one embodiment, the apparatus further comprises a training module. The training module is also used for training a feature coding network, a feature aggregation network and a parameterized model according to the loss of the hand center subgraph, the loss of the finger joint segmentation graph and the loss of the two-hand three-dimensional model.
In one embodiment, the training module is further configured to sum a loss between the left-hand center sub-graph and the tag left-hand center sub-graph with a loss between the right-hand center sub-graph and the tag right-hand center sub-graph to obtain a first sub-loss; calculating the loss between the finger segment segmentation map and the label finger segment segmentation map to obtain a second sub-loss; carrying out weighted summation on the loss of the attitude parameters and the loss of the morphological parameters in the two-hand three-dimensional model and the joint loss to obtain a third sub-loss; joint loss includes loss of position of a three-dimensional joint, loss of position of a two-dimensional joint, and loss of bone length; the first sub-loss, the second sub-loss and the third sub-loss are weighted and summed to obtain a target loss; training a feature encoding network, a feature aggregation network and a parameterized model according to the target loss.
According to one aspect of the present application, there is provided a computer apparatus comprising: a processor and a memory storing a computer program that is loaded and executed by the processor to implement the two-hand reconstruction method as described above.
According to another aspect of the present application, there is provided a computer readable storage medium storing a computer program loaded and executed by a processor to implement the two-hand reconstruction method as described above.
According to another aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the two-hand reconstruction method described above.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
and obtaining a parameter map, a hand center map, a finger joint segmentation map and an interactive hand priori map through feature coding network coding, obtaining global feature representation, local feature representation and dependency feature representation through feature aggregation network aggregation, obtaining hand feature representation of hands according to the three feature representations, and reconstructing the hands according to the hand feature representation. In the reconstruction process, the dependency relationship between the hands is explicitly reduced by the hand center graph, the dependency relationship between a plurality of knuckles in the hands is explicitly reduced by the knuckle segmentation graph, and the reduction of the dependency relationship is beneficial to releasing input constraint, but also reduces interaction between the hands in an interaction state. Therefore, the application also designs the interactive hand priori graph which is used for reasoning and obtaining the interactive relation between the two hands in the interactive state. Based on the design of the hand center graph, the finger joint segmentation graph and the interactive hand prior graph, the two-hand reconstruction process provided by the application can support two-hand images in any scene.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram showing the comparison of the effects of the two-hand reconstruction method using the related art and the present application;
FIG. 2 is a schematic diagram of the principles provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a two-hand reconstruction method provided by an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of an effect contrast of whether or not an interactive hand prior map is used, provided by an exemplary embodiment of the present application;
FIG. 5 is a schematic diagram of feature aggregation provided by an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram of a two-hand reconstruction method provided by an exemplary embodiment of the present application;
FIG. 7 is a flowchart of a method for aggregating left-hand global feature representations provided by an exemplary embodiment of the present application;
FIG. 8 is a flowchart of a method for aggregating right-hand global feature representations provided by an exemplary embodiment of the present application;
FIG. 9 is a flowchart of a method for aggregating left-hand local feature representations provided by an exemplary embodiment of the present application;
FIG. 10 is a flowchart of a method for aggregating right-hand local feature representations provided by an exemplary embodiment of the present application;
FIG. 11 is a flowchart of a method for aggregating left-hand dependency feature representations provided in accordance with an exemplary embodiment of the present application;
FIG. 12 is a flowchart of a method for aggregating right-hand dependent feature representations provided by an exemplary embodiment of the present application;
FIG. 13 is a flowchart of a method of generating interaction strength coefficients provided by an exemplary embodiment of the present application;
FIG. 14 is a flowchart of a method of updating a hand center provided by an exemplary embodiment of the present application;
FIG. 15 is a flowchart of a training method provided by an exemplary embodiment of the present application;
FIG. 16 is a schematic diagram showing the comparison of the effects of the two-hand reconstruction method using the related art and the present application;
FIG. 17 is a schematic diagram showing the effect contrast of the two-hand reconstruction method using the related art and the present application;
FIG. 18 is a block diagram of a two-hand reconstruction device provided in accordance with an exemplary embodiment of the present application;
fig. 19 is a block diagram of a computer device according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
First, the terms involved in the embodiments of the present application will be briefly described:
artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Hand pose estimation and morphological reconstruction (Hand Pose Estimation and Shape Recovery): the hand gesture estimation and the morphological reconstruction based on computer vision are important links for realizing man-machine interaction, and aim to accurately recover the three-dimensional hand gesture and reconstruct a real and reasonable three-dimensional hand model.
End-to-End Learning: the input of the neural network is the original data, the output is the final result, and no additional intermediate result or preprocessing and post-processing process are needed.
Attention mechanism (Attention): the learning needs to pay attention to specific region information for information aggregation and purification, and the deep learning model is used for combined training to improve the high efficiency of feature representation, and the technology is widely applied to the fields of natural language processing (Natural Language Processing, NLP), computer Vision (CV) and the like.
The related art of the present application is described as follows:
two-hand three-dimensional pose estimation and shape reconstruction based on monocular RGB cameras play an important role in various emerging applications, such as augmented Reality (Augmented Reality, AR) and Virtual Reality (VR), human-machine interaction, three-dimensional character animation of movies and games, and the like. However, this task is very challenging due to limited marker data, occlusion, depth blur, etc. Wherein, because the mutual shielding and ambiguity problems are very easy to be generated in the interaction process of the two hands, the complexity and difficulty of the interactive two-hand reconstruction are far superior to those of the single-hand reconstruction.
One simple strategy for early handling of two-handed reconstruction is to locate each hand separately and then reduce the task to a one-handed reconstruction. This strategy is generally widely adopted in the whole body motion capture and reconstruction framework. However, this strategy of reconstructing two hands independently is very prone to failure in handling the case of two-hand interactions, because closely spaced hands often result in mutual occlusion, while models usually reconstruct only two hands each as an independent individual, so it is easy to predict with this confusion model, causing unavoidable ambiguity problems.
To better address the case of two-hand interactions, some early efforts have addressed the problem of interactive hand prediction by model fitting, multi-view cameras, or depth cameras. In addition, a proposal is also provided, which provides a hand capturing system based on a binocular RGBD camera and provides an implicit model of hand geometry so as to facilitate model optimization. Another approach has been proposed that further simplifies the system by a single depth camera, helping hand pose and morphology prediction by predicting the segmentation map of the hand and voxel-pixel pairs. Another approach has been proposed that uses a multi-view RGB camera system to compute hand keypoints and 3D scans for grid fitting. To handle interactions and occlusion, a physical-based deformable model is introduced that improves the robustness of vision-based reconstruction algorithms.
Recent research efforts have shifted to reconstructing the interacting hands directly based on monocular RGB cameras. Under one approach, two-hand modeling is performed by directly learning a feature representation based on the coupling at the detection box level. Under another scheme, a depth convolutional neural network based on multi-task learning is proposed, which predicts multi-source complementary information from RGB images to reconstruct two interacting hands. Under another approach, a two-stage framework is proposed that first obtains an initial prediction and then performs a factorization refinement procedure to prevent the occurrence of a two-hand through-mold collision. Under another approach, the initial pose and shape are predicted from deeper features and the regression with lower level features is gradually refined. Under another scheme, a regression network based on a graph convolution network is proposed, which uses pyramid features and learns implicit attention to solve occlusion and interaction problems. However, all existing interactive two-hand reconstruction methods mainly treat two hands as a whole, and implicitly learn the coupled representation to encode the two-hand interaction, such highly coupled characteristic representation would be very vulnerable to incomplete interaction situations, such as edge truncated hands, mutually separated hands or occlusion from inside and outside, where the reconstructed two-hand three-dimensional model has more obvious flaws.
Referring to fig. 1 in combination, fig. 1 shows a case where three two-hand images are input, the three two-hand images corresponding to the hands truncated by the edge, the two hands separated from each other, and the two hands blocked each other in order, respectively. FIG. 1 shows a two-hand three-dimensional model reconstructed using Intaghand (a method of performing two-hand reconstruction in the related art), from which it can be seen from FIG. 1 that Intaghand will generate an erroneous two-hand three-dimensional model for an edge truncated two-hand image; for the two-hand images which are separated from each other, the generated two-hand three-dimensional model has poor fitting degree between the outline and the correct outline; for a two-hand image with two hands shielded from each other, an inexistent finger will be generated between the three-dimensional models of the left hand and the right hand; that is, the intag hand has obvious flaws for the case of the hands cut by the edge, the hands separated from each other, and the hands blocked from each other. FIG. 1 also shows a two-hand three-dimensional model reconstructed by the method of the application, and compared with Intaghand, the method can be used for obviously improving flaws of the two-hand three-dimensional model reconstructed under three conditions. Fig. 1 also shows an effect diagram of a field demonstration and a natural scene demonstration by the method of the application.
FIG. 2 illustrates a computer system provided by an exemplary embodiment of the present application. The computer system includes a training device 201 for the AI model and a use device 202 for the AI model. The AI model is a neural network model for performing a two-hand reconstruction task, and in the present application includes at least a feature encoding network 21 and a feature aggregation network 22. The training device 201 of the AI model and the using device 202 of the AI model are connected in a wired or wireless manner, the training device 201 transmits the AI model obtained by training to the using device 202, and the using device 202 performs two-hand reconstruction by the AI model.
Referring in conjunction with fig. 2, fig. 2 illustrates a flow for performing a two-hand reconstruction with an AI model using device 202.
After acquiring the two-hand image, feature encoding network 21 will generate a hand center map 203, an interactive hand prior map 204, a parameter map 205, and a finger joint segmentation map 206 of the two hands. Each pixel on the hand center map 203 is used to characterize the likelihood that the hand center is at that pixel location, and the hand center map 203 is used to decouple the dependency between the hands; the interactive hand prior diagram 204 is used to perform mutual reasoning between the hands; the parameter map 205 at least contains basic hand parameters of the hands, such as posture parameters and morphological parameters of the hands; each pixel on the finger segment segmentation map 206 is used to characterize the likelihood that multiple fingers are located at that pixel location, alternatively, the finger segment map is a probabilistic segmenter, and each voxel corresponds to the likelihood that multiple fingers are located at that voxel location, i.e., the finger segment segmentation map 206 is used to decouple the dependency relationship between multiple fingers inside the hand.
After generating the four graphs described above, the feature aggregation network 22 will generate a global feature representation 207 from the hand center graph 203 and the parameter graph 205; generating a local feature representation 208 from the finger joint segmentation map 206 and the parameter map 205; from the hand center graph 203 and the interactive hand prior graph 204, a dependent feature representation 209 is generated. It will be appreciated that the hand center map 203 will be used as an attention map, focusing on the parameter region indicated by the hand center in the parameter map 205, and performing data refinement on the parameter map 205 to obtain the global feature representation 207; the finger joint segmentation map 206 will be used as an attention map, and the parameter areas indicated by the finger joints in the parameter map 205 are focused on, and the data of the parameter map 205 is purified to obtain a local feature representation 208. The hand center graph 203 also serves as an attention graph, focuses on the parameter area indicated by the hand center in the interactive hand prior graph 204 to obtain a dependency feature representation 209, for example, a left hand center graph is adopted, focuses on the parameter area indicated by the right hand in the interactive hand prior graph 204 to obtain a right hand dependency feature representation; otherwise, the same is true. That is, the interactive hand prior map 204 is used to perform mutual reasoning between the hands.
In the two-hand reconstruction flow shown in fig. 2, a hand feature representation 210 of the two hands is also obtained from the global feature representation 207, the local feature representation 208, and the dependent feature representation 209; a two-hand reconstruction model is reconstructed from the hand feature representation 210. Optionally, the hand feature representation 210 is input into a parameterized model (e.g., a MANO model) to obtain a two-hand reconstruction model.
Alternatively, the training device 201 and the using device 202 may be the same computer device, or the training device 201 and the using device 202 may be different computer devices, for example, the training device 201 is a server, and the using device 202 is a terminal. The training device 201 and the using device 202 may be the same type of device, e.g. the training device 201 and the using device 202 may both be servers or both terminals. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
Fig. 3 shows a flow chart of a two-hand reconstruction method provided by an exemplary embodiment of the present application, illustrated by the execution of the method by the use device 202 shown in fig. 2, the method comprising:
step 310, acquiring a two-hand image;
a two-hand image refers to an input image for performing the two-hand reconstruction method provided by the present application. Optionally, the two-hand image adopted by the application is a hand image shot by a monocular RGB camera. In the present application, a two-hand image refers to an image containing at least two hands, and hereinafter, a reconstructed scene containing only two hands will be described, and a person skilled in the art can similarly extend to a reconstructed scene greater than two hands according to a reconstruction scheme for two hands provided below, for example, three hands are in a state of being separated from each other, three hands are in an interaction state, and the like. Alternatively, the hands included in the two-hand image may be hands from the same object or hands from different objects.
In the application, the hands in the hand image can be in any state, for example, the hands in the interaction state (such as the mutual buckling of the two fingers), the hands in the separation state, the hands cut off by the edges (the hands are cut off by the edges of the image), and the hands blocked by the inner part and the outer part (such as the mutual blocking of the hands or the blocking of the hands by other objects).
The interaction state refers to a state that two hands are in contact with each other, for example, a handshake state, a boxing state, a clapping state, a ten-finger buckling state, and the like. The separated state refers to a state where the hands are not contacted, and the skin surface contact is not generated between the left hand and the right hand.
Step 320, coding to obtain a parameter map, a hand center map, a finger joint segmentation map and an interactive hand prior map of the two-hand image through a feature coding network;
and the parameter diagram at least comprises basic hand parameters of the two hands. The parameter map will be referred to in the present application as a map containing a basic feature representation of both hands. Alternatively, the parameter map may be divided into two parts, a left-hand parameter sub-map and a right-hand parameter sub-map. Optionally, a parameter map M P ∈R 218×H×W The first 109 parameter dimensions are left-hand parameter subgraphs, the last 109 parameter dimensions are right-hand parameter subgraphs, and H and W represent the height and width of the feature matrix. Each 109 parameter dimensions is used to describe the state of one hand, such as the hand holding a fist. Optionally, the base hand parameters include pose parameters and morphological parameters. Optionally, each 109 parameter dimensions also includes a weak perspective camera parameter (s, t x ,t y ) Where s denotes the scaled size of the two-dimensional projection of each hand onto the image, (t x ,t y ) The displacement of the two-dimensional projection of each hand in the X-direction (horizontal axis direction) and the Y-direction (vertical axis direction) of the pixel coordinate system is shown, respectively.
A hand center map for characterizing the position of the hand centers of both hands. Optionally, hand center diagram A C ∈R 2×H×W H and W represent the height and width of the feature matrix. Alternatively, the hand center graph may be divided into a left hand center sub-graph and a right hand center sub-graph. The center plot of each hand may be represented as A C ∈R 1×H×W . For either the left-hand center sub-graph or the right-hand center sub-graph, it will be learned during the training process as a two-dimensional Gaussian heat map, with each pixel on the two-dimensional Gaussian heat map representing the likelihood that the hand center is located at that two-dimensional pixel location. Alternatively, the hand center is a center determined according to a preset rule, such as a center defining the hand center as the center of all visible MCP joints (metacarpophalangeal joints). It can be appreciated that the hand centroids explicitly decouple features between the hands, reducing dependencies between the hands.
A finger joint segmentation map is used to characterize at least the locations of a plurality of finger joints of a hand. Schematically, a finger segment segmentation map A P ∈R 33×H×W H and W represent the height and width of the feature matrix. The finger segment segmentation map is learned as a segmentation probability volume, with each voxel on the segmentation probability volume being a probability logical channel over 33 categories. Alternatively, the knuckle segmentation graph may be divided into a left-hand probability segment corresponding to a left-hand segmentation sub-graph, a right-hand probability segment corresponding to a right-hand segmentation sub-graph, and a background dimension. A voxel on the left-hand probability segmentation body represents a probability logic channel of a plurality of knuckle categories (such as 16 knuckle categories) corresponding to the left hand; right hand A voxel on the probability segmentation body represents a probability logic channel of a plurality of knuckle categories (such as 16 knuckle categories) corresponding to the right hand; pixels in a background dimension (e.g., one background dimension) characterize the probability of being in a background region. Schematically, the left-hand segmentation probability volume is denoted as R 16×H×W Each voxel of the segmentation probability volume represents the probability of 16 knuckles at that voxel location. Alternatively, 16 knuckles correspond to 16 joints where the parameterized model (alternatively, MANO model) below supports input. Optionally, during the data production process, a mask of the finger joint segmentation map is obtained by rendering a real MANO hand grid by the microneural renderer. Illustratively, the 16 joints include one wrist joint and three inter-finger joints on each of the five fingers.
And the interactive hand priori graph is used for reasoning the interactive relation between the two hands. Optionally, when the distance between the hands is relatively close, the interaction relationship between the hands is relatively strong; when the distance between the hands is long, the interactive relationship between the hands is weak. Optionally, the interactive hand prior map is consistent with the data contained in the parameter map, i.e. the interactive hand prior map includes at least basic hand parameters of both hands. Alternatively, the interactive hand prior graph may be divided into a left hand prior sub-graph and a right hand prior sub-graph. The interactive hand prior diagram is used for executing feature query of the interactive hand, and provides strong mutual reasoning capability under the two-hand interactive scene. Optionally, the interactive hand prior map M C ∈R 218×H×W The front 109 dimension represents the left-hand prior subgraph and the rear 109 dimension represents the right-hand prior subgraph. Wherein a priori subgraph of a single handFurther comprising two parts: attitude parameter theta epsilon R 16*3 And morphological parameter beta E R 10 And weak perspective camera parameters (s, t x ,t y )。
Step 330, generating a global feature representation of the two-hand image based on the hand center map and the parameter map through the feature aggregation network;
it can be appreciated that the feature aggregation network takes the hand center map as an attention map (or referred to as an attention mask), focuses on the parameter region indicated by the hand center in the parameter map, and performs data purification on the parameter map to obtain a global feature representation of the two-hand image. Global feature representation for characterizing the features of both hands after de-reliance. Thus, the global feature represents the global feature that will be of interest to the two-hand image. The global feature representation obtained at this time decouples the dependency between the hands, but only has instability that the global feature representation can cause occlusion and does not have the ability to recover hand details.
Step 340, generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map through the feature aggregation network;
in order to solve the problem that only the global feature representation still exists, the feature aggregation network takes the finger joint segmentation map as another attention map (or called attention mask), focuses on the parameter areas indicated by a plurality of fingers in the parameter map, and performs data purification on the parameter map to obtain the local feature representation of the hand. A local feature representation for characterizing a plurality of knuckles inside the hand after de-reliance. Thus, the local feature represents the hand detail that will be of interest to the two-hand image. The local feature characterization obtained at this time further decouples the dependency between the multiple knuckles inside the hand.
Step 350, generating a dependent feature representation based on the hand center graph and the interactive hand prior graph through the feature aggregation network;
in the above steps, although the dependency between both hands and the dependency between the plural knuckles inside the hands are decoupled, the states of both hands are closely highly correlated when both hands are in a tightly interactive scene. Simply taking the decoupled hands and the multiple knuckles inside the hands as final characteristic representations will reduce the mutual reasoning relationship between the hands in the reconstructed interaction scene.
Therefore, the application also uses the hand centre graph as an attention graph again, and queries out the dependent characteristic representation through the interactive hand prior graph as the characteristic representation after mutual reasoning. For example, using a left-hand center graph as an attention graph, a right-hand dependent feature representation is queried by a right-hand prior sub-graph as right-hand prior knowledge that is processed from the left-hand graph. Alternatively, the right-hand center graph is used as an attention graph, and the left-hand dependent feature representation is queried through the left-hand prior subgraph as left-hand prior knowledge which is processed according to the right hand graph. The dependency feature representation is used to characterize features derived after mutual reasoning of the hands. Thus, the dependent feature represents the interaction relationship between the hands that will be focused on the two-hand image.
Fig. 4 shows a two-hand three-dimensional model reconstructed using the complete method provided by the application, and a two-hand three-dimensional model reconstructed by a method in which only the interactive hand prior map is missing compared with the complete method. It can be seen that for the two-hand image with the two hands in the interaction state, the complete method can accurately recover the independent individuals of the two hands, while for the method of missing the prior image of the interaction hand, the thumb of the left hand of the first image is displayed in front of the right hand, and the finger staggered in the second image has the phenomenon of penetrating through the mould. Thus, it can be determined that the interactive hand prior map explicitly helps infer and recover the correlation between the closely interacting hands.
Step 360, generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation;
and splicing the global feature representation, the local feature representation and the dependency feature representation of the two hands to obtain the hand feature representation of the two hands. The hand feature representation may be used for two-hand modeling in any two-hand scenario.
Step 370, modeling the hands based on the hand feature representations of the hands.
In one embodiment, the hand feature representation is input into a parameterized model (e.g., a MANO model) that will recover a three-dimensional model of the hands and the positions of the nodes. And the parameterized model is used for obtaining a two-hand three-dimensional model by regression according to the input hand parameters. Alternatively, the parameterized model has differentiable properties, thus supporting gradient back propagation to achieve model training.
In summary, the parameter map, the hand center map, the finger joint segmentation map and the interactive hand prior map are obtained through feature coding network coding, the global feature representation, the local feature representation and the dependent feature representation are obtained through feature aggregation network aggregation, and the hand feature representation of the two hands is obtained according to the three feature representations, so that the two-hand reconstruction can be performed according to the hand feature representation. In the reconstruction process, the dependency relationship between the hands is explicitly reduced by the hand center graph, the dependency relationship between a plurality of knuckles in the hands is explicitly reduced by the knuckle segmentation graph, and the reduction of the dependency relationship is beneficial to releasing input constraint, but also reduces interaction between the hands in an interaction state. Therefore, the application also designs the interactive hand priori graph which is used for reasoning and obtaining the interactive relation between the two hands in the interactive state. Based on the design of the hand center graph, the finger joint segmentation graph and the interactive hand prior graph, the two-hand reconstruction process provided by the application can support two-hand images in any scene.
Based on the alternative embodiment shown in fig. 3, the hand center graph generated by the feature encoding network includes a left hand center graph and a right hand center graph; the finger joint segmentation graph comprises a left hand segmentation subgraph and a right hand segmentation subgraph; the parameter diagram comprises a left-hand parameter sub-diagram and a right-hand parameter sub-diagram; the interactive hand prior graph comprises a left hand prior sub-graph and a right hand prior sub-graph. Step 330, step 340, step 350 and step 360 may be replaced with the following:
Referring in conjunction with fig. 5, fig. 5 shows a left hand feature representation and a right hand feature representation aggregated by a feature aggregation network. Generating a left-hand global feature representation through feature aggregation based on a left-hand center sub-graph in the hand center graph and a left-hand parameter sub-graph in the parameter graph; generating a left-hand local feature representation through feature aggregation based on the left-hand segmentation subgraph and the left-hand parameter subgraph in the finger joint segmentation graph; generating a left-hand dependent feature representation through feature aggregation based on a right-hand center subgraph in the hand center graph and a left-hand priori subgraph in the interactive hand priori graph; generating a left hand feature representation by feature stitching based on the left hand global feature representation, the left hand local feature representation and the left hand dependent feature representation;
referring to fig. 5 in combination, a right-hand global feature representation is generated by feature aggregation based on a right-hand center subgraph in the hand center graph and a right-hand parameter subgraph in the parameter graph; generating a right hand local feature representation through feature aggregation based on a right hand segmentation subgraph and a right hand parameter subgraph in the finger joint segmentation graph; generating right-hand dependent feature representation through feature aggregation based on a left-hand center subgraph in the hand center graph and a right-hand priori subgraph in the interactive hand priori graph; and generating a right hand feature representation through feature stitching based on the right hand global feature representation, the right hand local feature representation and the right hand dependent feature representation.
Fig. 6 shows a schematic diagram of a two-hand reconstruction method provided by an exemplary embodiment of the present application.
Acquiring a two-hand image, wherein the two hands in the two-hand image are in any scene (namely any scene of an interaction state, a separation state, an edge cut-off state and a blocked state, and in an interaction state in fig. 6).
Inputting the two-hand image into a main network 61, wherein the main network 61 is used for carrying out preliminary feature extraction to obtain initial features F E R CxHxW C is the feature dimension, H and W, representing the two-dimensional coordinates of the pixels in the two-hand image. Hand centrograms, interactive hand prior maps, parametric maps and finger segment segmentation maps are extracted from the initial features, respectively, by a feature encoding network 62 (specifically four convolution operations). The hand center graph includes left and right hand center subgraphs, the interactive hand prior graph includes left and right hand prior subgraphs (not shown in fig. 6), the parameter graph includes left and right hand parameter subgraphs (not shown in fig. 6), and the knuckle segmentation graph includes left and right hand segmentation subgraphs. It may be noted that the finger segment map is a probabilistic logic body, and includes a plurality of finger segment categories, and the finger segment map is used for indicating positions of a plurality of finger segments; while the hand center map contains only one hand center category, the hand center map is used to indicate the location of the hand center.
In the feature aggregation network 63, after the pixel level multiplication is performed on the left-hand center subgraph and the left-hand parameter subgraph, channel level summation is performed to obtain a left-hand global feature representation; carrying out pixel level multiplication on the right-hand center sub-graph and the right-hand parameter sub-graph, and then carrying out channel level summation to obtain a right-hand global feature representation; carrying out pixel level multiplication on the left-hand segmentation sub-graph and the left-hand parameter sub-graph, and then carrying out multichannel level summation to obtain a left-hand local feature representation; carrying out pixel level multiplication on the right-hand segmentation sub-graph and the right-hand parameter sub-graph, and then carrying out multichannel level summation to obtain a right-hand local feature representation; carrying out pixel level multiplication on the right hand center subgraph and the left hand priori subgraph, and then carrying out channel level summation to obtain left hand dependency characteristic representation; after the pixel level multiplication is carried out on the left hand center subgraph and the right hand priori subgraph, channel level summation is carried out, and right hand dependency characteristic representation (characteristic aggregation stage) is obtained;
and the left hand global characteristic representation, the left hand local characteristic representation and the left hand dependent characteristic representation are passed through a multi-layer perceptron to obtain a left hand characteristic representation. The left hand feature representation contains left hand pose parameters, left hand morphology parameters, and left hand weak perspective camera parameters (feature stitching stage).
And the right-hand dependent characteristic representation, the right-hand global characteristic representation and the right-hand local characteristic representation are passed through a multi-layer perceptron to obtain a right-hand part characteristic representation. The right hand feature representation includes a right hand pose parameter, a right hand morphology parameter, and a right hand weak perspective camera parameter.
The left hand pose parameters, left hand morphology parameters, left hand weak perspective camera parameters, right hand pose parameters, right hand morphology parameters, and right hand weak perspective camera parameters are input into the MANO model 64 for modeling (reconstruction stage).
The process of the feature aggregation stage, the feature stitching stage and the reconstruction stage will be described in detail below.
The feature aggregation stage includes an aggregate global feature representation stage, an aggregate local feature representation stage, and an aggregate dependent feature representation stage.
Aggregate global feature representation phase: based on the alternative embodiment shown in fig. 3, the generation of the global feature representation of the left hand in step 330 may be replaced with the method steps in fig. 7. FIG. 7 illustrates a flow chart of an aggregation method of global feature representations of a left hand provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 710, converting the left-hand center subgraph into a left-hand center attention map through a normalized exponential function;
Step 720, performing pixel level dot product operation on the left hand center attention map and the left hand parameter subgraph;
and step 730, fully connecting the calculation result of the dot multiplication operation to obtain a left-hand global feature representation.
Based on the alternative embodiment shown in fig. 3, the generation of the right-hand global feature representation in step 330 may be replaced with the method steps in fig. 8. FIG. 8 illustrates a flow chart of an aggregation method of right-hand global feature representations provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 810, converting the right-hand center subgraph into a right-hand center attention map through a normalized exponential function;
step 820, performing pixel level dot product operation on the right hand center attention map and the right hand parameter sub-map;
and 830, fully connecting the calculation result of the dot multiplication operation to obtain a right-hand global feature representation.
The above-described global feature aggregation method for the left hand shown in fig. 7 or the global feature aggregation method for the right hand shown in fig. 8 may be expressed as:
wherein h E { L, R } respectively represent the left and right hands,is a hand center diagram, < >>For parameter map->For the global feature representation, σ is the spatial softmax function, f () is the fully connected operation,/->Is a multiplication at the pixel level.
Thus, the global feature representation F can be obtained by the above method G
Aggregate local feature representation phase: based on the alternative embodiment shown in fig. 3, the generation of the left-hand partial feature representation in step 340 may be replaced with the method steps in fig. 9. FIG. 9 illustrates a flow chart of an aggregation method of global feature representations of a left hand provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 910, converting the left-hand segmentation subgraph in the finger joint segmentation graph into a left-hand segmentation attention map through a normalized exponential function; optionally, in the left hand segmentation subgraph, the left hand segmentation subgraph further includes a background dimension corresponding to the left hand in addition to 16 finger category dimensions.
Step 920, performing hadamard product operation on the left-hand segmentation attention map and the left-hand parameter subgraph, and generating a left-hand local feature representation.
Based on the alternative embodiment shown in fig. 3, the generation of the right-hand partial feature representation in step 340 may be replaced with the method steps in fig. 10. FIG. 10 illustrates a flow chart of an aggregation method of right-hand global feature representations provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 1010, converting the right hand segmentation subgraph in the finger joint segmentation graph into a right hand segmentation attention graph through a normalized exponential function; optionally, in the right hand segmentation subgraph, the right hand segmentation subgraph further includes a background dimension corresponding to the right hand in addition to 16 finger category dimensions.
And 1020, carrying out Hadamard product operation on the right-hand segmentation attention map and the right-hand parameter subgraph to generate a right-hand local feature representation.
Specifically, for a left hand partial feature representation or a right hand partial feature representation, the same formula is followed as follows:
wherein, the liquid crystal display device comprises a liquid crystal display device,segmentation map for finger joints->For parameter map->For the local feature representation, (h, w) represents the pixel two-dimensional coordinates of the two-hand image. As indicated by the Hadamard product operation, σ is a spatial softmax function. />Is the last partial feature representation.
Hadamard product operation is realized by means of tensor remodeling, and the remodeled local features represent F P The method comprises the following steps:
wherein, the liquid crystal display device comprises a liquid crystal display device, for parameter map->For a finger segment segmentation map without background dimensions, T represents the transpose operation of the matrix. R represents the real number domain.
Thus, the local feature representation F can be obtained by the above method P
Aggregation-dependent feature representation phase: based on the alternative embodiment shown in fig. 3, the generation of the left-hand dependency feature representation in step 350 may be replaced with the method steps in fig. 11. FIG. 11 illustrates a flowchart of an aggregation method for left-hand dependency feature representation provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 1110, converting the right-hand center subgraph into a right-hand center attention map through a normalized exponential function;
Step 1120, performing pixel level dot product operation on the right hand center attention map and the left hand prior sub-map;
and 1130, fully connecting the calculation result of the dot multiplication operation to obtain a left-hand dependency characteristic representation.
Based on the alternative embodiment shown in fig. 3, the generation of the right-hand dependency feature representation in step 350 may be replaced with the method steps in fig. 12. FIG. 12 illustrates a flow chart of an aggregation method of right-hand dependent feature representations provided by an exemplary embodiment of the present application. The method comprises the following steps:
step 1210, converting the left center subgraph into a left hand center attention graph through a normalized exponential function;
step 1220, performing pixel level dot product operation on the left hand center attention map and the right hand prior sub-map;
and step 1230, fully connecting the calculation result of the dot multiplication operation to obtain the right-hand dependent characteristic representation.
The steps of fig. 11 and 12 described above may be expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the right hand center subgraph +.>Representing a left-hand a priori subgraph->Representing left-hand a priori knowledge (left-hand dependent feature representation) that is processed from the right hand; />Representing left-hand center subgraph->Representing a right-hand a priori subgraph,representing a priori knowledge of the right hand (right hand dependent feature representation) that is inferred from the left hand. Sigma, & gt >And fc are the spatial softmax function, pixel-by-pixel multiplication, and full-join operation, respectively. Thus, this stage will aggregate the dependency feature representation F C
The above-described dependency feature representations of the hands have been aggregated, however, the hands may be in an interactive state or in a separate state, the interactive state also including tight interactions and loose interactions. When the hands are moved away from each other, the dependency on each other should be reduced. In this regard, the present application also contemplates interaction strength coefficients for adjusting the application weights of the dependent feature representations.
And (5) calculating interaction intensity coefficients: FIG. 13 is a flowchart illustrating a method of generating interaction strength coefficients according to an exemplary embodiment of the present application. The method comprises the following steps:
step 1310, calculating the Euclidean distance between the left hand center in the left hand center graph and the right hand center in the right hand center graph;
step 1320, generating an interaction threshold according to the gaussian kernel size of the left hand center and the gaussian kernel size of the right hand center;
step 1330, setting the interaction intensity coefficient to zero if the Euclidean distance is greater than the interaction threshold;
in step 1340, in the case that the euclidean distance is not greater than the interaction threshold, an interaction intensity coefficient is generated according to the interaction threshold and the euclidean distance.
The above steps can be expressed as:
wherein C is L Center of left hand of the left hand center graph, C R For the right-hand center of the right-hand center plot, d is the Euclidean distance of the left-hand center to the right-hand center, IF (Interaction Field) is the two-hand interaction field (i.e., interaction threshold), if=γ (k) L +k R +1),k L Gaussian kernel size, k, for left hand center R The Gaussian kernel size is the center of the right hand, gamma is the adjustable amplitude, and lambda is the interaction intensity coefficient.
And (3) a characteristic splicing stage: in one embodiment, the interaction strength coefficient (λ above) is multiplied by the left-hand dependent feature representationRepresenting the multiplication result with the left-hand global feature +.>Left hand local feature representation +.>Splicing; and (5) fully connecting the splicing results to obtain the hand characteristic representation of the left hand. Multiplying the interaction strength coefficient by the right-hand dependency feature representation +.>Representing multiplication results with right-hand global features +.>Right hand local feature representation +.>Splicing; fully connecting the splicing results to obtain a hand characteristic representation of the right hand;
the above steps can be expressed as:
where h.epsilon.L, R represents left and right hands, respectively, concat () represents a concatenation operation, f () represents a full concatenation operation,representing a hand feature representation.
Reconstruction stage: the hand feature representation is input into MANO model 64 (a model that supports microtexturing) and regressed to obtain a reconstructed model of the two hands. The left hand pose parameters, left hand morphology parameters, left hand weak perspective camera parameters, right hand pose parameters, right hand morphology parameters, and right hand weak perspective camera parameters are input into the MANO model 64 for modeling. MANO model 64 contains attitude parameter θ εR 16*3 And morphological parameter beta εR 10 In addition, the embodiment further utilizes the 6D representation form to better represent the gesture parameters, and the gesture parameters are expressed as theta epsilon R 16*6 . Finally, the two-hand reconstruction model returns the skinned function W to m=w (θ, β) ∈r 778*3 . Optionally, the position J of the three-dimensional joint point is obtained according to the output of the two-hand three-dimensional model 3D =L·M∈R 21*3 Where L is a pre-trained linear regressor. Optionally, the projection camera model (s, t x ,t y ) And obtaining the position of the two-dimensional articulation point and performing model rendering.
In summary, the foregoing embodiments further illustrate a detailed method for reconstructing two hands, and it may be noted that, in the foregoing description, an interaction intensity coefficient is introduced, where the interaction intensity coefficient is used to determine the interaction tightness of two hands according to the distance between the two hands, and when the distance between the two hands is relatively close (the tightness is relatively high), the value of the interaction intensity coefficient is relatively large, so that the adjusted dependency feature representation has relatively large weight on the hand feature representation; when the distance between the hands is far (the tightness degree is low or zero), the value of the interaction intensity coefficient is small or zero, and the weight of the dependent feature representation is reduced or relieved in the hand feature representation. The interaction strength coefficient helps the interaction hand prior graph to formulate a self-adaptive interaction field, so that the correlation of two hands can be better modeled, and meanwhile, the interaction field is sensitive to close interaction and separation, so that unnecessary feature entanglement is avoided.
Based on the method embodiment shown in fig. 3, in order to generate a higher-quality hand center graph to guide the acquisition of more thorough feature decoupling and avoid feature ambiguity caused by too close hands, the following embodiment further adopts a hand center based on collision sensing representation and updates the hand center graph. Fig. 14 shows a flowchart of a method for updating a hand center map according to an exemplary embodiment of the present application (the method flowchart corresponding to fig. 14 is located after step 320 and before step 330 shown in fig. 3).
Step 1410, generating an adjustment vector based on a gaussian kernel size of a left hand center in the left hand center sub-graph, a gaussian kernel size of a right hand center in the right hand center sub-graph, a position difference between the left hand center and the right hand center, and euclidean distance between the left hand center and the right hand center; wherein the adjustment vector characterizes the rejection of the left hand center to the right hand center;
step 1420, performing weighted summation operation on the position of the left hand center and the adjustment vector to obtain an updated position of the left hand center;
step 1430, performing weighted difference calculation on the position of the right hand center and the adjustment vector to obtain an updated position of the right hand center;
step 1440, generating an updated left-hand center subgraph based on the updated left-hand center position;
At step 1450, an updated right-hand center subgraph is generated based on the updated right-hand center position.
The above steps can be expressed as:
C L =C L +αR,C R =C R +αR;
wherein C is L The left-hand center, C, representing the left-hand center subgraph R Representing right handThe right hand center of the center subgraph; c (C) L Representing the updated left hand center, C R Representing the updated right hand center; k (k) L The gaussian kernel size, k, representing the center of the left hand R The gaussian kernel size representing the right hand center; d is the Euclidean distance between the left hand center and the right hand center, and alpha is the controllable intensity coefficient. It will be appreciated that when the Euclidean distance between the centers of the two hands is less than (k L +k R +1) a rejection of the left hand center against the right hand center will occur.
In summary, due to the powerful pixel level representation, the original obtained hand centroids can release the dependency between the hands and construct a well-separated feature representation for both hands. However, these feature representations may also be gaussian blurred when the centers of the two hands are too close. The present embodiment further updates the hand center of the hand center map using collision awareness to more thoroughly release the dependency.
The two-hand reconstruction method (use procedure of the AI model) that has been described above corresponds to what is performed by the use device 202 of the AI model in fig. 2. Next, a training process of the AI model will be described, corresponding to what the training apparatus 201 of the AI model in fig. 2 performs. Specifically, the AI model includes at least a feature encoding network, a feature aggregation network, and a parameterized model (MANO model).
FIG. 15 illustrates a feature encoding network, feature aggregation network, and parameterized model training method provided by an exemplary embodiment of the present application, the method comprising:
step 1510, acquiring a two-hand image, wherein two hands in the two-hand image are in an interaction state or a separation state;
the two-hand image refers to a training method for executing the feature coding network, the feature aggregation network and the parameterized model provided by the application. Optionally, the two-hand image adopted by the application is a hand image shot by a monocular RGB camera.
Step 1520, coding to obtain a parameter map, a hand center map, a finger segment segmentation map and an interactive hand prior map of the two-hand image through a feature coding network;
please refer to step 320.
Step 1530, generating a global feature representation of the two-hand image based on the hand center map and the parameter map through the feature aggregation network;
please refer to step 330.
Step 1540, generating a local feature representation of the two-hand image based on the finger-joint segmentation map and the parameter map through the feature aggregation network;
please refer to step 340.
Step 1550, generating a dependent feature representation based on the hand centrograph and the interactive hand prior graph through the feature aggregation network;
Please refer to step 350.
Step 1560, generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation;
please refer to step 360.
Step 1570, inputting the hand feature representation of the two hands into a parameterized model, and regressing to obtain a reconstructed two-hand three-dimensional model;
please refer to step 370.
And step 1580, training a feature coding network, a feature aggregation network and a parameterized model according to the loss of the hand center graph, the loss of the finger joint segmentation graph and the loss of the two-hand three-dimensional model.
In one embodiment, the penalty for training the AI model includes three parts: a first sub-loss (loss of the hand center map, a second sub-loss (loss of the finger joint segmentation map) and a third sub-loss (loss of the reconstructed two-hand three-dimensional model).
Optionally, for the reconstruction method shown in fig. 6, the AI model further comprises a backbone network 61.
Loss for hand centrograms (L C ): and summing the loss between the left-hand center sub-graph and the label left-hand center sub-graph and the loss between the right-hand center sub-graph and the label right-hand center sub-graph to obtain a first sub-loss. For either the left-hand center subgraph or the right-hand center subgraph, the loss is calculated by the following formula:
L C =∑ h={L,R} (A C ,A′ C );
Wherein A is C Representing the left-hand center subgraph, A' C Representing a left-hand center sub-graph of the tag; or, A C Representing the right hand center subgraph, A' C Representing the label right hand center sub-graph.
Loss of finger joint segmentation map (L) P ): calculating the loss between the finger segment segmentation map and the label finger segment segmentation map to obtain a second sub-loss, wherein the calculation is shown in the following formula:
wherein CrossEntropy is a cross entropy loss function,probability segmentors representing the finger segment segmentation map at the (h, w) position, +.>Representing a labeled finger segment segmentation map, here σ is a normalized exponential function based on the channel dimension. (h, w) is the two-dimensional coordinates of the pixel. The hand segmentation map here does not have to ignore the background dimension. Alternatively, during training, only the previous two rounds are trained with the loss of the knuckle segmentation graph, while the subsequent rounds are trained with other losses until convergence.
Loss for two-hand three-dimensional model (L mesh ): carrying out weighted summation on the loss of the attitude parameters and the loss of the morphological parameters in the two-hand three-dimensional model and the joint loss to obtain a third sub-loss; joint loss includes loss of position of a three-dimensional joint, loss of position of a two-dimensional joint, and loss of bone length; the following formula is calculated:
L mesh =L MANO +L joint
Wherein L is MANO Representing the two-norm loss in terms of MANO parameters θ (pose parameters) and β (morphology parameters)A weighted sum of the loss functions.w θ And w β Is weight, optionally, w θ =80,w β =10。L joint Is a weighted sum of the position of the three-dimensional articulation point, the position of the two-dimensional articulation point after projection, and the loss of bone length.
L 3D =w j3d L MPJPE +w pa-j3d L PA-MPJPE
Wherein L is MPJPE L is a two-norm loss function (calculated according to the true three-dimensional joint point position) of the position of the reconstructed metacarpophalangeal joint PA-MPJPE Three-dimensional joint point position loss function (calculated according to position error before and after alignment) after metacarpophalangeal joint further passes through Procrustes-alignment (an alignment method), w j3d And w pa-j3d To set weight, L 3D Is a loss of position of the three-dimensional joint. Alternatively, w j3d =200,w pa-j3d =360。
PJ 2D Projection two-dimensional articulation points, J, generated for projection of three-dimensional articulation points using a weak perspective projection camera 2D ' true 2D articulation point position, w pj2d And the weight coefficient is set. L (L) 2D Is the loss of position of the two-dimensional joint. PJ (PJ) 2D The coordinate x obtained after projection pj2D =s x3D +t x ,y pj2D =s y3D +t y . Alternatively, w pj2d =400。
b i For skeleton length i, b i ' is the true bone length i, w bl For the set weight coefficient, L bone Is a loss of bone length. Can be used forOptionally, w bl =200。
In summary, the foregoing describes a loss function used in training a network, and specifically provides a training method for a feature encoding network, a feature aggregation network, and a parameterized model.
The experimental part of the two-hand reconstruction method provided by the present application will be described below.
Experimental details: the neural network used in the present application was run on a pyrerch. For backbone network 61, training is performed with ResNet-50 and HRNet-W32 to achieve faster reasoning speed and better two-hand reconstruction. Unlike the related art method requiring a hand detector, the present application can reconstruct hands in any scene using an end-to-end approach. Moreover, the method provided by the application does not limit the input two-hand image, and for the monocular original RGB camera image which is not cut or detected, all the input original images and the segmented images are readjusted to 512X 512 size, so that all the images keep the same aspect ratio by filling in zero. Extracting the feature map f E R through the main network 61 (C+2)×H×W . The feature map f will then generate a hand center map, a parameter map, an interactive hand prior map, and a finger joint segmentation map from the four convolution blocks.
Training process: for comparison on an InterHand2.6M dataset, adam optimizer and 5e were used -5 The learning rate performs eight rounds of learning. When no MANO valid tags are present, no supervision is made with knuckle segmentation graph loss and hand center graph loss. Because the actual finger joint segmentation map is derived from drawing a MANO hand grid, the MANO hand grid is drawn using a neural network. In all experiments, pre-trained backbone network HRNet-W32 was used for initialization for accelerated training. 2V 100 GPUs of batch size 64 were used. The feature size of the backbone network output is 128× 128,4 pixels aligned output image size 64×64. Random scaling, rotation, flipping, and color dithering enhancement are applied during training.
The testing process comprises the following steps: in all experiments, if not specified, the backbone network would be HRNet-W32. For comparison with other methods, the evaluation was performed using the complete official test set. Since there is only one left hand and one right hand in all training and test sets, the confidence threshold is set to 0.25 and the maximum number of detections is 1 left hand and 1 right hand.
Evaluation index: to assess the accuracy of the two-hand reconstruction, the average joint position error (MPJPE) and the post Proclusts-align average joint position error (PA-MPJPE) were first reported in millimeters. Both errors are calculated after aligning the joints following the prior art. The accuracy of the reconstructed hand shape was also tested on the FreiHand dataset by mean vertex position error (MPVPE) and Proclusts-align post-mean vertex position error (PA-MPVPE).
Data set
InterHand2.6M is the first and only one disclosed two-hand interaction dataset with accurate two-hand grid annotations. This large scale true captured dataset, with accurate three-dimensional pose and grid annotation of the human body (H) and machine (M), contains 1361062 frames of images for training and 849160 frames of images for testing, and 380125 frames of images for verification. These subsets are divided into two parts: interactive Hand (IH) and one hand (SH). In the experiment, a 5 frame/second interactive hand subset with h+m annotations was used.
FreiHan is a single-handed three-dimensional pose estimation dataset. For each frame, it has a MANO annotation and a 3D keypoint annotation. There are 4 x 32560 frames for training and 3960 frames for evaluation and testing. The initial sequence of 32560 frames was captured as a green curtain background, allowing background removal.
Comparison with the related art
The two-hand reconstruction method can obtain the highest degree of freedom and practicality, has the highest reconstruction precision, and is superior to the existing two-hand reconstruction methods under almost any condition. Experiments were mainly compared in the InterHand2.6M test set and the monocular reconstruction effect on network-derived video to verify the effectiveness of the method of the present application. As shown in fig. 16, compared with the existing interactive two-hand reconstruction method (IntagHand) based on the detection frame level coupling feature, the method provided by the application can generate more reasonable and accurate reconstruction results than IntagHand in a more challenging scene (for example, cut-off and occlusion). The method based on feature decoupling provided by the application has robustness to incomplete interaction scenes fundamentally.
The present application also contemplates hand reconstruction with the inclusion of a single hand, a first person, hand-object interaction, and a truncated hand. As shown in fig. 17, in some other cases, the method of the present application has better effect than IntagHand, which proves the versatility and practicality of the method provided by the present application.
Table 1 below compares the various mainstream one-hand reconstruction methods, two-hand reconstruction methods, and the average node position error (MPJPE), average vertex position error (MPVPE) of the present application on the intel handle 2.6m test set, and it can be seen that the error of the present application is significantly lower than all existing methods with greater accuracy. And is the only method that does not require any additional information.
TABLE 1
Table 2 below compares the various dominant single-hand reconstruction methods with the average node position error (PA-MPJPE) and average vertex position error (PA-MPVPE) of the present application after Procludes-alignment on the FreiHand test set, and it can be seen that the present application achieves performance comparable to the single-hand method based on vertex regression, showing its potential for accurate single-hand reconstruction, achieving the best results based on MANO model, while our method will possess better generalization capability than the vertex regression method.
TABLE 2
Reconstruction method PA-MPJPE PA-MPVPE
MeshGraphormer 6 5.9
METRO 6.8 6.7
I2L-MeshNet 7.4 7.6
HandTailor 8.2 8.7
The application is that 6.9 7.0
In summary, the application provides an arbitrary interactive double-hand reconstruction method based on attention aggregation and feature decoupling. This approach exploits feature representations based on hand centers (global) and knuckles (local) to alleviate interdependence and ambiguity between the hands and between the respective knuckles of each hand, and thereby release unnecessary input constraints. For better application scenes for processing interaction of hands, an interaction hand priori reasoning module with an Interaction Field (IF) is provided for dynamically adjusting the gesture dependency intensity between the hands, so that the reconstruction effect of the interaction hands is further optimized. The error of the application is significantly lower than that of all existing methods, with higher accuracy. Meanwhile, the method is the only method without any additional information, can be well combined with human body posture estimation, and is applied to whole-body motion capture.
Fig. 18 shows a block diagram of a two-hand reconstruction device according to an exemplary embodiment of the present application, the device comprising:
an acquiring module 1801, configured to acquire a two-hand image;
the encoding module 1802 is configured to encode, through a feature encoding network, a parameter map, a hand center map, a finger joint segmentation map, and an interactive hand prior map of the two-hand image; the parameter diagram at least comprises basic hand parameters of the two hands, and the hand center diagram is used for representing the positions of the hand centers of the two hands; the finger joint segmentation graph is at least used for representing the positions of a plurality of finger joints of the two hands, and the interactive hand prior graph is used for reasoning the interactive relation between the two hands;
an aggregation module 1803, configured to generate, through a feature aggregation network, a global feature representation of the two-hand image based on the hand center graph and the parameter graph; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrogram and the interactive hand prior map;
a reconstruction module 1804 for generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation; the hands are modeled from hand feature representations of the hands.
In an alternative embodiment, the aggregation module 1803 is further configured to generate a left-hand global feature representation based on the left-hand center subgraph in the hand center graph and the left-hand parameter subgraph in the parameter graph; generating a left-hand local feature representation based on the left-hand segmentation subgraph and the left-hand parameter subgraph in the finger joint segmentation graph; generating a left-hand dependent feature representation based on the right-hand center subgraph in the hand center graph and the left-hand prior subgraph in the interactive hand prior graph; a left hand feature representation is generated based on the left hand global feature representation, the left hand local feature representation, and the left hand dependent feature representation.
In an alternative embodiment, the aggregation module 1803 is further configured to generate a right-hand global feature representation based on the right-hand center subgraph in the hand center graph and the right-hand parameter subgraph in the parameter graph; generating a right hand local feature representation based on the right hand segmentation subgraph and the right hand parameter subgraph in the finger joint segmentation graph; generating a right-hand dependent feature representation based on the left-hand center subgraph in the hand center graph and the right-hand prior subgraph in the interactive hand prior graph; a right hand feature representation is generated based on the right hand global feature representation, the right hand local feature representation, and the right hand dependent feature representation.
In an alternative embodiment, the interactive hand prior graph comprises a left hand prior sub-graph comprising the base hand parameters of the left hand and a right hand prior sub-graph comprising the base hand parameters of the right hand, the left hand dependent feature representation is used to characterize the left hand prior knowledge derived from the right hand theory, and the right hand dependent feature representation is used to characterize the right hand prior knowledge derived from the left hand theory.
In an alternative embodiment, the aggregation module 1803 is further configured to convert the left-hand center subgraph in the hand center graph into a left-hand center attention graph through a normalized exponential function; performing pixel level dot multiplication operation on the left-hand center attention diagram and a left-hand parameter subgraph in the parameter diagram; and fully connecting the calculation result of the dot multiplication operation to obtain the left-hand global feature representation.
In an alternative embodiment, the aggregation module 1803 is further configured to convert the right-hand center subgraph in the hand center graph into a right-hand center attention graph through a normalized exponential function; performing pixel level dot multiplication operation on the right-hand central attention diagram and a right-hand parameter subgraph in the parameter diagram; and fully connecting the calculation result of the dot multiplication operation to obtain the right-hand global feature representation.
In an alternative embodiment, the apparatus further comprises an update module 1805. The updating module 1805 is configured to generate an adjustment vector based on a gaussian kernel size of a left hand center in the left hand center sub-graph, a gaussian kernel size of a right hand center in the right hand center sub-graph, a position difference between the left hand center and the right hand center, and euclidean distance between the left hand center and the right hand center; the adjustment vector characterizes the rejection of the left hand center to the right hand center; carrying out weighted summation operation on the position of the left hand center and the adjustment vector to obtain an updated position of the left hand center; the position of the right hand center and the adjustment vector are subjected to weighted difference calculation to obtain the updated position of the right hand center; generating an updated left-hand center subgraph based on the updated left-hand center position; and generating an updated right-hand center subgraph based on the updated right-hand center position.
In an alternative embodiment, the aggregation module 1803 is further configured to convert the left-hand segmentation subgraph in the finger joint segmentation map into a left-hand segmentation attention map through a normalized exponential function; carrying out Hadamard product operation on the left-hand segmentation attention map and the left-hand parameter subgraph to generate a left-hand local feature representation; converting the right hand segmentation subgraph in the finger joint segmentation graph into a right hand segmentation attention graph through a normalized exponential function; and carrying out Hadamard product operation on the right-hand segmentation attention map and the right-hand parameter subgraph to generate a right-hand local feature representation.
In an alternative embodiment, the aggregation module 1803 is further configured to convert the right-hand center subgraph in the hand center graph into a right-hand center attention graph through a normalized exponential function; performing pixel-level dot multiplication operation on the right hand center attention diagram and the left hand priori subgraph in the interactive hand priori diagram; fully connecting the calculation result of the dot multiplication operation to obtain a left-hand dependency characteristic representation; converting the left hand center subgraph in the hand center graph into a left hand center attention graph through a normalized exponential function; performing pixel-level dot multiplication operation on the left-hand center attention diagram and the right-hand prior subgraph in the interactive hand prior diagram; and fully connecting the calculation result of the dot multiplication operation to obtain right-hand dependent characteristic representation.
In an alternative embodiment, the aggregation module 1803 is further configured to calculate a euclidean distance between a left-hand center in the left-hand center graph and a right-hand center in the right-hand center graph; generating an interaction threshold according to the Gaussian kernel size of the left hand center and the Gaussian kernel size of the right hand center; setting the interaction intensity coefficient to be zero under the condition that the Euclidean distance is larger than the interaction threshold value; and generating an interaction intensity coefficient according to the interaction threshold and the Euclidean distance under the condition that the Euclidean distance is not larger than the interaction threshold.
In an alternative embodiment, the reconstruction module 1804 is further configured to multiply the interaction strength coefficient by the left-hand dependency feature representation; splicing the multiplication calculation result with the left-hand global characteristic representation and the left-hand local characteristic representation; and (5) fully connecting the splicing results to obtain the hand characteristic representation of the left hand.
In an alternative embodiment, the reconstruction module 1804 is further configured to multiply the interaction strength coefficient by the right-hand dependent feature representation; splicing the multiplication calculation result with the right-hand global characteristic representation and the right-hand local characteristic representation; and fully connecting the splicing results to obtain the hand characteristic representation of the right hand.
In an alternative embodiment, the left hand parameter subgraph includes a pose parameter of the left hand, a morphological parameter of the left hand, and a weak perspective camera parameter corresponding to the left hand; the right hand parameter subgraph comprises a right hand posture parameter, a right hand morphology parameter and a right hand corresponding weak perspective camera parameter.
In an alternative embodiment, the finger segment segmentation map is a probabilistic segmenter comprising a left-hand probabilistic segmenter corresponding to the left-hand segmentation map, a right-hand probabilistic segmenter corresponding to the right-hand segmentation map, and a background dimension; a voxel on the left-hand probability segmentation body represents a probability logic channel of a plurality of knuckle categories corresponding to the left hand; one voxel on the right-hand probability segmentation body represents one probability logic channel of a plurality of knuckle categories corresponding to the right hand; pixels in the background dimension characterize the probability of being in the background region.
In an alternative embodiment, the left-hand prior subgraph includes a pose parameter of the left hand, a morphology parameter of the left hand, and a weak perspective camera parameter corresponding to the left hand; the right-hand prior subgraph comprises a right-hand posture parameter, a right-hand morphological parameter and a right-hand corresponding weak perspective camera parameter.
In an alternative embodiment, the reconstruction module 1804 is further configured to input the hand feature representation of the two hands into a parameterized model, and to regress the hand feature representation to obtain a reconstructed three-dimensional model of the two hands.
In an alternative embodiment, the apparatus further comprises a training module 1806. A training module 1806, configured to train the feature encoding network, the feature aggregation network, and the parameterized model according to the loss of the hand center subgraph, the loss of the finger joint segmentation map, and the loss of the two-hand three-dimensional model.
In an alternative embodiment, training module 1806 is further configured to sum the loss between the left-hand center sub-graph and the tag left-hand center sub-graph with the loss between the right-hand center sub-graph and the tag right-hand center sub-graph to obtain a first sub-loss; calculating the loss between the finger segment segmentation map and the label finger segment segmentation map to obtain a second sub-loss; carrying out weighted summation on the loss of the attitude parameters and the loss of the morphological parameters in the two-hand three-dimensional model and the joint loss to obtain a third sub-loss; joint loss includes loss of position of a three-dimensional joint, loss of position of a two-dimensional joint, and loss of bone length; the first sub-loss, the second sub-loss and the third sub-loss are weighted and summed to obtain a target loss; training a feature encoding network, a feature aggregation network and a parameterized model according to the target loss.
In summary, the parameter map, the hand center map, the finger joint segmentation map and the interactive hand prior map are obtained through feature coding network coding, the global feature representation, the local feature representation and the dependent feature representation are obtained through feature aggregation network aggregation, and the hand feature representation of the two hands is obtained according to the three feature representations, so that the two-hand reconstruction can be performed according to the hand feature representation. In the reconstruction process, the dependence relationship between the hands is reduced by the hand center diagram, the dependence relationship between a plurality of knuckles in the hands is reduced by the knuckle segmentation diagram, and the reduction of the dependence relationship is beneficial to releasing input constraint, but the interaction between the hands in the interaction state is also reduced. Therefore, the application also designs the interactive hand priori graph which is used for reasoning and obtaining the interactive relation between the two hands in the interactive state. Based on the design of the hand center graph, the finger joint segmentation graph and the interactive hand prior graph, the two-hand reconstruction process provided by the application can support two-hand images in any scene.
Fig. 19 is a schematic diagram of a computer device according to an exemplary embodiment. The computer apparatus 1900 includes a central processing unit (Central Processing Unit, CPU) 1901, a system Memory 1904 including a random access Memory (Random Access Memory, RAM) 1902 and a Read-Only Memory (ROM) 1903, and a system bus 1905 connecting the system Memory 1904 and the central processing unit 1901. The computer device 1900 also includes a basic Input/Output system (I/O) 1906 that facilitates the transfer of information between various devices within the computer device, and a mass storage device 1907 for storing an operating system 1913, application programs 1914, and other program modules 1915.
The basic input/output system 1906 includes a display 1908 for displaying information and an input device 1909, such as a mouse, keyboard, etc., for inputting information by a user. Wherein the display 1908 and the input device 1909 are both coupled to the central processing unit 1901 through an input output controller 1919 coupled to the system bus 1905. The basic input/output system 1906 may also include an input/output controller 1910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1919 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1907 is connected to the central processing unit 1901 through a mass storage controller (not shown) connected to the system bus 1905. The mass storage device 1907 and its associated computer device readable media provide non-volatile storage for the computer device 1900. That is, the mass storage device 1907 may include a computer device readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.
The computer device readable medium may include computer device storage media and communication media without loss of generality. Computer device storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer device readable instructions, data structures, program modules or other data. Computer device storage media includes RAM, ROM, erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), CD-ROM, digital video disk (Digital Video Disc, DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer device storage medium is not limited to the ones described above. The system memory 1904 and mass storage device 1907 described above may be collectively referred to as memory.
According to various embodiments of the disclosure, the computer device 1900 may also operate through a network, such as the Internet, to remote computer devices on the network. I.e., the computer device 1900 may be connected to the network 1911 through a network interface unit 1912 coupled to the system bus 1905, or other types of networks or remote computer device systems (not shown) may also be coupled to the network interface unit 1912.
The memory further includes one or more programs stored in the memory, and the central processing unit 1901 implements all or part of the steps of the entity edge method of the knowledge graph by executing the one or more programs. The present application also provides a computer readable storage medium having stored therein at least one instruction, at least one program, a code set, or an instruction set, which is loaded and executed by a processor to implement the two-hand reconstruction method provided by the above method embodiments.
The present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the two-hand reconstruction method provided by the above-mentioned method embodiment.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (15)

1. A method of two-hand reconstruction, the method comprising:
acquiring images of both hands;
coding to obtain a parameter map, a hand center map, a finger joint segmentation map and an interactive hand prior map of the two-hand image through a feature coding network; the parameter map at least comprises basic hand parameters of the two hands, and the hand center map is used for representing the positions of the hand centers of the two hands; the finger joint segmentation graph is at least used for representing the positions of a plurality of finger joints of the two hands, and the interactive hand prior graph is used for reasoning the interactive relation between the two hands;
Generating, by a feature aggregation network, a global feature representation of the two-hand image based on the hand center map and the parameter map; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrgram and the interactive hand prior;
generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation; modeling the hands according to the hand feature representation of the hands.
2. The method of claim 1, wherein the generating a global feature representation of the two-hand image is based on the hand center map and the parameter map; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrgram and the interactive hand prior; generating a hand feature representation of the two hands based on the global feature representation, the local feature representation, and the dependent feature representation, comprising:
generating a left-hand global feature representation based on a left-hand center subgraph of the hand center graph and a left-hand parameter subgraph of the parameter graph; generating a left-hand local feature representation based on the left-hand segmentation subgraph and the left-hand parameter subgraph in the knuckle segmentation graph; generating a left-hand dependent feature representation based on a right-hand center subgraph in the hand center graph and a left-hand prior subgraph in the interactive hand prior graph; generating a left hand feature representation based on the left hand global feature representation, the left hand local feature representation, and the left hand dependent feature representation;
Generating a right-hand global feature representation based on a right-hand center subgraph in the hand center graph and a right-hand parameter subgraph in the parameter graph; generating a right-hand local feature representation based on the right-hand segmentation subgraph and the right-hand parameter subgraph in the finger joint segmentation graph; generating a right-hand dependent feature representation based on a left-hand center subgraph in the hand center graph and a right-hand prior subgraph in the interactive hand prior graph; generating a right hand feature representation based on the right hand global feature representation, the right hand local feature representation, and the right hand dependent feature representation;
the interactive hand priori graph comprises the left hand priori subgraph and the right hand priori subgraph, the left hand priori subgraph comprises basic hand parameters of a left hand, the right hand priori subgraph comprises basic hand parameters of a right hand, the left hand dependent feature representation is used for representing left hand priori knowledge obtained according to a right hand theory, and the right hand dependent feature representation is used for representing right hand priori knowledge obtained according to the left hand theory.
3. The method of claim 2, wherein the generating a left-hand global feature representation based on a left-hand center subgraph of the hand center graph and a left-hand parameter subgraph of the parameter graph comprises:
Converting a left hand center subgraph in the hand center graph into a left hand center attention graph through a normalized exponential function; performing pixel-level dot product operation on the left-hand center attention map and a left-hand parameter sub-graph in the parameter map; fully connecting the calculation result of the dot multiplication operation to obtain the left-hand global feature representation;
the generating a right-hand global feature representation based on the right-hand center subgraph in the hand center graph and the right-hand parameter subgraph in the parameter graph includes:
converting the right hand center subgraph in the hand center graph into a right hand center attention graph through a normalized exponential function; performing pixel-level dot product operation on the right-hand center attention map and a right-hand parameter subgraph in the parameter map; and fully connecting the calculation result of the dot multiplication operation to obtain the right-hand global feature representation.
4. The method according to claim 2, wherein the method further comprises:
generating an adjustment vector based on a gaussian kernel size of a left hand center in the left hand center subgraph, a gaussian kernel size of a right hand center in the right hand center subgraph, a position difference between the left hand center and the right hand center, and euclidean distances between the left hand center and the right hand center; the adjustment vector characterizes the rejection of the right hand center by the left hand center;
Carrying out weighted summation operation on the position of the left hand center and the adjustment vector to obtain an updated position of the left hand center; carrying out weighted difference calculation on the position of the right hand center and the adjustment vector to obtain the updated position of the right hand center;
generating an updated left-hand center subgraph based on the updated left-hand center position; and generating an updated right-hand center subgraph based on the updated right-hand center position.
5. The method according to any of claims 2 to 4, wherein the generating a left-hand local feature representation based on a left-hand segmentation sub-graph in the finger joint segmentation map and the left-hand parameter sub-graph comprises:
converting the left-hand segmentation subgraph in the finger joint segmentation graph into a left-hand segmentation attention graph through a normalized exponential function; carrying out Hadamard product operation on the left-hand segmentation attention map and the left-hand parameter subgraph to generate the left-hand local feature representation;
the generating a right-hand local feature representation based on the right-hand segmentation subgraph and the right-hand parameter subgraph in the finger joint segmentation graph includes:
converting the right hand segmentation subgraph in the finger joint segmentation graph into a right hand segmentation attention graph through a normalized exponential function; and carrying out Hadamard product operation on the right-hand segmentation attention map and the right-hand parameter subgraph to generate the right-hand local feature representation.
6. The method of any of claims 2 to 4, wherein the generating a left-hand dependency feature representation based on a right-hand center subgraph of the hand center graph and a left-hand prior subgraph of the interactive hand prior graph comprises:
converting the right hand center subgraph in the hand center graph into a right hand center attention graph through a normalized exponential function; performing pixel-level dot product operation on the right hand center attention map and a left hand prior sub-map in the interactive hand prior map; fully connecting the calculation result of the dot multiplication operation to obtain the left-hand dependency characteristic representation;
the generating a right-hand dependent feature representation based on the left-hand center subgraph in the hand center graph and the right-hand prior subgraph in the interactive hand prior graph includes:
converting a left hand center subgraph in the hand center graph into a left hand center attention graph through a normalized exponential function; performing pixel-level dot product operation on the left-hand center attention map and a right-hand prior sub-map in the interactive hand prior map; and fully connecting the calculation result of the dot multiplication operation to obtain the right-hand dependent characteristic representation.
7. The method of claim 6, wherein the method further comprises:
Calculating the Euclidean distance between the left hand center in the left hand center graph and the right hand center in the right hand center graph; generating an interaction threshold according to the Gaussian kernel size of the left hand center and the Gaussian kernel size of the right hand center; setting an interaction intensity coefficient to be zero under the condition that the Euclidean distance is larger than the interaction threshold value; generating the interaction intensity coefficient according to the interaction threshold and the Euclidean distance under the condition that the Euclidean distance is not larger than the interaction threshold;
the generating a left hand feature representation based on the left hand global feature representation, the left hand local feature representation, and the left hand dependent feature representation, comprising:
multiplying the interaction strength coefficient by the left-hand dependent feature representation; splicing the multiplication calculation result with the left-hand global feature representation and the left-hand local feature representation; fully connecting the splicing results to obtain the hand characteristic representation of the left hand;
the generating a right hand feature representation based on the right hand global feature representation, the right hand local feature representation, and the right hand dependent feature representation, comprising:
multiplying the interaction strength coefficient by the right-hand dependent feature representation; splicing the multiplication calculation result with the right-hand global feature representation and the right-hand local feature representation; and fully connecting the splicing results to obtain the hand characteristic representation of the right hand.
8. The method according to any one of claims 2 to 4, wherein the left-hand parameter subgraph comprises a pose parameter of the left hand, a morphology parameter of the left hand and a weak perspective camera parameter corresponding to the left hand; the right-hand parameter subgraph comprises a right-hand posture parameter, a right-hand morphological parameter and a right-hand corresponding weak perspective camera parameter;
the finger joint segmentation map is a probability segmentation body which comprises a left-hand probability segmentation body corresponding to the left-hand segmentation map, a right-hand probability segmentation body corresponding to the right-hand segmentation map and a background dimension; a voxel on the left-hand probability segmentation body represents a probability logic channel of a plurality of knuckle categories corresponding to the left hand; one voxel on the right-hand probability segmentation body represents one probability logic channel of a plurality of knuckle categories corresponding to the right hand; pixels in the background dimension characterize a probability of being in a background region;
the left hand priori subgraph comprises gesture parameters of the left hand, morphological parameters of the left hand and weak perspective camera parameters corresponding to the left hand; the right hand prior subgraph comprises a gesture parameter of the right hand, a morphological parameter of the right hand and a weak perspective camera parameter corresponding to the right hand;
The modeling of the hands according to the hand feature representation of the hands comprises:
and inputting the hand characteristic representation of the two hands into a parameterized model, and regressing to obtain a reconstructed two-hand three-dimensional model.
9. The method of claim 8, wherein the method further comprises:
training the feature encoding network, the feature aggregation network and the parameterized model according to the loss of the hand center subgraph, the loss of the knuckle segmentation graph and the loss of the two-hand three-dimensional model.
10. The method of claim 9, wherein the training the feature encoding network, the feature aggregation network, and the parameterized model based on the loss of the hand center subgraph, the loss of the finger joint segmentation graph, and the loss of the two-hand three-dimensional model comprises:
summing the loss between the left-hand center sub-graph and the tag left-hand center sub-graph and the loss between the right-hand center sub-graph and the tag right-hand center sub-graph to obtain a first sub-loss;
calculating the loss between the finger segment segmentation map and the label finger segment segmentation map to obtain a second sub-loss;
carrying out weighted summation on the loss of the attitude parameters, the loss of the morphological parameters and the joint loss in the two-hand three-dimensional model to obtain a third sub-loss; the joint loss includes a loss of position of a three-dimensional joint, a loss of position of a two-dimensional joint, and a loss of bone length;
Carrying out weighted summation on the first sub-loss, the second sub-loss and the third sub-loss to obtain target loss; training the feature encoding network, the feature aggregation network, and the parameterized model according to the target loss.
11. A two-handed reconstruction device, the device comprising:
the acquisition module is used for acquiring the images of the two hands;
the coding module is used for coding to obtain a parameter diagram, a hand center diagram, a finger joint segmentation diagram and an interactive hand prior diagram of the two-hand image through a feature coding network; the parameter map at least comprises basic hand parameters of the two hands, and the hand center map is used for representing the positions of the hand centers of the two hands; the finger joint segmentation graph is at least used for representing the positions of a plurality of finger joints of the two hands, and the interactive hand prior graph is used for reasoning the interactive relation between the two hands;
the aggregation module is used for generating a global feature representation of the two-hand image based on the hand center graph and the parameter graph through a feature aggregation network; generating a local feature representation of the two-hand image based on the finger joint segmentation map and the parameter map; generating a dependent feature representation based on the hand centrgram and the interactive hand prior;
A reconstruction module for generating a hand feature representation of both hands based on the global feature representation, the local feature representation and the dependent feature representation; modeling the hands according to the hand feature representation of the hands.
12. A computer device, the computer device comprising: a processor and a memory storing a computer program that is loaded and executed by the processor to implement the two-hand reconstruction method as claimed in any one of claims 1 to 10.
13. A computer readable storage medium, characterized in that it stores a computer program, which is loaded and executed by a processor to implement the two-hand reconstruction method according to any one of claims 1 to 10.
14. A computer program product, characterized in that it stores a computer program that is loaded and executed by a processor to implement the two-hand reconstruction method according to any one of claims 1 to 10.
15. A computer program, characterized in that it is loaded and executed by a processor to implement the two-hand reconstruction method according to any one of claims 1 to 10.
CN202310232179.2A 2023-02-28 2023-02-28 Double-hand reconstruction method, device, equipment and storage medium Pending CN116958405A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310232179.2A CN116958405A (en) 2023-02-28 2023-02-28 Double-hand reconstruction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310232179.2A CN116958405A (en) 2023-02-28 2023-02-28 Double-hand reconstruction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116958405A true CN116958405A (en) 2023-10-27

Family

ID=88443329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310232179.2A Pending CN116958405A (en) 2023-02-28 2023-02-28 Double-hand reconstruction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116958405A (en)

Similar Documents

Publication Publication Date Title
Laga et al. A survey on deep learning techniques for stereo-based depth estimation
Zhen et al. Smap: Single-shot multi-person absolute 3d pose estimation
Fieraru et al. Three-dimensional reconstruction of human interactions
US11232286B2 (en) Method and apparatus for generating face rotation image
WO2020199931A1 (en) Face key point detection method and apparatus, and storage medium and electronic device
CN111598998B (en) Three-dimensional virtual model reconstruction method, three-dimensional virtual model reconstruction device, computer equipment and storage medium
US20210158023A1 (en) System and Method for Generating Image Landmarks
Wang et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
Hu et al. Deep depth completion from extremely sparse data: A survey
Kothari et al. Weakly-supervised physically unconstrained gaze estimation
CN114648613B (en) Three-dimensional head model reconstruction method and device based on deformable nerve radiation field
CN113592913B (en) Method for eliminating uncertainty of self-supervision three-dimensional reconstruction
Núñez et al. Multiview 3D human pose estimation using improved least-squares and LSTM networks
Sharma et al. An end-to-end framework for unconstrained monocular 3D hand pose estimation
Cao et al. Single view 3D reconstruction based on improved RGB-D image
CN114913552A (en) Three-dimensional human body density corresponding estimation method based on single-view-point cloud sequence
Chang et al. 2d–3d pose consistency-based conditional random fields for 3d human pose estimation
An et al. ARShoe: Real-time augmented reality shoe try-on system on smartphones
Zhang et al. Deep learning-based real-time 3D human pose estimation
Kourbane et al. A graph-based approach for absolute 3D hand pose estimation using a single RGB image
WO2022043834A1 (en) Full skeletal 3d pose recovery from monocular camera
CN113822097B (en) Single-view human body posture recognition method and device, electronic equipment and storage medium
Shen et al. ImLiDAR: cross-sensor dynamic message propagation network for 3D object detection
CN116977547A (en) Three-dimensional face reconstruction method and device, electronic equipment and storage medium
CN114049678B (en) Facial motion capturing method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40099459

Country of ref document: HK