WO2021155653A1

WO2021155653A1 - Human hand-object interaction process tracking method based on collaborative differential evolution filtering

Info

Publication number: WO2021155653A1
Application number: PCT/CN2020/101671
Authority: WO
Inventors: 李东年; 郭阳; 陈成军; 赵正旭; 温晋杰; 张庆海
Original assignee: 青岛理工大学
Priority date: 2020-02-06
Filing date: 2020-07-13
Publication date: 2021-08-12
Also published as: CN111311648A

Abstract

Disclosed is a human hand-object interaction process tracking method based on collaborative differential evolution filtering. The method comprises: extracting a foreground area corresponding to a human hand and an object in an image to be detected, and generating an observation depth map and a corresponding observation silhouette map; respectively obtaining a human hand motion posture and an object motion posture on the basis of a constructed human hand kinematic model and an object kinematic model, wherein the human hand motion posture and the object motion posture form a human hand-object posture vector, and generating a corresponding rendering depth map; by means of taking the image to be detected as an observation input, constructing a matching error function of the observation input and the human hand-object posture vector; and using a collaborative differential evolution filtering algorithm to respectively perform posture optimization on the human hand and the object by means of calculating the matching error function, so as to obtain motion tracking of the human hand and the object during the human hand-object interaction process. The robust tracking of human hand-object motion is performed by using a small number of particles.

Description

Human-hand-object interaction process tracking method based on cooperative differential evolution filtering

Technical field

The present disclosure relates to the technical field of three-dimensional human hand tracking, and in particular to a method for tracking a human hand-object interaction process based on collaborative differential evolution filtering.

Background technique

The statements in this section merely provide background technical information related to the present disclosure, and do not necessarily constitute prior art.

Computer vision-based 3D human hand tracking can be applied in fields such as robot teaching and learning, motion capture, human-computer interaction, gesture recognition, etc. However, the tracking of the human hand-object interaction process is trapped by many complex factors. First, because the human hand has multiple degrees of freedom, the problem is essentially a high-dimensional space problem; secondly, the human hand will cause frequent occlusion during the interaction with the object, including the mutual occlusion between the human hand and the object being manipulated. The self-occlusion of the human hand; in addition, the useful information carried by the object context will promote the recognition and estimation of human hand movement.

At present, vision-based hand-object tracking methods are generally divided into two categories: appearance-based methods and model-based methods.

The appearance-based method builds a mapping through learning, and maps the image feature space to the human hand-object state space, thereby directly estimating the human hand state and the object state from the image features. This type of method does not need to be initialized, and the tracking speed is fast, but its accuracy is affected by the training samples.

Et al. proposed a method to simultaneously recognize the human hand movement and the manipulated object, and express the time-varying relationship between the human hand movement and the object through a conditional random field model, but this method does not give detailed information about the hand movement posture. Romero et al. proposed a real-time appearance-based non-parametric method to reconstruct the three-dimensional pose of the human hand interacting with the object. The method uses a histogram of directional gradients (HoG) to describe the characteristics of the human hand and executes it in a large template database. Nearest neighbor search is used to find the hand pose that best matches the input image. However, because of the appearance-based method, this method cannot precisely track the hand movement in a high-dimensional space. Gupta et al. proposed a Bayesian method to integrate multiple perception tasks in the process of human-object interaction, and seek consistent semantic expression by imposing spatial constraints on the perception elements. This method can recognize objects and corresponding actions when the appearance is not sufficiently discernible, and can also recognize human actions from static images without using any motion information. However, this method does not give detailed information about the posture of the human body. Yao et al. used a new random field model to jointly model objects and human body poses, and used a structure learning method to estimate the degree of connection between the objects, human body poses, and various parts of the human body, and adopted a new The maximum edge algorithm calculates the parameters of the model. In this mode, object detection provides strong prior knowledge for human pose estimation, and the estimation of human pose allows the system to more accurately detect objects that interact with the human body. However, this method adopts a two-dimensional estimation method for human posture.

Model-based methods use pre-established human hand models and object models to generate hand-object posture hypotheses. The features extracted from the model are compared with those extracted from visual observations, and the similarity between the two is evaluated. A group of human hand-object states with the best similarity are searched in the model state space. This type of method can use more prior information (such as human hand shape, joint constraints, etc.), but its tracking process needs to be initialized, and it faces a difficult problem of searching in a high-dimensional space. Hamer et al. activated an independent local tracker for each part of the multi-joint human hand, used a paired Markov random field to connect two adjacent human hand parts, and used belief propagation (BP) to find the optimal human hand. State configuration, but this method does not model the manipulated object. Oikonomidis et al. proposed a model-based method to track the movement of the human hand and the manipulated object at the same time. This method establishes a three-dimensional model and a motion model for the human hand and the object at the same time. The tracking problem is regarded as a sequence optimization problem, and the search and input image matching error The smallest human hand pose parameters and model pose parameters, the system uses a multi-eye RGB image sequence as input. Kyriazis et al. used a depth camera to obtain observation input, and proposed a method of searching only the hand posture parameters. The state of the object is derived from the state of the hand and the force model between the hand and the object. However, the interaction between objects in the real world The role involves many factors, and it is difficult to accurately model it.

In summary, the inventor found that the prior art has at least the following problems: in the method based on appearance, the detailed information of the hand movement posture cannot be given, the hand movement cannot be tracked finely in the high-dimensional space, and the hand movement is limited to two dimensions. In the model-based method, there are fineness problems in the three-dimensional modeling of human hands and objects, and the particle filter framework is used to track the motion of human hands or human bodies. Due to the extreme sparsity of particle sampling in high-dimensional space, It is difficult to use a limited number of particles to effectively express the true posterior distribution of the human hand state, which can easily lead to tracking failure.

Summary of the invention

In order to solve the above problems, the present disclosure proposes a method for tracking the human hand-object interaction process based on cooperative differential evolution filtering, which uses a model-based method to simultaneously track human hands and objects in the human hand-object interaction process, and integrates the differential evolution algorithm In the particle filter framework, two coordinated particle filter trackers are used to track the movement of human hands and objects respectively, and differential evolution is used to optimize the matching error under the current observation to drive the particles to move to the high likelihood area and improve the particles. The sample distribution is filtered, so that a small number of particles can be used to achieve robust tracking of human hand-object movement.

In order to achieve the above objectives, the present disclosure adopts the following technical solutions:

In the first aspect, the present disclosure provides a method for tracking human hand-object interaction process based on cooperative differential evolution filtering, including:

Extract the foreground area corresponding to the human hand and the object in the image to be measured, and generate the observation depth map and the corresponding observation silhouette map;

Based on the constructed hand kinematics model and object kinematics model, the hand motion posture and the object motion posture are obtained respectively. The hand motion posture and the object motion posture form the hand-object posture vector and generate the corresponding rendering depth map;

Take the image to be measured as the observation input, and calculate the depth feature matching degree between the observation depth map and the rendered depth map and the silhouette feature matching degree between the observation silhouette map and the rendered depth map, and construct the observation input and the human hand-object pose vector The matching error function;

The collaborative differential evolution filtering algorithm is used to calculate the matching error function to optimize the posture of the human hand and the object respectively, and obtain the motion tracking results of the human hand and the object during the hand-object interaction process.

In the second aspect, the present disclosure provides a human-hand-object interaction process tracking system based on cooperative differential evolution filtering, including:

The image processing module to be measured is configured to extract the foreground area corresponding to the human hand and the object in the image to be measured, and generate an observation depth map and a corresponding observation silhouette map;

The hand-object movement posture module is configured to obtain the hand movement posture and the object movement posture based on the constructed hand kinematics model and the object kinematics model respectively. The hand movement posture and the object movement posture form the hand-object posture vector and generate the corresponding rendering Depth map

The matching error function building module is configured to take the image to be measured as the observation input, and to calculate the depth feature matching degree between the observation depth map and the rendered depth map and the silhouette feature matching degree between the observation silhouette map and the rendered depth map. Construct a matching error function between the observation input and the human hand-object pose vector;

The tracking module is configured to use the cooperative differential evolution filtering algorithm to optimize the posture of the human hand and the object by calculating the matching error function, and obtain the motion tracking result of the human hand and the object during the hand-object interaction process.

In a third aspect, the present disclosure provides an electronic device including a memory, a processor, and computer instructions stored in the memory and running on the processor. When the computer instructions are executed by the processor, a cooperative differential evolution filtering is completed. The steps described in the human-hand-object interaction process tracking method.

In a fourth aspect, the present disclosure provides a computer-readable storage medium for storing computer instructions that, when executed by a processor, complete the method for tracking a human hand-object interaction process based on collaborative differential evolution filtering. step.

Compared with the prior art, the beneficial effects of the present disclosure are:

A model-based method is used to simultaneously track human hands and objects in the process of human-object interaction, and the differential evolution algorithm is integrated into the particle filter framework, and a new improved particle filter algorithm-collaborative differential evolution filtering to track human hands-is proposed. For object movement, two coordinated particle filter trackers are used to track the movement of human hands and objects respectively, and differential evolution is used to optimize the matching error under the current observation to drive the particles to move to the high-likelihood area and improve the particle filter samples. Distribution to achieve robust tracking of human hand and object motion with a small number of particles.

Description of the drawings

The drawings of the specification forming a part of the present disclosure are used to provide a further understanding of the present disclosure. The schematic embodiments and descriptions of the present disclosure are used to explain the present disclosure, and do not constitute an improper limitation of the present disclosure.

FIG. 1 is a schematic diagram of a method for tracking a human hand-object interaction process based on cooperative differential evolution filtering according to Embodiment 1 of the disclosure;

2 is a schematic diagram of a kinematics model of a human hand provided in Embodiment 1 of the disclosure;

Fig. 3(a) is a schematic diagram of a human hand-spherical body model provided in Embodiment 1 of the present disclosure;

Fig. 3(b) is a schematic diagram of a human hand-pillar model provided in Embodiment 1 of the present disclosure;

4 is a flow chart of human hand-object tracking provided by Embodiment 1 of the present disclosure;

5(a)-(c) are diagrams of the tracking result of the interaction process between the human hand and the sphere provided in Embodiment 1 of the present disclosure;

6(a)-(c) are diagrams of the tracking results of the interaction process between the human hand and the cylinder provided in Embodiment 1 of the present disclosure.

Detailed ways:

The present disclosure will be further described below in conjunction with the drawings and embodiments.

It should be noted that the following detailed descriptions are all illustrative, and are intended to provide further descriptions of the present disclosure. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the technical field to which the present disclosure belongs.

It should be noted that the terms used here are only for describing specific embodiments, and are not intended to limit the exemplary embodiments according to the present disclosure. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should also be understood that when the terms "comprising" and/or "including" are used in this specification, they indicate There are features, steps, operations, devices, components, and/or combinations thereof.

Example 1

As shown in FIG. 1, this embodiment provides a method for tracking human hand-object interaction process based on cooperative differential evolution filtering, including:

S1: Extract the foreground area corresponding to the human hand and the object in the image to be measured, and generate the observation depth map and the corresponding observation silhouette map; based on the constructed human hand kinematics model and object kinematics model to obtain the hand movement posture and object movement posture, hand movement The posture and object motion posture compose the hand-object posture vector and generate the corresponding rendering depth map;

S2: Take the image to be measured as the observation input, and calculate the depth feature matching degree between the observation depth map and the rendered depth map, and the silhouette feature matching degree between the observation silhouette image and the rendered depth map, and construct the observation input and human hand-object. Matching error function of the pose vector;

S3: The coordinated differential evolution filtering algorithm is used to calculate the matching error function to optimize the posture of the human hand and the object respectively, and obtain the motion tracking of the human hand and the object during the hand-object interaction process.

In the step S1, this embodiment uses a method based on the human hand-object kinematics model to track the interaction process between the human hand and the object, establishes a three-dimensional model and a motion model for the human hand and the object, and simultaneously tracks the motion of the human hand and the object in the three-dimensional space.

In the tracking process, the human hand 3D model is used to generate the human hand posture hypothesis, the object 3D model is used to generate the object posture hypothesis, the matching error between the model feature group and the observation feature group obtained from the input image is calculated, and the tracking problem is regarded as a sequence optimization problem , Search for the state parameter that minimizes the matching error in the state space of the human hand and the object, that is, the optimal solution corresponding to the current frame of the input image.

In this embodiment, the hand motion state and the object motion state are composed of the hand-object posture vector x _ho = (x _h , x _o ). Figure 2 shows the hand kinematics model. The hand motion state x _h contains 29 freedoms in total. Degree variables include global palm motion with 6 degrees of freedom, local finger motion with 20 degrees of freedom, and 3 degrees of freedom for the wrist joint. The CMC joints of each finger are fixed, and the palm is modeled as a rigid body. Its motion corresponds to 6 global degrees of freedom (3 translation and 3 rotation) of the human hand; the motion of 5 fingers corresponds to 20 local degrees of freedom, and each finger is composed of Modeling with 4 degrees of freedom; except for the thumb, the MCP joint of each finger and the TM joint of the thumb both contain 2 degrees of freedom (1 flexion and extension and 1 abduction and adduction), while the PIP and DIP joints of each finger and the thumb’s The MCP joint and the IP joint only contain one degree of freedom in flexion and extension; the wrist joint contains one degree of freedom in flexion and extension, one degree of freedom in abduction and extension, and one degree of freedom in scale transformation.

The object motion state x _o contains the 6-degree-of-freedom pose state (3 translations and 3 rotations) of the object in the three-dimensional space.

This embodiment limits the value of the angles of the finger joints and the wrist joints of the human hand within a certain range based on human anatomical factors. The application of these motion constraints can not only ensure that the solution obtained by the posture estimation process is effective, but also greatly Compress the search range of the human state space and reduce the search difficulty.

In this embodiment, PTC Pro/Engineer and Multigen-Paradigm Creator are used to establish a unified three-dimensional model for the human hand and the manipulated object with parameterized geometric primitives, and a tree-like hierarchical organization structure is established for the human hand-object model in the Creator. Add local coordinate system and DOF (Degree of Freedom) motion nodes.

In addition, the three-dimensional human hand model established in this embodiment includes a part of the human forearm, so that the established three-dimensional model can describe the forearm pixels connected to the human hand pixels in the segmented depth image. The wrist joint has a scale transformation. Degree of freedom, capable of telescopic transformation of the forearm model.

This embodiment proposes the interaction process between the human hand and the following two types of objects: a sphere and a column. Figure 3(a) shows the three-dimensional model of the human hand and the sphere, and Figure 3(b) shows the three-dimensional model of the human hand and the column. Model; In addition, the method used is also suitable for tracking the interaction process between human hands and objects of other shapes.

In the step S2, when constructing the matching error function and the observation likelihood function, this embodiment combines the two types of feature information, the depth feature and the silhouette feature. Taking the depth image obtained by the Kinect depth camera as the observation input z, the foreground area corresponding to the human hand and the manipulated object is extracted through simple depth threshold segmentation, and the observation depth map z _d (z) is generated, and the depth map z obtained from the observation is _d (z) generate observation silhouette map z _s (z);

For each human hand-object posture vector x _ho = (x _h , x _o ), given the calibration information of the depth camera, the corresponding rendering depth map r _d (x _ho ) is generated by the rendering depth Figure r _d (x _ho ) generates a rendered silhouette image r _s (x _ho ); z _s (z) and r _s (x _ho ) are both binary images, which are taken at the foreground area corresponding to the human hand and the manipulated object It is 1, and the value is 0 in the background.

The matching error function is used to express the matching degree between the observation z and the human hand-object pose vector x _ho . A small matching error means a high matching degree. In this embodiment, the matching error function is defined as:

E(z,x _ho )=λ _d E _d (z,x _ho )+λ _s E _s (z,x _ho )+λ _p E _p (x _h ) (1)

In the above formula, E (z, x _ho) by the depth of feature items E _d, silhouette characteristic items E _s and a penalty term E _p of three parts, λ _d, λ _s and λ _p weighting factor is a constant weight of each portion.

Among them, in the matching error function,

S2.1: E _d measures the depth deviation between the observed depth map z _d (z) and the rendered depth map r _d (x _ho _{) corresponding to the attitude vector x ho} , which is defined as follows:

The depth deviation (in mm as the measurement unit) is calculated and accumulated pixel by pixel on the entire feature map, and the accumulated sum is normalized by dividing by the total area of the human hand and the pixel area of the operated object. Certain large depth deviations will cause large changes in the function value, thereby affecting the performance of the search method. For this reason, the maximum depth deviation constant T _{d is} introduced, and the range of the depth deviation on each pixel is limited to [0, T _d ].

S2.2: E _s describes the matching degree of silhouette features by calculating the size of the non-overlapping area between the observation silhouette image z _s (z) and the rendered silhouette image r _s (x _{ho ), which is defined as follows:}

The first part of the above formula calculates _{the pixel area belonging to the observation silhouette area z s} (z) but not the rendered silhouette area r _s (x _ho ), and the second part calculates the pixel area belonging to r _s (x _ho ) instead of z _s (z ) Of the pixel area, the two parts were standardized. The application of the regional feature item E _s has a smooth effect on the objective function, reducing the local minimum around the global minimum, so that the optimization process can better converge to the actual global minimum, and enhance the robustness of the optimization process sex.

S2.3: penalty for the mutually adjacent fingers penetrate, matching error function E (z, x _h) adds a priori portion, i.e., the third portion E _p (x _h), which is defined as follows:

Among them, J represents three pairs of adjacent fingers except the thumb,

Represents the deviation between the abduction and adduction angles of the MCP joints of a pair of fingers in the human hand posture hypothesis x _h.

S2.4: The observation likelihood function and the matching error function E(z, x _ho ) are in a monotonically decreasing relationship. The observation likelihood function is defined as follows:

p(z|x _ho )∝exp(-λ _e ·E(z,x _ho )) (5)

Among them, λ _e is a constant normalization factor, and its value is determined by the observation noise.

In the step S3, the present embodiment uses the cooperative differential evolution filter algorithm to optimize the poses of the human hand and the object by calculating the matching error function. This embodiment integrates the differential evolution algorithm into the particle filter framework, and proposes a new tracking method. The algorithm, that is, the cooperative differential evolution filter algorithm to track the human hand-object movement in the high-dimensional space. The algorithm uses two cooperative particle filter trackers to track the human hand and the object respectively, and uses differential evolution to track the current observations. The matching error is optimized to improve the particle filter sample distribution.

Specifically: Differential evolution algorithm is an efficient emerging swarm intelligence optimization algorithm, which can effectively solve the optimization problem of nonlinear and non-differentiable objective equations. After initialization, differential evolution passes through N D-dimensional vectors

Iterative evolution of to search for the global optimal solution in a continuous space. The evolution of the population is carried out through the three basic operations of mutation, crossover, and selection; mutation and crossover operations are used to generate new candidate individuals, and the selection operation is used to determine whether the newly generated candidate individuals can survive in the next generation.

During the mutation operation, for each individual index i in the population, differential evolution randomly selects 3 different individuals from the previous generation, and combines them to generate a mutant individual.

Among them, the individual indexes r ₁ , r ₂ , and r ₃ are randomly selected in the range of [1,2,...,N], they are different from each other and different from i; F is the difference vector

The scaling factor of to control the convergence speed of the search process;

The scale factor F of the standard differential evolution algorithm is a constant. In this embodiment, to improve the convergence of the algorithm, a "jitter" factor σ = 1.0 is used to adjust F in each dimension, then F = F _C · N (0 , 1), where, F _C is a constant, N (0,1) is a zero mean Gaussian random number variance. In this embodiment, F _{C is} taken as 0.5.

Then, through the crossover operation, the mutated individual

With the old individual

Combine to generate candidate individuals

Among them, rand ^j ~ U(0,1) is a random number uniformly distributed in the interval [0,1]; the cross parameter CR determines the probability of each element of the candidate individual being inherited from the variant individual. In this embodiment, CR Take 0.9;

It is a random number selected in the range [1,2,...,D] to ensure that the candidate individual obtains at least one element from the variant individual.

After mutation and crossover operations, perform a one-to-one greedy selection operation:

Comparison of newly generated candidates

And old individuals

To determine which individual will be retained for the next generation; if the candidate individual

Has an older individual

Better objective function value, it will replace

Be retained to the next generation; otherwise, the old individual will continue to be retained.

The basic steps of the differential evolution algorithm can be summarized as follows:

1) Initialization: to the population

Perform random initialization; evaluate individuals in the population according to the objective function, and record the corresponding target value;

The individual with the optimal target value in the population is copied to the global optimal b ₀ of the population, and its corresponding target value is recorded;

2) Mutation: Perform mutation operation on individuals in the population according to formula (6) to generate mutated individuals

3) Crossover: According to formula (7), the old individuals in the population

And its corresponding variant individuals

Perform crossover operations to generate candidate individuals

4) Evaluate each candidate individual: According to the objective function to generate each candidate individual

Perform evaluation and record the corresponding target value;

5) Selection: According to formula (8) in the old individual

And its corresponding candidate individuals

Perform a selection operation between them to determine which of the two will be retained in the new population;

6) Update the global optimal: compare all new individuals

And b _g global optimum target value, to generate a new globally optimal b _{g + 1;}

7) Determine whether it is over: If it is, output the global optimal b _g+1 and its corresponding target value, and exit the algorithm, otherwise go to step 2.

In this embodiment, a differential evolution population is allocated to the human hand and the manipulated object to perform pose optimization. The human hand motion posture x _h and the object motion posture x _o of the current frame are respectively optimized, and these two populations are denoted as the population h and the population o. ；

When the population h hand gesture motion of the current frame X _h iterative optimization, operational attitude of the object to be treated as a static X _o, is operated in the pose of the object X _o o optimization process starts by the population of the previous frame determining optimal results; the population at the time of moving object o X _o pose current frame iterative optimization, the hand gesture considered static _h X, X _h hand gesture at the beginning of the process optimization of h on a population of The optimization result is determined.

Particle filtering is a robust motion tracking framework. Through the propagation of multiple samples in time, it has the characteristic of expressing multimodal distribution. The basic idea is: according to the sampling value of the posterior probability distribution p(x _t-1 |z _{1:t-1) of the system state at time t-1}

Use the prediction model p(x _t |x _t-1 ) and the observation model p(z _t |x _t ) to find a set of sampling values that approximate the posterior probability distribution of the system state at time t p(x _t |z _1:t)

Among them, the superscript i is the particle number, and x _t is the system state vector at time t. In this embodiment, it represents the hand-object posture x _ho,t at time t, w _t is the _{corresponding weight of x t} , and z _1:t is the system The observation value from time 1 to time t is accumulated.

One of the main problems of the standard particle filter algorithm is that it uses the state transition prior model p(x _t |x _t-1 _{) that does not consider the latest observation z t} as the important density function, so the importance of the particle sampling process is sub-optimal . In the tracking process, in order to approximate the true posterior probability density distribution of the system state, the standard particle filter algorithm needs to collect a large number of samples. If the sample set is too small, it will cause sample poverty, reduce the estimation accuracy, and even lead to the divergence of the sample set and the failure of the estimation.

Differential evolution filtering integrates the differential evolution algorithm into the particle filter framework. After predicting the new particle position, using _{the matching error function under the latest observation z t} as the objective function, running the differential evolution algorithm to iteratively evolve the particles and move the particles Go to an area with greater observation likelihood in the state space. The optimization process of the particle position can be regarded as an importance sampling process, and the new particle swarm generated after the optimization process can be regarded as the optimal importance distribution p(x _t |x _t-1 ,z _t ) An approximation. Through the optimization process of the differential evolution algorithm, the particle filter sample distribution is improved, and the convergence of the particle set is accelerated, so that a small number of particles can be used to achieve robust tracking of human hand-object motion.

As shown in equation (9), differential evolution filtering defines the transfer prior p(x _t |x _t-1 ) as a first-order motion model, which is used to propagate particles in a time series.

in,

The final position where the particles converge to after the fixed algebra G is optimized for the differential evolution iteration at time t-1.

Multivariate Gaussian noise with 0 mean value, Σ is its covariance matrix, and the diagonal elements of Σ are determined by the maximum inter-frame angle or displacement difference of the sequence to be tracked; the obtained new particle set

Used to initialize the differential evolution population at time t.

The differential evolution filtering algorithm is summarized as follows:

For t>0:

1) Resampling: According to the weight of the particle set

Perform resampling to obtain a new set of equal-weight particles

2) Prediction: According to formula (9), predict the position of the particle at time t from the position of the particle at time t-1, and obtain a new particle set

3) Optimization: Take _{the matching error function under the latest observation z t} as the objective function, and run the differential evolution algorithm to analyze the particle set

optimize;

4) Weight update: use observation likelihood to update particle weights

Get the weighted particle set

And weight

Normalize to make

5) State estimation: output the system state estimation value based on the maximum a posteriori criterion.

In this embodiment, two coordinated differential evolution filter trackers are used to track the motion and posture of a human hand and an object respectively, and a collaborative differential evolution filter algorithm is proposed. By assigning a differential evolution filter tracker to the human hand and the manipulated object, respectively, the human hand movement posture x _h and the object movement posture x _o are tracked. The two trackers are not independent of each other, but constantly exchange information during the tracking process. When the human hand tracker is _{iteratively optimizing the human hand motion posture x h} in the current frame, the posture x _{o of the} manipulated object is regarded as static, and the posture x _{o of the} manipulated object is adjusted by the corresponding object tracker at the beginning of the optimization process. The tracking result of the previous frame is determined; while the object tracker considers the hand posture x _h _{as static when iteratively optimizes the object motion posture x o} of the current frame, and the hand posture x _h is tracked by the hand at the beginning of the optimization process To determine the tracking result of the previous frame.

After each tracker obtains the posture tracking result of the current frame, it immediately passes it to another tracker, and the corresponding posture value remains static during the iterative optimization process of the next frame of the other tracker. This collaborative tracking scheme not only models the occlusion by considering the human hand and the manipulated object, but also decomposes the joint pose space through the use of multiple trackers, and decomposes the high-dimensional problem into multiple relatively low-dimensional problems. The problem of dimensionality reduces the difficulty of optimizing the search and reduces the computational cost.

Experimental verification process: In this embodiment, the depth image obtained by the Kinect depth camera is used as the observation input, and the human hand-object tracking prototype system is developed based on the 3D graphics rendering technology, and the pre-configured 3D human hand-object model is loaded into the 3D graphics. In the rendering engine OpenSceneGraph (OSG), during the tracking process, the osgSim::DOFTransform class is used to control the movement of the human hand and the object, and the OSG off-screen rendering technology is used to render the depth image of the human hand-object model, which is used to communicate with the observation image. The matching error value and observation likelihood value of each particle are compared and calculated, and the state parameter that minimizes the matching error is searched in the state space of the human hand and the object through the collaborative differential evolution filtering algorithm.

OSG is an open source cross-platform graphics engine based on OpenGL. It uses a tree structure (scene node tree) to organize spatial data, and achieves high performance through a variety of scene cutting technologies, rendering state sorting, and multi-threaded rendering mechanisms. 3D graphics rendering. The rendering process of each frame of OSG can be broken down into three stages: update traversal, crop traversal, and drawing traversal. By default, multi-threaded mode is used to render the scene. A thread is created for each camera and its corresponding graphics device. The clipping operation is performed in the thread, and the drawing operation is performed in the graphics device thread. This multi-threaded mode will start a new frame of scene update and cropping operations before the end of the drawing work of the graphics device thread, thereby improving the operating efficiency of the system and maximizing the computing power of the system.

As shown in Figure 4, this embodiment is based on the proposed collaborative differential evolution filtering algorithm, by using OSG and off-screen rendering technology to develop a hand-object tracking prototype system, creating a virtual camera to render each hand-object posture hypothesis The corresponding depth image is calculated for matching error.

This camera has a scene model node as a child node and is bound to a device cache object at the same time. The scene model node contains three-dimensional models of human hands and objects, and the device cache object can be bound to the camera through a frame cache object (FBO). During the rendering process of each frame of OSG, the virtual camera will render the content of its scene model child nodes to its bound cache object.

The system of this embodiment uses a collaborative differential evolution filtering algorithm to iteratively calculate new human hand-object posture parameters. The system creates a node callback object (osg::NodeCallback) for the scene model node, which is used to update the posture parameters of the human hand and object model during the update phase of each frame of OSG. At the same time, the system also creates a drawing callback object (osg::Camera::DrawCallback) for the camera. When the camera renders the updated 3D human hand and object model to the cache object, the system calculates the rendered depth image in this callback object The matching error with the observed depth image. Since OSG runs in multi-threaded mode by default, each frame will start a thread for each camera and its associated graphics device. When the drawing phase of the previous frame has not ended, the update phase of the next frame will begin. . To avoid data conflicts between threads, the system creates an event object for the camera, and uses Win32 API's SetEvent() function and WaitForSingleObject() function to synchronize and communicate between threads. After the calculation of the matching error is completed in the graphics device thread, the corresponding event object is set to a signaled state through the SetEvent() function, and the main thread is notified. The main thread will perform the next calculation operation after receiving the event signal.

This embodiment conducts experiments on real sequences to verify the effectiveness of the human hand-object motion tracking method proposed in this embodiment. In all experiments, the collaborative differential evolution filtering algorithm proposed in this embodiment uses 32 particles for the human hand posture tracker, 8 particles for the object posture tracker, and both trackers for each frame of image input, the DE algorithm is iteratively optimized 60 Second-rate. The tracking experiment in this embodiment runs on a PC with a 4-core Core i5 2.9 GHz CPU, 4.0 GB memory and Nvidia GeForce GTX 950M GPU, and it takes an average of 5 seconds to track one frame of image.

In this embodiment, the real sequence is used to evaluate the tracking algorithm, and the depth image sequence captured by the Microsoft Kinect 1.0 Beta 2 SDK is used as the observation input, the image resolution is 640×480, and the frame rate is 30 frames/s.

The experiment is divided into two groups. The first experiment tracks the movement process of the human hand grasping the sphere. Figures 5(a)-(c) show the tracking results of this embodiment on some frames of the real sequence of the interaction process between the human hand and the sphere; Figure 5(a) is the RGB image captured by the Kinect RGB camera, Figure 5(b) is the depth image captured by the Kinect depth camera and subjected to simple depth segmentation, and Figure 5(c) is the collaborative differential evolution filtering algorithm The result of tracking the depth image sequence;

The second group of experiments tracked the movement process of human hands grasping the cylinder. Figures 6(a)-(c) show the tracking results of this embodiment on some frames of the real sequence of the interaction process between the human hand and the cylinder; Figure 6(a) is the RGB image captured by the Kinect RGB camera, and Figure 6 (b) is the depth image captured by the Kinect depth camera and after simple depth segmentation. Figure 6(c) is the result of tracking the depth image sequence using the collaborative differential evolution filtering algorithm. It can be seen from the experimental results that the collaborative differential evolution filtering algorithm can effectively track the interaction process between human hands and objects.

In other embodiments, it also provides:

A human-hand-object interaction process tracking system based on cooperative differential evolution filtering, including:

An electronic device, including a memory and a processor, and computer instructions stored on the memory and running on the processor. When the computer instructions are executed by the processor, it completes a human hand-object interaction process tracking based on cooperative differential evolution filtering The steps described in the method.

A computer-readable storage medium is used to store computer instructions that, when executed by a processor, complete the steps described in a method for tracking a human hand-object interaction process based on collaborative differential evolution filtering.

In the above embodiments, it is possible to realize simultaneous tracking of human hands and objects in the process of human hand-object interaction. The differential evolution algorithm is integrated into the particle filter framework, and two coordinated particle filter trackers are used to separately track the human hand and the object. Object motion tracking, using differential evolution to optimize the matching error under current observations to drive particles to move to high-likelihood regions, improve particle filter sample distribution, and achieve robust tracking of human hand and object motion with a small number of particles.

The above are only preferred embodiments of the present disclosure and are not used to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Although the specific embodiments of the present disclosure are described above in conjunction with the accompanying drawings, they do not limit the scope of protection of the present disclosure. Those skilled in the art should understand that on the basis of the technical solutions of the present disclosure, those skilled in the art do not need to make creative efforts. Various modifications or deformations that can be made are still within the protection scope of the present disclosure.

Claims

A tracking method for human-hand-object interaction process based on cooperative differential evolution filtering, which is characterized in that it includes:

Extract the foreground area corresponding to the human hand and the object in the image to be measured, and generate the observation depth map and the corresponding observation silhouette map;

Based on the constructed hand kinematics model and object kinematics model, the hand motion posture and the object motion posture are obtained respectively. The hand motion posture and the object motion posture form the hand-object posture vector and generate the corresponding rendering depth map;

Take the image to be measured as the observation input, and calculate the depth feature matching degree between the observation depth map and the rendered depth map and the silhouette feature matching degree between the observation silhouette map and the rendered depth map, and construct the observation input and the human hand-object pose vector The matching error function;

The collaborative differential evolution filtering algorithm is used to calculate the matching error function to optimize the posture of the human hand and the object respectively, and obtain the motion tracking results of the human hand and the object during the hand-object interaction process.
The human-hand-object interaction process tracking method based on collaborative differential evolution filtering according to claim 1, wherein the depth feature item E d in the matching error function is defined as the depth between the calculated observation depth map and the rendered depth map deviation:

Among them, x ho is the hand-object pose vector, z is the observation input, z d (z) is the observation depth map, z s (z) is the observation silhouette map, r s (x ho ) is the rendered silhouette map, r d ( x ho ) is the rendering depth map, and T d is the maximum depth deviation constant.
The human-hand-object interaction process tracking method based on collaborative differential evolution filtering according to claim 1, wherein the silhouette feature item E s in the matching error function is defined as the difference between the observed silhouette image and the rendered depth image by calculation. The size of the overlapping area describes the matching degree of silhouette features:

Among them, x ho is the hand-object pose vector, z is the observation input, z s (z) is the observation silhouette image, and r s (x ho ) is the rendered silhouette image.
Differential Evolution based collaborative filtering according manpower claim 1 - interaction object tracking method, characterized in that, said matching error function added penalty term E p (x h), which is defined as follows:

Among them, x h is the hand movement posture, J represents three pairs of adjacent fingers except the thumb,
It represents the deviation between the abduction and adduction angles of a certain pair of fingers in the hand movement posture x h, and p is the observation likelihood function.
The human-hand-object interaction process tracking method based on cooperative differential evolution filtering according to claim 4, wherein the observation likelihood function and the matching error function E(z, x ho ) are in a monotonically decreasing relationship, and the observation The likelihood function is defined as follows:

p(z|x ho )∝exp(-λ e ·E(z,x ho ))

Among them, λ e is a constant normalization factor, whose value is determined by the observation noise, and x ho is the hand-object pose vector.
The human hand-object interaction process tracking method based on cooperative differential evolution filtering according to claim 1, characterized in that the cooperative differential evolution filtering algorithm is used to allocate differential evolution populations to human hands and objects respectively, and to determine the hand movement posture x h and object movement. The posture x o is optimized, and the two differential evolution populations are recorded as population h and population o;

When the population h performs iterative optimization on the human hand motion posture x h , the object motion posture x o is regarded as static, and the object motion posture x o is determined by the optimization result of the previous frame by the population o at the beginning of the optimization process;

When the population o performs iterative optimization on the object motion posture x o , the hand motion posture x h is regarded as static, and the hand motion posture x h is determined by the optimization result of the population h on the previous frame at the beginning of the optimization process.
The human-hand-object interaction process tracking method based on cooperative differential evolution filtering according to claim 1, wherein the matching error function E(z, x ho ) is:

E(z,x ho )=λ d E d (z,x ho )+λ s E s (z,x ho )+λ p E p (x h )

Wherein, E d is the depth of the feature item, E s is the sketch feature item, E p is the penalty term, x ho human hand - object pose vector, z is observed input, x h is hand motion gestures, λ d, λ s, and [lambda] p is the weighting factor;

The adoption of the collaborative differential evolution filtering algorithm includes: re-sampling the particle set according to the particle weight to obtain an equal-weight particle set;

Predict the position of the particle at time t from the position of the particle at time t-1, and obtain a new particle set;

Take the matching error function under the latest observation input as the objective function, and use the differential evolution algorithm to optimize the new particle set;

Use the observation likelihood function to update the particle weights to obtain the weighted particle set, and normalize the particle weights; use the maximum posterior criterion to output the state estimation value of the human hand-object interaction process.
A human-hand-object interaction process tracking system based on cooperative differential evolution filtering is characterized in that it includes:

The image processing module to be measured is configured to extract the foreground area corresponding to the human hand and the object in the image to be measured, and generate an observation depth map and a corresponding observation silhouette map;

The hand-object movement posture module is configured to obtain the hand movement posture and the object movement posture based on the constructed hand kinematics model and the object kinematics model respectively. The hand movement posture and the object movement posture form the hand-object posture vector and generate the corresponding rendering Depth map

The matching error function building module is configured to take the image to be measured as the observation input, and to calculate the depth feature matching degree between the observation depth map and the rendered depth map and the silhouette feature matching degree between the observation silhouette map and the rendered depth map. Construct a matching error function between the observation input and the human hand-object pose vector;

The tracking module is configured to use the cooperative differential evolution filtering algorithm to optimize the posture of the human hand and the object by calculating the matching error function, and obtain the motion tracking result of the human hand and the object during the hand-object interaction process.
An electronic device, characterized by comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, and when the computer instructions are executed by the processor, they complete the method described in any one of claims 1-7. The steps described.
A computer-readable storage medium, characterized in that it is used to store computer instructions, and when the computer instructions are executed by a processor, the steps described in any one of the methods of claims 1-7 are completed.