CN116704587B

CN116704587B - Multi-person head pose estimation method and system integrating texture information and depth information

Info

Publication number: CN116704587B
Application number: CN202310959879.1A
Authority: CN
Inventors: 李成龙; 张铭; 刘新锋
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-10-20
Anticipated expiration: 2043-08-02
Also published as: CN116704587A

Abstract

The invention relates to the technical field of data identification and computer vision, and provides a multi-person head pose estimation method and system integrating texture information and depth information, wherein the head features are detected by using an obtained original color image, and 3D feature points for matching are obtained by combining the depth information of the image; then extracting texture features of the original color image, and processing by adopting a sparse expression method to obtain texture feature descriptors; and matching the target area in the current image based on the extracted texture feature descriptors to find the best matching position, thereby obtaining the posture of the head. According to the method, the depth information is adopted to extract the information of the 3D feature points, the information of the front and back layers of the image can be extracted, the problem of target shielding is solved, and the identified target is accurately positioned. The depth information is combined with the color texture information, so that the target face can be accurately identified, and the accuracy of the target face identification can be further improved.

Description

Multi-person head pose estimation method and system integrating texture information and depth information

Technical Field

The invention relates to the technical field of data identification and computer vision, in particular to a multi-person head posture estimation method and system integrating texture information and depth information.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Head pose estimation is a process of estimating user head pose parameters from images using a computer, and has attracted attention in the fields of pattern recognition and computer vision in recent years. Accurate estimation of head orientation and position is important for many applications, such as driver state monitoring in driving assistance systems, human-machine interaction interfaces, facial expression analysis, etc.

There are various head pose estimation methods, and conventional methods are mainly based on 2D images, but are limited by factors such as illumination conditions, shadows, and lack of features. In recent years, with the development of depth sensing technology (such as microsoft Kinect), depth information has become an important factor in overcoming the limitations of the conventional 2D method, and thus, many methods use depth information as a key information source for head pose estimation, and these methods use depth information to solve some challenges in pose estimation, such as pose change and occlusion. Some of these methods use geometric features to generate a plurality of candidate head poses and compare the input image to a pre-rendered image. Still other methods consider head pose estimation as a regression problem, detect the head and estimate the pose by training classification and regression models.

Currently, the method for estimating the head pose based on the depth information still has the defects that the combination of the color image and the depth image is subject to some challenges in estimating the head pose, and the depth image generated by the current consumer-grade RGBD camera is usually noisy and cannot be accurately aligned with the color image, which brings difficulty to feature matching. In addition, the method for estimating the head posture based on the depth information still has the defect that in the process of estimating the head posture, if the head is blocked by other objects or hands and the like, the depth information can not completely acquire the accurate shape and position of the head, and particularly in a multi-person scene, the blocking phenomenon between people is more common. In addition, the head may undergo a change in position during movement, particularly in the case of rapid movement or changing viewing angles, which may result in a change in depth information of the head, further resulting in inaccuracy or failure of the head pose estimation.

Disclosure of Invention

In order to solve the problems, the invention provides a multi-person head pose estimation method and system integrating texture information and depth information, which can improve the accuracy and stability of head pose estimation.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

one or more embodiments provide a multi-person head pose estimation method fusing texture information and depth information, including the steps of:

detecting head features by using the obtained original color image, and obtaining 3D feature points for matching by combining the depth information of the image;

extracting texture features of an original color image based on the obtained 3D feature points, and processing by adopting a sparse expression method to obtain texture feature descriptors;

matching the extracted texture feature descriptors with a target area in the acquired current image to find the best matched position, wherein the matched position is a new position of the head of the person, and the posture of the head is obtained based on the new position.

One or more embodiments provide a multi-person head pose estimation system fusing texture information and depth information, comprising:

a 3D feature point recognition module configured to detect a head feature using the acquired original color image, acquire 3D feature points for matching in combination with depth information of the image;

the feature extraction module is configured to extract texture features of the original color image based on the obtained 3D feature points, and process the texture features by adopting a sparse expression method to obtain texture feature descriptors;

the matching module is configured to match the extracted texture feature descriptors with a target area in the acquired current image to find a best matching position, wherein the matched position is a new position of the head of the person, and the posture of the head is obtained based on the new position.

An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the multi-person head pose estimation method described above that fuses texture information and depth information.

A computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the multi-person head pose estimation method described above that fuses texture information and depth information.

Compared with the prior art, the invention has the beneficial effects that:

the invention combines the color image and the depth image, and fully utilizes the information advantages of the color image and the depth image. Extracting head features through the color image, processing the head features through a computer vision algorithm, and simultaneously aligning head feature points extracted from the color image with depth information through 3D information of the depth image to obtain 3D feature points for matching. And simultaneously, extracting texture features from the head region in the color image, matching the texture features extracted before with the region in the current image when the head is detected again, finding the best matching position through calculating a similarity measure, and updating the best matching position as a new position of the head of the person. According to the method, the depth information is adopted to extract the information of the 3D feature points, the information of the front and back layers of the image can be extracted, the shielding problem is solved, and the identified target is accurately positioned. The depth information is combined with the color texture information, so that the target face can be accurately identified, the accuracy of the target face identification can be further improved, and the accuracy and the stability of head posture estimation can be improved.

The advantages of the present invention, as well as additional aspects of the invention, will be described in detail in the following detailed examples.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic flow chart of the method of example 1 of the present invention;

FIG. 2 is a comparison of depth image before and after filtering in accordance with embodiment 1 of the present invention;

fig. 3 is a depth map including a head region selection frame according to embodiment 1 of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof. It should be noted that, in the case of no conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The embodiments will be described in detail below with reference to the accompanying drawings.

Aiming at the problems mentioned in the background data, the invention provides that the color image and the depth image are combined, and the information advantages of the color image and the depth image are fully utilized. Extracting head features through the color image, processing the head features through a computer vision algorithm, and simultaneously aligning head feature points extracted from the color image with depth information through 3D information of the depth image to obtain 3D feature points for matching. And simultaneously, extracting texture features from the head region in the color image, matching the texture features extracted before with the region in the current image when the head is detected again, finding the best matching position through calculating a similarity measure, and updating the best matching position as a new position of the head of the person. Specific examples are described below.

Example 1

In one or more embodiments, as shown in fig. 1 to 3, a multi-person head pose estimation method integrating texture information and depth information includes the following steps:

step 1, detecting head features by using an obtained original color image, and obtaining 3D feature points for matching by combining depth information of the image;

step 2, extracting texture features of an original color image based on the obtained 3D feature points, and processing the texture features by adopting a sparse expression method to obtain texture feature descriptors;

and 3, matching the extracted texture feature descriptors with a target area in the acquired current image to find the best matched position, wherein the matched position is a new position of the head of the person, and the Euler angle of the head is obtained based on the new position, namely the gesture of the head.

According to the embodiment, the depth information is adopted to extract information of the 3D feature points, information of layers before and after an image can be extracted, the shielding problem is solved, and the identified target is accurately positioned. The depth information is combined with the color texture information, so that the target face can be accurately identified, and the accuracy of the target face identification can be further improved.

In this embodiment, texture features are extracted from a head region in a color image, and when the head is detected again, the texture features extracted previously are matched with the region in the current image, and the best matching position is found and updated as a new position of the head. The multi-person head posture estimation method based on the color image texture information fully utilizes the advantages of the depth image and the color image, adopts sparse expression and filtering technology, realizes accurate and robust head posture estimation, is excellent in performance under the conditions that continuous multi-frame detection of the person head fails and the position depth information changes greatly, and improves the accuracy and stability of head posture estimation.

In step 1, the image to be processed is acquired, and an RGBD camera (e.g. Kinect) may be used to acquire a multi-frame color image with high resolution and a noise depth image.

In step 1, a method for detecting head features and extracting head feature points based on a color image is specifically:

step 11, foreground extraction: performing foreground extraction on the color image based on the depth information;

based on the foreground-background segmentation of the depth information, the image region is segmented into foreground and background by analyzing the depth information in the image. The distance value of different pixel points in the image from the camera is provided, and the distance relation of objects in the scene is reflected.

Step 12, detecting the head of a person: and performing human head detection in the image after foreground extraction through image processing and a computer vision algorithm to obtain a depth map containing a head region selection frame, as shown in fig. 3.

Specifically, the head is detected by the Adaboost cascade classifier based on Harr features, so that a head region can be efficiently detected in an image, and accurate position information is provided for subsequent head gesture estimation and tracking.

The detection of the head of the person in the embodiment to obtain the depth map including the head region selection frame is to frame the head region to obtain an image, namely adding a frame in the image;

in this embodiment, only the color image is used to detect the head features, and the image processing and the computer vision algorithm are fused to perform head detection, so that the head features can be quickly identified.

Step 13, filtering the depth map: filtering the obtained head region depth map to obtain a filtered head region depth map, so that head characteristic points can be extracted;

head feature points, such as eyes, nose, mouth, etc., are extracted from the filtered head region depth map using image processing and computer vision algorithms. These feature points can be used as position and posture information of the head.

Specifically, the filtering process includes median filtering and morphological processing, and specifically operates as follows: the image is median filtered to remove noise, and morphological operations such as dilation or opening are then used to enhance the edges and morphological features of the image to obtain a filtered head region map.

Alternatively, the outliers in the depth information can be detected and corrected by adopting the rules of depth consistency, normal consistency and re-projection consistency, so that the influence of the depth noise is effectively eliminated, and the left image is an image before filtering, the right image is an image after filtering, and the image quality and the accuracy of subsequent processing are improved by removing the noise and enhancing the image characteristics as shown in fig. 2.

The filtered depth information is used for calculating the head posture parameter, namely the head 3D characteristic point, in the subsequent steps. The filtering result of the depth information plays an important role in texture feature descriptor calculation after the feature extraction step, and helps to improve accuracy and stability of the head posture parameters. This filtering-before-application strategy can effectively eliminate the interference of depth noise to subsequent steps and maximally preserve useful depth information.

In some embodiments, the 3D feature points used for matching are obtained by combining the extracted head feature points with the depth information of the image, specifically, the spatial coordinates of the head feature points are calculated by aligning the depth information with the color image, so as to obtain the 3D feature points;

optionally, the method for obtaining the 3D feature point by combining the image depth information specifically includes the following steps:

step 1.1, aligning the depth information with the color image, and aligning the head region depth map filtered in step 13 with the original color image to ensure that the head region depth map and the original color image are in the same coordinate system;

in particular, this may be achieved by calibrating camera parameters or using an alignment function that the depth sensor is self-contained with.

Step 1.2, obtaining the head characteristic points extracted in the step 13:

step 1.3, calculating the space coordinates of the 3D feature points: with the aligned depth images, the coordinates of the head feature points in three-dimensional space can be calculated in combination with the pixel coordinates of the head feature points in the color image. The coordinates thus obtained are the spatial coordinates of the 3D feature points.

Through the steps, the depth information can be combined with the color image, and corresponding 3D feature points can be obtained.

The present embodiment combines a color image and a depth image, making full use of their information advantages. Extracting head features through the color image, processing the head features through a computer vision algorithm, and simultaneously aligning head feature points extracted from the color image with depth information through 3D information of the depth image to obtain 3D feature points for matching.

Further, the 3D feature points of the head are subjected to parameter smoothing, and specifically, the 3D feature points of the head are smoothed by using a kalman filter, so as to reduce jitter and instability. The Kalman filter can predict and update the current head pose parameters according to the previous observed values and the dynamic model.

The present embodiment employs depth map filtering and parameter smoothing techniques. Outliers in the depth information are processed through outlier filtering of the depth map, and the influence of depth noise is eliminated, so that estimation errors are reduced. Meanwhile, the Kalman filter is used for smoothing the head posture parameters, so that jitter and instability are reduced.

Step 2, extracting texture features of the color image based on the obtained 3D feature points and the filtered depth information, and processing by adopting a sparse representation method to obtain texture feature descriptors;

the depth information after filtering refers to the depth information in the image after filtering in step 123.

In this embodiment, the texture features of the target area are extracted based on the color image of the previous frame as the original color image, and the head position and posture of the color image of the current frame are recognized based on the texture features of the original image.

Specifically, the texture feature extraction stage in step 2 combines the depth information, i.e. the filtered depth information, to process the texture features of the original color image. By fusing the depth information, the texture features of the head region can be extracted more accurately, so that better texture feature descriptors can be obtained. The matching and updating stage then uses the extracted texture feature descriptors to match regions in the current image and find the location of the best match. The matched position will be used as the new head position and thus for calculating the pose parameters of the head.

An implementation manner, based on the obtained 3D feature points, the method for extracting texture features of an original color image comprises the following steps:

step 2.1, acquiring an original color image: acquiring an original color image from a camera or other image source, which may be a historical frame image;

step 2.2, determining a head region of interest: and (3) determining the head region of interest according to the 3D characteristic points obtained in the step (1).

Wherein the 3D feature points may provide position and pose information of the head for locating the head region.

Step 2.3, cutting the head area: and cutting out the head region from the original color image according to the determined head region, so as to obtain an image region only containing the head.

Step 2.4, extracting texture features: and extracting texture features of the head image area obtained by clipping.

Specifically, the texture feature extraction method may employ a Local Binary Pattern (LBP) or a histogram of directional gradients (HOG), or the like. These methods can capture texture details and edge information in the image.

And (3) after the texture features in the step (2) are extracted, processing by adopting a sparse expression method to obtain texture feature descriptors, representing the extracted texture features as a feature descriptor, and constructing a sparse dictionary D by using a plurality of descriptors.

In particular, the feature descriptors may be represented as a vector, where each dimension represents the value of a feature, or as a feature image, where each pixel represents the intensity of a feature. The manner in which the descriptors are generated depends on the feature extraction method employed.

In this embodiment, the texture descriptor refers to a feature vector for describing texture information obtained by extracting texture features of an image and applying a sparse representation method.

The method comprises the steps of processing texture features of an extracted color image by adopting a sparse expression method, specifically modeling a target through sparse expression in a tracking process, wherein the modeling aims at describing and representing the target through sparse expression.

In the step 2, the method for obtaining the texture feature descriptors by sparse expression of the extracted texture features comprises the following steps:

step 21, acquiring a historical image, taking the head of a person as a target, and extracting each variant and feature of the target to model as a dictionary;

and (3) transforming various head posture images for acquisition, and carrying out head detection on the images to obtain extracted features of different images, namely obtaining various variants and features of different targets, and carrying out head detection on the images in different postures to form a dictionary with rich postures.

In the tracking process, a template of the target is used as a basis of a dictionary, in the embodiment, the dictionary is modeled as a set formed by various variants and features of the target, and the modeling mode can extract key features of the target and realize tracking and positioning of the target by matching with observation data. The object of this embodiment is the head of the person in the picture.

Specifically, the dictionary D formed by templates of the targets in this embodiment is:

（1）

wherein T represents a template of the target,representing different variants or features of the nth target header;

step 22, obtaining a new detection current frame image, and extracting texture features to obtain texture information of a new detection targetExpressing a newly detected target based on the constructed dictionary, performing sparse coding to minimize reconstruction errors, and updating to obtain an optimal weight coefficient +.>；

The newly detected target is the current frame image; extracting texture features to obtain texture information of a new detection targetI.e. texture information->Is information extracted from an actual image.

The newly detected targets have a certain similarity and commonality with the previous targets, which may belong to the same class or have similar features, and are expressed based on the constructed dictionary.

Specifically, the next frame new target is described based on sparse representation as: the product of the dictionary and the weight coefficient matrix. Namely, the newly detected target d in the next frame is sparsely expressed by a dictionary, and is approximated as:

（2）

wherein, the liquid crystal display device comprises a liquid crystal display device,is a weight coefficient.

For obtaining new input vectorsDictionary D and weighting factors +.>To calculate a new input vector +.>The reconstruction error is minimized to obtain the optimal weight coefficient, and the updating function is as follows:

（3）

wherein, the liquid crystal display device comprises a liquid crystal display device,is the newly detected target vector; d is a dictionary containing dictionary elements for sparse coding; />Is a weight coefficient vector; />Regularization parameters representing control sparsity;

in this step, the reconstruction error is the texture information of the actually detected imageCalculated value of texture information obtained by sparse representation +.>Is a difference in (2);

step 23, calculating texture feature descriptors based on the updated weights and the dictionary;

and step 24, updating the dictionary D by the calculated historical frame texture feature descriptors to obtain rich dictionary data and matching the next frame of image.

In step 3, for the step of matching and updating, the matching method may adopt a method of calculating similarity measure, and find the best matching position, the process is as follows:

step 31, adopting a nearest neighbor algorithm to measure the similarity between the candidate target in the current frame image and the candidate target in the historical frame image;

the marked target is the head area of the person extracted from the original color image, and the candidate target is the head area of the person in the received current frame image.

Based on head detection, multi-target head tracking is modeled as a problem of data association between a marked target and a candidate target, the head motion is continuous, and when the head tracking is performed, a nearest neighbor algorithm is a most direct method for associating the candidate target and the marked target in a current frame, and a measurement formula for calculating the similarity is as follows:

（4）

wherein, the liquid crystal display device comprises a liquid crystal display device,is the firstiIndividual marker targets,/->For the j-th candidate target,/->Is the firstiMarking objectsxAxis coordinates->Is the firstjOf candidate targetsxAxis coordinates->Is the firstiMarking objectsyAxis coordinates->Is the firstjOf candidate targetsyAn axis coordinate;

and step 32, adding the depth information into the similarity measurement, and converting the depth information into a feature similarity value by adopting an exponential function to obtain the similarity value based on the depth information.

When there is a partial overlap in projection of candidate objects on the image plane, in order to effectively process occlusion, depth information is added, and the metric formula (4) is improved to the following formula:

（5）

wherein, the liquid crystal display device comprises a liquid crystal display device,first, theiDepth position coordinates of the individual marking targets; />Depth position coordinates of the j-th candidate object; in the formulax、y、dRespectively representxA shaft(s),yThe information of the axis and the depth dimension is the texture feature of the head, and comprises three dimension information; subscript T represents the tag target information, and C represents the candidate target information; respectively byx，y，dAnd calculating the similarity between the marked target and the candidate target.

Based on equation (5), the feature similarity of the tagged target T and the candidate target C is defined as:

（6）

wherein, the liquid crystal display device comprises a liquid crystal display device,the parameter representing the rate of decay of the control similarity is converted by an exponential function into a value of the feature similarity. The formula (6) is based on the similarity value of the depth information.

Step 33, based on a target vector d obtained after modeling a newly detected target through sparse expression in the tracking process, taking a reconstruction error as similarity of texture information, and converting the position and depth difference into a value of characteristic similarity by adopting an exponential function to obtain color texture information similarity;

in this embodiment, d is a value calculated based on a dictionary.

Converting the position and depth difference into a value of feature similarity through an exponential function, wherein a conversion formula is as follows:

（7）

where T is the information indicating the target of the mark, C is the information indicating the target candidate,iis the firstiTargets, corresponding toIs the first in the dictionaryiOf individual targetsTexture descriptor, < >>Texture information of the j candidate target which is newly detected is the target vector which is newly detected.

The smaller the error, the higher the similarity, and the target is:

（8）

wherein, the liquid crystal display device comprises a liquid crystal display device,texture information of candidate targets in a new detection image of the next frame is obtained, and D is a dictionary; />Is a sparse weight coefficient vector, and represents the weight of each base vector in the sparse coding of the target; />A parameter indicating the speed of the control of the similarity decay.

The above formula (6) represents the similarity of the two image target depth information, and the formula (7) represents the similarity of the color texture information.

Step 34, obtaining texture features of the candidate object in the currently detected imageThe problem of multi-head tracking is abstracted into a data association problem by using a formula, similarity based on depth information and similarity of color texture information are fused, and a similarity value of a candidate target and a mark target is obtained, wherein the formula is as follows:

（9）

wherein, the liquid crystal display device comprises a liquid crystal display device,and->Is the weight;

when the head is detected again, the previously extracted texture feature descriptors are matched with the marking targets in the current image. By calculating the similarity measure, the location of the best match is found. The matched position may be updated as a new position of the head.

In this embodiment, in order to improve accuracy and robustness of pose estimation, first, only a color image is used to detect head features, and then 3D feature points for matching are acquired by combining depth information. And filtering the depth map effectively eliminates the influence of depth noise. The parameters are smoothed using a kalman filter. In practical application, a situation that a head cannot be detected by continuous multiframes may occur, so that position and depth information of the head is changed greatly. In order to solve this problem, texture information of a color image is adopted and processed by a sparse representation method. Texture information of color images can provide useful information about object surface details and texture features, helping to distinguish between features of different areas. When the head is detected again, the previously extracted texture feature descriptors are used to match the target area in the current image to find the best matching location. This may be accomplished by calculating a similarity measure (e.g., euclidean distance or correlation). The matched position may be updated as a new position of the head. The multi-person head pose estimation method based on the color image texture information provided by the embodiment fully utilizes the advantages of the depth image and the color image, and simultaneously adopts sparse expression and filtering technology to realize accurate and robust head pose estimation. The method is excellent in performance under the conditions of continuous multi-frame detection failure of the human head and large change of position depth information, and improves accuracy and stability of head posture estimation.

Example 2

Based on embodiment 1, the multi-person head pose estimation system of the present embodiment that provides fusion of texture information and depth information includes:

Here, the modules in this embodiment are in one-to-one correspondence with the steps in embodiment 1, and the implementation process is the same, which is not described here.

Example 3

The present embodiment provides an electronic device including a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the multi-person head pose estimation method of fusing texture information and depth information described in embodiment 1.

Example 4

The present embodiment provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the steps in the multi-person head pose estimation method of fusing texture information and depth information described in embodiment 1.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The multi-person head posture estimation method integrating texture information and depth information is characterized by comprising the following steps of:

matching the extracted texture feature descriptors with a target area in the acquired current image to find the best matched position, wherein the matched position is a new position of the head of the person, and the posture of the head is obtained based on the new position;

the method for sparsely expressing the extracted texture features to obtain texture feature descriptors comprises the following steps: acquiring a historical image, taking the head of a person as a target, and extracting each variant and feature of the target to model as a dictionary; obtaining a new detection current frame image, extracting texture features to obtain texture information of a new detection target, expressing the new detection target based on a constructed dictionary, performing sparse coding to minimize reconstruction errors, and updating to obtain an optimal weight coefficient; calculating texture feature descriptors based on the updated weights and the dictionary; continuously updating the dictionary according to the calculated historical frame texture feature descriptors;

the matching method adopts a method for calculating similarity measurement, and finds the best matching position, and the process is as follows: adopting a nearest neighbor algorithm to measure the similarity of the candidate target in the current frame image and the candidate target in the historical frame image; adding the depth information into the similarity measurement, and converting the depth information into a feature similarity value by adopting an exponential function to obtain a similarity value based on the depth information; modeling a newly detected target through sparse expression to obtain a target vector, taking a reconstruction error as similarity of texture information, and converting the position and depth difference into a value of characteristic similarity by adopting an exponential function to obtain color texture information similarity; and obtaining texture characteristics of the candidate target in the currently detected image, fusing the similarity based on the depth information and the similarity of the color texture information to obtain a similarity value of the candidate target and the marked target, and determining the best matching position in the new detected image based on the similarity.

2. The multi-person head pose estimation method fusing texture information and depth information according to claim 1, wherein the method of detecting head features based on color images and extracting head feature points comprises:

performing foreground extraction on the color image based on the depth information;

performing head detection in the image after foreground extraction to obtain a depth map containing a head region selection frame;

and filtering the obtained head region depth map to obtain a filtered head region depth map, and extracting head characteristic points.

3. The multi-person head pose estimation method of claim 2, wherein the method for obtaining 3D feature points by combining image depth information comprises the following steps:

aligning the filtered head region depth map with the original color image so that the depth map and the color image are in the same coordinate system;

acquiring head characteristic points extracted from the filtered head region depth map;

and calculating the coordinates of the head characteristic points in the three-dimensional space by utilizing the aligned depth images and combining the pixel coordinates of the head characteristic points in the color images to obtain the space coordinates of the 3D characteristic points.

4. The multi-person head pose estimation method fusing texture information and depth information according to claim 1, wherein: and performing parameter smoothing processing on the head 3D characteristic points by using a Kalman filter.

5. The multi-person head pose estimation method of fusing texture information and depth information according to claim 1, wherein the method of extracting texture features of an original color image based on the obtained 3D feature points comprises:

acquiring an original color image;

determining a head region of interest according to the 3D feature points;

cutting out the head region from the original color image according to the determined head region to obtain an image region only containing the head;

and extracting texture features of the head image area obtained by clipping.

6. The multi-person head pose estimation system integrating texture information and depth information is characterized by comprising:

the matching module is configured to match the extracted texture feature descriptors with a target area in the acquired current image so as to find a best matched position, wherein the matched position is a new position of the head of the person, and the posture of the head is obtained based on the new position;

7. An electronic device comprising a memory and a processor, and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps in the multi-person head pose estimation method of fusing texture information and depth information according to any of claims 1-5.

8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps in the multi-person head pose estimation method of fusing texture information and depth information according to any of claims 1-5.