WO2021098441A1 - 手部姿态估计方法、装置、设备以及计算机存储介质 - Google Patents

手部姿态估计方法、装置、设备以及计算机存储介质 Download PDF

Info

Publication number
WO2021098441A1
WO2021098441A1 PCT/CN2020/122933 CN2020122933W WO2021098441A1 WO 2021098441 A1 WO2021098441 A1 WO 2021098441A1 CN 2020122933 W CN2020122933 W CN 2020122933W WO 2021098441 A1 WO2021098441 A1 WO 2021098441A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
convolution
network
hand
processing
Prior art date
Application number
PCT/CN2020/122933
Other languages
English (en)
French (fr)
Inventor
周扬
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021098441A1 publication Critical patent/WO2021098441A1/zh
Priority to US17/747,837 priority Critical patent/US20220358326A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/12Acquisition of 3D measurements of objects

Definitions

  • the embodiments of the present application relate to the field of image recognition technology, and in particular to a method, device, device, and computer storage medium for hand posture estimation.
  • the present application provides a hand posture estimation method, device, equipment, and computer storage medium, which can improve the accuracy of hand posture estimation and obtain high-precision hand posture estimation results.
  • an embodiment of the present application provides a hand posture estimation method, and the method includes:
  • the feature fusion processing is used to fuse features around multiple key points; the multiple key points represent the skeleton keys of the hand region node;
  • the deconvolution processing is used to adjust the resolution of the fused feature map
  • the coordinate information of the multiple key points is obtained to determine the posture estimation result of the hand region in the image to be processed.
  • a device for estimating hand posture includes:
  • An acquiring unit for acquiring an initial feature map corresponding to the hand region in the image to be processed
  • the first processing unit performs feature fusion processing on the initial feature map to obtain a fused feature map; the feature fusion processing is used to fuse features around multiple key points; the multiple key points represent the hand The key nodes of the skeleton of the region;
  • the second processing unit is configured to perform deconvolution processing on the fused feature map to obtain a target feature map; the deconvolution processing is used to adjust the resolution of the fused feature map;
  • the posture estimation unit is configured to obtain the coordinate information of the multiple key points based on the target feature map to determine the posture estimation result of the hand region in the image to be processed.
  • an electronic device in a third aspect, includes a memory and a processor; wherein,
  • the memory is used to store executable instructions that can be run on the processor
  • the processor is configured to execute the method described in the first aspect when the executable instruction is executed.
  • an embodiment of the present application provides a computer storage medium that stores a hand posture estimation program, and when the hand posture estimation program is executed by a processor, the method as described in the first aspect is implemented .
  • the embodiments of the present application provide a hand posture estimation method, device, equipment, and computer storage medium.
  • the posture estimation result of the hand region in the image In this way, by performing feature fusion and deconvolution processing on the feature map of the hand region in the image to be processed, the information of different key points can be fully integrated and the accuracy of hand posture estimation can be improved. Obtain high-precision hand posture estimation results.
  • Fig. 1 is a schematic diagram of an image taken by a TOF camera provided by a related technical solution
  • FIG. 2 is a schematic diagram of a detection result of a hand bounding box provided by related technical solutions
  • Fig. 3 is a schematic diagram of the key point positions of a hand skeleton provided by related technical solutions
  • FIG. 4 is a schematic diagram of a two-dimensional hand posture estimation result provided by related technical solutions
  • FIG. 5 is a schematic flow diagram of a traditional hand gesture detection provided by related technical solutions
  • Fig. 6 is a schematic diagram of a RoIAlign bilinear difference effect provided by related technical solutions
  • FIG. 7 is a schematic diagram of a non-maximum suppression structure provided by related technical solutions.
  • Fig. 8 is a schematic diagram of a structure of union and intersection provided by related technical solutions.
  • FIG. 9 is a schematic flowchart of a hand posture estimation method provided by an embodiment of this application.
  • FIG. 10 is a schematic diagram of a network architecture of an exemplary hand posture estimation method provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of a corresponding architecture of a hand posture estimation head provided by an embodiment of this application.
  • FIG. 12 is a schematic diagram of a structure composition of a first convolutional network provided by an embodiment of this application.
  • FIG. 13 is an architecture diagram of a masked area convolutional neural network provided by an embodiment of this application.
  • FIG. 14 is a schematic diagram of a network architecture of another hand posture estimation method provided by an embodiment of the application.
  • FIG. 15 is an exemplary hourglass network characteristic diagram during hand posture estimation provided by an embodiment of this application.
  • 16 is a schematic diagram of the composition structure of a hand posture estimation device provided by an embodiment of the application.
  • FIG. 17 is a schematic diagram of a specific hardware structure of an electronic device provided by an embodiment of the application.
  • hand pose estimation has the ability to accurately estimate the three-dimensional coordinate position of the human hand skeleton node from the image, so as to accurately and effectively reconstruct the human hand movement from the image, so it is widely used in immersive virtual reality and augmented reality.
  • Fields such as robot control and sign language recognition have become a key issue in the field of computer vision and human-computer interaction. With the rise and development of commercial and low-cost depth cameras, hand gesture recognition has made great progress.
  • the depth camera includes structured light, laser scanning, and Time of Flight (TOF) cameras.
  • the depth camera refers to the TOF camera.
  • the so-called three-dimensional (Three Dimension, 3D) imaging of the time-of-flight method is to continuously send light pulses to the target object, and then use the sensor to receive the light returned from the target object, and obtain the contact with the target object by detecting the flight (round trip) time of the light pulse.
  • TOF camera is a distance imaging camera system. It uses the time-of-flight method to measure the round-trip time of artificial light signals provided by lasers or light-emitting diodes (LEDs) to calculate the TOF camera and the camera system. The distance between each point on the image between the subjects.
  • LEDs light-emitting diodes
  • the TOF camera outputs an image with a size of H ⁇ W, and each pixel value on this two-dimensional (Two Dimension, 2D) image can represent the depth value of the pixel; where the pixel value ranges from 0 to 3000 mm (millimeter, mm).
  • Fig. 1 shows a schematic diagram of an image taken by a TOF camera provided by a related technical solution.
  • the image captured by the TOF camera may be referred to as a depth image.
  • the input of the hand detection is the depth image shown in Figure 1
  • the output can be the presence of the hand in the depth map.
  • Probability (such as a number between 0 and 1, a larger value indicates a greater confidence in the presence of the hand) and a hand bounding box (that is, a bounding box that indicates the position and size of the hand).
  • the bounding box is the bounding box.
  • the bounding box can be expressed as (xmin, ymin, xmax, ymax), where (xmin, ymin) represents the position of the upper left corner of the bounding box, and (xmax, ymax) is the lower right corner of the bounding box.
  • FIG. 2 is a schematic diagram of the detection result of the hand bounding box in the related art.
  • the black rectangular box is the hand bounding box
  • the score of the hand bounding box is as high as 0.999884, that is, the confidence of the existence of the hand in the depth map is as high as 0.999884.
  • the two-dimensional hand posture estimation can be continued based on the target detection result.
  • the output is the two-dimensional key point position of the hand skeleton.
  • Fig. 3 is an example diagram of the key point positions of the hand skeleton in the related art. As shown in Fig. 3, the hand skeleton is provided with 20 key points, and the position of each key point is shown as 0-19 in Fig. 3. Among them, the position of each key point can be represented by 2D coordinate information (x, y). After determining the coordinate information of these 20 key points, a two-dimensional hand pose estimation result can be generated. Exemplarily, based on the two-dimensional coordinates of the key points of the hand shown in FIG. 3, FIG. 4 is the result of the two-dimensional hand posture estimation in the related art.
  • the three-dimensional hand posture estimation can also be continued based on the target detection result.
  • the output is the three-dimensional key point position of the hand skeleton, and an example of the key point position of the hand skeleton is still shown in FIG. 3.
  • each key point position can use 3D coordinate information (x, y, z), and z is the coordinate information in the depth direction.
  • a typical hand posture detection process can include a hand detection part and a hand posture estimation part.
  • the hand detection part can include a backbone feature extractor and a bounding box detection head module
  • the hand posture estimation part can include a backbone Feature extractor and pose estimation head module.
  • Figure 5 is a schematic diagram of the flow of hand gesture detection in the related technology.
  • first hand detection can be performed, that is, using hand detection
  • the main feature extractor and the bounding box detection head module included in the part perform detection processing; at this time, you can also adjust the bounding box boundary, and then use the adjusted bounding box to perform image cropping, and perform hand-processing on the cropped image
  • Posture estimation that is, the main feature extractor and the posture estimation head module included in the hand posture estimation part are used for posture estimation processing.
  • the positions of the output bounding box can be adjusted to the centroid of the pixels in the bounding box, and the size of the bounding box can be slightly enlarged to include all the hand pixels. Further, the adjusted bounding box is used to crop the original depth image, and the cropped image is input into the task of hand pose estimation.
  • the backbone feature extractor is used twice to extract image features, and there will be a problem of repeated calculations, which increases the amount of calculation.
  • RoIAlign is a regional feature aggregation method, which can well solve the problem of regional mismatch caused by two quantizations in the ROI Pooling operation.
  • replacing ROI Pooling with RoIAlign can improve the accuracy of the detection results.
  • the RoIAlign layer eliminates the strict quantization of RoI Pooling, and correctly aligns the extracted features with the input.
  • FIG. 6 is a schematic diagram of the RoIAlign bilinear interpolation effect in the related technology, as shown in Figure 6, the dotted grid represents a feature map, bold and solid The line represents a RoI (such as 2 ⁇ 2 regions), and each region has 4 sampling points. RoIAlign can use adjacent grid points on the feature map to perform bilinear interpolation calculations to obtain the value of each sampling point.
  • NMS non-maximum suppression
  • Fig. 7 is a schematic diagram of the effect of the NMS in the related art. As shown in Fig. 7, the purpose of the NMS is only to reserve a window (the bold gray rectangular frame in Fig. 7).
  • FIG. 8 is a schematic diagram of union and intersection in the related art.
  • two bounding boxes are given, denoted by BB1 and BB2, respectively.
  • the black area in (a) is the intersection of BB1 and BB2, denoted by BB1 ⁇ BB2, that is, the overlapping area of BB1 and BB2;
  • the black area in (b) is the union of BB1 and BB2, denoted by BB1 ⁇ BB2 , which is the combined area of BB1 and BB2.
  • the calculation formula of the intersection ratio (indicated by IoU) is as follows,
  • the coordinates of each pixel in the image can be represented by the XYZ coordinate system or the UVD coordinate system.
  • (x, y, z) are the pixel coordinates in the XYZ coordinate system
  • (u, v, d) are the pixel coordinates in the UVD coordinate system.
  • C x and C y represent principal point coordinates, they should ideally be located in the center of the image
  • f x and f y are the focal lengths in the x direction and y direction, respectively.
  • the conversion between the UVD coordinate system and the XYZ coordinate system The relationship is shown in the following formula,
  • the hand pose estimation scheme either uses the fully connected layer to return the key point coordinates of the hand, or uses a classification-based method to predict the spatial position of the key point.
  • the regression-based method is to calculate the hand posture in a global manner, that is, to use all the information of the key point feature to predict each key point; in contrast, the classification-based method is biased towards a more local way, namely Gradually obtain the characteristics of adjacent key points. Due to the unconstrained global and local posture changes of the hand, frequent occlusion, local self-similarity, and high definition, it is a challenging task to estimate the hand posture more accurately.
  • embodiments of the present application provide a hand posture estimation method, device, equipment, and computer storage medium. Specifically, after acquiring the feature map of the hand region, the hand posture estimation device can perform feature fusion processing on the image feature map, and acquire deeper image information on the feature map of the hand region to fully integrate the different key points of the hand region. Point information, and then perform deconvolution processing on the feature map after feature fusion to enlarge the resolution of the image to further realize hand posture estimation; in this way, the hand posture estimation device of this application can fully fuse information of different key points , Thereby improving the efficiency and accuracy of hand posture estimation.
  • An embodiment of the present application provides a hand posture estimation method, which can be applied to a hand posture estimation device or an electronic device integrated with the device.
  • the electronic device may be a smart phone, a tablet computer, a notebook computer, a palmtop computer, a personal digital assistant (PDA), a navigation device, a wearable device, a desktop computer, etc., which are not limited in the embodiment of the present application.
  • PDA personal digital assistant
  • FIG. 9 is a schematic flowchart of a hand posture estimation method provided by an embodiment of the application. As shown in FIG. 9, the hand posture estimation method provided by the embodiment of the present application may include the following steps:
  • Step 910 Obtain an initial feature map corresponding to the hand region in the image to be processed.
  • the hand posture estimation device may first obtain the initial feature map corresponding to the hand region in the image to be processed.
  • the hand posture estimation device may pre-acquire a to-be-processed image containing the hand, and detect and recognize the image content of the to-be-processed image, and determine the hand in the to-be-processed image Region, and further feature extraction of the hand region in the image to be processed by a specific feature extraction method to obtain the initial feature map corresponding to the hand region in the image to be processed.
  • the initial feature map can be a feature map for shallow feature extraction , Such as RoIAlign feature map, RoI Pooling feature map, etc.
  • the initial feature map is the RoIAlign feature map; that is, after the hand pose estimation device obtains the hand region of the image to be processed, it uses the RoIAlign feature extraction method constructed based on the RoIAlign feature extraction method corresponding to Figure 6 Detector, perform shallow feature extraction on the hand region of the image to be processed, including the outline of the hand and the edge position, so as to obtain the RoIAlign feature map corresponding to the target object of the hand.
  • the hand posture estimation device may further extract deeper image information based on the RoIAlign feature map.
  • Step 920 Perform feature fusion processing on the initial feature map to obtain a fused feature map; the feature fusion processing is used to fuse features around multiple key points; the multiple key points represent the skeleton key nodes of the hand region.
  • the hand there can be multiple key skeleton key nodes of the hand, that is, key points.
  • the hand includes at least 20 key points. In the embodiment of the present application, 20 key points are included. The specific position of the point on the hand is shown in Figure 3.
  • the hand pose estimation device may further perform in-depth image feature extraction on the initial feature map based on the initial feature map, and fuse the features around multiple key points in the hand region to obtain Feature map after fusion.
  • the feature fusion processing is a step-by-step abstraction process for the initial feature map.
  • the hand pose estimation device can perform multi-layer convolution processing on the initial feature map to extract the initial feature map step by step.
  • the feature information in the feature map in the feature map.
  • the convolution process of the initial feature map the detailed information (ie local features) of the key points in the hand area and the context information (global features) of the key points can be processed layer by layer. Fusion realizes the deep-level feature extraction of the initial feature map.
  • Step 930 Perform deconvolution processing on the fused feature map to obtain a target feature map; the deconvolution processing is used to adjust the resolution of the fused feature map.
  • the fused feature map after the fused feature map is obtained, the fused feature map can be further subjected to deconvolution processing to adjust the resolution of the fused feature map. Specifically, through deconvolution processing, the resolution of the fused feature map is improved, so as to perform hand posture prediction based on a higher resolution image and improve the accuracy of hand posture estimation.
  • Step 940 Obtain coordinate information of multiple key points based on the target feature map to determine the posture estimation result of the hand region in the image to be processed.
  • the target feature map is a feature map after feature fusion processing and deconvolution processing, that is to say, the target feature map can fully integrate the local detail information of each key point in the hand region of the original image to be processed And the context information, then, the estimation of the hand posture based on the target feature map can improve the accuracy of the hand posture estimation.
  • the initial feature map corresponding to the hand region in the image to be processed is first acquired; feature fusion processing is performed on the initial feature map to obtain the fused feature map; the feature fusion processing is used to compare multiple The features around the key points are fused; the multiple key points represent the skeleton key nodes of the hand region; deconvolution processing is performed on the fused feature map to obtain the target feature map; the deconvolution processing Used to adjust the resolution of the fused feature map; obtain the coordinate information of the multiple key points based on the target feature map to determine the posture estimation result of the hand region in the image to be processed; in this way, Performing feature fusion and deconvolution processing on the feature map of the hand region in the image to be processed can fully fuse information of different key points, improve the accuracy of hand posture estimation, and obtain high-precision hand posture estimation results.
  • step 910 obtains the initial feature map corresponding to the hand region in the image to be processed, including:
  • RoIAlign feature extraction is performed on the hand region in the image to be processed to obtain the initial feature map.
  • the hand posture estimation device may first obtain the image to be processed containing the hand (for example, Figure 1), and then recognize and locate the hand region of the image to be processed through the bounding box detection method, that is, determine the corresponding hand region The position and size can then be obtained only the image of the hand area (for example, Figure 2); further, the hand pose estimation device uses the RoIAlign feature extractor constructed based on the RoIAlign feature extraction method corresponding to Figure 6 to compare the hand Perform shallow feature extraction in the part region, including the outline and edge position of the hand, so as to obtain the RoIAlign feature map corresponding to the target object of the hand.
  • the RoIAlign feature extractor constructed based on the RoIAlign feature extraction method corresponding to Figure 6 to compare the hand Perform shallow feature extraction in the part region, including the outline and edge position of the hand, so as to obtain the RoIAlign feature map corresponding to the target object of the hand.
  • the network architecture mainly includes a hand region detection module (101) and a hand pose estimation module (102).
  • the hand region detection module 101 includes: a backbone feature extractor 1011, a bounding box detection head 1012, a bounding box selection head 1013, and a RoIAlign feature extractor 1014.
  • the hand posture estimation module 102 includes a hand posture estimation head 1021. Specifically, the backbone feature extractor 1011 and the bounding box detection head 1012 can be used to detect the hand area of the image to be processed that contains the hand area; then the bounding box selection head 1013 is used to perform the bounding box selection process, and the confidence is selected.
  • the RoIAlign feature extractor 1014 can perform RoIAlign feature extraction on the hand region image with the highest confidence to obtain the RoIAlign feature map (ie, the initial feature map) Finally, the hand posture estimation is performed on the RoIAlign feature map through the hand posture estimation head 1021.
  • the hand posture estimation head 1021 may further extract deeper image information based on the RoIAlign feature map to obtain The target feature map, and the hand pose estimation result is obtained based on the target feature map.
  • step 920 performs feature fusion processing on the initial feature map to obtain the fused feature map, which can be implemented through the following steps:
  • Step 9201 through the first convolution network, perform a first convolution process on the initial feature map to obtain a first feature map; the first convolution process is used to extract local detail information of the multiple key points.
  • the initial feature map may have a specific resolution and size.
  • the size of the initial feature map may be 8 ⁇ 8 ⁇ 256.
  • the hand posture estimation device may directly input the initial feature map into the first convolutional network to perform the first convolution processing.
  • the first convolutional network may be composed of two or more sub-convolutional networks with input and output superimposed on each other, and the sub-convolutional network may be a deep convolutional neural network.
  • the features of key points can be abstracted step by step to obtain the final first feature map.
  • the obtained first feature map has the same size as the initial feature map.
  • the resolution of the initial feature map is higher, so the detailed information of the key points in the initial feature map is richer.
  • the local details of the key points in the initial feature map can be extracted Information, get the first feature map. That is to say, the first feature map is a feature map that incorporates local detailed information of key points.
  • Step 9202 Perform a first down-sampling process on the first feature map to obtain a first down-sampling feature map.
  • the first down-sampling process may be 2 times down-sampling or 4 times down-sampling.
  • the embodiments of the application are not limited here.
  • the first down-sampling process can be implemented by the convolutional network, that is, the first feature map is input into the convolutional network for convolution processing, so as to reduce the resolution of the first feature map.
  • the size of the first feature map is 8x8x128, and a convolutional network with a convolution kernel of 3x3x128 (step size 2) is used to process the first feature map to obtain a 4x4x128 first down-sampling feature map.
  • Step 9203 Perform a second convolution process on the first down-sampled feature map through the second convolution network to obtain a second feature map; the second convolution process is used to extract context information of the multiple key points.
  • the hand pose estimation device may input the first down-sampled feature map into the second convolutional network for convolution processing, and extract the context information of the multiple key points to obtain Obtain the second feature map.
  • the first down-sampling feature map is the feature map after the resolution is reduced.
  • the context information in the image is mostly, and the second down-sampling feature map is performed by performing the second volume on the reduced-resolution feature map.
  • the product processing can fully down-sample the context information of the key points in the first down-sampling feature map.
  • the second feature map is a feature map that combines local detail information of key points and context information.
  • Step 9204 Perform a second down-sampling process on the second feature map to obtain a fused feature map.
  • the down-sampling process is continued on the second feature map to fully fuse the global information of the key points in the second feature map to obtain the fused feature map.
  • step 9202 may be the same process or different processes, which is not limited in the embodiment of the present application.
  • the feature map after the fusion can contain the local detailed information of the key points, and can contain the context-related global information of the key points. That is to say, the fusion feature map can fully integrate the information of different key points, and then the hand posture estimation based on the fusion feature map can improve the accuracy of hand posture estimation and obtain high-precision hand posture estimation results.
  • step 9201 may perform the following processing on the initial feature map before performing the first convolution processing on the initial feature map through the first convolution network:
  • Dimensionality reduction processing is performed on the initial feature map to obtain a dimensionality reduction feature map; the dimensionality reduction processing is used to reduce the number of channels of the initial feature map;
  • the feature first convolution process is performed on the feature map after dimensionality reduction to obtain the first feature map, so that the first feature map is used to determine the fused feature map.
  • the initial feature map in the process of fusing the initial feature map, can be reduced in dimensionality to reduce the number of channels of the initial feature map.
  • the first convolution process is performed on the reduced feature map.
  • the first down-sampling process, the second convolution process, and the second down-sampling process to obtain the fused feature map.
  • the amount of calculation in the processing process can be reduced.
  • the hand posture estimation head may specifically include a feature fusion module 111 (also referred to as a downsampling module) and a deconvolution module.
  • the module may also be referred to as an up-sampling module) 112.
  • the feature fusion module 111 may include: a first convolutional network 1111, a first downsampling network 1112, a second convolutional network 1113, and a second downsampling network 1114.
  • the process of performing feature fusion processing on the initial feature map in step 920 can be applied to the network architecture shown in FIG. 11; specifically, after the initial feature map is obtained, the first convolutional network 1111 is used to perform the first feature map on the initial feature map.
  • the first convolutional network may include N sub-convolutional networks; where N is an integer greater than 1.
  • the output of the first subconvolutional network is connected to the input of the second subconvolutional network
  • the output of the second subconvolutional network is connected to the input of the third subconvolutional network
  • the N-1th subconvolutional network The output of the convolutional network is connected to the input of the Nth subconvolutional network.
  • step 9201 performs the first convolution processing on the initial feature map through the first convolutional network to obtain the first feature map, which can be implemented in the following manner:
  • the i-th convolution process is performed on the initial feature map through the i-th subconvolutional network, and the i-th feature map is output, and the initial feature map and the i-th output feature map are weighted and processed to obtain the first i weighted sum feature map; where i is an integer greater than or equal to 1 and less than N;
  • the i-1th weighted sum feature map is subjected to the i-th convolution process through the i-th subconvolutional network, and the i-th feature map is output, and the i-1th weighted sum feature map is combined with the i-th feature map. Perform weighted sum processing on the i feature map to obtain the i-th weighted sum feature map;
  • the hand posture estimation device first performs the first convolution process on the initial feature map through the first subconvolution network, and outputs the first feature map. And the initial feature map and the first feature map are weighted and processed to obtain the first weighted sum feature map; that is, the connection is skipped, and the input of the first subconvolutional network is the same as the output of the first subconvolutional network. Add to obtain the first weighted sum feature map, so that the obtained first weighted sum feature map is consistent with the size of the input initial feature map. In this way, the initial feature map is recognized and abstracted through the first subconvolutional network, and the feature information between the pixels around each key point is merged to obtain the first weighted sum feature map.
  • the second sub-convolutional network performs further processing on the first weighted sum feature map; specifically, the second sub-convolutional network performs the second convolution processing on the first weighted sum feature map, and outputs the second feature map.
  • the input of the second subconvolutional network ie, the first weighted sum feature map
  • the output of the second subconvolutional network ie, the second feature map
  • the first weighted sum feature map output by the first subconvolutional network is further understood and abstracted through the second subconvolutional network, and the feature information of the surrounding pixels of each key point can be further integrated.
  • the third subconvolutional network continues to process the second weighted sum feature map to obtain the third weighted sum feature map, until the Nth subconvolutional network processes the N-1th weighted sum feature map to obtain the first N-weighted sum feature map, and use the N-th weighted sum feature map as the final first feature map.
  • multi-layer convolution processing is performed on the initial feature map through a multi-level sub-convolutional network, and the feature information around the key points can be integrated step by step at the current resolution.
  • step 9203 the second convolution process is performed on the first down-sampled feature map through the second convolutional network to obtain the second feature map, which can be implemented in the following manner:
  • Step 9203a Perform a second convolution process on the first down-sampled feature map through the second convolutional network, and output a second convolutional feature map;
  • Step 9203b Perform weighting and processing on the second convolution feature map and the first down-sampling feature map to obtain a second feature map.
  • the second convolution processing can be performed on the first down-sampled feature map through the second convolutional network, and the context information of the key points in the first down-sampled feature map (ie, global feature information) can be further integrated. ).
  • connection can be skipped, and the input of the second convolutional network (ie, the first down-sampling feature map) and the output of the second convolutional network (ie, the second convolutional feature map) are added to obtain the second feature map.
  • the input of the second convolutional network ie, the first down-sampling feature map
  • the output of the second convolutional network ie, the second convolutional feature map
  • step 930 performs deconvolution processing on the fused feature map to obtain the target feature map, which can be implemented through the following steps:
  • Step 9301 Perform a first up-sampling process on the fused feature map to obtain a first up-sampling feature map
  • Step 9302 through the third convolution network, perform third convolution processing on the first up-sampled feature map to obtain a third feature map;
  • Step 9303 Perform a second up-sampling process on the third feature map to obtain a second up-sampling feature map
  • Step 9304 Perform a fourth convolution process on the second up-sampled feature map through the fourth convolution network to obtain a fourth feature map
  • Step 9305 Perform a third up-sampling process on the fourth feature map to obtain the target feature map.
  • the resolution of the fusion feature map is small, and the resolution of the fusion feature map needs to be restored, so that hand pose estimation can be performed on the high-resolution feature map, and the hand pose estimation is improved. Accuracy.
  • the process of restoring the resolution of the fusion feature map can correspond to the process of performing feature fusion on the original feature map.
  • the first upsampling process corresponds to the second downsampling process. For example, if a feature map with a size of 4x4x128 undergoes the second downsampling process, the resulting feature map has a size of 2x2x256; then the first upsampling process The feature map of 2x2x256 can be mapped to 4x4x128.
  • the third convolutional network corresponds to the second convolutional network, that is, the convolution kernel used by the third convolutional network is the same as that of the second convolutional network; the second up-sampling is the same as the first down-sampling Corresponding.
  • the deconvolution module 112 may include a first upsampling network 1121, a third convolutional network 1122, a second upsampling network 1123, and a fourth volume.
  • Product network 1124, and the third upsampling network 1125 may include a first upsampling network 1121, a third convolutional network 1122, a second upsampling network 1123, and a fourth volume.
  • Step 930 Perform deconvolution processing on the fused feature map to obtain the target feature map, which can be applied to the network architecture shown in FIG. 11. Specifically, the first upsampling process is performed on the fused feature map through the first upsampling network 1121 ; Among them, the first up-sampling network 1121 corresponds to the second down-sampling network 1114.
  • a third convolution process is performed on the first up-sampled feature map through the third convolutional network 1122 to obtain a third feature map, where the third convolutional network 1122 corresponds to the second convolutional network 1113.
  • the second up-sampling process is performed on the third feature map through the second up-sampling network 1123 to obtain the second up-sampling feature map; wherein, the second up-sampling network 1123 corresponds to the first down-sampling 1112.
  • the fourth convolutional network 1124 the second up-sampled feature map is subjected to fourth convolution processing to obtain the fourth feature map; wherein, the fourth convolutional network 1124 corresponds to the first convolutional network 1111.
  • the third up-sampling process is performed on the fourth feature map through the second up-sampling network 1125 to obtain the target feature map.
  • step 9302 performs third convolution processing on the first up-sampled feature map through the third convolutional network to obtain the third feature map, which can be implemented in the following manner:
  • Step 9302a Perform a third convolution process on the first up-sampled feature map through the third convolutional network, and output a third convolutional feature map;
  • Step 9302b Perform weighting and processing on the third convolution feature map and the second feature map to obtain a third feature map.
  • the third convolution process may be performed on the first up-sampling feature map through the third convolution network, and the third convolution feature map is output.
  • the hand pose estimation device can compare the second feature map obtained by the second convolutional network with the third convolutional network.
  • the third convolution feature map output by the product network is weighted and processed to obtain the third feature map. In this way, it can be ensured that the size of the obtained third feature map is consistent with the size of the second feature map, so that the next processing can be performed.
  • step 9304 performs fourth convolution processing on the second up-sampled feature map through the fourth convolutional network to obtain the fourth feature map, including:
  • Step 9304a Perform a fourth convolution process on the second up-sampling feature map through the fourth convolution network, and output a fourth convolution feature map;
  • Step 9304b Perform weighting and processing on the fourth convolution feature map and the first feature map to obtain a fourth feature map.
  • the fourth convolution process may be performed on the second up-sampling feature map through the third convolution network, and the fourth convolution feature map is output.
  • the fourth convolutional network corresponds to the first convolutional network. Therefore, in the embodiment provided in this application, the hand pose estimation device can compare the first feature map obtained by the first convolutional network with the fourth convolutional network.
  • the fourth convolution feature map output by the product network is weighted and processed to obtain the fourth feature map. In this way, it can be ensured that the size of the obtained fourth feature map is consistent with the size of the first feature map for the next step of processing.
  • Mask R-CNN masked region convolutional neural network
  • Fig. 13 Refer to the architecture diagram of a masked region convolutional neural network (Mask R-CNN) shown in Fig. 13, in which, in parallel with the existing branches for classification and bounding box regression, Mask R-CNN can be selected for each Add a mask segmentation head to each RoI to extend R-CNN.
  • the mask segmentation head can be understood as a small Fully Convolutional Neural Network (FCN) applied to each RoI to estimate and predict in a pixel-to-pixel manner.
  • FCN Fully Convolutional Neural Network
  • Mask R-CNN is easy to implement and train, and provides a faster R-CNN framework, which facilitates extensive and flexible architecture design.
  • the mask segmentation head only adds a small computational overhead, thus realizing a fast recognition system.
  • the hand posture estimation method provided in the embodiment of the present application can target the RoIAlign features extracted by the RoIAlign feature extractor.
  • the embodiment of the present application can reuse the RoIAlign feature map calculated from the hand region detection task, instead of starting from the original image. Therefore, the hand posture estimation method provided by the embodiments of the present application has a small amount of calculation, and can be deployed on a mobile device to estimate the user's hand posture. In addition, the hand posture estimation method provided by the embodiment of the present application adopts an hourglass network structure, which can fully integrate information of different key points, thereby realizing more accurate hand posture estimation.
  • the network architecture diagram includes a down-sampling block 141 (ie, a feature fusion module) and an up-sampling block 142 (ie, a deconvolution module).
  • the down-sampling block 141 includes Conv1 to Conv5; the up-sampling block 142 includes Conv5 to Conv10.
  • the hand posture estimation method includes the following steps:
  • Step a Perform convolution processing on the RoIAlign feature map 1501 (initial feature map) with a size of 8x8x256 through Conv1 with a convolution kernel of 3x3x128 (that is, the convolution layer corresponding to the dimensionality reduction processing) to obtain a reduction of size 8x8x128.
  • the convolution kernel (3x3x128) of Conv1 is preset, and the number of channels of the RoIAlign feature map 1501 can be reduced to 128 through Conv1, and a reduced-dimensional feature map 1520 with a size of 8x8x128 can be obtained. In this way, the dimension of the RoIAlign feature map 1501 is reduced for processing, thereby reducing the amount of calculation in the hand posture estimation process.
  • Step b Perform convolution processing on the dimensionality reduction feature map 1502 with a size of 8x8x128 through two end-to-end Conv2 (corresponding to the first convolution network above), and combine the feature map of each Conv2 input with the output of Conv2
  • the feature maps are added to obtain the first feature map 1503 with the same size (ie, 8x8x128) as the feature map after dimensionality reduction.
  • the feature map after dimensionality reduction can be processed twice through Conv2 to obtain the first feature map 1503 with the same size.
  • Step c Down-sampling the first feature map 1503 with a size of 8x8x128 through Conv3 with a convolution kernel of 3x3x128 and a step size of 2 (that is, the first downsampling network mentioned above), to obtain a size of The first down-sampling feature map 1504 of 4x4x128.
  • Step d Convolution is performed on the first down-sampling feature map 1504 with a size of 4x4x128 through Conv4 with a convolution kernel of 3x3x128 (that is, the second convolutional network mentioned above), and the first down-sampling feature map 1504 is input into Conv4.
  • the down-sampling feature map 1504 and the feature maps output by Conv4 are added to obtain a second feature map 1505 with the same size as the first down-sampling feature map, that is, the size of the second feature map 1505 is 4x4x128.
  • Step e Downsample the second feature map 1505 through Conv5 with a convolution kernel of 3x3x256 and a step size of 2 (ie, the second downsampling network mentioned above) to obtain a fused feature map 1506 with a size of 2x2x256 .
  • Step f Up-sampling the fused feature map 1506 through Conv6 with a convolution kernel of 2x2x128 (that is, the first up-sampling network mentioned above) to obtain a first up-sampling feature map 1507 with a size of 4x4x128.
  • a convolution kernel of 2x2x128 that is, the first up-sampling network mentioned above
  • Step g Process the first up-sampling feature map 1507 through Conv7 with a 3x3x128 convolution kernel (ie, the third convolutional network mentioned above), and output the second feature map 1505 and Conv7 obtained by Conv4 processing Add the feature maps to obtain a third feature map 1508 with a size of 4x4x128.
  • a 3x3x128 convolution kernel ie, the third convolutional network mentioned above
  • Step h Perform an upsampling process on the third feature map 1508 through Conv8 with a convolution kernel of 2x2x128 (ie, the second upsampling network mentioned above) to obtain a second upsampling feature map 1509 with a size of 8x8x128.
  • a convolution kernel of 2x2x128 ie, the second upsampling network mentioned above
  • Step i Process the second up-sampling feature map 1509 through Conv9 with a 3x3x128 convolution kernel (that is, the fourth convolutional network mentioned above), and output the first feature map 1503 and Conv9 obtained through Conv1 processing Add the feature maps of to obtain the fourth feature map 1510 with a size of 8x8x128.
  • a 3x3x128 convolution kernel that is, the fourth convolutional network mentioned above
  • Step j Process the fourth feature map 1510 through Conv10 with a convolution kernel of 2x2x128 (that is, the third upsampling network mentioned above) to obtain a target feature map 1511 with a size of 16x16x128.
  • a convolution kernel of 2x2x128 that is, the third upsampling network mentioned above
  • the target feature map 1511 is a feature map after feature fusion processing and deconvolution processing. It can be seen that the target feature map 1511 can fully integrate the detailed information and context information of each key point in the hand region of the original image to be processed. Then, the estimation of the hand posture based on the target feature map 1511 can improve the accuracy of the hand posture estimation.
  • FIG. 16 shows a schematic diagram of the composition structure of a hand posture estimation device 160 provided in an embodiment of the present application.
  • the hand posture estimation device 160 may include:
  • the acquiring unit 1601 is configured to acquire the initial feature map corresponding to the hand region in the image to be processed
  • the first processing unit 1602 is configured to perform feature fusion processing on the initial feature map to obtain a fused feature map; the feature fusion processing is used to fuse features around multiple key points; the multiple key points represent Key nodes of the skeleton of the hand region;
  • the second processing unit 1603 is configured to perform deconvolution processing on the fused feature map to obtain a target feature map; the deconvolution processing is used to adjust the resolution of the fused feature map;
  • the posture estimation unit 1604 is configured to obtain coordinate information of the multiple key points based on the target feature map to determine the posture estimation result of the hand region in the image to be processed.
  • the initial feature map is a region of interest matching RoIAlign feature map.
  • the acquiring unit 1601 is specifically configured to perform recognition processing on the image content of the image to be processed, determine the hand area in the image to be processed, and perform processing on the hand area in the image to be processed.
  • the region of interest is matched with RoIAlign feature extraction to obtain the initial feature map
  • the first processing unit 1602 is specifically configured to perform a first convolution process on the initial feature map through a first convolution network to obtain a first feature map; the first convolution process is used for Extract the local detail information of the multiple key points; perform a first down-sampling process on the first feature map to obtain a first down-sampled feature map; use a second convolutional network to sample the first down-sampling feature map Perform a second convolution process to obtain a second feature map; the second convolution process is used to extract the context information of the multiple key points; perform a second down-sampling process on the second feature map to obtain the Feature map after fusion.
  • the first processing unit 1602 is further configured to perform dimensionality reduction processing on the initial feature map to obtain a dimensionality reduction feature map; the dimensionality reduction processing is used to reduce the number of channels of the initial feature map; Through the first convolutional network, the feature first convolution process is performed on the feature map after dimensionality reduction to obtain the first feature map, so as to determine the fused feature map by using the first feature map.
  • the first convolutional network includes N subconvolutional networks; N is an integer greater than 1;
  • the product network performs the i-th convolution process on the i-1th weighted sum feature map, outputs the i-th feature map, and performs weighted sum processing on the i-1th weighted sum feature map and the i-th feature map to obtain the i-th feature map.
  • i-weighted sum feature map continue to perform i+1th convolution processing on the i-th weighted sum feature map through the i+1th subconvolutional network, until the N-1th weighted sum feature is performed by the Nth subconvolutional network
  • the image is subjected to Nth convolution processing, and an Nth weighted sum feature map is output; the Nth weighted sum feature map and the N-1th feature map are weighted and processed to obtain the first feature map.
  • the first processing unit 1602 is configured to perform a second convolution process on the first down-sampled feature map through the second convolutional network, and output a second convolution feature map;
  • the second convolution feature map and the first down-sampling feature map are weighted and processed to obtain the second feature map.
  • the second processing unit 1603 is configured to perform a first up-sampling process on the fused feature map to obtain a first up-sampling feature map; use a third convolutional network to perform a first up-sampling process on the first up-sampling feature map; Perform the third convolution process on the up-sampling feature map to obtain the third feature map; perform the second up-sampling process on the third feature map to obtain the second up-sampling feature map; use the fourth convolution network to perform the second up-sampling feature map. 2. Up-sampling the feature map by performing a fourth convolution process to obtain a fourth feature map; performing a third up-sampling process on the fourth feature map to obtain the target feature map.
  • the second processing unit 1603 is configured to perform a third convolution process on the first upsampled feature map through the third convolutional network, and output a third convolution feature map;
  • the third convolution feature map and the second feature map are weighted and processed to obtain the third feature map.
  • the second processing unit 1603 is further configured to perform a fourth convolution process on the second up-sampling feature map through the fourth convolution network, and output a fourth convolution feature map;
  • the fourth convolution feature map and the first feature map are weighted and processed to obtain the fourth feature map.
  • a "unit” may be a part of a circuit, a part of a processor, a part of a program or software, etc., of course, it may also be a module, or it may also be non-modular.
  • the various components in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be realized in the form of hardware or software function module.
  • the integrated unit is implemented in the form of a software function module and is not sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this embodiment is essentially or It is said that the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium and includes several instructions to enable a computer device (which can A personal computer, server, or network device, etc.) or a processor (processor) executes all or part of the steps of the method described in this embodiment.
  • the aforementioned storage media include: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes.
  • this embodiment provides a computer storage medium that stores a hand posture estimation program, and the hand posture estimation program is executed by at least one processor to implement the procedure described in any one of the preceding embodiments. Method steps.
  • FIG. 17 shows a schematic diagram of a specific hardware structure of the electronic device 170 provided by an embodiment of the present application.
  • the electronic device 170 may include: a communication interface 1701, a memory 1702, and a processor 1703; various components are coupled together through a bus system 1704.
  • the bus system 1704 is used to implement connection and communication between these components.
  • the bus system 1704 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 1704 in FIG. 17. among them,
  • the communication interface 1701 is configured to receive and send signals in the process of sending and receiving information with other external network elements;
  • the memory 1702 is configured to store executable instructions that can be run on the processor 1703;
  • the processor 1703 is configured to execute: when running the executable instruction:
  • the feature fusion processing is used to fuse features around multiple key points; the multiple key points represent the skeleton keys of the hand region node;
  • the deconvolution processing is used to adjust the resolution of the fused feature map
  • the coordinate information of the multiple key points is obtained to determine the posture estimation result of the hand region in the image to be processed.
  • the memory 1702 in the embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), and electrically available Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (Random Access Memory, RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • Enhanced SDRAM, ESDRAM Synchronous Link Dynamic Random Access Memory
  • SLDRAM Synchronous Link Dynamic Random Access Memory
  • DRRAM Direct Rambus RAM
  • the processor 1703 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 1703 or instructions in the form of software.
  • the aforementioned processor 1703 may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 1702, and the processor 1703 reads the information in the memory 1702, and completes the steps of the foregoing method in combination with its hardware.
  • the embodiments described herein can be implemented by hardware, software, firmware, middleware, microcode, or a combination thereof.
  • the processing unit can be implemented in one or more application specific integrated circuits (ASIC), digital signal processor (Digital Signal Processing, DSP), digital signal processing equipment (DSP Device, DSPD), programmable Logic device (Programmable Logic Device, PLD), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), general-purpose processors, controllers, microcontrollers, microprocessors, and others for performing the functions described in this application Electronic unit or its combination.
  • ASIC application specific integrated circuits
  • DSP Digital Signal Processing
  • DSP Device digital signal processing equipment
  • PLD programmable Logic Device
  • PLD Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the technology described herein can be implemented by modules (such as procedures, functions, etc.) that perform the functions described herein.
  • the software codes can be stored in the memory and executed by the processor.
  • the memory can be implemented in the processor or external to the processor.
  • the processor 1703 is further configured to execute the steps of the method described in any one of the foregoing embodiments when the computer program is running.
  • the initial feature map corresponding to the hand region in the image to be processed is first acquired; feature fusion processing is performed on the initial feature map to obtain the fused feature map; the feature fusion processing is used for multiple key points
  • the surrounding features are fused; the multiple key points represent the skeleton key nodes of the hand region; deconvolution processing is performed on the fused feature map to obtain the target feature map; the deconvolution processing is used for adjustment The resolution of the fused feature map; based on the target feature map, the coordinate information of the multiple key points is obtained to determine the posture estimation result of the hand region in the image to be processed, so that the image to be processed is
  • the feature map of the middle hand region is processed by feature fusion and deconvolution, which can fully integrate the information of different key points, improve the accuracy of hand posture estimation, and obtain high-precision hand posture estimation results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种手部姿态估计方法、装置、设备以及计算机存储介质,该方法包括:确定多个关键点各自对应的分类逻辑图;其中,所述多个关键点表示目标手部的骨架关键节点,第一关键点为所述多个关键点中任意一关键点;基于所述第一关键点对应的分类逻辑图,确定预设分类图中每一网格的三元组信息;根据所述预设分类图中每一网格的三元组信息,确定所述第一关键点的坐标信息;在确定出所述多个关键点各自的坐标信息后,得到所述目标手部的姿态估计结果。

Description

手部姿态估计方法、装置、设备以及计算机存储介质
相关申请的交叉引用
本申请基于申请号为62/938,190、申请日为2019年11月20日、申请名称为“COMPACT SEGMENTATION HEAD FOR EFFICIENT 3D HAND POSE ESTIMATION FOR A MOBILE TOF CAMERA”的在先美国临时专利申请提出,并要求该在先美国临时专利申请的优先权,该在先美国临时专利申请的全部内容在此以全文引入的方式引入本申请作为参考。
技术领域
本申请实施例涉及图像识别技术领域,尤其涉及一种手部姿态估计方法、装置、设备以及计算机存储介质。
背景技术
从图像中准确有效地重建人手运动的能力,在沉浸式虚拟现实和增强现实、机器人控制和手语识别等领域有着令人兴奋的新应用。近年来,尤其是随着消费者深度相机的到来,重建手部的运动也取得了很大的进步。然而,由于不受约束的全局和局部姿态变化、频繁的遮挡、局部自相似性和高清晰度等特点,导致手部姿态估计结果并不准确。
发明内容
本申请提供一种手部姿态估计方法、装置、设备以及计算机存储介质,可以提高手部姿态估计的准确度,用以得到高精度的手部姿态估计结果。
本申请的技术方案可以如下实现:
第一方面,本申请实施例提供了一种手部姿态估计方法,所述方法包括:
获取待处理图像中手部区域对应的初始特征图;
对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
第二方面,提供一种手部姿态估计装置,所述装置包括:
获取单元,用于获取待处理图像中手部区域对应的初始特征图;
第一处理单元,对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
第二处理单元,用于对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
姿态估计单元,用于基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
第三方面,提供一种电子设备,所述电子设备包括存储器和处理器;其中,
所述存储器,用于存储能够在所述处理器上运行的可执行指令;
所述处理器,用于在运行所述可执行指令时,执行如第一方面所述的方法。
第四方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有手部姿态估计程序,所述手部姿态估计程序被处理器执行时实现如第一方面所述的方法。
本申请实施例提供了一种手部姿态估计方法、装置、设备以及计算机存储介质,首先获取待处理图像中手部区域对应的初始特征图;对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果,这样,通过对待处理图像中手部区域的特征图进行特征融合和反卷积处理,能够充分融合不同关键点的信息,提高手部姿态估计的准确性,得到高精度的手部姿态估计结果。
附图说明
图1为相关技术方案提供的一种TOF相机所拍摄的图像示意图;
图2为相关技术方案提供的一种手部包围盒的检测结果示意图;
图3为相关技术方案提供的一种手部骨架的关键点位置示意图;
图4为相关技术方案提供的一种二维手部的姿态估计结果示意图;
图5为相关技术方案提供的一种传统手部姿态检测的流程示意图;
图6为相关技术方案提供的一种RoIAlign双线性差值效果示意图;
图7为相关技术方案提供的一种非最大值抑制的结构示意图;
图8为相关技术方案提供的一种并集与交集的结构示意图;
图9为本申请实施例提供的一种手部姿态估计方法的流程示意图;
图10为本申请实施例提供的一种示例性的手部姿态估计方法的网络架构示意图;
图11为本申请实施例提供的一种手部姿态估计头对应的架构示意图;
图12为本申请实施例提供的一种第一卷积网络结构组成示意图;
图13为本申请实施例提供的一种掩膜区域卷积神经网络架构图;
图14为本申请实施例提供的另一种手部姿态估计方法的网络架构示意图;
图15为本申请实施例提供的一种示例性的手部姿态估计期间的沙漏网络特征图;
图16为本申请实施例提供的一种手部姿态估计装置的组成结构示意图;
图17为本申请实施例提供的一种电子设备的具体硬件结构示意图。
具体实施方式
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、 方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
实际应用中,由于手部姿估计别具有能够从图像中准确估计出人手骨架节点的三维坐标位置,以从图像中准确有效地重建人手运动的能力,因此广泛应用于沉浸式虚拟现实和增强现实、机器人控制以及手语识别等领域,成为了计算机视觉和人机交互领域的一个关键问题,随着商用、低廉的深度相机的兴起和发展,手部姿态识别取得了巨大的进步。
尤其是近年来深度相机的成功研发,使得手部姿态估计的技术取得了更大的进步。其中,深度相机包括有结构光、激光扫描和飞行时间(Time of Fight,TOF)相机等几种,大多数情况下深度相机是指TOF相机。所谓飞行时间法的三维(Three Dimension,3D)成像,是通过向目标对象连续发送光脉冲,然后利用传感器接收从目标对象返回的光,通过探测光脉冲的飞行(往返)时间来得到与目标对象的距离。也就是说,TOF相机是一种距离成像相机系统,它利用飞行时间法,通过测量由激光或发光二极管(Light Emitting Diode,LED)提供的人工光信号的往返时间,从而计算出TOF相机和被摄对象之间在图像上每个点之间的距离。
具体的,TOF相机输出一个尺寸为H×W的图像,这个二维(Two Dimension,2D)图像上的每一个像素值可以代表该像素的深度值;其中,像素值的范围为0~3000毫米(millimeter,mm)。图1示出了相关技术方案提供的一种TOF相机所拍摄的图像示意图。在本申请实施例中,可以将TOF相机所拍摄的图像称为深度图像。
进一步的,可以对TOF相机所拍摄的深度图像进行目标检测,假定目标为人体的手部,那么手部检测的输入为图1所示的深度图像,然后输出可以为深度图中手部存在的概率(如0到1之间的数字,较大的值表示手部存在的置信度较大)和手部包围盒(即表示手的位置和大小的包围盒)。其中,包围盒(bounding box)即边界框。这里,包围盒可以表示为(xmin,ymin,xmax,ymax),其中,(xmin,ymin)表示包围盒的左上角位置,(xmax,ymax)是包围盒的右下角。
示例性的,图2为相关技术中手部包围盒的检测结果示意图。如图2所示,黑色矩形框即为手部包围盒,而且该手部包围盒的分数高达0.999884,即深度图中存在手部的置信度高达0.999884。
进一步的,一方面,可以基于目标检测结果继续进行二维手部姿态估计。具体的,输出为手部骨架的二维关键点位置。图3为相关技术中手部骨架的关键点位置示例图,如图3所示,手部骨架设置有20个关键点,每一个关键点位置如图3中的0~19所示。其中,每一个关键点位置可以用2D坐标信息(x,y)表示。在确定出这20个关键点的坐标信息之后,便可以生成二维手部姿态估计结果。示例性的,基于图3所示的手部关键点二维坐标,图4为相关技术中二维手部姿态估计结果。
另一方面,也可以基于目标检测结果继续进行三维手部姿态估计。具体的,输出为手部骨架的三维关键点位置,其手部骨架的关键点位置示例仍如图3所示。这其中,每一个关键点位置可以用3D坐标信息(x,y,z),z为深度方向的坐标信息。
目前,典型的手部姿态检测流程可以包括手部检测部分和手部姿态估计部分,其中,手部检测部分可以包括主干特征提取器和包围盒检测头部模块,手部姿态估计部分可以包括主干特征提取器和姿态估计头部模块。示例性地,图5为相关技术中手部姿态检测的流程示意图,如图5所示,在得到一张包括有手部的原始深度图像后,首先可以进行手部检测,即利用手部检测部分中所包括的主干特征提取器和包围盒检测头部模块进行检测处理;这时候还可以通过调整包围盒边界,然后利用调整后的包围盒进行图像裁剪, 并对裁剪后的图像进行手部姿态估计,即利用手部姿态估计部分中所包括的主干特征提取器和姿态估计头部模块进行姿态估计处理。
需要注意的是,相关技术中的手部姿态检测过程中,手部检测和手部姿势估计这两个部分的任务是完全分开的。为了连接这两个任务,可以将输出包围盒的位置调整为包围盒内像素的质心,并将包围盒的大小稍微放大以包含所有的手部像素。进一步的,调整后的包围盒用于裁剪原始深度图像,将裁剪后的图像输入到手部姿态估计这个任务中。需注意,图5所示的手部姿态检测过程中,两次使用主干特征提取器提取图像特征,将会存在重复计算的问题,增加了计算量。
为了解决上述计算量较大的问题,我们可以引入感兴趣区域匹配(RoIAlign)算法。具体的,RoIAlign是一种区域特征聚集方式,可以很好地解决ROI Pooling操作中两次量化造成的区域不匹配的问题。在检测任务中,将ROI Pooling替换为RoIAlign可以提升检测结果的准确性。也就是说,RoIAlign层消除了RoI Pooling的严格量化,将提取的特征与输入进行正确对齐。
可见,RoIAlign可以避免RoI边界或区域的任何量化,(如,使用x/16代替[x/16]。另外,还可以使用双线性插值的方式来计算每一个RoI区域中四个定期采样位置的输入特征的精确值,并汇总结果(使用最大值或平均值),图6为相关技术中RoIAlign双线性插值效果示意图,如图6所示,虚线网格表示一个特征图,加粗实线表示一个RoI(如2×2个区域),每个区域中有4个采样点。RoIAlign可以利用特征图上相邻网格点进行双线性插值计算,以得到每个采样点的值,且针对RoI、RoI区域或多个采样点,也不会对所涉及的任何坐标执行量化。这里,需要注意的是,只要不执行量化,检测结果对采样位置的准确度或采样点的数量均不敏感。
另外,在利用手部包围盒的目标检测方面,非最大值抑制(Non-Maximum Suppression,NMS)得到了广泛的应用,是边缘、角点或目标检测等多种检测方法的组成部分,能够克服原有检测检测算法对感兴趣概念的定位能力不完善,导致多个检测组出现在实际位置附近的缺陷。
具体的,在目标检测的背景下,基于滑动窗口的方法通常会产生多个得分较高的窗口,这些窗口靠近目标的正确位置,然而由于目标检测器的泛化能力、响应函数的光滑性和近窗视觉相关性的结果,导致这种相对密集的输出对于理解图像的内容通常是不令人满意的。也就是说,在这一步中,窗口假设的数量与图像中对象的实际数量不相关。因此,NMS的目标是每个检测组只保留一个窗口,对应于响应函数的精确局部最大值,理想情况下每个对象只获得一个检测。图7为相关技术中NMS的效果示意图,如图7所示,NMS的目的只是保留一个窗口(如图7中的加粗灰色矩形框)。
进一步的,图8为相关技术中并集与交集的示意图,如图8所示,给定了两个边界框,分别用BB1和BB2表示。这里,(a)中的黑色区域为BB1和BB2的交集,用BB1∩BB2表示,即BB1和BB2的重叠区域;(b)中的黑色区域为BB1和BB2的并集,用BB1∪BB2表示,即BB1和BB2的合并区域。具体地,交并比(用IoU表示)的计算公式如下所示,
Figure PCTCN2020122933-appb-000001
另外,手部姿态估计中,图像中每一个像素坐标可以用XYZ坐标系表示,也可以用UVD坐标系表示。这里,(x,y,z)是XYZ坐标系下的像素坐标,(u,v,d)是UVD坐标系下的像素坐标。假定C x和C y表示主点坐标,理想情况下应该位于图像的中心;f x和f y分别是x方向和y方向上的焦距,具体的,UVD坐标系与XYZ坐标系之间的换算关系如下式所示,
Figure PCTCN2020122933-appb-000002
目前,手部姿态估计方案要么是利用全连接层回归手部的关键点坐标,要么是采用基于分类的方法预测关键点的空间位置。具体的,基于回归的方法是以全局的方式计算手部姿态,即利用关键点特征的所有信息来预测每个关键点;相比之下,基于分类的方法则偏向于更局部的方式,即逐步获取相邻关键点的特征。由于手部不受约束的全局和局部姿态变化、频繁的遮挡、局部自相似性以及高清晰度等特点,因此,如何更准确的进行手部姿态估计是一项具有挑战性的任务。
为了解决相关技术中手部姿态估计存在的问题,本申请实施例提供了一种手部姿态估计方法、装置、设备、及计算机存储介质。具体的,手部姿态估计装置在获取手部区域的特征图之后,可以通过对图像特征图进行特征融合处理,对手部区域的特征图进行更深层次图像信息获取,以充分融合手部区域不同关键点的信息,接着对特征融合后的特征图进行反卷积处理,来扩大图像的分辨率,以进一步实现手部姿态估计;如此,本申请手部姿态估计装置能够充分融合不同关键点的信息,从而提高手部姿态估计的效率和准确度。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
本申请一实施例提供了一种手部姿态估计方法,该手部姿态估计的可以应用于手部姿态估计装置,或者集成有该装置的电子设备。其中,电子设备可以是智能手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、导航装置、可穿戴设备、台式计算机等,本申请实施例不作任何限定。
图9为本申请实施例提供的一种手部姿态估计方法的流程示意图。如图9所示,本申请实施例提供的手部姿态估计方法可以包括以下步骤:
步骤910、获取待处理图像中手部区域对应的初始特征图。
在本申请提供的实施例中,手部姿态估计装置可以先获取待处理图像中手部区域对应的初始特征图。
具体的,在本申请提供的实施例中,手部姿态估计装置可以预先获取包含手部的待处理图像,并对该待处理图像的图像内容进行检测和识别,确定待处理图像中的手部区域,进一步通过特定的特征提取方法对待处理图像中手部区域进行特征提取,得到待处理图像中手部区域对应的初始特征图,这里,初始特征图可以是进行浅层次特征提取的特征图,例如RoIAlign特征图,RoI Pooling特征图等。
在一可能的实现方式中,初始特征图为RoIAlign特征图;即手部姿态估计装置在获取到待处理图像的手部区域之后,使用如图6对应的基于RoIAlign特征提取方法构建的RoIAlign特征提取器,对待处理图像的手部区域进行浅层次的特征提取,包括手的大概轮廓、边缘位置,从而获取手部这一目标对象对应的RoIAlign特征图。
进一步地,在本申请的实施例中,手部姿态估计装置在获取待处理图像对应的RoIAlign特征图之后,可以进一步基于RoIAlign特征图进行更深层次图像信息地提取。
步骤920、对初始特征图进行特征融合处理,得到融合后特征图;特征融合处理用于对多个关键点周围的特征进行融合;多个关键点表示手部区域的骨架关键节点。
可以理解的是,针对人的手部来说,手部的骨架关键节点即关键点可以有多个,通常情况下手部至少包括有20个关键点,在本申请的实施例中,20个关键点在手部的具体位置如图3所示。
在本申请提供的实施例中,手部姿态估计装置可以在初始特征图的基础上,对初始特征图进一步进行深层次的图像特征提取,融合手部区域中多个关键点周围的特征,得到融合后的特征图。
可以理解的是,特征融合处理是对初始特征图进行一步步抽象的过程,本申请提供的实施例中,手部姿态估计装置可以对初始特征图进行多层的卷积处理,一步步提取初始特征图中的特征信息,这样,在对初始特征图进行卷积处理过程中,可以对逐层对手部区域关键点的细节信息(即局部特征),以及关键点的上下文信息(全局特征)进行融合,实现对初始特征图的深层次特征提取。
步骤930、对融合后特征图进行反卷积处理,得到目标特征图;反卷积处理用于调整融合后特征图的分辨率。
在本申请提供的实施例中,得到融合后特征图之后,可以进一步对融合后特征图进行反卷积处理,调整融合后特征图的分辨率。具体地,通过反卷积处理,提高融合后特征图的分辨率,以便基于较高分辨率的图像进行手部姿态预测,提高手部姿态估计的准确性。
步骤940、基于目标特征图,得到多个关键点的坐标信息,以确定待处理图像中手部区域的姿态估计结果。
可以理解的是,目标特征图是经过特征融合处理和反卷积处理后的特征图,也就是说,该目标特征图能够充分融合原始待处理图像的手部区域中各个关键点的局部细节信息以及上下文信息,那么,基于该目标特征图进行手部姿态的估计,能够提高手部姿态估计的准确性。
在本申请提供的实施例中,首先获取待处理图像中手部区域对应的初始特征图;对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果;这样,通过对待处理图像中手部区域的特征图进行特征融合和反卷积处理,能够充分融合不同关键点的信息,提高手部姿态估计的准确性,得到高精度的手部姿态估计结果。
在一种可能的实现方式中,步骤910获取待处理图像中手部区域对应的初始特征图,包括:
对待处理图像的图像内容进行识别处理,确定待处理图像中的手部区域;
对待处理图像中的手部区域进行RoIAlign特征提取,得到初始特征图。
具体地,手部姿态估计装置可以先获取包含手部的待处理图像(例如图1),然后通过包围盒检测方式对待处理图像的手部区域进行识别和定位,也就是确定手部区域对应的位置及大小,进而可以得到仅有手部区域的图像(例如图2);进一步地,手部姿态估计装置使用如图6对应的基于RoIAlign特征提取方法构建的RoIAlign特征提取器,来对上述手部区域进行浅层次的特征提取,包括手的大概轮廓、边缘位置,从而获取手部这一目标对象对应的RoIAlign特征图。
参考图10所示的一种示例性的手部姿态估计方法的网络架构示意图,如图10所示,该网络架构主要包括手部区域检测模块(101)和手部姿态估计模块(102)。其中,手部区域检测模块101包括:骨干特征提取器1011、包围盒检测头1012、包围盒选择头 1013、以及RoIAlign特征提取器1014。手部姿态估计模块102包括手部姿态估计头1021。具体的,可以先通过骨干特征提取器1011以及包围盒检测头1012,对包含手部区域的待处理图像进行手部区域检测;然后通过包围盒选择头1013进行包围盒选择处理,在挑选出置信度最高的包围盒,也就是置信度最高的手部区域图像之后,可以通过RoIAlign特征提取器1014对置信度最高的手部区域图像进行RoIAlign特征提取,从而获得RoIAlign特征图(即初始特征图),最后,通过手部姿态估计头1021对RoIAlign特征图进一步进行手部姿态估计。
进一步地,在本申请的实施例中,手部姿态估计头1021在获取到待处理图像的手部区域对应的RoIAlign特征图之后,可以进一步基于RoIAlign特征图进行更深层次图像信息地提取,以得到目标特征图,并基于目标特征图得到手部姿态估计结果。
在一种可能的实现方式中,步骤920对初始特征图进行特征融合处理,得到融合后特征图,可以通过以下步骤来实现:
步骤9201、通过第一卷积网络,对初始特征图进行第一卷积处理,得到第一特征图;第一卷积处理用于提取所述多个关键点的局部细节信息。
在本申请提供的实施例中,初始特征图可以具有特定的分辨率和大小,例如,初始特征图的大小可以是8×8×256。
这里,手部姿态估计装置可以将初始特征图直接输入至第一卷积网络中,进行第一卷积处理。
在本申请提供的实施例中,第一卷积网络可以由两个或者两个以上,且输入输出相互叠加的子卷积网络构成,子卷积网络可以是深度卷积神经网络。通过多层卷积处理,可以对关键点的特征进行一步一步的抽象,得到最终第一特征图。
值得注意的是,通过第一卷积网络对初始特征图进行处理后,得到的第一特征图与初始特征图的大小相同。
可以理解的是,初始特征图的分辨率较高,那么初始特征图中关键点的细节信息比较丰富,通过对初始特征图进行第一卷积处理,可以提取初始特征图中关键点的局部细节信息,得到第一特征图。也就是说,第一特征图是融合了关键点局部细节信息的特征图。
步骤9202、对第一特征图进行第一下采样处理,得到第一下采样特征图。
可以理解的是,通过第一下采样处理,可以将第一特征图的分辨进一步缩小。这里,第一下采样处理可以是2倍的下采样,也可以是4倍的下采样。本申请实施例这里不做限定。
在本申请提供的实施例中,可以通过卷积网络实现第一下采样处理,也就是,将第一特征图输入至卷积网络中进行卷积处理,来实现第一特征图分辨率的缩小。
例如,第一特征图的尺寸为8x8x128,采用卷积核为3x3x128(步长为2)的卷积网络对第一特征图进行处理,得到4x4x128的第一下采样特征图。
步骤9203、通过第二卷积网络,对第一下采样特征图进行第二卷积处理,得到第二特征图;第二卷积处理用于提取所述多个关键点的上下文信息。
这里,手部姿态估计装置在得到第一下采样特征图后,可以将第一下采样特征图输入至第二卷积网络中进行卷积处理,提取所述多个关键点的上下文信息,来得到第二特征图。
这里,第一下采样特征图是分辨率缩小后的特征图,当图像的分辨率较低时,图像中的上下文信息居多,通过对分辨率缩小后的第一下采样特征图进行第二卷积处理,可以充分第一下采样特征图中关键点的上下文信息。也就是说,第二特征图是融合了关键点的局部细节信息以及上下文信息的特征图。
步骤9204、对第二特征图进行第二下采样处理,得到融合后特征图。
进一步,在得到第二特征图后,继续对第二特征图进行下采样处理,充分融合第二特征图中关键点的全局信息,得到融合后特征图。
需要说明的是,第二下采样处理和步骤9202中的第一下采样处理可以是相同的处理,也可以是不同的处理,本申请实施例这里不做限定。
这样,融合后特征图能够包含关键点的局部细节信息,可以包含了关键点的上下文相关的全局信息。也就是说,融合后特征图能够充分融合不同关键点的信息,进而基于融合后特征图进行手部姿势估计,可以提高手部姿势估计的准确性,得到高精度的手部姿势估计结果。
在另一可能的实现方式中,步骤9201通过第一卷积网络,对所述初始特征图进行第一卷积处理之前,还可以对初始特征图进行以下处理:
对初始特征图进行降维处理,得到降维后特征图;降维处理用于降低所述初始特征图的通道数;
通过第一卷积网络,对降维后特征图进行特征第一卷积处理,得到第一特征图,以采用第一特征图确定所述融合后特征图。
可以理解的是,在对初始特征图进行融合的过程中,可以对初始特征图进行降维处理,以降低初始特征图的通道数,这样,通过对降维后特征图进行第一卷积处理、第一下采样处理、第二卷积处理、以及第二下采样处理,来得到融合后的特征图。如此,通过对降维后特征图进行处理,可以降低处理过程中的计算量。
下面,结合图11所示的手部姿态估计头对应的架构示意图,对上述特征融合的处理过程进行详细描述。
在本申请提供的实施例中,参考图11所示一种手部姿态估计头网络架构示意图,手部姿态估计头具体可以包括特征融合模块111(也可以称为下采样模块)和反卷积模块(也可以称为上采样模块)112。其中,特征融合模块111可以包括:第一卷积网络1111、第一下采样网络1112、第二卷积网络1113、第二下采样网络1114。
上述步骤920中对初始特征图进行特征融合处理的过程可以应用到图11所示的网络架构中;具体地,在得到初始特征图后,通过第一卷积网络1111,对初始特征图进行第一卷积处理,得到第一特征图;接着通过第一下采样网络1112对第一特征图进行第一下采样处理,得到第一下采样特征图;然后,通过第二卷积网络1113,对第一下采样特征图进行第二卷积处理,得到第二特征图;最后,通过第二下采样网络1114对第二特征图进行第二下采样处理,得到融合后特征图。
在一种可能的实现方式中,参考图12所示的一种第一卷积网络结构组成示意图,第一卷积网络可以包括N个子卷积网络;其中,N为大于1的整数。
其中,第1子卷积网络的输出,与第2子卷积网络的输入连接,第2个子卷积网络的输出与第3子卷积网络的输入连接,以此类推,第N-1子卷积网络的输出,与第N子卷积网络的输入连接。
基于此,本申请提供的实施例中,步骤9201通过第一卷积网络,对初始特征图进行第一卷积处理,得到第一特征图,可以通过以下方式实现:
在i=1的情况下,通过第i子卷积网络对初始特征图进行第i卷积处理,输出第i特征图,并将初始特征图与第i输出特征图进行加权和处理,得到第i加权和特征图;其中,i为大于等于1且小于N的整数;
在i不等于1的情况下,通过第i子卷积网络对第i-1加权和特征图进行第i卷积处理,输出第i特征图,并将第i-1加权和特征图与第i特征图进行加权和处理,得到第i加权和特征图;
继续通过第i+1子卷积网络对所述第i加权和特征图进行第i+1卷积处理,直至通过第N子卷积网络对第N-1加权和特征图进行第N卷积处理,输出第N加权和特征图;
将第N加权和特征图与所述第N-1特征图进行加权和处理,得到第一特征图。
也就是说,手部姿态估计装置获取到初始特征图之后,首先通过第1子卷积网络对初始特征图进行第1卷积处理,输出第1特征图。并且将初始特征图与第1特征图进行加权和处理,得到第1加权和特征图;也就是,跳过连接,将第1子卷积网络的输入,与第1子卷积网络的输出相加,得到第1加权和特征图,使得得到的第1加权和特征图与输入的初始特征图的尺寸大小一致。这样,通过第1子卷积网络对初始特征图进行认识和抽象,融合各个关键点周围像素之间的特征信息,得到第1加权和特征图。
接着,第2子卷积网络对第1加权和特征图进行进一步处理;具体地,通过第2子卷积网络对第1加权和特征图进行第2卷积处理,输出第2特征图,跳过连接将第2子卷积网络的输入(即第1加权和特征图),以及第2子卷积网络的输出(即第2特征图)进行加权和处理,得到第2加权和特征图。这样,通过第2子卷积网络对第1子卷积网络输出的第1加权和特征图进行进一步的认识和抽象,能够进一步融合各个关键点周围像素的特征信息。
以此类推,第3子卷积网络继续对第2加权和特征图进行处理,得到第3加权和特征图,直到第N子卷积网络对第N-1加权和特征图进行处理,得到第N加权和特征图,并将该第N加权和特征图作为最终的第一特征图。
如此,通过多层次的子卷积网络对初始特征图进行多层卷积处理,能够在当前分辨率下,一步步融合关键点周围的特征信息。
在一种可能的实现方式中,步骤9203中通过第二卷积网络,对第一下采样特征图进行第二卷积处理,得到第二特征图,可以通过以下方式实现:
步骤9203a、通过第二卷积网络,对第一下采样特征图进行第二卷积处理,输出第二卷积特征图;
步骤9203b、将第二卷积特征图和第一下采样特征图进行加权和处理,得到第二特征图。
在本申请提供的实施例中,可以通过第二卷积网络对第一下采样特征图进行第二卷积处理,可以进一步融合第一下采样特征图中关键点的上下文信息(即全局特征信息)。
进一步,可以跳过连接,将第二卷积网络的输入(即第一下采样特征图)和第二卷积网络的输出(即第二卷积特征图)相加,得到第二特征图。如此,可以保证得到的第二特征图与输入的第一下采样特征图的尺寸大小相同,以便进行下一步的处理。
在一种可能的实现方式中,步骤930对融合后特征图进行反卷积处理,得到目标特征图,可以通过以下步骤来实现:
步骤9301、对融合后特征图进行第一上采样处理,得到第一上采样特征图;
步骤9302、通过第三卷积网络,对第一上采样特征图进行第三卷积处理,得到第三特征图;
步骤9303、对第三特征图进行第二上采样处理,得到第二上采样特征图;
步骤9304、通过第四卷积网络,对第二上采样特征图进行第四卷积处理,得到第四特征图;
步骤9305、对第四特征图进行第三上采样处理,得到目标特征图。
在本申请提供的实施例中,融合后特征图的分辨率较小,需要恢复融合后特征图的分辨率,以便在高分辨率的特征图上进行手部姿态估计,提高手部姿态估计的准确度。
这里,对融合后特征图的分辨率恢复的过程,可以与对初始特征图进行特征融合的 过程相对应。具体地,第一上采样处理过程与第二下采样处理过程对应,例如,若尺寸大小为4x4x128的特征图经过第二下采样处理后,得到的特征图尺寸大小为2x2x256;则第一上采样可以将2x2x256的特征图映射至4x4x128。另外,第三卷积网络与第二卷积网络相对应,也就是,第三卷积网络使用的卷积核与第二卷积网络的卷积核相同;第二上采样与第一下采样相对应。
下面,结合图11所示的手部姿态估计头对应的架构示意图,对上述反卷积的处理过程进行详细描述。
具体地,参考图11所示的手部姿态估计头对应的架构示意图,反卷积模块112可以包括第一上采样网络1121,第三卷积网络1122,第二上采样网络1123,第四卷积网络1124,以及第三上采样网络1125。
步骤930对融合后特征图进行反卷积处理,得到目标特征图可以应用到图11所示的网络架构中,具体地,通过第一上采样网络1121对融合后特征图进行第一上采样处理;其中,第一上采样网络1121与第二下采样网络1114对应。
接着,通过第三卷积网络1122对第一上采样特征图进行第三卷积处理,得到第三特征图,其中,第三卷积网络1122与第二卷积网络1113相对应。进一步,通过第二上采样网络1123对第三特征图进行第二上采样处理,得到第二上采样特征图;其中,第二上采样网络1123与第一下采样1112对应。接着,通过第四卷积网络1124,对第二上采样特征图进行第四卷积处理,得到第四特征图;其中,第四卷积网络1124与第一卷积网络1111相对应。最后,通过第二上采样网络1125,对第四特征图进行第三上采样处理,得到目标特征图。
在一种可能的实现方式中,步骤9302通过第三卷积网络,对第一上采样特征图进行第三卷积处理,得到第三特征图,可以通过以下方式实现:
步骤9302a、通过第三卷积网络,对第一上采样特征图进行第三卷积处理,输出第三卷积特征图;
步骤9302b、将第三卷积特征图和第二特征图进行加权和处理,得到第三特征图。
在本申请提供的实施例中,可以通过第三卷积网络对第一上采样特征图进行第三卷积处理,输出第三卷积特征图。
应注意,第三卷积网络与第二卷积网络对应,因此,在本申请提供的实施例中,手部姿态估计装置可以将第二卷积网络得到的第二特征图,与第三卷积网络输出的第三卷积特征图进行加权和处理,得到第三特征图。如此,可以保证得到的第三特征图与第二特征图的尺寸大小一致,以便进行下一步的处理。
在一种可能的实现方式中,步骤9304通过第四卷积网络,对第二上采样特征图进行第四卷积处理,得到第四特征图,包括:
步骤9304a、通过第四卷积网络,对第二上采样特征图进行第四卷积处理,输出第四卷积特征图;
步骤9304b、将第四卷积特征图和第一特征图进行加权和处理,得到第四特征图。
在本申请提供的实施例中,可以通过第三卷积网络对第二上采样特征图进行第四卷积处理,输出第四卷积特征图。
应注意,第四卷积网络与第一卷积网络对应,因此,在本申请提供的实施例中,手部姿态估计装置可以将第一卷积网络得到的第一特征图,与第四卷积网络输出的第四卷积特征图进行加权和处理,得到第四特征图。如此,可以保证得到的第四特征图与第一特征图的尺寸大小一致,以便进行下一步的处理,
下面,结合实际应用场景对上述方案进行详细描述。
参考图13所示的一种掩膜区域卷积神经网络(Mask R-CNN)架构图,其中,与现 有的用于分类和边界盒回归的分支并行,Mask R-CNN可以在选择的每个RoI上添加一个掩膜分割头来扩展R-CNN。掩膜分割头可以理解为是应用于每个RoI的一个小的全卷积神经网络(Fully Convolutional Networks,FCN),以像素到像素的方式进行估计和预测。Mask R-CNN易于实现和训练,提供了更快的R-CNN框架,这有助于广泛灵活的体系结构设计。此外,掩膜分割头只增加了一个小的计算开销,从而实现了一个快速的识别系统。
基于Mask R-CNN架构,参考图10所示的一种示例性的手部姿态估计方法的网络架构示意图,本申请实施例提供的手部姿态估计方法,可以针对RoIAlign特征提取器提取的RoIAlign特征图进行手部姿态估计。
值得注意的是,本申请实施例能够复用从手部区域检测任务中计算出的RoIAlign特征图,而不是从原始图像开始。因此,本申请实施例提供的手部姿态估计方法计算量小,可以部署在移动设备上进行用户的手部姿态估计。并且本申请实施例提供的手部姿态估计方法采用沙漏网络结构,能够充分融合不同关键点的信息,从而实现更精确的手部姿态估计。
参考图14所示的一种示例性的手部姿态估计方法的网络架构示意图,该网络架构图包括下采样块141(即特征融合模块)和上采样块142(即反卷积模块)。其中,下采样块141包括Conv1至Conv5;上采样块142包括Conv5至Conv10。
一并参考图15所示的一种示例性的手部姿态估计期间的沙漏网络特征图。本申请实施例提供的手部姿态估计方法包括以下步骤:
步骤a、通过卷积核为3x3x128的Conv1(即降维处理对应的卷积层)对尺寸大小为8x8x256的RoIAlign特征图1501(即初始特征图)进行卷积处理,得到尺寸大小为8x8x128的降维后特征图1502。
在本申请提供的实施例中,Conv1的卷积核(3x3x128)为预先设置的,通过Conv1可以将RoIAlign特征图1501的通道数缩小到128,得到尺寸大小为8x8x128的降维后特征图1520。如此,降低RoIAlign特征图1501的维度进行处理,从而缩小手部姿态估计过程中的计算量。
步骤b、通过两个首尾相连的Conv2(对应上文中的第一卷积网络)对尺寸为8x8x128的降维后特征图1502进行卷积处理,并将每个Conv2输入的特征图与Conv2输出的特征图相加,得到与降维后特征图尺寸相同(即8x8x128)的第一特征图1503。
也就是说,可以通过Conv2对降维后特征图重复处理两次,来得到尺寸相同的第一特征图1503。
步骤c、通过卷积核为3x3x128,且步长为2的Conv3(即上文提到的第一下采样网络),对尺寸大小为8x8x128的第一特征图1503进行下采样,得到尺寸大小为4x4x128的第一下采样特征图1504。
步骤d、通过卷积核为3x3x128的Conv4(即上文提到的第二卷积网络),对尺寸大小为4x4x128的第一下采样特征图1504进行卷积处理,并将输入Conv4的第一下采样特征图1504和Conv4输出的特征图相加,得到与第一下采样特征图尺寸相同的第二特征图1505,即第二特征图1505的尺寸大小为4x4x128。
步骤e、通过卷积核为3x3x256,且步长为2的Conv5(即上文提到的第二下采样网络)对第二特征图1505进行下采样,得到尺寸为2x2x256的融合后特征图1506。
步骤f、通过卷积核为2x2x128的Conv6(即上文提到的第一上采样网络)对融合后特征图1506进行上采样,得到尺寸4x4x128的第一上采样特征图1507。
步骤g、通过卷积核为3x3x128的Conv7(即上文提到的第三卷积网络)对第一上采样特征图1507进行处理,并将通过Conv4处理得到的第二特征图1505与Conv7输 出的特征图相加,得到尺寸大小为4x4x128第三特征图1508。
如此,保证得到的第三特征图1508的尺寸大小与第二特征图1505的尺寸大小一致。
步骤h、通过卷积核为2x2x128的Conv8(即上文提到的第二上采样网络)对第三特征图1508进行上采样处理,得到尺寸大小为8x8x128的第二上采样特征图1509。
步骤i、通过卷积核为3x3x128的Conv9(即上文提到的第四卷积网络)对第二上采样特征图1509进行处理,并将通过Conv1处理得到的第一特征图1503与Conv9输出的特征图相加,得到尺寸大小为8x8x128的第四特征图1510。
步骤j、通过卷积核为2x2x128的Conv10(即上文提到的第三上采样网络)对第四特征图1510进行处理,得到尺寸大小为16x16x128目标特征图1511。
如此,目标特征图1511是经过特征融合处理和反卷积处理后的特征图,可见,该目标特征图1511能够充分融合原始待处理图像的手部区域中各个关键点的细节信息以及上下文信息,那么,基于该目标特征图1511进行手部姿态的估计,能够提高手部姿态估计的准确性。
本申请提供的实施例中,基于前述实施例相同的发明构思,参见图16,其示出了本申请实施例提供的一种手部姿态估计装置160的组成结构示意图。如图16所示,该手部姿态估计装置160可以包括:
获取单元1601,配置为获取待处理图像中手部区域对应的初始特征图;
第一处理单元1602,配置为对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
第二处理单元1603,配置为对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
姿态估计单元1604,配置为基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
在一些实施例中,所述初始特征图为感兴趣区域匹配RoIAlign特征图。
在一些实施例中,获取单元1601,具体配置为对所述待处理图像的图像内容进行识别处理,确定所述待处理图像中的手部区域;对所述待处理图像中的手部区域进行感兴趣区域匹配RoIAlign特征提取,得到所述初始特征图
在一些实施例中,第一处理单元1602,具体用于通过第一卷积网络,对所述初始特征图进行第一卷积处理,得到第一特征图;所述第一卷积处理用于提取所述多个关键点的局部细节信息;对所述第一特征图进行第一下采样处理,得到第一下采样特征图;通过第二卷积网络,对所述第一下采样特征图进行第二卷积处理,得到第二特征图;所述第二卷积处理用于提取所述多个关键点的上下文信息;对所述第二特征图进行第二下采样处理,得到所述融合后特征图。
在一些实施例中,第一处理单元1602,还配置为对所述初始特征图进行降维处理,得到降维后特征图;所述降维处理用于降低所述初始特征图的通道数;通过所述第一卷积网络,对所述降维后特征图进行特征第一卷积处理,得到所述第一特征图,以采用所述第一特征图确定所述融合后特征图。
在一些实施例中,所述第一卷积网络包括N个子卷积网络;N为大于1的整数;
所述第一处理单元1602,还被配置为在i=1的情况下,通过第i子卷积网络对所述初始特征图进行第i卷积处理,输出第i特征图,并将所述初始特征图与所述第i特征图进行加权和处理,得到第i加权和特征图;其中,i为大于等于1且小于N的整数;在i不等于1的情况下,通过第i子卷积网络对第i-1加权和特征图进行第i卷积处理,输出第i特征图,并将所述第i-1加权和特征图与所述第i特征图进行加 权和处理,得到第i加权和特征图;继续通过第i+1子卷积网络对所述第i加权和特征图进行第i+1卷积处理,直至通过第N子卷积网络对第N-1加权和特征图进行第N卷积处理,输出第N加权和特征图;将所述第N加权和特征图与所述第N-1特征图进行加权和处理,得到所述第一特征图。
在一些实施例中,所述第一处理单元1602,配置为通过所述第二卷积网络,对所述第一下采样特征图进行第二卷积处理,输出第二卷积特征图;将所述第二卷积特征图和所述第一下采样特征图进行加权和处理,得到所述第二特征图。
在一些实施例中,所述第二处理单元1603,配置为对所述融合后特征图进行第一上采样处理,得到第一上采样特征图;通过第三卷积网络,对所述第一上采样特征图进行第三卷积处理,得到第三特征图;对所述第三特征图进行第二上采样处理,得到第二上采样特征图;通过第四卷积网络,对所述第二上采样特征图进行第四卷积处理,得到第四特征图;对所述第四特征图进行第三上采样处理,得到所述目标特征图。
在一些实施例中,所述第二处理单元1603,配置为通过所述第三卷积网络,对所述第一上采样特征图进行第三卷积处理,输出第三卷积特征图;将所述第三卷积特征图和所述第二特征图进行加权和处理,得到所述第三特征图。
在一些实施例中,所述第二处理单元1603,还配置为通过所述第四卷积网络,对所述第二上采样特征图进行第四卷积处理,输出第四卷积特征图;
将所述第四卷积特征图和所述第一特征图进行加权和处理,得到所述第四特征图。
可以理解地,在本实施例中,“单元”可以是部分电路、部分处理器、部分程序或软件等等,当然也可以是模块,还可以是非模块化的。而且在本实施例中的各组成部分可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
所述集成的单元如果以软件功能模块的形式实现并非作为独立的产品进行销售或使用时,可以存储在一个计算机可读取存储介质中,基于这样的理解,本实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或processor(处理器)执行本实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
因此,本实施例提供了一种计算机存储介质,该计算机存储介质存储有手部姿态估计程序,所述手部姿态估计程序被至少一个处理器执行时实现前述实施例中任一项所述的方法的步骤。
基于上述手部姿态估计装置160的组成以及计算机存储介质,参见图17,其示出了本申请实施例提供的电子设备170的具体硬件结构示意图。如图17所示,可以包括:通信接口1701、存储器1702和处理器1703;各个组件通过总线系统1704耦合在一起。可理解,总线系统1704用于实现这些组件之间的连接通信。总线系统1704除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图17中将各种总线都标为总线系统1704。其中,
通信接口1701,配置为在与其他外部网元之间进行收发信息过程中,信号的接收和发送;
存储器1702,配置为存储能够在处理器1703上运行的可执行指令;
处理器1703,配置为在运行所述可执行指令时,执行:
获取待处理图像中手部区域对应的初始特征图;
对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
可以理解,本申请实施例中的存储器1702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步链动态随机存取存储器(Synchronous link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本文描述的系统和方法的存储器1702旨在包括但不限于这些和任意其它适合类型的存储器。
而处理器1703可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1703中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1703可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1702,处理器1703读取存储器1702中的信息,结合其硬件完成上述方法的步骤。
可以理解的是,本文描述的这些实施例可以用硬件、软件、固件、中间件、微码或其组合来实现。对于硬件实现,处理单元可以实现在一个或多个专用集成电路(Application Specific Integrated Circuits,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(DSP Device,DSPD)、可编程逻辑设备(Programmable Logic Device,PLD)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、通用处理器、控制器、微控制器、微处理器、用于执行本申请所述功能的其它电子单元或其组合中。
对于软件实现,可通过执行本文所述功能的模块(例如过程、函数等)来实现本文所述的技术。软件代码可存储在存储器中并通过处理器执行。存储器可以在处理器中或在处理器外部实现。
可选地,作为另一个实施例,处理器1703还配置为在运行所述计算机程序时,执 行前述实施例中任一项所述的方法的步骤。
需要说明的是,在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
工业实用性
本申请实施例中,首先获取待处理图像中手部区域对应的初始特征图;对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果,这样,通过对待处理图像中手部区域的特征图进行特征融合和反卷积处理,能够充分融合不同关键点的信息,提高手部姿态估计的准确性,得到高精度的手部姿态估计结果。

Claims (13)

  1. 一种手部姿态估计方法,所述方法包括:
    获取待处理图像中手部区域对应的初始特征图;
    对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
    对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
    基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
  2. 根据权利要求1所述的方法,其中,所述初始特征图为感兴趣区域匹配RoIAlign特征图。
  3. 根据权利要求1或2所述的方法,其中,所述获取待处理图像中手部区域对应的初始特征图,包括:
    对所述待处理图像的图像内容进行识别处理,确定所述待处理图像中的手部区域;
    对所述待处理图像中的手部区域进行感兴趣区域匹配RoIAlign特征提取,得到所述初始特征图。
  4. 根据权利要求1-3任一项所述的方法,其中,所述对所述初始特征图进行特征融合处理,得到融合后特征图,包括:
    通过第一卷积网络,对所述初始特征图进行第一卷积处理,得到第一特征图;所述第一卷积处理用于提取所述多个关键点的局部细节信息;
    对所述第一特征图进行第一下采样处理,得到第一下采样特征图;
    通过第二卷积网络,对所述第一下采样特征图进行第二卷积处理,得到第二特征图;所述第二卷积处理用于提取所述多个关键点的上下文信息;
    对所述第二特征图进行第二下采样处理,得到所述融合后特征图。
  5. 根据权利要求4所述的方法,其中,所述通过第一卷积网络,对所述初始特征图进行第一卷积处理之前,还包括:
    对所述初始特征图进行降维处理,得到降维后特征图;所述降维处理用于降低所述初始特征图的通道数;
    通过所述第一卷积网络,对所述降维后特征图进行特征第一卷积处理,得到所述第一特征图,以采用所述第一特征图确定所述融合后特征图。
  6. 根据权利要求4或5所述的方法,其中,所述第一卷积网络包括N个子卷积网络;N为大于1的整数;
    所述通过第一卷积网络,对所述初始特征图进行第一卷积处理,得到第一特征图,包括:
    在i=1的情况下,通过第i子卷积网络对所述初始特征图进行第i卷积处理,输出第i特征图,并将所述初始特征图与所述第i特征图进行加权和处理,得到第i加权和特征图;其中,i为大于等于1且小于N的整数;
    在i不等于1的情况下,通过第i子卷积网络对第i-1加权和特征图进行第i卷积处理,输出第i特征图,并将所述第i-1加权和特征图与所述第i特征图进行加权和处理,得到第i加权和特征图;
    继续通过第i+1子卷积网络对所述第i加权和特征图进行第i+1卷积处理,直至 通过第N子卷积网络对第N-1加权和特征图进行第N卷积处理,输出第N加权和特征图;
    将所述第N加权和特征图与所述第N-1特征图进行加权和处理,得到所述第一特征图。
  7. 根据权利要求4或5所述的方法,其中,所述通过第二卷积网络,对所述第一下采样特征图进行第二卷积处理,得到第二特征图,包括:
    通过所述第二卷积网络,对所述第一下采样特征图进行第二卷积处理,输出第二卷积特征图;
    将所述第二卷积特征图和所述第一下采样特征图进行加权和处理,得到所述第二特征图。
  8. 根据权利要求4或5所述的方法,其中,所述对所述融合后特征图进行反卷积处理,得到目标特征图,包括:
    对所述融合后特征图进行第一上采样处理,得到第一上采样特征图;
    通过第三卷积网络,对所述第一上采样特征图进行第三卷积处理,得到第三特征图;
    对所述第三特征图进行第二上采样处理,得到第二上采样特征图;
    通过第四卷积网络,对所述第二上采样特征图进行第四卷积处理,得到第四特征图;
    对所述第四特征图进行第三上采样处理,得到所述目标特征图。
  9. 根据权利要求8所述的方法,其中,所述通过第三卷积网络,对所述第一上采样特征图进行第三卷积处理,得到第三特征图,包括:
    通过所述第三卷积网络,对所述第一上采样特征图进行第三卷积处理,输出第三卷积特征图;
    将所述第三卷积特征图和所述第二特征图进行加权和处理,得到所述第三特征图。
  10. 根据权利要求8所述的方法,其中,所述通过第四卷积网络,对所述第二上采样特征图进行第四卷积处理,得到第四特征图,包括:
    通过所述第四卷积网络,对所述第二上采样特征图进行第四卷积处理,输出第四卷积特征图;
    将所述第四卷积特征图和所述第一特征图进行加权和处理,得到所述第四特征图。
  11. 一种手部姿态估计装置,所述装置包括:
    获取单元,配置为获取待处理图像中手部区域对应的初始特征图;
    第一处理单元,配置为对所述初始特征图进行特征融合处理,得到融合后特征图;所述特征融合处理用于对多个关键点周围的特征进行融合;所述多个关键点表示所述手部区域的骨架关键节点;
    第二处理单元,配置为对所述融合后特征图进行反卷积处理,得到目标特征图;所述反卷积处理用于调整所述融合后特征图的分辨率;
    姿态估计单元,配置为基于所述目标特征图,得到所述多个关键点的坐标信息,以确定所述待处理图像中手部区域的姿态估计结果。
  12. 一种电子设备,所述电子设备包括存储器和处理器;其中,
    所述存储器,配置为存储能够在所述处理器上运行的可执行指令;
    所述处理器,配置为在运行所述可执行指令时,执行如权利要求1至10任一项所述的方法。
  13. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被第一处理器执行实现权利要求1至10任一项所述方法的步骤。
PCT/CN2020/122933 2019-11-20 2020-10-22 手部姿态估计方法、装置、设备以及计算机存储介质 WO2021098441A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/747,837 US20220358326A1 (en) 2019-11-20 2022-05-18 Hand posture estimation method, apparatus, device, and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962938190P 2019-11-20 2019-11-20
US62/938,190 2019-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/747,837 Continuation US20220358326A1 (en) 2019-11-20 2022-05-18 Hand posture estimation method, apparatus, device, and computer storage medium

Publications (1)

Publication Number Publication Date
WO2021098441A1 true WO2021098441A1 (zh) 2021-05-27

Family

ID=75981191

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122933 WO2021098441A1 (zh) 2019-11-20 2020-10-22 手部姿态估计方法、装置、设备以及计算机存储介质

Country Status (2)

Country Link
US (1) US20220358326A1 (zh)
WO (1) WO2021098441A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115514885A (zh) * 2022-08-26 2022-12-23 燕山大学 基于单双目融合的远程增强现实随动感知系统及方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11763468B2 (en) * 2020-03-12 2023-09-19 Ping An Technology (Shenzhen) Co., Ltd. Structured landmark detection via topology-adapting deep graph learning
EP3965071A3 (en) * 2020-09-08 2022-06-01 Samsung Electronics Co., Ltd. Method and apparatus for pose identification
CN116486489B (zh) * 2023-06-26 2023-08-29 江西农业大学 基于语义感知图卷积的三维手物姿态估计方法及系统
CN117037215B (zh) * 2023-08-15 2024-03-22 匀熵智能科技(无锡)有限公司 人体姿态估计模型训练方法、估计方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104134061A (zh) * 2014-08-15 2014-11-05 上海理工大学 一种基于特征融合的支持向量机的数字手势识别方法
US20170147075A1 (en) * 2015-11-24 2017-05-25 Intel Corporation Determination of hand dimensions for hand and gesture recognition with a computing interface
US20170168586A1 (en) * 2015-12-15 2017-06-15 Purdue Research Foundation Method and System for Hand Pose Detection
CN107066935A (zh) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 基于深度学习的手部姿态估计方法及装置
CN110175566A (zh) * 2019-05-27 2019-08-27 大连理工大学 一种基于rgbd融合网络的手部姿态估计系统及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104134061A (zh) * 2014-08-15 2014-11-05 上海理工大学 一种基于特征融合的支持向量机的数字手势识别方法
US20170147075A1 (en) * 2015-11-24 2017-05-25 Intel Corporation Determination of hand dimensions for hand and gesture recognition with a computing interface
US20170168586A1 (en) * 2015-12-15 2017-06-15 Purdue Research Foundation Method and System for Hand Pose Detection
CN107066935A (zh) * 2017-01-25 2017-08-18 网易(杭州)网络有限公司 基于深度学习的手部姿态估计方法及装置
CN110175566A (zh) * 2019-05-27 2019-08-27 大连理工大学 一种基于rgbd融合网络的手部姿态估计系统及方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115514885A (zh) * 2022-08-26 2022-12-23 燕山大学 基于单双目融合的远程增强现实随动感知系统及方法
CN115514885B (zh) * 2022-08-26 2024-03-01 燕山大学 基于单双目融合的远程增强现实随动感知系统及方法

Also Published As

Publication number Publication date
US20220358326A1 (en) 2022-11-10

Similar Documents

Publication Publication Date Title
WO2021098441A1 (zh) 手部姿态估计方法、装置、设备以及计算机存储介质
JP7305869B2 (ja) 歩行者検出方法及び装置、コンピュータ読み取り可能な記憶媒体並びにチップ
WO2019042426A1 (zh) 增强现实场景的处理方法、设备及计算机存储介质
WO2021098576A1 (zh) 手部姿态估计方法、装置及计算机存储介质
CN111968064B (zh) 一种图像处理方法、装置、电子设备及存储介质
KR20210025942A (ko) 종단간 컨볼루셔널 뉴럴 네트워크를 이용한 스테레오 매칭 방법
JP7091485B2 (ja) 運動物体検出およびスマート運転制御方法、装置、媒体、並びに機器
US8971634B2 (en) Approximate pyramidal search for fast displacement matching
CN113807361B (zh) 神经网络、目标检测方法、神经网络训练方法及相关产品
US11158077B2 (en) Disparity estimation
WO2021098545A1 (zh) 一种姿势确定方法、装置、设备、存储介质、芯片及产品
WO2021098573A1 (zh) 手部姿态估计方法、装置、设备以及计算机存储介质
US20210150679A1 (en) Using imager with on-purpose controlled distortion for inference or training of an artificial intelligence neural network
CN111914756A (zh) 一种视频数据处理方法和装置
US20220277595A1 (en) Hand gesture detection method and apparatus, and computer storage medium
US20220277475A1 (en) Feature extraction method and device, and pose estimation method using same
CN108229281B (zh) 神经网络的生成方法和人脸检测方法、装置及电子设备
CN116309705A (zh) 一种基于特征交互的卫星视频单目标跟踪方法及系统
Hussain et al. Rvmde: Radar validated monocular depth estimation for robotics
KR20220098895A (ko) 인체 포즈 추정 장치 및 방법
US20210279506A1 (en) Systems, methods, and devices for head pose determination
CN116051736A (zh) 一种三维重建方法、装置、边缘设备和存储介质
WO2022107548A1 (ja) 3次元骨格検出方法及び3次元骨格検出装置
Cho et al. Depth map up-sampling using cost-volume filtering
Zhao et al. ResFuseYOLOv4_Tiny: Enhancing detection accuracy for lightweight networks in infrared small object detection tasks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20889223

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20889223

Country of ref document: EP

Kind code of ref document: A1