WO2021098545A1 - 一种姿势确定方法、装置、设备、存储介质、芯片及产品 - Google Patents
一种姿势确定方法、装置、设备、存储介质、芯片及产品 Download PDFInfo
- Publication number
- WO2021098545A1 WO2021098545A1 PCT/CN2020/127607 CN2020127607W WO2021098545A1 WO 2021098545 A1 WO2021098545 A1 WO 2021098545A1 CN 2020127607 W CN2020127607 W CN 2020127607W WO 2021098545 A1 WO2021098545 A1 WO 2021098545A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- interest
- key point
- depth
- unit
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
- G06T7/73—Determining position or orientation of objects or cameras using feature-based methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/771—Feature selection, e.g. selecting representative features from a multi-dimensional feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/04—Indexing scheme for image data processing or generation, in general involving 3D image data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10028—Range image; Depth image; 3D point clouds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
Definitions
- the embodiments of the present application relate to, but are not limited to, machine learning technology, and particularly relate to a posture determination method, device, device, storage medium, chip, and product.
- the hand is a more flexible part of the human body. Compared with other interaction methods, it is more natural to use gestures as a means of human-computer interaction. Therefore, gesture recognition technology is a major research point of human-computer interaction.
- the embodiments of the application provide a posture determination method, device, equipment, storage medium, chip, and product.
- a posture determination method including:
- the posture of the region of interest corresponding to the key point is determined.
- a posture determination device including:
- the plane coding unit is used to extract the plane features of the key points from the first image
- a depth coding unit configured to extract the depth feature of the key point from the first image
- a plane coordinate determining unit configured to determine the plane coordinates of the key point based on the plane characteristics of the key point
- a depth coordinate determining unit configured to determine the depth coordinate of the key point based on the depth feature of the key point
- the posture determination unit is configured to determine the posture of the region of interest corresponding to the key point based on the plane coordinates of the key point and the depth coordinate of the key point.
- a posture determination device including: a memory and a processor,
- the memory stores a computer program that can run on the processor
- a computer storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps in the above method .
- a chip including a processor, configured to call and run a computer program from a memory, so that a device installed with the chip executes the steps in the above method.
- a computer program product in a sixth aspect, includes a computer storage medium, and the computer storage medium stores computer program code.
- the computer program code includes instructions that can be executed by at least one processor. When the instructions are executed by the at least one processor, the steps in the foregoing method are implemented.
- the plane feature of the key point and the depth feature of the key point in the first image are respectively extracted, and the plane coordinate of the key point and the depth coordinate of the key point are determined respectively, the plane feature of the key point is compared with the key point. There is no interference between the depth features, which improves the accuracy of determining the plane coordinates of the key points and the depth coordinates of the key points. Moreover, because the plane features of the key points and the depth features of the key points are extracted, the extraction can be used for The method of extracting features to determine the posture of the region of interest is simple.
- FIG. 1 is a schematic diagram of a depth image obtained by a TOF camera according to an embodiment of the application
- FIG. 2 is a schematic diagram of the output result of the probability value of hand existence and the hand bounding box provided by an embodiment of this application;
- FIG. 3 is a schematic diagram of a position of key points of a hand provided by an embodiment of the application.
- FIG. 4 is a schematic diagram of a hand posture estimation result provided by an embodiment of this application.
- FIG. 5 is a schematic diagram of a hand posture detection pipeline provided by an embodiment of the application.
- FIG. 6 is a schematic diagram of using a RoIAlign layer to determine a region of interest according to an embodiment of the application
- FIG. 7a is a schematic diagram of a detection window provided by an embodiment of the application.
- FIG. 7b is a schematic diagram of a method for determining a cross-to-match ratio provided by an embodiment of the application.
- FIG. 8 is a schematic diagram of the architecture of a Pose-REN network provided by an embodiment of the application.
- FIG. 9 is a schematic structural diagram of a gesture determination device provided by an embodiment of the application.
- FIG. 10 is a schematic flowchart of a posture determination method provided by an embodiment of this application.
- FIG. 11 is a schematic flowchart of another posture determination method provided by an embodiment of this application.
- FIG. 12 is a schematic structural diagram of a feature extractor provided by an embodiment of this application.
- FIG. 13 is a schematic flowchart of another posture determination method provided by an embodiment of the application.
- FIG. 14a is a schematic diagram of the architecture of a posture determination network provided by an embodiment of this application.
- FIG. 14b is a schematic structural diagram of a planar encoder provided by an embodiment of the application.
- FIG. 14c is a schematic structural diagram of a depth encoder provided by an embodiment of this application.
- 15 is a schematic flowchart of another posture determination method provided by an embodiment of the application.
- 16 is a schematic diagram of the composition structure of a posture determination device provided by an embodiment of the application.
- FIG. 17 is a schematic diagram of a hardware entity of a posture determination device provided by an embodiment of this application.
- FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.
- Gesture estimation starts from extracting image features from the original image, and it takes a lot of calculation time to determine the hand posture.
- Time of flight (Time of fligh, TOF) camera is a range imaging camera system. It uses time of flight technology to determine the distance between the camera and the subject by measuring the round-trip time of the object.
- the TOF camera is provided by laser or LED Light signal.
- the TOF camera outputs a two-dimensional (2D) image of height H ⁇ width W, and each pixel value on the 2D image represents the depth value of the object (the pixel value can range from 0 mm to 3000 mm).
- FIG. 1 is a schematic diagram of a depth image obtained by a TOF camera according to an embodiment of the application. In the embodiment of the present application, the depth image may be an image captured by a TOF camera.
- Hand detection is to input the depth image, and then output the probability value of the existence of the hand (the probability value is a number from 0 to 1, and a larger value indicates a high degree of confidence in the existence of the hand) and the hand bounding box or detection box (representing the position and Size bounding box).
- Figure 2 is a schematic diagram of the output result of the probability value of the hand existence and the hand bounding box provided by the embodiment of the application.
- the score: 0.999884 table in Figure 2 shows that the probability value of the hand existence is 0.999884, and the bounding box can be expressed as (xmin, ymin, xmax, ymax). In one embodiment, (xmin, ymin) is the upper left corner of the bounding box, and (xmax, ymax) is the lower right corner of the bounding box.
- the 2D hand posture estimation method is: inputting a depth image and outputting the 2D key point positions of the hand bones.
- FIG. 3 is a schematic diagram of the key point positions of the hand provided in an embodiment of the application. Each key point is a 2D coordinate (2D coordinates such as xy coordinates, where x is on the horizontal image axis and y is on the vertical image axis).
- FIG. 3 shows that there are 20 key points in the hand in the embodiment of the present application, which are key points 0 to 19 respectively.
- Fig. 4 is a schematic diagram of a hand posture estimation result provided by an embodiment of the application. As shown in Fig. 4, the hand posture estimation result can be determined according to the position of each 2D key point corresponding to the depth image.
- 3D hand posture estimation is to input a depth image and output the 3D key point positions of the hand bones.
- Fig. 3 shows the key point positions of these hands.
- Each key point position can be 3D coordinates (for example, xyz coordinates, where x is on the horizontal image axis, y is on the vertical image axis, and z is on the depth direction).
- 3D gesture estimation is a hot spot of current research.
- the hand posture detection pipeline includes the processes of hand detection and hand posture estimation.
- Fig. 5 is a schematic diagram of a hand gesture detection pipeline provided by an embodiment of the application. As shown in Fig. 5, the part of hand detection 501 can be implemented by using a first backbone feature extractor 5011 and a bounding box detection head 5012. The part of the posture estimation 502 can be implemented by using the second backbone feature extractor 5021 and the posture estimation head 5022.
- the first backbone feature extractor can extract some features in the depth image and input these features into the bounding box detection head.
- the bounding box detection head can determine the bounding box of the hand.
- the bounding box detection head can be the bounding box detection layer.
- the bounding box adjustment 504 can be performed to obtain the adjusted bounding box.
- the pose estimation head 5022 may be a pose estimation layer, and the pose estimation layer can estimate the pose of the hand based on the certain characteristics.
- the trolley layers used by the first backbone feature extractor 5011 and the second backbone feature extractor 5021 to extract features may be the same.
- the tasks of hand detection and hand posture estimation are completely separated.
- the position of the output bounding box can be adjusted to the centroid of the pixels within the bounding box, and the size of the bounding box can be slightly enlarged to include all hand pixels.
- the adjusted bounding box is used to crop the original depth image.
- the cropped image is input into the hand pose estimation task.
- a RoIAlign layer may be used to determine a region of interest (Region Of Interest, ROI), and the region of interest may be a region corresponding to the above-mentioned cropped image.
- ROI Region Of Interest
- the RoIAlign layer eliminates the harsh quantization of the RoIPool layer and aligns the extracted features with the input correctly.
- the RoIAlign layer can avoid any quantization of RoI boundaries or bins (ie, use x/16 instead of [x/16]). You can use bilinear interpolation to calculate the exact value of the input features at the four regularly sampled locations in each RoI box, and summarize the results (using the maximum value or the average value).
- FIG. 6 is a schematic diagram of using a RoIAlign layer to determine a region of interest according to an embodiment of the application.
- the dotted grid represents the feature map
- the solid line represents the region of interest RoI (2 ⁇ 2 grid in this example).
- the grid points are calculated by bilinear interpolation to calculate the value of each sampling point, and any coordinates involved in the bin or sampling points will not be quantified. As long as the quantization is not performed, the result is not sensitive to the precise sampling position or the number of points sampled.
- NMS Non-Maximum Suppression
- Fig. 7a is a schematic diagram of a detection window provided by an embodiment of the application. As shown in Fig. 7a, the left side of Fig. 7a is a detection window determined based on a sliding window method. It can be seen that the method based on a sliding window can determine multiple detection windows , The right side of Fig. 7a is the detection window determined by the NMS-based method, and it can be seen that the NMS-based method can only determine one detection window.
- a detection window determined may be the above-mentioned region of interest.
- Intersection-over-Union IoU
- Detection Result the intersection of the detection result
- Ground Truth the real data
- FIG. 7b is a schematic diagram of a method for determining a cross-to-bin ratio provided by an embodiment of the application.
- the intersection ratio can be obtained based on the ratio of the area of overlap (Area of Overlap) to the ratio of area of Union (Area of Union).
- Area of Overlap the ratio of area of overlap
- BB1 ⁇ BB2 the intersection of BB1 and BB2 (denoted as BB1 ⁇ BB2) is defined as the overlapping area of BB1 and BB2
- BB1 ⁇ BB2 is defined as the union area of BB1 and BB2.
- UVD coordinates are the coordinates in the XYZ format
- u,v,d are the ordinates in the UVD format.
- C x represents the x value of the main point
- Cy represents the y value of the main point
- the main point can be located at the center of the image (which can be a depth image or a cropped image)
- f x represents the x direction
- the focal length of, f y represents the focal length in the y direction.
- classification is the task of predicting discrete category labels
- regression is the task of predicting continuous quantities.
- classification algorithms can predict continuous values, but the continuous values use the probability form of category labels
- regression algorithms can predict discrete values, but discrete values can be in integer form.
- Convolutional Neural Networks are composed of an input layer and an output layer and multiple hidden layers.
- the hidden layer of CNN usually consists of a series of convolutional layers convolved with multiplication or other dot products.
- the activation function is usually a Rectified Linear Unit (ReLU) layer, followed by other convolutions, such as a pooling layer, a fully connected layer, and a normalization layer. They are called hidden layers because of their input and output. It is masked by the activation function and the final convolution. In turn, the final convolution usually involves back propagation in order to quantify the final result more accurately.
- these layers are colloquially called convolutions, this is just a convention. Mathematically speaking, it is slippage product or cross-correlation. This has important implications for the index in the matrix, because it affects the way the weight is determined at a specific index point.
- each convolutional layer in the neural network should have the following properties: the input is a tensor with shape (number of images) ⁇ (image width) ⁇ (image height) ⁇ (image depth). Convolution kernel, its width and height are hyperparameters, and its depth is equal to the depth of the image. The convolutional layer convolves the input and passes the result to the next layer. This is similar to how neurons in the visual cortex respond to specific stimuli.
- Each convolutional neuron processes data only for its receiving field.
- a fully connected feedforward neural network can be used to learn features and classify data, it is not practical to apply this architecture to images. Due to the very large input size associated with the image (each pixel is a relevant variable), even in a shallow (opposite to deep) architecture, a very large number of neurons are required. For example, the weight of each neuron in the second layer of the fully connected layer of a (small) image with a size of 100 ⁇ 100 is 10,000.
- the convolution operation provides a solution to this problem because it reduces the number of available parameters, allowing the use of fewer parameters to make the network deeper.
- a slice area of size 5 ⁇ 5 (each area has the same shared weight) requires only 25 learnable parameters. In this way, it solves the problem of vanishing or exploding gradients when training traditional multilayer neural networks with multiple layers by using backpropagation.
- Convolutional networks can include local or global pooling layers to simplify basic calculations.
- the pooling layer reduces the size of the data by combining the output of one layer of neuron clusters into a single neuron of the next layer.
- Local pools combine small clusters that are usually 2 ⁇ 2.
- the global pool acts on all neurons in the convolutional layer.
- the combination can calculate the maximum or average value.
- the maximum pool uses the maximum value of each cluster in the neuron cluster of the previous layer.
- the average pool uses the average value of each cluster in the previous layer of neuron clusters.
- a fully connected layer connects every neuron in one layer to every neuron in another layer. In principle, it is the same as the traditional Multi-Layer Perceptron (MLP).
- MLP Multi-Layer Perceptron
- the flattened matrix passes through a fully connected layer to classify the image.
- the posture determination method in the embodiment of the present application can be applied to a posture guided structured region ensemble network (pose-guided structured region ensemble network, Pose-REN).
- a posture guided structured region ensemble network Pose-guided structured region ensemble network, Pose-REN.
- FIG 8 is a schematic diagram of the architecture of a Pose-REN network provided by an embodiment of the application.
- the Pose-REN network can obtain depth images and use a simple CNN (Init-CNN) to predict an initial hand
- the posture pose 0 and pose 0 can be used as the initialization of the cascade structure (Initialize), and the pose 0 is adopted.
- the two parts of the gesture guidance region extraction and the structured region integration are used to generate the gesture pose t according to the gesture pose t-1 .
- the value of t is a number greater than or equal to 1.
- the hierarchical integration of different joint features using structural connections can be achieved in the following way: using a rectangular window to extract the features of multiple grid regions from the feature map, the features of multiple grid regions can be compared with preset key points
- the features of the multiple grid regions may include the feature of the tip point (T) of the thumb, the feature of the first finger node (R) of the thumb, the feature of the palm point, and the feature of the little finger tip (T).
- the little finger first refers to the features of the node (R), and then sends the features of multiple grid areas one by one into multiple fully connected layers f c , each feature of the grid area corresponds to a fully connected layer f c , Through multiple fully connected layers f c , the characteristics of each key point can be obtained. Then, the same fingers in the features of each key point are spliced using the contact function to obtain the features of each finger. For example, after the features of these grid areas are sent to the fully connected layer f c , the features of the key points corresponding to each grid area can be obtained through the fully connected layer f c .
- the key points are, for example, the tip of the thumb and the thumb.
- input the obtained features of the five fingers into the five fully connected layers f c and use the contact function for stitching to obtain the fusion feature including the five finger joints of the human hand.
- the five finger joints The fusion features of is input to the fully connected layer f c , and the hand pose pose t is returned through the regression model of the fully connected layer f c .
- you pose t may be imparted pose t-1, to predict when the next may be employed posture pose t-1 extracts features from features in the drawings.
- a simple CNN predicts pose 0 as the initialization of the cascade framework.
- the feature region is extracted from the feature map generated by CNN under the guidance of pose t-1 , and the tree structure is used for hierarchical fusion.
- the posture is the precise hand posture obtained by Pose-REN, which will be used as a guide in the next stage.
- gesture and hand gesture can be interchanged.
- FIG. 9 is a schematic structural diagram of a gesture determination device provided by an embodiment of the application.
- the gesture determination device includes: a first backbone feature extractor 901, a bounding box detection head 902, a bounding box selection unit 903, and ROIAlign
- the first backbone feature extractor 901 may be the same as the first backbone feature extractor 5011 in FIG. 5
- the bounding box detection head 902 may be the same as the bounding box detection head 5012 in FIG. 5, and the bounding box selection unit 903 is used for cropping. Depth image, get the image enclosed by the bounding box.
- the ROIAlign feature extractor 904 may be the second backbone feature extractor described above.
- FIG. 10 is a schematic flowchart of a posture determination method provided by an embodiment of the application. As shown in FIG. 10, the method is applied to a posture determination device, and the method includes:
- the gesture determination device can be any device with image processing functions.
- the gesture device can be a server, a mobile phone, a tablet computer, a laptop, a handheld computer, a personal digital assistant, a portable media player, a smart speaker, a navigation device, a display Wearable devices such as devices, smart bracelets, virtual reality (VR) devices, augmented reality (AR) devices, pedometers, digital TVs, desktop computers, base stations, relay stations, access points, vehicle-mounted devices Or wearable devices, etc.
- VR virtual reality
- AR augmented reality
- the first image is an image including the hand area.
- the first image may be an original image obtained by shooting (for example, the aforementioned depth image).
- the first image may be a hand image obtained by cropping the hand from the original image obtained by shooting (for example, the image surrounded by the aforementioned bounding box).
- the depth image can also be referred to as a 3D image.
- the gesture determination device may include a camera (for example, a TOF camera), and a depth image is captured by the camera, so that the gesture determination device can obtain the depth image.
- the gesture determination device may receive depth images sent by other devices to obtain depth images.
- the plane feature of the key point may be the plane feature of the UV plane of the key point in the UVD coordinate system.
- the planar feature can be a feature located on the UV plane or projected onto the UV plane.
- the planar feature of the key point can be a deep feature in the neural network, so that the planar feature of the key point can be directly input to the first fully connected layer, so that the first fully connected layer can be directly input to the first fully connected layer.
- a fully connected layer can perform regression calculation on the plane features of the key points to obtain the UV coordinates of the key points.
- the coordinate form of the UV coordinates can be (u, v, 0).
- the depth feature of the key point can be the depth feature of the key point corresponding to the D axis in the UVD coordinate system.
- the depth feature can be the feature projected on the D axis.
- the depth feature of the key point can be the deep feature in the neural network.
- the depth feature of the key point can be directly input to the second fully connected layer, so that the second fully connected layer can regress the depth feature of the key point to obtain the D coordinate (also called the depth coordinate) of the key point, D
- the coordinate form of the coordinate can be (0,0,d).
- the plane feature and/or depth feature of the key point may be a feature describing each finger, for example, at least one of length, bending degree, bending shape, relative position, thickness degree, joint position, and the like.
- the plane features and/or depth features of the key points in the embodiments of the present application may be features in a feature map with a width W of 1, a height H of 1, and a channel number of 128 (1 ⁇ 1 ⁇ 128).
- the plane feature and/or depth feature of the key point can be the width from other values (for example, any integer from 2 to 7), and the height from other values (for example, any integer from 2 to 7).
- other numerical channel numbers for example, 32, 64, 128, 256, 512, etc.
- the features of the key points may be the same as The features corresponding to the key points.
- These features corresponding to the key points are therefore global or local features.
- the features corresponding to the key points at the fingertips of the thumb can be certain features of the fingertip part or area of the thumb, and /Or, some features of all hands.
- the key points may include at least one of the following: finger joint points, fingertip points, palm root points, and palm center points.
- the key points may include the following 20 points: 3 joint points of each finger of the index finger, middle finger, ring finger, and little finger, 2 joint points of the thumb, fingertip points of five fingers, palm The root point.
- the base of the palm is located at the part of the palm close to the wrist.
- the palm base point may also be replaced with the palm center point, or the key point may include not only the palm base point, but also the palm center point.
- S1003 Determine the plane coordinates of the key point based on the plane feature of the key point.
- the plane feature of the key point can be input to the first fully connected layer, and the first fully connected layer can perform regression calculation on the plane feature of the key point to obtain the plane coordinates of the key point.
- the first fully connected layer may be the first regression layer or the first regression head for estimating gestures. That is, the input of the first fully connected layer is the plane feature of the key point, and the output of the first fully connected layer is the plane coordinate of the key point.
- S1005 Determine the depth coordinates of the key point based on the depth feature of the key point.
- steps of S1003 and S1005 in the embodiment of the present application can be executed in parallel. In other embodiments, the steps of S1003 and S1005 may be executed sequentially.
- the depth feature of the key point may be input to the second fully connected layer, and the second fully connected layer may perform regression calculation on the depth feature of the key point to obtain the depth coordinates of the key point.
- the second fully connected layer may be a second regression layer or a second regression head for estimating gestures. That is, the input of the second fully connected layer is the depth feature of the key point, and the output of the second fully connected layer is the depth coordinate of the key point.
- S1007 Determine the posture of the region of interest corresponding to the key point based on the plane coordinates of the key point and the depth coordinate of the key point.
- the region of interest may include a hand region.
- the region of interest is the same as the first image.
- the region of interest may be a hand region determined from the original image.
- the region of interest may be a posture of other parts of the human body, for example, a face region, a head region, a leg region, or an arm region.
- the region of interest can also be a certain part of a robot or a certain part of another animal, and any region of interest that can be used with the posture determination method in the embodiment of this application should be within the protection scope of this application.
- the posture of the region of interest may be a hand posture
- the hand posture may be a posture corresponding to the position coordinates of each of the 20 key points listed above, that is, the posture determination device is determined to After the position coordinates of each key point, the posture corresponding to the position coordinates of each key point can be determined.
- the position coordinates of the key points can be UVD coordinates, or can be coordinates in other coordinate systems obtained through coordinate conversion of UVD coordinates.
- the posture determination device may determine the meaning represented by the hand posture according to the position coordinates of each key point, for example, the hand posture represents a fist gesture or a victory gesture.
- the plane feature of the key point and the depth feature of the key point in the first image are respectively extracted, and the plane coordinate of the key point and the depth coordinate of the key point are determined respectively, the plane feature of the key point is compared with the key point. There is no interference between the depth features, which improves the accuracy of determining the plane coordinates of the key points and the depth coordinates of the key points. Moreover, because the plane features of the key points and the depth features of the key points are extracted, the extraction can be used for The method of extracting features to determine the posture of the region of interest is simple.
- FIG. 11 is a schematic flowchart of another posture determination method provided by an embodiment of this application. As shown in FIG. 11, the method is applied to a posture determination device.
- the step S1001 may include S1101 Steps to S1107:
- the RoIAlign layer or the RoIAlign feature extractor in the posture determination device can determine the region of interest from the first image. After the posture determination device obtains the first image, it can input the first image to the RoIAlign layer, or can pass the first image through the first backbone feature extractor, the bounding box detection head, and the bounding box selection in sequence through the embodiment corresponding to FIG. 9
- the unit reaches the RoIAlign feature extractor or the RoIAlign layer, and outputs the features of the first image through the RoIAlign layer, and outputs the region of interest framed by the detection frame or the bounding box.
- the posture determination can crop the first image to obtain the region of interest.
- the RoIAlign layer can send the features in the region of interest framed by the bounding box to the feature extractor.
- S1103 Acquire a first feature of interest of a key point in the region of interest.
- the feature extractor of the posture determination device can extract the feature of interest of the key point from the second feature of interest in the region of interest.
- S1105 Extract the plane feature of the key point from the first feature of interest of the key point.
- the plane encoder (or UV encoder or UV encoder) of the posture determination device can extract the plane feature of the key point from the first feature of interest of the key point.
- S1107 Extract the depth feature of the key point from the first feature of interest of the key point.
- the steps S1105 and S1107 in the embodiment of the present application can be executed in parallel. In other embodiments, the steps of S1105 and S1107 may be executed sequentially.
- the depth encoder of the posture determination device can extract the depth feature of the key point from the first feature of interest of the key point.
- S1103 may include step A1 and step A3:
- Step A1 Obtain the second feature of interest of the key point; the second feature of interest has a lower level than the first feature of interest.
- Step A3 Determine the first feature of interest based on the second feature of interest and the first convolutional layer.
- the second feature of interest may be a relatively shallow feature, for example, features such as edges, corners, colors, and textures.
- the first feature of interest may be a relatively deep feature, for example, features such as shape, length, degree of curvature, and relative position.
- step A1 may include: step A11 and step A15:
- Step A11 Obtain a first feature map of the region of interest.
- the features in the first feature map of the region of interest may be features in the region of interest, or features extracted from features in the region of interest.
- Step A13 Extract a third feature of interest (7 ⁇ 7 ⁇ 256) of key points from the first feature map; the third feature of interest has a lower level than the second feature of interest.
- the second feature of interest is more characteristic than the third feature of interest.
- Step A15 Determine a second feature of interest based on the third feature of interest and the second convolutional layer; the number of channels of the second convolutional layer is less than the number of channels of the first feature map.
- step A3 may include step A31 and step A33:
- Step A31 Input the second feature of interest to the first residual unit, and obtain the first specific feature through the first residual unit; wherein the first residual unit includes M first residual blocks connected in sequence; each The first residual block includes a first convolutional layer, and a jump connection between the input of the first convolutional layer and the output of the first convolutional layer; M is greater than or equal to 2.
- Each first residual block is used to add the feature input to each first residual block and the feature output from each first residual block, and input to the next processing part.
- the M first residual blocks are the same residual block, and the value of M is 4.
- the M first residual blocks may be different residual blocks, and/or M is another value, for example, M is 2, 3, 5, and so on.
- M is 2, 3, 5, and so on.
- Step A33 Perform pooling processing on the first specific feature to obtain the first feature of interest.
- the first pooling layer may down-sample the feature map corresponding to the first specific feature at least once (for example, down-sampling twice), so as to obtain the first feature of interest.
- the third feature of interest is convolved by the second convolution layer, and the number of channels of the second feature of interest obtained is greatly reduced compared with the number of channels of the third feature of interest. Therefore, the component can retain the key feature in the third feature of interest, and the calculation amount for obtaining the plane coordinate and the depth coordinate is greatly reduced.
- the phenomenon of gradient disappearance can be effectively eliminated, and features of key points can be extracted.
- the amount of calculation can be further reduced.
- FIG. 12 is a schematic structural diagram of a feature extractor provided by an embodiment of the application. As shown in FIG. 12, the feature extractor includes: a second feature of interest acquisition unit 1201, a first residual unit 1202, and a first pooling layer 1203.
- the input terminal of the second feature of interest acquisition unit 1201 can be connected to the RoIAlign layer, and the RoIAlign layer inputs the third feature of interest of the key point to the second feature of interest acquisition unit 1201, and then the second feature of interest is the third feature of the key point.
- the feature of interest is processed to obtain the second feature of interest of the key point, and the second feature of interest of the key point is input to the first residual unit 1202.
- the first residual unit 1202 may include four first residual blocks 1204, and the first first residual block 1204 of the four first residual blocks 1204 receives the first of the key points input by the second feature-of-interest acquisition unit 1201. Two features of interest, each first convolutional layer is used to calculate the input feature and output the calculated feature, and each first residual block 1204 is used to add the input feature and the output feature, And input the added features to the next layer, for example, the first first residual block 1204 is the second feature of interest for the key point, and the second feature of interest for the key point and the first convolutional layer The convolutional features are added, and the added features are input to the second first residual block 1204. It should be noted that the fourth first residual block 1204 inputs the result of the addition to the first pooling layer 1203, and the addition result obtained by the fourth first residual block 1204 in the embodiment of the present application is the first specific feature.
- an embodiment of the present application provides an implementation manner for obtaining a first feature of interest of a key point.
- the second feature of interest acquisition unit 1201 receives the third feature of interest of the key point sent by the RoIAlign layer.
- the size of the third feature of interest of the point is 7 ⁇ 7 ⁇ 256 (the size is 7 ⁇ 7, the number of channels is 256), and the second feature of interest acquisition unit 1201 convolves the third feature of interest and the second convolution of the key point.
- the layer performs convolution calculation to reduce the number of channels to obtain the second feature of interest of the key point.
- the size of the second convolution layer is 3 ⁇ 3 ⁇ 128, and the size of the second feature of interest of the key point is 7 ⁇ 7 ⁇ 128.
- the second feature of interest acquisition unit 1201 inputs the obtained 7 ⁇ 7 ⁇ 128 second feature of interest to the first residual unit 1202, and the first residual unit 1202 processes the second feature of interest and combines the obtained
- the first specific feature with a size of 7 ⁇ 7 ⁇ 128 is input to the first pooling layer 1203, and the first pooling layer 1203 down-samples the first specific feature with a size of 7 ⁇ 7 ⁇ 128 twice, and the first down-sampling is obtained
- the second down-sampling obtains the first feature of interest with a key point of a size of 3 ⁇ 3 ⁇ 128.
- FIG. 13 is a schematic flowchart of another posture determination method provided by an embodiment of this application. As shown in FIG. 13, the method is applied to a posture determination device.
- the step S1003 may include S1301
- the steps of S1303 and S1005 may include the steps of S1305 and S1307:
- the planar encoder can use the third convolutional layer to extract the planar features of the key points from the first feature of interest.
- the first regression layer may input the obtained first feature of interest to the second residual unit of the planar encoder.
- the second pooling layer can input the plane characteristics of the key points to the first fully connected layer, so that the first fully connected layer obtains the plane coordinates of the key points; the first fully connected layer is used for regression to obtain the key points.
- the plane coordinates of the point can be input the plane characteristics of the key points to the first fully connected layer, so that the first fully connected layer obtains the plane coordinates of the key points; the first fully connected layer is used for regression to obtain the key points. The plane coordinates of the point.
- the third pooling layer may input the depth characteristics of the key points to the second fully connected layer, so that the second fully connected layer obtains the depth coordinates of the key points; the second fully connected layer is used for regression to obtain the key points. The depth coordinate of the point.
- S1301 may include steps B1 and B3:
- Step B1 Input the first feature of interest to the second residual unit, and obtain the second specific feature through the second residual unit; wherein the second residual unit includes N second residual blocks connected in sequence; each; The second residual block includes a third convolutional layer, and the input of the third convolutional layer and the output of the third convolutional layer are jump-connected; N is greater than or equal to 2.
- the N second residual blocks in the embodiment of the present application may be the same residual block, and the value of N is 4. In other embodiments, the N second residual blocks may be different residual blocks, and/or, N is another value.
- Step B3 Perform pooling processing on the second specific feature to obtain the planar feature of the key point.
- the second pooling layer may downsample the feature map corresponding to the second specific feature at least once (for example, downsample twice), so as to obtain the planar feature of the key point.
- S1305 may include steps C1 and C3:
- Step C1 Input the first feature of interest to the third residual unit, and obtain the third specific feature through the third residual unit; wherein, the third residual unit includes P third residual blocks connected in sequence; each; The third residual block includes a fourth convolutional layer, and a jump connection is made between the input of the fourth convolutional layer and the output of the fourth convolutional layer; P is greater than or equal to 2.
- the P third residual blocks in the embodiment of the present application may be the same residual block, and the value of P is 4. In other embodiments, the P third residual blocks may be different residual blocks, and/or P is another value.
- Step C3 Perform pooling processing on the third specific feature to obtain the depth feature of the key point.
- the third pooling layer may down-sample the feature map corresponding to the third specific feature at least once (for example, down-sampling twice), so as to obtain the depth feature of the key point.
- the extracted features become more and more abstract.
- the features extracted by the second convolutional layer are relatively shallow.
- the second convolutional layer extracts at least one of edge features, corner features, color features, texture features, etc., while the features extracted by the first convolutional layer are of a lower level.
- the depth of the second convolutional layer is deeper than those of the first convolutional layer.
- the third or fourth convolutional layer extracts at least finger length, curvature, shape, etc. One of the characteristics.
- the plane feature and the depth feature of the first feature of interest are respectively extracted by the UV encoder and the depth encoder, and the encoder that obtains the plane feature and the encoder of the depth feature are separated so as not to interfere with each other.
- the plane coordinates/depth coordinates are obtained through the plane feature/depth feature, the two-dimensional coordinates are easier to be obtained by regression than the three-dimensional coordinates in the prior art. Since the second residual unit and the third residual unit are used to process the first feature of interest, the phenomenon of gradient disappearance can be effectively eliminated, and features of key points can be extracted. Through the pooling process of the second and third pooling layers, the amount of calculation can be further reduced.
- Fig. 14a is a schematic diagram of the architecture of a posture determination network provided by an embodiment of the application.
- the posture determination network includes: a feature extractor 1410, a planar encoder 1420, a depth encoder 1430, and a first fully connected layer 1440 and the second fully connected layer 1450.
- FIG. 14b is a schematic structural diagram of a planar encoder provided by an embodiment of the application. As shown in FIGS. 14a and 14b, the planar encoder 1420 includes: a second residual unit 1421 and a second pooling layer 1422.
- the second residual unit 1421 may include four second residual blocks 1423.
- the process of obtaining the planar feature according to the first feature of interest by the four second residual blocks 1423 may correspond to the four first residual blocks in FIG. 12
- the process of obtaining the first feature of interest based on the second feature of interest is similar, and will not be repeated here.
- the second residual unit 1421 When the second residual unit 1421 obtains the first feature of interest of the key point with the size of 3 ⁇ 3 ⁇ 128, it can process the first feature of interest and input the obtained second specific feature of 3 ⁇ 3 ⁇ 128 To the second pooling layer 1422, the second pooling layer 1422 down-samples the second specific feature of 3 ⁇ 3 ⁇ 128 twice, the first down-sampling obtains the feature of 2 ⁇ 2 ⁇ 128 size, and the second down-sampling The plane features of key points with a size of 1 ⁇ 1 ⁇ 128 are obtained by sampling.
- the second pooling layer 1422 inputs the obtained planar features of the key points with a size of 1 ⁇ 1 ⁇ 128 to the first fully connected layer 1440, and the first fully connected layer 1440 performs regression calculation on the planar features of the key points to obtain 20 keys
- the plane coordinates (or UV coordinates) of each key point in the point are the same.
- FIG. 14c is a schematic structural diagram of a depth encoder provided by an embodiment of the application. As shown in FIGS. 14a and 14c, the depth encoder 1430 includes: a third residual unit 1431 and a third pooling layer 1432.
- the third residual unit 1431 may include four third residual blocks 1433.
- the process of obtaining the depth feature according to the first feature of interest by the four third residual blocks 1433 may correspond to the four first residual blocks in FIG. 12
- the process of obtaining the first feature of interest based on the second feature of interest is similar, and will not be repeated here.
- the third residual unit 1431 can process the first feature of interest when obtaining the first feature of interest of the key point with a size of 3 ⁇ 3 ⁇ 128, and input the obtained third specific feature of 3 ⁇ 3 ⁇ 128 To the third pooling layer 1432, the third pooling layer 1432 down-samples the third specific feature of 3 ⁇ 3 ⁇ 128 twice, the first down-sampling obtains the feature of 2 ⁇ 2 ⁇ 128 size, and the second down-sampling The depth features of key points with a size of 1 ⁇ 1 ⁇ 128 are obtained by sampling.
- the third pooling layer 1432 inputs the obtained depth features of the key points with the size of 1 ⁇ 1 ⁇ 128 to the second fully connected layer 1450, and the second fully connected layer 1450 performs regression calculation on the depth features of the key points to obtain 20 keys The depth coordinate (or D coordinate) of each key point in the point.
- FIG. 15 is a schematic flowchart of another posture determination method provided by an embodiment of this application. As shown in FIG. 15, the method is applied to a posture determination device.
- the step S1007 may include S1501 And S1503 steps:
- the plane coordinate is the UV coordinate
- the UV coordinate is the coordinate output by the first fully connected layer
- the D coordinate (or depth coordinate) is the coordinate output by the second fully connected layer.
- the U-axis coordinate and the V-axis coordinate can be converted by formula (1) Is the X-axis coordinate and the Y-axis coordinate.
- the UVD coordinates can be converted into XYZ coordinates, so as to obtain the hand posture according to the XYZ coordinates.
- a compact regression head for the task of 3D hand pose estimation is proposed.
- Our regression head starts with the region of interest (RoI) function, focusing on the hand area.
- RoI region of interest
- the 3D key points, namely XYZ coordinates, are restored by combining the UV coordinates and the corresponding depth.
- This application also belongs to the category of using the fully connected layer as the final layer to return the coordinates.
- the examples of this application start with the RoI function instead of the original image.
- the architecture of the regression head is different, that is, in addition to the final regression layer (or fully connected layer), the convolutional layer is mainly used instead of the complete Connection layer).
- the examples of this application return UVD coordinates instead of XYZ coordinates.
- the basic feature extractor extracts the key point features on the 7 ⁇ 7 ⁇ 256 (height x width x channel) image feature map, and first compares the image feature map with the 3 ⁇ 3 ⁇ 128 Conv1 (corresponding to the above The second convolutional layer) is used in conjunction to reduce the channel from 256 to 128 (ie, save the calculation).
- the obtained 7 ⁇ 7 ⁇ 128 feature map is convolved with Conv2 (3 ⁇ 3 ⁇ 128) (corresponding to the above-mentioned first convolutional layer) to further extract the basic key point features.
- Conv2 has a skip connection, so that The input of Conv2 can be added to the output of Conv2.
- Conv2 with skip connection (corresponding to the above-mentioned first residual block) is repeated 4 times.
- the maximum pooling of the kernel 3x3 ie Pool1 (the first pooling layer)
- the 7 ⁇ 7 ⁇ 128 keypoint feature map is down-sampled twice, and the size is 3 ⁇ 3 ⁇ 128 keypoint feature map.
- the UV encoder extracts key point features to perform UV coordinate regression.
- the UV encoder inputs a 3 ⁇ 3 ⁇ 128 key point feature map and convolves with Conv3 (corresponding to the third convolutional layer mentioned above) to output a key point feature map of the same size. Skip the connection and add the input of Conv3 to the output of Conv3.
- Conv3 (corresponding to the above-mentioned second residual block) with the corresponding skip connection is repeated 4 times. After that, through the maximum pooling of the kernel 3x3 (ie Pool2) (the second pooling layer), the 3 ⁇ 3 ⁇ 128 key point feature map is down-sampled twice, and the size is 1 ⁇ 1 ⁇ 128.
- the depth encoder extracts key point features to perform depth regression.
- the depth encoder inputs a 3 ⁇ 3 ⁇ 128 key point feature map and convolves with Conv4 (corresponding to the fourth convolutional layer mentioned above) to output a key point feature map of the same size. Skip the connection and add the input of Conv4 to the output of Conv4.
- Conv4 (corresponding to the above-mentioned third residual block) with the corresponding skip connection is repeated 4 times. After that, through the maximum pooling of the kernel 3x3 (ie Pool3) (third pooling layer), the 3 ⁇ 3 ⁇ 128 key point feature map is down-sampled twice, and the size is 1 ⁇ 1 ⁇ 128.
- UV coordinates and depth coordinates are used to calculate XYZ coordinates to obtain hand posture.
- the embodiment of the present application provides a posture determination device, which includes each unit included and each module included in each unit, which can be implemented by a processor in a posture determination device; of course, it can also Realized through specific logic circuits.
- FIG. 16 is a schematic diagram of the composition structure of a posture determination device provided by an embodiment of the application. As shown in FIG. 16, the posture determination device 1600 includes:
- the plane coding unit 1610 is used to extract plane features of key points from the first image
- the depth encoding unit 1620 is configured to extract the depth features of key points from the first image
- the plane coordinate determining unit 1630 is used to determine the plane coordinates of the key point based on the plane characteristics of the key point;
- the depth coordinate determining unit 1640 is configured to determine the depth coordinate of the key point based on the depth feature of the key point;
- the posture determination unit 1650 is configured to determine the posture of the region of interest corresponding to the key point based on the plane coordinates of the key point and the depth coordinate of the key point.
- the plane encoding unit 1610 may be the above-mentioned plane encoder, and the depth encoding unit 1620 may be the above-mentioned depth encoder.
- the plane coordinate determining unit 1630 may be the aforementioned first fully connected layer, and the depth coordinate determining unit 1640 may be the aforementioned second fully connected layer.
- the posture determination device 1600 further includes: a region determination unit 1660 and a feature extraction unit 1670;
- An area determining unit 1660 configured to determine a region of interest from the first image
- the feature extraction unit 1670 is configured to obtain the first feature of interest of the key points in the region of interest;
- the plane coding unit 1610 is used to extract the plane feature of the key point from the first feature of interest of the key point;
- the depth encoding unit 1620 is used to extract the depth feature of the key point from the first feature of interest of the key point.
- the region determining unit 1660 may be the aforementioned RoIAlign layer, and the feature extraction unit 1670 may be the aforementioned feature extractor.
- the feature extraction unit 1670 includes: a feature acquisition unit 1671 and a feature determination unit 1672;
- the feature acquisition unit 1671 is configured to acquire a second feature of interest of a key point; the second feature of interest has a lower level than the first feature of interest;
- the feature determining unit 1672 is configured to determine the first feature of interest based on the second feature of interest and the first convolutional layer.
- the feature acquiring unit 1671 may be the second feature of interest acquiring unit in the foregoing embodiment.
- the feature determining unit 1672 includes: a first residual unit 1673 and a first pooling unit 1674;
- the feature acquisition unit 1671 is further configured to input the second feature of interest to the first residual unit;
- the first residual unit 1673 is used to process the second feature of interest of the key point to obtain the first specific feature; wherein, the first residual unit 1673 includes M first residual blocks connected in sequence; A residual block includes a first convolutional layer, and a jump connection between the input of the first convolutional layer and the output of the first convolutional layer; M is greater than or equal to 2;
- the first pooling unit 1674 is configured to perform pooling processing on the first specific feature to obtain the first feature of interest.
- the first residual unit 1673 may be the aforementioned first residual unit 1202.
- the first pooling unit 1674 may be the aforementioned first pooling layer 1203.
- the feature acquisition unit 1671 is also used to acquire the first feature map of the region of interest; extract the third feature of interest (7 ⁇ 7 ⁇ 256) of key points from the first feature map; and the third sense
- the feature of interest has a lower level than the second feature of interest; the second feature of interest is determined based on the third feature of interest and the second convolutional layer; the number of channels of the second convolutional layer is less than the number of channels of the first feature map.
- the plane encoding unit 1610 is configured to extract plane features of key points from the first feature of interest using the third convolutional layer;
- the plane coordinate determination unit 1630 is used to return to obtain the plane coordinates of the key points based on the plane characteristics of the key points.
- the plane encoding unit 1610 includes: a second residual unit 1611 and a second pooling unit 1612;
- the first pooling unit 1674 is also used to input the first feature of interest to the second residual unit 1611;
- the second residual unit 1611 is also used to process the first feature of interest to obtain a second specific feature; wherein, the second residual unit 1611 includes N second residual blocks connected in sequence; each second residual The difference block includes a third convolutional layer, and the input of the third convolutional layer and the output of the third convolutional layer are jump-connected; N is greater than or equal to 2;
- the second pooling unit 1612 is also used to perform pooling processing on the second specific feature to obtain the planar feature of the key point.
- the second residual unit 1611 may be the second residual unit 1421 described above.
- the second pooling unit 1612 may be the second pooling layer 1422 in the above embodiment.
- the plane coordinate determining unit 1630 is also used to input the plane characteristics of the key points to the first fully connected layer to obtain the plane coordinates of the key points; the first fully connected layer is used to return to obtain the plane coordinates of the key points .
- the depth encoding unit 1620 is configured to use the fourth convolutional layer to extract the depth features of key points from the first feature of interest;
- the depth coordinate determining unit 1640 is configured to return to obtain the depth coordinate of the key point based on the depth feature of the key point.
- the depth encoding unit 1620 includes: a third residual unit 1621 and a third pooling unit 1622;
- the first pooling unit 1674 is also used to input the first feature of interest to the third residual unit 1621;
- the third residual unit 1621 is configured to process the first feature of interest to obtain the third specific feature; wherein, the third residual unit 1621 includes P third residual blocks connected in sequence; each third residual The block includes a fourth convolutional layer, and a jump connection between the input of the fourth convolutional layer and the output of the fourth convolutional layer; P is greater than or equal to 2;
- the third residual unit 1621 is used to perform pooling processing on the third specific feature to obtain the depth feature of the key point.
- the third residual unit 1621 may be the aforementioned third residual unit 1431.
- the third pooling unit 1622 may be the third pooling layer 1432 in the above embodiment.
- the depth coordinate determining unit 1640 is also used to input the depth characteristics of the key points to the second fully connected layer to obtain the depth coordinates of the key points; the second fully connected layer is used to return to obtain the depth coordinates of the key points .
- the posture determination unit 1650 is further configured to determine the X-axis coordinates and Y-axis coordinates in the XYZ coordinate system based on the plane coordinates and the depth coordinates; the postures corresponding to the X-axis coordinates, Y-axis coordinates, and depth coordinates, As the posture of the region of interest.
- the region of interest includes the hand region.
- the key points include at least one of the following: finger joint points, fingertip points, palm base points, and palm center points.
- the embodiments of the present application if the above-mentioned posture determination method is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer readable storage medium.
- the technical solutions of the embodiments of the present application can be embodied in the form of software products in essence or the parts that contribute to related technologies.
- the computer software products are stored in a storage medium and include several instructions to enable A posture determination device executes all or part of the method described in each embodiment of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read only memory (Read Only Memory, ROM), magnetic disk or optical disk and other media that can store program codes. In this way, the embodiments of the present application are not limited to any specific combination of hardware and software.
- FIG. 17 is a schematic diagram of the hardware entity of a gesture determination device provided by an embodiment of the application.
- the hardware entity of the gesture determination device 1700 includes: a processor 1701 and a memory 1702, wherein the memory 1702 stores a computer program that can run on the processor 1701, and the processor 1701 implements the steps in the method of any of the foregoing embodiments when the processor 1701 executes the program.
- the memory 1702 may store a computer program that can run on the processor, the memory 1702 may be configured to store instructions and applications executable by the processor 1701, and may also cache data to be processed or processed by each module (for example, image Data, audio data, voice communication data, and video communication data) can be implemented by flash memory (FLASH) or random access memory (Random Access Memory, RAM).
- the processor 1701 and the memory 1702 may be packaged together.
- the embodiment of the present application provides a computer storage medium, and the computer storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps in any of the foregoing methods.
- FIG. 18 is a schematic structural diagram of a chip provided by an embodiment of the present application.
- the chip 1800 shown in FIG. 18 includes a processor 1801, and the processor 1801 can call and run a computer program from the memory to implement the steps of the method executed by the posture determination device in the embodiment of the present application.
- the chip 1800 may further include a memory 1802.
- the processor 1801 may call and run a computer program from the memory 1802 to implement the steps of the method executed by the posture determination device in the embodiment of the present application.
- the memory 1802 may be a separate device independent of the processor 1801, or may be integrated in the processor 1801.
- the chip 1800 may further include an input interface 1803.
- the processor 1801 can control the input interface 1803 to communicate with other devices or chips, and specifically, can obtain information or data sent by other devices or chips.
- the chip 1800 may further include an output interface 1804.
- the processor 1801 can control the output interface 1804 to communicate with other devices or chips, and specifically, can output information or data to other devices or chips.
- the chip 1800 can be applied to the posture determination device in the embodiments of the present application, and the chip 1800 can implement the corresponding processes implemented by the posture determination device in the various methods of the embodiments of the present application. For the sake of brevity, it is not here. Go into details again.
- the chip 1800 may be a baseband chip in a posture determination device.
- the chip 1800 mentioned in the embodiment of the present application may also be referred to as a system-on-chip, a system-on-chip, a system-on-chip, or a system-on-chip, etc.
- the embodiments of the present application provide a computer program product.
- the computer program product includes a computer storage medium, and the computer storage medium stores computer program code.
- the computer program code includes instructions that can be executed by at least one processor. The steps of the method executed by the posture determination device in the above method are implemented.
- the computer program product can be applied to the posture determination device in the embodiments of the present application, and the computer program instructions cause the computer to execute the corresponding processes implemented by the posture determination device in the various methods of the embodiments of the present application. For the sake of brevity, I won't repeat them here.
- the processor of the embodiment of the present application may be an integrated circuit chip with signal processing capability.
- the steps of the foregoing method embodiments can be completed by hardware integrated logic circuits in the processor or instructions in the form of software.
- the above-mentioned processor can be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (Field Programmable Gate Array, FPGA) or other Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
- DSP Digital Signal Processor
- ASIC application specific integrated circuit
- FPGA Field Programmable Gate Array
- the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the steps of the method disclosed in the embodiments of the present application can be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
- the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
- the memory in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
- the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), and electrically available Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
- the volatile memory may be random access memory (Random Access Memory, RAM), which is used as an external cache.
- RAM random access memory
- SRAM static random access memory
- DRAM dynamic random access memory
- DRAM synchronous dynamic random access memory
- DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
- Enhanced SDRAM, ESDRAM Enhanced Synchronous Dynamic Random Access Memory
- Synchronous Link Dynamic Random Access Memory Synchronous Link Dynamic Random Access Memory
- DR RAM Direct Rambus RAM
- the memory in the embodiment of the present application may also be static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), Synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous connection Dynamic random access memory (synch link DRAM, SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DR RAM) and so on.
- static random access memory static random access memory
- DRAM dynamic random access memory
- SDRAM Synchronous dynamic random access memory
- double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
- enhanced synchronous dynamic random access memory enhanced synchronous dynamic random access memory
- ESDRAM enhanced synchronous dynamic random access memory
- synchronous connection Dynamic random access memory switch link DRAM, SLDRAM
- Direct Rambus RAM Direct Rambus RAM
- the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
- the implementation process constitutes any limitation.
- the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.
- the posture determination device executes any step in the embodiments of the present application, and the processor of the posture determination device may execute the step.
- the embodiment of the present application does not limit the sequence in which the posture determination device executes the following steps.
- the methods used to process data in different embodiments may be the same method or different methods.
- any step in the embodiment of the present application can be independently executed by the posture determination device, that is, when the posture determination device executes any step in the foregoing embodiments, it may not rely on the execution of other steps.
- the disclosed device and method may be implemented in other ways.
- the device embodiments described above are merely illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
- the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
- the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
- the functional units in the embodiments of the present application can be all integrated into one processing unit, or each unit can be individually used as a unit, or two or more units can be integrated into one unit;
- the unit can be implemented in the form of hardware, or in the form of hardware plus software functional units.
- the foregoing program can be stored in a computer readable storage medium.
- the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a read only memory (Read Only Memory, ROM), a magnetic disk, or an optical disk.
- ROM Read Only Memory
- the aforementioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
- the computer software products are stored in a storage medium and include several instructions to enable A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
- the aforementioned storage media include: removable storage devices, ROMs, magnetic disks or optical discs and other media that can store program codes.
- the gesture determination device executes A and executes B. It can be that the gesture determination device executes A first and then executes B, or the gesture determination device executes first. Perform B, then perform A, or the gesture determination device performs A while performing B.
- the embodiments of the application provide a posture determination method, device, equipment, storage medium, chip, and product, which can improve the accuracy of the plane coordinates of the determined key points and the depth coordinates of the key points, and enable the extraction to be used for determining interest
- the method of extracting the features of the posture of the region is simple.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- General Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biodiversity & Conservation Biology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
本申请实施例公开了一种姿势确定方法、装置、设备、存储介质、芯片及产品,该方法包括:包括:从第一图像中提取关键点的平面特征和所述关键点的深度特征;基于所述关键点的平面特征确定所述关键点的平面坐标;基于所述关键点的深度特征确定所述关键点的深度坐标;基于所述关键点的平面坐标和所述关键点的深度坐标,确定与所述关键点对应的感兴趣区域的姿势。
Description
相关申请的交叉引用
本申请基于申请号为62/938,187、申请日为2019年11月20日、申请名称为“COMPACT REGRESSION HEAD FOR EFFICIENT 3D HAND POSE ESTIMATION FOR MOBILE TOF CAMERA”的在先美国临时专利申请提出,并要求该在先美国临时专利申请的优先权,该在先美国临时专利申请的全部内容在此引入本申请作为参考。
本申请实施例涉及但不限于机器学习技术,尤其涉及一种姿势确定方法、装置、设备、存储介质、芯片及产品。
随着计算机视觉技术的不断发展,人们开始追求更加自然和谐的人机交互方式,手部运动是人类交互的重要渠道,人手不仅可以表达语义信息,还可以定量的表达空间方向和位置信息,这将有助于构建更自然、高效的人机交互环境。
手是人体较为灵活的部位,相较其他交互方式而言,将手势作为人机交互的手段显得更加自然,因此手势识别技术是人机交互的一大研究点。
发明内容
本申请实施例提供一种姿势确定方法、装置、设备、存储介质、芯片及产品。
第一方面,提供一种姿势确定方法,包括:
从第一图像中提取关键点的平面特征和所述关键点的深度特征;
基于所述关键点的平面特征确定所述关键点的平面坐标;
基于所述关键点的深度特征确定所述关键点的深度坐标;
基于所述关键点的平面坐标和所述关键点的深度坐标,确定与所述关键点对应的感兴趣区域的姿势。
第二方面,提供一种姿势确定装置,包括:
平面编码单元,用于从第一图像中提取关键点的平面特征;
深度编码单元,用于从所述第一图像中提取所述关键点的深度特征;
平面坐标确定单元,用于基于所述关键点的平面特征确定所述关键点的平面坐标;
深度坐标确定单元,用于基于所述关键点的深度特征确定所述关键点的深度坐标;
姿势确定单元,用于基于所述关键点的平面坐标和所述关键点的深度坐标,确定与所述关键点对应的感兴趣区域的姿势。
第三方面,提供一种姿势确定设备,包括:存储器和处理器,
所述存储器存储有可在处理器上运行的计算机程序,
所述处理器执行所述程序时实现上述方法中的步骤。
第四方面,提供一种计算机存储介质,其中,所述计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现上述方法中的步骤。
第五方面,提供一种芯片,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行上述方法中的步骤。
第六方面,提供一种计算机程序产品,所述计算机程序产品包括计算机存储介质,所述计算机存储介质存储计算机程序代码,所述计算机程序代码包括能够由至少一个处理器执行的指令,当所述指令由所述至少一个处理器执行时实现上述方法中的步骤。
本申请实施例中,由于分别提取第一图像中关键点的平面特征和关键点的深度特征,并分别确定关 键点的平面坐标和关键点的深度坐标,从而关键点的平面特征与关键点的深度特征之间不存在干扰,进而提高了确定关键点的平面坐标和关键点的深度坐标的准确性,并且,由于提取的是关键点的平面特征和关键点的深度特征,能够使得提取用于确定感兴趣区域的姿势的特征的提取方式简单。
图1为本申请实施例提供的一种由TOF相机获取的深度图像的示意图;
图2为本申请实施例提供的一种手存在的概率值和手边界框的输出结果示意图;
图3为本申请实施例提供的一种手部的关键点位置的示意图;
图4为本申请实施例提供的一种手部姿势的估计结果示意图;
图5为本申请实施例提供的一种手部姿势检测流水线的示意图;
图6为本申请实施例提供的一种利用RoIAlign层确定感兴趣区域的示意图;
图7a为本申请实施例提供的一种检测窗口的示意图;
图7b为本申请实施例提供的一种交并比的确定方式的示意图;
图8为本申请实施例提供的一种Pose-REN网络的架构示意图;
图9为本申请实施例提供的一种手势确定装置的结构示意图;
图10为本申请实施例提供的一种姿势确定方法的流程示意图;
图11为本申请实施例提供的另一种姿势确定方法的流程示意图;
图12为本申请实施例提供的一种特征提取器的结构示意图;
图13为本申请实施例提供的另一种姿势确定方法的流程示意图;
图14a为本申请实施例提供的一种姿势确定网络的架构示意图;
图14b为本申请实施例提供的一种平面编码器的结构示意图;
图14c为本申请实施例提供的一种深度编码器的结构示意图;
图15为本申请实施例提供的另一种姿势确定方法的流程示意图;
图16为本申请实施例提供的一种姿势确定装置的组成结构示意图;
图17为本申请实施例提供的一种姿势确定设备的硬件实体示意图;
图18是本申请实施例提供的一种芯片的结构示意图。
下面将通过实施例并结合附图具体地对本申请的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。
需要说明的是:在本申请实例中,“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
另外,本申请实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。
从图像中准确有效地重建人手运动的能力,有望在沉浸式虚拟现实和增强现实中,机器人控制和手语识别中获得令人兴奋的新应用。近年来,尤其是随着深度相机的出现,手势识别取得了长足的进步。但是,由于不受约束的全局和局部姿势变化、频繁的遮挡、局部自相似性以及过多的关节运动,手势识别的准确性和容易性仍然是一项艰巨的任务。
手势估计从原始图像中提取图像特征开始,到确定手部姿势的过程中,会花费大量的计算时间。
在说明本申请实施例之前,先介绍相关技术:
飞行时间(Time of fligh,TOF)相机是一种范围成像相机系统,它采用飞行时间技术,通过测量物体的往返时间来确定相机与被摄物体之间的距离,TOF相机由激光或LED提供的光信号。TOF相机输出尺寸为高度H×宽度W的二维(2D)图像,该2D图像上的每个像素值表示对象的深度值(像素值的范围可以为0mm至3000mm)。图1为本申请实施例提供的一种由TOF相机获取的深度图像的示意图。在本申请实施例中,深度图像可以为TOF相机捕获的图像。
手检测是输入深度图像,然后输出手存在的概率值(概率值为从0到1的数字,较大的值表示手存在的置信度高)和手边界框或检测盒(代表手的位置和大小的边界框)。图2为本申请实施例提供的一种手存在的概率值和手边界框的输出结果示意图,图2中的score:0.999884表是手存在的概率值为0.999884,边界框可以表示为(xmin,ymin,xmax,ymax),在一种实施例中,(xmin,ymin)是边界框的左上角,(xmax,ymax)是边界框的右下角。
2D手姿势估计的方式为:输入深度图像,并输出手骨骼的2D关键点位置,图3为本申请实施例提供的一种手部的关键点位置的示意图。每个关键点都是一个2D坐标(2D坐标例如xy坐标,其中x在水平图像轴上,y在垂直图像轴上)。图3示出了本申请实施例中手部有20个关键点,分别为关键点0至19。
图4为本申请实施例提供的一种手部姿势的估计结果示意图,如图4所示,可以根据每个2D关键点对应在深度图像中的位置,来确定手部姿势的估计结果。
3D手姿势估计的方式为:输入深度图像,并输出手骨骼的3D关键点位置,图3示出了这些手部的关键点位置。每个关键点位置都可以是3D坐标(例如,xyz坐标,其中x在水平图像轴上,y在垂直图像轴上,z在深度方向上)。3D手势估计是当前研究的热点。
手部姿势检测流水线包括手部检测和手部姿势估计的过程。图5为本申请实施例提供的一种手部姿势检测流水线的示意图,如图5所示,手部检测501的部分可以采用第一主干特征提取器5011和边界框检测头5012来实现,手部姿势估计502的部分可以采用第二主干特征提取器5021和姿势估计头5022来实现。
首先,可以通过TOF相机拍摄深度图像,并将深度图像输入到第一主干特征提取器,第一主干特征提取器能够提取深度图像中的一些特征,并将该一些特征输入至边界框检测头,边界框检测头能够确定手部的边界框,边界框检测头可以是边界框检测层,在得到手部的边界框后,可以进行边界框调整504,得到调整后的边界框,并根据调整后的边界框对深度图像进行图像裁剪505,得到裁剪后的图像,将裁剪后的图像输入至第二主干特征提取器5021,第二主干特征提取器5021能够提取裁剪后的图像的某些特征,并将该某些特征输入到姿势估计头5022中,姿势估计头5022可以是姿势估计层,姿势估计层能够根据该某些特征对手部的姿势作出估计。在一些实施例中,第一主干特征提取器5011和第二主干特征提取器5021用于提取特征的卷计层可以相同。
手部检测和手部姿势估计的任务是完全分开的。为了连接这两个任务,可以将输出的边界框位置调整为边界框内像素的质心,并且将边界框的大小略微放大以包括所有手部像素。调整后的边界框用于裁剪原始深度图像。裁剪后的图像被输入到手部姿势估计任务中。当第一主干特征提取器5011和第二主干特征提取器5021提取图像特征时,发现重复的计算。
可以采用感兴趣区域对齐(RoIAlign)层确定感兴趣区域(Region Of Interest,ROI),感兴趣区域可以是上述的裁剪后的图像对应的区域。
RoIAlign层消除了感兴趣区域池化(RoIPool)层的苛刻量化,将提取的特征与输入正确对齐。RoIAlign层可以避免对RoI边界或格子(bin)进行任何量化(即,使用x/16代替[x/16])。可以使用双线性插值法来计算每个RoI箱中四个定期采样位置的输入要素的精确值,并汇总结果(使用最大值或平均值)。
图6为本申请实施例提供的一种利用RoIAlign层确定感兴趣区域的示意图。在图6中,虚线网格表示特征图,实线表示感兴趣区域RoI(在此示例中为2×2格),每个bin中有4个采样,RoIAlign层通过从特征图上附近的网格点通过双线性插值计算每个采样点的值,bin或采样点中涉及的任何坐标都不会进行量化。只要不执行量化,结果对精确的采样位置或采样的点数都不敏感。
非最大抑制(Non-Maximum Suppression,NMS)已广泛用于计算机视觉的多个关键方面,并且是许多提议的检测方法的组成部分,可能是边缘,拐角或对象检测。它的必要性源于检测算法无法完美定位感兴趣的概念,从而导致在实际位置附近出现多个检测组。
在物体检测的背景下,基于滑动窗口的方法通常会产生多个具有接近物体正确位置的高分的窗口。这是由于对象检测器的泛化能力,响应函数的平滑度和邻近窗口的视觉相关性的结果。对于理解图像的内容,这种相对密集的输出通常不令人满意。实际上,在此步骤中窗口假设的数量与图像中实际对象的数量完全不相关。因此,NMS的目标是每组仅保留一个窗口,对应于响应函数的精确局部最大值,理想情况下,每个对象仅获得一个检测。
图7a为本申请实施例提供的一种检测窗口的示意图,如图7a所示,图7a的左侧为基于滑动窗口的方法确定的检测窗口,可见基于滑动窗口的方法能够确定多个检测窗口,图7a的右侧为基于NMS的方法确定的检测窗口,可见基于NMS的方法能够仅确定一个检测窗口。
在一些实施例中,通过NMS的方法在手部检测时,确定的一个检测窗口可以是上述的感兴趣区域。
在目标检测的评价体系中,有一个参数叫做交并比(Intersection-over-Union,IoU),即模型产生的目标窗口与原来标记窗口的交叠率。可以简单理解为:检测结果(Detection Result)与真实数据(Ground Truth)的交集比上它们的并集,交并比可以理解为检测的准确率。
图7b为本申请实施例提供的一种交并比的确定方式的示意图。交并比可以根据重叠区域(Area of Overlap)与合并区域(Area of Union)的比值得到。例如,给定两个边界框BB1和BB2,将BB1和 BB2的交集(表示为BB1∩BB2)定义为BB1和BB2的重叠区域,BB1∪BB2定义为BB1和BB2的并集区域。
此处介绍UVD坐标和XYZ坐标之间的转换关系。
UVD坐标和XYZ坐标之间的关系如下公式(1)所示。(x,y,z)是XYZ格式的坐标,(u,v,d)是UVD格式的纵坐标。
在公式(1)中,C
x表示主要点的x值,C
y表示主要点的y值,主要点可以位于图像(可以是深度图像或者裁剪后的图像)的中心,f
x表示x方向上的焦距,f
y表示y方向上的焦距。
此处对分类和回归作出一些解释,分类预测建模问题与回归预测建模问题不同。分类是预测离散类别标签的任务,回归是预测连续量的任务。分类和回归算法之间存在一些重叠,例如:分类算法可以预测连续值,但是该连续值采用类别标签的概率形式,回归算法可以预测离散值,但离散值可以为整数形式。
此处对卷积神经网络(Convolutional Neural Networks,CNN)作出一些解释:卷积神经网络由一个输入层和一个输出层以及多个隐藏层组成。CNN的隐藏层通常由一系列与乘法或其他点积卷积的卷积层组成。激活函数通常是线性整流函数(Rectified Linear Unit,ReLU)层,随后是其他卷积,例如池化层,完全连接的层和归一化层,之所以称为隐藏层,因为它们的输入和输出被激活函数和最终卷积掩盖了。反过来,最后的卷积通常涉及反向传播,以便更准确地量化最终结果。尽管这些层通俗地称为卷积,但这仅是约定。从数学上讲,它是滑点积或互相关。这对矩阵中的索引具有重要意义,因为它会影响在特定索引点确定权重的方式。
对CNN进行编程时,神经网络中的每个卷积层都应具有以下属性:输入是具有形状(图像数量)×(图像宽度)×(图像高度)×(图像深度)的张量。卷积核,其宽度和高度是超参数,并且其深度等于图像的深度。卷积层对输入进行卷积,并将其结果传递到下一层。这类似于视觉皮层中神经元对特定刺激的反应。
每个卷积神经元仅针对其接受场处理数据。尽管可以使用完全连接的前馈神经网络来学习特征以及对数据进行分类,但是将这种体系结构应用于图像并不实际。由于与图像相关联的非常大的输入大小(每个像素都是一个相关变量),即使在浅(与深对面)的体系结构中,也需要非常大量的神经元。例如,大小为100×100的(小)图像的完全连接层在第二层中的每个神经元的权重为10000。卷积操作为该问题提供了解决方案,因为它减少了可用参数的数量,从而允许使用更少的参数来使网络更深入。例如,不管图像大小如何,大小为5×5的切片区域(每个区域具有相同的共享权重)仅需要25个可学习的参数。通过这种方式,它解决了在通过使用反向传播训练具有多层的传统多层神经网络时消失或爆炸梯度的问题。
卷积网络可以包括局部或全局池化层,以简化基础计算。池化层通过将一层神经元簇的输出组合到下一层的单个神经元中来减少数据的大小。局部池结合了通常为2×2的小簇。全局池作用于卷积层的所有神经元。另外,合并可以计算最大值或平均值。最大池使用上一层神经元簇中每个簇的最大值。平均池使用上一层神经元集群中每个集群的平均值。
完全连接的层将一层中的每个神经元连接到另一层中的每个神经元。它在原则上与传统的多层感知器神经网络(Multi-Layer Perceptron,MLP)相同。展平的矩阵经过一个完全连接的层以对图像进行分类。
本申请实施例中的姿势确定方法可以应用于姿势指导结构化区域集成网络(pose guided structured region ensemble network,Pose-REN)。
图8为本申请实施例提供的一种Pose-REN网络的架构示意图,如图8所示,Pose-REN网络可以获取深度图像,采用一个简单的CNN(Init-CNN)预测一个初始的手部姿势pose
0,pose
0可以作为级联结构的初始化(Initialize),并采用pose
0,姿态引导区域提取和结构化区域集成两部分用于根据手势pose
t-1生成手势pose
t。t的取值为大于或等于1的数。将估计的pose
t-1和深度图像作为输入,深度图像送进CNN中来生成特征图,根据输入的手势pose
t-1,来从特征图中提取特征区域,利用结构连接对不同的关节特征进行层次集成,回归细化后的手势pose
t。
利用结构连接对不同的关节特征进行层次集成,可以采用以下方式实现:利用矩形窗口从特征图中提取多个网格区域的特征,多个网格区域的特征可以与预先设定的关键点相对应,例如,多个网格区域 的特征可以包括与拇指指尖点(T)的特征、拇指第一个指节点(R)的特征、手掌点的特征、小指指尖点(T)的特征、小指第一个指节点(R)的特征等,然后将多个网格区域的特征一一送入多个全连接层f
c,每一个网格区域的特征对应一个全连接层f
c,通过多个全连接层f
c,可以得到每个关键点的特征。然后将每个关键点的特征中相同的手指采用contact函数进行拼接,得到每个手指的特征。例如,在将这些网格区域的特征送入到全连接层f
c之后,通过全连接层f
c可以得到每个网格区域对应的关键点的特征,关键点例如是拇指指尖点、拇指第一个指节点等,然后将这些关键点点中相同部位的特征进行拼接或特征融合,例如,将手掌的特征和大拇指的特征进行融合,将手掌的特征和小拇指的特征进行拼接等。之后,将得到的五个手指的特征分别输入到五个全连接层f
c后得到的五个特征,采用contact函数进行拼接,得到包括人手五根手指关节的融合特征,接着将五根手指关节的融合特征输入至全连接层f
c,通过全连接层f
c的回归模型,回归出手部姿势pose
t。在得到pose
t后,可以将pose
t赋予pose
t-1,以在下次进行预测时,可以采用姿势为pose
t-1从特征图中提取特征。
在图8的架构图中,一个简单的CNN(Init-CNN)会将pose
0预测为级联框架的初始化。特征区域从CNN在pose
t-1的指导下生成的特征图中提取,并使用树状结构进行分层融合。姿势是Pose-REN获得的精确手部姿势,将在下一阶段用作指导。
需要说明的是,术语手势和手部姿势可以互相替换。
图9为本申请实施例提供的一种手势确定装置的结构示意图,如图9所示,手势确定装置包括:第一主干特征提取器901、边界框检测头902、边界框选择单元903、ROIAlign特征提取器904和3D姿势估计头905。其中,第一主干特征提取器901可以与图5中的第一主干特征提取器5011相同,边界框检测头902可以与图5中的边界框检测头5012相同,边界框选择单元903用于裁剪深度图像,得到边界框包围的图像。ROIAlign特征提取器904可以是撒上述的第二主干特征提取器。
以下介绍本申请实施例中根据获取到的深度图像进行姿势确定的方式:
图10为本申请实施例提供的一种姿势确定方法的流程示意图,如图10所示,该方法应用于姿势确定设备,该方法包括:
S1001、从第一图像中提取关键点的平面特征和关键点的深度特征。
姿势确定设备可以为任一具有图像处理功能的设备,例如,姿势设备可以是服务器、手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理、便捷式媒体播放器、智能音箱、导航装置、显示设备、智能手环等可穿戴设备、虚拟现实(Virtual Reality,VR)设备、增强现实(Augmented Reality,AR)设备、计步器、数字TV、台式计算机、基站、中继站、接入点、车载设备或可穿戴设备等。
第一图像为包括手部区域的图像。在一些实施例中,第一图像可以是拍摄得到的原始图像(例如是上述的深度图像)。在另一些实施例中,第一图像可以是对拍摄得到的原始图像对手部进行裁剪得到的手部图像(例如是上述的边界框包围的图像)。
深度图像也可以称为3D图像。在一些实施例中,姿势确定设备可以包括摄像头(例如TOF相机),通过摄像头拍摄深度图像,从而使得姿势确定设备获取到深度图像。在另一些实施例中,姿势确定设备可以接收其它设备发送的深度图像,从而得到深度图像。
关键点的平面特征可以是关键点在UVD坐标系下的UV平面的平面特征。平面特征可以是位于UV平面或者投影到UV平面的特征,关键点的平面特征可以是神经网络中深层次的特征,以使关键点的平面特征可以直接输入至第一全连接层,进而使第一全连接层可以对关键点的平面特征回归计算,从而得到关键点的UV坐标,UV坐标的坐标形式可以为(u,v,0)。
关键点的深度特征可以是关键点在UVD坐标系下与D轴对应的深度特征,深度特征可以是投影到D轴上的特征,关键点的深度特征可以是神经网络中深层次的特征,以使关键点的深度特征可以直接输入至第二全连接层,进而使第二全连接层可以对关键点的深度特征回归计算,从而得到关键点的D坐标(也可以称为深度坐标),D坐标的坐标形式可以为(0,0,d)。
关键点的平面特征和/或深度特征可以是描述每根手指的特征,例如,长度、弯曲程度、弯曲形状、相对位置、粗细程度、关节位置等中的至少一个。
本申请实施例中关键点的平面特征和/或深度特征,可以是宽度W为1,高度H为1,通道数为128(1×1×128)的特征图中的特征。在其它实施例中,关键点的平面特征和/或深度特征可以是从其它数值的宽度(例如2~7中的任一整数),其它数值的高度(例如2~7中的任一整数),以及其它数值的通道数(例如,32、64、128、256、512等)中的特征图中的特征。
本申请实施例中,关键点的特征(例如,关键点的平面特征、深度特征、下述的关键点的第一感兴 趣特征、第二感兴趣特征、第三感兴趣特征),可以是与关键点对应的特征,这些与关键点对应的特征所以是全局特征或局部特征,例如,大拇指的指尖处的关键点对应的特征可以是大拇指指尖部位或区域的某些特征,和/或,全部手部的某些特征。
关键点可以包括以下至少之一:手指关节点、指尖点、手掌根部点以及手掌中心点。在本申请实施例中,关键点可以包括以下20个点:食指、中指、无名指、小指中每个手指的3个关节点,大拇指的2个关节点,五个手指的指尖点、手掌根部点。手掌根部位于手掌靠近手腕的部位,具体的位置可以参照图3对应的实施例。在其它实施例中,还可以将手掌根部点用手掌中心点替换,或者,关键点不仅可以包括手掌根部点,还可以包括手掌中心点。
S1003、基于关键点的平面特征确定关键点的平面坐标。
在一些实施例中,可以将关键点的平面特征输入到第一全连接层,第一全连接层可以对关键点的平面特征进行回归计算,从而得到关键点的平面坐标。第一全连接层可以是第一回归层(regression layer)或者用于估计手势的第一回归头(regression head)。即第一全连接层的输入为关键点的平面特征,第一全连接层的输出为关键点的平面坐标。
S1005、基于关键点的深度特征确定关键点的深度坐标。
本申请实施例中的S1003和S1005的步骤可以并行执行。在其它实施例中,S1003和S1005的步骤可以先后执行。
在一些实施例中,可以将关键点的深度特征输入到第二全连接层,第二全连接层可以对关键点的深度特征进行回归计算,从而得到关键点的深度坐标。第二全连接层可以是第二回归层(regression layer)或者用于估计手势的第二回归头(regression head)。即第二全连接层的输入为关键点的深度特征,第二全连接层的输出为关键点的深度坐标。
S1007、基于关键点的平面坐标和关键点的深度坐标,确定与关键点对应的感兴趣区域的姿势。
在本申请实施例中,感兴趣区域可以包括手部区域,在第一图像为裁剪得到的手部图像时,感兴趣区域与第一图像相同。在第一图像为拍摄的原始图像时,感兴趣区域可以是从原始图像中确定的手部区域。从原始图像中确定手部区域的方式可以参照图5对应的实施例的说明。在其它实施例中,感兴趣区域可以是人体的其它部位的姿势,例如,人脸区域、头部区域、腿部区域或臂部区域等。感兴趣区域还可以是机器人的某一部分区域或其它动物的某一部分区域,任何能够用本申请实施例中的姿势确定方法所对应的感兴趣区域都应该在本申请的保护范围之内。
在本申请实施例中,感兴趣区域的姿势可以是手部姿势,手部姿势可以是上述列举的20个关键点的每个关键点的位置坐标所对应的姿势,即姿势确定设备在确定到每个关键点的位置坐标后,既可确定与每个关键点的位置坐标对应的姿势。关键点的位置坐标可以是UVD坐标,也可以是将UVD坐标进行坐标转换,得到的其他坐标系下的坐标。在其它实施例中,姿势确定设备可以根据每个关键点的位置坐标,确定手部姿势所表征的意义,例如,手部姿势表征拳头的手势或者胜利的手势等。
本申请实施例中,由于分别提取第一图像中关键点的平面特征和关键点的深度特征,并分别确定关键点的平面坐标和关键点的深度坐标,从而关键点的平面特征与关键点的深度特征之间不存在干扰,进而提高了确定关键点的平面坐标和关键点的深度坐标的准确性,并且,由于提取的是关键点的平面特征和关键点的深度特征,能够使得提取用于确定感兴趣区域的姿势的特征的提取方式简单。
图11为本申请实施例提供的另一种姿势确定方法的流程示意图,如图11所示,该方法应用于姿势确定设备,在本申请实施例对应的方法中,S1001的步骤,可以包括S1101至S1107的步骤:
S1101、从第一图像中确定感兴趣区域。
姿势确定设备中的RoIAlign层或者RoIAlign特征提取器可以从第一图像中确定感兴趣区域。姿势确定设备在得到第一图像后,可以将第一图像输入至RoIAlign层,或者可以通过图9对应的实施例将第一图像依次通过第一主干特征提取器、边界框检测头、边界框选择单元到达RoIAlign特征提取器或RoIAlign层,通过RoIAlign层输出第一图像的特征,并输出检测框或包围盒框住的感兴趣区域。姿势确定可以对第一图像进行裁剪得到感兴趣区域。RoIAlign层可以将包围盒框住的感兴趣区域中的特征发送给特征提取器。
S1103、获取感兴趣区域中关键点的第一感兴趣特征。
姿势确定设备的特征提取器可以从感兴趣区域中的第二感兴趣特征中,提取关键点的感兴趣特征。
S1105、从关键点的第一感兴趣特征中提取关键点的平面特征。
姿势确定设备的平面编码器(或者UV编码器或UV encoder)可以从关键点的第一感兴趣特征中提取关键点的平面特征。
S1107、从关键点的第一感兴趣特征中提取关键点的深度特征。
本申请实施例中的S1105和S1107的步骤可以并行执行。在其它实施例中,S1105和S1107的步骤可以先后执行。
姿势确定设备的深度编码器可以从关键点的第一感兴趣特征中提取关键点的深度特征。
在一些实施例中,S1103可以包括步骤A1和步骤A3:
步骤A1、获取关键点的第二感兴趣特征;第二感兴趣特征比第一感兴趣特征的层次低。
步骤A3、基于第二感兴趣特征和第一卷积层,确定第一感兴趣特征。
第二感兴趣特征可以是较为浅层的特征,例如,边、角、颜色、纹理等特征。第一感兴趣特征可以是比较深层的特征,例如,形状、长度、弯曲程度、相对位置等特征。
在一些实施例中,步骤A1可以包括:步骤A11和步骤A15:
步骤A11、获取感兴趣区域的第一特征图。感兴趣区域的第一特征图中的特征可以感兴趣区域中的特征,或者,是从感兴趣区域中的特征中提取的特征。
步骤A13、从第一特征图中提取关键点的第三感兴趣特征(7×7×256);第三感兴趣特征比第二感兴趣特征的层次低。
或者可以说第二感兴趣特征比第三感兴趣特征的特征性强。
步骤A15、基于第三感兴趣特征和第二卷积层,确定第二感兴趣特征;第二卷积层的通道数小于第一特征图的通道数。
在一些实施例中,步骤A3可以包括步骤A31和步骤A33:
步骤A31、将第二感兴趣特征输入至第一残差单元,通过第一残差单元得到第一特定特征;其中,第一残差单元包括顺序连接的M个第一残差块;每一第一残差块包括第一卷积层,第一卷积层的输入和第一卷积层的输出之间跳跃连接;M大于或等于2。
每一第一残差块用于将输入至所述每一第一残差块的特征和从所述每一第一残差块输出的特征相加,并输入至下一处理部分。
本申请实施例中的M个第一残差块为相同的残差块,M的值为4。在其它实施例中,M个第一残差块可以是不同的残差块,和/或,M为其它数值,例如,M为2、3、5等。M的数值越大,第一残差单元能够从第二感兴趣特征中提取的特征更充分,从而使得确定的手部姿势的准确度更高,但是M的数值越大,会导致计算量过大,从而由于M的值为4,能够在计算量和准确度之间取一个折中。
步骤A33、对第一特定特征进行池化处理,得到第一感兴趣特征。
第一池化层可以对第一特定特征对应的特征图下采样至少一次(例如,下采样两次),从而得到第一感兴趣特征。
在本申请实施例中,通过第二卷积层对第三感兴趣特征进行卷积处理,得到的第二感兴趣特征的通道数相较于第三感兴趣特征的通道数大大的减小,从而能够部件保留了第三感兴趣特征中的关键特征,还大大减少了得到平面坐标和深度坐标的计算量。通过将第二感兴趣特征输入至第一残差单元,通过第一残差单元得到第一特定特征,从而能够有效的消除梯度消失这一现象,且能够提取到关键点的特征。通过第一池化层的池化处理,能够进一步减少计算量。
图12为本申请实施例提供的一种特征提取器的结构示意图,如图12所示,特征提取器包括:第二感兴趣特征获取单元1201、第一残差单元1202和第一池化层1203。
第二感兴趣特征获取单元1201的输入端可以连接到RoIAlign层,RoIAlign层向第二感兴趣特征获取单元1201输入关键点的第三感兴趣特征,然后第二感兴趣特征对关键点的第三感兴趣特征进行处理,得到关键点的第二感兴趣特征,并将关键点的第二感兴趣特征输入至第一残差单元1202。
第一残差单元1202可以包括4个第一残差块1204,4个第一残差块1204的第一个第一残差块1204接收第二感兴趣特征获取单元1201输入的关键点的第二感兴趣特征,每一个第一卷积层用于对输入的特征进行计算,并将计算的特征输出,每一第一残差块1204用于将输入的特征和输出的特征进行相加,并将相加的特征输入至下一层,例如,第一个第一残差块1204对将关键点的第二感兴趣特征,与对关键点的第二感兴趣特征与第一卷积层卷积的特征进行相加,并将相加的特征输入至第二个第一残差块1204。应注意,第四个第一残差块1204将相加的结果输入至第一池化层1203,本申请实施例中第四个第一残差块1204得到的相加结果即为第一特定特征。
请继续参阅图12,本申请实施例提供一种获得关键点的第一感兴趣特征的实现方式,第二感兴趣特征获取单元1201接收到RoIAlign层发送的关键点的第三感兴趣特征,关键点的第三感兴趣特征的大小为7×7×256(尺寸为7×7,通道数为256),第二感兴趣特征获取单元1201将关键点的第三感兴趣特征和第二卷积层进行卷积计算,以对通道数进行降低,进而得到关键点的第二感兴趣特征,第二卷积层的大小为3×3×128,关键点的第二感兴趣特征的大小为7×7×128。第二感兴趣特征获取单元1201将得到的7×7×128的第二感兴趣特征输入至第一残差单元1202,第一残差单元1202对第二感兴趣特征进 行处理,并将得到的7×7×128大小的第一特定特征输入至第一池化层1203,第一池化层1203对7×7×128大小的第一特定特征进行两次下采样,第一次下采样得到5×5×128大小的特征,第二次下采样得到3×3×128大小的关键点的第一感兴趣特征。
图13为本申请实施例提供的另一种姿势确定方法的流程示意图,如图13所示,该方法应用于姿势确定设备,在本申请实施例对应的方法中,S1003的步骤,可以包括S1301和S1303的步骤,S1005的步骤,可以包括S1305和S1307的步骤:
S1301、从第一感兴趣特征中,利用第三卷积层提取关键点的平面特征。
平面编码器可以从第一感兴趣特征中,利用第三卷积层提取关键点的平面特征。第一回归层可以将得到的第一感兴趣特征输入至平面编码器的第二残差单元。
S1303、基于关键点的平面特征,回归得到关键点的平面坐标。
在一些实施例中,第二池化层可以将关键点的平面特征输入至第一全连接层,以使第一全连接层得到关键点的平面坐标;第一全连接层用于回归得到关键点的平面坐标。
S1305、从第一感兴趣特征中,利用第四卷积层提取关键点的深度特征。
S1307、基于关键点的深度特征,回归得到关键点的深度坐标。
在一些实施例中,第三池化层可以将关键点的深度特征输入至第二全连接层,以使第二全连接层得到关键点的深度坐标;第二全连接层用于回归得到关键点的深度坐标。
在一些实施例中,S1301可以包括步骤B1和B3:
步骤B1、将第一感兴趣特征输入至第二残差单元,通过第二残差单元得到第二特定特征;其中,第二残差单元包括顺序连接的N个第二残差块;每一第二残差块包括第三卷积层,第三卷积层的输入和第三卷积层的输出之间跳跃连接;N大于或等于2。
本申请实施例中的N个第二残差块可以为相同的残差块,N的值为4。在其它实施例中,N个第二残差块可以是不同的残差块,和/或,N为其它数值。
步骤B3、对第二特定特征进行池化处理,得到关键点的平面特征。
第二池化层可以对第二特定特征对应的特征图下采样至少一次(例如,下采样两次),从而得到关键点的平面特征。
在一些实施例中,S1305可以包括步骤C1和C3:
步骤C1、将第一感兴趣特征输入至第三残差单元,通过第三残差单元得到第三特定特征;其中,第三残差单元包括顺序连接的P个第三残差块;每一第三残差块包括第四卷积层,第四卷积层的输入和第四卷积层的输出之间跳跃连接;P大于或等于2。
本申请实施例中的P个第三残差块可以为相同的残差块,P的值为4。在其它实施例中,P个第三残差块可以是不同的残差块,和/或,P为其它数值。
步骤C3、对第三特定特征进行池化处理,得到关键点的深度特征。
第三池化层可以对第三特定特征对应的特征图下采样至少一次(例如,下采样两次),从而得到关键点的深度特征。
在本申请实施例中,第二卷积层、第一卷积层,以及第三或第四卷积层在提取特征时,提取的特征越来越抽象。第二卷积层提取的特征比较浅层次,例如,第二卷积层提取的是边特征、角特征、颜色特征、纹理特征等中的至少一个,第一卷积层提取的特征层次较第二卷积层的深,第三或第四卷积层提取的特征较第一卷积层的深,例如,第三或第四卷积层提取的是手指长度、弯曲程度、形状等至少之一的特征。
在本申请实施例中,通过UV编码器和深度编码器分别提取第一感兴趣特征的平面特征和深度特征,获取平面特征的编码器和深度特征的编码器分开,从而不会互相产生干扰,并且,在通过平面特征/深度特征得到平面坐标/深度坐标时,二维的坐标相对于现有技术中的三维坐标,更容易回归得到。由于采用第二残差单元和第三残差单元对第一感兴趣特征进行处理,能够有效的消除梯度消失这一现象,且能够提取到关键点的特征。通过第二和第三池化层的池化处理,能够进一步减少计算量。
图14a为本申请实施例提供的一种姿势确定网络的架构示意图,如图14a所示,该姿势确定网络包括:特征提取器1410、平面编码器1420、深度编码器1430、第一全连接层1440和第二全连接层1450。
图14b为本申请实施例提供的一种平面编码器的结构示意图,如图14a和14b所示,平面编码器1420包括:第二残差单元1421和第二池化层1422。
第二残差单元1421可以包括4个第二残差块1423,4个第二残差块1423根据第一感兴趣特征得到平面特征的过程,可以与图12对应的4个第一残差块根据第二感兴趣特征得到第一感兴趣特征的过程类似,此处不再赘述。
第二残差单元1421在得到3×3×128大小的关键点的第一感兴趣特征时,可以对第一感兴趣特征进行处理,并将得到的3×3×128的第二特定特征输入至第二池化层1422,第二池化层1422对3×3×128的第二特定特征进行两次下采样,第一次下采样得到2×2×128大小的特征,第二次下采样得到1×1×128大小的关键点的平面特征。第二池化层1422将得到的1×1×128大小的关键点的平面特征输入至第一全连接层1440,第一全连接层1440对关键点的平面特征进行回归计算,得到20个关键点中每个关键点的平面坐标(或UV坐标)。
图14c为本申请实施例提供的一种深度编码器的结构示意图,如图14a和14c所示,深度编码器1430包括:第三残差单元1431和第三池化层1432。
第三残差单元1431可以包括4个第三残差块1433,4个第三残差块1433根据第一感兴趣特征得到深度特征的过程,可以与图12对应的4个第一残差块根据第二感兴趣特征得到第一感兴趣特征的过程类似,此处不再赘述。
第三残差单元1431在得到3×3×128大小的关键点的第一感兴趣特征时,可以对第一感兴趣特征进行处理,并将得到的3×3×128的第三特定特征输入至第三池化层1432,第三池化层1432对3×3×128的第三特定特征进行两次下采样,第一次下采样得到2×2×128大小的特征,第二次下采样得到1×1×128大小的关键点的深度特征。第三池化层1432将得到的1×1×128大小的关键点的深度特征输入至第二全连接层1450,第二全连接层1450对关键点的深度特征进行回归计算,得到20个关键点中每个关键点的深度坐标(或D坐标)。
图15为本申请实施例提供的另一种姿势确定方法的流程示意图,如图15所示,该方法应用于姿势确定设备,在本申请实施例对应的方法中,S1007的步骤,可以包括S1501和S1503的步骤:
S1501、基于平面坐标和深度坐标,确定XYZ坐标系下的X轴坐标和Y轴坐标。
平面坐标为UV坐标,UV坐标是第一全连接层输出的坐标,D坐标(或深度坐标)是第二全连接层输出的坐标,可以通过公式(1)将U轴坐标和V轴坐标转换为X轴坐标和Y轴坐标。
S1503、将X轴坐标、Y轴坐标以及深度坐标对应的姿势,作为感兴趣区域的姿势。
在本申请实施例中,能够将UVD坐标转换为XYZ坐标,从而根据XYZ坐标得到手部姿势。
在申请实施例中,提出了用于3D手姿势估计任务的紧凑回归头。我们的回归头从感兴趣区域(RoI)功能开始,重点放在手部区域。我们使其紧凑,以包括用于3D手势估计的两个子任务:1)2D关键点,即UV坐标;2)UV坐标的深度估计。通过结合UV坐标和相应的深度来恢复3D关键点,即XYZ坐标。
本申请还属于使用全连接层作为最终层以使坐标回归的类别。但是,首先,本申请实施例的从RoI功能而不是原始图像开始,其次,回归头的体系结构是不同的,即除了最终回归层(或全连接层)之外,主要使用卷积层代替完全连接层)。本申请实施例回归的是UVD坐标而不是XYZ坐标。
在本申请实施例中,基本特征提取器提取7×7×256(高度x宽度x通道)图像特征图上的关键点特征,首先将图像特征图与3×3×128的Conv1(对应上述的第二卷积层)配合使用,以将通道从256缩小到128(即保存计算)。将得到的7×7×128特征图与Conv2(3×3×128)(对应上述的第一卷积层)卷积以进一步提取基本关键点特征,Conv2具有跳过连接(skip connection),使得可将Conv2的输入与Conv2的输出相加。带有跳过连接的Conv2(对应上述的第一残差块)重复4次。之后,通过内核3x3(即Pool1)(第一池化层)的最大池化,将7×7×128关键点特征图下采样2次,大小为3×3×128的关键点特征图。
UV编码器提取关键点特征,以进行UV坐标回归。UV编码器输入3×3×128的关键点特征图,并与Conv3(对应上述的第三卷积层)卷积以输出相同大小的关键点特征图。跳过连接将Conv3的输入与Conv3的输出相加。带有相应跳过连接的Conv3(对应上述的第二残差块)重复4次。之后,通过内核3x3(即Pool2)(第二池化层)的最大池化,将3×3×128关键点特征图下采样2次,大小为1×1×128。
使用完全连接的FC1层(对应上述的第一全连接层)对20个关键点的UV坐标进行回归。
深度编码器提取关键点特征,以进行深度回归。深度编码器输入3×3×128的关键点特征图,并与Conv4(对应上述的第四卷积层)卷积以输出相同大小的关键点特征图。跳过连接将Conv4的输入与Conv4的输出相加。带有相应跳过连接的Conv4(对应上述的第三残差块)重复4次。之后,通过内核3x3(即Pool3)(第三池化层)的最大池化,将3×3×128关键点特征图下采样2次,大小为1×1×128。
使用完全连接的FC2层(对应上述的第二全连接层)对20个关键点的深度坐标进行回归。
最后,UV坐标和深度坐标用于计算XYZ坐标,以得到手部姿势。
基于前述的实施例,本申请实施例提供一种姿势确定装置,该装置包括所包括的各单元、以及各单元所包括的各模块,可以通过姿势确定设备中的处理器来实现;当然也可通过具体的逻辑电路实现。
图16为本申请实施例提供的一种姿势确定装置的组成结构示意图,如图16所示,姿势确定装置1600包括:
平面编码单元1610,用于从第一图像中提取关键点的平面特征;
深度编码单元1620,用于从第一图像中提取关键点的深度特征;
平面坐标确定单元1630,用于基于关键点的平面特征确定关键点的平面坐标;
深度坐标确定单元1640,用于基于关键点的深度特征确定关键点的深度坐标;
姿势确定单元1650,用于基于关键点的平面坐标和关键点的深度坐标,确定与关键点对应的感兴趣区域的姿势。
平面编码单元1610可以是上述的平面编码器,深度编码单元1620可以是上述的深度编码器。平面坐标确定单元1630可以是上述的第一全连接层,深度坐标确定单元1640可以是上述的第二全连接层。
在一些实施例中,姿势确定装置1600还包括:区域确定单元1660和特征提取单元1670;
区域确定单元1660,用于从第一图像中确定感兴趣区域;
特征提取单元1670,用于获取感兴趣区域中关键点的第一感兴趣特征;
平面编码单元1610,用于从关键点的第一感兴趣特征中提取关键点的平面特征;
深度编码单元1620,用于从关键点的第一感兴趣特征中提取关键点的深度特征。
区域确定单元1660可以是上述的RoIAlign层,特征提取单元1670可以是上述的特征提取器。
在一些实施例中,特征提取单元1670包括:特征获取单元1671和特征确定单元1672;
特征获取单元1671,用于获取关键点的第二感兴趣特征;第二感兴趣特征比第一感兴趣特征的层次低;
特征确定单元1672,用于基于第二感兴趣特征和第一卷积层,确定第一感兴趣特征。
特征获取单元1671可以是上述实施例中的第二感兴趣特征获取单元。
在一些实施例中,特征确定单元1672包括:第一残差单元1673和第一池化单元1674;
特征获取单元1671,还用于将第二感兴趣特征输入至第一残差单元;
第一残差单元1673,用于对关键点的第二感兴趣特征进行处理,得到第一特定特征;其中,第一残差单元1673包括顺序连接的M个第一残差块;每一第一残差块包括第一卷积层,第一卷积层的输入和第一卷积层的输出之间跳跃连接;M大于或等于2;
第一池化单元1674,用于对第一特定特征进行池化处理,得到第一感兴趣特征。
第一残差单元1673可以是上述的第一残差单元1202。第一池化单元1674可以是上述的第一池化层1203。
在一些实施例中,特征获取单元1671,还用于获取感兴趣区域的第一特征图;从第一特征图中提取关键点的第三感兴趣特征(7×7×256);第三感兴趣特征比第二感兴趣特征的层次低;基于第三感兴趣特征和第二卷积层,确定第二感兴趣特征;第二卷积层的通道数小于第一特征图的通道数。
在一些实施例中,平面编码单元1610,用于从第一感兴趣特征中,利用第三卷积层提取关键点的平面特征;
平面坐标确定单元1630,用于基于关键点的平面特征,回归得到关键点的平面坐标。
在一些实施例中,平面编码单元1610包括:第二残差单元1611和第二池化单元1612;
第一池化单元1674,还用于将第一感兴趣特征输入至第二残差单元1611;
第二残差单元1611,还用于对第一感兴趣特征进行处理,得到第二特定特征;其中,第二残差单元1611包括顺序连接的N个第二残差块;每一第二残差块包括第三卷积层,第三卷积层的输入和第三卷积层的输出之间跳跃连接;N大于或等于2;
第二池化单元1612,还用于对第二特定特征进行池化处理,得到关键点的平面特征。
第二残差单元1611可以是上述的第二残差单元1421。第二池化单元1612可以是上述实施例中的第二池化层1422。
在一些实施例中,平面坐标确定单元1630,还用于将关键点的平面特征输入至第一全连接层,得到关键点的平面坐标;第一全连接层用于回归得到关键点的平面坐标。
在一些实施例中,深度编码单元1620,用于从第一感兴趣特征中,利用第四卷积层提取关键点的深度特征;
深度坐标确定单元1640,用于基于关键点的深度特征,回归得到关键点的深度坐标。
在一些实施例中,深度编码单元1620包括:第三残差单元1621和第三池化单元1622;
第一池化单元1674,还用于将第一感兴趣特征输入至第三残差单元1621;
第三残差单元1621,用于对第一感兴趣特征进行处理,得到第三特定特征;其中,第三残差单元1621包括顺序连接的P个第三残差块;每一第三残差块包括第四卷积层,第四卷积层的输入和第四卷积层的输出之间跳跃连接;P大于或等于2;
第三残差单元1621,用于对第三特定特征进行池化处理,得到关键点的深度特征。
第三残差单元1621可以是上述的第三残差单元1431。第三池化单元1622可以是上述实施例中的第三池化层1432。
在一些实施例中,深度坐标确定单元1640,还用于将关键点的深度特征输入至第二全连接层,得到关键点的深度坐标;第二全连接层用于回归得到关键点的深度坐标。
在一些实施例中,姿势确定单元1650,还用于基于平面坐标和深度坐标,确定XYZ坐标系下的X轴坐标和Y轴坐标;将X轴坐标、Y轴坐标以及深度坐标对应的姿势,作为感兴趣区域的姿势。
在一些实施例中,感兴趣区域包括手部区域。
在一些实施例中,关键点包括以下至少之一:手指关节点、指尖点、手掌根部点以及手掌中心点。
以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请装置实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
需要说明的是,本申请实施例中,如果以软件功能模块的形式实现上述的姿势确定方法,并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台姿势确定设备执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。
需要说明的是,图17为本申请实施例提供的一种姿势确定设备的硬件实体示意图,如图17所示,该姿势确定设备1700的硬件实体包括:处理器1701和存储器1702,其中,存储器1702存储有可在处理器1701上运行的计算机程序,处理器1701执行程序时实现上述任一实施例的方法中的步骤。
其中,存储器1702可以存储有可在处理器上运行的计算机程序,存储器1702可以配置为存储由处理器1701可执行的指令和应用,还可以缓存各模块待处理或已经处理的数据(例如,图像数据、音频数据、语音通信数据和视频通信数据),可以通过闪存(FLASH)或随机访问存储器(Random Access Memory,RAM)实现。处理器1701和存储器1702可以封装在一起。
本申请实施例提供一种计算机存储介质,计算机存储介质存储有一个或者多个程序,一个或者多个程序可被一个或者多个处理器执行,以实现上述任一方法中的步骤。
图18是本申请实施例提供的一种芯片的结构示意图。图18所示的芯片1800包括处理器1801,处理器1801可以从存储器中调用并运行计算机程序,以实现本申请实施例中姿势确定设备执行的方法的步骤。
在一些实施中,如图18所示,芯片1800还可以包括存储器1802。其中,处理器1801可以从存储器1802中调用并运行计算机程序,以实现本申请实施例中姿势确定设备执行的方法的步骤。
其中,存储器1802可以是独立于处理器1801的一个单独的器件,也可以集成在处理器1801中。
在一些实施中,该芯片1800还可以包括输入接口1803。其中,处理器1801可以控制该输入接口1803与其他设备或芯片进行通信,具体地,可以获取其他设备或芯片发送的信息或数据。
在一些实施中,该芯片1800还可以包括输出接口1804。其中,处理器1801可以控制该输出接口1804与其他设备或芯片进行通信,具体地,可以向其他设备或芯片输出信息或数据。
在一些实施中,该芯片1800可应用于本申请实施例中的姿势确定设备,并且该芯片1800可以实现本申请实施例的各个方法中由姿势确定设备实现的相应流程,为了简洁,在此不再赘述。芯片1800可以是姿势确定设备中的基带芯片。
应理解,本申请实施例提到的芯片1800还可以称为系统级芯片,系统芯片,芯片系统或片上系统芯片等。
本申请实施例提供一种计算机程序产品,计算机程序产品包括计算机存储介质,计算机存储介质存储计算机程序代码,计算机程序代码包括能够由至少一个处理器执行的指令,当指令由至少一个处理器执行时实现上述方法中姿势确定设备执行的方法的步骤。
在一些实施中,该计算机程序产品可应用于本申请实施例中的姿势确定设备,并且该计算机程序指 令使得计算机执行本申请实施例的各个方法中由姿势确定设备实现的相应流程,为了简洁,在此不再赘述。
应理解,本申请实施例的处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
应理解,上述存储器为示例性但不是限制性说明,例如,本申请实施例中的存储器还可以是静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synch link DRAM,SLDRAM)以及直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)等等。也就是说,本申请实施例中的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
这里需要指出的是:以上姿势确定装置、设备、计算机存储介质、芯片和计算机程序产品实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请存储介质和设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”或“本申请实施例”或“前述实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”或“本申请实施例”或“前述实施例”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
在未做特殊说明的情况下,姿势确定设备执行本申请实施例中的任一步骤,可以是姿势确定设备的处理器执行该步骤。除非特殊说明,本申请实施例并不限定姿势确定设备执行下述步骤的先后顺序。另外,不同实施例中对数据进行处理所采用的方式可以是相同的方法或不同的方法。还需说明的是,本申请实施例中的任一步骤是姿势确定设备可以独立执行的,即姿势确定设备执行上述实施例中的任一步骤时,可以不依赖于其它步骤的执行。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选 择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、只读存储器(Read Only Memory,ROM)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
在本申请实施例中,不同实施例中相同步骤和相同内容的说明,可以互相参照。在本申请实施例中,术语“并”不对步骤的先后顺序造成影响,例如,姿势确定设备执行A,并执行B,可以是姿势确定设备先执行A,再执行B,或者是姿势确定设备先执行B,再执行A,或者是姿势确定设备执行A的同时执行B。
以上所述,仅为本申请的实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。
本申请实施例提供了一种姿势确定方法、装置、设备、存储介质、芯片及产品,能够提高确定的关键点的平面坐标和关键点的深度坐标的准确性,且使得提取用于确定感兴趣区域的姿势的特征的提取方式简单。
Claims (32)
- 一种姿势确定方法,包括:从第一图像中提取关键点的平面特征和所述关键点的深度特征;基于所述关键点的平面特征确定所述关键点的平面坐标;基于所述关键点的深度特征确定所述关键点的深度坐标;基于所述关键点的平面坐标和所述关键点的深度坐标,确定与所述关键点对应的感兴趣区域的姿势。
- 根据权利要求1所述的方法,其中,所述从第一图像中提取关键点的平面特征和所述关键点的深度特征,包括:从所述第一图像中确定所述感兴趣区域;获取所述感兴趣区域中所述关键点的第一感兴趣特征;从所述关键点的第一感兴趣特征中提取所述关键点的平面特征;从所述关键点的第一感兴趣特征中提取所述关键点的深度特征。
- 根据权利要求2所述的方法,其中,所述获取所述感兴趣区域中所述关键点的第一感兴趣特征,包括:获取所述关键点的第二感兴趣特征;所述第二感兴趣特征比所述第一感兴趣特征的层次低;基于所述第二感兴趣特征和第一卷积层,确定所述第一感兴趣特征。
- 根据权利要求3所述的方法,其中,所述基于所述第二感兴趣特征和第一卷积层,确定所述第一感兴趣特征,包括:将所述第二感兴趣特征输入至第一残差单元,通过所述第一残差单元得到第一特定特征;其中,所述第一残差单元包括顺序连接的M个第一残差块;每一第一残差块包括所述第一卷积层,所述第一卷积层的输入和所述第一卷积层的输出之间跳跃连接;M大于或等于2;对所述第一特定特征进行池化处理,得到所述第一感兴趣特征。
- 根据权利要求3或4所述的方法,其中,所述获取所述关键点的第二感兴趣特征,包括:获取所述感兴趣区域的第一特征图;从所述第一特征图中提取所述关键点的第三感兴趣特征;所述第三感兴趣特征比所述第二感兴趣特征的层次低;基于所述第三感兴趣特征和第二卷积层,确定所述第二感兴趣特征;所述第二卷积层的通道数小于所述第一特征图的通道数。
- 根据权利要求2至5任一项所述的方法,其中,所述基于所述关键点的平面特征确定所述关键点的平面坐标,包括:从所述第一感兴趣特征中,利用第三卷积层提取所述关键点的平面特征;基于所述关键点的平面特征,回归得到所述关键点的平面坐标。
- 根据权利要求6所述的方法,其中,所述从所述第一感兴趣特征中,利用第三卷积层提取所述关键点的平面特征,包括:将所述第一感兴趣特征输入至第二残差单元,通过所述第二残差单元得到第二特定特征;其中,所述第二残差单元包括顺序连接的N个第二残差块;每一第二残差块包括所述第三卷积层,所述第三卷积层的输入和所述第三卷积层的输出之间跳跃连接;N大于或等于2;对所述第二特定特征进行池化处理,得到所述关键点的平面特征。
- 根据权利要求6或7所述的方法,其中,所述基于所述关键点的平面特征,回归得到所述关键点的平面坐标,包括:将所述关键点的平面特征输入至第一全连接层,得到所述关键点的平面坐标;所述第一全连接层用于回归得到所述关键点的平面坐标。
- 根据权利要求2至8任一项所述的方法,其中,所述基于所述关键点的深度特征确定所述关键点的深度坐标,包括:从所述第一感兴趣特征中,利用第四卷积层提取所述关键点的深度特征;基于所述关键点的深度特征,回归得到所述关键点的深度坐标。
- 根据权利要求9所述的方法,其中,所述从所述第一感兴趣特征中,利用第四卷积层提取所述关键点的深度特征,包括:将所述第一感兴趣特征输入至第三残差单元,通过所述第三残差单元得到第三特定特征;其中,所述第三残差单元包括顺序连接的P个第三残差块;每一第三残差块包括所述第四卷积层,所述第四卷积层的输入和所述第四卷积层的输出之间跳跃连接;P大于或等于2;对所述第三特定特征进行池化处理,得到所述关键点的深度特征。
- 根据权利要求9或10所述的方法,其中,所述基于所述关键点的深度特征,回归得到所述关键点的深度坐标,包括:将所述关键点的深度特征输入至第二全连接层,得到所述关键点的深度坐标;所述第二全连接层用于回归得到所述关键点的深度坐标。
- 根据权利要求1至11任一项所述的方法,其中,所述基于所述关键点的平面坐标和所述关键点的深度坐标,确定所述感兴趣区域的姿势,包括:基于所述平面坐标和所述深度坐标,确定XYZ坐标系下的X轴坐标和Y轴坐标;将所述X轴坐标、Y轴坐标以及深度坐标对应的姿势,作为所述感兴趣区域的姿势。
- 根据权利要求2至12任一项所述的方法,其中,所述感兴趣区域包括手部区域。
- 根据权利要求1至13任一项所述的方法,其中,所述关键点包括以下至少之一:手指关节点、指尖点、手掌根部点以及手掌中心点。
- 一种姿势确定装置,包括:平面编码单元,用于从第一图像中提取关键点的平面特征;深度编码单元,用于从所述第一图像中提取所述关键点的深度特征;平面坐标确定单元,用于基于所述关键点的平面特征确定所述关键点的平面坐标;深度坐标确定单元,用于基于所述关键点的深度特征确定所述关键点的深度坐标;姿势确定单元,用于基于所述关键点的平面坐标和所述关键点的深度坐标,确定与所述关键点对应的感兴趣区域的姿势。
- 根据权利要求15所述的装置,其中,所述装置还包括:区域确定单元和特征提取单元;所述区域确定单元,用于从所述第一图像中确定所述感兴趣区域;所述特征提取单元,用于获取所述感兴趣区域中所述关键点的第一感兴趣特征;所述平面编码单元,用于从所述关键点的第一感兴趣特征中提取所述关键点的平面特征;所述深度编码单元,用于从所述关键点的第一感兴趣特征中提取所述关键点的深度特征。
- 根据权利要求16所述的装置,其中,所述特征提取单元包括:特征获取单元和特征确定单元;所述特征获取单元,用于获取所述关键点的第二感兴趣特征;所述第二感兴趣特征比所述第一感兴趣特征的层次低;所述特征确定单元,用于基于所述第二感兴趣特征和第一卷积层,确定所述第一感兴趣特征。
- 根据权利要求17所述的装置,其中,所述特征确定单元包括:第一残差单元和第一池化单元;所述特征获取单元,还用于将所述第二感兴趣特征输入至第一残差单元;所述第一残差单元,用于对所述关键点的第二感兴趣特征进行处理,得到第一特定特征;其中,所述第一残差单元包括顺序连接的M个第一残差块;每一第一残差块包括所述第一卷积层,所述第一卷积层的输入和所述第一卷积层的输出之间跳跃连接;M大于或等于2;所述第一池化单元,用于对所述第一特定特征进行池化处理,得到所述第一感兴趣特征。
- 根据权利要求17或18所述的装置,其中,所述特征获取单元,还用于获取所述感兴趣区域的第一特征图;从所述第一特征图中提取所述关键点的第三感兴趣特征;所述第三感兴趣特征比所述第二感兴趣特征的层次低;基于所述第三感兴趣特征和第二卷积层,确定所述第二感兴趣特征;所述第二卷积层的通道数小于所述第一特征图的通道数。
- 根据权利要求16至19任一项所述的装置,其中,所述平面编码单元,用于从所述第一感兴趣特征中,利用第三卷积层提取所述关键点的平面特征;所述平面坐标确定单元,用于基于所述关键点的平面特征,回归得到所述关键点的平面坐标。
- 根据权利要求20所述的装置,其中,所述平面编码单元包括:第二残差单元和第二池化单元;所述第一池化单元,还用于将所述第一感兴趣特征输入至所述第二残差单元;所述第二残差单元,还用于对所述第一感兴趣特征进行处理,得到第二特定特征;其中,所述第二残差单元包括顺序连接的N个第二残差块;每一第二残差块包括所述第三卷积层,所述第三卷积层的输入和所述第三卷积层的输出之间跳跃连接;N大于或等于2;所述第二池化单元,还用于对所述第二特定特征进行池化处理,得到所述关键点的平面特征。
- 根据权利要求20或21所述的装置,其中,所述平面坐标确定单元,还用于将所述关键点的平面特征输入至第一全连接层,得到所述关键点的平面坐标;所述第一全连接层用于回归得到所述关键点的平面坐标。
- 根据权利要求16至22任一项所述的装置,其中,所述深度编码单元,用于从所述第一感兴趣特征中,利用第四卷积层提取所述关键点的深度特征;所述深度坐标确定单元,用于基于所述关键点的深度特征,回归得到所述关键点的深度坐标。
- 根据权利要求23所述的装置,其中,所述深度编码单元包括:第三残差单元和第三池化单元;所述第一池化单元,还用于将所述第一感兴趣特征输入至所述第三残差单元;所述第三残差单元,用于对所述第一感兴趣特征进行处理,得到第三特定特征;其中,所述第三残差单元包括顺序连接的P个第三残差块;每一第三残差块包括所述第四卷积层,所述第四卷积层的输入和所述第四卷积层的输出之间跳跃连接;P大于或等于2;所述第三残差单元,用于对所述第三特定特征进行池化处理,得到所述关键点的深度特征。
- 根据权利要求23或24所述的装置,其中,所述深度坐标确定单元,还用于将所述关键点的深度特征输入至第二全连接层,得到所述关键点的深度坐标;所述第二全连接层用于回归得到所述关键点的深度坐标。
- 根据权利要求15至25任一项所述的装置,其中,所述姿势确定单元,还用于基于所述平面坐标和所述深度坐标,确定XYZ坐标系下的X轴坐标和Y轴坐标;将所述X轴坐标、Y轴坐标以及深度坐标对应的姿势,作为所述感兴趣区域的姿势。
- 根据权利要求16至26任一项所述的装置,其中,所述感兴趣区域包括手部区域。
- 根据权利要求15至27任一项所述的装置,其中,所述关键点包括以下至少之一:手指关节点、指尖点、手掌根部点以及手掌中心点。
- 一种姿势确定设备,包括:存储器和处理器,所述存储器存储有可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至14任一项所述方法中的步骤。
- 一种计算机存储介质,其中,所述计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现权利要求1至14任一项所述方法中的步骤。
- 一种芯片,包括:处理器,用于从存储器中调用并运行计算机程序,使得安装有所述芯片的设备执行如权利要求1至14任一项所述方法中的步骤。
- 一种计算机程序产品,所述计算机程序产品包括计算机存储介质,所述计算机存储介质存储计算机程序代码,所述计算机程序代码包括能够由至少一个处理器执行的指令,当所述指令由所述至少一个处理器执行时实现权利要求1至14中任一项所述方法中的步骤。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/746,933 US20220351405A1 (en) | 2019-11-20 | 2022-05-17 | Pose determination method and device and non-transitory storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962938187P | 2019-11-20 | 2019-11-20 | |
US62/938,187 | 2019-11-20 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/746,933 Continuation US20220351405A1 (en) | 2019-11-20 | 2022-05-17 | Pose determination method and device and non-transitory storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021098545A1 true WO2021098545A1 (zh) | 2021-05-27 |
Family
ID=75980384
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/127607 WO2021098545A1 (zh) | 2019-11-20 | 2020-11-09 | 一种姿势确定方法、装置、设备、存储介质、芯片及产品 |
Country Status (2)
Country | Link |
---|---|
US (1) | US20220351405A1 (zh) |
WO (1) | WO2021098545A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114882114A (zh) * | 2022-05-31 | 2022-08-09 | 中国农业科学院农业质量标准与检测技术研究所 | 基于关键特征点定位的斑马鱼形态识别方法及相关装置 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12017116B2 (en) * | 2020-06-23 | 2024-06-25 | Electronics And Telecommunications Research Institute | Apparatus and method for evaluating human motion using mobile robot |
EP3965071A3 (en) * | 2020-09-08 | 2022-06-01 | Samsung Electronics Co., Ltd. | Method and apparatus for pose identification |
US20220189195A1 (en) * | 2020-12-15 | 2022-06-16 | Digitrack Llc | Methods and apparatus for automatic hand pose estimation using machine learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182037A (zh) * | 2014-06-17 | 2014-12-03 | 惠州市德赛西威汽车电子有限公司 | 一种基于坐标变换的手势识别方法 |
CN105491425A (zh) * | 2014-09-16 | 2016-04-13 | 洪永川 | 识别手势及遥控电视的方法 |
US20180144530A1 (en) * | 2016-11-18 | 2018-05-24 | Korea Institute Of Science And Technology | Method and device for controlling 3d character using user's facial expressions and hand gestures |
-
2020
- 2020-11-09 WO PCT/CN2020/127607 patent/WO2021098545A1/zh active Application Filing
-
2022
- 2022-05-17 US US17/746,933 patent/US20220351405A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104182037A (zh) * | 2014-06-17 | 2014-12-03 | 惠州市德赛西威汽车电子有限公司 | 一种基于坐标变换的手势识别方法 |
CN105491425A (zh) * | 2014-09-16 | 2016-04-13 | 洪永川 | 识别手势及遥控电视的方法 |
US20180144530A1 (en) * | 2016-11-18 | 2018-05-24 | Korea Institute Of Science And Technology | Method and device for controlling 3d character using user's facial expressions and hand gestures |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114882114A (zh) * | 2022-05-31 | 2022-08-09 | 中国农业科学院农业质量标准与检测技术研究所 | 基于关键特征点定位的斑马鱼形态识别方法及相关装置 |
Also Published As
Publication number | Publication date |
---|---|
US20220351405A1 (en) | 2022-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021098545A1 (zh) | 一种姿势确定方法、装置、设备、存储介质、芯片及产品 | |
Romero et al. | Embodied hands: Modeling and capturing hands and bodies together | |
US11994377B2 (en) | Systems and methods of locating a control object appendage in three dimensional (3D) space | |
US10679046B1 (en) | Machine learning systems and methods of estimating body shape from images | |
Kumarapu et al. | Animepose: Multi-person 3d pose estimation and animation | |
Oberweger et al. | Hands deep in deep learning for hand pose estimation | |
CN113034652A (zh) | 虚拟形象驱动方法、装置、设备及存储介质 | |
WO2016123913A1 (zh) | 数据处理的方法和装置 | |
US11961266B2 (en) | Multiview neural human prediction using implicit differentiable renderer for facial expression, body pose shape and clothes performance capture | |
US20220262093A1 (en) | Object detection method and system, and non-transitory computer-readable medium | |
US20230214458A1 (en) | Hand Pose Estimation for Machine Learning Based Gesture Recognition | |
WO2021098441A1 (zh) | 手部姿态估计方法、装置、设备以及计算机存储介质 | |
US10970849B2 (en) | Pose estimation and body tracking using an artificial neural network | |
CN109934183B (zh) | 图像处理方法及装置、检测设备及存储介质 | |
WO2021098554A1 (zh) | 一种特征提取方法、装置、设备及存储介质 | |
CN111709268B (zh) | 一种深度图像中的基于人手结构指导的人手姿态估计方法和装置 | |
CN111062263A (zh) | 手部姿态估计的方法、设备、计算机设备和存储介质 | |
WO2021098573A1 (zh) | 手部姿态估计方法、装置、设备以及计算机存储介质 | |
WO2023083030A1 (zh) | 一种姿态识别方法及其相关设备 | |
WO2021098587A1 (zh) | 手势分析方法、装置、设备及计算机可读存储介质 | |
WO2022179603A1 (zh) | 一种增强现实方法及其相关设备 | |
TW202309834A (zh) | 模型重建方法及電子設備和電腦可讀儲存介質 | |
CN115661336A (zh) | 一种三维重建方法及相关装置 | |
US11188787B1 (en) | End-to-end room layout estimation | |
CN113763536A (zh) | 一种基于rgb图像的三维重建方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20891171 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20891171 Country of ref document: EP Kind code of ref document: A1 |