WO2021098554A1 - 一种特征提取方法、装置、设备及存储介质 - Google Patents

一种特征提取方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021098554A1
WO2021098554A1 PCT/CN2020/127867 CN2020127867W WO2021098554A1 WO 2021098554 A1 WO2021098554 A1 WO 2021098554A1 CN 2020127867 W CN2020127867 W CN 2020127867W WO 2021098554 A1 WO2021098554 A1 WO 2021098554A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
convolutional
depth image
layer
Prior art date
Application number
PCT/CN2020/127867
Other languages
English (en)
French (fr)
Inventor
周扬
刘杰
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021098554A1 publication Critical patent/WO2021098554A1/zh
Priority to US17/745,565 priority Critical patent/US20220277475A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • G06V10/225Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition based on a marking or identifier characterising the area
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/472End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
    • H04N21/4728End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for selecting a Region Of Interest [ROI], e.g. for requesting a higher resolution version of a selected region
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • the present invention relates to image processing technology, in particular to a feature extraction method, device, equipment and storage medium.
  • gesture recognition technology has broad market application prospects in many fields such as immersive virtual reality, augmented reality, robot control and sign language recognition.
  • this technology has made considerable progress.
  • embodiments of the present invention provide a feature extraction method, device, equipment, and storage medium.
  • an embodiment of the present application provides a feature extraction method, including: extracting features of the depth image to be recognized, and determining the basic feature of the depth image; extracting multiple features of different scales of the basic feature, and determining the feature
  • the multi-scale feature of the depth image; the multi-scale feature is up-sampled to determine the target feature; wherein the target feature is used to determine the bounding box of the region of interest in the depth image.
  • an embodiment of the present application provides a feature extraction device, the feature extraction device includes: a first extraction part configured to extract features of a depth image to be recognized, and determine basic features of the depth image;
  • the second extraction part is configured to extract multiple different-scale features of the basic feature, and determine the multi-scale feature of the depth image;
  • the up-sampling part is configured to up-sampling the multi-scale features to determine target features; wherein, the target features are used to determine the bounding box of the region of interest in the depth image.
  • an embodiment of the present application provides a feature extraction device.
  • the feature extraction device includes a first processor and a first memory configured to store a computer program that can run on the first processor.
  • the first memory is used to store a computer program
  • the first processor is used to call and run the computer program stored in the first memory to execute the steps of the method in the first aspect described above.
  • embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program that enables a computer to execute the steps of the method in the first aspect described above.
  • an embodiment of the present application provides a pose estimation method, including: extracting features of a depth image to be recognized, and determining basic features of the depth image; extracting multiple features of different scales of the basic feature, and determining The multi-scale feature of the depth image; up-sampling the multi-scale feature to determine the target feature; extracting the bounding box of the region of interest based on the target feature; extracting the coordinate information of the key points in the region of interest based on the bounding box; The coordinate information of the key points in the region of interest estimates the posture of the detection object, and determines the posture estimation result.
  • an embodiment of the present application provides a pose estimation device, the pose estimation device includes: a third extraction part, a bounding box detection part, a fourth extraction part, and a pose estimation part; wherein,
  • the third extraction part is configured to perform the steps of the method of the fifth aspect described above to determine the target feature of the depth image to be recognized;
  • the bounding box detection part is configured to extract the bounding box of the region of interest based on the target feature
  • the fourth extraction part is configured to extract position information of key points in the region of interest based on the bounding box;
  • the pose estimation part is used to estimate the pose of the detection object based on the position information of the key points in the region of interest.
  • an embodiment of the present application provides an attitude estimation device, the attitude estimation device includes: a second processor and a second memory configured to store a computer program that can run on the second processor,
  • the second memory is used to store a computer program
  • the second processor is used to call and run the computer program stored in the second memory to execute the steps of the method in the fifth aspect described above.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium is used to store a computer program that enables a computer to execute the steps of the method in the fifth aspect described above.
  • the embodiments of the present application provide a feature extraction method, device, device, and storage medium, and a posture estimation method, device, device, and storage medium using the feature extraction method.
  • the basic feature of the depth image is determined by extracting the features of the depth image to be recognized; then multiple features of different scales of the basic feature are extracted to determine the depth image Multi-scale features; Finally, the multi-scale features are up-sampled to enrich the features again.
  • using the feature extraction method can extract more kinds of features from the depth image.
  • pose estimation is performed based on the feature extraction method, the rich features can also improve the accuracy of subsequent bounding box and pose estimation.
  • Fig. 1 is a schematic diagram of an image taken by a TOF camera provided by a related technical solution
  • FIG. 2 is a schematic diagram of a detection result of a hand bounding box provided by related technical solutions
  • Fig. 3 is a schematic diagram of the key point positions of a hand skeleton provided by related technical solutions
  • FIG. 4 is a schematic diagram of a two-dimensional hand posture estimation result provided by related technical solutions
  • FIG. 5 is a schematic flow diagram of a traditional hand gesture detection provided by related technical solutions
  • Fig. 6 is a schematic structural diagram of RoIAlign feature extraction provided by related technical solutions.
  • FIG. 7 is a schematic diagram of a non-maximum suppression structure provided by related technical solutions.
  • Fig. 8 is a schematic diagram of a structure of union and intersection provided by related technical solutions.
  • Figure 9 is a schematic diagram of the Alexnet network structure
  • FIG. 10 shows a schematic flow chart of hand posture estimation in an embodiment of the present application.
  • FIG. 11 shows a schematic flowchart of a feature extraction method in an embodiment of the present application.
  • FIG. 12 shows a schematic flowchart of another feature extraction method in an embodiment of the present application.
  • FIG. 13 shows a schematic diagram of the composition structure of a backbone feature extractor in an embodiment of the present application
  • FIG. 14 shows a schematic diagram of the composition structure of a feature extraction device in an embodiment of the present application.
  • FIG. 15 shows a schematic diagram of the composition structure of a feature extraction device in an embodiment of the present application.
  • FIG. 16 shows a schematic flowchart of a posture estimation method in an embodiment of the present application
  • FIG. 17 shows a schematic diagram of the composition structure of a posture estimation apparatus in an embodiment of the present application.
  • FIG. 18 shows a schematic diagram of the composition structure of a posture estimation device in an embodiment of the present application.
  • Hand posture estimation mainly refers to accurately estimating the three-dimensional coordinate position of the human hand skeleton node from the image. This is a key issue in the field of computer vision and human-computer interaction, and is of great significance in the fields of virtual reality, augmented reality, non-contact interaction, and gesture recognition. With the rise and development of commercial and inexpensive depth cameras, great progress has been made in hand posture estimation.
  • the depth camera includes structured light, laser scanning and TOF, etc. In most cases, it refers to the TOF camera.
  • TOF is short for Time of Fight, literally translated as time of flight.
  • the so-called three-dimensional (Three Dimension, 3D) imaging of the time-of-flight method is to continuously send light pulses to the target, and then use the sensor to receive the light returned from the object, and obtain the target distance of the object by detecting the flight (round trip) time of the light pulse.
  • TOF camera is a distance imaging camera system. It uses the time-of-flight method to measure the round-trip time of artificial light signals provided by lasers or light-emitting diodes (LEDs) to calculate the TOF camera and the camera system. The distance between each point on the image between the photographed objects.
  • LEDs light-emitting diodes
  • the TOF camera outputs an image with a size of H ⁇ W.
  • Each pixel value on this two-dimensional (Two Dimension, 2D) image can represent the depth value of the pixel; among them, the pixel value ranges from 0 to 3000 millimeters (millimeter, mm).
  • Fig. 1 shows a schematic diagram of an image taken by a TOF camera provided by a related technical solution.
  • the image captured by the TOF camera may be referred to as a depth image.
  • TOF cameras provided by manufacturer O have the following differences: (1) TOF cameras can be installed inside the smartphone instead of being fixed on a static bracket; (2) ) Compared with TOF cameras from other manufacturers (such as Microsoft Kinect or Intel Realsense, etc.), it has lower power consumption; (3) It has a lower image resolution, such as 240 ⁇ 180, and the typical value is 640 ⁇ 480.
  • the input of hand detection is a depth image
  • the output is the probability of the existence of the hand (that is, a number between 0 and 1, a larger value indicates a greater confidence in the existence of the hand) and a hand bounding box (That is, the bounding box representing the position and size of the hand).
  • Fig. 2 shows a schematic diagram of a detection result of a hand bounding box provided by related technical solutions. As shown in Figure 2, the black rectangular box is the hand bounding box, and the score of the hand bounding box is as high as 0.999884.
  • a bounding box may also be referred to as a bounding box.
  • the bounding box can be expressed as (xmin, ymin, xmax, ymax), where (xmin, ymin) represents the position of the upper left corner of the bounding box, and (xmax, ymax) is the lower right corner of the bounding box.
  • the input is a depth image
  • the output is the two-dimensional key point position of the hand skeleton.
  • An example of the key point position of the hand skeleton is shown in FIG. 3.
  • the hand skeleton can be provided with 20 key points, and the position of each key point is shown as 0-19 in Fig. 3.
  • the position of each key point can be represented by 2D coordinate information (x, y), where x is the coordinate information in the direction of the horizontal image axis, and y is the coordinate information in the direction of the vertical image axis.
  • the posture estimation result of a two-dimensional hand is shown in FIG. 4.
  • the input is still a depth image
  • the output is the 3D key point position of the hand skeleton.
  • An example of the key point position of the hand skeleton is still shown in Figure 3.
  • the position of each key point can be represented by 3D coordinate information (x, y, z), where x is the coordinate information in the horizontal image axis direction, y is the coordinate information in the vertical image axis direction, and z is the coordinate information in the depth direction.
  • the embodiment of the present application mainly solves the problem of the pose estimation of the three-dimensional hand.
  • a typical hand posture detection process can include a hand detection part and a hand posture estimation part.
  • the hand detection part can include a backbone feature extractor and a bounding box detection head module
  • the hand posture estimation part can Including backbone feature extractor and pose estimation head module.
  • FIG. 5 shows a schematic flow chart of a traditional hand posture estimation provided by related technical solutions. As shown in Figure 5, after obtaining an original depth image including hands, first hand detection can be performed, that is, the main feature extractor and bounding box detection head module included in the hand detection part are used for detection.
  • the tasks of hand detection and hand pose estimation are completely separate. In order to connect these two tasks, the position of the output bounding box is adjusted to the centroid of the pixels in the bounding box, and the size of the bounding box is slightly enlarged to include all the hand pixels.
  • the adjusted bounding box is used to crop the original depth image. Input the cropped image into the task of hand pose estimation.
  • the backbone feature extractor is used twice to extract image features, it will cause repeated calculations and increase the amount of calculation.
  • ROIAlign is a region of interest (region of interest, ROI) feature aggregation method, which can well solve the problem of region mismatch caused by two quantizations in the ROI Pool operation.
  • replacing ROI Pool with ROIAlign can improve the accuracy of the detection results.
  • the RoIAlign layer eliminates the strict quantization of the RoI Pool, and correctly aligns the extracted features with the input.
  • any quantization of the RoI boundary or area can be avoided, for example, x/16 can be used here instead of [x/16].
  • bilinear interpolation can also be used to calculate the precise values of the input features of the four periodic sampling positions in each RoI area, and the results can be summarized (using the maximum value or the average value), as shown in Figure 6.
  • the dotted grid represents a feature map
  • the bold solid line represents a RoI (such as 2 ⁇ 2 regions)
  • 4 sampling points are dotted in each region.
  • RoIAlign uses adjacent grid points on the feature map to perform bilinear interpolation calculations to obtain the value of each sampling point. For the ROI bounding box or sampling point, no quantization is performed on any coordinates involved. It should also be noted that as long as the quantization is not performed, the detection result is not sensitive to the accuracy of the sampling position or the number of sampling points.
  • NMS Non-Maximum Suppression
  • FIG. 8 it shows a schematic diagram of union and intersection provided by related technical solutions.
  • two bounding boxes are given, denoted by BB1 and BB2, respectively.
  • the black area in (a) is the intersection of BB1 and BB2, denoted by BB1 ⁇ BB2, that is, the overlapping area of BB1 and BB2;
  • the black area in (b) is the union of BB1 and BB2, denoted by BB1 ⁇ BB2 , which is the combined area of BB1 and BB2.
  • the calculation formula of the intersection ratio (indicated by IoU) is as follows,
  • FIG. 9 is a schematic diagram of the Alexnet network structure.
  • the input image is sequentially connected through 5 convolutional layers (ie Conv1, Conv2, Conv3, Conv4, and Conv5). ), and then through 3 fully connected layers (ie FC6, FC7 and FC8).
  • FC6, FC7 and FC8 are fully connected layers.
  • an embodiment of the present application provides a feature extraction method, which can be applied to a backbone feature extractor.
  • FIG. 10 shows a schematic diagram of the flow of hand pose estimation in an embodiment of this application.
  • the backbone feature extractor provided by the embodiment of this application is placed in the input After that, before the bounding box detection head module, it can extract more image features for hand detection and hand pose estimation.
  • the detection network is more compact and more suitable for deployment on mobile devices.
  • the method may include:
  • Step 111 Extract the features of the depth image to be recognized, and determine the basic features of the depth image
  • the method before this step, further includes: acquiring a depth image collected by the depth camera and containing the detection object.
  • the depth camera can exist independently or integrated on the electronic device.
  • the depth camera can be a TOF camera, a structured light depth camera, or binocular stereo vision. At present, TOF cameras are widely used in mobile terminals.
  • the feature extraction network includes at least one convolutional layer and at least one pooling layer connected at intervals, and the starting layer is a convolutional layer, the convolution kernel of each convolutional layer is the same or different, and the convolution of each pooling layer The nuclear is the same.
  • the convolution kernel of the convolutional layer may be any one of 1 ⁇ 1, 3 ⁇ 3, 5 ⁇ 5, or 7 ⁇ 7, and the convolution kernel of the pooling layer may also choose one of them.
  • pooling can be Max pooling or Average pooling, which is not specifically limited in this application.
  • the basic features include at least one of color feature, texture feature, shape feature, spatial relationship feature, and contour feature.
  • the basic feature has a higher resolution, contains more position and detailed information, and can provide positioning and segmentation. The more useful information allows the high-level network to obtain the context information of the image more easily and comprehensively based on the basic characteristics, so that the context information can be used to improve the accuracy of the subsequent ROI region bounding box positioning.
  • the basic features can also be referred to as low-level features of the image.
  • the expression form of the features in the embodiments of the present application may include, but is not limited to, for example, a feature map, a feature vector, or a feature matrix.
  • Step 112 Extract multiple different-scale features of the basic feature, and determine the multi-scale feature of the depth image
  • the multi-scale features are subjected to multiple convolutions of set sizes, and the convolutions are added together to obtain different image features at multiple scales.
  • image features at different scales can be extracted from basic features through the established multi-scale feature extraction network.
  • the multi-scale feature extraction network includes N convolutional networks connected sequentially, and N takes an integer greater than 1.
  • the N convolutional networks can be the same convolutional network or different convolutional networks.
  • the input of the first convolutional network is the basic feature, and the input of the other convolutional network is the previous convolution.
  • the output of the integrative network, and the output of the Nth convolutional network is the final output of the multi-scale feature extraction network.
  • the N convolutional networks are the same convolutional network, that is, repeated N convolutional networks are sequentially connected, which is beneficial to reduce network complexity and reduce the amount of calculation.
  • the input feature and the initial output feature of each convolutional network are fused, and the fused feature is used as the final output feature of the convolutional network. For example, adding jump connections in each convolutional network and fusing the input features and initial output features can solve the problem of gradient disappearance when the network has a deeper number of layers. It also helps the back propagation of the gradient and speeds up training. process.
  • Step 113 Up-sampling the multi-scale feature to determine a target feature; wherein the target feature is used to determine the bounding box of the region of interest in the depth image.
  • upsampling refers to any technology that can make an image into a higher resolution. Upsampling of multi-scale features can obtain more detailed features of the image, which is conducive to subsequent bounding box detection.
  • the simplest way is resampling and interpolation: the input image is rescaled to a desired size, and the pixels of each point are calculated, and interpolation methods such as bilinear interpolation are used to interpolate the remaining points to complete the process. Sampling process.
  • the posture estimation of the detection object is performed based on the coordinate information of the key points in the region of interest to determine the posture estimation result.
  • the detection object includes a hand; the key points include at least one of the following: finger joint points, fingertip points, wrist key points, and palm center points.
  • the key nodes of the hand skeleton are the key points.
  • the hand includes 20 key points. The specific positions of these 20 key points in the hand are shown in Figure 3.
  • the detection object includes a human face; the key points include at least one of the following: eye points, eyebrow points, mouth points, nose points, and face contour points.
  • the key points of the face are specifically the key points of the facial features. There can be 5 key points, 21 key points, 68 key points, 98 key points, etc.
  • the detection object includes a human face; the key points include at least one of the following: head points, limb joint points, and torso points, and there may be 28 key points.
  • the feature extraction method provided in the embodiments of the present application is applied in a feature extraction device or an electronic device integrated with the device.
  • the electronic device may be a smart phone, a tablet computer, a notebook computer, a palmtop computer, a personal digital assistant (PDA), a navigation device, a wearable device, a desktop computer, etc., which are not limited in the embodiment of the present application.
  • PDA personal digital assistant
  • the feature extraction method of this application is applied in the field of image recognition.
  • the extracted image features can participate in the posture estimation of the entire human body or the local posture estimation.
  • the embodiments of this application mainly introduce how to estimate the pose of the hand, and the posture estimation of other parts is usually
  • the feature extraction method applied to the embodiment of this application is also within the protection scope of this application.
  • the basic feature of the depth image is determined by extracting the features of the depth image to be recognized; then multiple features of different scales of the basic feature are extracted to determine the depth image Multi-scale features; Finally, the multi-scale features are up-sampled to enrich the features again.
  • using the feature extraction method can extract more kinds of features from the depth image.
  • pose estimation is performed based on the feature extraction method, the rich features can also improve the accuracy of subsequent bounding box and pose estimation.
  • FIG. 12 shows a schematic flowchart of another feature extraction method provided by an embodiment of the present application. As shown in Figure 12, the method may include:
  • Step 121 Input the depth image to be recognized into the feature extraction network for multiple downsampling, and output the basic features of the depth image;
  • the feature extraction network includes at least one convolutional layer and at least one pooling layer connected at intervals, and the starting layer is a convolutional layer.
  • the convolution kernel of the convolution layer close to the input end in the at least one convolution layer is greater than or equal to the convolution kernel of the convolution layer far away from the input end.
  • the convolution kernel may be any one of 1 ⁇ 1, 3 ⁇ 3, 5 ⁇ 5, or 7 ⁇ 7, and the convolution kernel of the pooling layer may also choose one of them.
  • the embodiment of the present application adopts the convolution kernel to reduce the image characteristics and calculation amount layer by layer. There is a good balance between them, and on the basis of extracting more basic features, it is ensured that the amount of calculation is suitable for the processing capacity of the mobile terminal.
  • FIG. 13 shows a schematic diagram of the composition structure of the backbone feature extractor in an embodiment of the present application.
  • the backbone feature extractor includes a feature extraction network, a multi-scale feature extraction network, and an upsampling network.
  • a feature extraction network including 2 convolutional layers and 2 pooling layers is specifically given, including: 7 ⁇ 7 ⁇ 48 Conv1, that is, the convolution kernel is 7 ⁇ 7 and the number of channels is 48, s2 Indicates that the two-dimensional data of the input depth image is down-sampled twice, including 3 ⁇ 3 Pool1, 5 ⁇ 5 ⁇ 128 Conv2, and 3 ⁇ 3 Pool2.
  • a 240 ⁇ 180 depth image is first input into a 7 ⁇ 7 ⁇ 48 Conv1, and a 20 ⁇ 132 ⁇ 128 feature map is output, and then input into a 3 ⁇ 3 Pool1 to output a 60 ⁇ 45 ⁇ 48 feature map. Then input into 5 ⁇ 5 ⁇ 128 Conv2 to output 30 ⁇ 23 ⁇ 128 feature map, and finally input to 3 ⁇ 3 Pool2 to output 15 ⁇ 12 ⁇ 128 feature map.
  • each convolution or pooling operation is down-sampled twice, and the input depth image is directly down-sampled 16 times in total. The down-sampling can greatly reduce the computational cost.
  • using large convolution kernels such as 7 ⁇ 7 and 5 ⁇ 5 can quickly expand the receptive field and extract more image features.
  • the method before this step, further includes: acquiring a depth image collected by the depth camera and containing the detection object.
  • Depth cameras can exist independently or integrated in electronic devices. Depth cameras can be TOF cameras, structured light depth cameras, and binocular stereo vision. TOF cameras are currently widely used in mobile terminals.
  • Step 122 Input the basic features into a multi-scale feature extraction network, and output the multi-scale features of the depth image;
  • the multi-scale feature extraction network includes N convolutional networks connected in sequence, and N takes an integer greater than 1.
  • each convolutional network includes at least two convolutional branches and a fusion network, and different convolutional branches are used to extract features of different scales;
  • the inputting the basic feature into a multi-scale feature extraction network and outputting the multi-scale feature of the depth image includes:
  • the input feature of the first convolutional network is the basic feature;
  • the Nth convolutional network outputs the multi-scale features of the depth image.
  • the number of channels of the output feature of the convolutional network should be the same as the number of channels of the input feature in order to perform feature fusion.
  • each convolutional network is used to extract different types of features. The later the features are extracted, the more abstract the features. For example, the previous convolutional network can extract more local features, extract finger features, and the latter Extract more global features, extract features of the entire hand, and use N repeated convolution kernel groups to extract more different features. Similarly, different convolution branches in the convolutional network also extract different types of features, some branches extract more detailed features, and some branches extract more global features.
  • the convolutional network includes four convolutional branches; wherein,
  • the first convolution branch includes a first convolution layer
  • the second convolution branch includes a first pooling layer and a second convolution layer that are sequentially connected;
  • the third convolution branch includes a third convolution layer and a fourth convolution layer that are sequentially connected;
  • the fourth convolution branch includes a fifth convolution layer, a sixth convolution layer, and a seventh convolution layer that are sequentially connected;
  • the number of channels of the first convolutional layer, the second convolutional layer, the fourth convolutional layer, and the seventh convolutional layer are equal; the number of channels of the third convolutional layer and the fifth convolutional layer are the same, And is smaller than the number of channels of the fourth convolutional layer.
  • the small number of channels of the third convolutional layer and the fifth convolutional layer is for channel down-sampling the input features to reduce the calculation amount of subsequent convolution processing, which is more suitable for mobile devices.
  • a good balance can be made between image features and calculations.
  • the calculations can be guaranteed to be suitable for the processing capabilities of mobile terminals.
  • the convolution kernels of the first convolutional layer, the second convolutional layer, the third convolutional layer, and the fifth convolutional layer are equal, and the fourth convolutional layer, the The convolution kernels of the sixth convolutional layer and the seventh convolutional layer are equal.
  • the convolution kernels of the first to seventh convolutional layers can be any one of 1 ⁇ 1, 3 ⁇ 3, or 5 ⁇ 5, and the convolution kernels of the first pooling layer can also choose one of them.
  • the convolution kernels of the first pooling layer can also choose one of them.
  • FIG. 13 shows a schematic diagram of the composition structure of the backbone feature extractor in an embodiment of the present application.
  • the backbone feature extractor includes a feature extraction network, a multi-scale feature extraction network, and an upsampling network.
  • a multi-scale feature extraction network including 3 repeated convolutional networks is specifically given, including: the first convolution branch includes 1 ⁇ 1 ⁇ 32 Conv, that is, the convolution kernel is 1 ⁇ 1 and the number of channels is 32;
  • the second convolution branch includes sequentially connected 3 ⁇ 3 Pool and 1 ⁇ 1 ⁇ 32 Conv;
  • the third convolution network includes sequentially connected 1 ⁇ 1 ⁇ 24 Conv and 1 ⁇ 1 ⁇ 32 Conv;
  • the fourth convolutional network includes 1 ⁇ 1 ⁇ 24 Conv, 1 ⁇ 1 ⁇ 32 Conv, and 1 ⁇ 1 ⁇ 32 Conv that are sequentially connected.
  • Each convolutional network also adds a jump connection (that is, a fusion network), which is used to fuse input features and output features, so as to achieve a smoother gradient flow during the training process
  • Step 123 Up-sampling the multi-scale features to determine target features; wherein, the target features are used to determine the bounding box of the region of interest in the depth image.
  • the multi-scale feature is input to the eighth convolutional layer, and the target feature is output; wherein the number of channels of the eighth convolutional layer is M times the number of channels of the multi-scale feature, and M is greater than 1.
  • M takes an integer greater than 1 or a non-integer.
  • FIG. 13 shows a schematic diagram of the composition structure of the backbone feature extractor in an embodiment of the present application.
  • the backbone feature extractor includes a feature extraction network, a multi-scale feature extraction network, and an upsampling network.
  • the upsampling network contains a Conv with a convolutional layer of 1 ⁇ 1 ⁇ 256, and s2 means that the two-dimensional data of multi-scale features is down-sampled twice.
  • a convolution kernel is added to 1 ⁇ 1 ⁇ 256, and the The input 15 ⁇ 12 ⁇ 128 feature map is mapped to the 15 ⁇ 12 ⁇ 256 feature map for up-sampling.
  • the posture estimation of the detection object is performed based on the coordinate information of the key points in the region of interest, and the posture estimation result is determined.
  • the feature extraction method mainly includes the following design rules:
  • the network pipeline of the present invention includes three main components, specifically including: a basic feature extractor, a multi-scale feature extractor, and a feature up-sampling network.
  • the network architecture is shown in Figure 13.
  • Rule #2 for rule #1, the basic feature extractor is used to extract low-level image features (ie, basic features).
  • a 240 ⁇ 180 depth image is first input into 7 ⁇ 7 ⁇ 48 Conv1, output 20 ⁇ 132 ⁇ 128 feature map, and then input into 3 ⁇ 3 Pool1 output 60 ⁇ 45 ⁇ 48 feature map, and then input into 5 *5*128 Conv2 outputs 30*23*128 feature maps, and finally input to 3*3 Pool2 to output 15*12*128 feature maps.
  • the input is directly down-sampled 16 times to greatly reduce the computational cost.
  • Use large convolution kernels (such as 7 ⁇ 7 and 5 ⁇ 5) to quickly expand the receptive field.
  • the multi-scale feature extractor contains three repeated convolution kernel groups to extract more different features.
  • each convolution kernel group there are four branches, each branch extracts one type of image feature, and the four branches (each branch outputs a 32-channel feature map) are combined into a 128-channel feature map.
  • Rule #4 for rule #3, an additional jump connection is added to the 128-channel feature map to achieve a smoother gradient flow during the training process.
  • Rule #5 for rule #1, a convolution kernel is added to 1 ⁇ 1 ⁇ 256, and the feature map from 15 ⁇ 12 ⁇ 128 to 15 ⁇ 12 ⁇ 256 is up-sampled. By applying feature channel upsampling, more features can be generated.
  • the basic feature of the depth image is determined by extracting the features of the depth image to be recognized; then multiple features of different scales of the basic feature are extracted to determine the depth image Multi-scale features; Finally, the multi-scale features are up-sampled to enrich the features again.
  • using the feature extraction method can extract more kinds of features from the depth image.
  • pose estimation is performed based on the feature extraction method, the rich features can also improve the accuracy of subsequent bounding box and pose estimation.
  • the embodiment of the present application also provides a feature extraction device.
  • the feature extraction device includes:
  • the first extraction part 141 is configured to extract features of the depth image to be recognized, and determine the basic features of the depth image;
  • the second extraction part 142 is configured to extract multiple different-scale features of the basic feature, and determine the multi-scale feature of the depth image;
  • the up-sampling part 143 is configured to up-sample the multi-scale features to determine target features; wherein, the target features are used to determine the bounding box of the region of interest in the depth image.
  • the first extraction part 141 is configured to input the depth image to be recognized into a feature extraction network for multiple downsampling, and output basic features of the depth image; wherein, the feature extraction network includes interval At least one convolutional layer and at least one pooling layer are connected, and the starting layer is a convolutional layer.
  • the convolution kernel of the convolution layer close to the input end in the at least one convolution layer is greater than or equal to the convolution kernel of the convolution layer far away from the input end.
  • the feature extraction network includes 2 convolutional layers and 2 pooling layers; wherein the convolution kernel of the first convolutional layer is 7 ⁇ 7, and the convolution of the second convolutional layer The core is 5 ⁇ 5.
  • the second extraction part 142 is configured to input the basic features into a multi-scale feature extraction network and output the multi-scale features of the depth image; wherein the multi-scale feature extraction network includes sequential connections N convolutional networks, N takes an integer greater than 1.
  • each convolutional network includes at least two convolutional branches and a fusion network, and different convolutional branches are used to extract features of different scales;
  • the second extraction part 142 is configured to input the output feature of the i-1th convolutional network into the i-th convolutional network, and output the features of at least two branches of the i-th convolutional network; where i Take an integer from 1 to N.
  • the input feature of the first convolutional network is the basic feature; input the output feature and input feature of the i-th convolutional network to the fusion network for feature fusion , Output the output feature of the i-th convolutional network; if i is less than N, continue to input the output feature of the i-th convolutional network to the i+1th convolutional network; if i is equal to N, then the Nth convolution
  • the product network outputs the multi-scale features of the depth image.
  • the convolutional network includes four convolutional branches; wherein,
  • the first convolution branch includes a first convolution layer
  • the second convolution branch includes a first pooling layer and a second convolution layer that are sequentially connected;
  • the third convolution branch includes a third convolution layer and a fourth convolution layer that are sequentially connected;
  • the fourth convolution branch includes a fifth convolution layer, a sixth convolution layer, and a seventh convolution layer that are sequentially connected;
  • the number of channels of the first convolutional layer, the second convolutional layer, the fourth convolutional layer, and the seventh convolutional layer are equal; the number of channels of the third convolution kernel and the fifth convolution kernel are the same, And is smaller than the number of channels of the fourth convolutional layer.
  • the first convolutional layer is 1 ⁇ 1 ⁇ 32;
  • the first pooling layer is 3 ⁇ 3, and the second convolutional layer is 1 ⁇ 1 ⁇ 32;
  • the third convolutional layer is 1 ⁇ 1 ⁇ 24, and the fourth convolutional layer is 3 ⁇ 3 ⁇ 32;
  • the fifth convolutional layer is 1 ⁇ 1 ⁇ 24
  • the sixth convolutional layer is 3 ⁇ 3 ⁇ 32
  • the seventh convolutional layer is 3 ⁇ 3 ⁇ 32.
  • the up-sampling part 143 is configured to input the multi-scale features into the eighth convolutional layer and output target features; wherein, the number of channels of the eighth convolutional layer is the same as the number of channels in the eighth convolutional layer. M times the number of characteristic channels, and M is greater than 1.
  • an embodiment of the present application also provides a feature extraction device.
  • the device includes: a first processor 151 and a device configured to be stored in the first processor.
  • the first memory 152 is used to store a computer program, and the first processor 151 is used to call and run the computer program stored in the first memory 152 to execute the steps of the feature extraction method in the foregoing embodiment.
  • the various components in the device are coupled together through the first bus system 153.
  • the first bus system 153 is used to implement connection and communication between these components.
  • the first bus system 153 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the first bus system 153 in FIG. 15.
  • An embodiment of the present invention provides a computer storage medium, where the computer storage medium stores computer-executable instructions, and when the computer-executable instructions are executed, the method steps of the foregoing embodiments are implemented.
  • the foregoing device in the embodiment of the present invention is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, Read Only Memory (ROM, Read Only Memory), magnetic disk or optical disk and other media that can store program codes. In this way, the embodiments of the present invention are not limited to any specific combination of hardware and software.
  • an embodiment of the present invention also provides a computer storage medium in which a computer program is stored, and the computer program is configured to execute the feature extraction method of the embodiment of the present invention.
  • a pose estimation method using the feature extraction method is also provided. As shown in Fig. 16, the method may include:
  • Step 161 Extract the features of the depth image to be recognized, and determine the basic features of the depth image
  • the depth image to be recognized is input into the feature extraction network for multiple downsampling, and the basic features of the depth image are output; wherein the feature extraction network includes at least one convolutional layer and at least one pool connected at intervals
  • the initial layer is the convolutional layer.
  • the convolution kernel of the convolution layer close to the input end in the at least one convolution layer is greater than or equal to the convolution kernel of the convolution layer far away from the input end.
  • the feature extraction network includes 2 convolutional layers and 2 pooling layers; wherein the convolution kernel of the first convolutional layer is 7 ⁇ 7, and the convolution of the second convolutional layer The core is 5 ⁇ 5.
  • Step 162 Extract multiple different-scale features of the basic feature, and determine the multi-scale feature of the depth image
  • the basic features are input into a multi-scale feature extraction network, and the multi-scale features of the depth image are output; wherein, the multi-scale feature extraction network includes N convolutional networks connected in sequence, and N is greater than 1. Integer.
  • each convolutional network includes at least two convolutional branches and a fusion network, and different convolutional branches are used to extract features of different scales;
  • the inputting the basic feature into a multi-scale feature extraction network and outputting the multi-scale feature of the depth image includes: inputting the output feature of the i-1th convolutional network into the i-th convolutional network, and outputting The features of at least two branches of the i-th convolutional network; where i is an integer from 1 to N, and when i is equal to 1, the input feature of the first convolutional network is the basic feature; and the i-th convolutional network is the basic feature;
  • the output features and input features of the convolutional network are input to the fusion network for feature fusion, and the output features of the i-th convolutional network are output; if i is less than N, continue to input the output features of the i-th convolutional network to the i-th +1 convolutional network; if i is equal to N, the Nth convolutional network outputs the multi-scale features of the depth image.
  • the convolutional network includes four convolutional branches; wherein,
  • the first convolution branch includes a first convolution layer
  • the second convolution branch includes a first pooling layer and a second convolution layer that are sequentially connected;
  • the third convolution branch includes a third convolution layer and a fourth convolution layer that are sequentially connected;
  • the fourth convolution branch includes a fifth convolution layer, a sixth convolution layer, and a seventh convolution layer that are sequentially connected;
  • the number of channels of the first convolutional layer, the second convolutional layer, the fourth convolutional layer, and the seventh convolutional layer are equal; the number of channels of the third convolution kernel and the fifth convolution kernel are the same, And is smaller than the number of channels of the fourth convolutional layer.
  • the first convolutional layer is 1 ⁇ 1 ⁇ 32;
  • the first pooling layer is 3 ⁇ 3, and the second convolutional layer is 1 ⁇ 1 ⁇ 32;
  • the third convolutional layer is 1 ⁇ 1 ⁇ 24, and the fourth convolutional layer is 3 ⁇ 3 ⁇ 32;
  • the fifth convolutional layer is 1 ⁇ 1 ⁇ 24
  • the sixth convolutional layer is 3 ⁇ 3 ⁇ 32
  • the seventh convolutional layer is 3 ⁇ 3 ⁇ 32.
  • Step 163 Up-sampling the multi-scale feature to determine the target feature
  • the multi-scale feature is input to the eighth convolutional layer, and the target feature is output; wherein the number of channels of the eighth convolutional layer is M times the number of channels of the multi-scale feature, and M is greater than 1.
  • M is an integer greater than 1 or non-integer.
  • Step 164 Extract the bounding box of the region of interest based on the target feature
  • the target feature is input into the bounding box detection head model, multiple candidate bounding boxes of the region of interest are determined, and then a bounding box is selected from the multiple candidate bounding boxes as the bounding box surrounding the region of interest.
  • Step 165 Extract coordinate information of key points in the region of interest based on the bounding box
  • the region of interest is an image region selected from the image, and this region is the focus of image analysis, including the detection object.
  • the area is delimited for further processing of the test object. Using ROI to delineate the detection object can reduce processing time and increase accuracy.
  • the detection object includes a hand; the key points include at least one of the following: finger joint points, fingertip points, wrist key points, and palm center points.
  • the key nodes of the hand skeleton are the key points.
  • the hand includes 20 key points. The specific positions of these 20 key points in the hand are shown in Figure 3.
  • the detection object includes a human face; the key points include at least one of the following: eye points, eyebrow points, mouth points, nose points, and face contour points.
  • the key points of the face are specifically the key points of the facial features. There can be 5 key points, 21 key points, 68 key points, 98 key points, etc.
  • the detection object includes a human face; the key points include at least one of the following: head points, limb joint points, and torso points, and there may be 28 key points.
  • Step 166 Perform pose estimation on the detection object based on the coordinate information of the key points in the region of interest, and determine the pose estimation result.
  • the basic feature of the depth image is determined by extracting the features of the depth image to be recognized; then multiple features of different scales of the basic feature are extracted to determine the depth image Multi-scale features; Finally, the multi-scale features are up-sampled to enrich the features again.
  • the feature extraction method can extract more features from the depth image.
  • the pose estimation is performed based on the feature extraction method, the rich features can also improve the accuracy of the subsequent bounding box and pose estimation.
  • the embodiment of the present application also provides a pose estimation device.
  • the pose estimation device includes: a third extraction part 171 and a bounding box detection part 172.
  • the third extraction part 171 is configured to perform the steps of the aforementioned feature extraction method, and determine the target feature of the depth image to be recognized;
  • the bounding box detection part 172 is configured to extract the bounding box of the region of interest based on the target feature
  • the fourth extraction part 173 is configured to extract position information of key points in the region of interest based on the bounding box;
  • the pose estimation part 174 is used to estimate the pose of the detection object based on the position information of the key points in the region of interest.
  • an embodiment of the present application also provides an attitude estimation device.
  • the device includes: a second processor 181 and a second processor for storing The second memory 182 of the computer program running on 181,
  • the second memory 182 is used to store a computer program, and the second processor 181 is used to call and run the computer program stored in the second memory 182 to execute the steps of the pose estimation method in the foregoing embodiment.
  • the various components in the device are coupled together through the second bus system 183.
  • the second bus system 183 is used to implement connection and communication between these components.
  • the second bus system 183 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the second bus system 183 in FIG. 18.
  • the embodiments of the present application provide a feature extraction method, device, device, and storage medium, and a posture estimation method, device, device, and storage medium using the feature extraction method.
  • the basic feature of the depth image is determined by extracting the features of the depth image to be recognized; then multiple features of different scales of the basic feature are extracted to determine the depth image Multi-scale features; Finally, the multi-scale features are up-sampled to enrich the features again.
  • using the feature extraction method can extract more kinds of features from the depth image.
  • pose estimation is performed based on the feature extraction method, the rich features can also improve the accuracy of subsequent bounding box and pose estimation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种特征提取方法、装置、设备及存储介质,以及应用该特征提取方法的姿态估计方法、装置、设备及存储介质。采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特征,确定深度图像的基本特征;再提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。

Description

一种特征提取方法、装置、设备及存储介质
相关申请的交叉引用
本申请基于申请号为62/938,183、申请日为2019年11月20日、申请名称为“COMPACT BACKBONE FEATURE EXTRACTOR FOR EFFICIENT 3D HAND POSE DETECTION FOR A MOBILE TOF CAMERA”的在先美国临时专利申请提出,并要求该在先美国临时专利申请的优先权,该在先美国临时专利申请的全部内容在此以全文引入的方式引入本申请作为参考。
技术领域
本发明涉及图像处理技术,尤其涉及一种特征提取方法、装置、设备及存储介质。
背景技术
目前,手势识别技术在沉浸式虚拟现实、增强现实、机器人控制和手语识别等多领域中有着广阔的市场应用前景。近年来,尤其是随着消费类深度相机的出现,该技术取得了长足的进步。但是,由于无约束的全局和局部姿态变化、频繁的遮挡、局部自相似性和高清晰度的影响,手势识别的准确性较低。因此,手势识别技术仍然具备很高的研究价值。
发明内容
为解决上述技术问题,本发明实施例提供了一种特征提取方法、装置、设备及存储介质。
第一方面,本申请实施例提供了一种特征提取方法,包括:提取待识别的深度图像的特征,确定所述深度图像的基本特征;提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
第二方面,本申请实施例提供了一种特征提取装置,所述特征提取装置包括:第一提取部分,配置为提取待识别的深度图像的特征,确定所述深度图像的基本特征;
第二提取部分,配置为提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
上采样部分,配置为上采样所述多尺度特征,确定目标特征;其中,所述目标特征 用于确定所述深度图像中感兴趣区域的边界框。
第三方面,本申请实施例提供了一种特征提取设备,所述特征提取设备包括:第一处理器和用于存储能够在第一处理器上运行的计算机程序的第一存储器,其中,所述第一存储器用于存储计算机程序,所述第一处理器用于调用并运行所述第一存储器中存储的计算机程序,执行前述第一方面方法的步骤。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序使得计算机执行前述第一方面方法的步骤。
第五方面,本申请实施例提供了一种姿态估计方法,包括:提取待识别的深度图像的特征,确定所述深度图像的基本特征;提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;上采样所述多尺度特征,确定目标特征;基于所述目标特征提取感兴趣区域的边界框;基于所述边界框提取感兴趣区域内关键点的坐标信息;基于所述感兴趣区域内关键点的坐标信息对检测对象进行姿态估计,确定姿态估计结果。
第六方面,本申请实施例提供了一种姿态估计装置,所述姿态估计装置包括:第三提取部分、边界框检测部分、第四提取部分和姿态估计部分;其中,
所述第三提取部分,配置为执行前述第五方面方法的步骤,确定待识别的深度图像的目标特征;
所述边界框检测部分,配置为基于所述目标特征提取感兴趣区域的边界框;
所述第四提取部分,配置为基于所述边界框提取感兴趣区域内关键点的位置信息;
所述姿态估计部分,用于基于所述感兴趣区域内关键点的位置信息对检测对象进行姿态估计。
第七方面,本申请实施例提供了一种姿态估计设备,所述姿态估计设备包括:第二处理器和用于存储能够在第二处理器上运行的计算机程序的第二存储器,
其中,所述第二存储器用于存储计算机程序,所述第二处理器用于调用并运行所述第二存储器中存储的计算机程序,执行前述第五方面方法的步骤。
第八方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序使得计算机执行前述第五方面方法的步骤。
本申请实施例提供了一种特征提取方法、装置、设备及存储介质,以及应用该特征提取方法的姿态估计方法、装置、设备及存储介质。采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特征,确定深度图像的基本特征;再提取 所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。
附图说明
图1为相关技术方案提供的一种TOF相机所拍摄的图像示意图;
图2为相关技术方案提供的一种手部包围盒的检测结果示意图;
图3为相关技术方案提供的一种手部骨架的关键点位置示意图;
图4为相关技术方案提供的一种二维手部的姿态估计结果示意图;
图5为相关技术方案提供的一种传统手部姿态检测的流程示意图;
图6为相关技术方案提供的一种RoIAlign特征提取的结构示意图;
图7为相关技术方案提供的一种非最大值抑制的结构示意图;
图8为相关技术方案提供的一种并集与交集的结构示意图;
图9为Alexnet网络结构示意图;
图10示出了本申请实施例中手部姿态估计的流程示意图;
图11示出了本申请实施例中一种特征提取方法的流程示意图;
图12示出了本申请实施例中另一种特征提取方法的流程示意图;
图13示出了本申请实施例中主干特征提取器的组成结构示意图;
图14示出了本申请实施例中一种特征提取装置的组成结构示意图;
图15示出了本申请实施例中一种特征提取设备的组成结构示意图;
图16示出了本申请实施例中一种姿态估计方法的流程示意图;
图17示出了本申请实施例中一种姿态估计装置的组成结构示意图;
图18示出了本申请实施例中一种姿态估计设备的组成结构示意图。
具体实施方式
为了能够更加详尽地了解本发明实施例的特点与技术内容,下面结合附图对本发明实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本发明实施例。
手部姿态估计主要是指从图像中准确估计出人手骨架节点的三维坐标位置。这是计算机视觉和人机交互领域的一个关键问题,在虚拟现实、增强现实、非接触交互以及手 势识别等领域具有重要的意义。随着商用、低廉的深度相机的兴起和发展,手部姿态估计已经取得了很大的进步。
其中,深度相机包括有结构光、激光扫描和TOF等几种,大多数情况下是指TOF相机。这里,TOF是Time of Fight的简写,直译为飞行时间。所谓飞行时间法的三维(Three Dimension,3D)成像,是通过给目标连续发送光脉冲,然后利用传感器接收从物体返回的光,通过探测光脉冲的飞行(往返)时间来得到物体的目标距离。具体来讲,TOF相机是一种距离成像相机系统,它利用飞行时间法,通过测量由激光或发光二极管(Light Emitting Diode,LED)提供的人工光信号的往返时间,从而计算出TOF相机和被摄物体之间在图像上每个点之间的距离。
TOF相机输出一个尺寸为H×W的图像,这个二维(Two Dimension,2D)图像上的每一个像素值可以代表该像素的深度值;其中,像素值的范围为0~3000毫米(millimeter,mm)。图1示出了相关技术方案提供的一种TOF相机所拍摄的图像示意图。在本申请实施例中,可以将TOF相机所拍摄的图像称为深度图像。
示例性地,与其他厂商的TOF相机相比,O厂商提供的TOF相机存在有以下几个区别点:(1)TOF相机可以安装在智能手机内部,而非是固定在静态支架上;(2)与其他厂商的TOF相机(比如Microsoft Kinect或Intel Realsense等)相比,具有较低的功耗;(3)具有较低的图像分辨率,如240×180,而典型值为640×480。
可以理解地,手部检测的输入为深度图像,然后输出为手部存在的概率(即0到1之间的数字,较大的值表示手部存在的置信度较大)和手部包围盒(即表示手的位置和大小的包围盒)。图2示出了相关技术方案提供的一种手部包围盒的检测结果示意图。如图2所示,黑色矩形框即为手部包围盒,而且该手部包围盒的分数高达0.999884。
在本申请实施例中,包围盒(bounding box)也可以称为边界框。这里,包围盒可以表示为(xmin,ymin,xmax,ymax),其中,(xmin,ymin)表示包围盒的左上角位置,(xmax,ymax)是包围盒的右下角。
具体来说,在二维手部姿态估计的过程中,输入为深度图像,输出为手部骨架的二维关键点位置,其手部骨架的关键点位置示例如图3所示。在图3中,手部骨架可以设置有20个关键点,每一个关键点位置如图3中的0~19所示。这里,每一个关键点位置可以用2D坐标信息(x,y)表示,其中,x为水平图像轴方向的坐标信息,y为垂直图像轴方向的坐标信息。示例性地,在确定出这20个关键点的坐标信息之后,一个二维手部的姿态估计结果如图4所示。
在三维手部姿态估计的过程中,输入仍为深度图像,输出则为手部骨架的三维关键点位置,其手部骨架的关键点位置示例仍如图3所示。这里,每一个关键点位置可以用3D坐标信息(x,y,z)表示,其中,x为水平图像轴方向的坐标信息,y为垂直图像轴方向的坐标信息,z为深度方向的坐标信息。本申请实施例主要是解决三维手部的姿态估计问题。
目前,一种典型的手部姿态检测流程可以包括手部检测部分和手部姿态估计部分,其中,手部检测部分可以包括主干特征提取器和边界框检测头部模块,手部姿态估计部分可以包括主干特征提取器和姿态估计头部模块。示例性地,图5示出了相关技术方案提供的一种传统手部姿态估计的流程示意图。如图5所示,在得到一张包括有手部的原始深度图像后,首先可以进行手部检测,即利用手部检测部分中所包括的主干特征提取器和边界框检测头部模块进行检测处理;这时候还可以通过调整包围盒边界,然后利用调整后的包围盒进行图像裁剪,并对裁剪后的图像进行手部姿态估计,即利用手部姿态估计部分中所包括的主干特征提取器和姿态估计头部模块进行姿态估计处理。需要注意的是,手部检测和手部姿势估计这两个部分的任务是完全分开的。为了连接这两个任务,输出包围盒的位置调整为包围盒内像素的质心,并将包围盒的大小稍微放大以包含所有的手部像素。调整后的包围盒用于裁剪原始深度图像。将裁剪后的图像输入到手部姿态估计这个任务中。这里,在两次使用主干特征提取器提取图像特征时,将会导致重复计算,增加了计算量。
这时候可以引入RoIAlign。其中,ROIAlign是一种感兴趣区域(region of interest,ROI)特征聚集方式,可以很好地解决ROI Pool操作中两次量化造成的区域不匹配的问题。在检测任务中,将ROI Pool替换为ROIAlign可以提升检测结果的准确性。也就是说,RoIAlign层消除了RoI Pool的严格量化,将提取的特征与输入进行正确对齐。这里,可以避免RoI边界或区域的任何量化,例如,这里可以使用x/16而不是[x/16]。另外,还可以使用双线性插值的方式来计算每一个RoI区域中四个定期采样位置的输入特征的精确值,并汇总结果(使用最大值或平均值),如图6所示。在图6中,虚线网格表示一个特征图,加粗实线表示一个RoI(如2×2个区域),并在每个区域中点了4个采样点。RoIAlign利用特征图上相邻网格点进行双线性插值计算,以得到每个采样点的值。针对ROI边界框或采样点,不会对所涉及的任何坐标执行量化。还需要注意的是,只要不执行量化,检测结果对采样位置的准确度或采样点的数量均不敏感。
另外,非最大值抑制(Non-Maximum Suppression,NMS)在计算机视觉的几个关 键方面得到了广泛的应用,是边缘、角点或目标检测等多种检测方法的组成部分。它的必要性源于检测算法对感兴趣概念的定位能力不完善,导致多个检测组出现在实际位置附近。
在目标检测的背景下,基于滑动窗口的方法通常会产生多个得分较高的窗口,这些窗口靠近目标的正确位置。这是由于目标检测器的泛化能力、响应函数的光滑性和近窗视觉相关性的结果。这种相对密集的输出对于理解图像的内容通常是不令人满意的。事实上,在这一步中,窗口假设的数量与图像中对象的实际数量不相关。因此,NMS的目标是每个检测组只保留一个窗口,对应于响应函数的精确局部最大值,理想情况下每个对象只获得一个检测。NMS的一个具体示例如图7所示,NMS的目的只是保留一个窗口(如图7中的加粗灰色矩形框)。
如图8所示,其示出了相关技术方案提供的一种并集与交集的示意图。在图8中,给定了两个边界框,分别用BB1和BB2表示。这里,(a)中的黑色区域为BB1和BB2的交集,用BB1∩BB2表示,即BB1和BB2的重叠区域;(b)中的黑色区域为BB1和BB2的并集,用BB1∪BB2表示,即BB1和BB2的合并区域。具体地,交并比(用IoU表示)的计算公式如下所示,
Figure PCTCN2020127867-appb-000001
在上述检测背景的基础上,目前的手部姿态估计方案有Alexnet,图9为Alexnet网络结构示意图,对输入图像依次经过5个顺序连接的卷积层(即Conv1、Conv2、Conv3、Conv4和Conv5),再经过3个全连接层(即FC6、FC7和FC8)。但是Alexnet中存在大量的计算,没有针对移动设备,难以在移动设备上实现。
针对该问题,本申请实施例提供了一种特征提取方法,可以应用在主干特征提取器中。不同于图5中的主干特征提取器的应用,图10示出了本申请实施例中手部姿态估计的流程示意图,如图10所示,本申请实施例提供的主干特征提取器放置在输入之后,边界框检测头模块之前,它可以提取更多的图像特征用于手部检测和手部姿态估计,相较于传统提取方法,检测网络更加紧凑,更加适合部署在移动设备上。
下面对本申请实施例提供的特征提取方法进行详细的举例说明。
本申请一实施例中,示出了本申请实施例提供的一种特征提取方法的流程示意图。如图11所示,该方法可以包括:
步骤111:提取待识别的深度图像的特征,确定所述深度图像的基本特征;
在一些实施例中,该步骤之前还包括:获取深度相机采集的包含检测对象的深度图像。深度相机可以独立存在或集成在电子设备上,深度相机可以为TOF相机、结构光深度相机、双目立体视觉,目前TOF相机在移动终端应用较多。
实际应用中,可以通过建立的特征提取网络来提取深度图像的基本特征。特征提取网络包括间隔连接的至少一个卷积层和至少一个池化层,且起始层为卷积层,每个卷积层的卷积核相同或不相同,每个池化层的卷积核相同。示例性地,卷积层卷积核可以为1×1、3×3、5×5或7×7中的任一种,池化层的卷积核也可以从中任选一种。
需要说明的是,池化可以为最大池化(Max pooling)或平均池化(Average pooling),本申请不做具体限定。
需要说明的是,基本特征包括颜色特征、纹理特征、形状特征、空间关系特征和轮廓特征中的至少一种,基本特征分辨率更高,包含更多位置、细节信息,能够提供对定位和分割较为有益的信息,可以让高层网络根据基本特征更容易、更全面地获取图像的上下文信息,从而可以利用上下文信息提升后续ROI区域边界框定位的准确性。
需要说明的是,基本特征也可以称为图像的低层特征。
需要说明的是,本申请各实施例中的特征的表现形式例如可以包括但不限于:特征图、特征向量或者特征矩阵等等。
步骤112:提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
具体地,对多尺度特征进行多个设定尺寸的卷积,卷积后相加,得到多个尺度下的不同的图像特征。
实际应用中,可以通过建立的多尺度特征提取网络从基本特征中提起不同尺度下的图像特征。示例性地,多尺度特征提取网络包括顺序连接的N个卷积网络,N取大于1的整数。
需要说明的是,当N大于1时,N个卷积网络可以为相同卷积网络或不同卷积网络,第一个卷积网络的输入为基本特征,其他卷积网络的输入为上一个卷积网络的输出,第N个卷积网络的输出则为该多尺度特征提取网络最终输出的多尺度特征。
优选地,N个卷积网络为相同卷积网络,即重复的N个卷积网络顺序连接,有利于降低网络复杂度,减少计算量。
在一些实施例中,每个卷积网络的输入特征和初始输出特征进行融合,将融合后的特征作为卷积网络最终的输出特征。比如,在每个卷积网络中增加跳跃连接,将输入特 征和初始输出特征进行融合,可以解决网络层数较深的情况下梯度消失的问题,同时有助于梯度的反向传播,加快训练过程。
步骤113:上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
实际应用中,上采样指的是任何可以让图像变成更高分辨率的技术,对多尺度特征进行上采样能够得到图像更多细节特征,有利于后续边界框的检测。最简单的方式是重采样和插值:将输入图片进行重设比例到一个想要的尺寸,而且计算每个点的像素点,使用如双线性插值等插值方法对其余点进行插值来完成上采样过程。
进一步地,将得到的图标特征用于姿态估计时,首先、基于目标特征确定所述深度图像中感兴趣区域的边界框;其次、基于所述边界框提取感兴趣区域内关键点的坐标信息;最后、基于所述感兴趣区域内关键点的坐标信息对检测对象进行姿态估计,确定姿态估计结果。
具体地,所述检测对象包括手部;所述关键点包括以下至少之一:手指关节点、指尖点、手腕关键点以及手掌中心点。在进行手部姿态估计时,手部的骨架关键节点即关键点,通常情况下手部包括有20个关键点,这20个关键点在手部的具体位置如图3所示。
所述检测对象包括人脸;所述关键点包括以下至少之一:眼睛点、眉毛点、嘴巴点、鼻子点以及人脸轮廓点。在进行人脸表情识别时,人脸关键点具体是人脸五官的关键点,可以有5个关键点、21个关键点、68个关键点、98个关键点等。
所述检测对象包括人脸;所述关键点包括以下至少之一:头部点、四肢关节点、以及躯干点,可以有28个关键点。
实际应用中,本申请实施例提供的特征提取方法应用在特征提取装置中,或者集成有该装置的电子设备。其中,电子设备可以是智能手机、平板电脑、笔记本电脑、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、导航装置、可穿戴设备、台式计算机等,本申请实施例不作任何限定。
本申请特征提取方法应用在图像识别领域,所提取的图像特征可参与到整个人体的姿态估计或者局部的姿态估计,本申请实施例主要介绍了如何对手部姿态进行估计,其他部位的姿态估计凡是应用到本申请实施例这种特征提取方法,也均在本申请的保护范围内。
采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特 征,确定深度图像的基本特征;再提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。
本申请的另一实施例中,参见图12,其示出了本申请实施例提供的另一种特征提取方法的流程示意图。如图12所示,该方法可以包括:
步骤121:将待识别的深度图像输入到特征提取网络中进行多次下采样,输出所述深度图像的基本特征;
这里,所述特征提取网络包括间隔连接的至少一个卷积层和至少一个池化层,且起始层为卷积层。
在一些实施例中,所述至少一个卷积层中靠近输入端的卷积层的卷积核大于或者等于远离输入端的卷积层的卷积核。示例性地,卷积核可以为1×1、3×3、5×5或7×7中的任一种,池化层的卷积核也可以从中任选一种。
需要说明的是,大卷积核能够快速扩大感受野,提取到更多图像特征,但存在计算量大的问题,因此本申请实施例采用卷积核逐层递减的方式在图像特征和计算量之间进行了很好的平衡,在提取更多基本特征的基础上,保证计算量适用于移动终端的处理能力。
图13示出了本申请实施例中主干特征提取器的组成结构示意图,如图13所示,主干特征提取器中包括特征提取网络、多尺度特征提取网络以及上采样网络。其中,具体给出了一种包括2个卷积层和2个池化层的特征提取网络,具体包括:7×7×48的Conv1,即卷积核为7×7通道数为48,s2表示对输入深度图像的二维数据进行两次下采样,还包括3×3的Pool1,5×5×128的Conv2和3×3的Pool2。
示例性地,一个240×180深度图像首先输入到7×7×48的Conv1中,输出20×132×128的特征图,再输入到3×3的Pool1中输出60×45×48特征图,再输入到5×5×128的Conv2中输出30×23×128特征图,最后输入到3×3的Pool2中输出15×12×128特征图。其中,每一次卷积或池化操作进行2次下采样,输入深度图像总共被直接下采样16次,通过下采样可以大大降低计算成本。这里,使用7×7和5×5这样的大卷积核能够快速扩大感受野,提取到更多图像特征。
在一些实施例中,该步骤之前还包括:获取深度相机采集的包含检测对象的深度图像。深度相机可以独立存在或集成在电子设备上,深度相机可以为TOF相机、结构光 深度相机、双目立体视觉,目前TOF相机在移动终端应用较多。
步骤122:将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征;
这里,所述多尺度特征提取网络包括顺序连接的N个卷积网络,N取大于1的整数。
具体地,每个卷积网络中包括至少两个卷积分支和融合网络,不同卷积分支用于提取不同尺度的特征;
所述将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征,包括:
将第i-1个卷积网络的输出特征输入到第i个卷积网络,输出第i个卷积网络的至少两个分支的特征;其中,i取1至N的整数,i等于1时,第1个卷积网络的输入特征为所述基本特征;
将所述第i个卷积网络输出特征和输入特征输入到所述融合网络进行特征融合,输出第i个卷积网络的输出特征;
若i小于N,则继续将第i个卷积网络的输出特征输入到第i+1个卷积网络;
若i等于N,则第N个卷积网络输出所述深度图像的多尺度特征。
需要说明的是,卷积网络的输出特征的通道数,应该与输入特征的通道数相同,才能进行特征融合。
还需要说明的是,每个卷积网络用于提取不同种类的特征,越靠后的提取出来的特征越抽象,比如前面卷积网络可以提取出来更加局部的特征,提取手指的特征,后面的提取更加全局的特征,提取整个手的特征,利用N个重复的卷积核群能够提取更多不同的特征。同样,卷积网络中不同卷积分支同样提取不同种类的特征,有些分支提取的更细节的特征,有些分支提取更全局的特征。
在一些实施例中,所述卷积网络包含四个卷积分支;其中,
第一卷积分支包括第一卷积层;
第二卷积分支包括顺序连接的第一池化层和第二卷积层;
第三卷积分支包括顺序连接的第三卷积层和第四卷积层;
第四卷积分支包括顺序连接的第五卷积层、第六卷积层和第七卷积层;
所述第一卷积层、第二卷积层、第四卷积层和第七卷积层的通道数相等;所述第三卷积层和所述第五卷积层的通道数相等,且小于所述第四卷积层的通道数。
需要说明的是,第三卷积层和第五卷积层的通道数较小是为了对输入特征进行通道 下采样,以减小后续卷积处理的计算量,更加适用于移动设备。通过设置4个卷积分支,可以在图像特征和计算量之间进行了很好的平衡,在提取更多尺度特征的基础上,保证计算量适用于移动终端的处理能力。
在一些实施例中,所述第一卷积层、所述第二卷积层、所述第三卷积层和第五卷积层的卷积核相等,所述第四卷积层、所述第六卷积层和所述第七卷积层的卷积核相等。
示例性地,第一至第七卷积层的卷积核可以为1×1、3×3或5×5中的任一种,第一池化层的卷积核也可以从中任选一种。
图13示出了本申请实施例中主干特征提取器的组成结构示意图,如图13所示,主干特征提取器中包括特征提取网络、多尺度特征提取网络以及上采样网络。其中,具体给出了一种包括3个重复卷积网络的多尺度特征提取网络,具体包括:第一卷积分支包括1×1×32的Conv,即卷积核为1×1通道数为32;第二卷积分支包括顺序连接的3×3的Pool和1×1×32的Conv;第三卷积网络包括顺序连接的1×1×24的Conv和1×1×32的Conv;第四卷积网络包括顺序连接的1×1×24的Conv、1×1×32的Conv和1×1×32的Conv。每个卷积网络中还增加了一个跳跃连接(即融合网络),用于对输入特征和输出特征进行融合,以便在训练过程中实现更平滑的梯度流。
需要说明的是,对于图13中示出的多尺度提取网络,卷积网络包括的四个卷积分支中最上面的分支提取的特征更细节,中间的两个分支提取的特征偏局部,最后一个分支提取的特征更全局。
步骤123:上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
具体地,将所述多尺度特征输入到第八卷积层,输出目标特征;其中,所述第八卷积层的通道数是所述多尺度特征的通道数的M倍,M大于1。
也就是说,通过对多尺度特征进行特征通道上采样,可以生成更多种特征,M取大于1的整数或非整数。
图13示出了本申请实施例中主干特征提取器的组成结构示意图,如图13所示,主干特征提取器中包括特征提取网络、多尺度特征提取网络以及上采样网络。其中,上采样网络包含一个卷积层为1×1×256的Conv,s2表示对多尺度特征的二维数据进行两次下采样,在1×1×256中添加了卷积核,可以将输入的15×12×128的特征映射到15×12×256的特征映射进行向上采样。通过应用特征通道上采样,可以生成更多种特征。
进一步地,将得到的图标特征用于姿态估计时,首先、基于目标特征确定所述深度图像中感兴趣区域的边界框;其次、基于所述边界框提取感兴趣区域内关键点的坐标信息;最后、基于所述感兴趣区域内关键点的坐标信息对检测对象进行姿态估计,确定姿态估计结果。
简言之,在本申请实施例中,特征提取方法主要包括以下设计规则:
规则#1本发明网络管线包括三个主要部件,具体包括:基本特征提取器、多尺度特征提取器和特征上采样网络。网络构架如图13所示。
规则#2,对于规则#1中,基本特征提取器用于提取低层图像特征(即基本特征)。一个240×180深度图像首先输入到7×7×48的Conv1中,输出20×132×128的特征图,再输入到3×3的Pool1中输出60×45×48特征图,再输入到5×5×128的Conv2中输出30×23×128特征图,最后输入到3×3的Pool2中输出15×12×128特征图。这里,输入被直接下采样16次,以大大降低计算成本。使用大的卷积核(例如7×7和5×5)来快速扩大感受野。
规则#3,对于规则#1中,多尺度特征提取器包含三个重复卷积核群,来提取更多不同的特征。在每个卷积核群中,有四个分支,每个分支提取一种类型的图像特征,将四个分支(每个分支输出一个32通道的特征图)组合成128通道的特征图。
规则#4,对于规则#3中,额外添加了一个跳跃连接,以添加到128通道特征图中,以便在训练过程中实现更平滑的梯度流。
规则#5,对于规则#1中,在1×1×256中添加了卷积核,从15×12×128的特征映射到15×12×256的特征映射进行向上采样。通过应用特征通道上采样,可以生成更多种特征。
采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特征,确定深度图像的基本特征;再提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。
为实现本申请实施例的特征提取方法,基于同一发明构思本申请实施例还提供了一种特征提取装置,如图14所示,该特征提取装置包括:
第一提取部分141,配置为提取待识别的深度图像的特征,确定所述深度图像的基本特征;
第二提取部分142,配置为提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
上采样部分143,配置为上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
在一些实施例中,第一提取部分141,配置为将待识别的深度图像输入到特征提取网络中进行多次下采样,输出所述深度图像的基本特征;其中,所述特征提取网络包括间隔连接的至少一个卷积层和至少一个池化层,且起始层为卷积层。
在一些实施例中,所述至少一个卷积层中靠近输入端的卷积层的卷积核大于或者等于远离输入端的卷积层的卷积核。
在一些实施例中,所述特征提取网络包括2个卷积层和2个池化层;其中,第一个卷积层的卷积核为7×7,第二个卷积层的卷积核为5×5。
在一些实施例中,第二提取部分142,配置为将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征;其中,所述多尺度特征提取网络包括顺序连接的N个卷积网络,N取大于1的整数。
在一些实施例中,每个卷积网络中包括至少两个卷积分支和融合网络,不同卷积分支用于提取不同尺度的特征;
相应的,第二提取部分142,配置为将第i-1个卷积网络的输出特征输入到第i个卷积网络,输出第i个卷积网络的至少两个分支的特征;其中,i取1至N的整数,i等于1时,第1个卷积网络的输入特征为所述基本特征;将所述第i个卷积网络输出特征和输入特征输入到所述融合网络进行特征融合,输出第i个卷积网络的输出特征;若i小于N,则继续将第i个卷积网络的输出特征输入到第i+1个卷积网络;若i等于N,则第N个卷积网络输出所述深度图像的多尺度特征。
在一些实施例中,所述卷积网络包含四个卷积分支;其中,
第一卷积分支包括第一卷积层;
第二卷积分支包括顺序连接的第一池化层和第二卷积层;
第三卷积分支包括顺序连接的第三卷积层和第四卷积层;
第四卷积分支包括顺序连接的第五卷积层、第六卷积层和第七卷积层;
所述第一卷积层、第二卷积层、第四卷积层和第七卷积层的通道数相等;所述第三卷积核和所述第五卷积核的通道数相等,且小于所述第四卷积层的通道数。
具体地,所述第一卷积层为1×1×32;
所述第一池化层为3×3,所述第二卷积层为1×1×32;
所述第三卷积层为1×1×24,所述第四卷积层为3×3×32;
所述第五卷积层为1×1×24,所述第六卷积层为3×3×32,所述第七卷积层为3×3×32。
在一些实施例中,所述上采样部分143,配置为将所述多尺度特征输入到第八卷积层,输出目标特征;其中,所述第八卷积层的通道数是所述多尺度特征的通道数的M倍,M大于1。
基于上述特征提取装置中各单元的硬件实现,本申请实施例还提供了一种特征提取设备,如图15所示,该设备包括:第一处理器151和配置为存储能够在第一处理器151上运行的计算机程序的第一存储器152;
其中,所述第一存储器152用于存储计算机程序,所述第一处理器151用于调用并运行所述第一存储器152中存储的计算机程序,执行前述实施例中的特征提取方法的步骤。
当然,实际应用时,如图15所示,该设备中的各个组件通过第一总线系统153耦合在一起。可理解,第一总线系统153用于实现这些组件之间的连接通信。第一总线系统153除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图15中将各种总线都标为第一总线系统153。
本发明实施例提供的一种计算机存储介质,所述计算机存储介质存储有计算机可执行指令,所述计算机可执行指令被执行时实施前述实施例的方法步骤。
本发明实施例上述装置如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本发明实施例不限制于任何特定的硬件和软件结合。
相应地,本发明实施例还提供一种计算机存储介质,其中存储有计算机程序,该计算机程序配置为执行本发明实施例的特征提取方法。
基于本申请实施例提供的特征提取方法,还提供了应用该特征提取方法的一种姿态 估计方法,如图16所示,该方法可以包括:
步骤161:提取待识别的深度图像的特征,确定所述深度图像的基本特征;
具体地,将待识别的深度图像输入到特征提取网络中进行多次下采样,输出所述深度图像的基本特征;其中,所述特征提取网络包括间隔连接的至少一个卷积层和至少一个池化层,且起始层为卷积层。
在一些实施例中,所述至少一个卷积层中靠近输入端的卷积层的卷积核大于或者等于远离输入端的卷积层的卷积核。
在一些实施例中,所述特征提取网络包括2个卷积层和2个池化层;其中,第一个卷积层的卷积核为7×7,第二个卷积层的卷积核为5×5。
步骤162:提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
具体地,将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征;其中,所述多尺度特征提取网络包括顺序连接的N个卷积网络,N取大于1的整数。
在一些实施例中,每个卷积网络中包括至少两个卷积分支和融合网络,不同卷积分支用于提取不同尺度的特征;
所述将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征,包括:将第i-1个卷积网络的输出特征输入到第i个卷积网络,输出第i个卷积网络的至少两个分支的特征;其中,i取1至N的整数,i等于1时,第1个卷积网络的输入特征为所述基本特征;将所述第i个卷积网络输出特征和输入特征输入到所述融合网络进行特征融合,输出第i个卷积网络的输出特征;若i小于N,则继续将第i个卷积网络的输出特征输入到第i+1个卷积网络;若i等于N,则第N个卷积网络输出所述深度图像的多尺度特征。
在一些实施例中,所述卷积网络包含四个卷积分支;其中,
第一卷积分支包括第一卷积层;
第二卷积分支包括顺序连接的第一池化层和第二卷积层;
第三卷积分支包括顺序连接的第三卷积层和第四卷积层;
第四卷积分支包括顺序连接的第五卷积层、第六卷积层和第七卷积层;
所述第一卷积层、第二卷积层、第四卷积层和第七卷积层的通道数相等;所述第三卷积核和所述第五卷积核的通道数相等,且小于所述第四卷积层的通道数。
示例性地,所述第一卷积层为1×1×32;
所述第一池化层为3×3,所述第二卷积层为1×1×32;
所述第三卷积层为1×1×24,所述第四卷积层为3×3×32;
所述第五卷积层为1×1×24,所述第六卷积层为3×3×32,所述第七卷积层为3×3×32。
步骤163:上采样所述多尺度特征,确定目标特征;
具体地,将所述多尺度特征输入到第八卷积层,输出目标特征;其中,所述第八卷积层的通道数是所述多尺度特征的通道数的M倍,M大于1。M取大于1的整数或非整数。
步骤164:基于所述目标特征提取感兴趣区域的边界框;
具体地,将目标特征输入到边界框检测头模型中,确定感兴趣区域的多个候选边界框,再从多个候选边界框中选择一个边界框,作为包围感兴趣区域的边界框。
步骤165:基于所述边界框提取感兴趣区域内关键点的坐标信息;
这里,感兴趣区域(ROI)是从图像中选择的一个图像区域,这个区域是图像分析所关注的重点,包括检测对象。圈定该区域以便对检测对象进行进一步处理。使用ROI圈定检测对象,可以减少处理时间,增加精度。
具体地,所述检测对象包括手部;所述关键点包括以下至少之一:手指关节点、指尖点、手腕关键点以及手掌中心点。在进行手部姿态估计时,手部的骨架关键节点即关键点,通常情况下手部包括有20个关键点,这20个关键点在手部的具体位置如图3所示。
所述检测对象包括人脸;所述关键点包括以下至少之一:眼睛点、眉毛点、嘴巴点、鼻子点以及人脸轮廓点。在进行人脸表情识别时,人脸关键点具体是人脸五官的关键点,可以有5个关键点、21个关键点、68个关键点、98个关键点等。
所述检测对象包括人脸;所述关键点包括以下至少之一:头部点、四肢关节点、以及躯干点,可以有28个关键点。
步骤166:基于所述感兴趣区域内关键点的坐标信息对检测对象进行姿态估计,确定姿态估计结果。
采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特征,确定深度图像的基本特征;再提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提 取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。
为实现本申请实施例的姿态估计方法,基于同一发明构思本申请实施例还提供了一种姿态估计装置,如图17所示,该姿态估计装置包括:第三提取部分171、边界框检测部分172、第四提取部分173和姿态估计部分174;其中,
所述第三提取部分171,配置为执行前述特征提取方法的步骤,确定待识别的深度图像的目标特征;
所述边界框检测部分172,配置为基于所述目标特征提取感兴趣区域的边界框;
所述第四提取部分173,配置为基于所述边界框提取感兴趣区域内关键点的位置信息;
所述姿态估计部分174,用于基于所述感兴趣区域内关键点的位置信息对检测对象进行姿态估计。
基于上述姿态估计装置中各单元的硬件实现,本申请实施例还提供了一种姿态估计设备,如图18所示,该设备包括:第二处理器181和用于存储能够在第二处理器181上运行的计算机程序的第二存储器182,
其中,所述第二存储器182用于存储计算机程序,所述第二处理器181用于调用并运行所述第二存储器182中存储的计算机程序,执行前述实施例中姿态估计方法的步骤。
当然,实际应用时,如图18所示,该设备中的各个组件通过第二总线系统183耦合在一起。可理解,第二总线系统183用于实现这些组件之间的连接通信。第二总线系统183除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图18中将各种总线都标为第二总线系统183。
需要说明的是,在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。
工业实用性
本申请实施例提供了一种特征提取方法、装置、设备及存储介质,以及应用该特征提取方法的姿态估计方法、装置、设备及存储介质。采用本申请这种特征提取方法,在特征提取阶段,通过提取待识别的深度图像的特征,确定深度图像的基本特征;再提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;最后对多尺度特征进行上采样再次丰富特征。如此,采用特征提取方法能够从深度图像中提取更多种特征,基于该特征提取方法进行姿态估计时,丰富的特征也能够提高后续边界框及姿态估计的准确性。

Claims (17)

  1. 一种特征提取方法,其中,包括:
    提取待识别的深度图像的特征,确定所述深度图像的基本特征;
    提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
    上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
  2. 根据权利要求1所述的方法,其中,所述提取待识别的深度图像的特征,确定所述深度图像的基本特征,包括:
    将待识别的深度图像输入到特征提取网络中进行多次下采样,输出所述深度图像的基本特征。
  3. 根据权利要求2所述的方法,其中,所述特征提取网络包括间隔连接的至少一个卷积层和至少一个池化层,且起始层为卷积层;所述至少一个卷积层中靠近输入端的卷积层的卷积核大于或者等于远离输入端的卷积层的卷积核。
  4. 根据权利要求3所述的方法,其中,所述特征提取网络包括2个卷积层和2个池化层;
    其中,第一个卷积层的卷积核为7×7,第二个卷积层的卷积核为5×5。
  5. 根据权利要求1所述的方法,其中,所述提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征,包括:
    将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征;
    其中,所述多尺度特征提取网络包括顺序连接的N个卷积网络,N取大于1的整数。
  6. 根据权利要求5所述的方法,其中,每个卷积网络中包括至少两个卷积分支和融合网络,不同卷积分支用于提取不同尺度的特征;
    所述将所述基本特征输入到多尺度特征提取网络中,输出所述深度图像的多尺度特征,包括:
    将第i-1个卷积网络的输出特征输入到第i个卷积网络,输出第i个卷积网络的至少两个分支的特征;其中,i取1至N的整数,i等于1时,第1个卷积网络的输入特征为所述基本特征;
    将所述第i个卷积网络输出特征和输入特征输入到所述融合网络进行特征融合,输出第i个卷积网络的输出特征;
    若i小于N,则继续将第i个卷积网络的输出特征输入到第i+1个卷积网络;
    若i等于N,则第N个卷积网络输出所述深度图像的多尺度特征。
  7. 根据权利要求5或6所述的方法,其中,所述卷积网络包含四个卷积分支;其中,
    第一卷积分支包括第一卷积层;
    第二卷积分支包括顺序连接的第一池化层和第二卷积层;
    第三卷积分支包括顺序连接的第三卷积层和第四卷积层;
    第四卷积分支包括顺序连接的第五卷积层、第六卷积层和第七卷积层;
    所述第一卷积层、第二卷积层、第四卷积层和第七卷积层的通道数相等;所述第三卷积核和所述第五卷积核的通道数相等,且小于所述第四卷积层的通道数。
  8. 根据权利要求7所述的方法,其中,
    所述第一卷积层为1×1×32;
    所述第一池化层为3×3,所述第二卷积层为1×1×32;
    所述第三卷积层为1×1×24,所述第四卷积层为3×3×32;
    所述第五卷积层为1×1×24,所述第六卷积层为3×3×32,所述第七卷积层为3×3×32。
  9. 根据权利要求1所述的方法,其中,
    所述上采样所述多尺度特征,确定目标特征,包括:
    将所述多尺度特征输入到第八卷积层,输出目标特征;其中,所述第八卷积层的通道数是所述多尺度特征的通道数的M倍,M大于1。
  10. 一种特征提取装置,其中,所述特征提取装置包括:
    第一提取部分,配置为提取待识别的深度图像的特征,确定所述深度图像的基本特征;
    第二提取部分,配置为提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
    上采样部分,配置为上采样所述多尺度特征,确定目标特征;其中,所述目标特征用于确定所述深度图像中感兴趣区域的边界框。
  11. 一种特征提取设备,其中,所述特征提取设备包括:第一处理器和用于存储能够在第一处理器上运行的计算机程序的第一存储器,
    其中,所述第一存储器用于存储计算机程序,所述第一处理器用于调用并运行所 述第一存储器中存储的计算机程序,执行如权利要求1-9任一项所述方法的步骤。
  12. 一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序使得计算机执行如权利要求1-9任一项所述方法的步骤。
  13. 一种姿态估计方法,其中,包括:
    提取待识别的深度图像的特征,确定所述深度图像的基本特征;
    提取所述基本特征的多个不同尺度特征,确定所述深度图像的多尺度特征;
    上采样所述多尺度特征,确定目标特征;
    基于所述目标特征提取感兴趣区域的边界框;
    基于所述边界框提取感兴趣区域内关键点的坐标信息;
    基于所述感兴趣区域内关键点的坐标信息对检测对象进行姿态估计,确定姿态估计结果。
  14. 根据权利要求13所述的方法,其中,
    所述检测对象包括手部;
    所述关键点包括以下至少之一:手指关节点、指尖点、手腕关键点以及手掌中心点。
  15. 一种姿态估计装置,其中,所述姿态估计装置包括:第三提取部分、边界框检测部分、第四提取部分和姿态估计部分;其中,
    所述第三提取部分,配置为执行权利要求1-9任一项所述方法的步骤,确定待识别的深度图像的目标特征;
    所述边界框检测部分,配置为基于所述目标特征提取感兴趣区域的边界框;
    所述第四提取部分,配置为基于所述边界框提取感兴趣区域内关键点的位置信息;
    所述姿态估计部分,用于基于所述感兴趣区域内关键点的位置信息对检测对象进行姿态估计。
  16. 一种姿态估计设备,其中,所述姿态估计设备包括:第二处理器和用于存储能够在第二处理器上运行的计算机程序的第二存储器,
    其中,所述第二存储器用于存储计算机程序,所述第二处理器用于调用并运行所述第二存储器中存储的计算机程序,执行如权利要求13-14任一项所述方法的步骤。
  17. 一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,所述计算机程序使得计算机执行如权利要求13-14所述方法的步骤。
PCT/CN2020/127867 2019-11-20 2020-11-10 一种特征提取方法、装置、设备及存储介质 WO2021098554A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/745,565 US20220277475A1 (en) 2019-11-20 2022-05-16 Feature extraction method and device, and pose estimation method using same

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962938183P 2019-11-20 2019-11-20
US62/938,183 2019-11-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/745,565 Continuation US20220277475A1 (en) 2019-11-20 2022-05-16 Feature extraction method and device, and pose estimation method using same

Publications (1)

Publication Number Publication Date
WO2021098554A1 true WO2021098554A1 (zh) 2021-05-27

Family

ID=75980386

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127867 WO2021098554A1 (zh) 2019-11-20 2020-11-10 一种特征提取方法、装置、设备及存储介质

Country Status (2)

Country Link
US (1) US20220277475A1 (zh)
WO (1) WO2021098554A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457101A (zh) * 2022-11-10 2022-12-09 武汉图科智能科技有限公司 面向无人机平台的边缘保持多视图深度估计及测距方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3965071A3 (en) * 2020-09-08 2022-06-01 Samsung Electronics Co., Ltd. Method and apparatus for pose identification

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956532A (zh) * 2016-04-25 2016-09-21 大连理工大学 一种基于多尺度卷积神经网络的交通场景分类方法
CN107368787A (zh) * 2017-06-16 2017-11-21 长安大学 一种面向深度智驾应用的交通标志识别算法
CN109214250A (zh) * 2017-07-05 2019-01-15 中南大学 一种基于多尺度卷积神经网络的静态手势识别方法
CN109800676A (zh) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 基于深度信息的手势识别方法及系统
US20190259284A1 (en) * 2018-02-20 2019-08-22 Krishna Khadloya Pedestrian detection for vehicle driving assistance
US10438082B1 (en) * 2018-10-26 2019-10-08 StradVision, Inc. Learning method, learning device for detecting ROI on the basis of bottom lines of obstacles and testing method, testing device using the same

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956532A (zh) * 2016-04-25 2016-09-21 大连理工大学 一种基于多尺度卷积神经网络的交通场景分类方法
CN107368787A (zh) * 2017-06-16 2017-11-21 长安大学 一种面向深度智驾应用的交通标志识别算法
CN109214250A (zh) * 2017-07-05 2019-01-15 中南大学 一种基于多尺度卷积神经网络的静态手势识别方法
US20190259284A1 (en) * 2018-02-20 2019-08-22 Krishna Khadloya Pedestrian detection for vehicle driving assistance
US10438082B1 (en) * 2018-10-26 2019-10-08 StradVision, Inc. Learning method, learning device for detecting ROI on the basis of bottom lines of obstacles and testing method, testing device using the same
CN109800676A (zh) * 2018-12-29 2019-05-24 上海易维视科技股份有限公司 基于深度信息的手势识别方法及系统

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115457101A (zh) * 2022-11-10 2022-12-09 武汉图科智能科技有限公司 面向无人机平台的边缘保持多视图深度估计及测距方法

Also Published As

Publication number Publication date
US20220277475A1 (en) 2022-09-01

Similar Documents

Publication Publication Date Title
CN110276316B (zh) 一种基于深度学习的人体关键点检测方法
CN111028330B (zh) 三维表情基的生成方法、装置、设备及存储介质
Memo et al. Head-mounted gesture controlled interface for human-computer interaction
US10832039B2 (en) Facial expression detection method, device and system, facial expression driving method, device and system, and storage medium
EP4307233A1 (en) Data processing method and apparatus, and electronic device and computer-readable storage medium
EP3942529A1 (en) Predicting three-dimensional articulated and target object pose
WO2021098441A1 (zh) 手部姿态估计方法、装置、设备以及计算机存储介质
KR20170014491A (ko) 움직임 인식 방법 및 움직임 인식 장치
KR20230156056A (ko) 포즈 추정을 위한 키포인트-기반 샘플링
US20220262093A1 (en) Object detection method and system, and non-transitory computer-readable medium
WO2021098554A1 (zh) 一种特征提取方法、装置、设备及存储介质
CN111062981A (zh) 图像处理方法、装置及存储介质
JP2022550948A (ja) 3次元顔モデル生成方法、装置、コンピュータデバイス及びコンピュータプログラム
WO2021098545A1 (zh) 一种姿势确定方法、装置、设备、存储介质、芯片及产品
CN111680550B (zh) 情感信息识别方法、装置、存储介质及计算机设备
WO2023015409A1 (zh) 物体姿态的检测方法、装置、计算机设备和存储介质
WO2021098576A1 (zh) 手部姿态估计方法、装置及计算机存储介质
CN110348359B (zh) 手部姿态追踪的方法、装置及系统
CN111652110A (zh) 一种图像处理方法、装置、电子设备及存储介质
WO2021098666A1 (zh) 手部姿态检测方法和装置、及计算机存储介质
US20230093827A1 (en) Image processing framework for performing object depth estimation
CN113822174B (zh) 视线估计的方法、电子设备及存储介质
CN116642490A (zh) 基于混合地图的视觉定位导航方法、机器人及存储介质
WO2023003642A1 (en) Adaptive bounding for three-dimensional morphable models
CN114820899A (zh) 一种基于多视角渲染的姿态估计方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20890500

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20890500

Country of ref document: EP

Kind code of ref document: A1