WO2020093837A1 - 人体骨骼关键点的检测方法、装置、电子设备及存储介质 - Google Patents

人体骨骼关键点的检测方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020093837A1
WO2020093837A1 PCT/CN2019/110582 CN2019110582W WO2020093837A1 WO 2020093837 A1 WO2020093837 A1 WO 2020093837A1 CN 2019110582 W CN2019110582 W CN 2019110582W WO 2020093837 A1 WO2020093837 A1 WO 2020093837A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
key points
human
average
hourglass network
Prior art date
Application number
PCT/CN2019/110582
Other languages
English (en)
French (fr)
Inventor
谷继力
张雷
郑文
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2020093837A1 publication Critical patent/WO2020093837A1/zh
Priority to US17/085,214 priority Critical patent/US11373426B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Definitions

  • the present application belongs to the field of computer software applications, in particular to detection methods, devices, electronic equipment and storage media for key points of human bones.
  • the key points of human skeleton are very important for describing human posture and predicting human behavior. Therefore, the detection of key points of human bones is the basis of many computer vision tasks, such as motion classification, abnormal behavior detection, and automatic driving.
  • the detection of key points of human bones mainly detects some key points of the human body, such as joints, limbs, and facial features, and further, describes human posture information through the detected key points.
  • the human body is quite flexible, various postures and shapes will appear, and small changes in any part of the human body will produce a new posture; at the same time, the visibility of its key points is greatly affected by wearing, posture, perspective, etc. And, it is also faced with the effects of occlusion, lighting, fog and other environments.
  • the detection method of the key points of the human skeleton is based on the deformation or improvement of the Stacked Hourglass algorithm.
  • Stacked Hourglass is an algorithm proposed by Falnewell et al.
  • For human body pose estimation which predicts the human body by generating heatmaps heatmaps Key points.
  • the Stacked Hourglass algorithm in the related art has the same deep learning level for the feature maps of the key points of each human bone.
  • the inventor realized that when using related technologies to detect key points of the human body, the accuracy of the thermal map of the key points of the human skeleton is low, which ultimately affects the accuracy of the detection of the key points of the human skeleton.
  • the present application discloses a method, device, electronic device, and storage medium for detecting human bone key points, so as to improve the accuracy of the learned heat map of multiple human bone key points, thereby The purpose of improving the accuracy of the detection of key points of human bones.
  • a method for detecting key points of a human skeleton including:
  • the stacked hourglass network includes a plurality of hourglass networks connected in series, and each of the hourglass networks, based on the weight values corresponding to the feature maps of the key points of the human bones, selects the key points of the human bones Feature map for deep feature learning.
  • a device for detecting key points of a human skeleton including:
  • the data acquisition unit is configured to acquire an original image, wherein the original image includes a plurality of human bone key points;
  • the heat map acquisition module is configured to perform bone key point identification on the original image based on a pre-trained stacked hourglass network for bone key point detection to obtain heat maps of the multiple human bone key points;
  • the stacked hourglass network includes a plurality of hourglass networks connected in series, and each of the hourglass networks, based on the weight values corresponding to the feature maps of the key points of the human bones, selects the key points of the human bones Feature map for deep feature learning.
  • an electronic device including:
  • Memory for storing processor executable instructions
  • the processor is configured to:
  • the stacked hourglass network includes a plurality of hourglass networks connected in series, and each of the hourglass networks, based on the weight values corresponding to the feature maps of the key points of the human bones, selects the key points of the human bones Feature map for deep feature learning.
  • a non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are executed to implement the human skeleton key as described above Point detection method.
  • the solution provided in the embodiment of the present application pre-constructs a stacked hourglass network, and in each hourglass network of the constructed stacked hourglass network, based on the weight value corresponding to the feature map of multiple human bone key points, the multiple human bones
  • the feature map of key points is used for deep feature learning; further, based on the pre-trained stacked hourglass network for bone key point detection, bone key point recognition is performed on the original image to obtain heat maps of the multiple human bone key points.
  • this solution considering the influence of different feature maps on the thermal maps of different human bone key points, an attention mechanism is introduced in the detection process of human bone key points, that is, setting weight values for the feature maps of human bone key points, so that There are differences in the degree of deep learning of the feature maps of key points of each human skeleton. Therefore, this solution can achieve the purpose of improving the accuracy of the learned heat map of multiple human bone key points, thereby improving the accuracy of detecting human bone key points.
  • the original image is subjected to multiple maximum pooling sampling and multiple average pooling sampling respectively in advance to obtain multiple maximum pooling images and maximum average pooling images.
  • the information of multiple nearby pixels will be used to introduce context in the feature map of the extracted multiple bone key points, which improves the accuracy of human bone key point detection.
  • only multiple corresponding maximum pooled images and maximum average pooled images need to be inserted, which further reduces the amount of calculation and increases the speed of calculation.
  • Fig. 1 is a flowchart of a method for detecting key points of a human skeleton according to an exemplary embodiment
  • Fig. 2 is a flowchart of a method for detecting key points of a human skeleton according to an exemplary embodiment
  • FIG. 3a is a schematic structural diagram of an hourglass network provided by an embodiment of the present application.
  • FIG. 3b is a schematic diagram of the principle of a stacked hourglass network provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an attention mechanism module provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of a device for detecting key points of a human skeleton according to an exemplary embodiment
  • Fig. 6 is a schematic structural diagram of an electronic device for performing a method for detecting key points of a human skeleton according to an exemplary embodiment
  • Fig. 7 is a block diagram of an electronic device for performing a method for detecting key points of a human skeleton according to an exemplary embodiment.
  • embodiments of the present application provide a method and device for detecting human bone key points , Electronic equipment and storage media.
  • the following first introduces a method for detecting key points of a human skeleton provided by embodiments of the present application.
  • the execution body of the method for detecting human bone key points may be a human bone key point detection device.
  • the detection device of the key points of the human skeleton can be run in an electronic device.
  • the electronic device may be a server or a terminal device.
  • Fig. 1 is a flowchart of a method for detecting key points of a human skeleton according to an exemplary embodiment. As shown in FIG. 1, a method for detecting key points of a human skeleton provided by embodiments of the present application includes specific steps including:
  • step S101 an original image is obtained, wherein the original image includes a plurality of human bone key points.
  • the human posture picture is extracted from a video or image file as an original image, and the original image includes a plurality of human skeleton key points, such as joints, limbs, and facial features. Furthermore, after learning a plurality of human bone key points included in the original image, the human posture information can be described by analyzing the plurality of human bone key points.
  • the color model of the original image may be an RGB mode, that is, a red, green, and blue color mode, which is of course not limited to this.
  • the color mode of the original image may also be one of CMYK mode, HSB mode, etc.
  • the CMYK color mode is a printing mode, and the four letters refer to cyan (Cyan), magenta (Magenta), yellow (Yellow), and black (Black), representing four colors of ink in printing; and HSB mode
  • the mode in which the color elements are Hue, Saturation, and Brightness.
  • step S102 based on a pre-trained stacked hourglass network for bone key point detection, bone key point recognition is performed on the original image to obtain heat maps of the multiple human bone key points; wherein, the stack The hourglass network includes a plurality of hourglass networks connected in series, and each of the hourglass networks, based on the weight values corresponding to the feature maps of the key points of the human bones, performs a deep layer on the feature maps of the key points of the human bones Feature learning.
  • each hourglass network is a heat map of the key points of the human skeleton.
  • the output heat map is used as the input content of the next hourglass network, and
  • the output heat map is used as the final output of the original image.
  • the force mechanism is to set the weight value for the feature map of the key points of the human skeleton, so that there is a difference in the degree of deep learning for different feature maps.
  • the topology is symmetrical, and the network structure is shaped like an hourglass.
  • the number of hourglass networks included in the stacked hourglass network can be set according to actual conditions, and this application is not limited.
  • the order of each hourglass network can also be set according to the actual situation, which is not limited in this application.
  • the so-called hourglass network is usually a second-order hourglass network, that is, a network with two network structures symmetrical to each other, of course not Limited to this.
  • the attention mechanism can be introduced through Senet learning to automatically obtain the weight value corresponding to each feature map, and adjust the depth of the feature map according to the obtained weight value Feature learning.
  • SEnet the core idea of SEnet is to learn the feature weights according to the loss through the network, so that the weight of the effective feature map is large, and the weight of the invalid or small effect map is small, so that the trained model achieves better results.
  • the introduction of the attention mechanism through the Senet learning method is only a specific implementation method, and should not constitute a limitation on the embodiments of the present application. Any one can obtain the feature maps of the multiple key points of the human skeleton The specific implementation of the corresponding weight values can be applied to this application.
  • each feature map can be used as a feature channel
  • the weight value corresponding to each feature map can also be referred to as the weight value of the feature channel corresponding to each feature map.
  • the coordinate positions of each key point of the human bones can be obtained according to the obtained thermal maps, and then, the coordinate positions of the key points of each human bones can be used for analysis Human pose information in the original image.
  • a stacked hourglass network is constructed in advance, and in each hourglass network of the constructed stacked hourglass network, the weight value corresponding to the feature map of multiple human bone key points is critical to multiple human bones
  • the feature map of the points is used for deep feature learning; further, based on the pre-trained stacked hourglass network for detecting bone key points, the bone key points of the original image are identified to obtain the heat map of the multiple human bone key points.
  • an attention mechanism is introduced in the detection process of human bone key points, that is, setting weight values for the feature maps of human bone key points, so that There are differences in the degree of deep learning of the feature maps of key points of each human skeleton. Therefore, this solution can achieve the purpose of improving the accuracy of the learned heat map of multiple human bone key points, thereby improving the accuracy of detecting human bone key points.
  • each hourglass network may be as follows:
  • Between two adjacent downsampling includes: multiple convolution modules and multiple attention mechanism modules;
  • Between two adjacent upsampling includes: one or more convolution modules;
  • Between adjacent downsampling and upsampling include: multiple convolution modules and multiple attention mechanism modules; and
  • Each skip path includes: multiple convolution modules and multiple attention mechanism modules, the multiple convolution modules and the multiple attention mechanism modules are arranged at intervals.
  • the number of multiple convolution modules is more than the number of multiple attention mechanism modules, and each attention mechanism module Located between two convolution modules; similarly, for multiple convolution modules and multiple attention mechanism modules included between adjacent downsampling and upsampling, the number of multiple convolution modules is more than multiple The number of attention mechanism modules, each attention mechanism module is located between two convolution modules; for multiple convolution modules and multiple attention mechanism modules included in each skip path, the number of multiple convolution modules is more than The number of multiple attention mechanism modules, each attention mechanism module is located between two convolution modules.
  • FIG. 3a An exemplary structure of an hourglass network can be shown in Figure 3a.
  • FIG. 3b a schematic diagram of the principle of a stacked hourglass network with two hourglass networks connected in series is shown. In this figure, the topology of each hourglass network is symmetrical.
  • the feature maps of the key points of the multiple human skeletons are extracted through various convolution modules;
  • the attention mechanism module through the Senet algorithm learning, a set of weight values corresponding to the feature maps of the multiple key points of the human skeleton are obtained.
  • the attention mechanism module includes: a global pooling layer, multiple fully connected layers, and a nonlinear activation layer.
  • the global pooling layer (global) is used to reduce the dimension of the feature maps of multiple human bone key points
  • the fully connected layer (FC) synthesizes the feature maps of multiple human bone key points after dimension reduction to obtain a Dimension vector
  • the nonlinear activation layer is used to normalize the one-dimensional vector into a feature vector.
  • the number of fully connected layers can be set arbitrarily. Any node in the fully connected layer will be connected to each node of the front layer and the back layer, that is, each node of the fully connected layer is the same as the previous one.
  • All nodes of the layer are connected to synthesize the features extracted from the front.
  • the image is normalized to a certain interval through the sigmoid function.
  • the Sigmoid function is often used as a threshold function for neural networks, mapping variables between 0 and 1.
  • the attention mechanism module includes: a global pooling layer, multiple fully connected layers, and a nonlinear activation layer, for example, in a specific implementation manner, in the attention mechanism module, the Feature maps of key points of multiple human bones are divided into lower-level network and upper-level network transmission;
  • the feature maps of the key points of the multiple human bones are reduced through the global pooling layer
  • the one-dimensional vector is normalized to a feature vector through the nonlinear activation layer
  • the feature maps of the key points of the plurality of human bones in the lower-level road network are fused with the feature vectors.
  • Figure 4 shows a schematic diagram of the structure of the attention mechanism module.
  • feature maps of key points of multiple human skeletons are reduced through a global pooling layer (global pool).
  • the feature maps of multiple human skeleton key points after dimensionality reduction are synthesized to obtain a one-dimensional vector.
  • the one-dimensional vector is normalized to a feature vector through a nonlinear activation layer.
  • c ⁇ h ⁇ w c represents the number of channels
  • h represents the height of the feature map
  • w represents the width of the feature map.
  • the number of channels can be directly specified in the fully connected layer of the neural network.
  • the feature maps and feature vectors of multiple human skeleton key points in the lower-level road network are fused.
  • the stacked hourglass network for bone key point detection based on pre-training is performed to perform bone key point identification on the original image to obtain heat maps of the multiple human bone key points Previously, the method may also include:
  • the stacked hourglass network based on the pre-training for bone key point detection performs bone key point identification on the original image to obtain the heat map of the multiple human bone key points, including:
  • the first image is input to the stacked hourglass network to obtain a thermal map of the key points of the multiple human bones.
  • the sampling times of the multiple downsampling mentioned above are not limited.
  • the original image may be down-sampled twice to obtain the first image. Specifically, after two convolutional layers with a step size of 2, the resolution of the original image is reduced to half of the original value each time.
  • the method further include:
  • the inputting the first image to the stacked hourglass network to obtain the thermal map of the key points of the plurality of human bones includes:
  • the first image, the image obtained by each maximum pooling sampling and the image obtained by each average pooling sampling are input to the stacked hourglass network to obtain a thermal map of the key points of the multiple human bones.
  • the pooling technology is used to reduce the features and reduce the parameters.
  • the feature points in the neighborhood to be processed are integrated to obtain new features.
  • the average pooling means that the feature points in the neighborhood are only averaged.
  • Maximum pooling that is, the maximum feature points in the neighborhood.
  • the original image is subjected to multiple maximum pooling sampling and multiple average pooling sampling in advance to obtain multiple maximum pooling images and maximum average pooling images.
  • multiple maximum pooling sampling and multiple average pooling sampling in advance to obtain multiple maximum pooling images and maximum average pooling images.
  • pooling process nearby
  • the information of multiple pixels in the system introduces context in the extracted feature maps of multiple human bone key points, which improves the accuracy of human bone key point detection.
  • the stacked hourglass network includes a first hourglass network and a second hourglass network, and both the first hourglass network and the second hourglass network belong to a second-order hourglass network;
  • the multiple maximum pooling sampling and multiple average pooling sampling of the original image respectively include:
  • the first image, the image obtained by each maximum pooling sampling and the image obtained by each average pooling sampling are input to the stacked hourglass network to obtain a thermal map of the key points of the multiple human bones, including:
  • each hourglass network After each hourglass network performs the second downsampling, insert the fourth largest pooled image and the fourth average pooled image into the convolution path of the hourglass network.
  • the pre-trained stacked hourglass network includes a first hourglass network and a second hourglass network connected in series.
  • Fig. 2 is a flowchart of a method for detecting key points of a human skeleton according to an exemplary embodiment. Specific steps include:
  • S203 Perform multiple maximum pooling sampling and multiple average pooling sampling on the original image respectively to obtain four maximum pooling images and four average pooling images.
  • the four largest pooled images include: the first largest pooled image, the second largest pooled image, the third largest pooled image, and the fourth largest pooled image; and the four average pooled images include: the first average Pooled image, second average pooled image, third average pooled image, and fourth average pooled image.
  • the thermal maps of the key points of the multiple human skeletons are obtained.
  • Step S201 is the same as S101 of FIG. 1 and will not be repeated here.
  • step S202 the original image is down-sampled twice to obtain the first image I 1 . Specifically, after two convolutional layers with a step size of 2, each time the resolution of the original image is reduced to half of the original, that is, the resolution of the original image is reduced from m ⁇ n to (m / 2) ⁇ (n / 2).
  • step S203 the original image is subjected to multiple maximum pooling sampling and multiple average pooling sampling respectively.
  • the maximum pooling sampling and the average pooling sampling are respectively performed on the original image to obtain the first maximum pooling image and the first average pooling image;
  • the first maximum pooling image and the first average pooling image are respectively maximized Pooling sampling and average pooling sampling to obtain the second maximum pooling image and the second average pooling image;
  • the maximum pooling sampling and the average pooling sampling are respectively performed on the second maximum pooling image and the second average pooling image, Obtain the third largest pooled image and the third average pooled image; and perform the largest pooled sampling and the average pooled sampling on the third largest pooled image and the third average pooled image, respectively, to obtain the fourth largest pooled image and Fourth average pooled image.
  • the first maximum pooling image and the first average pooling image are respectively subjected to maximum pooling sampling and average pooling sampling, that is, performing two maximum pooling sampling and average pooling sampling on the original image respectively; similarly, Maximum pooling sampling and average pooling sampling are performed on the second maximum pooling image and the second average pooling image, respectively, that is, three maximum pooling sampling and average pooling sampling are performed on the original image; and the third maximum pooling The maximum pooling sampling and the average pooling sampling are performed on the converted image and the third average pooling image, respectively, that is, the maximum pooling sampling and the average pooling sampling are performed on the original image four times, respectively.
  • the size of the original image is 3 ⁇ 128 ⁇ 128, where 3 refers to the number of channels in the RGB image (R: red, G refers to green, and B refers to blue), and 128 ⁇ 128 refers to the number of pixels in the RGB image .
  • the maximum pooling sampling and the average pooling sampling are performed on the original image once, respectively, and the size of the obtained first maximum pooling image AM 1 and first average pooling image AE 1 is 3 ⁇ 64 ⁇ 64.
  • the first maximum pooled image and the first average pooled image of the original image are respectively subjected to maximum pooling sampling and average pooling sampling to obtain the sizes of the second maximum pooled image AM 2 and the second average pooled image AE 2 It is 3 ⁇ 32 ⁇ 32.
  • the second largest pooled image and the second average pooled image of the original image are respectively subjected to maximum pooling sampling and average pooling sampling to obtain the sizes of the third largest pooled image AM 3 and the third average pooled image AE 3 It is 3 ⁇ 16 ⁇ 16.
  • the third largest pooled image and the third average pooled image of the original image are respectively subjected to maximum pooling sampling and average pooling sampling to obtain the sizes of the fourth largest pooled image AM 4 and the fourth average pooled image AE 4 It is 3 ⁇ 8 ⁇ 8.
  • step S204 the network structure of the first hourglass network and the second hourglass network in the pre-established stacked hourglass network is the same.
  • FIG. 3a exemplarily shows the network structure of the first hourglass network or the second hourglass network.
  • the specific structure of the hourglass network in the stacked hourglass network is described below with reference to the network structure of the first hourglass network or the second hourglass network as shown in FIG. 3a.
  • the details are as follows:
  • the product path is divided into a step-by-step path to retain the feature maps of the key points of multiple human bones at the original scale, which is the scale of the feature map before downsampling.
  • the third largest pooled image AM 3 and the third average pooled image AE 3 are inserted into the convolution path.
  • the fourth largest pooled image AM 4 and the fourth average pooled image AE 4 are inserted into the convolution path.
  • the feature maps of multiple human bone key points of the convolution road and the feature maps of multiple human bone key points of the previous scale of the skip path are fused; between two downsampling includes: three volumes Product module and two attention mechanism modules, and three convolution modules and two attention mechanism modules are arranged at intervals; the input channel of the first convolution module is M, the output channel is N, and the input of the other two convolution modules Both the channel and the output channel are N. Between the two upsampling includes: a convolution module, the input channel and output channel are N.
  • Each skip path includes: four convolution modules and two attention mechanism modules, and the two attention mechanism modules are arranged at intervals from the first three convolution modules, the input channel of the first convolution module is M, and the output channel is N, the input channels and output channels of the other three convolution modules are all N.
  • FIG. 3b exemplarily shows a schematic diagram of the principle of generating a heat map of a stacked hourglass network including a first hourglass network and a second hourglass network.
  • the process of obtaining thermal maps of key points of multiple human bones in step S204 is specifically described below with reference to FIG. 3b:
  • the first image I 1 , the second largest pooled image AM 2 and the second average pooled image AE 2 are input to the first hourglass network.
  • the heat map O 1 output by the first hourglass network is input to the second hourglass network, and at the same time, the first image I 1 , the second largest pooled image AM 2 and the second average pooled image AE 2 are fed into the second hourglass network.
  • the output of the first hourglass network is the heat map O 1 corresponding to multiple key points of human bones.
  • the first hourglass network will compare the heat map O 1 and the true value to generate loss and return it.
  • a thermal map of multiple key points of the human skeleton is obtained. Each heat map corresponds to a key point of human skeleton.
  • the original image is subjected to multiple maximum pooling sampling and multiple average pooling sampling respectively in advance to obtain multiple maximum pooling images and maximum average pooling images.
  • the context is introduced in the feature map of the extracted multiple bone key points, which improves the accuracy of the detection of human bone key points.
  • only multiple corresponding maximum pooled images and maximum average pooled images need to be inserted, which further reduces the amount of calculation and increases the speed of calculation.
  • the heat map given by the first hourglass network is used as the input of the next hourglass network, so that the second hourglass network can use the correlation between the joint points learned by the first hourglass network, increasing the input of the second hourglass network , Thereby further improving the accuracy of the detection of key points of human bones.
  • FIG. 4 exemplarily shows the structure of the attention mechanism module.
  • the following is a detailed description of the process of obtaining the heat map of multiple key points of human bones based on the deep feature learning of the convolution module and the attention mechanism module in the first hourglass network and the second hourglass network in conjunction with FIG. 4:
  • the feature maps of multiple key points of human bones are extracted through the convolution module.
  • the attention mechanism module through the Senet algorithm learning, a set of weights corresponding to the feature maps of multiple key points of the human skeleton is obtained, that is, the importance of each feature map. Then, through the corresponding set of weights, let the network focus on the features with significant weights.
  • the feature maps of multiple key points of the human skeleton are divided into lower-level network and upper-level network transmission.
  • the feature maps of key points of multiple human skeletons are reduced through the global pooling layer.
  • the feature maps of multiple human skeleton key points after dimensionality reduction are synthesized to obtain a one-dimensional vector.
  • the one-dimensional vector is normalized to a feature vector through a nonlinear activation layer.
  • c ⁇ h ⁇ w c represents the number of channels
  • h represents the height of the feature map
  • w represents the width of the feature map.
  • the number of channels can be directly specified in the fully connected layer of the neural network.
  • the feature maps and feature vectors of multiple human skeleton key points in the lower-level road network are fused.
  • the Senet algorithm is used to learn to obtain a set of weights corresponding to the feature maps of multiple key points of the human skeleton, that is, the importance of each feature map. Then, through the corresponding set of weights, let the network focus on the features with significant weights.
  • An attention mechanism is introduced in the detection process of human bone key points, which improves the accuracy of the learned feature maps of multiple human bone key points, thereby improving the detection accuracy of human bone key points.
  • Fig. 5 is a block diagram of a device for detecting key points of a human skeleton according to an exemplary embodiment.
  • the apparatus for detecting key points of human bones provided in the embodiments of the present application may include:
  • the data obtaining unit 510 is configured to obtain an original image, wherein the original image includes a plurality of human bone key points;
  • the heat map generating unit 520 is configured to perform bone key point identification on the original image based on a pre-trained stacked hourglass network for bone key point detection to obtain heat maps of the multiple human bone key points;
  • the stacked hourglass network includes a plurality of hourglass networks connected in series, and each of the hourglass networks, based on the weight values corresponding to the feature maps of the key points of the human bones, selects the key points of the human bones Feature map for deep feature learning.
  • a jump path is branched from the convolution path, the jump path is used to retain the multiple human bones of the original scale Feature map of key points, the original scale is the scale before downsampling;
  • the feature maps of the plurality of human bone key points of the convolution path and the feature maps of the plurality of human bone key points corresponding to the skip path of this upsampling are fused; this time upsampling
  • the corresponding jump path is the jump path that was separated before downsampling commensurate with this upsampling;
  • Between two adjacent downsampling includes: multiple convolution modules and multiple attention mechanism modules;
  • Between two adjacent upsampling includes: one or more convolution modules;
  • Adjacent down-sampling and up-sampling include: multiple convolution modules and multiple attention mechanism modules; and each skip path includes: multiple convolution modules and multiple attention mechanism modules.
  • each hourglass network the feature maps of the key points of the multiple human bones are extracted through each convolution module;
  • the attention mechanism module through the Senet algorithm learning, a set of weight values corresponding to the feature maps of the multiple key points of the human skeleton are obtained.
  • the attention mechanism module includes: a global pooling layer, multiple fully connected layers, and a nonlinear activation layer;
  • the feature maps of the key points of the multiple human bones are divided into a lower-level road network and a higher-level road network for transmission;
  • the feature maps of the key points of the multiple human bones are reduced through the global pooling layer
  • the one-dimensional vector is normalized into a feature vector through the non-linear activation layer; and the feature map of the plurality of human bone key points in the lower-level road network and the Feature vector fusion.
  • the device further includes:
  • a down-sampling unit configured to perform bone key point identification on the original image based on the pre-trained stacked hourglass network for bone key point detection in the heat map generation unit to obtain the multiple human bone key points Before the heat map of, the original image is down-sampled multiple times to obtain the first image;
  • the heat map generating unit is specifically configured to: input the first image to the stacked hourglass network to obtain heat maps of the key points of the plurality of human bones.
  • the device further includes: a pooling unit configured to input the first image to the stacked hourglass network at the heat map generation unit to obtain heat maps of key points of the plurality of human bones Before, the original image was subjected to multiple maximum pooling sampling and multiple average pooling sampling respectively;
  • a pooling unit configured to input the first image to the stacked hourglass network at the heat map generation unit to obtain heat maps of key points of the plurality of human bones Before, the original image was subjected to multiple maximum pooling sampling and multiple average pooling sampling respectively;
  • the heat map generating unit is specifically configured to: input the first image, the image obtained by each maximum pool sampling and the image obtained by each average pool sampling to the stacked hourglass network to obtain The heat map of the key points of the multiple human bones.
  • the stacked hourglass network includes a first hourglass network and a second hourglass network; the pooling unit is specifically configured to:
  • the heat map generating unit is specifically configured to:
  • an embodiment of the present invention also provides an electronic device for the above method for detecting the key points of the human skeleton.
  • the electronic device includes: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute any method for detecting a key point of a human skeleton provided by an embodiment of the present application.
  • Fig. 6 is a block diagram of an electronic device 600 for a method for detecting key points of a human skeleton according to an exemplary embodiment.
  • the electronic device 600 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and so on.
  • the electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power supply component 606, a multimedia component 608, an audio component 610, an input / output (I / O) interface 612, and a sensor component 614 , ⁇ ⁇ ⁇ 616.
  • the processing component 602 generally controls the overall operations of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 602 may include one or more processors 620 to execute instructions to complete all or part of the steps of the method for detecting human bone key points described in the embodiments of the present application.
  • the processing component 602 may include one or more modules to facilitate interaction between the processing component 602 and other components.
  • the processing component 602 may include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 1202.
  • the memory 604 is configured to store various types of data to support operation at the device 600. Examples of these data include instructions for any application or method operating on the device 600, contact data, phone book data, messages, pictures, videos, and so on.
  • the memory 604 may be implemented by any type of volatile or nonvolatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable and removable Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable and removable Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power supply component 606 provides power to various components of the electronic device 600.
  • the power component 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 600.
  • the multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, swipe, and gestures on the touch panel. The touch sensor may not only sense the boundary of the touch or sliding action, but also detect the duration and pressure related to the touch or sliding operation.
  • the multimedia component 608 includes a front camera and / or a rear camera. When the electronic device device 600 is in an operation mode, such as a shooting mode or a video mode, the front camera and / or the rear camera may receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 610 is configured to output and / or input audio signals.
  • the audio component 610 includes a microphone (MIC).
  • the microphone When the electronic device 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 604 or transmitted via the communication component 616.
  • the audio component 610 further includes a speaker for outputting audio signals.
  • the I / O interface 612 provides an interface between the processing component 602 and a peripheral interface module.
  • the peripheral interface module may be a keyboard, a click wheel, or a button. These buttons may include, but are not limited to: home button, volume button, enable button, and lock button.
  • the sensor component 614 includes one or more sensors for providing the electronic device 600 with status assessment in various aspects.
  • the sensor component 614 can detect the opening / closing state of the device 600, and the relative positioning of the components, for example, the component is the display and keypad of the device 600, and the sensor component 614 can also detect the position change of a component of the electronic device 600 The presence or absence of contact with the electronic device 600, the orientation or acceleration / deceleration of the electronic device 600, and the temperature change of the electronic device 600.
  • the sensor assembly 614 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • the sensor component 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 614 may further include an acceleration sensor, a gyro sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 616 is configured to facilitate wired or wireless communication between the electronic device 600 and other devices.
  • the electronic device 600 can access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof.
  • the communication component 616 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 616 also includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the electronic device 600 may be used by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field Programming gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are used to implement the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field Programming gate array
  • controller microcontroller, microprocessor or other electronic components are used to implement the above method.
  • Fig. 7 is a block diagram of an electronic device 700 for a method for detecting a human bone key point according to an exemplary embodiment.
  • the electronic device 700 may be provided as a server.
  • the electronic device 700 includes a processing component 722, which further includes one or more processors, and memory resources represented by the memory 732, for storing instructions executable by the processing component 722, such as application programs.
  • the application programs stored in the memory 732 may include one or more modules each corresponding to a set of instructions.
  • the processing component 722 is configured to execute instructions to execute the above-mentioned information list display method.
  • the electronic device 700 may further include a power component 726 configured to perform power management of the electronic device 700, a wired or wireless network interface 750 configured to connect the electronic device 700 to the network, and an input output (I / O) interface 758 .
  • the electronic device 1400 can operate based on an operating system stored in the memory 732, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
  • an embodiment of the present application also provides a non-transitory computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are executed to implement the embodiments of the present application.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

一种人体骨骼关键点的检测方法、装置、电子设备及存储介质。该方法包括:获取原始图像,其中,所述原始图像包括多个人体骨骼关键点(S101);基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习(S102)。可以达到提高学习到的多个人体骨骼关键点的热力图的准确度,从而提高了人体骨骼关键点的检测的准确性的目的。

Description

人体骨骼关键点的检测方法、装置、电子设备及存储介质
本申请要求于2018年11月07日提交中国专利局、申请号为201811319932.7发明名称为“人体骨骼关键点的检测方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于计算机软件应用领域,尤其是人体骨骼关键点的检测方法、装置、电子设备及存储介质。
背景技术
人体骨骼关键点对于描述人体姿态,预测人体行为至关重要。因此人体骨骼关键点的检测是诸多计算机视觉任务的基础,例如动作分类,异常行为检测,以及自动驾驶等等。人体骨骼关键点的检测,主要检测人体的一些关键点,如关节、四肢、五官等,进而,通过所检测到的关键点描述人体姿态信息。并且,由于人体具有相当的柔性,会出现各种姿态和形状,人体任何一个部位的微小变化都会产生一种新的姿态;同时,其关键点的可见性受穿着、姿态、视角等影响非常大,而且还面临着遮挡、光照、雾等环境的影响。
相关技术中,人体骨骼关键点的检测方法是基于Stacked Hourglass(堆叠沙漏)算法的变形或者改进,其中,Stacked Hourglass是falnewell等人提出应用于人体姿态估计的算法,通过生成热力图heatmaps方式预测人体的关键点。具体的,在人体骨骼关键点的检测过程中,相关技术中Stacked Hourglass(堆叠沙漏)算法对各个人体骨骼关键点的特征图的深层学习程度均相同。
发明人意识到,在利用相关技术进行人体关键点检测时,人体骨骼关键点的热力图的准确度较低,最终影响人体骨骼关键点的检测的准确性。
发明内容
为克服相关技术中存在的问题,本申请公开一种人体骨骼关键点的检测方法、装置、电子设备及存储介质,以实现提高学习到的多个人体骨骼关键点的热力图的准确度,从而提高人体骨骼关键点的检测的准确性的目的。
根据本申请的实施例的一方面,提供一种人体骨骼关键点的检测方法,包括:
获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。根据本申请的实施例的第二方面,提供一种人体骨骼关键点的检测装置,包括:
数据获取单元,被配置为获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
热力图获取模块,被配置为基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。根据本申请的实施例的第四方面,提供一种电子设备,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为:
获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
根据本申请的实施例的第五方面,提供一种非临时性计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令被执行时实现如上所述的人体骨骼关键点的检测方法。
本申请的实施例提供的技术方案可以包括以下有益效果:
本申请实施例所提供的方案,预先构建堆叠沙漏网络,并且,在构建的堆叠沙漏网络的每一沙漏网络中,基于多个人体骨骼关键点的特征图对应的权重值,对多个人体骨骼关键点的特征图进行深层特征学习;进而,基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对原始图像进行骨骼关 键点识别,得到所述多个人体骨骼关键点的热力图。本方案中,考虑到不同的特征图对于不同人体骨骼关键点的热力图的影响,在人体骨骼关键点的检测过程中引入注意机制,即为人体骨骼关键点的特征图设置有权重值,使得对各个人体骨骼关键点的特征图的深层学习程度存在差异。因此,通过本方案可以实现提高学习到的多个人体骨骼关键点的热力图的准确度,从而提高了人体骨骼关键点的检测的准确性的目的。
另外,本申请实施例所提供的方案中,提前对原始图像分别进行多次最大池化采样和多次平均池化采样,得到多个最大池化图像和最大平均池化图像,在池化的过程中,会用到附近的多个像素的信息,在提取的多个骨骼关键点的特征图中引入上下文,提高了人体骨骼关键点的检测的准确性。在每个沙漏网络中只需取相应的多个最大池化图像和最大平均池化图像插进去,这样进一步减小了计算量,提高了计算速度。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
为了更清楚地说明本发明实施例和现有技术的技术方案,下面对实施例和现有技术中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是根据一示例性实施例示出的人体骨骼关键点的检测方法的流程图;
图2是根据一示例性实施例示出的人体骨骼关键点的检测方法的流程图;
图3a是示例性给出的本申请实施例所提供的沙漏网络的结构示意图;
图3b是示例性给出的本申请实施例所提供的堆叠沙漏网络的原理示意图;
图4是示例性给出的本申请实施例所提供的注意机制模块的结构示意图;
图5是根据一示例性实施例示出的人体骨骼关键点的检测装置的结构示意图;
图6是根据一示例性实施例示出的一种用于执行人体骨骼关键点的检测方法的电子设备的结构示意图;
图7是根据一示例性实施例示出的一种用于执行人体骨骼关键点的检测方法的电子设备的框图。
具体实施方式
为使本发明的目的、技术方案、及优点更加清楚明白,以下参照附图并 举实施例,对本发明进一步详细说明。显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
为了实现提高学习到的多个人体骨骼关键点的热力图的准确度,从而提高人体骨骼关键点的检测的准确性的目的,本申请实施例提供了一种人体骨骼关键点的检测方法、装置、电子设备及存储介质。
下面首先对本申请实施例所提供的一种人体骨骼关键点的检测方法进行介绍。
其中,本申请实施例所提供的一种人体骨骼关键点的检测方法的执行主体可以为一种人体骨骼关键点的检测装置。其中,该人体骨骼关键点的检测装置可以运行于电子设备中。在具体应用中,在电子设备可以为服务器,也可以为终端设备。
图1是根据示例性实施例示出的人体骨骼关键点的检测方法的流程图。如图1所示,本申请实施例所提供的一种人体骨骼关键点的检测方法,具体步骤包括:
在步骤S101中,获取原始图像,其中,所述原始图像包括多个人体骨骼关键点。
在本步骤中,通过从视频或图像等文件中提取人体姿势画面作为原始图像,该原始图像中包括多个人体骨骼关键点,例如,关节,四肢和五官等。进而,在获知原始图像包括的多个人体骨骼关键点后,可以通过对多个人体骨骼关键点的分析来描述人体姿势信息。
其中,从视频或图像等文件中提取人体姿势画面的具体实现方式存在多种。例如:利用预先训练的用于提取人体姿态画面的神经网络模型,从视频或图像等文件中提取人体姿势画面。
另外,原始图像的色彩模型可以为RGB模式,即红绿蓝三色模式,当然并不局限于此。例如:该原始图像的颜色模式还可以为CMYK模式、HSB模式等中的一种模式。其中,CMYK颜色模式是一种印刷模式,并且四个字母分别指青(Cyan)、洋红(Magenta)、黄(Yellow)、黑(Black),在印刷中代表四种颜色的油墨;而HSB模式为色彩要素为色泽(Hue)、饱和度(Saturation)和亮度(Brightness)的模式。
在步骤S102中,基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网 络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
需要说明的是,每一沙漏网络的输出均为人体骨骼关键点的热力图,对于堆叠沙漏网络中的非最后一个沙漏网络而言,输出的热力图作为下一沙漏网络的输入内容,而对于堆叠沙漏网络中的最后一个沙漏网络而言,输出的热力图作为原始图像的最后的输出结果。
本实施例中,考虑到不同的特征图对于不同人体骨骼关键点的热力图的影响,本申请实施例中所采用的堆叠沙漏网络的每一沙漏网络中,相对于现有技术,增设了注意力机制,即为人体骨骼关键点的特征图设置有权重值,使得对于不同特征图的深度学习程度存在区别。
可以理解的是,针对任一个沙漏网络而言,拓扑结构是对称的,网络结构形似沙漏状。另外,堆叠沙漏网络所包括的沙漏网络的数量,可以根据实际情况设定,本申请不做限定。并且,每一沙漏网络的阶数也可以根据实际情况设定,本申请不做限定,其中,所谓沙漏网络通常为二阶沙漏网络,即存在相互对称的两个网络结构的网络,当然并不局限于此。
在本步骤中,在多个沙漏网络中,可以通过Senet学习的方式来引入注意机制,自动获取到每个特征图对应的权重值,根据所获取到的权重值来调整所述特征图的深层特征学习。其中,SENet的核心思想在于通过网络根据损失去学习特征权重,使得有效的特征图的权重大,无效或效果小的特征图的权重小,从而使得训练的模型达到更好的结果。需要说明的是,通过Senet学习的方式来引入注意机制仅仅是一种具体实现方式,并不应该构成对本申请实施例的限定,任一种能够获取到所述多个人体骨骼关键点的特征图对应的权重值的具体实现方式,均可以应用于本申请。
另外,由于每个特征图可以作为一个特征通道,那么,每个特征图对应的权重值,也可以称为每个特征图对应的特征通道的权重值。并且,在得到所述多个人体骨骼关键点的热力图后,可以根据所获得的热力图,得到每一人体骨骼关键点的坐标位置,进而,利用每一人体骨骼关键点的坐标位置,分析原始图像中人体姿态信息。
在本申请的实施例中,预先构建堆叠沙漏网络,并且,在构建的堆叠沙漏网络的每一沙漏网络中,基于多个人体骨骼关键点的特征图对应的权重值, 对多个人体骨骼关键点的特征图进行深层特征学习;进而,基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图。本方案中,考虑到不同的特征图对于不同人体骨骼关键点的热力图的影响,在人体骨骼关键点的检测过程中引入注意机制,即为人体骨骼关键点的特征图设置有权重值,使得对各个人体骨骼关键点的特征图的深层学习程度存在差异。因此,通过本方案可以实现提高学习到的多个人体骨骼关键点的热力图的准确度,从而提高了人体骨骼关键点的检测的准确性的目的。
示例性,在一种具体实现方式中,每一沙漏网络的具体结构可以如下:
在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路,所述跳级路用于保留原尺度的所述多个人体骨骼关键点的特征图,所述原尺度为本次下采样之前的尺度;
每次上采样之后,将所述卷积路的所述多个人体骨骼关键点的特征图和本次上采样对应跳级路的所述多个人体骨骼关键点的特征图融合;本次上采样对应跳级路为在与本次上采样相对称的下采样之前所分出的跳级路;
相邻的两次下采样之间包括:多个卷积模块和多个注意机制模块;
相邻的两次上采样之间包括:一个或多个卷积模块;
相邻的下采样和上采样之间包括:多个卷积模块和多个注意机制模块;以及
每条跳级路包括:多个卷积模块和多个注意机制模块,该多个卷积模块和该多个注意机制模块间隔排列。
其中,针对相邻的两次下采样之间包括的多个卷积模块和多个注意机制模块而言,多个卷积模块的数量多于多个注意机制模块的数量,每一注意机制模块位于两个卷积模块之间;类似的,针对相邻的下采样和上采样之间包括的多个卷积模块和多个注意机制模块而言,多个卷积模块的数量多于多个注意机制模块的数量,每一注意机制模块位于两个卷积模块之间;针对每条跳级路包括的多个卷积模块和多个注意机制模块而言,多个卷积模块的数量多于多个注意机制模块的数量,每一注意机制模块位于两个卷积模块之间。
关于一个沙漏网络的示例性结构可以为图3a所示。另外,参见图3b所示,给出了串联有两个沙漏网络的堆叠沙漏网络的原理示意图,该图中,每个沙漏网络的拓扑结构呈对称。
并且,每一沙漏网络中,通过各个卷积模块提取所述多个人体骨骼关键点 的特征图;
在所述注意机制模块里中,通过Senet算法学习,得到所述多个人体骨骼关键点的特征图对应的一组权重值。
其中,卷积模块的具体结构可以存在多种,本申请对此不做限定。
示例性的,在一种实现方式中,所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层。其中,全局池化层(global pool)用于将多个人体骨骼关键点的特征图进行降维,全连接层(FC)将降维后的多个人体骨骼关键点的特征图综合,得到一维向量,而非线性激活层用于将该一维向量归一化为特征向量。并且,全连接层的数目可以任意设定,全连接层中的任意一个结点都会和前面层及后面层的每个结点相边接,即全连接层的每一个结点都与上一层的所有结点相连,用来把前边提取到的特征综合起来。非线性激活层中通过sigmoid函数将图像进行归一化到某个区间。Sigmoid函数常被用作神经网络的阈值函数,将变量映射到0,1之间。
在所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层的前提下,示例性的,在一种具体实现方式中,在所述注意机制模块中,将所述多个人体骨骼关键点的特征图分为下级路网络和上级路网络传输;
在所述上级路网络,通过所述全局池化层将所述多个人体骨骼关键点的特征图进行降维;
在所述上级路网络,通过所述多个全连接层的Senet算法学习,将降维后的所述多个人体骨骼关键点的特征图综合,得到一维向量;
在所述上级路网络,通过所述非线性激活层将所述一维向量归一化为特征向量;以及
将所述下级路网络中的所述多个人体骨骼关键点的特征图与所述特征向量融合。
为了清楚,图4给出了注意机制模块的结构示意图。如图4所示,在上级路网络,通过全局池化层(global pool)将多个人体骨骼关键点的特征图进行降维。在上级路网络,通过两个全连接层(FC)的Senet算法学习,将降维后的多个人体骨骼关键点的特征图综合,得到一维向量。在上级路网络,通过非线性激活层将该一维向量归一化为特征向量。其中,c×h×w中c表示channel数目,h表示特征图的高,w表示特征图的宽。在神经网络的全连接层可以直接指定channel数目。将下级路网络中的多个人体骨骼关键点的特征图与特征向量融合。
可选地,为了降低计算量,所述基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图之前,所述方法还可以包括:
对所述原始图像进行多次下采样,得到第一图像;
相应的,所述基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图,包括:
将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
其中,上述的多次下采样的采样次数并不做限制。示例性的,可以对原始图像进行两次下采样,得到第一图像,具体的,经过两次步长为2的卷积层,每次使得原始图像的分辨率下降为原来的一半。
可选地,在包含上述的多次下采样的步骤的基础上,将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图之前,所述方法还包括:
对所述原始图像分别进行多次最大池化采样和多次平均池化采样;
相应的,所述将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
可以理解的是,池化技术用于使得特征减少,参数减少,具体的,将待处理邻域内的特征点整合得到新的特征,其中,平均池化,即对邻域内特征点只求平均,最大池化,即对邻域内特征点取最大。
该实施例中,提前对原始图像分别进行多次最大池化采样和多次平均池化采样,得到多个最大池化图像和最大平均池化图像,在池化的过程中,会用到附近的多个像素的信息,在提取的多个人体骨骼关键点的特征图中引入上下文,提高了人体骨骼关键点的检测的准确性。
可选地,在一种实现方式中,所述堆叠沙漏网络包括第一沙漏网络和第二沙漏网络,所述第一沙漏网络和所述第二沙漏网络均属于二阶沙漏网络;
所述对所述原始图像分别进行多次最大池化采样和多次平均池化采样,包括:
对所述原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;
对所述第一最大池化图像和所述第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;
对所述第二最大池化图像和所述第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及
对所述第三最大池化图像和所述第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像;相应的,
将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
将所述第一图像、所述第一最大池化图像和所述第一平均池化图像输入所述第一沙漏网络;
将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络;
并且,在每一个沙漏网络执行第一次下采样之后,将所述第三最大池化图像和所述第三平均池化图像插入该沙漏网络的卷积路;
在每一个沙漏网络执行第二次下采样之后将所述第四最大池化图像和所述第四平均池化图像插入该沙漏网络的卷积路。
在该种实现方式中,在每个沙漏网络中只需取相应的多个最大池化图像和最大平均池化图像插进去,这样进一步减小了计算量,提高了计算速度。
为了方案清楚,下面结合具体实施例,对本申请实施例所提供的人体骨骼关键点的检测方法进行介绍。
本具体实施例中,预先训练完成的堆叠沙漏网络包括串联的第一沙漏网络和第二沙漏网络。
图2是根据示例性实施例示出的人体骨骼关键点的检测方法的流程图。具体步骤包括:
S201,获取原始图像,其中,所述原始图像包括多个人体骨骼关键点。
S202,对所述原始图像进行多次下采样,得到第一图像。
S203,对所述原始图像分别进行多次最大池化采样和多次平均池化采样,得到四个最大池化图像和四个平均池化图像。
其中,四个最大池化图像包括:第一最大池化图像、第二最大池化图像、第三最大池化图像、第四最大池化图像;而四个平均池化图像包括:第一平均池化图像、第二平均池化图像、第三平均池化图像和第四平均池化图像。
S204,将所述第一图像、所述第二最大池化图像和所述第二平均池化图像输入预先训练完成的堆叠沙漏网络中的所述第一沙漏网络,并将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时,将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络,得到作为堆叠沙漏网络的输出结果的所述多个人体骨骼关键点的热力图。
在堆叠沙漏网络中,基于所述多个沙漏网络的所述卷积模块和所述注意机制模块的深层特征学习,得到所述多个人体骨骼关键点的热力图。
步骤S201与图1的S101相同,这里就不再赘述。
步骤S202中,对原始图像进行两次下采样,得到第一图像I 1。具体的,经过两次步长为2的卷积层,每次使得原始图像的分辨率下降为原来的一半,即原始图像的分辨率从m×n下降为(m/2)×(n/2)。
步骤S203中,对原始图像分别进行多次最大池化采样和多次平均池化采样。具体是,对原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;对第一最大池化图像和第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;对第二最大池化图像和第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及对第三最大池化图像和第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像。其中,对第一最大池化图像和第一平均池化图像分别进行最大池化采样和平均池化采样,即为对原始图像分别进行两次最大池化采样和平均池化采样;类似的,对第二最大池化图像和第二平均池化图像分别进行最大池化采样和平均池化采样,即为对原始图像分别进行三次最大池化采样和平均池化采样;而对第三最大池化图像和第三平均池化图像分别进行最大池化采样和平均池化采样,即为对原始图像分别进行四次最大池化采样和平均池化采样。
为了方便理解对原始图像分别进行多次最大池化采样和多次平均池化采样的具体实现方式,举例而言:
假如原始图像的大小为3×128×128,其中3指的是RGB图像(R:是red,G是指green,B是指blue)的通道数,128×128指的是RGB图像的像素 数量。那么,对原始图像分别进行一次最大池化采样和平均池化采样,得到的第一最大池化图像AM 1和第一平均池化图像AE 1的大小为3×64×64。对原始图像的第一最大池化图像和第一平均池化图像分别进行最大池化采样和平均池化采样,得到的第二最大池化图像AM 2和第二平均池化图像AE 2的大小为3×32×32。对原始图像的第二最大池化图像和第二平均池化图像分别进行最大池化采样和平均池化采样,得到的第三最大池化图像AM 3和第三平均池化图像AE 3的大小为3×16×16。对原始图像的第三最大池化图像和第三平均池化图像分别进行最大池化采样和平均池化采样,得到的第四最大池化图像AM 4和第四平均池化图像AE 4的大小为3×8×8。
步骤S204中,预先建立的堆叠沙漏网络中第一沙漏网络和第二沙漏网络的网络结构相同。图3a示例性的示出第一沙漏网络或第二沙漏网络的网络结构。下面结合如图3a所示的第一沙漏网络或第二沙漏网络的网络结构,介绍堆叠沙漏网络中沙漏网络的具体结构,具体如下:在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路来保留原尺度的多个人体骨骼关键点的特征图,该原尺度为下采样之前的特征图的尺度。第一次下采样之后将第三最大池化图像AM 3和第三平均池化图像AE 3插入所述卷积路。第二次下采样之后将第四最大池化图像AM 4和第四平均池化图像AE 4插入所述卷积路。每次上采样之后,将卷积路的多个人体骨骼关键点的特征图和跳级路的上一尺度的多个人体骨骼关键点的特征图融合;两次下采样之间包括:三个卷积模块和两个注意机制模块,并且,三个卷积模块和两个注意机制模块间隔排列;第一个卷积模块的输入通道为M、输出通道为N,其他两个卷积模块的输入通道和输出通道都为N。两次上采样之间包括:一个卷积模块,输入通道和输出通道都为N。下采样和上采样之间包括:四个卷积模块和两个注意机制模块,并且,两个注意机制模块与前三个卷积模块间隔排列,第一个卷积模块的输入通道为M、输出通道为N,其他三个卷积模块的输入通道和输出通道都为N。每条跳级路包括:四个卷积模块和两个注意机制模块,并且,两个注意机制模块与前三个卷积模块间隔排列,第一个卷积模块的输入通道为M、输出通道为N,其他三个卷积模块的输入通道和输出通道都为N。
另外,图3b示例性的示出了包含第一沙漏网络和第二沙漏网络的堆叠沙漏网络生成热力图的原理示意图。下面结合图3b具体说明步骤S204中得到所多个人体骨骼关键点的热力图的过程:
将第一图像I 1、第二最大池化图像AM 2和第二平均池化图像AE 2输入第一 沙漏网络。将第一沙漏网络输出的热力图O 1输入第二沙漏网络,同时,将第一图像I 1、第二最大池化图像AM 2和第二平均池化图像AE 2馈入第二沙漏网络。第一沙漏网络的输出是多个人体骨骼关键点对应的热力图O 1,在堆叠沙漏网络的训练过程中,第一沙漏网络会将热力图O 1和真值比较产生loss并回传。
基于第一沙漏网络和第二沙漏网络中的卷积模块和注意机制模块的深层特征学习,得到多个人体骨骼关键点的热力图。每个热力图对应一个人体骨骼关键点。
根据本申请的实施例,提前对原始图像分别进行多次最大池化采样和多次平均池化采样,得到多个最大池化图像和最大平均池化图像,在池化的过程中,会用到附近的多个像素的信息,在提取的多个骨骼关键点的特征图中引入上下文,提高了人体骨骼关键点的检测的准确性。在每个沙漏网络中只需取相应的多个最大池化图像和最大平均池化图像插进去,这样进一步减小了计算量,提高了计算速度。
同时,将第一沙漏网络给出的热力图作为下一个沙漏网络的输入,使得第二沙漏网络可以使用第一沙漏网络学习到的关节点间的相互关系,增大了第二沙漏网络的输入,从而进一步提高了人体骨骼关键点的检测的准确性。
另外,注意机制模块中包括:全局池化层、两个全连接层和非线性激活层。图4示例性地示出注意机制模块的结构。下面结合图4具体说明基于第一沙漏网络和第二沙漏网络中的卷积模块和注意机制模块的深层特征学习,得到多个人体骨骼关键点的热力图的过程:在第一沙漏网络和第二沙漏网络中,通过卷积模块提取多个人体骨骼关键点的特征图。在注意机制模块里中,通过Senet算法学习,得到多个人体骨骼关键点的特征图对应的一组权重,即每个特征图的重要程度。然后,通过该对应的一组权重,让网络重点关注权重大的特征。
具体地,在注意机制模块中,将多个人体骨骼关键点的特征图分为下级路网络和上级路网络传输。在上级路网络,通过全局池化层(global pool)将多个人体骨骼关键点的特征图进行降维。在上级路网络,通过两个全连接层(FC)的Senet算法学习,将降维后的多个人体骨骼关键点的特征图综合,得到一维向量。在上级路网络,通过非线性激活层将该一维向量归一化为特征向量。其中,c×h×w中c表示channel数目,h表示特征图的高,w表示特征图的宽。在神经网络的全连接层可以直接指定channel数目。将下级路网络中的多个人体骨骼关键点的特征图与特征向量融合。
根据本申请的实施例,在注意机制模块里中,通过Senet算法学习,得到多个人体骨骼关键点的特征图对应的一组权重,即每个特征图的重要程度。然后,通过该对应的一组权重,让网络重点关注权重大的特征。在人体骨骼关键点的检测过程中引入注意机制,提高了学习到的多个人体骨骼关键点的特征图的准确度,从而提高了人体骨骼关键点的检测的准确性。
图5是根据一示例性实施例示出的人体骨骼关键点的检测装置框图。如图5所示,本申请实施例所提供的人体骨骼关键点的检测装置,可以包括:
数据获取单元510,被配置为获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
热力图生成单元520,被配置为基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
可选地,在一种实现方式中,在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路,所述跳级路用于保留原尺度的所述多个人体骨骼关键点的特征图,所述原尺度为本次下采样之前的尺度;
每次上采样之后,将所述卷积路的所述多个人体骨骼关键点的特征图和本次上采样对应跳级路的所述多个人体骨骼关键点的特征图融合;本次上采样对应跳级路为在与本次上采样相对称的下采样之前所分出的跳级路;
相邻的两次下采样之间包括:多个卷积模块和多个注意机制模块;
相邻的两次上采样之间包括:一个或多个卷积模块;
相邻的下采样和上采样之间包括:多个卷积模块和多个注意机制模块;以及每条跳级路包括:多个卷积模块和多个注意机制模块。
可选地,在一种实现方式中,每一沙漏网络中,通过各个卷积模块提取所述多个人体骨骼关键点的特征图;
在所述注意机制模块里中,通过Senet算法学习,得到所述多个人体骨骼关键点的特征图对应的一组权重值。
可选地,在一种实现方式中,所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层;
相应的,在所述注意机制模块中,将所述多个人体骨骼关键点的特征图分 为下级路网络和上级路网络传输;
在所述上级路网络,通过所述全局池化层将所述多个人体骨骼关键点的特征图进行降维;
在所述上级路网络,通过所述多个全连接层的Senet算法学习,将降维后的所述多个人体骨骼关键点的特征图综合,得到一维向量;
在所述上级路网络,通过所述非线性激活层将所述一维向量归一化为特征向量;以及将所述下级路网络中的所述多个人体骨骼关键点的特征图与所述特征向量融合。
可选地,在一种实现方式中,所述装置还包括:
下采样单元,被配置为在所述热力图生成单元基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像进行多次下采样,得到第一图像;
相应的,所述热力图生成单元具体被配置为:将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
可选地,所述装置还包括:池化单元,被配置为在所述热力图生成单元将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像分别进行多次最大池化采样和多次平均池化采样;
相应的,所述热力图生成单元具体被配置为:将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
可选地,所述堆叠沙漏网络包括第一沙漏网络和第二沙漏网络;所述池化单元具体被配置为:
对所述原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;
对所述第一最大池化图像和所述第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;
对所述第二最大池化图像和所述第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及
对所述第三最大池化图像和所述第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像;
相应的,所述热力图生成单元具体被配置为:
将所述第一图像、所述第一最大池化图像和所述第一平均池化图像输入所述第一沙漏网络;
将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络;
并且,在每一个沙漏网络执行第一次下采样之后,将所述第三最大池化图像和所述第三平均池化图像插入该沙漏网络的卷积路;在每一个沙漏网络执行第二次下采样之后将所述第四最大池化图像和所述第四平均池化图像插入该沙漏网络的卷积路。
相应于上述方法实施例,本发明实施例还提供了用于上述人体骨骼关键点的检测方法的一种电子设备。该电子设备包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行本申请实施例所提供的任一人体骨骼关键点的检测方法。
图6是根据一示例性实施例示出的一种用于上述人体骨骼关键点的检测方法的电子设备600的框图。例如,电子设备600可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图6,电子设备600可以包括以下一个或多个组件:处理组件602,存储器604,电源组件606,多媒体组件608,音频组件610,输入/输出(I/O)的接口612,传感器组件614,以及通信组件616。
处理组件602通常控制电子设备600的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件602可以包括一个或多个处理器620来执行指令,以完成本申请实施例上述的人体骨骼关键点的检测方法的全部或部分步骤。此外,处理组件602可以包括一个或多个模块,便于处理组件602和其他组件之间的交互。例如,处理组件602可以包括多媒体模块,以方便多媒体组件608和处理组件1202之间的交互。
存储器604被配置为存储各种类型的数据以支持在设备600的操作。这些数据的示例包括用于在装置600上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器604可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储 器,快闪存储器,磁盘或光盘。
电源组件606为电子设备600的各种组件提供电力。电源组件606可以包括电源管理系统,一个或多个电源,及其他与为电子设备600生成、管理和分配电力相关联的组件。
多媒体组件608包括在所述电子设备600和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件608包括一个前置摄像头和/或后置摄像头。当电子设备设备600处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件610被配置为输出和/或输入音频信号。例如,音频组件610包括一个麦克风(MIC),当电子设备600处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器604或经由通信组件616发送。在一些实施例中,音频组件610还包括一个扬声器,用于输出音频信号。I/O接口612为处理组件602和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启用按钮和锁定按钮。
传感器组件614包括一个或多个传感器,用于为电子设备600提供各个方面的状态评估。例如,传感器组件614可以检测到设备600的打开/关闭状态,组件的相对定位,例如所述组件为装置600的显示器和小键盘,传感器组件614还可以检测电子设备600一个组件的位置改变,用户与电子设备600接触的存在或不存在,电子设备600方位或加速/减速和电子设备600的温度变化。传感器组件614可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件614还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件614还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件616被配置为便于电子设备600和其他设备之间有线或无线方式的通信。电子设备600可以接入基于通信标准的无线网络,如WiFi,运营商网络(如2G、3G、4G或5G),或它们的组合。在一个示例性实施例中,通信组件616经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件616还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备600可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
图7是根据一示例性实施例示出的一种用于上述人体骨骼关键点的检测方法的电子设备700的框图。例如,电子设备700可以被提供为一服务器。参照图7,电子设备700包括处理组件722,其进一步包括一个或多个处理器,以及由存储器732所代表的存储器资源,用于存储可由处理组件722的执行的指令,例如应用程序。存储器732中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件722被配置为执行指令,以执行上述信息列表显示方法方法。
电子设备700还可以包括一个电源组件726被配置为执行电子设备700的电源管理,一个有线或无线网络接口750被配置为将电子设备700连接到网络,和一个输入输出(I/O)接口758。电子设备1400可以操作基于存储在存储器732的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
相应于上述方法实施例,本申请实施例还提供了一种非临时性计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令被执行时实现本申请实施例所提供的人体骨骼关键点的检测方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。

Claims (22)

  1. 一种人体骨骼关键点的检测方法,包括:
    获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
    基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
    其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
  2. 根据权利要求1所述的人体骨骼关键点的检测方法,在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路,所述跳级路用于保留原尺度的所述多个人体骨骼关键点的特征图,所述原尺度为本次下采样之前的尺度;
    每次上采样之后,将所述卷积路的所述多个人体骨骼关键点的特征图和本次上采样对应跳级路的所述多个人体骨骼关键点的特征图融合;本次上采样对应跳级路为在与本次上采样相对称的下采样之前所分出的跳级路;
    相邻的两次下采样之间包括:多个卷积模块和多个注意机制模块;
    相邻的两次上采样之间包括:一个或多个卷积模块;
    相邻的下采样和上采样之间包括:多个卷积模块和多个注意机制模块;以及
    每条跳级路包括:多个卷积模块和多个注意机制模块。
  3. 根据权利要求2所述的人体骨骼关键点的检测方法,每一沙漏网络中,通过各个卷积模块提取所述多个人体骨骼关键点的特征图;
    在所述注意机制模块里中,通过Senet算法学习,得到所述多个人体骨骼关键点的特征图对应的一组权重值。
  4. 根据权利要求3所述的人体骨骼关键点的检测方法,所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层;
    相应的,在所述注意机制模块中,将所述多个人体骨骼关键点的特征图分为下级路网络和上级路网络传输;
    在所述上级路网络,通过所述全局池化层将所述多个人体骨骼关键点的特征图进行降维;
    在所述上级路网络,通过所述多个全连接层的Senet算法学习,将降维后的所述多个人体骨骼关键点的特征图综合,得到一维向量;
    在所述上级路网络,通过所述非线性激活层将所述一维向量归一化为特征向量;以及
    将所述下级路网络中的所述多个人体骨骼关键点的特征图与所述特征向量融合。
  5. 根据权利要求2所述的人体骨骼关键点的检测方法,所述基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图之前,所述方法还包括:
    对所述原始图像进行多次下采样,得到第一图像;
    所述基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  6. 根据权利要求5所述的人体骨骼关键点的检测方法,将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图之前,所述方法还包括:
    对所述原始图像分别进行多次最大池化采样和多次平均池化采样;
    所述将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  7. 根据权利要求6所述的人体骨骼关键点的检测方法,所述堆叠沙漏网络包括第一沙漏网络和第二沙漏网络;
    所述对所述原始图像分别进行多次最大池化采样和多次平均池化采样,包 括:
    对所述原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;
    对所述第一最大池化图像和所述第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;
    对所述第二最大池化图像和所述第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及
    对所述第三最大池化图像和所述第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像;相应的,
    将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像、所述第一最大池化图像和所述第一平均池化图像输入所述第一沙漏网络;
    将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络;
    并且,在每一个沙漏网络执行第一次下采样之后,将所述第三最大池化图像和所述第三平均池化图像插入该沙漏网络的卷积路;
    在每一个沙漏网络执行第二次下采样之后将所述第四最大池化图像和所述第四平均池化图像插入该沙漏网络的卷积路。
  8. 一种人体骨骼关键点的检测装置,包括:
    数据获取单元,被配置为获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
    热力图生成单元,被配置为基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
    其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏 网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
  9. 根据权利要求8所述的人体骨骼关键点的检测装置,在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路,所述跳级路用于保留原尺度的所述多个人体骨骼关键点的特征图,所述原尺度为本次下采样之前的尺度;
    每次上采样之后,将所述卷积路的所述多个人体骨骼关键点的特征图和本次上采样对应跳级路的所述多个人体骨骼关键点的特征图融合;本次上采样对应跳级路为在与本次上采样相对称的下采样之前所分出的跳级路;
    相邻的两次下采样之间包括:多个卷积模块和多个注意机制模块;
    相邻的两次上采样之间包括:一个或多个卷积模块;
    相邻的下采样和上采样之间包括:多个卷积模块和多个注意机制模块;以及
    每条跳级路包括:多个卷积模块和多个注意机制模块。
  10. 根据权利要求9所述的人体骨骼关键点的检测装置,每一沙漏网络中,通过各个卷积模块提取所述多个人体骨骼关键点的特征图;
    在所述注意机制模块里中,通过Senet算法学习,得到所述多个人体骨骼关键点的特征图对应的一组权重值。
  11. 根据权利要求10所述的人体骨骼关键点的检测装置,所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层;
    相应的,在所述注意机制模块中,将所述多个人体骨骼关键点的特征图分为下级路网络和上级路网络传输;
    在所述上级路网络,通过所述全局池化层将所述多个人体骨骼关键点的特征图进行降维;
    在所述上级路网络,通过所述多个全连接层的Senet算法学习,将降维后的所述多个人体骨骼关键点的特征图综合,得到一维向量;
    在所述上级路网络,通过所述非线性激活层将所述一维向量归一化为特征向量;以及
    将所述下级路网络中的所述多个人体骨骼关键点的特征图与所述特征向量融合。
  12. 根据权利要求9所述的人体骨骼关键点的检测装置,还包括:
    下采样单元,被配置为在所述热力图生成单元基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像进行多次下采样,得到第一图像;
    相应的,所述热力图生成单元具体被配置为:
    将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  13. 根据权利要求12所述的人体骨骼关键点的检测装置,还包括:
    池化单元,被配置为在所述热力图生成单元将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像分别进行多次最大池化采样和多次平均池化采样;
    相应的,所述热力图生成单元具体被配置为:将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  14. 根据权利要求13所述的人体骨骼关键点的检测装置,所述堆叠沙漏网络包括第一沙漏网络和第二沙漏网络;
    所述池化单元具体被配置为:
    对所述原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;
    对所述第一最大池化图像和所述第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;
    对所述第二最大池化图像和所述第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及
    对所述第三最大池化图像和所述第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像;
    相应的,所述热力图生成单元具体被配置为:
    将所述第一图像、所述第一最大池化图像和所述第一平均池化图像输入所述第一沙漏网络;
    将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络;
    并且,在每一个沙漏网络执行第一次下采样之后,将所述第三最大池化图像和所述第三平均池化图像插入该沙漏网络的卷积路;
    在每一个沙漏网络执行第二次下采样之后将所述第四最大池化图像和所述第四平均池化图像插入该沙漏网络的卷积路。
  15. 一种电子设备,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:
    获取原始图像,其中,所述原始图像包括多个人体骨骼关键点;
    基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图;
    其中,所述堆叠沙漏网络包括串联相接的多个沙漏网络,每一所述沙漏网络,基于所述多个人体骨骼关键点的特征图对应的权重值,对所述多个人体骨骼关键点的特征图进行深层特征学习。
  16. 根据权利要求15所述的电子设备,在每一个沙漏网络中,每次下采样之前,从卷积路分出一路跳级路,所述跳级路用于保留原尺度的所述多个人体骨骼关键点的特征图,所述原尺度为本次下采样之前的尺度;
    每次上采样之后,将所述卷积路的所述多个人体骨骼关键点的特征图和本次上采样对应跳级路的所述多个人体骨骼关键点的特征图融合;本次上采样对应跳级路为在与本次上采样相对称的下采样之前所分出的跳级路;
    相邻的两次下采样之间包括:多个卷积模块和多个注意机制模块;
    相邻的两次上采样之间包括:一个或多个卷积模块;
    相邻的下采样和上采样之间包括:多个卷积模块和多个注意机制模块;以及
    每条跳级路包括:多个卷积模块和多个注意机制模块。
  17. 根据权利要求16所述的电子设备,每一沙漏网络中,通过各个卷积模块提取所述多个人体骨骼关键点的特征图;
    在所述注意机制模块里中,通过Senet算法学习,得到所述多个人体骨骼关键点的特征图对应的一组权重值。
  18. 根据权利要求17所述的电子设备,所述注意机制模块包括:全局池化层、多个全连接层和非线性激活层;
    相应的,在所述注意机制模块中,将所述多个人体骨骼关键点的特征图分为下级路网络和上级路网络传输;
    在所述上级路网络,通过所述全局池化层将所述多个人体骨骼关键点的特征图进行降维;
    在所述上级路网络,通过所述多个全连接层的Senet算法学习,将降维后的所述多个人体骨骼关键点的特征图综合,得到一维向量;
    在所述上级路网络,通过所述非线性激活层将所述一维向量归一化为特征向量;以及
    将所述下级路网络中的所述多个人体骨骼关键点的特征图与所述特征向量融合。
  19. 根据权利要求16所述的电子设备,所述处理器还被配置为在基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像进行多次下采样,得到第一图像;
    所述处理器基于预先训练完成的用于骨骼关键点检测的堆叠沙漏网络,对所述原始图像进行骨骼关键点识别,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  20. 根据权利要求19所述的电子设备,所述处理器还被配置为在将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图之前,对所述原始图像分别进行多次最大池化采样和多次平均池化采样;
    所述处理器将所述第一图像输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图。
  21. 根据权利要求20所述的电子设备,所述堆叠沙漏网络包括第一沙漏网络和第二沙漏网络;
    所述处理器对所述原始图像分别进行多次最大池化采样和多次平均池化采样,包括:
    对所述原始图像分别进行最大池化采样和平均池化采样,得到第一最大池化图像和第一平均池化图像;
    对所述第一最大池化图像和所述第一平均池化图像分别进行最大池化采样和平均池化采样,得到第二最大池化图像和第二平均池化图像;
    对所述第二最大池化图像和所述第二平均池化图像分别进行最大池化采样和平均池化采样,得到第三最大池化图像和第三平均池化图像;以及
    对所述第三最大池化图像和所述第三平均池化图像分别进行最大池化采样和平均池化采样,得到第四最大池化图像和第四平均池化图像;
    相应的,所述处理器将所述第一图像、每次最大池化采样所得到图像以及每次平均池化采样所得的图像,输入至所述堆叠沙漏网络,得到所述多个人体骨骼关键点的热力图,包括:
    将所述第一图像、所述第一最大池化图像和所述第一平均池化图像输入所述第一沙漏网络;
    将所述第一沙漏网络输出的热力图输入所述第二沙漏网络,同时将所述第一图像、所述第二最大池化图像和所述第二平均池化图像馈入所述第二沙漏网络;
    并且,在每一个沙漏网络执行第一次下采样之后,将所述第三最大池化图像和所述第三平均池化图像插入该沙漏网络的卷积路;
    在每一个沙漏网络执行第二次下采样之后将所述第四最大池化图像和所述第四平均池化图像插入该沙漏网络的卷积路。
  22. 一种非临时性计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令被执行时实现如权利要求1至7任一项所述的人体骨骼关键点的检测方法。
PCT/CN2019/110582 2018-11-07 2019-10-11 人体骨骼关键点的检测方法、装置、电子设备及存储介质 WO2020093837A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/085,214 US11373426B2 (en) 2018-11-07 2020-10-30 Method for detecting key points in skeleton, apparatus, electronic device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811319932.7 2018-11-07
CN201811319932.7A CN109670397B (zh) 2018-11-07 2018-11-07 人体骨骼关键点的检测方法、装置、电子设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/085,214 Continuation US11373426B2 (en) 2018-11-07 2020-10-30 Method for detecting key points in skeleton, apparatus, electronic device and storage medium

Publications (1)

Publication Number Publication Date
WO2020093837A1 true WO2020093837A1 (zh) 2020-05-14

Family

ID=66142071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110582 WO2020093837A1 (zh) 2018-11-07 2019-10-11 人体骨骼关键点的检测方法、装置、电子设备及存储介质

Country Status (3)

Country Link
US (1) US11373426B2 (zh)
CN (1) CN109670397B (zh)
WO (1) WO2020093837A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832526A (zh) * 2020-07-23 2020-10-27 浙江蓝卓工业互联网信息技术有限公司 一种行为检测方法及装置
CN112651333A (zh) * 2020-12-24 2021-04-13 世纪龙信息网络有限责任公司 静默活体检测方法、装置、终端设备和存储介质
CN113128383A (zh) * 2021-04-07 2021-07-16 杭州海宴科技有限公司 一种校园学生欺凌行为的识别方法
CN114154465A (zh) * 2021-10-29 2022-03-08 北京搜狗科技发展有限公司 结构图的结构重构方法、装置、电子设备及存储介质

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670397B (zh) 2018-11-07 2020-10-30 北京达佳互联信息技术有限公司 人体骨骼关键点的检测方法、装置、电子设备及存储介质
CN110084180A (zh) * 2019-04-24 2019-08-02 北京达佳互联信息技术有限公司 关键点检测方法、装置、电子设备及可读存储介质
CN110348335B (zh) * 2019-06-25 2022-07-12 平安科技(深圳)有限公司 行为识别的方法、装置、终端设备及存储介质
CN110348412B (zh) * 2019-07-16 2022-03-04 广州图普网络科技有限公司 一种关键点定位方法、装置、电子设备及存储介质
CN110532891B (zh) * 2019-08-05 2022-04-05 北京地平线机器人技术研发有限公司 目标对象状态识别方法、装置、介质和设备
WO2021056134A1 (en) * 2019-09-23 2021-04-01 Intel Corporation Scene retrieval for computer vision
CN110738654B (zh) * 2019-10-18 2022-07-15 中国科学技术大学 髋关节影像中的关键点提取及骨龄预测方法
CN110895809B (zh) * 2019-10-18 2022-07-15 中国科学技术大学 准确提取髋关节影像中关键点的方法
CN111414823B (zh) * 2020-03-12 2023-09-12 Oppo广东移动通信有限公司 人体特征点的检测方法、装置、电子设备以及存储介质
CN111753643B (zh) * 2020-05-09 2024-05-14 北京迈格威科技有限公司 人物姿态识别方法、装置、计算机设备和存储介质
CN111899235A (zh) * 2020-07-21 2020-11-06 北京灵汐科技有限公司 图像检测方法、装置、电子设备和存储介质
CN112417991B (zh) * 2020-11-02 2022-04-29 武汉大学 基于沙漏胶囊网络的双注意力人脸对齐方法
US11948271B2 (en) * 2020-12-23 2024-04-02 Netflix, Inc. Machine learning techniques for video downsampling
CN113033581B (zh) * 2021-05-07 2024-02-23 刘慧烨 髋关节图像中骨骼解剖关键点定位方法、电子设备及介质
CN113192043B (zh) * 2021-05-13 2022-07-01 杭州健培科技有限公司 基于多尺度拓扑图的医学关键点检测方法、装置及应用
CN113420604B (zh) * 2021-05-28 2023-04-18 沈春华 多人姿态估计方法、装置和电子设备
TWI828174B (zh) * 2021-06-04 2024-01-01 虹映科技股份有限公司 動態訓練動作與靜態訓練動作的偵測方法及裝置
CN113569756B (zh) * 2021-07-29 2023-06-09 西安交通大学 异常行为检测与定位方法、系统、终端设备及可读存储介质
CN113642471A (zh) * 2021-08-16 2021-11-12 百度在线网络技术(北京)有限公司 一种图像识别方法、装置、电子设备和存储介质
CN113870215B (zh) * 2021-09-26 2023-04-07 推想医疗科技股份有限公司 中线提取方法及装置
CN114155556B (zh) * 2021-12-07 2024-05-07 中国石油大学(华东) 一种基于加入通道混洗模块的堆叠沙漏网络的人体姿态估计方法及系统
CN114393575B (zh) * 2021-12-17 2024-04-02 重庆特斯联智慧科技股份有限公司 基于用户姿势高效能识别的机器人控制方法和系统
CN114492522B (zh) * 2022-01-24 2023-04-28 四川大学 基于改进堆叠沙漏神经网络的自动调制分类方法
CN115937793B (zh) * 2023-03-02 2023-07-25 广东汇通信息科技股份有限公司 基于图像处理的学生行为异常检测方法
CN116894844B (zh) * 2023-07-06 2024-04-02 北京长木谷医疗科技股份有限公司 一种髋关节图像分割与关键点联动识别方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229490A (zh) * 2017-02-23 2018-06-29 北京市商汤科技开发有限公司 关键点检测方法、神经网络训练方法、装置和电子设备
CN108229497A (zh) * 2017-07-28 2018-06-29 北京市商汤科技开发有限公司 图像处理方法、装置、存储介质、计算机程序和电子设备
CN108280455A (zh) * 2018-01-19 2018-07-13 北京市商汤科技开发有限公司 人体关键点检测方法和装置、电子设备、程序和介质
CN108427927A (zh) * 2018-03-16 2018-08-21 深圳市商汤科技有限公司 目标再识别方法和装置、电子设备、程序和存储介质
CN108764133A (zh) * 2018-05-25 2018-11-06 北京旷视科技有限公司 图像识别方法、装置及系统
CN109670397A (zh) * 2018-11-07 2019-04-23 北京达佳互联信息技术有限公司 人体骨骼关键点的检测方法、装置、电子设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239728B (zh) * 2017-01-04 2021-02-02 赛灵思电子科技(北京)有限公司 基于深度学习姿态估计的无人机交互装置与方法
CN108197633A (zh) * 2017-11-24 2018-06-22 百年金海科技有限公司 基于TensorFlow的深度学习图像分类与应用部署方法
EP3547211B1 (en) * 2018-03-30 2021-11-17 Naver Corporation Methods for training a cnn and classifying an action performed by a subject in an inputted video using said cnn
CN108596258B (zh) * 2018-04-27 2022-03-29 南京邮电大学 一种基于卷积神经网络随机池化的图像分类方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229490A (zh) * 2017-02-23 2018-06-29 北京市商汤科技开发有限公司 关键点检测方法、神经网络训练方法、装置和电子设备
CN108229497A (zh) * 2017-07-28 2018-06-29 北京市商汤科技开发有限公司 图像处理方法、装置、存储介质、计算机程序和电子设备
CN108280455A (zh) * 2018-01-19 2018-07-13 北京市商汤科技开发有限公司 人体关键点检测方法和装置、电子设备、程序和介质
CN108427927A (zh) * 2018-03-16 2018-08-21 深圳市商汤科技有限公司 目标再识别方法和装置、电子设备、程序和存储介质
CN108764133A (zh) * 2018-05-25 2018-11-06 北京旷视科技有限公司 图像识别方法、装置及系统
CN109670397A (zh) * 2018-11-07 2019-04-23 北京达佳互联信息技术有限公司 人体骨骼关键点的检测方法、装置、电子设备及存储介质

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111832526A (zh) * 2020-07-23 2020-10-27 浙江蓝卓工业互联网信息技术有限公司 一种行为检测方法及装置
CN111832526B (zh) * 2020-07-23 2024-06-11 浙江蓝卓工业互联网信息技术有限公司 一种行为检测方法及装置
CN112651333A (zh) * 2020-12-24 2021-04-13 世纪龙信息网络有限责任公司 静默活体检测方法、装置、终端设备和存储介质
CN112651333B (zh) * 2020-12-24 2024-02-09 天翼数字生活科技有限公司 静默活体检测方法、装置、终端设备和存储介质
CN113128383A (zh) * 2021-04-07 2021-07-16 杭州海宴科技有限公司 一种校园学生欺凌行为的识别方法
CN114154465A (zh) * 2021-10-29 2022-03-08 北京搜狗科技发展有限公司 结构图的结构重构方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20210049356A1 (en) 2021-02-18
US11373426B2 (en) 2022-06-28
CN109670397A (zh) 2019-04-23
CN109670397B (zh) 2020-10-30

Similar Documents

Publication Publication Date Title
WO2020093837A1 (zh) 人体骨骼关键点的检测方法、装置、电子设备及存储介质
CN110084775B (zh) 图像处理方法及装置、电子设备和存储介质
CN111462268B (zh) 图像重建方法及装置、电子设备和存储介质
CN107220667B (zh) 图像分类方法、装置及计算机可读存储介质
US11470294B2 (en) Method, device, and storage medium for converting image from raw format to RGB format
US20160027191A1 (en) Method and device for adjusting skin color
KR101727169B1 (ko) 이미지 필터를 생성하기 위한 방법 및 장치
CN111310616A (zh) 图像处理方法及装置、电子设备和存储介质
CN107967459B (zh) 卷积处理方法、装置及存储介质
WO2020114236A1 (zh) 关键点检测方法、装置、电子设备及存储介质
CN109615593A (zh) 图像处理方法及装置、电子设备和存储介质
CN107220614B (zh) 图像识别方法、装置及计算机可读存储介质
RU2614541C2 (ru) Способ, устройство и терминал для перенастройки изображения
CN113259583B (zh) 一种图像处理方法、装置、终端及存储介质
CN107424130B (zh) 图片美颜方法和装置
CN105354793A (zh) 人脸图像处理方法及装置
US20220327749A1 (en) Method and electronic device for processing images
CN105528765A (zh) 处理图像的方法及装置
CN112188091B (zh) 人脸信息识别方法、装置、电子设备及存储介质
CN112219224A (zh) 图像处理方法及装置、电子设备和存储介质
CN110135349A (zh) 识别方法、装置、设备及存储介质
US9665925B2 (en) Method and terminal device for retargeting images
CN106384356A (zh) 分离视频序列的前景与背景的方法及装置
CN106372663A (zh) 构建分类模型的方法及装置
CN110110742B (zh) 多特征融合方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19882603

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19882603

Country of ref document: EP

Kind code of ref document: A1