WO2020108362A1 - 人体姿态检测方法、装置、设备及存储介质 - Google Patents

人体姿态检测方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2020108362A1
WO2020108362A1 PCT/CN2019/119633 CN2019119633W WO2020108362A1 WO 2020108362 A1 WO2020108362 A1 WO 2020108362A1 CN 2019119633 W CN2019119633 W CN 2019119633W WO 2020108362 A1 WO2020108362 A1 WO 2020108362A1
Authority
WO
WIPO (PCT)
Prior art keywords
image data
human
posture
module
input
Prior art date
Application number
PCT/CN2019/119633
Other languages
English (en)
French (fr)
Inventor
项伟
王毅峰
黄秋实
梁柱锦
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Priority to US17/297,882 priority Critical patent/US11908244B2/en
Publication of WO2020108362A1 publication Critical patent/WO2020108362A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • Embodiments of the present application relate to human posture detection technology, such as a human posture detection method, device, equipment, and storage medium.
  • Human pose detection is the most challenging research direction in the field of computer vision. It is widely used in human-computer interaction, intelligent monitoring, virtual reality, and human behavior analysis. However, due to the multi-scale affine transformation of the local image features at the key points that make up the human pose, and the image is easily affected by factors such as the dress of the target person, the camera's shooting angle, distance, lighting changes, and local occlusion, making the human pose Progress in testing research is slow.
  • human body posture detection is usually based on convolutional neural networks.
  • it is usually necessary to collect a large number of training samples to perform long-term supervision learning on the human posture detection model.
  • the embodiments of the present application provide a method, a device, a device, and a storage medium for human body posture detection, so as to realize human body posture detection on an embedded platform.
  • an embodiment of the present application provides a method for detecting human posture.
  • the method includes: acquiring multiple frames of image data; inputting the current frame of image data into a pre-trained human posture detection model, and referring to the previous frame of image data Human body pose confidence map, output multiple human body pose reference maps, the human body detection model is generated by the training of the convolutional neural network applied to the embedded platform; identify the key points of the human body pose in each human body pose reference map; according to the The credibility of the key points of the human posture to generate the confidence map of the human posture of the current frame of image data; determine whether the current frame of image data is the last frame of image data; when the current frame of image data is not the last frame of image data, the The body pose confidence map of the image data of the current frame is input into the body pose detection model and used to participate in generating the body pose confidence map of the next frame of image data; in the case where the current frame of image data is the last frame of image data, End the operation of generating a human body confidence map for multi-
  • an embodiment of the present application provides a human body posture detection device, which includes: an image data collection module configured to collect multiple frames of image data; a human body posture reference image output module configured to input current frame image data to In a pre-trained human posture detection model, and referring to the human posture confidence map of the previous frame of image data, multiple human posture reference maps are output.
  • the human posture detection model is generated by training a convolutional neural network applied to an embedded platform;
  • the human body posture key point recognition module is set to recognize human body posture key points in each human body posture reference figure;
  • the human body posture confidence map generation module is set to generate a human body posture confidence map according to the credibility of the human body posture key points;
  • the judgment module is set to judge whether the current frame of image data is the last frame of image data;
  • the first execution module is set to set the human posture of the current frame of image data when the current frame of image data is not the last frame of image data
  • the confidence map is input into the human posture detection model and is used to participate in the generation of the human posture confidence map of the next frame of image data;
  • the second execution module is set to end when the current frame of image data is the last frame of image data
  • the operation of generating a human body confidence map for multi-frame image data is performed.
  • an embodiment of the present application further provides a device, the device including: at least one processor; a memory configured to store at least one program; when the at least one program is executed by the at least one processor, so that The at least one processor implements the method described in the first aspect of the embodiments of the present application.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the method according to the first aspect of the embodiment of the present application is implemented.
  • FIG. 1 is a flowchart of a method for detecting human posture in an embodiment of the present application
  • FIG. 2 is a schematic diagram of application of a convolutional neural network in an embodiment of the present application
  • FIG. 3 is a flowchart of another human body posture detection method in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a human body posture detection device in an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a device in an embodiment of the present application.
  • the so-called computer vision is to let the computer simulate the human visual function and understand the objective world through observation like humans.
  • the main content of its research is: how to use computer vision technology to solve human-centered related problems, including object recognition, face recognition, human detection and tracking, human posture detection and human motion analysis.
  • Human posture detection is an important part of human behavior recognition and an important research content of human behavior recognition system. Its ultimate purpose is to output the structural parameters of the whole or part of the human body, such as human body contour, head position and orientation, human body The location or type of key points. It has important applications in many aspects, exemplary, such as athlete motion recognition, animated character production, and content-based image and video retrieval.
  • the human body can be regarded as composed of different parts connected by key points.
  • the human body posture detection can be determined by obtaining the position information of each key point, where the position information of the key point can be Dimensional coordinates.
  • Human posture detection usually requires obtaining the human head, neck, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle and right ankle, a total of 14 Key points.
  • the human body pose detection method based on the convolutional neural network can be used for human body pose detection.
  • the core problem solved by the convolutional neural network is how to automatically extract and abstract features, and then map the features to the task goal to solve the actual problem
  • a convolutional neural network is generally composed of the following three parts, the first part is the input layer, the second part is composed of the convolution layer, the excitation layer and the pooling layer (or downsampling layer), and the third part is a fully connected Composed of a multi-layer perceptron classifier.
  • the convolutional neural network has the feature of weight sharing. Weight sharing means that the same feature of different positions of the entire image can be extracted through the convolution operation of a convolution kernel.
  • the convolutional layer is to extract and aggregate low-level features into high-level features.
  • Low-level features are basic features, such as local features such as texture and edges, and high-level features such as faces And the shape of the object, etc., can better represent the global attributes of the sample. This process is the level of generalization of the convolutional neural network to the target object.
  • the convolutional neural network needs to have a small calculation amount, a fast running speed, and a prediction accuracy that meets actual requirements.
  • a lightweight convolutional neural network may be used.
  • the product neural network refers to the lightweight convolutional neural network.
  • the so-called lightweight convolutional neural network refers to a convolutional neural network that can be applied to embedded platforms.
  • the human body posture detection method will be described below in conjunction with specific embodiments.
  • FIG. 1 is a flowchart of a method for detecting human posture provided by an embodiment of the present application. This embodiment can be applied to the situation of detecting human posture.
  • the method can be performed by a human posture detection device, which can use software and hardware.
  • the device may be configured in a device, such as a computer or a mobile terminal. As shown in FIG. 1, the method includes steps 110 to 170.
  • step 110 multiple frames of image data are collected.
  • the video can be understood to be composed of at least one frame of image data. Therefore, in order to recognize the human posture in the video, the video can be divided into frame-by-frame image data.
  • the multi-frame image data here represents image data in the same video. In other words, the video includes multi-frame image data. Multiple frames of image data can be named in chronological order. Exemplarily, if the video includes N frames of image data, N ⁇ 1, at this time, the N frames of image data may be referred to as: first frame image data, second frame image data, ..., the first N-1 frame image data and Nth frame image data.
  • each frame of image data may be processed sequentially in chronological order.
  • a certain frame of image data currently being processed can be called the current frame of image data
  • the previous frame of image data of the current frame of image data can be called the previous frame of image data
  • the next frame of image data of the current frame of image data can be called Image data for the next frame.
  • the current frame data is the first frame of image data, then for the current frame of image data, it only has the next frame of image data and no previous frame of image data; if the current frame of image data is the last frame Image data, for the current frame of image data, it only has the previous frame of image data and no next frame of image data; if the current frame of image data is neither the first frame of image data nor the last frame of image data, the current For the frame image data, it has the image data of the previous frame and the image data of the next frame.
  • the reason for using the above-mentioned chronological order to process each frame of image data in sequence is that for human posture detection, there may be a certain correlation between the two adjacent frames of image data, that is, if a certain frame is identified based on the previous frame of image data The key point appears at a position in the previous frame image, then the key point in the image data of the current frame may also appear near the same position in the image data of the current frame.
  • the detection result of the image data of the previous frame satisfies the preset condition, you can refer to the detection result of the image data of the previous frame to process the image data of the current frame.
  • step 120 the image data of the current frame is input into the pre-trained human body pose detection model, and the human body pose confidence map of the previous frame image data is referenced, and multiple human body pose reference maps are output.
  • the human body pose detection model is applied to embed Convolutional neural network training generation of the platform.
  • the human pose confidence map may refer to an image including key points of the human pose, or the human pose confidence map may be understood as an image generated based on the key points of the human pose, such as centering on the key points of the human pose The generated image.
  • the key points of the human posture mentioned here can refer to the head, neck, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle as mentioned above 14 key points including the right ankle.
  • the human posture reference image may include two aspects, namely the position information of multiple points that may be key points of the human posture and the probability value corresponding to the position information, wherein the points that may be key points of the human posture may be referred to as candidates Point, correspondingly, the human body posture reference map may include the position information of multiple candidate points and the probability value corresponding to the position information, that is, each candidate point corresponds to a probability value, and the position information may be expressed in the form of coordinates. At the same time, it can be determined which candidate point is the key point of human posture according to the probability value corresponding to the position information of multiple candidate points.
  • a human posture reference diagram includes the position information (x A , y A ) of the candidate point A and the corresponding probability value P A ; the position information (x B , y B ) of the candidate point B and the corresponding probability value P B ; The position information (x C , y C ) of point C and the corresponding probability value P C , where P A ⁇ P B ⁇ P C , based on the above, determine the candidate point C as the key point of the human posture.
  • each human pose confidence map corresponds to a human pose key point
  • each human pose reference map includes multiple candidate points
  • the candidate points are candidate points for a certain key point, such as a human pose reference
  • the graph includes multiple candidate points, which are candidate points for the left elbow.
  • a certain human body posture reference picture also includes multiple candidate points, and the candidate points are candidate points for the left knee.
  • the pre-trained human pose detection model can be generated from a set number of training samples through the training of a convolutional neural network applied to an embedded platform.
  • the convolutional neural network applicable to an embedded platform is a lightweight convolutional neural network.
  • the human posture detection model may include a main circuit, a first branch, a second branch, and a third branch; the main circuit may include a residual module and an upsampling module, and the first branch may include a refined network module, a second branch A feedback module may be included; the residual module may include a first residual unit, a second residual unit, and a third residual unit. The detailed description of the components of the human posture detection model can be found later.
  • Case 1 Input the current frame image data as input variables to the pre-trained body pose model to obtain multiple first body pose reference maps, and output multiple body pose confidence maps based on the previous frame image data Human body posture reference picture, where each first human body posture reference picture is based on a certain human body pose confidence map of multiple human body pose confidence maps obtained from the corresponding previous frame of image data, and outputs a human body pose of the current frame of image data
  • the above correspondence is determined based on whether the key points are the same. Exemplarily, if the key point of a first body pose reference image of the current frame of image data is the left elbow, then it refers to the confidence figure of the body pose of the corresponding key point in the data of the previous frame image is the left elbow.
  • the body pose confidence map of the previous frame of image data is not used as an input variable, and is input into the pre-trained body pose detection model together with the current frame of image data, but is input to the current frame of image data to After pre-training the human body posture detection model, after obtaining multiple first human body posture reference images, according to the multiple human body posture confidence maps of the previous frame of image data, it is determined in sequence whether each first human body posture reference image is credible.
  • first human body posture reference picture is reliable, use the first human body posture reference picture as the human body reference picture of the current frame; if the first human body posture reference picture is not reliable, the previous frame
  • the body pose confidence map for the first body pose reference map in the image data is used as the body pose reference map for the current frame.
  • Case 2 The human body pose confidence maps of the current frame image data and the previous frame image data are input as input variables to the pre-trained human body pose detection model, and multiple human body pose reference maps are output.
  • the human body pose confidence map of the previous frame of image data in the above case 2 is also used as an input variable, and is input into the pre-trained human body pose detection model together with the current frame of image data.
  • the pre-trained human body pose detection model For video, two adjacent frames of images There is a certain correlation between the data.
  • the result of the previous frame of image data is used as feedback information to be input into the pre-trained human posture detection model. Participating in the process of predicting the output result of the current frame of image data can improve the human posture.
  • the prediction accuracy of the detection model is also used as an input variable, and is input into the pre-trained human body pose detection model together with the current frame of image data.
  • the following methods can be used: determine whether the confidence map of the human posture of the previous frame of image data is reliable; When the confidence map is credible, input the human frame confidence map of the current frame image data and the previous frame image data to the pre-trained human body pose detection model, and output multiple human body pose reference maps; in the previous frame of image data When the confidence figure of the human body pose is unreliable, input the current frame image data and the preset image data into the pre-trained human body pose detection model, and output multiple human body pose reference maps; or, the human body pose of the previous frame of image data When the confidence map is not credible, the image data of the current frame is input into a pre-trained human posture detection model, and a plurality of human posture reference maps are output.
  • the preset image data refers to image data that does not contain prior knowledge, such as an all-black image, if expressed in the form of a matrix table, it is an all-zero matrix.
  • the confidence figure of the body pose of the image data of the previous frame is the image data containing a priori knowledge; for the output result of the image data of the next frame, the body pose of the image data of the current frame
  • a confidence map is image data that contains prior knowledge.
  • the reason why the prediction accuracy of the human posture detection model can be improved by using the above method is that: if the human posture confidence map of the previous frame of image data is unreliable, it can explain that the human posture confidence map of the previous frame of image data is not reliable. In this case, it is still input into the pre-trained human pose detection model as an input variable, which will not only improve the prediction accuracy of the human pose detection model, but may reduce the prediction accuracy of the human pose model.
  • the human pose confidence map of the previous frame of image data input into the pre-trained human pose detection model as an input variable is credible, therefore, when determining whether to refer to the previous frame of image data of the human pose confidence map Previously, the method of judging whether the body pose confidence map of the previous frame of image data was credible was implemented. In the case where the body pose confidence map of the previous frame of image data was credible, the body pose of the previous frame of image data was trusted The image is input to the pre-trained human pose detection model as an input variable. Conversely, when the confidence figure of the human pose of the previous frame of image data is not credible, it is not used as an input variable.
  • the multiple human posture reference pictures mentioned above are directed to the output result of the current frame image data, that is, the current frame image data corresponds to multiple human posture reference pictures.
  • the current frame image data corresponds to multiple human posture reference pictures.
  • N human body pose reference maps are correspondingly output.
  • the confidence figure of the posture of the image data of the previous frame as a reference also includes N pieces.
  • the above-mentioned judgment of whether the body pose confidence map of the previous frame of image data is credible refers to separately judging whether each body pose confidence map of the previous frame of image data is credible. It can also be understood that, because the human pose confidence map can refer to an image including key points, different key points correspond to different human pose confidence maps, so the conditions for determining whether the human pose confidence map is credible for different key points can be the same, and It can be different and can be determined according to the actual situation, which is not limited here.
  • the current frame of image data can be input into a pre-trained human posture detection model, or the current frame of image data can be input And preset image data are input into the pre-trained human pose detection model.
  • the image data of the current frame and the preset image data are input into the pre-trained human posture detection model, and multiple reference figures of the human posture are output.
  • the following methods may be considered: judging whether the human pose confidence map of the previous frame of image data is reliable; the human pose confidence map of the previous frame of image data In the case of believability, input the confidence figure of the current frame image data and the previous frame image data into the pre-trained human pose detection model, and output multiple reference maps of the human pose; the human pose of the previous frame image data If the confidence map is not credible, the image data of the current frame and the preset image data can be input into a pre-trained human posture detection model, and a plurality of human posture reference maps can be output.
  • N body pose confidence maps determine whether the N body pose confidence maps are credible, and the judgment result is that x body pose confidence maps are credible, and (Nx) body pose confidence maps If it is not credible, you can input x credible human pose confidence maps, (Nx) preset image data and current frame image data into the human pose detection model, and output multiple human pose reference maps.
  • inputting the image data of the current frame into the pre-trained human pose detection model and referring to the human pose confidence map of the previous frame of image data, before outputting multiple human pose reference maps includes: The image data is preprocessed to obtain the processed image data.
  • the preprocessing may include normalization and whitening, where normalization refers to a series of transformations, that is, using a constant moment of the image to find a set of parameters so that it can eliminate other transformation functions on the image
  • normalization refers to a series of transformations, that is, using a constant moment of the image to find a set of parameters so that it can eliminate other transformation functions on the image
  • the effect of transformation transforms the original image to be processed into the corresponding unique standard form, which has invariant characteristics for affine transformations such as translation, rotation or scaling. Normalization usually includes the following steps: coordinate centralization, x-shearing normalization, scaling normalization, and rotation normalization.
  • the human pose detection model Before inputting the image data of the current frame to the pre-trained human pose detection model, the human pose detection model can be generated based on neural network training, and the role of normalizing the image data is to summarize the statistical distribution of the unified sample, Furthermore, the speed of network learning is accelerated to ensure that small values in the output data will not be swallowed.
  • the input variables Due to the strong correlation between adjacent pixels in the image data, it is redundant when input as an input variable.
  • the role of whitening is to reduce the redundancy of the input. More precisely, through whitening, the input variables have the following properties: the correlation between features is low; all features have the same variance, and are usually set to Unit variance.
  • the current frame of image data that is input into the pre-trained human posture detection model as input variables is the processed image data.
  • the image data of the previous frame is also processed image data.
  • step 130 key points of human posture are identified in each human posture reference image.
  • the human posture reference map may include two aspects, namely, the position information of each point that may be a key point of the human posture and the probability value corresponding to the position information, where,
  • the key point of human posture can refer to the point determined as the key point.
  • the key point of human posture is the key point, and at the same time, the point that may be the key point of human posture can be referred to as the candidate point.
  • the human posture reference map includes the position information of multiple candidate points and the probability value corresponding to the position information, and which candidate point can be determined as the key point of the human pose according to the probability value corresponding to the position information of the multiple candidate points .
  • the candidate point corresponding to the maximum probability value among the probability values corresponding to the position information of the plurality of candidate points is selected as the key point of the human posture.
  • each human posture reference map includes multiple candidate points of human posture key points, and the coordinate position of each candidate point corresponds to a probability value; identifying human posture key points in the human posture reference map includes: : Determine the coordinate position corresponding to the maximum probability value among the multiple probability values corresponding to the coordinate positions of the multiple candidate points in the human posture reference picture, and use the candidate point corresponding to the coordinate position as the key point of the human pose.
  • the human posture reference image includes position information of multiple points that may be key points of the human posture and the probability value corresponding to the position information
  • the probability corresponding to the position information of the multiple points can be used Value to determine which point to use as the key point for human posture.
  • the coordinate position of the maximum probability value is determined in the human pose reference map, and the coordinate position is used as a key point of the human pose.
  • step 140 according to the credibility of the key points of the human posture, a confidence map of the human posture is generated.
  • the credibility may include credibility and untrustworthiness
  • the criterion for determining credibility and untrustworthiness may be: whether the probability value corresponding to the key point of the human posture is greater than a preset threshold, that is, corresponding to the key point of the human posture When the probability value of is greater than the preset threshold, it can indicate that the human body pose key point is credible; when the probability value corresponding to the human pose key point is less than or equal to the preset threshold value, it can indicate that the human pose key point is not credible .
  • the preset image can be The data is used as a confidence map of human posture.
  • the preset image data described here is the same as the preset image data described above, and the preset image data may be an all-black image, and in the case of being expressed in the form of a matrix table, an all-zero matrix.
  • the following manners can be used to determine whether the key points of the human posture are credible: whether the probability value of the key points of the human posture is greater than a preset threshold. When the probability value of the human body pose key point is greater than the preset threshold, the human body pose key point is credible; when the probability value of the human body pose key point is less than or equal to the preset threshold, the human body pose The key point is not credible.
  • the corresponding key point of the human posture in the previous frame of image data is used as the key point of the human posture in the current frame, but for the unreliable key point of the human posture .
  • the human body pose confidence map is not generated based on the corresponding human body pose key points in the previous frame of image data, but is generated based on the preset image data human body pose confidence map.
  • generating a confidence map of the human posture according to the credibility of the key points of the human posture includes: determining whether the key points of the human posture are credible. When the key points of the human posture are credible, a mask image is generated with the key points of the human posture as the center, and used as a confidence map of the human posture. When the key points of the human posture are not reliable, the preset image data is used as the confidence map of the human posture.
  • the mask map refers to an image obtained after performing image mask processing on the image.
  • the image mask refers to using the selected image, figure or object to block the image to be processed (all or part) to control the image processing area or processing process.
  • the specific image or object used for covering is called a mask or template.
  • the mask can be a two-dimensional matrix array or a multi-value image.
  • the image mask is set as follows: 1. Extract the region of interest. That is, the pre-made region of interest mask is multiplied with the image to be processed to obtain the region of interest image. The image values within the region of interest remain unchanged, while the image values outside the region are all zero. Second, the shielding effect.
  • the human body posture reference map generates a human body posture confidence map, including: when the human body posture keypoints are credible, the human body posture keypoints are used as the center to generate a mask map as the human body posture confidence Fig.
  • the key points of the human posture are used as the center, and the mask map is generated using the Gaussian kernel as the confidence map of the human posture.
  • the area affected by the mask map can be determined by setting the parameters of the Gaussian kernel, where the parameters of the Gaussian kernel include the width and height of the filter window, and the Gaussian kernel can be a two-dimensional Gaussian kernel.
  • a certain Gaussian kernel is a two-dimensional Gaussian kernel
  • the parameters of the two-dimensional Gaussian kernel are a filter window with a width of 7 and a height of 7, that is, the area affected by the mask map is a 7 ⁇ 7 square area.
  • the preset image data may be used as a confidence map of the human posture, or the preset image data may be regarded as a mask image.
  • the preset image data described here is the same as the preset image data described above, and the preset image data may be an all-black image, and in the case of being expressed in the form of a matrix table, an all-zero matrix.
  • determining whether the key points of the human body pose are credible includes: determining whether the probability value corresponding to the key points of the human body is greater than a preset threshold. When the probability value corresponding to the key point of the human body is greater than a preset threshold, it is determined that the key point of the human body is credible. When the probability value corresponding to the key point of the human body is less than or equal to the preset threshold, it is determined that the key point of the human body is not credible.
  • the threshold can be set according to actual conditions, and is not limited herein.
  • the thresholds corresponding to the key points of different human postures can be the same or different, and can also be determined according to the actual situation, which is not limited here. For example, for important human posture key points, a larger threshold can be set. For the key points of human posture, a smaller threshold can be set. Exemplarily, for example, when the key point of the human posture is the top of the head, the corresponding threshold is 0.9, and when the key point of the human posture is the left knee, the corresponding threshold is 0.5.
  • step 150 determine whether the current frame of image data is the last frame of image data; if the current frame of image data is not the last frame of image data, go to step 160; when the current frame of image data is the last frame of image data Next, go to step 170.
  • step 160 the human pose confidence map of the image data of the current frame is input into the human pose detection model, and used to participate in generating the human pose confidence map of the next frame of image data.
  • step 170 the operation of generating the human pose confidence map of the multi-frame image data is ended.
  • the body position confidence map of the current frame of image data can be input to In the human posture detection model, as a reference for the output result of the next frame of image data to improve the accuracy of the output result of the next frame of image data, the next frame of image data is input into the pre-trained human posture detection model and reference Human pose confidence map of current frame image data, output multiple human pose reference maps of next frame image data, identify human pose key points in human pose reference map, and generate human pose confidence points according to the credibility of human pose key points Figure.
  • the current frame of image data is the last frame of image data
  • Pose detection model it can be understood that in the case where the current frame of image data is the last frame of image data, only steps 120 and 130 can be performed and it is judged whether the key points of the human posture are credible. In this case, the key point of the human posture in the image data of the previous frame is taken as the key point of the human posture.
  • steps 120-150 are all processing procedures for the image data of the current frame.
  • the reference image of human posture in steps 120 and 130 refers to the reference image of human posture corresponding to the image data of the current frame
  • the human pose key points in steps 130 and 140 refer to the human pose key points corresponding to the current frame of image data
  • the human pose confidence maps in step 140 and step 150 refer to the human pose confidence corresponding to the current frame of image data Figure.
  • the first frame of image data can be used as the current frame of image data when the first frame of image data is currently being processed.
  • Frame image data in the case where the second frame image data is a frame image data currently being processed, the second frame image data can be used as the current frame image data, and so on.
  • the current frame image data may be the first frame image data, the second frame image data, the third frame image data, ..., the N-1th frame image data or the Nth frame image data.
  • steps 120-140 can be repeated to complete the first frame of image data to the N-1th frame Image data processing operation; in the case where the current frame of image data is determined to be the Nth frame of image data, steps 120-130 can be performed and if the key points of the human posture are not credible, the corresponding The key points of the human posture can be used as the key points of the human posture.
  • the current frame of image data is input into a pre-trained human posture detection model, and the human frame posture confidence map of the previous frame of image data is referenced to output multiple human posture reference maps
  • the human posture detection model is generated through the training of the convolutional neural network applied to the embedded platform.
  • the human posture key points are identified in the human posture reference map. According to the credibility of the human posture key points, the human posture confidence map is generated to determine the current frame.
  • the image data is the last frame of image data
  • the current frame of image data is not the last frame of image data
  • the operation of generating a human pose confidence map of multi-frame image data is ended, and human pose detection on an embedded platform is realized
  • the output result of the image data of the previous frame is introduced into the prediction process of the output result of the image data of the current frame, which improves the prediction accuracy.
  • the human pose detection model includes a main path, a first branch, and a second branch
  • the main path includes a residual module and an upsampling module
  • the first branch includes a refinement network module
  • the second branch includes feedback Module.
  • the first convolution result output by the residual module is input to the upsampling module for processing to obtain a second convolution result
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result .
  • the second convolution result and the third convolution result are added to output a plurality of human body posture reference pictures.
  • the residual module is configured to extract features such as edges and contours of image data
  • the upsampling module is configured to extract context information of image data.
  • the refinement network module is set to process the first convolution result output by the residual module, and the first convolution result can be regarded as the network middle layer information, that is, the refinement network module uses the network middle layer information to increase its return gradient , Thereby improving the prediction accuracy of the convolutional neural network.
  • the feedback module is set to introduce the confidence figure of the human posture of the image data of the previous frame into the convolutional neural network to improve the accuracy of the output result of the image data of the current frame.
  • Input the image data of the current frame to the residual module for processing, and input the confidence figure of the body pose of the image data of the previous frame to the feedback module for processing to obtain the first convolution result which can be understood as follows:
  • the data is input to the residual module for processing, and the result obtained by inputting the body pose confidence map of the previous frame of image data to the feedback module for processing is processed to obtain a first convolution result.
  • the first convolution result output by the residual module is input to the upsampling module for processing to obtain a second convolution result
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result
  • the upsampling module can use the nearest neighbor interpolation method or other upsampling methods, which can be performed according to the actual situation. The setting is not limited here.
  • the middle layer information of the network is used to increase its return gradient, thereby improving the prediction accuracy of the convolutional neural network.
  • the human body pose confidence map of the last frame of image data is introduced into the convolutional neural network through the feedback module. It participates in the prediction of the current frame of image data by the human pose detection model, and also improves the prediction accuracy of the convolutional neural network.
  • the residual module includes a first residual unit, a second residual unit, and a third residual unit.
  • Input the image data of the current frame to the residual module for processing and refer to the result obtained by inputting the confidence figure of the human posture of the image data of the previous frame to the feedback module for processing to obtain the first convolution result, including:
  • the image data is input to the first residual unit for processing to obtain the first intermediate result.
  • the first intermediate result is input to the result processed by the second residual unit and the result obtained by inputting the confidence figure of the human posture of the image data of the previous frame to the feedback module for processing to obtain the second intermediate result.
  • the second intermediate result is input to the third residual unit for processing, and the third intermediate result is obtained as the first convolution result.
  • the number of channels of the first intermediate result, the second result, and the third result increase sequentially.
  • the residual module includes a first residual unit, a second residual unit, and a third residual unit, where each residual unit is composed of a ShuffleNet subunit and a ShuffleNet downsampling subunit,
  • the ShuffleNet subunit can operate on image data of any size, which is controlled by two parameters, namely the input depth and the output depth, where the input depth represents the number of layers of the middle feature layer of the input network, and the output depth refers to What is the number of layers of the middle feature layer output by this subunit, the number of layers corresponds to the number of channels
  • the ShuffleNet subunit extracts the higher level features, while retaining the information of the original level, it can achieve the size of the image data without changing , Only changing the depth of the middle feature layer of the network, it can be regarded as an advanced "convolutional layer" that keeps the size unchanged.
  • each residual unit can contain only one ShuffleNet subunit. Compared with the original, each residual unit includes three ShuffleNet subunits, the network structure is simplified. Reduce the amount of calculation and improve processing efficiency.
  • the sizes of the first intermediate result, the second intermediate result, and the third intermediate result are sequentially reduced, and In order to keep the size of the network unchanged, the number of channels of the first intermediate result, the number of channels of the second intermediate result, and the number of channels of the third intermediate result increase in sequence. In addition, each channel corresponds to a feature map.
  • the intermediate result can be represented by W ⁇ H ⁇ K, where W is the width of the intermediate result, H is the length of the intermediate result, K is the number of channels, and W ⁇ H is the size of the intermediate result.
  • W ⁇ H ⁇ D For the input image data, it can be expressed as W ⁇ H ⁇ D, where W and H have the same meaning as described above, and D indicates the depth.
  • D indicates the depth.
  • the first intermediate result, the second intermediate result, and the third intermediate result are represented by W ⁇ H ⁇ K, the meanings of W, H, and K are the same as those described above, and the first intermediate result is 64 ⁇ 32 ⁇ 32.
  • the second intermediate result is 32 ⁇ 16 ⁇ 64, and the third intermediate result is 16 ⁇ 8 ⁇ 128.
  • the size of the first intermediate result is 64 ⁇ 32
  • the size of the second intermediate result is 32 ⁇ 16
  • the size of the third intermediate result is 16 ⁇ 8. The above indicates that the first intermediate result, the second intermediate result and The size of the third intermediate result becomes smaller in turn.
  • the number of channels of the first intermediate result is 32
  • the number of channels of the second intermediate result is 64
  • the number of channels of the third intermediate result is 128.
  • the human pose detection model includes a third branch.
  • the first convolution result output by the residual module is input to the upsampling module for processing to obtain a second convolution result
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result , Including: inputting the first intermediate result to the third branch for processing to obtain a fourth intermediate result.
  • the second intermediate result is input to the third branch for processing to obtain a fifth intermediate result.
  • the third intermediate result and the fifth intermediate result are input to the up-sampling module for processing to obtain the sixth intermediate result.
  • the fourth intermediate result and the sixth intermediate result are input to the up-sampling module for processing to obtain the seventh intermediate result as the second convolution result.
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result. Among them, the number of channels of the sixth intermediate result and the seventh intermediate result decrease sequentially.
  • the human posture detection model includes a third branch, and the role of the third branch is to move the convolution operation of the jump connection to the main branch by moving the third branch On the road, the prediction accuracy of the human pose detection model is improved.
  • the third branch includes a 1 ⁇ 1 convolution kernel module, batch normalization module, and linear activation function module. Among them, the 1 ⁇ 1 convolution kernel can play the following roles:
  • the 1 ⁇ 1 convolution kernel is the scaling of the input image data. This is because the 1 ⁇ 1 convolution kernel has only one parameter. This convolution kernel is at the input Sliding on the image data is equivalent to multiplying the input image data by a factor;
  • the 1 ⁇ 1 convolution kernel has the following two functions: First, it realizes cross-channel interaction and information integration; Second, it carries out dimensionality reduction and dimensionality upgrade and reduces
  • the dimension reduction mentioned here refers to the reduction of the number of channels
  • the dimension increase refers to the increase of the number of channels.
  • the nonlinear characteristics are greatly increased without loss of resolution.
  • the batch normalization module is set to perform batch normalization processing, where batch normalization (or batch normalization) is to avoid the deepening of the neural network layers and the slower convergence speed, resulting in the disappearance of gradients or gradient explosions, which can be achieved by using batch normalization Regulate the input of some layers or all layers, so as to fix the mean and variance of the input signal of each layer, so that the input of each layer has a stable distribution.
  • batch normalization or batch normalization
  • the weight matrix refers to the convolution kernel, that is, W represents the convolution kernel.
  • the size of the seventh intermediate result is larger than the size of the sixth intermediate result.
  • make The number of channels of the sixth intermediate result and the number of channels of the seventh intermediate result decrease sequentially.
  • the convolution operation of the jump connection is moved to the main road, thereby improving the prediction accuracy of the human pose detection model.
  • the first intermediate result, the second intermediate result, and the third intermediate result can be understood as the encoding part
  • the sixth intermediate result and the seventh intermediate result can be understood as the decoding part.
  • the convolutional neural network provided by the embodiments of the present application is an asymmetric encoding-decoding structure.
  • the method further includes: combining the first convolution result and the second convolution result Plus, get the target result. Adding multiple human body posture reference pictures and the target result to output new multiple human body posture reference pictures. Among them, the target result is used to improve the accuracy of the human posture detection model when training the human posture detection model.
  • Midway supervision refers to the calculation of the loss at the output of each stage, which can ensure that the underlying parameters are updated normally.
  • the operation of adding the first convolution result and the second convolution result may not be performed, that is, in the prediction stage, the output result includes only a plurality of body pose reference pictures.
  • the second residual unit and the third residual unit are composed of the ShuffleNet subunit and the ShuffleNet subsampling subunit, the original size information is retained on the main road before each downsampling, that is, the second The ShuffleNet down-sampling sub-unit of the residual unit inputs the first intermediate result to the second residual unit before down-sampling; the ShuffleNet down-sampling sub-unit of the third residual unit before the down-sampling, the second intermediate The result is input to the third residual unit.
  • a ShuffleNet subunit is used to extract features, that is, a ShuffleNet subunit is used to extract features between the first residual unit and the second residual unit, and the ShuffleNet subunit is the ShuffleNet subunit of the first residual unit Unit;
  • a ShuffleNet subunit is used to extract features between the second residual unit and the third residual unit, that is, a ShuffleNet subunit is used to extract features between the second residual unit and the third residual unit, and the ShuffleNet subunit is ShuffleNet subunit of the second residual unit.
  • the convolutional neural network provided in the embodiment of the present application introduces a refining network module, a feedback module, and moves the convolution operation of the jump connection to the main road, which improves the prediction accuracy of the convolutional neural network.
  • the asymmetric encoding-decoding structure is used to ensure that the network size is basically unchanged. Since each residual unit contains only one ShuffleNet subunit, compared to the original, each residual unit includes three ShuffleNet subunits , Simplifies the network structure, correspondingly, it also reduces the amount of calculation and improves the processing efficiency.
  • the human posture detection method based on the convolutional neural network can be applied to an embedded platform, such as the embedded platform of a smart phone, and it runs in real time and the prediction accuracy can meet the requirements.
  • the convolutional neural network may include: a main path, a first branch, a second branch, and a third branch.
  • the main path includes a first convolution module 21, a first residual unit 22, a second residual unit 23, a third residual unit 24, a second convolution module 25, an up-sampling module 26, and a bitwise addition module 27 ⁇ 28 ⁇
  • the third convolution module 28 is a schematic diagram of an application of a convolutional neural network.
  • the convolutional neural network may include: a main path, a first branch, a second branch, and a third branch.
  • the main path includes a first convolution module 21, a first residual unit 22, a second residual unit 23, a third residual unit 24, a second convolution module 25, an up-sampling module 26, and a bitwise addition module 27 ⁇ 28 ⁇
  • the third convolution module 28 is a schematic diagram of an application of a convolutional neural network.
  • the first residual unit 22, the second residual unit 23, and the third residual unit 24 all include a ShuffleNet down-sampling subunit 221 and a ShuffleNet subunit 222.
  • the first branch includes a refinement network module 29, wherein the refinement network module 29 includes a ShuffleNet subunit 222, an upsampling module 26, and a bitwise addition module 27; the second branch includes a feedback module 30; and the third branch includes a second volume Product module 25.
  • W ⁇ H ⁇ K marked on the module, unit or subunit represents the result obtained after processing by the module, unit or subunit, where W represents the width of the result, H represents the length of the result, and K Indicates the number of channels.
  • the first convolution module 21 includes the following processing operations: the first step, the convolution operation, the size of the convolution kernel used is 3 ⁇ 3; the second step, batch standardization; the third step, linear Activation function.
  • the second convolution module 25 includes the following processing operations: the first step, the convolution operation, adopts the size of the convolution kernel to be 1 ⁇ 1; the second step, batch normalization; the third step, the linear activation function.
  • the third convolution module 26 includes the following processing operations: first step, convolution operation, the size of the convolution kernel used is 1 ⁇ 1; second step, batch normalization; third step, linear activation function; fourth step 3. Convolution operation, the size of the convolution kernel adopted is 3 ⁇ 3.
  • the current frame image data is a 256 ⁇ 128 ⁇ 3 RGB image
  • it is input as an input variable to the convolutional neural network, and then passes through the first convolution module 21 and the first residual unit 22 in turn, to obtain the first intermediate result
  • the first intermediate result is 64 ⁇ 32 ⁇ 32
  • the result processed by the bitwise addition module 27 on the main road is input to the second residual unit 23 for processing to obtain a second intermediate result
  • the second intermediate result is 32 ⁇ 16 ⁇ 64
  • the second intermediate result Input to the third residual unit 24 for processing to obtain a third intermediate result, and use the third intermediate result as the first convolution result.
  • the first convolution result is 16 ⁇ 8 ⁇ 128.
  • the feedback module 30 may include a 1 ⁇ 1 convolution kernel, which is set to dimension enhancement. This is because the body pose confidence map of the previous frame of image data is 64 ⁇ 32 ⁇ 14, and the first intermediate result is 64 ⁇ 32 ⁇ 32, need to upgrade dimension to ensure the same number of output channels.
  • the first intermediate result is input to the second convolution module 25 of the third branch for processing to obtain a fourth intermediate result, and the fourth intermediate result is 64 ⁇ 32 ⁇ 32.
  • the second intermediate result is input to the second convolution module 25 of the third branch for processing to obtain a fifth intermediate result, and the fifth intermediate result is 32 ⁇ 16 ⁇ 32.
  • the third intermediate result is input to the second convolution module 25 and the up-sampling module 26 on the main road for processing and the fifth intermediate result is jointly input to the bitwise addition module 27 on the main road for processing to obtain the sixth
  • the intermediate result, the result obtained by inputting the sixth intermediate result to the up-sampling module 26 on the main road for processing and the fourth intermediate result are jointly input to the bitwise addition module 27 on the main road for processing, and the seventh intermediate result is obtained.
  • the seven intermediate results are used as the second convolution result, which is 64 ⁇ 32 ⁇ 32.
  • the third intermediate result is input to the second convolution module 25 on the main road for processing, and then input to the ShuffleNet subunit 222 on the first branch for processing to obtain the eighth intermediate result, and the eighth intermediate result is input to
  • the upsampling module 26 on the first branch performs processing to obtain the ninth intermediate result, and then inputs the ninth intermediate result to the ShuffleNet subunit 222 on the first branch for processing, obtains the tenth intermediate result, and inputs the tenth intermediate result
  • the up-sampling module 26 on the first branch performs processing to obtain the eleventh intermediate result.
  • the sixth intermediate result is input to the ShuffleNet subunit 222 on the first branch for processing to obtain the twelfth intermediate result, and the twelfth intermediate result is input to the upsampling module 26 on the first branch for processing to obtain the thirteenth
  • the eleventh intermediate result and the thirteenth intermediate result are input to the bitwise addition module 27 on the first branch for processing, and a third convolution result is obtained.
  • the third convolution result is 64 ⁇ 32 ⁇ 32.
  • the second convolution result and the third convolution result are input to the bitwise addition module 27 on the main road to obtain the fourteenth intermediate result, and the fourteenth intermediate result is input to the ShuffleNet subunit 222 on the main road to obtain the fifteenth Intermediate result, the fifteenth intermediate result is 64 ⁇ 32 ⁇ 32, the fifteenth intermediate result is input to the third convolution module 28 on the main road, and multiple human body pose reference maps are output.
  • the first convolution result and the second convolution result are added to obtain the target result, which is 64 ⁇ 32 ⁇ 14. Adding multiple human body posture reference pictures and the target result to output new multiple human body posture reference pictures. Among them, the target result is used to improve the accuracy of the human posture detection model when training the human posture detection model.
  • the image data of the current frame is input into the convolutional neural network as an input variable, but the intermediate layer of the network and the first intermediate result are The input variables are input into the convolutional neural network, and the above achieves a reduction in the amount of data processing.
  • FIG. 3 is a flowchart of another method for detecting human posture according to an embodiment of the present application. This embodiment can be applied to the case of detecting human posture.
  • the method can be performed by a human posture detection device, which can use software and hardware It can be implemented in at least one of the following ways.
  • the device can be configured in a device, such as a computer or a mobile terminal. As shown in FIG. 3, the method includes steps 301 to 311.
  • step 301 multi-frame image data is collected.
  • step 302 it is judged whether the confidence figure of the human pose of the previous frame of image data is reliable; in the case that the confidence figure of the human pose of the previous frame of image data is reliable, step 303 is performed; If the confidence map is not credible, step 304 is executed.
  • step 303 the body pose confidence maps of the current frame image data and the previous frame image data are input into a pre-trained body pose detection model, a plurality of body pose reference maps are output, and the process proceeds to step 305.
  • step 304 the current frame image data and the preset image data are input into a pre-trained human posture detection model, a plurality of human posture reference maps are output, and the process proceeds to step 305.
  • each human posture reference map includes multiple candidate points of human posture key points, and the coordinate position of each candidate point corresponds to a probability value; in the human posture reference map, it is determined that the coordinate positions of multiple candidate points correspond to The coordinate position corresponding to the maximum probability value among the multiple probability values, the candidate point corresponding to the coordinate position is used as a key point of the human posture.
  • step 306 it is determined whether the probability value corresponding to the key point of human posture is greater than a preset threshold; in the case that the probability value corresponding to the key point of human posture is greater than the preset threshold, step 307 is performed; If the probability value is less than or equal to the preset threshold, step 308 is executed.
  • step 307 a mask image is generated centering on the key points of the human posture as the confidence map of the human posture of the image data of the current frame, and the process proceeds to step 309.
  • step 308 the preset image data is used as the confidence map of the human body image of the current frame image data, and the process proceeds to step 309.
  • step 309 it is determined whether the current frame of image data is the last frame of image data; if the current frame of image data is not the last frame of image data, step 310 is performed; when the current frame of image data is the last frame of image data Next, go to step 311.
  • step 310 the human body pose confidence map of the image data of the current frame is input into the human body pose detection model for use in participating in generating the human body pose confidence map of the next frame of image data.
  • step 311 the operation of generating the human pose confidence map of the multi-frame image data is ended.
  • the human posture detection model provided by the embodiments of the present application is generated by training a convolutional neural network applied to an embedded platform.
  • the current frame of image data is input into a pre-trained human posture detection model, and the human frame posture confidence map of the previous frame of image data is referenced to output multiple human posture reference maps
  • the human posture detection model is generated through the training of the convolutional neural network applied to the embedded platform.
  • the human posture key points are identified in the human posture reference map. According to the credibility of the human posture key points, the human posture confidence map is generated to determine the current frame.
  • the image data is the last frame of image data
  • the current frame of image data is not the last frame of image data
  • the operation of generating a human pose confidence map of multi-frame image data is ended, and human pose detection on an embedded platform is realized
  • the output result of the image data of the previous frame is introduced into the prediction process of the output result of the image data of the current frame, which improves the prediction accuracy.
  • FIG. 4 is a schematic structural diagram of a human body posture detection device provided by an embodiment of the present application. This embodiment can be applied to detect human body posture.
  • the device can be implemented in at least one of software and hardware.
  • the device can be configured In the device, for example, a computer or a mobile terminal is typical.
  • the device includes an image data collection module 410, a human pose reference map output module 420, a human pose key point recognition module 430, a human pose confidence map generation module 440, a judgment module 450, a first execution module 460, and a third ⁇ Execute module 470.
  • the image data collection module 410 is configured to collect multiple frames of image data.
  • the human posture reference image output module 420 is configured to input the image data of the current frame into the pre-trained human posture detection model, and refer to the human posture confidence map of the previous frame of image data, and output multiple human posture reference images, human posture detection
  • the model is generated by the training of the convolutional neural network applied to the embedded platform.
  • the human body posture key point recognition module 430 is configured to recognize the human posture key point in each human body posture reference picture.
  • the human pose confidence map generation module 440 is configured to generate a human pose confidence map of the current frame of image data according to the credibility of key points of the human pose.
  • the judgment module 450 is set to judge whether the image data of the current frame is the last frame of image data.
  • the first execution module 460 is configured to input the confidence figure of the human posture of the image data of the current frame into the human posture detection model in the case that the image data of the current frame is not the last frame of image data, for participating in generating the next frame of image data Human figure confidence map.
  • the second execution module 470 is configured to end the operation of generating a human body pose confidence map of multiple frames of image data when the current frame of image data is the last frame of image data.
  • the current frame of image data is input into a pre-trained human posture detection model, and the human frame posture confidence map of the previous frame of image data is referenced to output multiple human posture reference maps ,
  • the human body pose detection model is generated by the training of the convolutional neural network applied to the embedded platform, the human body pose key points are identified in each human body pose reference map, and the human body of the current frame image data is generated according to the credibility of the human body pose key points Posture confidence map to determine whether the current frame of image data is the last frame of image data.
  • the human frame of the current frame of image data is input into the human posture detection model.
  • the operation of generating the confidence image of the human body image of the multi-frame image data is ended.
  • Human body pose detection is carried out on the platform.
  • the output result of the image data of the previous frame is introduced into the prediction process of the output result of the image data of the current frame, which improves the prediction accuracy.
  • the body posture reference image output module 420 includes a confidence map credibility judgment unit, a first body posture reference image output unit, and a second body posture reference image output unit.
  • the confidence image credibility judgment unit is set to judge whether the confidence figure of the human posture of the image data of the previous frame is credible.
  • the first body posture reference image output unit is set to input the body posture confidence maps of the current frame image data and the previous frame image data to the pre-trained body when the body posture confidence map of the previous frame image data is reliable
  • multiple human posture reference images are output.
  • the second human posture reference image output unit is set to input the image data of the current frame and the preset image data into the pre-trained human posture detection model when the human posture confidence image of the previous frame of image data is not reliable Multiple reference pictures of human posture.
  • each human posture reference map includes multiple candidate points of human posture key points, the coordinate position of each candidate point corresponds to a probability value, and the human posture key point identification module 430 includes a human posture key point identification unit.
  • the human posture key point recognition unit is set to determine the coordinate position corresponding to the maximum probability value among the multiple probability values corresponding to the coordinate positions of multiple candidate points in the human posture reference picture, and use the candidate point corresponding to the coordinate position as the human posture key point.
  • the human pose confidence map generation module 440 includes a human pose key point credibility judgment unit, a first human pose confidence map generation unit, and a second human pose confidence map generation unit.
  • the credibility judgment unit of the key points of the human posture is set to judge whether the key points of the human posture are credible.
  • the first human posture confidence map generation unit is set to generate a mask map with the human posture keypoints as the center when the human posture keypoints are credible as the human posture confidence maps.
  • the second human posture confidence map generating unit is set to use the preset image data as the human posture confidence map when the key points of the human posture are not credible.
  • the credibility judgment unit of the key point of the human posture is set as:
  • the probability value corresponding to the key point of the human body is greater than the preset threshold, it is determined that the key point of the human posture is credible.
  • the probability value corresponding to the key point of the human body is less than or equal to the preset threshold, it is determined that the key point of the human body posture is not credible.
  • the human pose detection model includes a main path, a first branch, and a second branch
  • the main path includes a residual module and an upsampling module
  • the first branch includes a refinement network module
  • the second branch includes feedback Module.
  • Input the image data of the current frame to the residual module for processing and refer to the result obtained by inputting the confidence figure of the human posture of the image data of the previous frame to the feedback module for processing to obtain the first convolution result.
  • the first convolution result output by the residual module is input to the upsampling module for processing to obtain a second convolution result, and the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result .
  • the second convolution result and the third convolution result are added to output a plurality of human body posture reference pictures.
  • the residual module includes a first residual unit, a second residual unit, and a third residual unit.
  • the image data of the current frame is input to the first residual unit for processing to obtain a first intermediate result.
  • the first intermediate result is input to the second residual unit for processing and the confidence figure of the human posture of the image data of the previous frame is input to the feedback module for processing and the results are added to obtain the second intermediate result.
  • the second intermediate result is input to the third residual unit for processing, and the third intermediate result is obtained as the first convolution result.
  • the number of channels of the first intermediate result, the second intermediate result, and the third intermediate result increase in sequence.
  • the human pose detection model includes a third branch.
  • the first convolution result output by the residual module is input to the upsampling module for processing to obtain a second convolution result
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result
  • the first intermediate result is input to the third branch for processing to obtain a fourth intermediate result.
  • the second intermediate result is input to the third branch for processing to obtain a fifth intermediate result.
  • the third intermediate result and the fifth intermediate result are input to the up-sampling module for processing to obtain the sixth intermediate result.
  • the fourth intermediate result and the sixth intermediate result are input to the up-sampling module for processing to obtain the seventh intermediate result as the second convolution result.
  • the first convolution result output by the residual module is input to the refining network module for processing to obtain a third convolution result.
  • the number of channels of the sixth intermediate result and the seventh intermediate result decrease sequentially.
  • the method further includes:
  • the first convolution result and the second convolution result are added to obtain the target result.
  • the target result is used to improve the accuracy of the human posture detection model when training the human posture detection model.
  • the human body posture detection device provided by the embodiment of the present application may execute the human body posture detection method provided by any embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a device provided by an embodiment of the present application.
  • FIG. 5 shows a block diagram of an exemplary device 512 suitable for implementing embodiments of the present application.
  • the components of the device 512 may include, but are not limited to: at least one processor 516, a system memory 528, and a bus 518 connected to different system components (including the system memory 528 and the processor 516).
  • the system memory 528 may include a computer system readable medium in the form of volatile memory, such as at least one of a random access memory (Random Access Memory, RAM) 530 and a cache memory 532.
  • the device 512 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
  • the storage system 534 may provide a hard disk drive, a magnetic disk drive, and an optical disk drive. In these cases, each drive may be connected to the bus 518 through at least one data medium interface.
  • a program/utility tool 540 having a set of (at least one) program modules 542 may be stored in, for example, the memory 528, and the program modules 542 generally perform functions and/or methods in the embodiments described in this application.
  • the device 512 may also communicate with at least one external device 514 (eg, keyboard, pointing device, display 524, etc.), and may also communicate with at least one device that enables a user to interact with the device 512, and/or with the device 512 At least one other computing device communicates with any device (eg, network card, modem, etc.) that communicates. Such communication may be performed through an input/output (I/O) interface 522.
  • the device 512 can also communicate with at least one or a network (such as a local area network (Local Area Network, LAN), a wide area network (Wide Area Network, WAN), and/or a public network, such as the Internet) through the network adapter 520.
  • the processor 516 runs programs stored in the system memory 528 to execute various functional applications and data processing, for example, to implement the human body posture detection method provided by any embodiment of the present application.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored.
  • a computer program is executed by a processor, a human body posture detection method as provided in any embodiment of the present application is implemented.

Abstract

本申请公开了一种人体姿态检测方法、装置、设备及存储介质。该方法包括:采集多帧图像数据;将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成;在每张人体姿态参考图中识别人体姿态关键点;根据人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图;判断当前帧图像数据是否为最后一帧图像数据;在当前帧图像数据不是最后一帧图像数据的情况下,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图;在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作。

Description

人体姿态检测方法、装置、设备及存储介质
本申请要求在2018年11月27日提交中国专利局、申请号为201811427578.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人体姿态检测技术,例如一种人体姿态检测方法、装置、设备及存储介质。
背景技术
人体姿态检测是计算机视觉领域中最具挑战性的研究方向,被广泛应用于人机交互、智能监控、虚拟现实和人体行为分析等领域。但是,由于组成人体姿态的各个关键点所在的局部图像特征呈多尺度的仿射变换,并且图像容易受目标人物着装、相机的拍摄角度、距离、光照变化和局部遮挡等因素影响,使得人体姿态检测研究进展缓慢。
相关技术中,通常采用基于卷积神经网络来进行人体姿态检测,同时,为了获得较高的识别精度,通常需要采集大量的训练样本对人体姿态检测模型进行长时间监督学习。
在实现本申请过程中,申请人发现相关技术中至少存在如下问题:由于嵌入式平台中没有图形处理器(Graphics Processing Unit,GPU)对卷积神经网络中计算量最大的卷积操作进行优化,因此,大量基于卷积神经网络的人体姿态检测方法无法应用于嵌入式平台。
发明内容
本申请实施例提供一种人体姿态检测方法、装置、设备及存储介质,以实现在嵌入式平台上的人体姿态检测。
第一方面,本申请实施例提供了一种人体姿态检测方法,该方法包括:采集多帧图像数据;将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,所述人体检测模型经应用于嵌入式平台的卷积神经网络训练生成;在每张人体姿态参考图中识别人体姿态关键点;根据所述人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图;判断当前帧图像数据是否为最后一帧图像数据;在当前帧图像数据不是最后一帧图像数据的情况下,将所述当前帧图像数据的人体姿态置信图输入至所述人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图;在当前帧图像数据是最后一帧图像数据的情况下,结束 执行生成多帧图像数据的人体姿态置信图的操作。
第二方面,本申请实施例提供了一种人体姿态检测装置,该装置包括:图像数据采集模块,设置为采集多帧图像数据;人体姿态参考图输出模块,设置为将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,所述人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成;人体姿态关键点识别模块,设置为在每张人体姿态参考图中识别人体姿态关键点;人体姿态置信图生成模块,设置为根据所述人体姿态关键点的可信性,生成人体姿态置信图;判断模块,设置为判断当前帧图像数据是否为最后一帧图像数据;第一执行模块,设置为在当前帧图像数据不是最后一帧图像数据的情况下,将所述当前帧图像数据的人体姿态置信图输入至所述人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图;第二执行模块,设置为在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作。
第三方面,本申请实施例还提供了一种设备,该设备包括:至少一个处理器;存储器,设置为存储至少一个程序;当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如本申请实施例第一方面所述的方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如本申请实施例第一方面所述的方法。
附图说明
图1是本申请实施例中的一种人体姿态检测方法的流程图;
图2是本申请实施例中的一种卷积神经网络的应用示意图;
图3是本申请实施例中的另一种人体姿态检测方法的流程图;
图4是本申请实施例中的一种人体姿态检测装置的结构示意图;
图5是本申请实施例中的一种设备的结构示意图。
具体实施方式
所谓计算机视觉,就是让计算机模拟人的视觉功能,能够像人一样通过观察来理解客观世界。它研究的主要内容是:如何利用计算机视觉技术解决以人为中心的相关问题,包括物体识别、人脸识别、人体检测与跟踪、人体姿态检测和人体运动分析等。人体姿态检测是人体行为识别的重要组成部分,也是人体行为识别系统的重要研究内容,它最终的目的是输出人的整体或者局部肢体的结构参数,如人体轮廓、头部的位置与朝向、人体关键点的位置或者部位类 别。它在很多方面都有重要应用,示例性的,如运动员动作识别、动画人物制作以及基于内容的图像和视频检索等。
针对人体姿态检测来说,可将人体看成是由关键点相连接的不同部件组成,人体姿态检测可以通过获取各个关键点的位置信息来确定,其中,关键点的位置信息可以用一个平面二维坐标来表示。人体姿态检测通常需要获取人体的头部、脖子、左肩、右肩、左肘、右肘、左手腕、右手腕、左臀、右臀、左膝、右膝、左脚踝和右脚踝,共计14个关键点。
传统技术中,可以采用基于卷积神经网络的人体姿态检测方法来进行人体姿态检测,其中,卷积神经网络解决的核心问题就是如何自动提取并抽象特征,进而将特征映射到任务目标解决实际问题,一个卷积神经网络一般由以下三部分组成,第一部分是输入层,第二部分由卷积层、激励层和池化层(或下采样层)组合而成,第三部分由一个全连结的多层感知机分类器构成。卷积神经网络具有权值共享特性,权值共享即指可以通过一个卷积核的卷积操作提取整张图像不同位置的同一特征,换句话说,即是在一张图像数据中的不同位置的相同目标,它们的局部特征是基本相同的。可以理解到,使用一个卷积核只能得到一种特征,可以通过设置多核卷积,用每个卷积核来学习不同的特征来提取图像数据的特征。可以理解到,在图像处理中,卷积层的作用是将低层次的特征抽取并聚合为高层次特征,低层次的特征是基本特征,诸如纹理和边缘等局部特征,高层次特征如人脸和物体的形状等,更能表现样本的全局属性,这个过程就是卷积神经网络对目标物体层级概括性。
可以理解到,如果想实现基于卷积神经网络的人体姿态检测方法可以在嵌入式平台上运行,需要该卷积神经网络的计算量较小、运行速度快且预测精度满足实际要求。
为了避免基于卷积神经网络的人体姿态检测方法无法实现在嵌入式平台运行的情况,可考虑对卷积神经网络进行改进,例如可以采用轻量化卷积神经网络,本申请实施例所提供的卷积神经网络即指轻量化卷积神经网络。所谓轻量化卷积神经网络指的是可以应用于嵌入式平台的卷积神经网络。
下面将结合具体实施例对人体姿态检测方法进行说明。
图1为本申请实施例提供的一种人体姿态检测方法的流程图,本实施例可适用于检测人体姿态的情况,该方法可以由人体姿态检测装置来执行,该装置可以采用软件和硬件中至少之一的方式实现,该装置可以配置于设备中,例如典型的是计算机或移动终端等。如图1所示,该方法包括步骤110至步骤170。
在步骤110中,采集多帧图像数据。
在本申请的实施例中,视频可以理解为是由至少一帧图像数据组成的,因此,为了对视频中的人体姿态进行识别,可以将视频划分为一帧一帧的图像数 据,分别对每帧图像数据进行分析。这里多帧图像数据表示的是同一视频中的图像数据,换句话说,该视频包括多帧图像数据。可以按照时间顺序对多帧图像数据进行命名。示例性的,如视频包括N帧图像数据,N≥1,此时,按照时间顺序可将上述N帧图像数据称为:第一帧图像数据、第二帧图像数据、…...、第N-1帧图像数据以及第N帧图像数据。
可以理解到,在将视频划分为多帧图像数据的情况下,可以按照时间顺序依次对每帧图像数据进行处理。同时,可将当前正在处理的某帧图像数据称为当前帧图像数据,将当前帧图像数据的上一帧图像数据称为上一帧图像数据,将当前帧图像数据的下一帧图像数据称为下一帧图像数据。当前可以理解到,如果当前帧数据为第一帧图像数据,则对该当前帧图像数据来说,其只有下一帧图像数据而没有上一帧图像数据;如果当前帧图像数据是最后一帧图像数据,则对当前帧图像数据来说,其只有上一帧图像数据而没有下一帧图像数据;如果当前帧图像数据既不是第一帧图像数据也不是最后一帧图像数据,则对当前帧图像数据来说,其有上一帧图像数据也有下一帧图像数据。
采用上述按照时间顺序依次对每帧图像数据进行处理的原因在于:对于人体姿态检测来说,相邻两帧图像数据之间可能存在一定的关联性,即如果根据上一帧图像数据识别出某关键点出现在了上一帧图像中的某个位置,则当前帧图像数据中该关键点也可能出现在当前帧图像数据中的同一位置附近。换句话说,如果上一帧图像数据的检测结果满足预设条件,则可以参考上一帧图像数据的检测结果,对当前帧图像数据进行处理。
在步骤120中,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成。
在本申请的实施例中,人体姿态置信图可以指包括人体姿态关键点的图像,或者,人体姿态置信图可以理解为是基于人体姿态关键点所生成的图像,如以人体姿态关键点为中心生成的图像。这里所述的人体姿态关键点可以指前文所述的头部、脖子、左肩、右肩、左肘、右肘、左手腕、右手腕、左臀、右臀、左膝、右膝、左脚踝和右脚踝等14个关键点。
人体姿态参考图可以包括两方面的内容,即可能作为人体姿态关键点的多个点的位置信息以及所述位置信息对应的概率值,其中,可将可能作为人体姿态关键点的点称为候选点,相应的,人体姿态参考图可以包括多个候选点的位置信息以及该位置信息对应的概率值,即每个候选点对应一个概率值,位置信息可以以坐标形式表示。同时,可以根据多个候选点的位置信息对应的概率值确定将哪个候选点作为人体姿态关键点。示例性的,如选择多个候选点的位置信息对应的多个概率值中最大概率值对应的候选点作为人体姿态关键点。某人 体姿态参考图中包括候选点A的位置信息(x A,y A)以及对应的概率值P A;候选点B的位置信息(x B,y B)以及对应的概率值P B;候选点C的位置信息(x C,y C)以及对应的概率值P C,其中,P A<P B<P C,基于上述,确定将候选点C作为人体姿态关键点。
需要说明的是,每张人体姿态置信图对应一个人体姿态关键点,每张人体姿态参考图包括多个候选点,所述候选点是针对某个关键点的候选点,如某张人体姿态参考图包括多个候选点,所述候选点是针对左肘的候选点。再如某张人体姿态参考图也包括多个候选点,所述候选点是针对左膝的候选点。基于上述,可以理解到,针对某帧图像数据,需要从该帧图像数据中确定N个关键点,则对应存在N张人体姿态参考图以及N张人体姿态置信图。
预先训练的人体姿态检测模型可以由设定数量组的训练样本经应用于嵌入式平台的卷积神经网络训练生成,可应用于嵌入式平台的卷积神经网络即是轻量化卷积神经网络,人体姿态检测模型可以包括主路、第一支路、第二支路和第三支路;主路可以包括残差模块和上采样模块,第一支路可以包括提炼网络模块,第二支路可以包括反馈模块;残差模块可以包括第一残差单元、第二残差单元和第三残差单元。对于人体姿态检测模型的组成部分的详细说明可参见后文。
将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,可分如下两种情况:
情况一、将当前帧图像数据作为输入变量输入至预先训练的人体姿态模型中,得到多张第一人体姿态参考图,并根据上一帧图像数据得到的多张人体姿态置信图,输出多张人体姿态参考图,其中,每张第一人体姿态参考图根据对应的上一帧图像数据得到的多张人体姿态置信图中的某张人体姿态置信图,输出当前帧图像数据的一张人体姿态参考图,上述的对应关系是基于关键点是否相同确定的。示例性的,如当前帧图像数据的某张第一人体姿态参考图针对的关键点是左肘,则其参考的是上一帧图像上数据中对应关键点是左肘的人体姿态置信图。
可以理解到,针对情况一,上一帧图像数据的人体姿态置信图并未作为输入变量,同当前帧图像数据一起输入至预先训练的人体姿态检测模型中,而是在当前帧图像数据输入至预先训练的人体姿态检测模型,得到多张第一人体姿态参考图后,根据上一帧图像数据的多张人体姿态置信图,依次确定每张第一人体姿态参考图是否可信,在该张第一人体姿态参考图可信的情况下,将该张第一人体姿态参考图作为该当前帧的人体姿态参考图;在该张第一人体姿态参考图不可信的情况下,将上一帧图像数据中对于该张第一人体姿态参考图的人体姿态置信图作为该当前帧的人体姿态参考图。
情况二、将当前帧图像数据和上一帧图像数据的人体姿态置信图作为输入变量输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
可以理解到,上述情况二中上一帧图像数据的人体姿态置信图也作为输入变量,同当前帧图像数据一起输入至预先训练的人体姿态检测模型中,针对视频来说,相邻两帧图像数据之间具有一定的关联性,将上一帧图像数据的结果作为反馈信息,输入至预先训练的人体姿态检测模型中,参与到预测当前帧图像数据的输出结果的进程中,可提高人体姿态检测模型的预测精度。
需要说明的是,针对第二种情况,为了提高人体姿态检测模型的预测精度,可采用如下方式:判断上一帧图像数据的人体姿态置信图是否可信;在上一帧图像数据的人体姿态置信图可信的情况下,将当前帧图像数据和上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图;在上一帧图像数据的人体姿态置信图不可信的情况下,将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图;或者,在上一帧图像数据的人体姿态置信图不可信的情况下,将当前帧图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。其中,预设图像数据指的是不包含先验知识的图像数据,如全黑图像,若以矩阵表形式表示,即为全零矩阵。对于当前帧图像数据的输出结果来说,上一帧图像数据的人体姿态置信图即是包含先验知识的图像数据;对于下一帧图像数据的输出结果来说,当前帧图像数据的人体姿态置信图即是包含先验知识的图像数据。
采用上述方式可提高人体姿态检测模型的预测精度的原因在于:如果上一帧图像数据的人体姿态置信图不可信,则可以说明上一帧图像数据的人体姿态置信图并不可靠,如果在上述情况下,仍然将其作为输入变量也输入至预先训练的人体姿态检测模型中,不但不会提高人体姿态检测模型的预测精度,反而可能降低人体姿态模型的预测精度。基于上述,需要确保作为输入变量输入至预先训练的人体姿态检测模型中的上一帧图像数据的人体姿态置信图是可信的,因此,在确定是否参考上一帧图像数据的人体姿态置信图之前,采用了判断上一帧图像数据的人体姿态置信图是否可信的方式来实现,在上一帧图像数据的人体姿态置信图可信的情况下,将上一帧图像数据的人体姿态置信图作为输入变量输入至预先训练的人体姿态检测模型中,相反的,在上一帧图像数据的人体姿态置信图不可信的情况下,不将其作为输入变量。可采用如下方式判断上一帧图像数据的人体姿态置信图是否可信:在上一帧的人体姿态参考图中识别人体姿态关键点,在人体关键点对应的概率值大于预设的阈值的情况下,以人体姿态关键点作为中心生成掩模图,作为上一帧的人体姿态置信图,并确定上一帧的人体姿态置信图可信;在人体关键点对应的概率值小于等于预设的阈值 的情况下,以预设图像数据作为人体姿态置信图,并确定上一帧的人体姿态置信图不可信。
还需要说明的是,上述所述的多张人体姿态参考图针对的是当前帧图像数据的输出结果,即当前帧图像数据对应多张人体姿态参考图,示例性的,在当前帧图像数据中确定N个关键点的情况下,对应输出N张人体姿态参考图。同时,作为参考的上一帧图像数据的人体姿态置信图也包括N张。
另需要说明的是,上述所述的判断上一帧图像数据的人体姿态置信图是否可信指的是分别判断上一帧图像数据的每张人体姿态置信图是否可信。还可以理解到,由于人体姿态置信图可以指包括关键点的图像,不同关键点对应不同的人体姿态置信图,因此,针对不同关键点,判断人体姿态置信图是否可信的条件可以相同,也可以不同,可根据实际情况进行确定,在此不作限定。
此外,如果当前帧图像数据是第一帧图像数据,即其不存在上一帧图像数据,则可将当前帧图像数据输入至预先训练的人体姿态检测模型中,或者,可将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中。
在一些实施例中,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,包括:判断上一帧图像数据的人体姿态置信图是否可信。在上一帧图像数据的人体姿态置信图可信的情况下,将当前帧图像数据和上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。在上一帧图像数据的人体姿态置信图不可信的情况下,将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
在本申请的实施例中,为了提高人体姿态检测模型的预测精度,可考虑采用如下方式:判断上一帧图像数据的人体姿态置信图是否可信;在上一帧图像数据的人体姿态置信图可信的情况下,将当前帧图像数据和上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图;在上一帧图像数据的人体姿态置信图不可信的情况下,则可将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
经过上述操作,确保作为输入变量输入至预先训练的人体姿态检测模型中的上一帧图像数据的人体姿态置信图是可信的,进而依据上一帧图像数据的人体姿态置信图所提供的先验知识,提高人体姿态检测模型对当前帧图像数据的输出结果的预测精度。
示例性的,如上一帧图像数据的人体姿态置信图有N张,分别判断N张人体姿态置信图是否可信,判断结果为x张人体姿态置信图可信,(N-x)张人体姿态置信图不可信,则可将x张可信的人体姿态置信图、(N-x)张预设图像数 据和当前帧图像数据输入至人体姿态检测模型中,输出多张人体姿态参考图。
在一些实施例中,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图之前,包括:分别对每帧图像数据进行预处理,得到处理后的图像数据。
在本申请的实施例中,预处理可以包括归一化和白化,其中,归一化是指通过一系列变换,即利用图像的不变矩寻找一组参数使其能够消除其他变换函数对图像变换的影响,将待处理的原始图像转换成相应的唯一标准形式,该标准形式图像对平移、旋转或缩放等仿射变换具有不变特性。通常归一化包括如下步骤:即坐标中心化、x-shearing归一化、缩放归一化和旋转归一化。在将当前帧图像数据输入至预先训练的人体姿态检测模型之前,该人体姿态检测模型可以基于神经网络训练生成,将图像数据进行归一化所起到的作用是归纳统一样本的统计分布性,进而加快网络学习速度,保证输出数据中数值小的不被吞食。
由于图像数据中相邻像素之间具有很强的相关性,因此作为输入变量输入时是冗余的。白化的作用即是降低输入的冗余性,更准确的说,通过白化处理,使得输入变量具有如下性质:特征之间相关性较低;所有特征具有相同的方差,通常在图像处理中设置为单位方差。
可以理解到,在对图像数据进行预处理后,作为输入变量输入至预先训练的人体姿态检测模型中的当前帧图像数据便是经过处理后的图像数据。当然,上一帧图像数据也是经过处理后的图像数据。
在步骤130中,在每张人体姿态参考图中识别人体姿态关键点。
在本申请的实施例中,根据前文所述可知,人体姿态参考图可以包括两方面的内容,即可能作为人体姿态关键点的每个点的位置信息以及该位置信息对应的概率值,其中,人体姿态关键点可以指确定作为关键点的点,换句话说,人体姿态关键点即为关键点,同时,可将可能作为人体姿态关键点的点称为候选点。
基于上述,可以理解到,人体姿态参考图包括多个候选点的位置信息以及位置信息对应的概率值,可以根据多个候选点的位置信息对应的概率值确定将哪个候选点作为人体姿态关键点。示例性的,如选择多个候选点的位置信息对应的概率值中最大概率值对应的候选点作为人体姿态关键点。
在一些实施例中,所述每张人体姿态参考图包括人体姿态关键点的多个候选点,每个候选点的坐标位置对应一个概率值;在人体姿态参考图中识别人体姿态关键点,包括:在人体姿态参考图中确定所述多个候选点的坐标位置对应的多个概率值中最大概率值对应的坐标位置,将所述坐标位置对应的候选点作为人体姿态关键点。
在本申请的实施例中,由于人体姿态参考图包括可能作为人体姿态关键点的多个点的位置信息以及该位置信息对应的概率值,因此,可以根据多个点的位置信息所对应的概率值,确定将哪个点作为人体姿态关键点。示例性的,在人体姿态参考图中确定最大概率值的坐标位置,将坐标位置作为人体姿态关键点。
需要说明的是,针对每张人体姿态参考图来说,其只有一个人体姿态关键点。采用上述根据概率值的方式确定人体姿态关键点,可能存在如下情况,在人体姿态参考图中有至少两个概率值相等,且均大于其它概率值,则可以根据实际情况,如关节连接是否合理,确定将哪个概率值的坐标位置作为人体姿态关键点。示例性的,如人体姿态参考图中有两个概率值相等且均大于其它概率值,两个概率值的坐标位置分别为A和B,分别将A和B作为人体姿态关键点进行关节连接是否合理的判断,判断结果为:在将A作为人体姿态关键点的情况下,关节连接不合理;在将B作为人体姿态关键点的情况下,关节连接合理。因此,确定B为人体姿态关键点。
在步骤140中,根据人体姿态关键点的可信性,生成人体姿态置信图。
在本申请的实施例中,可信性可以包括可信和不可信,确定可信和不可信的标准可以为:人体姿态关键点对应的概率值是否大于预设的阈值,即在人体姿态关键点对应的概率值大于预设的阈值的情况下,可以说明该人体姿态关键点可信;在人体姿态关键点对应的概率值小于等于预设的阈值的情况下,可以说明该人体姿态关键点不可信。
在此基础上,在人体姿态关键点可信的情况下,以人体姿态关键点为中心生成掩模图,作为人体姿态置信图;在人体姿态关键点不可信的情况下,可将预设图像数据作为人体姿态置信图。这里所述的预设图像数据与前文所述的预设图像数据相同,预设图像数据可以为全黑图像,在以矩阵表形式表示的情况下,为全零矩阵。其中,可采用如下方式判断人体姿态关键点是否可信:判断人体姿态关键点的概率值是否大于预设的阈值。在人体姿态关键点的概率值大于预设的阈值的情况下,可以说明该人体姿态关键点可信;在人体姿态关键点的概率值小于等于预设的阈值的情况下,可以说明该人体姿态关键点不可信。
需要说明的是,在确定人体姿态关键点不可信的情况下,将上一帧图像数据中对应的人体姿态关键点作为当前帧人体姿态关键点,但是,针对不可信的人体姿态关键点来说,其人体姿态置信图并不是根据上一帧图像数据中对应的人体姿态关键点生成的,而是根据预设图像数据人体姿态置信图生成的。
在一些实施例中,根据人体姿态关键点的可信性,生成人体姿态置信图,包括:判断人体姿态关键点是否可信。在人体姿态关键点可信的情况下,以人体姿态关键点作为中心生成掩模图,作为人体姿态置信图。在人体姿态关键点 不可信的情况下,将预设图像数据作为人体姿态置信图。
在本申请的实施例中,掩模图即指对图像进行图像掩膜处理后得到的图像。其中,图像掩膜是指用选定的图像、图形或物体,对待处理的图像(全部或局部)进行遮挡来控制图像处理的区域或处理过程。其中,用于覆盖的特定图像或物体称为掩膜或模板。在数字图像处理中,掩膜可以为二维矩阵数组,也可以为多值图像,图像掩膜设置为:其一、提取感兴趣区域。即用预先制作的感兴趣区域掩膜与待处理图像相乘,得到感兴趣区域图像,感兴趣区域内图像值保持不变,而区域外图像值均为零;其二、屏蔽作用。即用掩膜对待处理图像上某些区域作屏蔽,使其不参与处理、不参与处理参数的计算或者仅对所掩区域作处理、统计;其三、结构特征提取。即用相似性模板或图像匹配方法检测和提取待处理图像中与掩膜相似的结构特征;其四、制作特殊形状的图像。
根据人体姿态关键点的可信性,将人体姿态参考图生成人体姿态置信图,包括:在人体姿态关键点可信的情况下,以人体姿态关键点作为中心生成掩模图,作为人体姿态置信图,示例性的,在人体姿态关键点可信的情况下,以人体姿态关键点作为中心,并使用高斯核生成掩模图,作为人体姿态置信图。需要说明的是,可以通过设置高斯核的参数来确定掩模图所影响区域,其中,高斯核的参数包括滤波窗口的宽度和高度,高斯核可以为二维高斯核。示例性的,如某高斯核为二维高斯核,该二维高斯核的参数为滤波窗口的宽度为7,高度为7,即掩模图所影响区域为7×7的方形区域。
需要说明的是,在人体姿态关键点不可信的情况下,可以将预设图像数据作为人体姿态置信图,也可将预设图像数据认为是一种掩模图。这里所述的预设图像数据与前文所述的预设图像数据相同,预设图像数据可以为全黑图像,在以矩阵表形式表示的情况下,为全零矩阵。
在一些实施例中,判断人体姿态关键点是否可信,包括:判断人体关键点对应的概率值是否大于预设的阈值。在人体关键点对应的概率值大于预设的阈值的情况下,确定人体关键点可信。在人体关键点对应的概率值小于等于预设的阈值的情况下,确定人体关键点不可信。
在本申请的实施例中,需要说明的是,阈值可以根据实际情况进行设定,在此不作限定。此外,不同人体姿态关键点对应的阈值可以相同,也可以不同,也可以根据实际情况进行确定,在此不作限定,如对于重要的人体姿态关键点,可设置较大的阈值,对于不重要的人体姿态关键点,可设置较小的阈值。示例性的,如在人体姿态关键点为头顶的情况下,其对应的阈值为0.9,而在人体姿态关键点为左膝的情况下,其对应的阈值为0.5。
在步骤150中,判断当前帧图像数据是否为最后一帧图像数据;在当前帧图像数据不是最后一帧图像数据的情况下,执行步骤160;在当前帧图像数据是 最后一帧图像数据的情况下,执行步骤170。
在步骤160中,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图。
在步骤170中,结束执行生成多帧图像数据的人体姿态置信图的操作。
在本申请的实施例中,判断当前帧图像数据是否为最后一帧图像数据,在当前帧图像数据不为最后一帧图像数据的情况下,可将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,作为下一帧图像数据的输出结果的参考,以提高下一帧图像数据的输出结果的精度,即将下一帧图像数据输入至预先训练的人体姿态检测模型中,并参考当前帧图像数据的人体姿态置信图,输出下一帧图像数据的多张人体姿态参考图,在人体姿态参考图中识别人体姿态关键点,根据人体姿态关键点的可信性,生成人体姿态置信图。
需要说明的是,在当前帧图像数据为最后一帧图像数据的情况下,表明可以结束执行生成多帧图像数据的人体姿态置信图的操作,而无需再将得到的人体姿态置信图输入至人体姿态检测模型中。在此基础上,可以理解到,在当前帧图像数据为最后一帧图像数据的情况下,可以只执行步骤120、步骤130以及判断人体姿态关键点是否可信,在人体姿态关键点不可信的情况下,将上一帧图像数据中对应的人体姿态关键点作为人体姿态关键点。当然可以理解到,每经过步骤120、步骤130以及判断人体姿态关键点是否可信,在人体姿态关键点不可信的情况下,将上一帧图像数据中对应的人体姿态关键点作为人体姿态关键点,便可以得到当前帧图像数据对应的人体姿态关键点。
还需要说明的是,步骤120-步骤150,均是针对当前帧图像数据的处理过程,相应的,步骤120和步骤130中的人体姿态参考图指的是当前帧图像数据对应的人体姿态参考图,步骤130和步骤140中的人体姿态关键点指的是当前帧图像数据对应的人体姿态关键点,以及步骤140和步骤150中的人体姿态置信图指的是当前帧图像数据对应的人体姿态置信图。
基于上述,由于当前帧图像数据表示的是当前正在处理的某帧图像数据,因此,在第一帧图像数据为当前正在处理的某帧图像数据的情况下,可以将第一帧图像数据作为当前帧图像数据;在第二帧图像数据为当前正在处理的某帧图像数据的情况下,可以将第二帧图像数据作为当前帧图像数据,依此类推。换句话说,当前帧图像数据可以为第一帧图像数据、第二帧图像数据、第三帧图像数据、……、第N-1帧图像数据或第N帧图像数据。
假设视频包括N帧图像数据,N≥1,在确定当前帧图像数据不是第N帧图像数据的情况下,可以重复执行步骤120-140,进而完成对第一帧图像数据至第N-1帧图像数据的处理操作;在确定当前帧图像数据为第N帧图像数据的情况下,可以执行步骤120-步骤130以及在人体姿态关键点不可信的情况下,将上 一帧图像数据中对应的人体姿态关键点作为人体姿态关键点即可。
本实施例的技术方案,通过采集多帧图像数据,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成,在人体姿态参考图中识别人体姿态关键点,根据人体姿态关键点的可信性,生成人体姿态置信图,判断当前帧图像数据是否为最后一帧图像数据,在当前帧图像数据不是最后一帧图像数据的情况下,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图,在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作,实现了在嵌入式平台上进行人体姿态检测,同时,将上一帧图像数据的输出结果引入对当前帧图像数据的输出结果的预测过程中,提高了预测精度。
在一些实施例中,人体姿态检测模型包括主路、第一支路和第二支路,主路包括残差模块和上采样模块,第一支路包括提炼网络模块,第二支路包括反馈模块。
将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,包括:将当前帧图像数据输入至残差模块进行处理,并参考将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理得到的结果进行处理,得到第一卷积结果。将残差模块输出的第一卷积结果输入至上采样模块进行处理,得到第二卷积结果,将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果。将第二卷积结果和第三卷积结果相加,输出多张人体姿态参考图。
在本申请的实施例中,残差模块设置为提取图像数据的边缘和轮廓等特征,而上采样模块设置为提取图像数据的上下文信息。提炼网络模块设置为对残差模块输出的第一卷积结果进行处理,可将第一卷积结果认为是网络中间层信息,即提炼网络模块利用了网络中间层信息,增加了其回传梯度,进而提高了卷积神经网络的预测精度。反馈模块设置为将上一帧图像数据的人体姿态置信图引入卷积神经网络中,提高当前帧图像数据输出结果的精度。
将当前帧图像数据输入至残差模块进行处理,并将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理,得到第一卷积结果,可作如下理解:将所述当前帧图像数据输入至所述残差模块进行处理,并参考将所述上一帧图像数据的人体姿态置信图输入至所述反馈模块进行处理得到的结果进行处理,得到第一卷积结果。
将残差模块输出的第一卷积结果输入至上采样模块进行处理,得到第二卷积结果,将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到 第三卷积结果,再将第二卷积结果和第三卷积结果相加,输出多张人体姿态参考图,其中,上采样模块可以采用最近邻插值方法,也可采用其它上采样方法,可以根据实际情况进行设定,在此不作限定。
通过提炼网络模块利用了网络中间层信息,增加了其回传梯度,进而提高了卷积神经网络的预测精度。通过反馈模块将上一帧图像数据的人体姿态置信图引入卷积神经网络中,参与人体姿态检测模型对当前帧图像数据的预测,也提高了卷积神经网络的预测精度。
在一些实施例中,残差模块包括第一残差单元、第二残差单元和第三残差单元。
将当前帧图像数据输入至残差模块进行处理,并参考将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理得到的结果进行处理,得到第一卷积结果,包括:将当前帧图像数据输入至第一残差单元进行处理,得到第一中间结果。将第一中间结果输入至第二残差单元进行处理后的结果以及将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理后的结果相加,得到第二中间结果。将第二中间结果输入至第三残差单元进行处理,得到第三中间结果,作为第一卷积结果。其中,第一中间结果、第二结果和第三结果的通道数依次增多。
在本申请的实施例中,残差模块包括第一残差单元、第二残差单元和第三残差单元,其中,每个残差单元均由ShuffleNet子单元和ShuffleNet下采样子单元组成,其中,ShuffleNet子单元可以实现对任意尺寸的图像数据进行操作,其由两个参数控制,分别为输入深度和输出深度,其中,输入深度表示的是输入网络中间特征层的层数,输出深度指的是该子单元所输出中间特征层的层数,层数与通道数对应,ShuffleNet子单元提取了较高层次的特征,同时保留了原有层次的信息,可以实现不改变图像数据的尺寸大小,只改变网络中间特征层的深度,可以将其看做一个保持尺寸大小不变的高级“卷积层”。其中,在卷积神经网络中,通道数即指每个卷积层中卷积核的个数。此外,需要说明的是,每个残差单元可以只包含一个ShuffleNet子单元,相比于原有的每个残差单元包括三个ShuffleNet子单元而言,简化了网络结构,相应的,也就减少了计算量,提升了处理效率。
通过第一残差单元、第二残差单元和第三残差单元中ShuffleNet下采样子单元的依次处理,使得第一中间结果、第二中间结果和第三中间结果的尺寸依次变小,同时,为了保持网络大小的不变,使第一中间结果的通道数、第二中间结果的通道数和第三中间结果的通道数依次增多。此外,每个通道对应一张特征图。
需要说明的是,中间结果可以用W×H×K表示,其中,W表示中间结果 的宽度,H表示中间结果的长度,K表示通道数,W×H即表示中间结果的尺寸。针对输入图像数据来说,其可以表示为W×H×D,其中,W和H与前述含义相同,D表示深度,示例性的,在输入图像数据是RGB图像的情况下,D=3,在输入图像数据是灰度图像的情况下,D=1。
示例性的,如第一中间结果、第二中间结果和第三中间结果用W×H×K表示,W、H和K的含义与前述相同,第一中间结果为64×32×32,第二中间结果为32×16×64,第三中间结果为16×8×128。基于上述可知,第一中间结果的尺寸为64×32,第二中间结果的尺寸为32×16,第三中间结果的尺寸为16×8,上述表明,第一中间结果、第二中间结果和第三中间结果的尺寸依次变小。同时,第一中间结果的通道数为32,第二中间结果的通道数为64,第三中间结果的通道数为128,上述表明,第一中间结果、第二中间结果和第三中间结果的通道数依次增多。
在一些实施例中,人体姿态检测模型包括第三支路。
将残差模块输出的第一卷积结果输入至上采样模块进行处理,得到第二卷积结果,将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果,包括:将第一中间结果输入至第三支路进行处理,得到第四中间结果。将第二中间结果输入至第三支路进行处理,得到第五中间结果。将第三中间结果和第五中间结果输入至上采样模块进行处理,得到第六中间结果。将第四中间结果和第六中间结果输入至上采样模块进行处理,得到第七中间结果,作为第二卷积结果。将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果。其中,第六中间结果和第七中间结果的通道数依次减少。
在本申请的一些实施例中,人体姿态检测模型包括第三支路,第三支路所起到的作用在于:通过将第三支路,实现了将跳转连接的卷积操作移至主路上,从而提高了人体姿态检测模型的预测精度。第三支路包括1×1卷积核模块、批标准化模块和线性激活函数模块。其中,1×1卷积核可以起到如下作用:
情况一、针对单通道和单个卷积核来说,1×1的卷积核是对输入图像数据进行的比例缩放,这是由于1×1卷积核只有一个参数,这个卷积核在输入图像数据上滑动,就相当于给输入图像数据乘以一个系数;
情况二、针对多通道和多个卷积核来说,1×1卷积核具有如下两方面的作用:其一,实现跨通道的交互和信息整合;其二、进行降维和升维并减少网络参数,这里所述的降维指的是减少通道数,升维指的是增加通道数;其三、在不损失分辨率的前提下大幅增加非线性特性。
批标准化模块设置为进行批标准化处理,其中,批标准化(或称批量归一化)是为了避免神经网络层数加深,收敛速度变慢,导致的梯度消失或者梯度 爆炸,可以通过采用批标准化来规范某些层或者所有层的输入,从而固定每层输入信号的均值与方差,使得每一层的输入有一个稳定的分布。示例性的,其一般用在激活函数之前,对x=W+b进行规范化,使输出结果的均值为0,方差为1,其中,W表示权值矩阵,b表示偏置。可以理解到,在卷积神经网络中,权值矩阵即指卷积核,即W表示卷积核。
由于第七中间结果是将第六中间结果和第四中间结果输入至上采样模块后得到的,因此,第七中间结果的尺寸大于第六中间结果的尺寸,同时,为了保持网络大小不变,使第六中间结果的通道数和第七中间结果的通道数依次减少。
通过将第三支路,实现了将跳转连接的卷积操作移至主路上,从而提高了人体姿态检测模型的预测精度。此外,可将第一中间结果、第二中间结果和第三中间结果理解编码部分,将第六中间结果和第七中间结果理解为解码部分,为了保持网络大小不变,在编码部分,随着中间结果的尺寸减小,依次增加中间结果的通道数;在解码部分,随着中间结果的尺寸增大,依次减少中间结果的通道数。此外,可以理解到,本申请实施例所提供的卷积神经网络是一种非对称编码-解码结构。
在一些实施例中,将所述第二卷积结果和所述第三卷积结果相加,输出多张人体姿态参考图之后,还包括:将第一卷积结果和第二卷积结果相加,得到目标结果。将多张人体姿态参考图和目标结果相加,输出新的多张人体姿态参考图。其中,目标结果用于在对人体姿态检测模型进行训练时,提高人体姿态检测模型的精度。
在本申请的一些实施例中,为了提高人体姿态检测模型在训练阶段的精度,可以考虑增加中途监督,中途监督指的是在每个阶段的输出都计算损失,可以保证底层参数正常更新。
将第一卷积结果和第二卷积结果相加,得到目标结果,再将目标结果与多张人体姿态参考图相加,得到新的多张人体姿态参考图,上述目标结果即起到中途监督的作用,即目标结果也参与到损失的计算过程中。
需要说明的是,在预测阶段,可以不执行将第一卷积结果和第二卷积结果相加的操作,即在预测阶段,输出结果只包括多张人体姿态参考图。
还需要说明的是,本申请实施例所述的技术方案,在采集到多帧图像数据后,无需进行检测图像数据中是否有人脸,在存在人脸的情况下,检测人脸在图像数据中所在的位置,再将其提取等操作,不进行上述操作的原因在于:上述操作耗时较长,且检测结果误差较大。可以理解到,在不进行上述操作的情况下,可以大大提高数据处理效率。
另需要说明的是,由于第二残差单元和第三残差单元均由ShuffleNet子单元和ShuffleNet下采样子单元组成,在每次进行下采样之前,主路上保留原尺寸信 息,即在第二残差单元的ShuffleNet下采样子单元在进行下采样之前,将第一中间结果输入至第二残差单元;在第三残差单元的ShuffleNet下采样子单元在进行下采样之前,将第二中间结果输入至第三残差单元。两次下采样之间,使用一个ShuffleNet子单元提取特征,即第一残差单元和第二残差单元之间使用一个ShuffleNet子单元提取特征,该ShuffleNet子单元为第一残差单元的ShuffleNet子单元;第二残差单元和第三残差单元之间使用一个ShuffleNet子单元提取特征,即第二残差单元和第三残差单元之间使用一个ShuffleNet子单元提取特征,该ShuffleNet子单元为第二残差单元的ShuffleNet子单元。
本申请实施例所提供的卷积神经网络引入了提炼网络模块、反馈模块以及将跳转连接的卷积操作移至主路上,上述提高了卷积神经网络的预测精度。此外,采用非对称编码-解码结构,保证了网络大小基本不变,由于每个残差单元只包含一个ShuffleNet子单元,相比于原有的每个残差单元包括三个ShuffleNet子单元而言,简化了网络结构,相应的,也就减少了计算量,提升了处理效率。基于上述,使得基于卷积神经网络的人体姿态检测方法可以应用于嵌入式平台,如智能手机的嵌入式平台上,并且实时运行且预测精度可以满足要求。
为了更好的理解本申请实施例所提供的卷积神经网络,下面以具体示例进行说明:
如图2所示,为一种卷积神经网络的应用示意图,该卷积神经网络可以包括:主路、第一支路、第二支路和第三支路。其中,主路包括第一卷积模块21、第一残差单元22、第二残差单元23、第三残差单元24、第二卷积模块25、上采样模块26、按位加模块27和第三卷积模块28。
其中,第一残差单元22、第二残差单元23和第三残差单元24均包括ShuffleNet下采样子单元221和ShuffleNet子单元222。第一支路包括提炼网络模块29,其中,提炼网络模块29包括ShuffleNet子单元222、上采样模块26及按位加模块27;第二支路包括反馈模块30;第三支路包括第二卷积模块25。
需要说明的是,模块、单元或子单元上标注的W×H×K,表示经过该模块、单元或子单元处理后得到的结果,其中,W表示结果的宽度,H表示结果的长度,K表示通道数。
还需要说明的是,第一卷积模块21包括如下处理操作:第一步、卷积操作,所采用的卷积核的尺寸为3×3;第二步、批标准化;第三步、线性激活函数。第二卷积模块25包括如下处理操作:第一步、卷积操作,采用所卷积核的尺寸为1×1;第二步、批标准化;第三步、线性激活函数。第三卷积模块26包括如下处理操作:第一步、卷积操作,所采用的卷积核的尺寸为1×1;第二步、批标准化;第三步、线性激活函数;第四步、卷积操作,所采用采用的卷积核的尺寸为3×3。
假设当前帧图像数据为256×128×3的RGB图像,将其作为输入变量输入至卷积神经网络中,依次经过第一卷积模块21和第一残差单元22后,得到第一中间结果,第一中间结果为64×32×32,将第一中间结果和将上一帧图像数据的人体姿态置信图输入至反馈模块30进行处理后的结果,共同输入至主路上的按位加模块27进行处理,将主路上的按位加模块27处理后的结果输入至第二残差单元23进行处理,得到第二中间结果,第二中间结果为32×16×64,将第二中间结果输入至第三残差单元24进行处理,得到第三中间结果,将第三中间结果作为第一卷积结果,第一卷积结果为16×8×128。需要说明的是,反馈模块30可以包括1×1卷积核,设置为升维,这是由于上一帧图像数据的人体姿态置信图为64×32×14,而第一中间结果为64×32×32,需要升维,以保证两者输出通道数一致。
将第一中间结果输入至第三支路的第二卷积模块25进行处理,得到第四中间结果,第四中间结果为64×32×32。
将第二中间结果输入至第三支路的第二卷积模块25进行处理,得到第五中间结果,第五中间结果为32×16×32。
将第三中间结果输入至主路上的第二卷积模块25和上采样模块26进行处理后得到的结果和第五中间结果,共同输入至主路上的按位加模块27进行处理,得到第六中间结果,将第六中间结果输入至主路上的上采样模块26进行处理得到的结果和第四中间结果,共同输入至主路上的按位加模块27进行处理,得到第七中间结果,将第七中间结果作为第二卷积结果,第二卷积结果为64×32×32。
将第三中间结果输入至主路上的第二卷积模块25进行处理得到的结果,再输入至第一支路上的ShuffleNet子单元222进行处理,得到第八中间结果,将第八中间结果输入至第一支路上的上采样模块26进行处理,得到第九中间结果,再将第九中间结果输入至第一支路上的ShuffleNet子单元222进行处理,得到第十中间结果,将第十中间结果输入至第一支路上的上采样模块26进行处理,得到第十一中间结果。将第六中间结果输入至第一支路上的ShuffleNet子单元222进行处理,得到第十二中间结果,将第十二中间结果输入至第一支路上的上采样模块26进行处理,得到第十三中间结果,将第十一中间结果和第十三中间结果共同输入至第一支路上的按位加模块27进行处理,得到第三卷积结果,第三卷积结果为64×32×32。
将第二卷积结果和第三卷积结果输入至主路上的按位加模块27,得到第十四中间结果,将第十四中间结果输入至主路上的ShuffleNet子单元222,得到第十五中间结果,第十五中间结果为64×32×32,将第十五中间结果输入至主路上的第三卷积模块28,输出多张人体姿态参考图。
将第一卷积结果和第二卷积结果相加,得到目标结果,目标结果为64×32×14。将多张人体姿态参考图和目标结果相加,输出新的多张人体姿态参考图。其中,目标结果用于在对人体姿态检测模型进行训练时,提高人体姿态检测模型的精度。
需要说明的是,由于上一帧图像数据的人体姿态置信图并未在开始时,与当前帧图像数据作为输入变量输入至卷积神经网络中,而是在网络中间层与第一中间结果作为输入变量输入至卷积神经网络中,上述实现了减少数据处理量。
图3为本申请实施例提供的另一种人体姿态检测方法的流程图,本实施例可适用于检测人体姿态的情况,该方法可以由人体姿态检测装置来执行,该装置可以采用软件和硬件中至少之一的方式实现,该装置可以配置于设备中,例如典型的是计算机或移动终端等。如图3所示,该方法包括步骤301至步骤311。
在步骤301中,采集多帧图像数据。
在步骤302中,判断上一帧图像数据的人体姿态置信图是否可信;在上一帧图像数据的人体姿态置信图可信的情况下,执行步骤303;在上一帧图像数据的人体姿态置信图不可信的情况下,执行步骤304。
在步骤303中,将当前帧图像数据和上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图,并转入执行步骤305。
在步骤304中,将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图,并转入执行步骤305。
在步骤305中,每张人体姿态参考图包括人体姿态关键点的多个候选点,每个候选点的坐标位置对应一个概率值;在人体姿态参考图中确定多个候选点的坐标位置对应的多个概率值中最大概率值对应的坐标位置,将该坐标位置对应的候选点作为人体姿态关键点。
在步骤306中,判断人体姿态关键点对应的概率值是否大于预设的阈值;在人体姿态关键点对应的概率值大于预设的阈值的情况下,执行步骤307;在人体姿态关键点对应的概率值小于等于预设的阈值的情况下,执行步骤308。
在步骤307中,以人体姿态关键点为中心生成掩模图,作为当前帧图像数据的人体姿态置信图,并转入执行步骤309。
在步骤308中,将预设图像数据作为当前帧图像数据的人体姿态置信图,并转入执行步骤309。
在步骤309中,判断当前帧图像数据是否为最后一帧图像数据;在当前帧图像数据不是最后一帧图像数据的情况下,执行步骤310;在当前帧图像数据是最后一帧图像数据的情况下,执行步骤311。
在步骤310中,将当前帧图像数据的人体姿态置信图输入至人体姿态检测 模型中,用于参与生成下一帧图像数据的人体姿态置信图。
在步骤311中,结束执行生成多帧图像数据的人体姿态置信图的操作。
在本申请的实施例中,需要说明的是,本申请实施例所提供的人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成。
本实施例的技术方案,通过采集多帧图像数据,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成,在人体姿态参考图中识别人体姿态关键点,根据人体姿态关键点的可信性,生成人体姿态置信图,判断当前帧图像数据是否为最后一帧图像数据,在当前帧图像数据不是最后一帧图像数据的情况下,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图,在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作,实现了在嵌入式平台上进行人体姿态检测,同时,将上一帧图像数据的输出结果引入对当前帧图像数据的输出结果的预测过程中,提高了预测精度。
图4为本申请实施例提供的一种人体姿态检测装置的结构示意图,本实施例可适用于检测人体姿态的情况,该装置可以采用软件和硬件中至少之一的方式实现,该装置可以配置于设备中,例如典型的是计算机或移动终端等。如图4所示,该装置包括图像数据采集模块410、人体姿态参考图输出模块420、人体姿态关键点识别模块430、人体姿态置信图生成模块440、判断模块450、第一执行模块460以及第二执行模块470。
图像数据采集模块410,设置为采集多帧图像数据。
人体姿态参考图输出模块420,设置为将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成。
人体姿态关键点识别模块430,设置为在每张人体姿态参考图中识别人体姿态关键点。
人体姿态置信图生成模块440,设置为根据人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图。
判断模块450,设置为判断当前帧图像数据是否为最后一帧图像数据。
第一执行模块460,设置为在当前帧图像数据不是最后一帧图像数据的情况下,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图。
第二执行模块470,设置为在当前帧图像数据是最后一帧图像数据的情况下, 结束执行生成多帧图像数据的人体姿态置信图的操作。
本实施例的技术方案,通过采集多帧图像数据,将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成,在每张人体姿态参考图中识别人体姿态关键点,根据人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图,判断当前帧图像数据是否为最后一帧图像数据,在当前帧图像数据不是最后一帧图像数据的情况下,将当前帧图像数据的人体姿态置信图输入至人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图,在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作,实现了在嵌入式平台上进行人体姿态检测,同时,将上一帧图像数据的输出结果引入对当前帧图像数据的输出结果的预测过程中,提高了预测精度。
在一些实施例中,人体姿态参考图输出模块420,包括置信图可信性判断单元、第一人体姿态参考图输出单元以及第二人体姿态参考图输出单元。
置信图可信性判断单元,设置为判断上一帧图像数据的人体姿态置信图是否可信。
第一人体姿态参考图输出单元,设置为在上一帧图像数据的人体姿态置信图可信的情况下,将当前帧图像数据和上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
第二人体姿态参考图输出单元,设置为在上一帧图像数据的人体姿态置信图不可信的情况下,将当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
在一些实施例中,每张人体姿态参考图包括人体姿态关键点多个候选点,每个候选点的坐标位置对应一个概率值,人体姿态关键点识别模块430,包括人体姿态关键点识别单元。
人体姿态关键点识别单元,设置为在人体姿态参考图中确定多个候选点的坐标位置对应的多个概率值中最大概率值对应的的坐标位置,将该坐标位置对应的候选点作为人体姿态关键点。
在一些实施例中,人体姿态置信图生成模块440,包括人体姿态关键点可信性判断单元、第一人体姿态置信图生成单元以及第二人体姿态置信图生成单元。
人体姿态关键点可信性判断单元,设置为判断人体姿态关键点是否可信。
第一人体姿态置信图生成单元,设置为在人体姿态关键点可信的情况下,以人体姿态关键点作为中心生成掩模图,作为人体姿态置信图。
第二人体姿态置信图生成单元,设置为在人体姿态关键点不可信的情况下,将预设图像数据作为人体姿态置信图。
在一些实施例中,人体姿态关键点可信性判断单元,设置为:
判断人体关键点对应的概率值是否大于预设的阈值。
在人体关键点对应的概率值大于预设的阈值的情况下,确定人体姿态关键点可信。
在人体关键点对应的概率值小于等于预设的阈值的情况下,确定人体姿态关键点不可信。
在一些实施例中,人体姿态检测模型包括主路、第一支路和第二支路,主路包括残差模块和上采样模块,第一支路包括提炼网络模块,第二支路包括反馈模块。
将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,包括:
将当前帧图像数据输入至残差模块进行处理,并参考将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理得到的结果进行处理,得到第一卷积结果。
将残差模块输出的第一卷积结果输入至上采样模块进行处理,得到第二卷积结果,将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果。
将第二卷积结果和第三卷积结果相加,输出多张人体姿态参考图。
在一些实施例中,残差模块包括第一残差单元、第二残差单元和第三残差单元。
将当前帧图像数据输入至残差模块进行处理,并将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理得到的结果进行处理,得到第一卷积结果,包括:
将当前帧图像数据输入至第一残差单元进行处理,得到第一中间结果。
将第一中间结果输入至第二残差单元进行处理以及将上一帧图像数据的人体姿态置信图输入至反馈模块进行处理后的结果相加,得到第二中间结果。
将第二中间结果输入至第三残差单元进行处理,得到第三中间结果,作为第一卷积结果。
其中,第一中间结果、第二中间结果和第三中间结果的通道数依次增多。
在一些实施例中,人体姿态检测模型包括第三支路。
将残差模块输出的第一卷积结果输入至上采样模块进行处理,得到第二卷积结果,将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果,包括:
将第一中间结果输入至第三支路进行处理,得到第四中间结果。
将第二中间结果输入至第三支路进行处理,得到第五中间结果。
将第三中间结果和第五中间结果输入至上采样模块进行处理,得到第六中间结果。
将第四中间结果和第六中间结果输入至上采样模块进行处理,得到第七中间结果,作为第二卷积结果。
将残差模块输出的第一卷积结果输入至提炼网络模块进行处理,得到第三卷积结果。
其中,第六中间结果和第七中间结果的通道数依次减少。
在一些实施例中,将所述第二卷积结果和所述第三卷积结果相加,输出多张人体姿态参考图之后,还包括:
将第一卷积结果和第二卷积结果相加,得到目标结果。
将多张人体姿态参考图和目标结果相加,输出新的多张人体姿态参考图。
其中,目标结果用于在对人体姿态检测模型进行训练时,提高人体姿态检测模型的精度。
本申请实施例所提供的人体姿态检测装置可执行本申请任意实施例所提供的人体姿态检测方法。
图5为本申请实施例提供的一种设备的结构示意图。图5示出了适于用来实现本申请实施方式的示例性设备512的框图。如图5所示,设备512的组件可以包括但不限于:至少一个处理器516,系统存储器528,连接于不同系统组件(包括系统存储器528和处理器516)的总线518。系统存储器528可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory,RAM)530和高速缓存存储器532中至少之一。设备512可以包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统534可以提供硬盘驱动器、磁盘驱动器,以及光盘驱动器。在这些情况下,每个驱动器可以通过至少一个数据介质接口与总线518相连。具有一组(至少一个)程序模块542的程序/实用工具540,可以存储在例如存储器528中,程序模块542通常执行本申请所描述的实施例中的功能和/或方法。设备512也可以与至少一个外部设备514(例如键盘、指向设备、显示器524等)通信,还可与至少一个使得用户能与该设备512交互的设备通信,和/或与使得该设备512能与至少一个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(Input/Output,I/O)接口522进行。并且,设备512还可以通过网络适配器520与至少一个或网络(例如局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN)和/或公共网络,例如因特网)通信。处理器516通过运行存储在系统存储器528中的程序,从而执行各种功能应用以及数据处理,例如实现本申请任意实施例所提供的人体姿态检测方法。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如本申请任意实施例所提供的一种人体姿态检测方法。

Claims (12)

  1. 一种人体姿态检测方法,包括:
    采集多帧图像数据;
    将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,所述人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成;
    在每张人体姿态参考图中识别人体姿态关键点;
    根据所述人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图;
    判断当前帧图像数据是否为最后一帧图像数据;
    在当前帧图像数据不是最后一帧图像数据的情况下,将所述当前帧图像数据的人体姿态置信图输入至所述人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图;
    在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作。
  2. 根据权利要求1所述的方法,其中,所述将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,包括:
    判断上一帧图像数据的人体姿态置信图是否可信;
    在上一帧图像数据的人体姿态置信图可信的情况下,将所述当前帧图像数据和所述上一帧图像数据的人体姿态置信图输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图;
    在上一帧图像数据的人体姿态置信图不可信的情况下,将所述当前帧图像数据和预设图像数据输入至预先训练的人体姿态检测模型中,输出多张人体姿态参考图。
  3. 根据权利要求1所述的方法,其中,所述每张人体姿态参考图包括人体姿态关键点的多个候选点,每个候选点的坐标位置对应一个概率值;
    所述在每张人体姿态参考图中识别人体姿态关键点,包括:
    在所述每张人体姿态参考图中确定所述多个候选点的坐标位置对应的多个概率值中最大概率值对应的坐标位置,将所述坐标位置对应的候选点作为人体姿态关键点。
  4. 根据权利要求2所述的方法,其中,所述根据所述人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图,包括:
    判断所述人体姿态关键点是否可信;
    在所述人体姿态关键点可信的情况下,以所述人体姿态关键点作为中心生成掩模图,作为人体姿态置信图;
    在所述人体姿态关键点不可信的情况下,则将所述预设图像数据作为人体姿态置信图。
  5. 根据权利要求4所述的方法,其中,所述判断所述人体姿态关键点是否可信,包括:
    判断所述人体关键点对应的概率值是否大于预设的阈值;
    在所述人体关键点对应的概率值大于预设的阈值的情况下,确定所述人体姿态关键点可信;
    在所述人体关键点对应的概率值小于或等于预设的阈值的情况下,确定所述人体姿态关键点不可信。
  6. 根据权利要求1-5任一所述的方法,其中,所述人体姿态检测模型包括主路、第一支路和第二支路,所述主路包括残差模块和上采样模块,所述第一支路包括提炼网络模块,所述第二支路包括反馈模块;
    所述将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,包括:
    将当前帧图像数据输入至所述残差模块进行处理,并参考将上一帧图像数据的人体姿态置信图输入至所述反馈模块进行处理得到的结果进行处理,得到第一卷积结果;
    将所述残差模块输出的第一卷积结果输入至所述上采样模块进行处理,得到第二卷积结果;将所述残差模块输出的第一卷积结果输入至所述提炼网络模块进行处理,得到第三卷积结果;
    将所述第二卷积结果和所述第三卷积结果相加,输出多张人体姿态参考图。
  7. 根据权利要求6所述的方法,其中,所述残差模块包括第一残差单元、第二残差单元和第三残差单元;
    所述将当前帧图像数据输入至所述残差模块进行处理,并参考将上一帧图像数据的人体姿态置信图输入至所述反馈模块进行处理的结果进行处理,得到第一卷积结果,包括:
    将所述当前帧图像数据输入至所述第一残差单元进行处理,得到第一中间结果;
    将所述第一中间结果输入至所述第二残差单元进行处理,并将所述上一帧图像数据的人体姿态置信图输入至所述反馈模块进行处理,将所述第二残差单元输出的结果和所述反馈模块输出的结果相加,得到第二中间结果;
    将所述第二中间结果输入至所述第三残差单元进行处理,得到第三中间结果,作为所述第一卷积结果;
    其中,所述第一中间结果、所述第二中间结果和所述第三中间结果的通道数依次增多。
  8. 根据权利要求7所述的方法,所述人体姿态检测模型还包括第三支路;
    所述将所述残差模块输出的第一卷积结果输入至所述上采样模块进行处理,得到第二卷积结果;将所述残差模块输出的第一卷积结果输入至所述提炼网络模块进行处理,得到第三卷积结果,包括:
    将所述第一中间结果输入至所述第三支路进行处理,得到第四中间结果;
    将所述第二中间结果输入至所述第三支路进行处理,得到第五中间结果;
    将所述第三中间结果和所述第五中间结果输入至所述上采样模块进行处理,得到第六中间结果;
    将所述第四中间结果和所述第六中间结果输入至所述上采样模块进行处理,得到第七中间结果,作为所述第二卷积结果;
    将所述残差模块输出的第一卷积结果输入至所述提炼网络模块进行处理,得到所述第三卷积结果;
    其中,所述第六中间结果和所述第七中间结果的通道数依次减少。
  9. 根据权利要求6所述的方法,所述将所述第二卷积结果和所述第三卷积结果相加,输出多张人体姿态参考图之后,还包括:
    将所述第一卷积结果和所述第二卷积结果相加,得到目标结果;
    将所述多张人体姿态参考图和所述目标结果相加,输出新的所述多张人体姿态参考图。
  10. 一种人体姿态检测装置,包括:
    图像数据采集模块,设置为采集多帧图像数据;
    人体姿态参考图输出模块,设置为将当前帧图像数据输入至预先训练的人体姿态检测模型中,并参考上一帧图像数据的人体姿态置信图,输出多张人体姿态参考图,所述人体姿态检测模型经应用于嵌入式平台的卷积神经网络训练生成;
    人体姿态关键点识别模块,设置为在每张人体姿态参考图中识别人体姿态关键点;
    人体姿态置信图生成模块,设置为根据所述人体姿态关键点的可信性,生成当前帧图像数据的人体姿态置信图;
    判断模块,设置为判断当前帧图像数据是否为最后一帧图像数据;
    第一执行模块,设置为在当前帧图像数据不是最后一帧图像数据的情况下,将所述当前帧图像数据的人体姿态置信图输入至所述人体姿态检测模型中,用于参与生成下一帧图像数据的人体姿态置信图;
    第二执行模块,设置为在当前帧图像数据是最后一帧图像数据的情况下,结束执行生成多帧图像数据的人体姿态置信图的操作。
  11. 一种设备,包括:
    至少一个处理器;
    存储器,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-9任一项所述的方法。
  12. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-9任一项所述的方法。
PCT/CN2019/119633 2018-11-27 2019-11-20 人体姿态检测方法、装置、设备及存储介质 WO2020108362A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/297,882 US11908244B2 (en) 2018-11-27 2019-11-20 Human posture detection utilizing posture reference maps

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811427578.XA CN109558832B (zh) 2018-11-27 2018-11-27 一种人体姿态检测方法、装置、设备及存储介质
CN201811427578.X 2018-11-27

Publications (1)

Publication Number Publication Date
WO2020108362A1 true WO2020108362A1 (zh) 2020-06-04

Family

ID=65867635

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/119633 WO2020108362A1 (zh) 2018-11-27 2019-11-20 人体姿态检测方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US11908244B2 (zh)
CN (1) CN109558832B (zh)
WO (1) WO2020108362A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580543A (zh) * 2020-12-24 2021-03-30 四川云从天府人工智能科技有限公司 行为识别方法、系统及装置
CN112784739A (zh) * 2021-01-21 2021-05-11 北京百度网讯科技有限公司 模型的训练方法、关键点定位方法、装置、设备和介质

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558832B (zh) * 2018-11-27 2021-03-26 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质
CN110197117B (zh) * 2019-04-18 2021-07-06 北京奇艺世纪科技有限公司 人体轮廓点提取方法、装置、终端设备及计算机可读存储介质
CN111950321B (zh) * 2019-05-14 2023-12-05 杭州海康威视数字技术股份有限公司 步态识别方法、装置、计算机设备及存储介质
CN110163878A (zh) * 2019-05-28 2019-08-23 四川智盈科技有限公司 一种基于双重多尺度注意力机制的图像语义分割方法
CN112149477A (zh) * 2019-06-28 2020-12-29 北京地平线机器人技术研发有限公司 姿态估计方法、装置、介质及设备
US11576794B2 (en) 2019-07-02 2023-02-14 Wuhan United Imaging Healthcare Co., Ltd. Systems and methods for orthosis design
CN110327146A (zh) * 2019-07-02 2019-10-15 武汉联影医疗科技有限公司 一种矫形器设计方法、装置和服务器
CN110688888B (zh) * 2019-08-02 2022-08-05 杭州未名信科科技有限公司 一种基于深度学习的行人属性识别方法和系统
CN110991235B (zh) * 2019-10-29 2023-09-01 京东科技信息技术有限公司 一种状态监测方法、装置、电子设备及存储介质
CN111311714A (zh) * 2020-03-31 2020-06-19 北京慧夜科技有限公司 一种三维动画的姿态预测方法和系统
CN111797753B (zh) * 2020-06-29 2024-02-27 北京灵汐科技有限公司 图像驱动模型的训练、图像生成方法、装置、设备及介质
CN112488071B (zh) * 2020-12-21 2021-10-26 重庆紫光华山智安科技有限公司 提取行人特征的方法、装置、电子设备和存储介质
CN113034580B (zh) * 2021-03-05 2023-01-17 北京字跳网络技术有限公司 图像信息检测方法、装置和电子设备
CN112861777A (zh) * 2021-03-05 2021-05-28 上海有个机器人有限公司 人体姿态估计方法、电子设备及存储介质
CN113033526A (zh) * 2021-05-27 2021-06-25 北京欧应信息技术有限公司 基于计算机实现的方法、电子设备和计算机程序产品
CN113077383B (zh) * 2021-06-07 2021-11-02 深圳追一科技有限公司 一种模型训练方法及模型训练装置
KR102623109B1 (ko) * 2021-09-10 2024-01-10 중앙대학교 산학협력단 합성곱 신경망 모델을 이용한 3차원 의료 영상 분석 시스템 및 방법
CN115100745B (zh) * 2022-07-05 2023-06-20 北京甲板智慧科技有限公司 基于Swin Transformer模型的运动实时计数方法和系统
CN116912951B (zh) * 2023-09-13 2023-12-22 华南理工大学 人体姿态的评估方法及装置
CN117392761B (zh) * 2023-12-13 2024-04-16 深圳须弥云图空间科技有限公司 人体位姿识别方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057761A1 (en) * 2010-09-01 2012-03-08 Sony Corporation Three dimensional human pose recognition method and apparatus
CN107798313A (zh) * 2017-11-22 2018-03-13 杨晓艳 一种人体姿态识别方法、装置、终端和存储介质
CN107832708A (zh) * 2017-11-09 2018-03-23 云丁网络技术(北京)有限公司 一种人体动作识别方法及装置
CN108846365A (zh) * 2018-06-24 2018-11-20 深圳市中悦科技有限公司 视频中打架行为的检测方法、装置、存储介质及处理器
CN109344755A (zh) * 2018-09-21 2019-02-15 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质
CN109558832A (zh) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8437506B2 (en) * 2010-09-07 2013-05-07 Microsoft Corporation System for fast, probabilistic skeletal tracking
US20150294143A1 (en) * 2014-04-10 2015-10-15 GM Global Technology Operations LLC Vision based monitoring system for activity sequency validation
EP3203412A1 (en) * 2016-02-05 2017-08-09 Delphi Technologies, Inc. System and method for detecting hand gestures in a 3d space
WO2018017399A1 (en) * 2016-07-20 2018-01-25 Usens, Inc. Method and system for 3d hand skeleton tracking
CN108875492B (zh) * 2017-10-11 2020-12-22 北京旷视科技有限公司 人脸检测及关键点定位方法、装置、系统和存储介质
CN108875523B (zh) * 2017-12-28 2021-02-26 北京旷视科技有限公司 人体关节点检测方法、装置、系统和存储介质
CN108399367B (zh) * 2018-01-31 2020-06-23 深圳市阿西莫夫科技有限公司 手部动作识别方法、装置、计算机设备及可读存储介质
CN108710868B (zh) * 2018-06-05 2020-09-04 中国石油大学(华东) 一种基于复杂场景下的人体关键点检测系统及方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120057761A1 (en) * 2010-09-01 2012-03-08 Sony Corporation Three dimensional human pose recognition method and apparatus
CN107832708A (zh) * 2017-11-09 2018-03-23 云丁网络技术(北京)有限公司 一种人体动作识别方法及装置
CN107798313A (zh) * 2017-11-22 2018-03-13 杨晓艳 一种人体姿态识别方法、装置、终端和存储介质
CN108846365A (zh) * 2018-06-24 2018-11-20 深圳市中悦科技有限公司 视频中打架行为的检测方法、装置、存储介质及处理器
CN109344755A (zh) * 2018-09-21 2019-02-15 广州市百果园信息技术有限公司 视频动作的识别方法、装置、设备及存储介质
CN109558832A (zh) * 2018-11-27 2019-04-02 广州市百果园信息技术有限公司 一种人体姿态检测方法、装置、设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580543A (zh) * 2020-12-24 2021-03-30 四川云从天府人工智能科技有限公司 行为识别方法、系统及装置
CN112580543B (zh) * 2020-12-24 2024-04-16 四川云从天府人工智能科技有限公司 行为识别方法、系统及装置
CN112784739A (zh) * 2021-01-21 2021-05-11 北京百度网讯科技有限公司 模型的训练方法、关键点定位方法、装置、设备和介质

Also Published As

Publication number Publication date
CN109558832A (zh) 2019-04-02
US11908244B2 (en) 2024-02-20
US20220004744A1 (en) 2022-01-06
CN109558832B (zh) 2021-03-26

Similar Documents

Publication Publication Date Title
WO2020108362A1 (zh) 人体姿态检测方法、装置、设备及存储介质
CN109344701B (zh) 一种基于Kinect的动态手势识别方法
CN108491835B (zh) 面向面部表情识别的双通道卷积神经网络
Gao et al. Dynamic hand gesture recognition based on 3D hand pose estimation for human–robot interaction
WO2022111236A1 (zh) 一种结合注意力机制的面部表情识别方法及系统
CN109472198B (zh) 一种姿态鲁棒的视频笑脸识别方法
CN110490158B (zh) 一种基于多级模型的鲁棒人脸对齐方法
CN106909938B (zh) 基于深度学习网络的视角无关性行为识别方法
CN112837344B (zh) 一种基于条件对抗生成孪生网络的目标跟踪方法
CN106650617A (zh) 一种基于概率潜在语义分析的行人异常识别方法
CN111639571B (zh) 基于轮廓卷积神经网络的视频动作识别方法
CN109886159B (zh) 一种非限定条件下的人脸检测方法
CN111476089A (zh) 一种图像中多模态信息融合的行人检测方法、系统及终端
CN107798329B (zh) 基于cnn的自适应粒子滤波目标跟踪方法
CN114445715A (zh) 一种基于卷积神经网络的农作物病害识别方法
Song et al. Feature extraction and target recognition of moving image sequences
Sarma et al. Hand gesture recognition using deep network through trajectory-to-contour based images
WO2022142297A1 (en) A robot grasping system and method based on few-shot learning
Guo et al. Small aerial target detection using trajectory hypothesis and verification
Lai et al. Underwater target tracking via 3D convolutional networks
Yang Face feature tracking algorithm of aerobics athletes based on Kalman filter and mean shift
Yang et al. Footballer action tracking and intervention using deep learning algorithm
CN112069943A (zh) 基于自顶向下框架的在线多人姿态估计与跟踪方法
Zhou et al. Motion balance ability detection based on video analysis in virtual reality environment
CN111160179A (zh) 一种基于头部分割和卷积神经网络的摔倒检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19890948

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19890948

Country of ref document: EP

Kind code of ref document: A1