CN116682170A - Human body action detection method, device and storage medium based on deep learning - Google Patents

Human body action detection method, device and storage medium based on deep learning Download PDF

Info

Publication number
CN116682170A
CN116682170A CN202310481601.8A CN202310481601A CN116682170A CN 116682170 A CN116682170 A CN 116682170A CN 202310481601 A CN202310481601 A CN 202310481601A CN 116682170 A CN116682170 A CN 116682170A
Authority
CN
China
Prior art keywords
human body
image
target
key point
target image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310481601.8A
Other languages
Chinese (zh)
Inventor
吉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jitter Technology Shenzhen Co ltd
Shenzhen Instant Construction Technology Co ltd
Original Assignee
Jitter Technology Shenzhen Co ltd
Shenzhen Instant Construction Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jitter Technology Shenzhen Co ltd, Shenzhen Instant Construction Technology Co ltd filed Critical Jitter Technology Shenzhen Co ltd
Priority to CN202310481601.8A priority Critical patent/CN116682170A/en
Publication of CN116682170A publication Critical patent/CN116682170A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Biophysics (AREA)
  • Social Psychology (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a human body action detection method, equipment and storage medium based on deep learning, wherein the method comprises the following steps: acquiring an original image from an image sequence corresponding to a target video; intercepting a target image from an original image, and acquiring corresponding preset position information; establishing a target coordinate system based on the target image; inputting the target image into a preset human body key point detection model to obtain first human body key point coordinates; obtaining a second human body key point coordinate according to the first human body key point coordinate and the preset position information; determining a target bounding box of the human body based on the second human body key point coordinates; acquiring third human body key point coordinates in all subsequent images of the original image in the image sequence based on the target boundary box; and inputting the acquired second human body key point coordinates and third human body key point coordinates into a reverse dynamics model to obtain rotation data of the driving parameterized human body model, and obtaining the action of the human body. The application can effectively detect human body actions.

Description

Human body action detection method, device and storage medium based on deep learning
Technical Field
The present application relates to the field of human body detection technologies, and in particular, to a human body motion detection method, apparatus, and storage medium based on deep learning.
Background
The human Motion Capture technology, abbreviated as Motion Capture technology (Mocap), is used for capturing the gesture or Motion data of human Motion in video, and uses the Motion gesture data as a driving data to drive an avatar model (e.g. parameterized human model) or perform behavior analysis. In the related art, in order to enable the collected motion data to well drive the avatar model, the rotation information of each joint point is generally calculated by adopting a reverse dynamics (Inverse Kinematics, IK) algorithm, but the rotation angle directly obtained by adopting the IK algorithm lacks one degree of freedom, and cannot effectively detect the human motion.
Disclosure of Invention
The embodiment of the application discloses a human body action detection method, equipment and medium based on deep learning, which solve the technical problem that human body actions cannot be effectively detected.
The application provides a human body action detection method based on deep learning, which comprises the following steps: acquiring an image sequence corresponding to a target video, and acquiring an original image from the image sequence; intercepting a target image from the original image, and acquiring corresponding preset position information of the target image in the original image; establishing a target coordinate system based on the target image; inputting the target image into a preset human body key point detection model, and detecting first human body key point coordinates corresponding to a human body in the target coordinate system in the target image; obtaining a second human body key point coordinate according to the first human body key point coordinate and the preset position information; determining a target bounding box of the human body based on the second human body key point coordinates; acquiring third human body key point coordinates in all images after the original image in the image sequence based on the target boundary box; inputting the acquired second human body key point coordinates and third human body key point coordinates into a reverse dynamics model to obtain corresponding rotation data; based on the rotation data, driving a parameterized human model to obtain the motion of the human body.
In some embodiments of the present application, before inputting the target image into a preset human body keypoint detection model, the method further includes: calculating tracking confidence of the target image; inputting the target image into a human body key point detection model when the tracking confidence coefficient of the target image meets a preset tracking confidence coefficient; and when the tracking confidence coefficient of the target image does not meet the preset tracking confidence coefficient, discarding the target image, and returning to execute the process of intercepting the target image from the original image so as to acquire an updated target image.
In some embodiments of the application, the calculating tracking confidence of the target image includes: inputting the target image into a preset human body tracking model to obtain a confidence score of the target image; when the confidence score is greater than or equal to a preset confidence score, determining that the target image meets the preset tracking confidence; and when the confidence score is smaller than the preset confidence score, determining that the target image does not meet the preset tracking confidence.
In some embodiments of the application, further comprising: acquiring an original boundary frame corresponding to the original image; amplifying an original boundary box corresponding to the original image based on a preset amplification proportion; and establishing an original coordinate system corresponding to the original image based on the enlarged original boundary box.
In some embodiments of the present application, the capturing a target image from the original image and acquiring preset position information corresponding to the target image in the original image includes: detecting a human body in the amplified original boundary box based on an example segmentation algorithm, and determining a preset boundary box of the human body; based on the preset boundary frame, intercepting the target image, and acquiring preset position information corresponding to the preset boundary frame in the original coordinate system.
In some embodiments of the application, further comprising: acquiring an abscissa and an ordinate corresponding to a preset position from the preset position information; taking the abscissa corresponding to the preset position as a first difference value; and taking the ordinate corresponding to the preset position as a second difference value.
In some embodiments of the present application, the obtaining the second human body keypoint coordinate according to the first human body keypoint coordinate and the preset position information includes: calculating a first sum of the abscissa corresponding to the first human body key point coordinate and the first difference; calculating a second sum of the ordinate corresponding to the first human body key point coordinate and the second difference; and obtaining the second human body key point coordinates in the original coordinate system based on the first sum value and the second sum value.
The application also provides a human body action detection device based on deep learning, which comprises: the acquisition module is used for acquiring an image sequence corresponding to the target video and acquiring an original image from the image sequence; the first intercepting module is used for intercepting a target image from the original image and acquiring preset position information corresponding to the target image in the original image; the construction module is used for establishing a target coordinate system based on the target image; the detection module is used for inputting the target image into a preset human body key point detection model and detecting first human body key point coordinates corresponding to a human body in the target coordinate system in the target image; the calculation module is used for obtaining a second human body key point coordinate according to the first human body key point coordinate and the preset position information; the determining module is used for determining a target boundary frame of the human body based on the second human body key point coordinates; the second intercepting module is used for acquiring third human body key point coordinates in all images after the original image in the image sequence based on the target boundary box; the output module is used for inputting the acquired second human body key point coordinates and third human body key point coordinates into the reverse dynamics model to obtain corresponding rotation data; and the driving module is used for driving the parameterized human body model based on the rotation data to obtain the action of the human body.
The application also provides electronic equipment, which comprises a processor and a memory, wherein the processor is used for executing a computer program stored in the memory to realize the human body action detection method based on deep learning.
The application also provides a computer readable storage medium storing at least one instruction which when executed by a processor implements the deep learning based human motion detection method.
In the human body action detection method based on deep learning, firstly, an original image is obtained from an image sequence, a target image is intercepted from the original image, detection of a background area in the original image is reduced, and a detection flow is optimized; secondly, detecting human body key points of the target image to obtain first human body key point coordinates, and combining preset position information in the original image to further obtain second human body key point coordinates of the human body in the original image, and obtaining a target boundary frame of the human body based on the second human body key point coordinates, so that a detection frame is provided for a subsequent original image in the detection image sequence, the detection flow is simplified, and meanwhile, the detection precision is improved; finally, the rotation data of the driving parameterized human body model is obtained by using the reverse dynamics model, so that the human body actions are effectively detected.
Drawings
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present application.
Fig. 2 is a flowchart of a human motion detection method based on deep learning according to an embodiment of the present application.
Fig. 3 is a schematic diagram of an original coordinate system according to an embodiment of the present application.
Fig. 4 is a schematic diagram of an original coordinate system and a target coordinate according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a human motion detection device based on deep learning according to an embodiment of the present application.
Detailed Description
For ease of understanding, a description of some of the concepts related to the embodiments of the application are given by way of example for reference.
It should be noted that the terms "first" and "second" in the description and claims of the present application and the accompanying drawings are used to distinguish similar objects, and are not used to describe a specific order or sequence.
It should be further noted that, in the method disclosed in the embodiment of the present application or the method shown in the flowchart, one or more steps for implementing the method are included, and the execution order of the steps may be interchanged with each other, where some steps may be deleted without departing from the scope of the claims.
Some embodiments will be described below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.
The human Motion Capture technology, abbreviated as Motion Capture technology (Mocap), is used for capturing the gesture or Motion data of human Motion in video, and uses the Motion gesture data as a driving data to drive an avatar model (e.g. parameterized human model) or perform behavior analysis. In the related art, in order to enable the collected motion data to well drive the avatar model, the rotation information of each joint point is generally calculated by adopting a reverse dynamics (Inverse Kinematics, IK) algorithm, but the rotation angle directly obtained by adopting the IK algorithm lacks one degree of freedom, and cannot effectively detect the human motion.
In order to improve the accuracy of human motion detection, the embodiment of the application provides a human motion detection method, device and storage medium based on deep learning. The human body action detection method based on deep learning is applied to the electronic equipment, and the related structure of the electronic equipment is described below with reference to fig. 1.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 1, in an embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 11 and at least one processor 12 communicatively coupled to each other via a communication bus 10.
The electronic device 1 may comprise any electronic device such as a mobile phone, a tablet computer, a notebook computer, etc., in some embodiments, the electronic device 1 may further comprise a photographing device for photographing a plurality of images or videos including a human body, and in other embodiments, the electronic device 1 may also establish a communication connection with one or more external photographing devices to obtain a plurality of images or videos.
Fig. 1 is merely an exemplary illustration of an electronic device 1 and is not meant to be limiting, and in other embodiments, the electronic device 1 may include more or less components than illustrated, or may combine certain components, or may replace different components, e.g., the electronic device 1 may also include input-output devices, network access devices, etc.
Fig. 2 is a flowchart of a human motion detection method based on deep learning according to an embodiment of the present application, and as shown in fig. 2, the human motion detection method based on deep learning according to an embodiment of the present application is applied to an electronic device (e.g., the electronic device 1 of fig. 1). The order of the steps in the flowchart may be changed and some steps may be omitted according to various needs.
As shown in fig. 2, the method comprises the following steps:
step 201, an image sequence corresponding to a target video is acquired, and an original image is acquired from the image sequence.
In some embodiments of the present application, an electronic device may be used to capture a video of a human motion, where the video may be a video of a specified human motion acquired during a specified time period, or may be a video of any human motion acquired during any time period. One image can be extracted from the video frame by frame and combined into an image sequence, and the image selection mode is not limited in the application because the similarity of the images obtained frame by frame is too high, or a plurality of images can be extracted as the image sequence at intervals.
In one embodiment, the images acquired from the sequence of images may be referred to as original images. After the image sequence is determined, a time stamp of each original image in the image sequence is acquired, and according to the time stamp, an original image with the earliest time stamp is selected from the acquired plurality of original images to be used as an original image for subsequent training. In addition, an original image with any time stamp can be obtained, for example, an Nth original image is used as an original image for subsequent training, and the original image before the Nth original image is discarded based on the time stamp of the original image, so that the earliest time stamp of the original training image is ensured, and a foundation is laid for realizing subsequent smooth training.
Step 202, a target image is intercepted from an original image, and preset position information corresponding to the target image in the original image is obtained.
In some embodiments of the present application, after determining the original image, a human body in the original image may be detected by using an object detection network to obtain an original bounding box of the original image, where the object detection network is a neural network model obtained by training any one or two of Long Short-Term Memory (LSTM), recurrent neural network (Recurrent Neural Network, RNN), convolutional neural network (Convolutional Neural Networks, CNN). The target detection network may detect a target object (e.g., a human body) in the image, that is, detect the target object in the image, and give a position range, a classification, and a probability of the target object in the image, where the position range may be specifically marked in a form of a detection frame, the classification may represent a specific class of the target object, and the probability may represent a probability that the target object in the detection frame is of a specific class.
In some embodiments of the present application, a detection box labeled with the target detection network is taken as an original bounding box of the original image. In order to reduce errors of the original boundary box marked by the target detection network, after the original boundary box marked by the target detection network is obtained, the original boundary box corresponding to the original image is enlarged based on a preset enlargement ratio, so that the original boundary box can be ensured to completely contain a human body. For example, the magnification ratio is 5%, and the frame selection area is increased by 5% on the basis of the original boundary frame, wherein the magnification ratio is specific to the original boundary frame, and the size in the original image is not processed, for example, the original image is not subjected to image stretching.
In some embodiments of the present application, after obtaining the enlarged original bounding box, an example segmentation algorithm may be used to detect the human body within the enlarged original bounding box to determine a preset bounding box for the human body. The example segmentation algorithm (Instance Segmentation) has the characteristics of semantic segmentation and target detection, namely, the example segmentation algorithm can classify on a pixel level and locate different examples, for example, the position of a human body can be located in one image, and a mask mode can be adopted for marking. Where a Mask (Mask), also called masking, means that the image to be processed (either fully or partially) is masked with a selected image, graphic or object, thereby controlling the area or process of image processing. For example, an object in an image to be processed is blocked, and then the blocked area is called a mask.
After the human body area marked based on the example segmentation algorithm is obtained, the rectangular frame for selecting the human body according to the human body area, namely, the preset boundary frame, can be redetermined. Based on the preset boundary frame, the target image is intercepted from the original image, and the preset position information of the target image in the original boundary frame is obtained, for example, an original coordinate system is established based on the enlarged original boundary frame of the original image, a plurality of coordinate points of the preset boundary frame on the original coordinate system are obtained, for example, the preset position point corresponding to the upper left corner of the preset boundary frame is obtained, the abscissa and the ordinate of the preset position point are obtained as the preset position information, and the abscissa and the ordinate of the preset position point corresponding to the lower right corner of the preset boundary frame can be obtained.
For example, in some examples, fig. 3 is a schematic diagram of an original coordinate system provided by an embodiment of the present application, as shown in fig. 3, coordinates of a preset position point of a preset bounding box at an upper left corner in the original coordinate system are a (X1, Y1), coordinates of a preset position point at a lower right corner are B (X2, Y2), then a height of a target image is |y2-y1|, and a width of the target image is |x2-x1|.
In step 203, a target coordinate system is established based on the target image.
In some embodiments of the present application, after capturing the target image, a target coordinate system may be established based on the target image to obtain the human body keypoint coordinates of the human body in the target image (for example, the first human body keypoint coordinates to be determined in step 204 described below), for example, in combination with the schematic diagram shown in fig. 3, the point a is used as the origin of the target coordinate system to establish the target coordinate system.
Step 204, inputting the target image into a preset human body key point detection model, and detecting the first human body key point coordinates corresponding to the human body in the target coordinate system in the target image.
In some embodiments of the present application, due to the influence of factors such as partial occlusion, deformation, motion blur, rapid motion, illumination change, background clutter, and scale change that may occur in the target image, whether the target image includes a human body may be detected before the human body key point detection model is input, for example, the tracking confidence of the target image may be calculated, and in particular, whether the human body in the target image is complete may be calculated.
In some embodiments of the present application, the captured target image is input into a body Tracking model, and the body Tracking technique in the body Tracking model may include any one of a core-based structured output (Structured output Tracking with kernel, struct) Tracking method, a multi-sample Learning (MultipleInstance Learning, MILs) Tracking method, a Tracking-Learning-Detection (TLD) Tracking method, and specific Tracking manners may include: firstly, respectively taking a human body and a background as positive and negative samples, training a classifier, secondly, detecting according to the classifier on a search area, and taking the maximum response value position as the estimation of the central position of the human body, thereby realizing tracking.
In some embodiments of the present application, a confidence score of a target image output by a human body tracking model is obtained, the confidence score is used for evaluating the integrity of a human body contained in the target image, if the confidence score is greater than or equal to a preset confidence score, it is determined that the target image meets a preset tracking confidence, that is, the target image is reserved, so that a human body key point detection model is subsequently input, if the confidence score is less than the preset confidence score, it is determined that the target image does not meet the preset tracking confidence, the target image is discarded, and no further human body key point is detected for the target image, that is, the target image which does not meet the preset tracking confidence is not input into the human body key point detection model for detection, where the preset confidence score may be a preset score, for example, 95%, which is not limited by the present application.
In some embodiments of the application, the human keypoint detection model may comprise a feature extraction network. The feature extraction network may be a variety of networks that exist for extracting images of features. For example, the feature extraction network may be a residual network (ResNet), convolutional Neural Network (CNN), or the like. In addition, the human keypoint detection model may further include at least one keypoint detection network corresponding to the at least one location. Parts of the human body may include, but are not limited to: the key points of the head, upper limb, lower limb, etc. can be different. For example, for the head, the position where the left ear, right ear, left eye, right eye, mouth, etc. are located may be determined as the position of the key point. For the upper limb, the position of the wrist, elbow, or the like can be determined as the position of the key point. According to different actual needs, the number and the positions of the key points of each part can be flexibly determined. The present application is not limited to this.
In some embodiments of the present application, after the target image is input into the human body keypoint detection model, first human body keypoint coordinates of the human body in the target image, for example, coordinates corresponding to an elbow, are output.
Step 205, obtaining the second human body key point coordinates according to the first human body key point coordinates and the preset position information.
In some embodiments of the present application, the first human body key point is obtained from a target coordinate system, the preset position information is obtained from an original coordinate system, and the second human body key point coordinate is a coordinate corresponding to the key point detected in the calculated target image in the original coordinate system.
Fig. 4 is a schematic diagram of an original coordinate system and a target coordinate provided in an embodiment of the present application, as shown in fig. 4, in some embodiments of the present application, the coordinate system established with 0 as an origin is the original coordinate system, the coordinate system established with the point a is the target coordinate system, the preset position information in the original coordinate system may be the coordinate corresponding to the point a in the original coordinate system, for example, a (X1, Y1), the first key point coordinate is the coordinate corresponding to a certain key point in the target coordinate system, for example, the coordinate (X3, Y3) of the point C, then the abscissa X1 corresponding to the point a is taken as a first difference value, the ordinate Y1 corresponding to the point a is taken as a second difference value, the first sum value x1+x3 of the abscissa X3 corresponding to the point a and the first difference value X1 is calculated, and the second sum value y1+y3 of the ordinate Y3 corresponding to the point a and the second coordinate Y1 is calculated, then the second human key point coordinate is (X1+y3). Coordinates of each detected key point in the target image in the original coordinate system, namely, coordinates of the second human body key point are calculated through the method.
Step 206, determining a target bounding box of the human body based on the second human body key point coordinates.
In some embodiments of the present application, after obtaining the coordinates of each key point of the human body in the original coordinate system, the coordinate position of the boundary of the human body is determined according to the coordinates of the second key point of the human body of each key point on the original coordinate system, for example, when the human body stands in a T shape, the position of the boundary of the human body may include the position of the head, the position of the hands and the position of the feet of the human body, and the target bounding box of the human body may be selected according to the coordinate position of the boundary of the human body, that is, the preset bounding box obtained in the step 202 is updated.
Step 207, based on the target bounding box, acquiring the coordinates of the third human keypoint in all images following the original image in the image sequence.
In some embodiments of the present application, assuming that there are M original images in the image sequence, based on steps 201 to 206, the second human body key point coordinates and the target bounding box of the first original image are calculated, when the second original image is calculated, the target bounding box of the first original image may be used as the preset bounding box of the second original image, so as to obtain the target image corresponding to the second original image, that is, the target bounding box corresponding to the nth original image is obtained, the target image corresponding to the n+1th original image is intercepted by using the target bounding box corresponding to the nth original image, where N e M is equal to, and the third human body key point coordinates corresponding to the n+1th original image may be calculated. The human body key point coordinates (including the second human body key point coordinates and the third human body key point coordinates) of each original image in the image sequence are obtained through iteration, and the human body key point coordinates obtained based on the method are higher in accuracy, and the human body key point coordinates of each corresponding key point are obtained.
And step 208, inputting the acquired second human body key point coordinates and third human body key point coordinates into a reverse dynamics model to obtain corresponding rotation data.
In some embodiments of the application, the inverse kinematics (Inverse Kinematics, IK) model reflects a motion from the hand to the shoulder, in which motion starts at the free end of the hand, which naturally drives motion of the fixed end shoulder when the hand moves, for example, by a key point of the elbow to determine arm motion and thus shoulder motion.
In some embodiments of the present application, the second human body keypoint coordinates and the third human body keypoint coordinates corresponding to the image sequence are input into the inverse kinetic model to obtain rotation data corresponding to each joint point of the human body, for example, rotation data corresponding to an elbow.
Step 209, driving the parameterized human model based on the rotation data to obtain the motion of the human body.
In some embodiments of the application, the parameterized mannequin may be a body geometry model represented in a three-dimensional mesh form, may be used for automatic body stature measurements, to design corresponding garments, or to apply the avatar model as an avatar for the user in an electronic device (e.g., an electronic game).
In some embodiments of the present application, after the rotation data is obtained, the rotation data is input into the parameterized mannequin to drive the parameterized mannequin to move, so that the motion of the human body can be obtained.
In the embodiment of the application, firstly, the process of image detection is simplified and the precision of key point detection is improved by continuously updating the preset boundary box; secondly, after the target image is obtained, by calculating the confidence coefficient of the target image, each target image detected in the input human body key point detection model is ensured to contain a complete human body, and a certain key point detection error is avoided; and finally, inputting all the obtained coordinates of the key points of the human body into a reverse dynamics model to obtain rotation data for driving the parameterized model, so that the detection precision of the human body motion is improved, and the capture of the human body motion of a video is realized.
Fig. 5 is a schematic structural diagram of a human motion detection device 5 based on deep learning according to an embodiment of the present application. As shown in fig. 5, in the embodiment of the present application, the human motion detection apparatus 5 based on deep learning may be divided into a plurality of functional modules according to the functions performed thereby, and may include: the system comprises an acquisition module 51, a first interception module 52, a construction module 53, a detection module 54, a calculation module 55, a determination module 56, a second interception module 57, an output module 58 and a driving module 59.
The obtaining module 51 is configured to obtain an image sequence corresponding to the target video, and obtain an original image from the image sequence.
The first capturing module 52 is configured to capture a target image from the original image, and obtain preset position information corresponding to the target image in the original image.
A construction module 53 for establishing a target coordinate system based on the target image.
The detection module 54 is configured to input the target image into a preset human body key point detection model, and detect first human body key point coordinates corresponding to a human body in the target coordinate system in the target image.
The calculating module 55 is configured to obtain a second human body key point coordinate according to the first human body key point coordinate and the preset position information.
A determining module 56 is configured to determine a target bounding box of the human body based on the second human body keypoint coordinates.
A second clipping module 57, configured to obtain, based on the target bounding box, coordinates of a third human key point in all images subsequent to the original image in the image sequence.
The output module 58 is configured to input the obtained second human body keypoint coordinates and the third human body keypoint coordinates into the inverse kinetic model, and obtain corresponding rotation data.
A driving module 59, configured to drive the parameterized mannequin based on the rotation data to obtain the motion of the human body.
In some embodiments of the present application, the acquiring, based on the target bounding box, third human keypoint coordinates in all images subsequent to the original image in the image sequence includes: acquiring a target boundary frame corresponding to the Nth original image; based on the target boundary box corresponding to the Nth original image, intercepting a target image corresponding to the (n+1) th original image, wherein M original images are shared in the image sequence, and N is E M; and obtaining a third human body key point coordinate corresponding to the (N+1) th original image based on the target image corresponding to the (N+1) th original image.
In some embodiments of the present application, before inputting the target image into a preset human body keypoint detection model, the method further includes: calculating tracking confidence of the target image; inputting the target image into a human body key point detection model when the tracking confidence coefficient of the target image meets a preset tracking confidence coefficient; and when the tracking confidence coefficient of the target image does not meet the preset tracking confidence coefficient, discarding the target image, and returning to execute the process of intercepting the target image from the original image so as to acquire an updated target image.
In some embodiments of the application, the calculating tracking confidence of the target image includes: inputting the target image into a preset human body tracking model to obtain a confidence score of the target image; when the confidence score is greater than or equal to a preset confidence score, determining that the target image meets the preset tracking confidence; and when the confidence score is smaller than the preset confidence score, determining that the target image does not meet the preset tracking confidence.
In some embodiments of the application, further comprising: acquiring an original boundary frame corresponding to the original image; amplifying an original boundary box corresponding to the original image based on a preset amplification proportion; and establishing an original coordinate system corresponding to the original image based on the enlarged original boundary box.
In some embodiments of the present application, the capturing a target image from the original image and acquiring preset position information corresponding to the target image in the original image includes: detecting a human body in the amplified original boundary box based on an example segmentation algorithm, and determining a preset boundary box of the human body; based on the preset boundary frame, intercepting the target image, and acquiring preset position information corresponding to the preset boundary frame in the original coordinate system.
In some embodiments of the application, further comprising: acquiring an abscissa and an ordinate corresponding to a preset position from the preset position information; taking the abscissa corresponding to the preset position as a first difference value; and taking the ordinate corresponding to the preset position as a second difference value.
In some embodiments of the present application, the obtaining the second human body keypoint coordinate according to the first human body keypoint coordinate and the preset position information includes: calculating a first sum of the abscissa corresponding to the first human body key point coordinate and the first difference; calculating a second sum of the ordinate corresponding to the first human body key point coordinate and the second difference; and obtaining the second human body key point coordinates in the original coordinate system based on the first sum value and the second sum value.
The human motion detection device based on deep learning provided in this embodiment may execute the above method embodiment, and its implementation principle and technical effects are similar, and will not be described here again.
With continued reference to fig. 1, in the present embodiment, the memory 11 may be an internal memory of the electronic device 1, that is, a memory built into the electronic device 1. In other embodiments, the memory 11 may also be an external memory of the electronic device 1, i.e. a memory external to the electronic device 1.
In some embodiments, the memory 11 is used to store program code and various data and to enable high-speed, automatic access to programs or data during operation of the electronic device 1.
The memory 11 may include random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid state storage device.
In one embodiment, the processor 12 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), field-programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any other conventional processor or the like.
The program code and various data in the memory 11 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, for example, the human motion detection method based on deep learning, or may be implemented by instructing related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program, when executed by a processor, may implement the steps of each method embodiment described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), or the like.
It will be appreciated that the above-described division of modules into a logical function division may be implemented in other ways. In addition, each functional module in the embodiments of the present application may be integrated in the same processing unit, or each module may exist alone physically, or two or more modules may be integrated in the same unit. The integrated modules may be implemented in hardware or in hardware plus software functional modules.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application and not for limiting the same, and although the present application has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present application without departing from the spirit and scope of the technical solution of the present application.

Claims (10)

1. The human motion detection method based on deep learning is characterized by comprising the following steps of:
acquiring an image sequence corresponding to a target video, and acquiring an original image from the image sequence;
intercepting a target image from the original image, and acquiring corresponding preset position information of the target image in the original image;
establishing a target coordinate system based on the target image;
inputting the target image into a preset human body key point detection model, and detecting first human body key point coordinates corresponding to a human body in the target coordinate system in the target image;
obtaining a second human body key point coordinate according to the first human body key point coordinate and the preset position information;
determining a target bounding box of the human body based on the second human body key point coordinates;
acquiring third human body key point coordinates in all images after the original image in the image sequence based on the target boundary box;
inputting the acquired second human body key point coordinates and third human body key point coordinates into a reverse dynamics model to obtain corresponding rotation data;
based on the rotation data, driving a parameterized human model to obtain the motion of the human body.
2. The method of claim 1, wherein the acquiring third human keypoint coordinates in all images following the original image in the image sequence based on the target bounding box comprises:
acquiring a target boundary frame corresponding to the Nth original image;
based on the target boundary box corresponding to the Nth original image, intercepting a target image corresponding to the (n+1) th original image, wherein M original images are shared in the image sequence, and N is E M;
and obtaining a third human body key point coordinate corresponding to the (N+1) th original image based on the target image corresponding to the (N+1) th original image.
3. The method of claim 1, wherein prior to inputting the target image into a pre-set human keypoint detection model, the method further comprises:
calculating tracking confidence of the target image;
inputting the target image into a human body key point detection model when the tracking confidence coefficient of the target image meets a preset tracking confidence coefficient;
and when the tracking confidence coefficient of the target image does not meet the preset tracking confidence coefficient, discarding the target image, and returning to execute the process of intercepting the target image from the original image so as to acquire an updated target image.
4. The method of claim 3, wherein the calculating tracking confidence of the target image comprises:
inputting the target image into a preset human body tracking model to obtain a confidence score of the target image;
when the confidence score is greater than or equal to a preset confidence score, determining that the target image meets the preset tracking confidence;
and when the confidence score is smaller than the preset confidence score, determining that the target image does not meet the preset tracking confidence.
5. The method according to claim 1, wherein the method further comprises:
acquiring an original boundary frame corresponding to the original image;
amplifying an original boundary box corresponding to the original image based on a preset amplification proportion;
and establishing an original coordinate system corresponding to the original image based on the enlarged original boundary box.
6. The method according to claim 5, wherein the capturing the target image from the original image and acquiring the preset position information corresponding to the target image in the original image includes:
detecting a human body in the amplified original boundary box based on an example segmentation algorithm, and determining a preset boundary box of the human body;
based on the preset boundary frame, intercepting the target image, and acquiring preset position information corresponding to the preset boundary frame in the original coordinate system.
7. The method of claim 6, wherein the method further comprises:
acquiring an abscissa and an ordinate corresponding to a preset position from the preset position information;
taking the abscissa corresponding to the preset position as a first difference value;
and taking the ordinate corresponding to the preset position as a second difference value.
8. The method of claim 7, wherein the obtaining the second human keypoint coordinates according to the first human keypoint coordinates and the preset position information comprises:
calculating a first sum of the abscissa corresponding to the first human body key point coordinate and the first difference;
calculating a second sum of the ordinate corresponding to the first human body key point coordinate and the second difference;
and obtaining the second human body key point coordinates in the original coordinate system based on the first sum value and the second sum value.
9. An electronic device comprising a processor and a memory, the processor configured to execute a computer program stored in the memory to implement the deep learning-based human motion detection method of any one of claims 1 to 8.
10. A computer readable storage medium storing at least one instruction that when executed by a processor implements the deep learning based human motion detection method of any one of claims 1 to 8.
CN202310481601.8A 2023-04-27 2023-04-27 Human body action detection method, device and storage medium based on deep learning Pending CN116682170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310481601.8A CN116682170A (en) 2023-04-27 2023-04-27 Human body action detection method, device and storage medium based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310481601.8A CN116682170A (en) 2023-04-27 2023-04-27 Human body action detection method, device and storage medium based on deep learning

Publications (1)

Publication Number Publication Date
CN116682170A true CN116682170A (en) 2023-09-01

Family

ID=87786175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310481601.8A Pending CN116682170A (en) 2023-04-27 2023-04-27 Human body action detection method, device and storage medium based on deep learning

Country Status (1)

Country Link
CN (1) CN116682170A (en)

Similar Documents

Publication Publication Date Title
CN109934065B (en) Method and device for gesture recognition
CN112528831B (en) Multi-target attitude estimation method, multi-target attitude estimation device and terminal equipment
CN112506340B (en) Equipment control method, device, electronic equipment and storage medium
KR20200111617A (en) Gesture recognition method, device, electronic device, and storage medium
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN111091590A (en) Image processing method, image processing device, storage medium and electronic equipment
CN107832736B (en) Real-time human body action recognition method and real-time human body action recognition device
EP3779768A1 (en) Event data stream processing method and computing device
CN110210480B (en) Character recognition method and device, electronic equipment and computer readable storage medium
CN112435223B (en) Target detection method, device and storage medium
CN112381071A (en) Behavior analysis method of target in video stream, terminal device and medium
CN111582032A (en) Pedestrian detection method and device, terminal equipment and storage medium
CN112836653A (en) Face privacy method, device and apparatus and computer storage medium
CN110599520B (en) Open field experiment data analysis method, system and terminal equipment
CN113449538A (en) Visual model training method, device, equipment and storage medium
CN111353325A (en) Key point detection model training method and device
WO2024022301A1 (en) Visual angle path acquisition method and apparatus, and electronic device and medium
JP7396076B2 (en) Number recognition device, method and electronic equipment
CN113269752A (en) Image detection method, device terminal equipment and storage medium
CN110334576B (en) Hand tracking method and device
WO2020244076A1 (en) Face recognition method and apparatus, and electronic device and storage medium
US20220122341A1 (en) Target detection method and apparatus, electronic device, and computer storage medium
CN116682170A (en) Human body action detection method, device and storage medium based on deep learning
CN113807407B (en) Target detection model training method, model performance detection method and device
WO2022205841A1 (en) Robot navigation method and apparatus, and terminal device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination