CN116189290A - Method and device for detecting waving hand - Google Patents

Method and device for detecting waving hand Download PDF

Info

Publication number
CN116189290A
CN116189290A CN202211735182.8A CN202211735182A CN116189290A CN 116189290 A CN116189290 A CN 116189290A CN 202211735182 A CN202211735182 A CN 202211735182A CN 116189290 A CN116189290 A CN 116189290A
Authority
CN
China
Prior art keywords
human
frame
image
detected
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211735182.8A
Other languages
Chinese (zh)
Inventor
薛若晨
朱媛媛
丁美玉
刘文庭
董鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Fullhan Microelectronics Co ltd
Original Assignee
Shanghai Fullhan Microelectronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fullhan Microelectronics Co ltd filed Critical Shanghai Fullhan Microelectronics Co ltd
Priority to CN202211735182.8A priority Critical patent/CN116189290A/en
Publication of CN116189290A publication Critical patent/CN116189290A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a method and a device for detecting waving hands, which are used for carrying out frame-by-frame processing on a real-time video sequence containing continuous multi-frame images to be detected, and selecting a plurality of portraits in the images to be detected by using a human-shaped detection frame; matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other; acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number. According to the invention, the plurality of corresponding humanoid detection frames of the same portrait in the continuous multi-frame image to be detected are correspondingly arranged, so that continuous multi-person waving detection in a real-time video sequence is realized, the detection sensitivity is improved, and the false detection rate is reduced.

Description

Method and device for detecting waving hand
Technical Field
The invention relates to the technical field of hand waving detection, in particular to a hand waving detection method and device.
Background
The hand waving is a communication mode widely used in daily life of people and has a strong ideographic function. Along with the development of computer technology, the hand swing detection is gradually applied to scenes such as household electrical control, interactive display, game control, dialing and alarming of household and commercial cameras, search and rescue of unmanned aerial vehicles and the like according to the characteristics of natural, convenient and non-contact operation and the like.
Fig. 1 is a flowchart of a method for detecting a waving hand. Referring to fig. 1, a video stream sequence is processed in a video processor into a series of single frame pictures arranged in time sequence, and each frame of pictures is sent to a pedestrian detection device in time sequence to perform pedestrian detection. The pedestrian detection device performs feature extraction on positive and negative samples by selecting a direction gradient histogram feature extraction method, and then trains by using an SVM classifier. The pictures sent to the pedestrian detection device are unified and normalized to the same image size (for example, the size is 108×36 pixels), the detected pictures are ordered according to the confidence of the detection frames, and the detection frames with the score greater than 0.7 are used as the pedestrian detection frames (if a plurality of candidate pedestrian detection frames exist in the picture, the pedestrian detection frame with the highest confidence is used as the object of the primary waving detection). After the pedestrian detection frame is determined, a hand swing detection window (the size is 36×36 pixels) is arranged at the upper left side of the pedestrian detection window, and the left vertexes of the hand swing detection window and the pedestrian detection window are respectively different by 12 pixels on the x axis and the y axis. And setting all pixel values in the region outside the waving detection window in the picture to be 0. Starting from the nth frame image (n is more than or equal to 2), 3-frame difference is adopted to detect the motion of the nth frame image, and the calculation formula is as follows.
D n (x,y)=P n-1 (x,y)-2P n (x,y)+P n+1 (x,y); (1)
Wherein D is n (x, y) represents difference information corresponding to the nth frame image, P n-1 (x,y)、P n (x, y) and P n+1 (x, y) represents image information of an nth frame image, an n-1 st frame image, and an n+1st frame image, respectively.
Subsequently, D was prepared using the discipline method n (x, y) binarization, obtaining binary motion information A of the image n (x, y); next, a motion history image H is defined n (x, y) and calculating an angle matrix of the image (i.e., gradient direction of pixels in the image) by using a sobel operator; traversing the angle matrix, counting a gradient histogram, and taking the angle corresponding to the maximum value in the gradient histogram as the main direction of movement of the image. If the main direction of motion is between a first threshold (e.g., 46 degrees to 134 degrees), then a right hand swing is determined, and if the main direction of motion is between a second threshold (e.g., 226 degrees to 314 degrees), then a left hand swing is determined. When the number of left hand waving or right hand waving in the current N images is judged to be N/2, judging that hand waving movement exists, otherwise, judging that no hand waving movement exists. If the pedestrian detection frame with the highest confidence coefficient does not judge waving hands, taking the pedestrian detection frame with the second highest confidence coefficient to carry out waving hand judgment of the flow again, and circulating until waving hands are judged or all the pedestrian detection frames are traversed.
However, the above-mentioned method for determining waving hand is based on a history image sequence, and waving hand cannot be detected for real-time video. In addition, the existing hand waving detection method is only used for simply carrying out hand waving judgment, the sensitivity of hand waving detection is low, and the risk of false detection exists. In the waving detection method of the Microsoft Kinect somatosensory game, a 3D model is adopted as a key point model, so that predicted key points are more accurate and stable, the situation of jumping is less, and the waving detection sensitivity is improved. However, the parameter amount of the 3D model is large, and the method is not suitable for model deployment at a hardware end.
In view of this, there is a need for a method to continuously perform multi-person waving detection in a real-time video sequence and to reduce the false detection rate while improving the detection sensitivity.
Disclosure of Invention
The invention aims to provide a method and a device for detecting waving hands, which are used for continuously detecting waving hands of multiple persons in a real-time video sequence, so that the detection sensitivity is improved, and the false detection rate is reduced.
In order to achieve the above object, the present invention provides a method for detecting waving hands, comprising:
acquiring a real-time video sequence, wherein the real-time video sequence comprises continuous multi-frame images to be detected;
Processing the image to be detected frame by frame, and respectively selecting a plurality of portraits in the image to be detected by using a human-shaped detection frame;
matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other;
acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; the method comprises the steps of,
and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number.
Optionally, the image to be measured is processed frame by adopting a simplified VGG network model, and the processing process comprises the following steps:
the convolution of the convolution kernel of 64 multiplied by 3 of two layers is carried out twice, and then the RELU activation and the maximum pooling layer are carried out, so that the output size of the image is changed into 224 multiplied by 64;
the output size of the image is changed into 112×112×128 by three convolutions of three layers of convolution kernels of 128×3×3, and then by a RELU activating and maximum pooling layer; the method comprises the steps of,
the output size of the image is changed to 56×56×512 by three convolutions of 512×3×3 convolution kernels, followed by RELU activation and maximum pooling layers.
Optionally, the loss function of the simplified VGG network model includes a classification loss function and a positioning loss function, wherein the classification loss function is:
Figure BDA0004030265630000031
wherein N represents the number of training samples input into the simplified VGG network model, y i The label representing the ith training sample, here a classification problem, positive class 1, negative class 0, p i Representing the probability that the i-th sample is predicted to be a positive class;
the positioning loss function is:
Figure BDA0004030265630000032
wherein x is a row vector, x= [ Δx, Δy, Δw, Δh ], Δx and Δy represent differences between the position coordinates of the actual human detection frame in the training set and the position coordinates of the model-predicted human detection frame in different directions of the simplified VGG network model, respectively, Δw represents differences between the width of the actual human detection frame in the training set and the width of the model-predicted human detection frame, and Δh represents differences between the height of the actual human detection frame in the training set and the height of the model-predicted human detection frame.
Optionally, determining whether the position of the human-shaped detection frame is accurate according to the intersection ratio between the position of the human-shaped detection frame output by the simplified VGG network model and the actual position of the human-shaped detection frame.
Optionally, the process of matching the human form detection frames in the front and rear frame images to be detected so that the human form detection frames used for representing the same human form in the front and rear frame images to be detected correspond to each other includes:
selecting a human-shaped detection frame from the current frame of image to be detected, and acquiring the predicted position of the human-shaped detection frame selected in the next frame of image to be detected according to the position of the human-shaped detection frame selected;
and respectively acquiring the intersection ratio of the actual positions of all the human form detection frames to be matched in the next frame of image to be detected and the predicted position, if all the intersection ratios are smaller than an association threshold value, the selected human form detection frame does not have a corresponding human form detection frame in the next frame of image to be detected, otherwise, the human form detection frame to be matched corresponding to the maximum value in the intersection ratio corresponds to the same human form with the selected human form detection frame.
Optionally, the human body keypoints comprise one or more of an elbow keypoint, a wrist keypoint, a neck keypoint, a left shoulder keypoint, a right shoulder keypoint, a hip keypoint, a knee keypoint, and an ankle keypoint.
Optionally, the method includes the steps of obtaining a plurality of human body key points in the humanoid detection frame by adopting an improved vovnet network model, and the specific process includes:
Convolving with a convolution kernel of 3×3×64 for two times and convolving with a convolution kernel of 3×3×128 for one time, and downsampling with a maximum pooling layer to change the output size of the image to 112×112×128;
five times of convolution kernel convolution with five layers of 3×3×64 are carried out, the results of each convolution are spliced according to the last dimension, and then the output size of the image is changed to 56×56×128 through convolution kernel convolution with one layer of 1×1×128 and the output of the largest pooling layer with the step length of 2;
five times of convolution kernel convolution with five layers of 3 multiplied by 80 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 28 multiplied by 256 through one layer of convolution kernel convolution with 1 multiplied by 256 and the largest pooling layer with the step length of 2;
five times of convolution kernel convolution with five layers of 3×3×96 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 14×14×384 through one layer of convolution kernel convolution with 1×1×384 and the largest pooling layer with the step length of 2;
five times of convolution kernel convolution with five layers of 3×3×112 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed into 7×7×512 by the convolution kernel convolution with one layer of 1×1×512 and the maximum pooling layer output with the step length of 2; the method comprises the steps of,
The position coordinates and the visibility of the key points of the human body are output through the convolution kernels of 7×7×26 and 7×7×13.
Optionally, the loss function of the improved vovnet network model is:
Figure BDA0004030265630000041
wherein x is a row vector, and includes differences between actual position coordinates of all the human body key points in the improved vovnet network model and position coordinates predicted by the model in different directions.
Optionally, the prediction index of the improved vovnet network model is:
Figure BDA0004030265630000051
/>
wherein p represents the serial number of the human body detection frame of the image to be detected, pi represents the serial number of the human body key point in the p-th human body detection frame in the image to be detected, v pi Representing the visibility of the ith human critical point in the p-th human detection frame in the image to be detected,
Figure BDA0004030265630000052
representing the area size, sigma, of the p-th human-shaped detection frame i And the normalization factor of the ith human body key point is represented.
Optionally, in the image to be detected within the continuous set frame number, judging whether the corresponding portrait has hand waving motion according to the position change condition of the human body key point in the selected humanoid detection frame includes three judging conditions, wherein the first judging condition is whether the wrist key point is continuously higher than the elbow key point in the vertical direction; the second judgment condition is whether an included angle formed by the wrist key point, the elbow key point and the left shoulder key point (or the right shoulder key point) and close to the human body is within a set angle threshold; a third judgment condition is whether the positions of the elbow key point and the wrist key point in the horizontal direction are periodically moved along with the change of time; if the three judging conditions are yes, the hand waving motion exists in the images to be detected within the continuous set frame number.
Optionally, before the judging condition is adopted for judging, the method further comprises judging the distance between the corresponding portrait and the camera according to the type and the number of the key points of the human body contained in the selected portrait detection frame, and setting the set angle threshold according to the distance between the portrait and the camera.
Correspondingly, the invention also provides a hand waving detection device, which adopts the hand waving detection method for detection, and comprises the following steps:
the human shape detection module is used for setting a human shape detection frame on the human shape in the input image to be detected;
the human shape tracking module comprises a motion estimation unit, a data association unit and a tracking target establishing and destroying unit, wherein the motion estimation unit is used for acquiring the predicted position of a human shape detection frame selected in a next frame of image to be detected according to the position of the human shape detection frame selected in the current frame of image to be detected, the data association unit is used for matching the predicted position with the actual positions of all human shape detection frames to be matched in the next frame of image to be detected, so that human shape detection frames used for representing the same human image in the previous and later frames of image to be detected correspond to each other, and the tracking target establishing and destroying unit is used for marking the human image appearing in the image to be detected and destroying the mark corresponding to the human image leaving the image to be detected;
The key point detection module is used for acquiring position coordinates and visibility of key points of a human body in the humanoid detection frame;
the hand waving detection module is used for judging whether corresponding figures have hand waving movement according to the position change condition of the key points of the human body in the selected human body detection frame.
In summary, the present invention provides a method and an apparatus for detecting waving hands, which process a real-time video sequence including continuous multi-frame images to be detected frame by frame, and select a plurality of portraits in the images to be detected by using a human detection frame; matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other; acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number. According to the invention, the plurality of corresponding humanoid detection frames of the same portrait in the continuous multi-frame image to be detected are correspondingly arranged, so that continuous multi-person waving detection in a real-time video sequence is realized, the detection sensitivity is improved, and the false detection rate is reduced.
Drawings
FIG. 1 is a flow chart of a method of detecting a waving hand;
FIG. 2 is a flowchart of a method for detecting waving hands according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a simplified VGG network model in the swing detection method according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a matching process of a humanoid detection frame in the waving detection method according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an improved vovnet network model in the waving detection method according to an embodiment of the present invention;
fig. 6 is a schematic diagram of determining whether there is a waving motion in the waving detection method according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a face detection apparatus according to an embodiment of the present invention;
wherein, the reference numerals are as follows:
1-a waving detection device; 11-a humanoid detection module; 12-a humanoid tracking module; 13-a key point detection module; 14-a waving detection module.
Detailed Description
Specific embodiments of the present invention will be described in more detail below with reference to the drawings. The advantages and features of the present invention will become more apparent from the following description. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.
Fig. 2 is a flowchart of a method for detecting waving hands according to an embodiment of the present invention. Referring to fig. 2, the method for detecting waving hand according to the present embodiment includes:
step S01: acquiring a real-time video sequence, wherein the real-time video sequence comprises continuous multi-frame images to be detected;
step S02: processing the image to be detected frame by frame, and respectively selecting a plurality of portraits in the image to be detected by using a human-shaped detection frame;
step S03: matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other;
step S04: acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; the method comprises the steps of,
step S05: and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number.
Fig. 3 to fig. 6 are corresponding methods or schematic diagrams of each step in the hand swing detection method according to an embodiment of the present invention. The following describes the waving detection method according to the present embodiment in detail with reference to fig. 3 to 6.
Firstly, step S01 is executed to obtain a real-time video sequence, where the real-time video sequence includes a plurality of continuous frames of images to be detected. In this embodiment, the image to be measured is an RGB image of 448×448×3, and in other embodiments of the present invention, the size of the image to be measured may be adjusted according to practical situations, which is not limited in the present invention.
Next, referring to fig. 3, step S02 is executed to process the image to be tested frame by frame, and a plurality of portraits in the image to be tested are respectively frame-selected by using a human detection frame.
Specifically, the human shape detection model (that is, the simplified VGG network model) is adopted to process the image to be detected frame by frame, and the processing process comprises the following steps: convolving with two layers of 64×3×3 convolution kernels twice, and then performing RELU activation and maximum pooling layer downsampling to change the output size of the image to 224×224×64; convolving three times with three 128×3×3 convolution kernels, and performing RELU activation and maximum pooling layer downsampling to change the output size of the image to 112×112×128; and, convolving three times with a three-layer 512 x 3 convolution kernel, and then downsampling with a RELU activation and max pooling layer to change the output size of the image to 56 x 512. It should be noted that, the number of channels of the feature is increased after each downsampling, and finally the whole network structure extracts three feature layers for generating subsequent humanoid detection frames, and the simplified VGG network model finally outputs a series of position coordinates (the position coordinates are the position coordinates of the upper left corner of the humanoid detection frame) of the humanoid detection frame greater than a set size threshold, and the width and the height of the humanoid detection frame.
Optionally, before the processing the image to be tested by adopting the simplified VGG network model, the method further comprises: inputting 448 x 3 RGB training images, outputting the RGB training images into a series of position coordinates (and the width and the height of the human-shaped detection frame) of which the size is larger than the set size threshold through the simplified VGG network model, wherein the training images are obtained by clipping from a large image processing data set (such as Microsoft coco), and the robustness of the simplified VGG network model is enhanced by carrying out data augmentation on the training images through random rotation, random clipping, random occlusion and Gaussian blur methods.
In this embodiment, the loss function of the simplified VGG network model includes a classification loss function and a positioning loss function, where the classification loss function is a common cross entropy loss function, that is:
Figure BDA0004030265630000081
wherein N represents the number of training samples input into the simplified VGG network model, y i The label representing the ith training sample, here a classification problem, positive class 1, negative class 0, p i Representing the probability that the i-th sample is predicted to be a positive class;
the positioning loss function is:
Figure BDA0004030265630000082
wherein x is a row vector, x= [ Δx, Δy, Δw, Δh ], Δx and Δy represent differences between the position coordinates of the actual human detection frame in the training set and the position coordinates of the model-predicted human detection frame in different directions of the simplified VGG network model, respectively, Δw represents differences between the width of the actual human detection frame in the training set and the width of the model-predicted human detection frame, and Δh represents differences between the height of the actual human detection frame in the training set and the height of the model-predicted human detection frame.
In this embodiment, the prediction index of the simplified VGG network model is calculated by using a cross ratio (Interaction over Union, ioU), and whether the position of the human-shaped detection frame is accurate is determined according to the cross ratio between the position of the human-shaped detection frame output by the simplified VGG network model and the actual position of the human-shaped detection frame. For example, if the intersection ratio between the position of the human-shaped detection frame output by the simplified VGG network model and the actual position of the human-shaped detection frame is greater than a threshold value (for example, 0.7), it is determined that the position of the human-shaped detection frame output by the simplified VGG network model is accurate.
Subsequently, referring to fig. 4, step S03 is performed to match the human form detection frames in the front and rear frame images to be detected, so that the human form detection frames used for representing the same human form in the front and rear frame images to be detected correspond to each other.
Specifically, firstly, selecting a human shape detection frame from a current frame of image to be detected, and acquiring a predicted position of the human shape detection frame selected from a next frame of image to be detected according to the position of the human shape detection frame, wherein at the moment, the state of each human shape detection frame in the image to be detected is as follows:
Figure BDA0004030265630000091
wherein u and v represent the central position coordinates of the human shape detection frame in the image to be detected of the current frame, s represents the area of the human shape detection frame, r represents the sum length-width ratio of the human shape detection frame,
Figure BDA0004030265630000092
The predicted center position coordinates of the human-shaped detection frame and the predicted area speed of the human-shaped detection frame in the next frame of the image to be detected (namely the speed from the current frame of the image to be detected to the next frame of the image to be detected) are respectively represented.
It should be noted that, the human-shaped detection Box (binding Box) is used for updating the state of the corresponding human figure, and the velocity component is solved by using a kalman filter. If the next frame of image to be detected does not have the humanoid detection frame associated with the portrait, a linear prediction model can be used for acquiring the prediction center position of the real-time humanoid detection frame without correction.
And then, respectively acquiring the intersection ratio of the actual positions (namely the actual center positions) of all the human form detection frames to be matched in the next frame of image to be detected and the predicted position (namely the predicted center position), if all the intersection ratios are smaller than an association threshold value, the selected human form detection frame does not have a corresponding human form detection frame in the next frame of image to be detected, otherwise, the human form detection frame to be matched corresponding to the maximum value in the intersection ratio corresponds to the same human form with the selected human form detection frame. Optionally, obtaining the intersection ratio of the actual positions of all the human-shaped detection frames to be matched in the next frame of the image to be detected and the predicted position by calculating the distribution cost matrix, and performing multi-objective matching on the actual positions of all the human-shaped detection frames to be matched in the next frame of the image to be detected and the predicted position by adopting a Hungary algorithm.
It should be noted that, when a portrait appears in the image to be detected for the first time, the newly appearing portrait needs to be marked so as to obtain a corresponding portrait detection frame of the portrait in the multi-frame image to be detected, and when the portrait leaves the image to be detected, the mark of the portrait needs to be destroyed. Optionally, if no human detection frame exists in the image to be detected after destroying the human image, all human detection frames with the cross ratio smaller than the association threshold need to be re-detected, so as to avoid that the human image which is not tracked exists in the image to be detected.
Subsequently, referring to fig. 5, step S04 is performed to obtain a plurality of human body key points in the human body detection frame, where the human body key points are used to characterize the pose of the human figure. In this embodiment, the human body key points include one or more of an elbow key point, a wrist key point, a neck key point, a left shoulder key point, a right shoulder key point, a hip key point, a knee key point, and an ankle key point.
Specifically, referring to fig. 5, an improved vovnet network model is used to obtain a plurality of human body key points in the humanoid detection frame, where the improved vovnet network model includes four OSA (One-Shot Aggregation) modules, and the specific process includes: convolving with a convolution kernel of 3×3×64 for two times and convolving with a convolution kernel of 3×3×128 for one time, and downsampling with a maximum pooling layer to change the output size of the image to 112×112×128; five times of convolution kernel convolution with five layers of 3×3×64 are carried out, the results of each convolution are spliced according to the last dimension, and then the output size of the image is changed to 56×56×128 through convolution kernel convolution with one layer of 1×1×128 and the output of the largest pooling layer with the step length of 2; five times of convolution kernel convolution with five layers of 3 multiplied by 80 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 28 multiplied by 256 through one layer of convolution kernel convolution with 1 multiplied by 256 and the largest pooling layer with the step length of 2; five times of convolution kernel convolution with five layers of 3×3×96 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 14×14×384 through one layer of convolution kernel convolution with 1×1×384 and the largest pooling layer with the step length of 2; five times of convolution kernel convolution with five layers of 3×3×112 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed into 7×7×512 by the convolution kernel convolution with one layer of 1×1×512 and the maximum pooling layer output with the step length of 2; and outputting the position coordinates and the visibility of the human body key points through the convolution kernels of 7×7×26 and 7×7×13.
Optionally, before the improved vovnet network model is used to obtain the plurality of human body key points in the human body detection frame, the method further includes: and inputting 224 multiplied by 3 RGB training images, outputting human body key points contained in the training images through the improved vovnet network model, and judging whether the human body key points are visible or not. The training pictures are obtained by clipping from large image processing data sets (such as MPII, LSP and AI Challenger), and the robustness of the improved vovnet network model is enhanced by performing data augmentation on the training images through random rotation, random clipping, random occlusion and Gaussian blur methods.
In this embodiment, the loss function of the improved vovnet network model is:
Figure BDA0004030265630000111
wherein x is a row vector, and includes differences between actual position coordinates of all the human body key points in the improved vovnet network model and position coordinates predicted by the model in different directions.
Optionally, the prediction index of the improved vovnet network model is:
Figure BDA0004030265630000112
wherein p represents the serial number of the human body detection frame of the image to be detected, pi represents the serial number of the human body key point in the p-th human body detection frame in the image to be detected, v pi Representing the ith person-shaped detection frame in the p-th person-shaped detection frame in the image to be detectedThe visibility of the key points of the individual,
Figure BDA0004030265630000113
representing the area size, sigma, of the p-th human-shaped detection frame i And the normalization factor of the ith human body key point is represented.
The normalization factor sigma i The method is obtained by calculating standard deviations of all human shape detection frames in the existing data set, and reflects the influence degree of the key points of the current human body on the whole. The normalization factor sigma i The larger the value of (2) is, the worse the labeling effect of the whole data set on the current key points of the human body is; the normalization factor sigma i The smaller the value of (C) is, the better the labeling effect of the whole data set on the current key points of the human body is.
Subsequently, referring to fig. 6, step S05 is executed, and in the image to be detected within the continuous set frame number, whether the corresponding portrait has the waving motion is determined according to the position change condition of the human body key points in the selected portrait detection frame.
In this embodiment, whether the waving motion exists in the image to be measured within the continuous set frame number is mainly determined by three determination conditions. First, if the height of the wrist key is higher than the height of the elbow key, it is explained that the arm of the figure is in a lifted state, and the arm of the figure is always in a lifted state when the hand waving motion is performed. Thus, the first judgment condition is whether the wrist key is continuously higher than the elbow key in the vertical direction.
Then, in the images to be measured within the continuous set frame number, when the wrist key point moves to the nearest position to the human body in the horizontal direction, the included angle, which is formed by the wrist key point, the elbow key point and the left shoulder key point (or the right shoulder key point) and is close to the inner side of the human body, needs to be within a set angle threshold (for example, 180 degrees ± 15 degrees), and when the wrist key point moves to the farthest position to the human body in the horizontal direction, the included angle, which is formed by the wrist key point, the elbow key point and the left shoulder key point (or the right shoulder key point) and is close to the inner side of the human body, needs to be within another set angle threshold (for example, 90 degrees ± 15 degrees). Therefore, the second judgment condition is whether the included angle formed by the wrist key point, the elbow key point and the left shoulder key point (or the right shoulder key point) and measured in the human figure body is within a set angle threshold.
Then, in the image to be measured within the continuous set frame number, since the waving motion can be attributed to the periodic motion, the horizontal coordinate values of the wrist key point and the elbow key point belong to a sine-like function in the time domain, and if the motion of the horizontal coordinate values of the wrist key point and the elbow key point in the time domain direction can be matched with a partial curve of the sine-wave function to a certain extent, the motion can be judged as waving. Thus, a third criterion is whether the positions of the elbow and wrist keypoints in the horizontal direction are in periodic motion over time, i.e. whether the sequence of hand waving action frames belongs to a sine-like function. The number of points taken by the images to be measured of the waving action frame sequence is the number of the set frames, the same points are selected on the sine function on average, the points in the two curves are matched with each other to calculate the Euclidean distance, the Euclidean distances of all the points are added, and if the added value is smaller than the set threshold value, the current waving action frame sequence is judged to belong to the sine-like function.
If the results of the three judging conditions are yes, the hand waving motion exists in the images to be detected in the continuous set frame number, otherwise, the hand waving motion does not exist in the images to be detected in the continuous set frame number.
Optionally, before the judging condition is adopted for judging, the method further comprises judging the distance between the corresponding portrait and the camera according to the type and the number of the key points of the human body contained in the selected portrait detection frame, and setting a corresponding set angle threshold according to the distance between the portrait and the camera.
For example, when the portrait is basically looking at the camera, if the distance between the portrait and the camera is less than 0.5 meter, the hip key points in the portrait detection frame and other key points lower than the hip key points in the vertical direction are not visible; if the distance between the portrait and the camera is within 0.5-1 meter, the knee key point and other key points lower than the knee key point in the vertical direction in the portrait detection frame are invisible; if the distance between the portrait and the camera is within 1-2 meters, only ankle key points in the portrait detection frame are invisible; if the distance between the portrait and the camera is more than 2 meters, all key points of the human body in the portrait detection frame are visible. Along with the change of the distance between the portrait and the camera, the judging thresholds of the three judging conditions can be adjusted, and by setting different parameters (such as the set angle threshold), the negative influence of the detecting distance on the waving detection result is reduced as much as possible.
Correspondingly, referring to fig. 7, the present invention further provides a hand waving detection device 1, which uses the hand waving detection method to detect, including:
the human shape detection module 11 is used for setting a human shape detection frame on the human shape in the input image to be detected;
the human shape tracking module 12 comprises a motion estimation unit, a data association unit and a tracking target establishing and destroying unit, wherein the motion estimation unit is used for acquiring the predicted position of a human shape detection frame selected in a next frame of image to be detected according to the position of the human shape detection frame selected in the current frame of image to be detected, the data association unit is used for matching the predicted position with the actual positions of all human shape detection frames to be matched in the next frame of image to be detected, so that human shape detection frames used for representing the same human image in the previous and later frames of image to be detected correspond to each other, and the tracking target establishing and destroying unit is used for marking the human image appearing in the image to be detected and destroying the mark corresponding to the human image leaving the image to be detected;
the key point detection module 13 is used for acquiring position coordinates and visibility of key points of a human body in the humanoid detection frame;
the waving detection module 14 is configured to determine whether waving motion exists in the corresponding portrait according to the position change condition of the human body key points in the selected humanoid detection frame.
In this embodiment, the portrait detection module 11 sequentially inputs consecutive multi-frame pictures (for example, RGB pictures with input size 448×448×3) of a real-time video sequence into a portrait detection model (i.e., the simplified VGG network model) according to a time sequence, and outputs position coordinates (x, y) of all portrait detection frames above the set size threshold and widths and heights of the portrait detection frames.
In the human tracking module 12, information of all human detection frames of the current frame of image to be detected and the next frame of image to be detected is input, and human detection frames in the current frame of image to be detected and all human detection frames in the next frame of image to be detected are paired one by one. If the intersection ratio of the two human-shaped detection frames is greater than or equal to the association threshold value, successful pairing is achieved, and the same ID is distributed to the two human-shaped detection frames in the two frames of images to be detected; and if the intersection ratio of the two human-shaped detection frames is smaller than the association threshold value, selecting one of the residual human-shaped detection frames in the current frame to-be-detected image for pairing again. If the intersection ratio of one of the human-shaped detection frames in the next frame of the image to be detected and all human-shaped detection frames in the current frame is smaller than the association threshold value, a new ID is allocated to the human-shaped detection frames in the next frame of the image to be detected. If the human detection frame of the image to be detected in the current frame is not matched with the corresponding human detection frame in the next frame of the image to be detected, destroying the ID of the current human detection frame.
In the keypoint detection module 13, a processed image to be detected (i.e., an RGB picture with an input size of 224×224×3) is input, the image to be detected includes a human-shaped detection frame that is tracked, and then, a plurality of (for example, 13) human-body keypoint coordinates included in the human-shaped detection frame and a determination as to whether the human-body keypoint coordinates are visible are output.
And inputting human body key point coordinates and the visibility of a human body detection frame in the continuous set frame to-be-detected image into the hand waving detection module 14, and carrying out corresponding hand waving judgment calculation according to the human body key point coordinates. Specifically, firstly, determining the approximate distance between the portrait corresponding to the portrait detection frame and the camera, selecting different threshold parameters according to the next judging conditions of the distance, and judging that the hand waving condition exists in the continuous set frame to-be-detected image if the three judging conditions are met.
Compared with the existing waving detection method, the waving detection method and device of the embodiment detect the continuous to-be-detected images with the set frame number, and correspond a plurality of corresponding humanoid detection frames of the same portrait in the continuous multi-frame to-be-detected images, so that continuous multi-person waving detection in a real-time video sequence is realized, the detection sensitivity is improved, and meanwhile, the false detection rate is reduced. In addition, according to the hand waving detection method, the distance between the corresponding portrait and the camera is judged according to the types and the number of the key points of the human body contained in the selected humanoid detection frame, so that the influence of distance factors on the hand waving detection result is reduced, and the adaptability of the hand waving detection method to different environments is improved.
In summary, the present invention provides a method and an apparatus for detecting waving hands, which process a real-time video sequence including continuous multi-frame images to be detected frame by frame, and select a plurality of portraits in the images to be detected by using a human detection frame; matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other; acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number. According to the invention, the plurality of corresponding humanoid detection frames of the same portrait in the continuous multi-frame image to be detected are correspondingly arranged, so that continuous multi-person waving detection in a real-time video sequence is realized, the detection sensitivity is improved, and the false detection rate is reduced.
The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Any person skilled in the art will make any equivalent substitution or modification to the technical solution and technical content disclosed in the invention without departing from the scope of the technical solution of the invention, and the technical solution of the invention is not departing from the scope of the invention.

Claims (12)

1. A method of detecting a waving hand, comprising:
acquiring a real-time video sequence, wherein the real-time video sequence comprises continuous multi-frame images to be detected;
processing the image to be detected frame by frame, and respectively selecting a plurality of portraits in the image to be detected by using a human-shaped detection frame;
matching human form detection frames in the front and back frame images to be detected, so that human form detection frames used for representing the same human form in the front and back frame images to be detected correspond to each other;
acquiring a plurality of human body key points in the human body detection frame, wherein the human body key points are used for representing the gesture of a human figure; the method comprises the steps of,
and judging whether the corresponding portrait has waving motion or not according to the position change condition of the key points of the human body in the selected portrait detection frame in the images to be detected within the continuous set frame number.
2. The method for detecting waving as defined in claim 1, wherein the step of processing the image to be detected frame by frame using a simplified VGG network model includes:
the convolution of the convolution kernel of 64 multiplied by 3 of two layers is carried out twice, and then the RELU activation and the maximum pooling layer are carried out, so that the output size of the image is changed into 224 multiplied by 64;
the output size of the image is changed into 112×112×128 by three convolutions of three layers of convolution kernels of 128×3×3, and then by a RELU activating and maximum pooling layer; the method comprises the steps of,
The output size of the image is changed to 56×56×512 by three convolutions of 512×3×3 convolution kernels, followed by RELU activation and maximum pooling layers.
3. The waving detection method of claim 2, wherein the loss functions of the simplified VGG network model include a classification loss function and a positioning loss function, wherein the classification loss function is:
Figure FDA0004030265620000011
wherein N represents the number of training samples input into the simplified VGG network model, y i The label representing the ith training sample, here a classification problem, positive class 1, negative class 0, p i Representing the probability that the i-th sample is predicted to be a positive class;
the positioning loss function is:
Figure FDA0004030265620000021
wherein x is a row vector, x= [ Δx, Δy, Δw, Δh ], Δx and Δy represent differences between the position coordinates of the actual human detection frame in the training set and the position coordinates of the model-predicted human detection frame in different directions of the simplified VGG network model, respectively, Δw represents differences between the width of the actual human detection frame in the training set and the width of the model-predicted human detection frame, and Δh represents differences between the height of the actual human detection frame in the training set and the height of the model-predicted human detection frame.
4. The swing detection method according to claim 3, wherein whether the position of the human-shaped detection frame is accurate is determined according to a cross-over ratio between the position of the human-shaped detection frame output by the simplified VGG network model and an actual position of the human-shaped detection frame.
5. The method for detecting waving hands according to claim 1 or 4, wherein the process of matching human-shaped detection frames in the images to be detected of the front and rear frames so that human-shaped detection frames for representing the same human figure in the images to be detected of the front and rear frames correspond to each other includes:
selecting a human-shaped detection frame from the current frame of image to be detected, and acquiring the predicted position of the human-shaped detection frame selected in the next frame of image to be detected according to the position of the human-shaped detection frame selected;
and respectively acquiring the intersection ratio of the actual positions of all the human form detection frames to be matched in the next frame of image to be detected and the predicted position, if all the intersection ratios are smaller than an association threshold value, the selected human form detection frame does not have a corresponding human form detection frame in the next frame of image to be detected, otherwise, the human form detection frame to be matched corresponding to the maximum value in the intersection ratio corresponds to the same human form with the selected human form detection frame.
6. The method of claim 5, wherein the human body keypoints comprise one or more of an elbow keypoint, a wrist keypoint, a neck keypoint, a left shoulder keypoint, a right shoulder keypoint, a hip keypoint, a knee keypoint, and an ankle keypoint.
7. The method of claim 6, wherein the acquiring the plurality of human body key points in the human-shaped detection frame by using the improved vovnet network model comprises the following steps:
convolving with a convolution kernel of 3×3×64 for two times and convolving with a convolution kernel of 3×3×128 for one time, and downsampling with a maximum pooling layer to change the output size of the image to 112×112×128;
five times of convolution kernel convolution with five layers of 3×3×64 are carried out, the results of each convolution are spliced according to the last dimension, and then the output size of the image is changed to 56×56×128 through convolution kernel convolution with one layer of 1×1×128 and the output of the largest pooling layer with the step length of 2;
five times of convolution kernel convolution with five layers of 3 multiplied by 80 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 28 multiplied by 256 through one layer of convolution kernel convolution with 1 multiplied by 256 and the largest pooling layer with the step length of 2;
Five times of convolution kernel convolution with five layers of 3×3×96 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed to 14×14×384 through one layer of convolution kernel convolution with 1×1×384 and the largest pooling layer with the step length of 2;
five times of convolution kernel convolution with five layers of 3×3×112 are carried out, the result of each convolution is spliced according to the last dimension, and then the output size of the image is changed into 7×7×512 by the convolution kernel convolution with one layer of 1×1×512 and the maximum pooling layer output with the step length of 2; the method comprises the steps of,
the position coordinates and the visibility of the key points of the human body are output through the convolution kernels of 7×7×26 and 7×7×13.
8. The waving detection method of claim 7, wherein the loss function of the modified vovnet network model is:
Figure FDA0004030265620000031
wherein x is a row vector, and includes differences between actual position coordinates of all the human body key points in the improved vovnet network model and position coordinates predicted by the model in different directions.
9. The waving detection method of claim 8, wherein the prediction index of the improved vovnet network model is:
Figure FDA0004030265620000032
wherein p represents the serial number of the human body detection frame of the image to be detected, pi represents the serial number of the human body key point in the p-th human body detection frame in the image to be detected, v pi Representing the visibility of the ith human critical point in the p-th human detection frame in the image to be detected,
Figure FDA0004030265620000033
representing the p-th human-shaped detection frameArea size, sigma i And the normalization factor of the ith human body key point is represented.
10. The method according to claim 5, wherein in the image to be measured within a continuous set frame number, the judging conditions for judging whether the corresponding figure has a waving motion according to the position change condition of the human body key point in the selected human body detection frame include three, wherein the first judging condition is whether the wrist key point is continuously higher than the elbow key point in the vertical direction; the second judgment condition is whether an included angle formed by the wrist key point, the elbow key point and the left shoulder key point (or the right shoulder key point) and close to the human body is within a set angle threshold; a third judgment condition is whether the positions of the elbow key point and the wrist key point in the horizontal direction are periodically moved along with the change of time; if the three judging conditions are yes, the hand waving motion exists in the images to be detected within the continuous set frame number.
11. The method of claim 10, further comprising, before determining using the determining condition, determining a distance between a corresponding portrait and a camera according to a type and number of key points of a human body included in the selected portrait detection frame, and setting the set angle threshold according to the distance between the portrait and the camera.
12. A waving detection device for detecting waving according to any one of claims 1 to 11, comprising:
the human shape detection module is used for setting a human shape detection frame on the human shape in the input image to be detected;
the human shape tracking module comprises a motion estimation unit, a data association unit and a tracking target establishing and destroying unit, wherein the motion estimation unit is used for acquiring the predicted position of a human shape detection frame selected in a next frame of image to be detected according to the position of the human shape detection frame selected in the current frame of image to be detected, the data association unit is used for matching the predicted position with the actual positions of all human shape detection frames to be matched in the next frame of image to be detected, so that human shape detection frames used for representing the same human image in the previous and later frames of image to be detected correspond to each other, and the tracking target establishing and destroying unit is used for marking the human image appearing in the image to be detected and destroying the mark corresponding to the human image leaving the image to be detected;
The key point detection module is used for acquiring position coordinates and visibility of key points of a human body in the humanoid detection frame;
the hand waving detection module is used for judging whether corresponding figures have hand waving movement according to the position change condition of the key points of the human body in the selected human body detection frame.
CN202211735182.8A 2022-12-30 2022-12-30 Method and device for detecting waving hand Pending CN116189290A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211735182.8A CN116189290A (en) 2022-12-30 2022-12-30 Method and device for detecting waving hand

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211735182.8A CN116189290A (en) 2022-12-30 2022-12-30 Method and device for detecting waving hand

Publications (1)

Publication Number Publication Date
CN116189290A true CN116189290A (en) 2023-05-30

Family

ID=86433678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211735182.8A Pending CN116189290A (en) 2022-12-30 2022-12-30 Method and device for detecting waving hand

Country Status (1)

Country Link
CN (1) CN116189290A (en)

Similar Documents

Publication Publication Date Title
Nadeem et al. Human actions tracking and recognition based on body parts detection via Artificial neural network
Ibrahim et al. An automatic Arabic sign language recognition system (ArSLRS)
Chang et al. Tracking Multiple People Under Occlusion Using Multiple Cameras.
US8578299B2 (en) Method and computing device in a system for motion detection
US8706663B2 (en) Detection of people in real world videos and images
Lim et al. Block-based histogram of optical flow for isolated sign language recognition
Guo et al. Improved hand tracking system
CN108364302B (en) Unmarked augmented reality multi-target registration tracking method
Mikolajczyk et al. Face detection in a video sequence-a temporal approach
CN110222572A (en) Tracking, device, electronic equipment and storage medium
Shen et al. Adaptive pedestrian tracking via patch-based features and spatial–temporal similarity measurement
CN106909890A (en) A kind of Human bodys' response method based on position cluster feature
CN114445853A (en) Visual gesture recognition system recognition method
Zoidi et al. Stereo object tracking with fusion of texture, color and disparity information
Waheed et al. Exploiting Human Pose and Scene Information for Interaction Detection
Kim et al. Real-time facial feature extraction scheme using cascaded networks
Al-Faris et al. Multi-view region-adaptive multi-temporal DMM and RGB action recognition
CN111382606A (en) Tumble detection method, tumble detection device and electronic equipment
CN113327269A (en) Unmarked cervical vertebra movement detection method
Wang et al. Hand motion and posture recognition in a network of calibrated cameras
Mohan Object detection in images by components
Polat et al. A nonparametric adaptive tracking algorithm based on multiple feature distributions
Kölsch An appearance-based prior for hand tracking
Tang et al. Fusion of local appearance with stereo depth for object tracking
Huang et al. Whole-body detection, recognition and identification at altitude and range

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination