CN111104816B

CN111104816B - Object gesture recognition method and device and camera

Info

Publication number: CN111104816B
Application number: CN201811247103.2A
Authority: CN
Inventors: 吕瑞
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2023-11-03
Anticipated expiration: 2038-10-25
Also published as: CN111104816A

Abstract

The application discloses a gesture recognition method of a target object, which comprises the steps of obtaining a current video frame; detecting preset key points of the target objects in the current video frame, and obtaining preset key point information of the target objects in the current frame; judging whether the position change of the preset key points of the current object in the current frame and the previous f frames meets a first preset posture condition or not according to the preset key point information, and/or judging whether the position relation between the preset key points of the current object in the current frame meets a second preset posture condition or not; and if the preset gesture condition is met, the current gesture of the target object is recognized as a preset gesture, wherein f is a preset natural number, and the preset gesture condition is set according to the position characteristics among key points of the gesture to be recognized of the target object. The embodiment of the application can accurately identify the tiny change of the gesture, has wide application range, low requirement on the image in the video frame, high accuracy of identifying the gesture and small false detection and omission detection of the gesture identification.

Description

Object gesture recognition method and device and camera

Technical Field

The present application relates to the field of image analysis, and in particular, to a method and an apparatus for identifying a gesture of a target object, and a camera.

Background

With the development of image acquisition and analysis technologies, video or image data-based analysis is becoming increasingly widely used, for example, for gesture detection or recognition of objects.

The existing method for detecting or identifying the gesture of the target object through video or image analysis mainly comprises the following steps: and analyzing a frame difference image of the current frame and the previous frame, obtaining a target object pixel point with a moving action according to the frame difference image, taking a graph formed by the target object pixel point with the moving action as a contour, and judging whether a specific gesture exists according to the change condition of the contour.

However, in the method, the frame difference image analysis is adopted to extract the target object pixel point, when the change of the target motion is small, the pixel difference of the adjacent frames is small, and at the moment, the target object pixel point with the moving motion cannot be obtained from the frame difference image, so that the target gesture is easy to miss, and the recognition accuracy is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for identifying the gesture of a target object and a camera so as to improve the accuracy of identifying the gesture of the target object in an image.

The invention provides a gesture recognition method of a target object, which comprises the following steps of,

Acquiring a current video frame;

detecting preset key points of the target objects in the current video frame, and obtaining preset key point information of the target objects in the current frame;

judging whether the position change of the preset key points of the current object in the current frame and the previous f frames meets a first preset posture condition or not according to the preset key point information, and/or judging whether the position relation between the preset key points of the current object in the current frame meets a second preset posture condition or not;

if the preset posture condition is met, the current object posture is recognized as the preset posture,

and f is a preset natural number, and the preset gesture condition is set according to the position characteristics among key points of the gesture to be recognized of the target object.

Preferably, the method further comprises the steps of,

inputting the current frame with the recognized current object gesture into a trained machine learning model, and taking the preset gesture as a recognition result if the machine learning model recognizes the object gesture in the current frame as the preset gesture.

Wherein the step of inputting the current frame with the recognized current object posture into a trained machine learning model, and taking the preset posture as a recognition result if the machine learning model recognizes the object posture in the current frame as the preset posture, comprises,

Collecting picture data containing the pose of the object,

calibrating a first target frame of the target object in the picture data, extracting a first target frame image in the picture data, preparing a classification sample of the identification gesture and the non-identification gesture,

inputting the two classification samples into a machine learning model, training the model, and storing the model after the current training;

generating a second target frame of the identified current target object in the current frame based on the preset key points, extracting a second target frame image from the current frame, inputting the second target frame image into the trained model in real time for classification, and taking the classification result as an identification result if the machine learning model classifies the model into the identified gesture.

Wherein the obtaining of the preset key point information of the target object in the current frame further comprises,

judging whether the current frame comprises more than two target objects according to the obtained preset key point information, and if so, tracking the target objects according to the preset key point information to obtain a locked target object; otherwise, the target in the current frame is used as a locking target;

the method may further comprise the steps of,

traversing the locking target object in the current frame, and carrying out gesture recognition according to preset key point information of the current locking target object until all target object gestures of the current frame are recognized.

Preferably, the method further comprises the steps of,

judging whether the change of the preset key point position of the locking target object in the current frame and the previous f frames is larger than a first displacement threshold value,

if yes, executing the step of judging whether the position change of the preset key point of the current target object in the current frame and the previous f frame meets the first preset posture condition,

otherwise, executing the step of judging whether the position relation between the preset key points of the current target object in the current frame meets the second preset posture condition.

Wherein the first preset posture condition is a sitting posture condition,

the determining whether the position change of the preset key point of the current target object in the current frame and the previous f frame meets the first preset posture condition includes,

according to the preset human body key point information, determining the longitudinal position change of the same preset human body key point of the current frame and the previous f frame, and determining whether the sitting posture condition is met according to the relative position change.

Wherein the method comprises determining longitudinal position change of the same preset human body key point of the current frame and the previous f frame according to preset human body key point information, and determining whether the sitting posture condition is met according to the relative position change,

According to preset left shoulder and right shoulder human body key point information, determining longitudinal position changes of left shoulder and right shoulder human body key points in the current frame and the previous f frames, and determining whether the sitting posture condition is met according to the position changes.

Wherein, according to the preset left shoulder and right shoulder human body key point information, the longitudinal position changes of the left shoulder and right shoulder human body key points in the current frame and the previous f frame are determined, and whether the sitting posture condition is met is determined according to the position changes, comprising,

judging whether the sum of the left shoulder key point displacement in the current frame and the front f frame and the right shoulder key point displacement in the current frame and the front f frame is larger than a human body key point time sequence position relation judging threshold value or not; if yes, identifying the suspected standing posture; if the detected position is smaller than the negative value of the judging threshold value, the detected position is identified as a suspected sitting posture; if the judgment threshold value is equal to the judgment threshold value, the gesture is identified as no action gesture;

the human body key point time sequence position relation judging threshold value is in proportional relation with the distance between the left shoulder and the right shoulder in the current frame.

Collect picture data containing the standing posture and/or sitting posture of the target person,

calibrating a first target frame of the target human body in the picture data, extracting a first target frame image in the picture data, manufacturing a two-class sample of the standing posture of the human body and the sitting posture of the human body,

and generating a second target frame based on a preset key point in the current frame, extracting a second target frame image from the current frame, inputting the second target frame image into a trained model in real time for classification, identifying the target frame image as the standing posture if the target frame image is classified as the standing posture by the machine learning model, and identifying the target frame image as the sitting posture if the target frame image is classified as the sitting posture by the machine learning model.

Preferably, the method further comprises the steps of,

judging whether the seating posture of the target human body identified by the current frame is continuously provided with M frames, and if so, controlling a camera lens to capture a distant view;

judging whether the standing posture of the target human body identified by the current frame is continuously provided with T frames, if so, counting whether the number of the standing posture target human bodies identified in the current frame is equal to 1, if so, controlling a camera lens to capture a close range of the target human body, otherwise, controlling the camera lens to capture a distant range;

Wherein M, T is a natural number set in advance.

Preferably, the second preset posture condition is an blackboard writing posture condition,

the determining whether the positional relationship between preset key points of the current target object in the current frame meets the second preset posture condition includes,

and determining the relative position relation between preset human body key points according to the preset human body key point information, and determining whether the board writing posture condition is met according to the relative position relation.

Wherein the determining of the relative positional relationship between the preset human body key points according to the preset human body key point information and the determining of whether the blackboard writing gesture condition is met according to the relative positional relationship comprises,

and determining the relative position relation among the right wrist key point, the right elbow key point and the right shoulder key point according to preset right wrist, right elbow and right shoulder human body key point information, and determining whether the gesture condition of the blackboard writing is met according to the relative position relation.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

the method comprises determining the relative position relationship among the right wrist key point, the right elbow key point and the right shoulder key point according to the human body key point information, and determining whether the relative position relationship meets the blackboard writing gesture condition according to the relative position relationship, including,

Judging whether the position of the right wrist key point is higher than the position of the right elbow key point, whether the horizontal distance between the right wrist key point and the right elbow key point is smaller than a first distance threshold value, and whether the vertical distance between the right elbow key point and the right shoulder key point is smaller than a second distance threshold value;

if yes, recognizing that the target human body has the writing gesture, otherwise, recognizing that the target human body is in a non-writing gesture.

The embodiment of the invention provides a gesture recognition device of a target object, which comprises,

the image acquisition module acquires a current video frame;

the key point detection module is used for detecting preset key points of the target object in the current video frame and obtaining preset key point information of the target object in the current frame;

the identification module is used for judging whether the position change of the preset key points of the current object in the current frame and the previous f frames meets the first preset posture condition or not and/or judging whether the position relation between the preset key points of the current object in the current frame meets the second preset posture condition or not according to the preset key point information;

if the preset gesture condition is met, the current gesture of the target object is recognized as the preset gesture;

and f is a natural number, and the preset gesture condition is set according to the position characteristics among key points of the gesture to be recognized of the target object.

Preferably, the apparatus further comprises a processor for processing the data,

the detection classification module inputs the current frame with the recognized current object gesture into the trained machine learning model, and if the machine learning model recognizes the object gesture in the current frame as the preset gesture, the preset gesture is used as a recognition result.

Wherein the detection classification module comprises a detection classification module,

the sample preparation unit is used for calibrating a first target frame of a target object in the picture data containing the gesture of the target object, extracting a first target frame image in the picture data, preparing a classification sample of the recognition gesture and the non-recognition gesture, and inputting the classification sample into the machine learning model unit;

the machine learning model unit is used for training based on the input two classification samples and storing a current trained model; and classifying a second target frame image input in real time through the trained model, wherein the second target frame image is a second target frame of the current target object identified in the current frame generated based on the preset key point, and is an image in the target frame extracted from the current frame.

Wherein the device further comprises a control unit for controlling the control unit,

and the target tracking module is used for determining the number of targets included in the current frame according to the obtained preset key point information, tracking the targets according to the preset key point information when the number of the targets is more than or equal to 2 to obtain a locking target, and taking the targets in the current frame as the locking targets when the number of the targets is equal to 1.

the inter-preset-key-point frame space identification module is used for judging whether the change of preset key points of the locking target object in the current frame and the previous f frames is larger than a first displacement threshold value or not; when the position change is larger than a first displacement threshold value, determining whether the position change meets a first preset posture condition or not; and when the position change is not greater than the first displacement threshold, determining whether the position relation between preset key points of the current target object meets a second preset posture condition.

Wherein the first preset posture condition is a sitting posture condition, the identification module comprises,

the first identification unit is used for determining longitudinal position changes of the same preset human body key points of the current frame and the previous f frames according to preset human body key point information and determining whether the sitting posture condition is met according to the relative position changes.

Wherein the first recognition unit comprises,

a first calculating subunit, for calculating the sum of the left shoulder key point displacement in the current frame and the previous f frame and the right shoulder key point displacement in the current frame and the previous f frame; calculating the ratio value of the human body key point time sequence position relation judging threshold value and the distance between the left shoulder and the right shoulder in the current frame and the human body key point in the current frame,

A first comparing subunit for comparing whether the sum of the left shoulder key point displacement in the current frame and the front f frame and the right shoulder key point displacement in the current frame and the front f frame is larger than the human body key point time sequence position relation judging threshold; if yes, identifying the suspected standing posture; if the detected position is smaller than the negative value of the judging threshold value, the detected position is identified as a suspected sitting posture; and if the judgment threshold value is equal to the judgment threshold value, identifying the gesture as no action.

the camera control module is used for controlling the camera lens to capture a distant view when the seating gesture of the target human body recognized by the current frame is continuously provided with M frames; when the standing posture of the target human body identified by the current frame is continuously provided with T frames, if the number of the standing posture target human bodies identified in the current frame is counted to be equal to 1, controlling the camera lens to capture a close view of the target human body, and if the number of the standing posture target human bodies identified in the current frame is counted to be greater than 1, controlling the camera lens to capture a distant view; wherein M, T is a natural number set in advance.

Wherein the second preset gesture condition is an blackboard writing gesture condition, the recognition module comprises,

the second recognition unit is used for determining the relative position relation between preset human body key points according to the preset human body key point information and determining whether the blackboard writing gesture condition is met according to the relative position relation.

Wherein the second recognition unit comprises,

a second calculation subunit calculating a horizontal distance between the right wrist key point and the right elbow key point, and a vertical distance between the right elbow key point and the right shoulder key point;

a second comparing subunit comparing whether the right wrist keypoint location is higher than the right elbow keypoint location, and whether the calculated horizontal spacing is less than a first spacing threshold, and whether the calculated vertical spacing is less than a second spacing threshold; if yes, recognizing that the target human body has the writing gesture, otherwise, recognizing that the target human body is in a non-writing gesture.

The embodiment of the invention provides a camera device, which comprises a camera, a memory and a processor, wherein,

the camera is used for shooting images;

the memory is used for storing a computer program;

the processor is configured to execute the program stored in the memory, and the object gesture recognition method is described above.

According to the gesture recognition method of the target object, gesture recognition is performed through detection of preset key points of the target object, position change and/or position relation change of the preset key points based on the position features among the key points of the gesture to be recognized of the target object. The embodiment of the invention can accurately identify the tiny change of the gesture, has no specific requirements on the target object and the gesture, has wide application range, has low requirements on the image in the video frame, has high accuracy of identifying the gesture, and has small false detection and omission detection of gesture identification.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of the present invention for realizing gesture recognition of a target object.

Fig. 2 is a schematic flow chart of identifying an object in an acquired video frame according to the present invention.

Fig. 3 is a schematic diagram of key points of a human body according to an embodiment.

Fig. 4 is a flow chart of a method for identifying an sitting posture based on a target human body.

Fig. 5 is a schematic flow chart of controlling recording equipment based on sitting posture recognition of a plurality of target human bodies in a classroom recording scene.

Fig. 6 is a schematic diagram of a camera mounting position in a second embodiment of the present invention.

Fig. 7 is a schematic flow chart of a method for identifying writing behaviors.

Fig. 8 is a schematic flow chart of a method for recognizing a blackboard writing gesture of a human body with multiple targets in a video.

Fig. 9 is a schematic view of an apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent.

With the rapid development of the fields of cloud computing, big data, artificial intelligence and the like, intelligent products are widely applied to various industries. The intelligent video analysis is used as one of intelligent products and mainly comprises target object key point detection, target tracking, key point gesture detection and the like, wherein the key point detection is to detect the positions of all key points in an image by utilizing the information of an image frame and provide the coordinates of each key point and the connection relation between each key point; the key point tracking is to generate a target detection frame according to the key point detection result and track the detection frame. The key point gesture detection is to identify the gesture according to the space-time relationship of the key points of the tracking target.

And judging the detected preset target key points according to the position relation among the preset key points and/or the position change of the same key point in the preset key points on different time sequences by utilizing the key point detection of the target object and combining the position relation among the key points of the gesture to be identified and/or the position relation on different time sequences of the key points, and further, detecting and classifying the identified gesture through a trained deep learning algorithm model in order to avoid misjudgment and omission.

Referring to fig. 1, fig. 1 is a schematic flow chart of the present invention for implementing gesture recognition of a target object. The schematic diagram shows the basic flow of the technical scheme of the invention.

Acquiring a current video frame;

step 101, detecting preset key points of a target object in a current video frame, and obtaining preset key point information of the target object in the current frame;

102, judging whether the position change of the preset key points of the current object in the current frame and the previous f frames meets a first preset posture condition or not and/or judging whether the position relation between the preset key points of the current object in the current frame meets a second preset posture condition or not according to the preset key point information;

If the preset posture condition is met, the current object posture is recognized as the preset posture, and if the preset posture condition is not met, the current object posture is recognized as the preset posture.

Wherein f is a preset natural number, and the preset gesture condition is set according to the position characteristics among the key points of the gesture to be recognized of the target object, for example, the preset gesture condition is set based on the motion trail rule of the key points of the gesture to be recognized.

Referring to fig. 2, fig. 2 is a schematic flow chart of the present invention for identifying an object in an acquired video frame.

Step 201, after obtaining a current video frame, recording a current frame number f;

step 202, scaling the current video frame according to a fixed size to obtain a picture with a consistent size;

step 203, detecting preset key points of the targets, and obtaining and storing information of each preset key point of each target in the current frame;

step 204, judging whether more than two targets in the current frame are targets according to the obtained preset key point information, namely, obtaining the number of targets contained in the current frame; if the current frame includes multiple targets, step 205 is executed to obtain a locked target object through target tracking, so that the target object corresponding to the previous video frame is the same target object among the target objects corresponding to the current video frame, if there is only one target object in the current frame, step 206 is executed, and the target object in the current frame is taken as the locked target object, or in this step, preferably, the target object in the current frame is also tracked, so as to avoid misjudgment after the inter-frame target object is changed, for example, when the first target object leaves in the previous frame or f-frame, and the second target object enters the video in the current frame, the target tracking is adopted, so that the target object corresponding to the current video frame and the target human object corresponding to the previous video frame are the same target object, which is beneficial to improving the accuracy of gesture recognition;

Step 207, determining whether the change of the preset key point position of the locking target object in the current frame and the previous f frames is greater than the first displacement threshold value, so as to determine whether the position of the same key point in the preset key point on different time sequences (intervals) has a change,

if the position changes, step 208 is performed, which indicates that the gesture to be recognized has a change between frames, and can be regarded as mainly dynamic motion, and when the gesture is recognized, whether the displacement of the same preset key point between frames meets the first preset gesture condition is used for pre-judging the gesture of the locking target object in the current frame;

if the change of the preset key points of the locking target object in the current frame and the previous f frames is smaller than or equal to the first displacement threshold, executing step 209, which indicates that the change of the positions of the frames to be identified is not obvious, and can be regarded as being mainly static, and pre-judging the gesture of the locking target object in the current frame according to whether the position relationship between the preset key points in the frames meets the second preset gesture condition;

in step 210, in order to further improve accuracy of gesture recognition, the suspected gesture of the target object predicted in step 208 or step 209 is further detected and classified by a trained depth model, so as to obtain a final recognition result of the current locked target in the current frame.

Step 211, judging whether unidentified locking targets exist in the current frame, if so, identifying the next locking target until all the gestures of the locking targets are identified; otherwise, the target object in the current frame is processed, and the next frame is processed continuously.

Through the steps, the recognition process under the condition that the current frame contains a plurality of targets is solved, and different judgment strategies are adopted for recognizing the preset key points based on the characteristics of the gesture to be recognized.

In practical application, gesture recognition of a human body as a target object has a wide application requirement. The following description will be made with reference to the gesture recognition of a human body as an example.

When the target object is a human body and the acquired image is a video, human body parts related to gesture recognition can be marked on the basis of the human body as key points. As shown in fig. 3, fig. 3 is a schematic diagram of a human body keypoint according to an embodiment, wherein the head keypoint index is 0, the neck keypoint index is 1, …, and the like, and the numerals in the figures represent the position keypoint index. When the gesture to be recognized is a standing and/or sitting gesture, as the main change part related to the gesture is the upper half body of the human body, the head, the neck, the left and right shoulders, the left and right buttocks, the left and right elbows, the left and right wrists and the chest can be set as key points; for another example, the gesture to be recognized is a blackboard writing gesture, and since the blackboard writing is usually performed by driving the right hand through the right arm, the right wrist, the right arm and the right shoulder can be set as key points; for another example, the gesture to be recognized is a certain movement gesture, such as sit-ups, flat supports, etc., and key points may be set in connection with the body parts involved in the movement gesture.

Embodiment one: the following will be an example of recognition of a sitting posture of a human body.

Referring to fig. 4, fig. 4 is a flow chart of a method for recognizing an sitting posture based on a target human body.

Step 401, starting from collecting a frame of image, recording the frame number f of the current frame;

step 402, preprocessing an image: and the acquired images are scaled according to the fixed size w multiplied by h to obtain the images with the same size, so that firstly, the difference of the sizes of the images caused by the difference of the image pick-up devices is avoided, and secondly, the analysis accuracy of the images in the subsequent steps is improved.

Step 403, detecting human body key points according to human body key points preset by human body parts related to the gesture to be identified, and acquiring and storing each preset human body key point information J in the frame image _n，f For example, human body key point coordinate information is denoted as J _n，f ·x,J _n，f Y, wherein n is the index number of the key point of the human body, f is the current frame number, x is the abscissa, y is the ordinate, thus J _n，f X represents the abscissa of the human body key point n in the f-th frame, J _n，f Y represents the ordinate of the human body key point n in the f-th frame.

In the step, analyzing the current video frame by using a preset machine learning algorithm to obtain a plurality of preset human body key point heat maps corresponding to the current video frame; the preset machine learning algorithm is obtained through training of sample video frames marked with human body key points and sample human body key point heat maps corresponding to the sample video frames; any human body key point heat map comprises a part identifier and human body key point heat values of all pixel points of the current video frame corresponding to the part identifier; determining human body key point information of a target human body corresponding to the current video frame according to the plurality of human body key point heat maps; wherein, the liquid crystal display device comprises a liquid crystal display device,

The human body key point mark can be formed by manually marking the positions of the human body key points in the picture. In this embodiment, in order to facilitate the recognition of other gestures, the noted human body key points include 15 human body key points such as head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, chest, right hip, right knee, right ankle, left hip, left knee, left ankle, etc.

In the machine learning algorithm, the YPN (YOLO Pyramid Networks) model is adopted in the embodiment, and is obtained by combining and optimizing network structure designs of an open source target detection network framework YOLOv2 and a feature pyramid network (FPN, feature Pyramid Networks). The YOLOv2 network structure is used for quickly extracting target features under the condition of reducing the calculation amount. Compared with a convolutional neural network structure adopted by OpenPose, the YPN calculation amount is smaller, and real-time detection of key points of a human body can be realized under the condition of not losing precision. The FPN network structure is beneficial to improving the multi-scale adaptability of the characteristics and ensuring the detection performance of the network model.

Step 404, considering that the position relationship between preset human body key points related to the sitting posture is not changed in different video frames, and the position of the same human body key point of the same target is changed in different video frames, that is, the position relationship between the human body key points in the frames is kept relatively inconvenient, and the positions of the same human body key points in the frames are different, in order to better capture the target human body, preferably, a preset tracking algorithm is used to select the target human body which is the same target in the target human body corresponding to the current video frame and corresponds to the previous video frame as the captured action target human body;

The preset tracking algorithm may be a multi-target tracking algorithm CMOT (Robust Online Multi-object Tracking Based on Tracklet Confidence), or may be other tracking algorithms, which is not particularly limited in the embodiment of the present invention.

Specifically, a preset tracking algorithm is used to determine whether a first preset pattern corresponding to a preset human body key point in a target human body corresponding to the current video frame and a second preset pattern corresponding to a preset human body key point in a target human body corresponding to the previous video frame meet a preset overlapping condition.

The preset human body key points may be a plurality of human body key points capable of forming a target human body contour with a high degree of identification. For example, the plurality of preset human body key points may include all key points of the upper body of the human body, or may be a human body key point marked with a head mark, a human body key point marked with a left shoulder mark, and a human body key point marked with a right shoulder mark. The target human body outline formed by the three key points can better identify the target human body. Other human body key points can be selected as preset human body key points by a person skilled in the art according to actual conditions, and the embodiment of the invention is not particularly limited.

And if the first preset graph and the second preset graph meet the preset overlapping condition, taking the target human body corresponding to the human body key point corresponding to the preset graph meeting the preset overlapping condition in the previous video frame in the current video frame as the target human body of which the target human body corresponding to the previous video frame is the same target.

In practical applications, this step is not necessary when the accuracy of human keypoint detection is sufficient.

Step 405, because the sitting posture is represented on the image as the position relationship between human body key points in the frames is kept relatively inconvenient, and the positions of the same human body key points between the frames are different, based on this, the human body key point information of the target human body in the current frame needs to be compared with the same human body key point information in the previous f frame including the current frame, thereby judging whether the human body key point information of the previous f frame of the current frame exists, if yes, step 407 can be executed, otherwise, the acquired frame quantity is insufficient, the accuracy of posture identification is possibly affected, step 406 is executed, and the next frame image is acquired;

step 407, pre-judging the sitting posture of the target human body based on the target human body obtained by human body tracking:

After the human body tracking is locked to the target human body, the coordinates of the preset human body key points in the current video frame of the target human body are determined and compared with the coordinates of the human body key points of the same target human body in the previous f frame, so that the gesture of the target human body is prejudged.

Based on a certain proportional relation between the displacement change of the shoulders and the width of the shoulders of the body when the human body sits up, the method is specifically characterized in that the longitudinal coordinates of the shoulder key points corresponding to the standing posture and the sitting posture on the image are changed, and the longitudinal coordinates of the same human body key point of the current frame and the previous f frames are changed in time sequence, therefore, preferably, the change of the longitudinal coordinates of the two human body key points of the left shoulder and the right shoulder of the current frame and the position relation of the two human body key points of the left shoulder and the right shoulder in the current frame are calculated, the longitudinal coordinate change and the position relation are compared, and whether the suspected sitting posture exists or not is judged, wherein the specific formula is as follows:

d＝|J _2，1 ·x-J _5，1 ·x|×α

wherein, in combination with the designation of FIG. 3, J _2，1 ·x、J _2，1 ·y、J _5，1 ·x、J _5，1 Y represents the abscissa and ordinate of the left shoulder and the abscissa and ordinate of the right shoulder of the current frame, respectively. J (J) _2，f ·y、J _5，f Y represents the ordinate of the left shoulder and the right shoulder of the f frame before the target student, d represents the time sequence position relation judgment threshold value of the key points of the human body, and the value is taken as the shoulder of the target student The product of the width and the scaling factor alpha.

In rectangular coordinates in the image field, a default origin of coordinates is usually located at the upper left corner of the image, an x coordinate axis is forward from left to right, and a y coordinate axis is forward from top to bottom. According to the formula:

if the sum of the left shoulder displacement and the right shoulder displacement is larger than the judging threshold, indicating that the human body key point of the current frame is in the positive direction along the y coordinate axis relative to the human body key point displacement direction corresponding to the previous f frame, enabling W=1, and recording the suspected sitting posture;

if the sum of the left shoulder displacement and the right shoulder displacement is smaller than a negative judgment threshold value, indicating that the human body key point of the current frame is negative to the y coordinate axis relative to the human body key point displacement direction corresponding to the previous f frame, if w=2, recording that the standing posture is suspected to occur;

if the sum of the left and right shoulder displacements is equal to the judgment threshold value, which indicates that the human body key point of the current frame is very limited relative to the human body key point displacement direction corresponding to the previous f frame, w=0 is recorded as suspected non-action. In specific implementation, the value of the relevant parameter is alpha=0.8.

The probability of recognizing the standing and sitting postures according to the mode is about 80-90%.

Because the chest leans forward, sits back and the like, the movement track of the shoulder is similar to the movement track of the standing, so that misjudgment is generated on the gesture, preferably, the suspected judgment result is classified and calibrated by combining a classification network algorithm, so that the accuracy of the standing gesture recognition is improved, and the method specifically comprises step 408.

Step 408, for the target determined to be suspected sitting and standing posture, performing sitting detection classification by using a machine learning algorithm:

firstly, collecting picture data containing sitting postures, preferably, screening the picture data containing sitting postures and standing postures from a current collected image frame set to serve as sample data for training, and calibrating first target frames of all sitting targets and standing targets in the sample data, wherein the sitting postures and the standing postures mainly relate to key points of the upper half body of a human body, preferably, the sitting targets and the standing targets can be calibrated; and (3) performing expansion based on the first target frame in the calibration data, extracting a first target frame image in each piece of picture data, and manufacturing a sitting posture and standing posture classification sample. Preferably, the calibrated target frames are regular patterns, so that the expansion ratio is conveniently, flexibly and simply set, for example, the first target frame is rectangular, the width of the left and right expansion target frames is set to be 0.2, the height of the upper expansion target frame is set to be 0.1, and the height of the lower expansion target frame is set to be 0.5.

And inputting the two classification samples into a convolutional neural network CNN network model, training the model, and storing the current trained CNN network model after the training is finished.

The acquisition of the sample data and the training of the CNN network model may be performed in parallel with the present process or performed in advance on the basis of the sample data, separately from the present process.

When the suspected sitting or standing posture in the current frame is identified, a second target frame is generated based on human body key points, the second target frame is expanded by a certain proportion, then the second target frame which is pre-judged to be suspected sitting is subjected to image matting, namely, a second target frame image is extracted from the current frame, and then the extracted image is sent to a CNN classification network in real time for classification. If the target frame image suspected of standing is classified as standing, then it is recognized as standing posture. If the target frame image suspected of sitting is classified as sitting, the sitting posture is identified.

The shapes of the first target frame and the second target frame can be the same or different; the second target frame may be generated based on all key points of the target human body, or may be generated based on preset key points related to the gesture to be recognized.

According to the method for identifying the sitting gesture of the single human body based on the image, the suspected gesture is further identified through tracking of the key points of the human body, track change of the key points of the human body among multiple frames and training of the deep learning model of the suspected gesture, so that gesture identification accuracy is improved, and missing detection or false detection is reduced.

Embodiment two: the following will be an example of recognition of sitting postures of students when recording classroom videos as application scenes based on classroom video resources of distance education.

In education and teaching field, the appearance of remote education, wisdom classroom, classroom video monitoring etc. for modern education is more convenient and high-efficient. At present, two ways of acquiring course video resources in remote education exist. One is to record by manual shooting, and the other is to control shooting equipment to record automatically by a method based on traditional image processing. However, the former method, while reliable, is costly; the latter approach, while relatively low cost, is not reliable. In the following embodiments, based on the recognition result of the sitting gesture, switching and adjustment of the long-and-short view shooting of the recording device are controlled, and meanwhile, monitoring of classroom is achieved.

Referring to fig. 5, fig. 5 is a schematic flow chart of controlling a recording device based on sitting posture recognition of a plurality of target human bodies in a classroom recording scene.

Step 500, a current video frame is acquired.

Before recording video, a camera may be installed in the classroom, the camera may be installed at the top of the lecture table side of the classroom, and the imaging view angle covers at least all desk areas of the classroom, for example, as shown in fig. 6, a panoramic camera is installed on a fixable position of the lecture table side, and the imaging view angle covers the lecture table and all desk areas. In the embodiment of the invention, the camera can be started manually or automatically according to the preset starting time. Specifically, a preset lecture time of a teacher lecture may be taken as the above-described preset on time.

After the camera is started, an image can be shot through a camera of the camera, and a current video frame is obtained from the shot image, so that whether a sitting posture exists on a target human body in the current video frame or not can be judged.

Step 501, recording the current frame number f;

step 502, preprocessing an image: and the acquired images are scaled according to the fixed size w multiplied by h to obtain the images with the same size, so that firstly, the difference of the sizes of the images caused by the difference of the image pick-up devices is avoided, and secondly, the analysis accuracy of the images in the subsequent steps is improved.

Step 503, human body key point detection:

first, collecting a data set of a student scene, and then marking the position of each human body key point of each target human body (student) in the picture. For other applications that facilitate student behavior and gesture recognition, for example, monitoring whether there is a long-term low head gesture, etc., preferably, the noted human body keypoints include 15 human body keypoints such as head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, chest, right hip, right knee, right ankle, left hip, left knee, left ankle, etc. If the amount of data to be noted is reduced, only the key points of the upper body, i.e., the head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, chest, right hip, left hip, may be noted for the recognition of the sitting posture.

One of the embodiments may be that after the labeling data of the key points of the human body is obtained, the labeling data is used as a training sample of the YPN network model, the YPN network model is trained, and the trained YPN network model is saved; inputting the collected current video frame into the trained YPN network model, extracting human key features from the current video frame through the model, thereby realizing the real-time generation of human key point information and obtaining the human key point information J of each human (student) target in the frame image _k，n，f For example, the coordinate information of the human body key point n of the target human body (student) k in the f-th frame is denoted as J _k，n，f ·x,J _k，n，f Y, wherein n is the index number of the key point of the human body, f is the current frame number, x is the abscissa, y is the ordinate, thus J _k，n，f X represents the abscissa of human body key point n of student k in f-frame, J _k，n，f Y represents the ordinate of the human body key point n of student k in the f-th frame.

Step 504, when the video frames recorded by the video camera include a plurality of target human bodies, in order to enable the current video frame and the previous video frame to perform sitting posture recognition on the same target human body, so that the recognition process can continuously target the same person, the step reuses a preset tracking algorithm, and selects the target human body, which is the same target as the target human body corresponding to the previous video frame, in the target human bodies corresponding to the current video frame as a captured action target human body;

Unlike the first embodiment, which only has one target, a plurality of target persons are typically involved in the classroom video, so that each target person in the current video frame needs to be traversed as the target person is tracked.

In particular to a special-shaped ceramic tile,

step 504a, using a preset tracking algorithm, determining whether a first preset pattern corresponding to a plurality of preset human body key points in a target human body corresponding to a current video frame and a second preset pattern corresponding to a plurality of preset human body key points in a target human body corresponding to a previous video frame meet a preset overlapping condition. And if the first preset graph and the second preset graph meet the preset overlapping condition, taking the target human body corresponding to the first preset graph meeting the preset overlapping condition as the target human body of which the target human body corresponding to the previous video frame is the same target, namely taking the target human body corresponding to the first preset graph meeting the preset overlapping condition as the locking target.

Step 504b, traversing the human body key point coordinates of each target human body, calculating to obtain preset graphs corresponding to a plurality of preset human body key points of each target, and then returning to step 504a until tracking of all target human bodies in the current video frame is completed.

The preset patterns corresponding to the plurality of preset human body key points can be patterns formed after the plurality of preset human body key points are connected, or can be external multiple deformations of the plurality of preset human body key points, such as external quadrangles, pentagons and the like. The embodiment of the invention is not particularly limited to the specific shape of the preset pattern. Preferably, the preset graph is an external frame of the upper body, and the preset algorithm may be a CMOT algorithm.

Step 505, judging whether a previous f frame including the current frame exists, if yes, executing step 506, otherwise, indicating that the acquired frame quantity is insufficient and the accuracy of gesture recognition is affected, executing step 507, and acquiring the next frame image;

step 506, traversing the coordinates of the key points of the human body of all the locking targets, and pre-judging the sitting posture of each locking target:

in this step, unlike the first embodiment, a class video generally includes a plurality of target human bodies, so after a plurality of locking targets are obtained through human body tracking, each locking target needs to be traversed to obtain the human body key point coordinates of the current video frame of the locking target, and the human body key point coordinates of the same target human body in the previous f frame are compared, so as to predict the gesture of the same target human body.

In particular to a special-shaped ceramic tile,

in step 506a, the change of the longitudinal coordinates of the two human body key points of the left and right shoulders of the current frame and the previous F frame of the locking target k (for example, student k) is calculated, and the position relationship of the two human body key points of the left and right shoulders of the locking target k on the current F frame time sequence is compared, and whether the suspected sitting posture exists or not is judged by comparing the longitudinal coordinate change with the position relationship, wherein the specific formula is as follows:

d _k ＝|J _k，2，1 ·-J _k，5，1 ·x|×α

wherein J is _k，2，1 ·x、J _k，5，1 ·y、J _k，2，1 ·x、J _k，5，1 Y represents the abscissa and ordinate of the right shoulder and the abscissa and ordinate of the left shoulder of the current frame of the lock target k, respectively. J (J) _k，2，f ·y、J _k，５，f Y represents the ordinate of the right and left shoulders of the f frames before the locking target k, d _k The time sequence position relation judgment threshold value of the human body key point representing the locking target k takes the value of the product of the shoulder width of the locking target k and the proportionality coefficient alpha.

According to the formula:

if the sum of the left shoulder displacement and the right shoulder displacement is larger than the judging threshold, indicating that the human body key point of the current frame is in the positive direction along the y coordinate axis relative to the human body key point displacement direction corresponding to the previous f frame, enabling W=1, and recording that the locking target k is suspected to be in a sitting posture;

if the sum of the left shoulder displacement and the right shoulder displacement is smaller than a negative judgment threshold value, and the human body key point of the current frame is negative relative to the human body key point displacement direction corresponding to the previous f frame, if w=2, recording that the locking target k is a suspected standing posture;

If the sum of the left and right shoulder displacements is equal to the judgment threshold value, which indicates that the human body key point of the current frame is very limited relative to the human body key point displacement direction corresponding to the previous f frame, let w=0, and record that the locking target k is suspected to be not operated. In specific implementation, the value of the relevant parameter is alpha=0.8.

Step 506b, traversing the next locking target, returning to step 506a, until all the locking targets in the traversed current video frame complete the pre-determination of the sitting posture.

Step 508, for a plurality of targets determined to be suspected sitting and standing postures, performing sitting detection classification by using a machine learning algorithm:

firstly, collecting picture data with student scenes, preferably, screening picture data containing sitting postures and/or standing postures of a plurality of targets (students) from a currently collected video frame set as sample data for training, and calibrating first target frames of sitting targets and standing targets of all students in the sample data, wherein the sitting postures and the standing postures mainly relate to upper body key points of a human body, preferably, the sitting targets and the standing targets can be calibrated; and (3) performing expansion based on the first target frame in the calibration data, extracting the first target frame image of each student in each video frame, and manufacturing a sitting posture and standing posture classification sample.

When a plurality of suspected sitting or standing gesture targets in the current frame are identified, a second target frame is generated for each suspected gesture target based on human body key points, after the second target frame is expanded by a certain proportion, each suspected gesture target is respectively scratched, namely, a second target frame image is extracted from the current frame, and then each extracted image is sent into a CNN classification network in real time for classification. If the target frame image suspected of standing is classified as standing, then it is recognized as standing posture. If the target frame image suspected of sitting is classified as sitting, the sitting posture is identified. And repeating the steps, and traversing all the suspected gesture targets in the current frame until all the suspected gesture targets finish sitting detection classification.

Step 509, traversing the previous M frames of the current frame for the seating gesture target identified in step 508, to determine whether there is a seating gesture of the target, i.e., determine whether there is a continuous M frame for the seating gesture of the target, if so, indicating that the gesture of the target is not changed currently, triggering camera lens restoration to control the camera lens to capture a distant view, otherwise, processing the next video frame;

step 510, traversing the previous T frames of the current frame for the object with the standing posture identified in step 508, to determine whether there is a standing posture of the object, i.e., whether there is a continuous T frame for the standing posture of the object,

if so, further counting whether the number of the identified standing posture targets in the current frame is equal to 1, if so, triggering the stretching of the camera lens to control the camera lens to capture the close view of the targets if only 1 person in the current frame is in the standing state, otherwise, triggering the restoration of the camera lens to control the camera lens to capture the distant view if at least two persons in the current frame are in the standing state,

if the pose of the object does not last for a T frame, the next video frame is processed.

The T, M is a natural number and can be set according to the requirement.

In the second embodiment, the head and the shoulders of each student are accurately positioned through the detection of the key points of the human body, so that the standing posture of each student can be effectively detected, meanwhile, the actions which are easy to cause false alarm such as hand lifting, head lifting, sitting backwards and lying upwards can be distinguished, the more accurate standing snapshot process is finally completed, the action of the student is tracked and shot, the positioning, tracking and key action prejudgment and recognition of the student are automatically performed, the camera is focused on the student when the student stands, and the expression and limb actions of the student when the student stands are clearly shot; when the students sit down, the cameras resume panoramic shooting, and the process of classroom teaching is more truly reproduced and recorded. A further application may be to output an alarm target for the suspected pose when the suspected pose and seating pose are identified in step 506, for input to the intelligent video analysis system for corresponding analysis or alarm output.

Embodiment III: based on the classroom video resource of distance education, the recognition of the blackboard writing gesture when the recording of the classroom video is taken as an application scene is taken as an embodiment.

When recording course video, the stretching multiple of the camera is usually required to be adjusted according to whether the main speaker has the blackboard writing action, so that the blackboard writing content of the main speaker can be shot.

A known automatic recognition method for blackboard writing behavior comprises the following steps: and analyzing a frame difference image of the current video frame and the previous video frame, obtaining human body pixel points with motion according to the frame difference image, taking a graph formed by the human body pixel points with motion as a contour, and judging whether the writing on board exists according to the change condition of the contour.

However, in the method, the frame difference image analysis is adopted to extract the target human body pixel point, when the change of the target action is small, the pixel difference of the adjacent frame is small, and at the moment, the human body pixel point with the motion action can not be obtained from the frame difference image, so that the detection accuracy of the writing action of the blackboard is low due to the fact that the target human body is easy to miss detection.

By analyzing the gesture of the board book, the key points of the human body related to the gesture of the board book are mainly the right wrist, the right elbow and the right shoulder, based on which, the gesture of the board book is identified by the detection, tracking and the change of the position relation of the key points of the human body in the video frame.

Referring to fig. 7, fig. 7 is a schematic flow chart of a method for identifying blackboard writing behaviors.

Step 700, a current video frame is acquired.

Before video is recorded, a camera can be installed in a classroom of a lecturer, the camera can be installed at the top of the classroom, a camera of the camera can be aligned to the position of a blackboard area, a person skilled in the art can set the vertical distance between the camera and the blackboard according to actual conditions, and the specific vertical selection is specifically related to the pixels of the camera, the video recording quality requirement and the like. For example, any distance from 3 meters to 6 meters may be selected as the vertical distance between the camera and the blackboard. The embodiment of the present invention is not particularly limited to the above-described vertical distance. In the embodiment of the invention, the camera can be started manually or automatically according to the preset starting time. Specifically, the preset lecture time may be taken as the above-described preset on time.

After the camera is started, an image can be shot through a camera of the camera, and a current video frame is obtained from the shot image, so that whether the target human body in the current video frame has the writing on blackboard or not is judged.

Step 701, preprocessing an image: and the acquired images are scaled according to the fixed size w multiplied by h to obtain the images with the same size, so that firstly, the difference of the sizes of the images caused by the difference of the image pick-up devices is avoided, and secondly, the analysis accuracy of the images in the subsequent steps is improved.

Step 702, detecting human body key points according to human body key points preset on human body parts related to the gesture to be identified, and acquiring and storing human body key point information J in the current frame _n ，

Similar to step 403 in the first embodiment, in this step, the current video frame is analyzed by using a preset machine learning algorithm to obtain a plurality of human body key point heat maps corresponding to the current video frame; the preset machine learning algorithm is obtained through training of sample video frames marked with human body key points and sample human body key point heat maps corresponding to the sample video frames; any human body key point heat map comprises a part identifier and human body key point heat values of all pixel points of the current video frame corresponding to the part identifier; determining human body key point information of a target human body corresponding to the current video frame according to the plurality of human body key point heat maps; wherein, the liquid crystal display device comprises a liquid crystal display device,

The human body key point mark can be formed by manually marking the positions of the human body key points in the picture. In this embodiment, the key points of the human body are the right wrist, the right elbow, and the right shoulder.

In this step, unlike step 403 in the first embodiment, since the position of the human body key points related to the board writing gesture is significantly changed between the preset human body key points, the position of the same preset human body key point is not significantly changed between video frames, and preferably, the coordinates of the preset human body key points in the current frame are detected.

Step 703, determining the relative position relationship between the preset human body key points according to the preset human body key point information obtained and stored in step 702, and pre-judging whether the writing behavior exists according to the relative position relationship:

in the embodiment of the invention, whether the target human body has the writing on board behavior can be judged by judging whether the target human body has the preset action. But whether there is a preset action can be judged by the relative positional relationship between preset key points of the target human body. Therefore, in the embodiment of the invention, the camera can determine the relative position relation between preset key points according to the key point information of the target human body, and determine whether the target human body has the writing on blackboard or not according to the relative position relation.

According to the method provided by the embodiment of the invention, whether the blackboard writing gesture exists in the current video frame is determined through the relative position relation of the human body key points which belong to the same target human body and are marked with the preset position marks.

In one implementation of the embodiment of the present invention, the target human body generally may write on the blackboard by driving the right hand through the right arm, so that according to the human body key point information of the target human body, a relative positional relationship among the key point marked with the right wrist mark, the key point marked with the right elbow mark, and the key point marked with the right shoulder mark in the target human body is determined, and whether the writing gesture exists in the target human body is determined according to the relative positional relationship. By judging the relative position relation of the key points of the human body corresponding to the right arm, whether the target human body has the writing gesture can be judged more accurately, and whether the target human body has the writing gesture can be judged by less key point information, so that the judgment speed is improved.

Specifically, judging whether the position of a key point marked with a right wrist mark is higher than the position of a key point marked with a right elbow mark in the target human body, whether the horizontal distance between the key point marked with the right wrist mark and the key point marked with the right elbow mark is smaller than a first distance threshold value, and whether the vertical distance between the key point marked with the right elbow mark and the key point marked with the right shoulder mark is smaller than a second distance threshold value; if so, determining that the target human body has the blackboard writing gesture.

The first interval threshold and the second interval threshold may be determined according to a specific action when writing on a blackboard in a real environment and a size condition of an arm of a human body, and the first interval threshold and the second interval threshold may be equal or unequal. For example, the first pitch threshold value and the second pitch threshold value may each be within a range of 18cm to 22 cm. Specific values of the first pitch threshold and the second pitch threshold may be set by those skilled in the art according to actual situations, and embodiments of the present invention are not particularly limited.

In one embodiment, the camera may determine whether the target human body has an blackboard writing gesture by using the coordinates of the key points marked with the right wrist marks, the coordinates of the key points marked with the right elbow marks, and the coordinates of the key points marked with the right shoulder marks as parameters of a blackboard writing gesture determination formula according to whether the gesture determination value obtained by the gesture determination formula is a preset value.

In a specific embodiment, the above gesture determination formula may be the following formula:

wherein, in combination with FIG. 3, J ₄ ·x、J ₃ X represents the abscissa of the key marked with the right wrist mark, the key marked with the right elbow mark, J ₄ ·y、J ₃ ·y、J ₂ Y represents the ordinate of the key point marked with the right wrist mark, the key point marked with the right elbow mark, the key point marked with the right shoulder mark, d _j 、d _i Representing a first pitch threshold and a second pitch threshold, respectively. In the formula, let w=1 denote suspected writing postures, and let w=0 denote non-writing postures.

Because the actions related to the limbs of the right arm are rich, the pre-judgment is possible to be misjudged and missed, and preferably, the suspected judgment result is classified and calibrated by combining a classification network algorithm so as to improve the accuracy of the blackboard writing gesture recognition, in particular to step 704.

Step 704, for the target judged to be the suspected blackboard writing gesture, performing sitting detection classification by adopting a machine learning algorithm:

firstly, collecting picture data containing a blackboard writing gesture, preferably, screening out sample data containing the blackboard writing gesture picture data for training from a current collected image frame set, and calibrating first target frames of all blackboard writing targets in the sample data, wherein the blackboard writing gesture mainly relates to key points of the right upper limb of a human body, so that the right upper limb target frames of the blackboard writing targets can be calibrated; and (3) performing expansion based on the first target frame in the calibration data, extracting a first target frame image in each piece of picture data, and manufacturing a classification sample of the blackboard writing gesture and the non-blackboard writing gesture. Preferably, the calibrated first target frame is a regular graph, so that the expansion ratio is conveniently, flexibly and simply set.

When the suspected blackboard writing gesture in the current frame is identified, a second target frame is generated based on human body key points, a certain proportion of the second target frame is expanded, then the target predicted to be the suspected blackboard writing gesture is scratched, namely, a second target frame image is extracted from the current frame, and then the extracted image is sent to a CNN classification network in real time for classification. If the target frame image of the suspected board is classified as a board, the suspected board is recognized as a board gesture. And if the target frame image suspected of being in the non-board writing posture is classified as being in the non-board writing posture, identifying the target frame image as being in the non-board writing posture.

According to the embodiment of the invention, whether the blackboard writing behavior exists in the current video frame is determined through the relative position relation of the human body key points which belong to the same human body target and are marked with the preset position marks. Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Embodiment four: the following is an example of recognition of a writing gesture of a plurality of presenter contained in a video.

In practical applications, the video frame may include not only a speaker, but also other persons such as teaching aid.

Referring to fig. 8, fig. 8 is a schematic flow chart of a method for recognizing a blackboard writing gesture of a human body with multiple targets in a video.

Step 801, recording the current frame number f;

step 802, as in the image preprocessing in embodiment three,

step 803, human body key point detection:

firstly, a scene data set is collected, and then, the positions of all human body key points of each target human body in the picture are marked. For other applications to facilitate gesture recognition, preferably, the labeled human body keypoints include 15 human body keypoints such as head, neck, right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, chest, right hip, right knee, right ankle, left hip, left knee, left ankle, etc. If the amount of marked data is reduced, only the key points of the right upper limb, namely the right shoulder, the right elbow and the right wrist, can be marked for the recognition of the blackboard writing gesture.

One of the embodiments may be that after the labeling data of the key points of the human body is obtained, the labeling data is used as a training sample of the YPN network model, the YPN network model is trained, and the trained YPN network model is saved; inputting the collected current video frame into the trained YPN network model, extracting human key features from the current video frame through the model, thereby realizing the real-time generation of human key point information and obtaining the human key point information J of each human target in the frame image _k，n，f Wherein, the method comprises the steps of, wherein,J _k，n，f and (f) the information of the human body key point n of the target human body k in the f-th frame so as to distinguish different human body key point information of different target human bodies in different video frames.

Step 804, as one of the embodiments, in order to capture the gesture to be recognized of the same target human body from the multiple target human bodies, the same human body key point information of the same target between frames needs to be acquired, and a frame amount at least greater than 1 is beneficial to the acquisition of the human body key point information, so as to judge whether the current accumulated frame number reaches a preset threshold value, if yes, step 805 is executed, otherwise, step 806 is executed, and the next video frame is acquired;

in the embodiment of the present invention, when the video frame recorded by the camera includes a plurality of target human bodies, in order to enable the current video frame and the previous video frame to identify the same target human body in the form of a blackboard writing gesture, so that the identification process can continuously target the same person, a preset tracking algorithm is adopted in the step, and the target human body corresponding to the current video frame and the target human body corresponding to the previous video frame are selected as the target for locking; and traversing each target human body until the tracking of all target human bodies in the current video frame is completed.

Step 807, traversing preset human body key point coordinates of all the locking targets, and pre-judging the board writing gesture of each locking target to obtain a suspected board writing gesture of the locked target. The specific pre-determination method may be the determination method of step 703 in the third embodiment. Because the blackboard writing gesture is a change in position between human body keypoints within a frame, it can be performed based on a preset human body keypoint coordinate relationship of the current frame.

Step 808, traversing all suspected board writing target human bodies, performing board writing gesture and non-board writing gesture detection classification by adopting a machine learning algorithm, classifying the suspected board writing target human bodies into board writing gestures, recognizing the suspected board writing target human bodies into board writing gestures, classifying the suspected board writing target human bodies into non-board writing gestures, and recognizing the suspected board writing target human bodies into non-board writing gestures.

The method and the device realize prejudgement of the blackboard writing gesture by detecting key points of the human body, tracking the human body and combining the position relation among the key points of the human body of the blackboard writing gesture, detect and classify based on the prejudged suspected blackboard writing gesture, identify the gesture of a plurality of target human bodies in the current video frame, and further trigger shooting control based on the identified blackboard writing gesture.

Fig. 9 is a schematic view of an apparatus according to an embodiment of the present invention. The apparatus comprises a device for the treatment of a patient,

The image acquisition module acquires a current video frame;

The identification module may comprise a plurality of identification modules,

the first identification unit is used for determining longitudinal position changes of the same preset human body key points of the current frame and the previous f frames according to preset human body key point information and determining whether the sitting posture condition is met according to the relative position changes;

the second recognition unit is used for determining the relative position relation between preset human body key points according to the preset human body key point information and determining whether the blackboard writing gesture condition is met according to the relative position relation. .

The first recognition unit may comprise a first recognition unit,

The second identification unit may comprise a processor configured to,

The image pickup device provided by the embodiment of the invention comprises a camera, a memory and a processor, wherein,

the camera is used for shooting images;

The memory is used for storing a computer program;

the processor is used for executing the program stored in the memory to realize the target object gesture recognition method.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The embodiment of the invention also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program realizes the following steps when being executed by a processor:

Acquiring a current video frame;

The storage medium provided by the embodiment of the invention can accurately identify the tiny change of the gesture, has no specific requirements on the target object and the gesture, has wide application range, has low requirements on the image in the video frame, has high accuracy of identifying the gesture, and has small false detection and omission detection of gesture identification.

For the device/camera/storage medium embodiments, the description is relatively simple as it is substantially similar to the method embodiments, as relevant see the section description of the method embodiments.

It should be noted that, the gesture recognition embodiment of the object provided by the present invention is not limited to the above embodiment, and the present invention may be applied to gesture recognition of other objects, for example, gesture recognition and correction may be performed by capturing video of a body builder in a body-building process, and for example, for behavior tracking and capturing of an animal, the preset gesture condition may be set according to a position feature and/or a motion feature between key points of the gesture to be recognized of the object.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A method for recognizing the gesture of object is characterized by comprising,

acquiring a current video frame;

f is a preset natural number, and the number of the natural numbers is equal to the preset natural number,

the preset gesture conditions are set according to the position characteristics among key points of the gesture to be recognized of the target object,

the first preset posture condition is a sitting-starting posture condition,

the determining whether the position change of the preset key point of the current target object in the current frame and the previous f frame meets the first preset posture condition includes:

According to the preset human body key point information, determining the longitudinal position change of the same preset human body key point of the current frame and the previous f frame, and determining whether the sitting posture condition is met according to the relative position change;

the step of determining the longitudinal position change of the same preset human body key point of the current frame and the previous f frames according to the preset human body key point information and determining whether the sitting posture condition is met according to the relative position change comprises the following steps:

according to preset left shoulder and right shoulder human body key point information, determining longitudinal position changes of left shoulder and right shoulder human body key points in the current frame and the previous f frame, and determining whether the sitting posture condition is met according to the position changes;

the second preset posture condition is an blackboard writing posture condition,

the determining whether the position relationship between the preset key points of the current target object in the current frame meets the second preset posture condition comprises:

determining the relative position relation between preset human body key points according to the preset human body key point information, and determining whether the board writing posture condition is met according to the relative position relation;

determining the relative position relation between preset human body key points according to the preset human body key point information, and determining whether the board writing posture condition is met according to the relative position relation, wherein the method comprises the following steps:

2. The identification method of claim 1, further comprising,

3. The recognition method of claim 2, wherein the inputting the current frame recognizing the posture of the current object into the trained machine learning model, if the machine learning model recognizes the posture of the object in the current frame as the preset posture, takes the preset posture as the recognition result, includes,

collecting picture data containing the pose of the object,

4. The method of claim 1, wherein the step of obtaining the predetermined key point information of the object in the current frame further comprises,

the method may further comprise the steps of,

5. The identification method of any one of claim 1 to 4, further comprising,

6. The method of claim 2, wherein determining longitudinal position changes of left and right shoulder human body key points in the current frame and the previous f-frame according to the preset left and right shoulder human body key point information, and determining whether the sitting posture condition is met according to the position changes, comprises,

7. The recognition method of claim 6, wherein the inputting the current frame recognizing the posture of the current object into the trained machine learning model, if the machine learning model recognizes the posture of the object in the current frame as the preset posture, the preset posture is taken as the recognition result, comprising,

8. The identification method of claim 7, further comprising,

wherein M, T is a natural number set in advance.

9. The recognition method of claim 1, wherein the determining the relative positional relationship among the right wrist key, the right elbow key and the right shoulder key based on the preset right wrist, right elbow and right shoulder human key information, and determining whether the writing gesture condition is met based on the relative positional relationship, comprises,

If yes, recognizing that the target human body has the writing gesture, otherwise, recognizing that the target human body is in the non-writing gesture.

10. A gesture recognition apparatus for an object, characterized in that the apparatus comprises,

the image acquisition module acquires a current video frame;

the first preset posture condition is a sitting posture condition, and the identification module comprises:

a first identification unit for determining the longitudinal position change of the same preset human body key point of the current frame and the previous f frame according to the preset human body key point information, determining whether the sitting posture condition is met according to the relative position change,

the second preset posture condition is a sitting posture condition, and the identification module comprises:

a second recognition unit for determining the relative position relation between the preset human body key points according to the preset human body key point information and determining whether the condition of the blackboard writing gesture is met according to the relative position relation,

the determining the relative position relation between the preset human body key points according to the preset human body key point information and determining whether the condition of the board writing posture is met according to the relative position relation comprises the following steps:

11. The apparatus of claim 10, further comprising,

12. The apparatus of claim 11, wherein the detection classification module comprises,

13. The apparatus of claim 10, further comprising,

14. The apparatus of any one of claims 10 to 13, further comprising,

15. The apparatus of claim 10, wherein the first identification unit comprises,

16. The apparatus of claim 15, further comprising,

17. The apparatus of claim 10, wherein the second identification unit comprises,

a second comparing subunit comparing whether the right wrist keypoint location is higher than the right elbow keypoint location, and whether the calculated horizontal spacing is less than a first spacing threshold, and whether the calculated vertical spacing is less than a second spacing threshold; if yes, recognizing that the target human body has the writing gesture, otherwise, recognizing that the target human body is in the non-writing gesture.

18. A camera device is characterized by comprising a camera, a memory and a processor, wherein,

the camera is used for shooting images;

the memory is used for storing a computer program;

the processor is configured to execute a program stored in the memory, and implement the steps of the object gesture recognition method according to any one of claims 1 to 9.

19. A computer storage medium storing a computer program to be executed by a processor to implement the steps of the object gesture recognition method of any one of claims 1 to 9.