CN112446240A - Action recognition method and device - Google Patents
Action recognition method and device Download PDFInfo
- Publication number
- CN112446240A CN112446240A CN201910807508.5A CN201910807508A CN112446240A CN 112446240 A CN112446240 A CN 112446240A CN 201910807508 A CN201910807508 A CN 201910807508A CN 112446240 A CN112446240 A CN 112446240A
- Authority
- CN
- China
- Prior art keywords
- target
- type
- video frame
- sequence corresponding
- key point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 109
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004891 communication Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 15
- 238000003062 neural network model Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 9
- 238000002372 labelling Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 4
- 210000000707 wrist Anatomy 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000005452 bending Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 210000003414 extremity Anatomy 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 210000004247 hand Anatomy 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006748 scratching Methods 0.000 description 1
- 230000002393 scratching effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/20—Education
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Tourism & Hospitality (AREA)
- General Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Educational Administration (AREA)
- Educational Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention provides a method and a device for recognizing actions, which relate to the technical field of computer vision, wherein the method comprises the following steps: the method comprises the steps of detecting key points of first-class targets, obtaining key point sequences corresponding to second-class targets, generating key point sequences corresponding to the first-class targets according to the key point sequences corresponding to the targets belonging to the first-class targets in the second-class targets and the detected key points, and performing action recognition on the first-class targets according to the key point sequences corresponding to the first-class targets. Therefore, the scheme provided by the embodiment of the invention can be used for identifying the action of the target in the video.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a motion recognition method and device.
Background
When the video is processed in some scenes, the motion of the target in the video needs to be identified. For example, in a classroom teaching scene, in order to facilitate a teacher or a parent to know the learning state of each student, the monitoring video of the classroom teaching scene may be subjected to identification of hand movements of the student, such as identifying the movements of raising, bending, holding a cheek, clapping hands, and the like of the student.
Therefore, a motion recognition method is needed to recognize the motion of the target in the video.
Disclosure of Invention
The embodiment of the invention aims to provide a motion recognition method and a motion recognition device, which are used for recognizing the motion of a target in a video. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides an action recognition method, where the method includes:
detecting key points of a first type of target, wherein the first type of target is as follows: a target in a current video frame;
obtaining a key point sequence corresponding to a second type of target, wherein the second type of target is as follows: and the key point sequence corresponding to each second type target of the targets in the previous video frame of the current video frame is as follows: a sequence formed by key points of a second type of target in the previous video frame and key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points;
generating a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point;
and performing action recognition on each first type target according to the key point sequence corresponding to each first type target.
In an embodiment of the present invention, the generating, according to the key point sequence corresponding to the target belonging to the first class target in the second class target and the detected key point, the key point sequence corresponding to each first class target includes:
determining the area of each first type target in the current video frame as a first type target area according to the detected key points;
obtaining the area where each second type target in the previous video frame is located as a second type target area;
calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity;
and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
In an embodiment of the present invention, the performing motion recognition on each first-class target according to the key point sequence corresponding to each first-class target includes:
inputting the key point sequence corresponding to each first-class target into a pre-trained motion recognition model, and recognizing the motion of each first-class target, wherein the motion recognition model is as follows: and training a preset neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target.
In an embodiment of the present invention, after performing motion recognition on each first-class object according to a key point sequence corresponding to each first-class object, the method further includes:
determining a target of which the recognition result represents a preset action in each first type of target as a third type of target;
obtaining the areas of the third type targets in the current video frame and the video frame before the current video frame respectively to obtain the area sequence corresponding to each third type target;
and performing action recognition on each third type target according to the region sequence corresponding to each third type target.
In an embodiment of the present invention, the obtaining regions of the third type objects in the current video frame and the video frames before the current video frame, respectively, to obtain a region sequence corresponding to each third type object includes:
obtaining the minimum area of each third type target in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type target;
and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
In a second aspect, an embodiment of the present invention provides an action recognition apparatus, where the apparatus includes:
the key point detection module is used for detecting key points of a first type of target, wherein the first type of target is as follows: a target in a current video frame;
a key point sequence obtaining module, configured to obtain a key point sequence corresponding to a second type of target, where the second type of target is: and the key point sequence corresponding to each second type target of the targets in the previous video frame of the current video frame is as follows: a sequence formed by key points of a second type of target in the previous video frame and key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points;
the sequence generation module is used for generating a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point;
and the first action identification module is used for identifying the action of each first type of target according to the key point sequence corresponding to each first type of target.
In an embodiment of the present invention, the sequence generating module is specifically configured to:
determining the area of each first type target in the current video frame as a first type target area according to the detected key points;
obtaining the area where each second type target in the previous video frame is located as a second type target area;
calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity;
and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
In an embodiment of the present invention, the first action recognition module is specifically configured to:
inputting the key point sequence corresponding to each first-class target into a pre-trained motion recognition model, and recognizing the motion of each first-class target, wherein the motion recognition model is as follows: and training a preset neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target.
In an embodiment of the present invention, after the first action recognition module performs action recognition on each first class object, the apparatus further includes:
the target determining module is used for determining a target of which the recognition result represents a preset action in each first type of target as a third type of target;
the region sequence obtaining module is used for obtaining the regions of the third type targets in the current video frame and the video frames before the current video frame respectively to obtain the region sequence corresponding to each third type target;
and the second action identification module is used for identifying the action of each third type of target according to the region sequence corresponding to each third type of target.
In an embodiment of the present invention, the region sequence obtaining module is specifically configured to:
obtaining the minimum area of each third type target in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type target;
and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the communication bus;
a memory for storing a computer program;
a processor adapted to perform the method steps of any of the first aspects when executing a program stored in the memory.
In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the first aspects.
According to the technical scheme, when the scheme provided by the embodiment of the invention is applied to motion recognition, after the key points of the first class of targets are detected and the key point sequences corresponding to the second class of targets are obtained, the key point sequences corresponding to the first class of targets are generated according to the key point sequences corresponding to the targets belonging to the first class of targets in the second class of targets and the detected key points; and performing action recognition on each first type target according to the key point sequence corresponding to each first type target. Since the first type of target is a target in the current video frame, the second type of target is a target in a video frame previous to the current video frame, and the key point sequence corresponding to each second type of target is: and forming a sequence by the key points of the second type of target in the previous video frame and the key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points. Therefore, by applying the scheme provided by the embodiment of the invention, the action recognition of the target in the video can be realized, and the key points of the target in each video frame before the current video frame are considered in the recognition process, so that the information according to the action recognition of the target is very rich, and the recognized action has higher accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a motion recognition method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart of another motion recognition method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an action recognition device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of another motion recognition device according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method and a device for recognizing actions, which are respectively described in detail below.
Referring to fig. 1, fig. 1 is a flow chart illustrating a method for recognizing an action according to an embodiment of the present invention, where the method includes the following steps 101-104.
Wherein the first type of target is: an object in the current video frame.
The above-mentioned target refers to an object to be motion-recognized. The key points of the target may be understood as feature points of the target. Specifically, the above target may be different according to different application scenarios.
For example, in a classroom teaching scene, if it is necessary to recognize hand movements of a student such as raising a hand, bending a head, and holding a cheek, the target may be the hand of the student, and correspondingly, the key point of the target may be a feature point representing a part of the student such as an elbow, a wrist, and a palm. In a traffic monitoring scene, if it is necessary to recognize the actions of the automobile such as lane changing, turning, etc., the target is the entire automobile, and correspondingly, the key points of the target may be feature points representing the positions of the head, wheels, windows, etc. of the automobile.
The present invention is described only by way of example, and in other application scenarios, the target may also be a leg of a human, an arm of a human, a head of a human, a neck of a human, a crotch of a human, an extremity of an animal, and the like, and the present invention is not limited to the specific content of the target.
In one embodiment of the present invention, since the content contained in the video frame is rich, and the object in which the user is interested may only appear in a specific area of the video frame, when performing the keypoint detection, the keypoint detection may be performed only on the object in the specific area of the video frame. Since the specific area is a partial area of the whole video frame, the key point detection by applying the method can effectively improve the key detection efficiency. Specifically, the specific area may be set in advance, and may be determined by detecting the content of the video frame.
For example, in a classroom teaching scene, a video frame includes content related to students and content related to teachers, and the positions of the two content appearing in each video frame are relatively fixed, for example, the content related to students appears in the left half area of the video frame, and the content related to teachers appears in the right half area of the video frame, but in this scene, an object in which a user is interested is a hand of a student, so when performing key point detection, key point detection of the object can be performed only for the left half area of the video frame.
In order to better detect the key points of the objects in the video frame, because there may be noise in the video frame or the contrast of the video frame is low, in an embodiment of the present invention, the current video frame may be preprocessed before detecting the key points of the above-mentioned first type of objects in the current video. For example, the pretreatment may be: and carrying out noise reduction processing on the current video frame through a median filtering algorithm so as to suppress noise in the video frame. The pretreatment can also be as follows: and carrying out histogram equalization processing on the current video frame so as to equalize the contrast of the current video frame.
And 102, acquiring a key point sequence corresponding to the second type of target.
Wherein the second type of target is: for the target in the previous video frame of the current video frame, the key point sequence corresponding to each second type target is: and a sequence formed by the key points of the second type of target in the previous video frame and the key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points.
Specifically, the keypoint may be represented by coordinates of the keypoint in the video frame, and in this case, the sequence of keypoints may be considered as a sequence generated from the coordinates of the keypoint.
The order of arrangement of the key points may be the order of the parts indicated by the key points. For example, in the case where the object is a hand and the parts represented by the keypoints are an elbow, a wrist, and a palm, the keypoints may be arranged in the order of [ elbow wrist palm ] or [ palm wrist elbow ].
The arrangement order of the key points may also be an order set for each key point after each key point is identified in the video frame in which the target corresponding to the key point appears for the first time. In this case, when the above-described respective key points in the subsequent video frames are added to the key point sequence, the key points are also added in the order set forth above.
The video frames before the previous video frame may be all video frames before the previous video frame that continuously contain the second type of object, or a preset number of video frames before the previous video frame that continuously contain the second type of object.
For example, in a case where the video frame before the previous video frame is the previous video frame and continuously includes a preset number of video frames of the second type object, assuming that the current video frame is the 10 th video frame and the preset number is 4, the key point sequence corresponding to the previous video frame is:
{X7、Y7、Z7、X8、Y8、Z8、X9、Y9、Z9、X10、Y10、Z10}
wherein X, Y, Z represents the key point of the object, and 7, 8, 9, and 10 represent the frame number of the video frame, such as X7Key points X, Y representing objects in the 7 th video frame7Key points Y, Z representing objects in the 7 th video frame7The key point Z of the target in the 7 th video frame is represented, and the meanings represented by other parameters are similar to the key point Z, and are not described again here.
The preset number may be determined by an actual application scene and/or a video playing speed. For example, in a classroom teaching scene, to identify the hand movements of students, the time for the students to complete the hand movements is about 1 second, and the playing speed of the video is 25 frames per second, and the preset number may be 25.
When the current video frame is the first frame, since there is no previous video frame of the current video frame, the key point sequence corresponding to the second-class target may not be obtained.
When the current video frame is not the first frame in the video, the current video frame has a previous video frame, and the motion of the target in each video frame of the video is detected according to the playing sequence of the video frames, which means that the motion of detecting the target in the previous video frame is completed before the motion of detecting the target in the current video frame. As can be seen from step 101, when detecting the motion of the target in each video frame, the key points of the target in the video frame are detected, so that the key point detection is performed on each frame of the video to generate the key point sequence corresponding to the target in the video frame. Therefore, in the process of detecting the work of the target in the current video frame, the key point sequence corresponding to each target in the previous video frame is already obtained, and further, the key point sequence of the second type of target in the previous video frame of the current video frame can be directly obtained.
103, generating a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point.
If a target belongs to both the first type of target and the second type of target, which indicates that the target appears in the current video frame and the previous video frame, the key points of the target in the current video frame can be directly added to the tail of the key point sequence corresponding to the second type of target pointed by the target according to the sequence of the key points, so as to generate the key point sequence corresponding to the first type of target.
If an object belongs to the first class of objects but not to the second class of objects, which indicates that the object appears in the current video frame but does not appear in the previous video frame, the key point sequence corresponding to the first class of objects can be directly generated from the key points of the object in the current video frame.
If an object does not belong to the first class of objects but belongs to the second class of objects, which indicates that the object appears in the previous video frame but does not appear in the current video frame, that is, the object disappears, it can be considered that the key point sequence corresponding to the second class of objects pointed by the object does not participate in the motion recognition of the object in the subsequent video frames.
And 104, performing action recognition on each first-class target according to the key point sequence corresponding to each first-class target.
The key point sequence corresponding to each first-class target not only reflects the position of the key point of each first-class target in the current video frame, but also reflects the position of the key point of each first-class target in each video frame before the current video frame, so that the change rule of the key point of each first-class target in a period of time can be analyzed, and the change rule is not the instantaneous distribution of the key point of each first-class target. And because the action of the target is usually completed through a process rather than instantaneously, the action of each first type target in the current video frame can be effectively identified according to the rules.
In an embodiment of the present invention, after performing the motion recognition on each first-class object, the recognition result may include a probability that the motion of each first-class object is recognized as a preset motion, which may also be referred to as a confidence level. For each first type of target, when the confidence of the action of the target being recognized as a preset action is greater than or equal to a preset threshold, the preset action is considered as the action of the target. The preset action is an action preset according to a use scene.
For example, in a classroom teaching scene, to identify the hand movements of a student, the preset movements may be hand raising, head curling, and the like. The confidence level of the action of the target being identified as each preset action and the value of the preset threshold may be as shown in table 1 below.
TABLE 1
Preset actions | Lifting hand | Scratching head |
Confidence level | 0.865 | 0.105 |
Preset threshold value | 0.5 | 0.5 |
It can be seen that, when the scheme provided by the above embodiment is applied to motion recognition, not only can motion recognition of the target in the video be realized, but also the key points of the target in each video frame before the current video frame are considered in the recognition process, so that the information according to which the target is subjected to motion recognition is very rich, and the recognized motion has higher accuracy.
In an embodiment of the present invention, in the step 103, the key point sequence corresponding to each first-class object can be obtained through the following steps a to D.
Step A: and determining the area of each first-class object in the current video frame as a first-class object area according to the detected key points.
Specifically, for each first-class object, a horizontally-placed minimum bounding rectangular region including the key points of the first-class object may be used as the first-class object region.
In addition, after the minimum circumscribed rectangular region is obtained, the minimum circumscribed rectangular region may be expanded, and the expanded region may be used as the first-class target region.
Specifically, the minimum circumscribed rectangular region may be expanded in any manner that can enlarge the minimum circumscribed rectangular region, which is not limited in the present invention.
And B: and obtaining the area where each second type target in the previous video frame is located as a second type target area.
For each second type of target, the second type target area is a minimum bounding rectangle area which comprises the key points of the second type of target and is horizontally placed.
In addition, the second type target area may be: and an area obtained by expanding the minimum circumscribed rectangular area.
In addition, when detecting the motion of the object in each video frame of the video, the motion is generally detected according to the playing sequence of the video frames, that is, before detecting the motion of the object in the current video frame, the motion of detecting the object in the previous video frame is already completed. It can be seen from the above step a that, when detecting the motion of the target in each video frame, the region where each target is located in each video frame is obtained, so that when identifying the motion of the target in the current video frame, the region where the target is located in the previous video is already obtained, and thus the second type target region can be directly obtained here.
And C: and calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity.
Specifically, the similarity between each first-class target area and each second-class target area is calculated, and when the similarity is greater than or equal to a preset threshold, it can be considered that the target corresponding to the second-class target area and the first-class target belong to the same target.
In an embodiment of the present invention, the similarity may be an intersection ratio of each first type target area and each second type target area. Wherein the cross-over ratio is: and the ratio of the intersection area and the area of the phase-parallel area of the first type target area and the second type target area in the same coordinate system. Since the first type target area and the second type target area are both part of the video frame, that is, part of the image, the similarity may also be the image similarity between each first type target area and each second type target area, that is, the similarity between the areas. Since the key points in the first-type target area and the second-type target area may generate position deviation, and it can be considered that the key points move, the similarity may also be a motion similarity of the key points in each of the first-type target area and the second-type target area, and the specific motion similarity may be obtained by measuring a speed of the key point.
In addition to the above, the similarity may also be a motion similarity between each first type target area and each second type target area, and the like.
Step D: and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
In an embodiment of the present invention, the detected key points may be directly added to the tail of the key point sequence corresponding to the determined target according to the arrangement order of the key points.
In another embodiment of the present invention, when generating the key point sequence corresponding to each first-type target, the number of video frames from which the key points included in the key point sequence corresponding to the determined target are derived may be determined first, and if the number is smaller than the preset number of frames, the detected key points may be directly added to the end of the key point sequence corresponding to the determined target according to the arrangement order of the key points; if the number is equal to the preset number of frames, the key points from the earliest video frame can be deleted from the key point sequence corresponding to the determined target, the rest key points are moved to the head of the sequence until the first key point in the rest key points reaches the head of the sequence, and in addition, the detected key points are added to the tail of the sequence according to the arrangement sequence of the key points.
In an embodiment of the present invention, when the number of video frames in which the determined target exists is less than the preset number of frames, it may be considered that the target exists for a short time, and the target is not stable enough in the scene, so that only the key point sequence may be generated for the target, and the motion recognition for the target may not be performed for the time.
In an embodiment of the present invention, referring to fig. 2, a flowchart of another motion recognition method is provided, and compared with the foregoing embodiment shown in fig. 1, in this embodiment, after the step 104 performs motion recognition on each first-type object, the method further includes the following steps:
and 105, determining the target of which the recognition result represents the preset action in each first-class target as a third-class target.
The preset action is an action preset according to a use scene. For example, in a classroom teaching scene, to identify the hand movements of a student, the preset movements may be hand raising, head curling, cheek supporting, hand clapping, and the like. In a traffic monitoring scene, the preset actions may be lane changing, turning around, and the like to identify the overall driving action of the automobile.
And 106, obtaining the areas of the third type objects in the current video frame and the video frames before the current video frame respectively, and obtaining the area sequence corresponding to each third type object.
In an embodiment of the present invention, since the actions of the objects in the respective video frames of the video are detected generally according to the playing sequence of the video frames, that is, before the actions of the objects in the current video frame are detected, the actions of detecting the objects in the previous video frame are completed. In the case that step 106 is performed on the previous video frame of the current video frame, since the region sequence of each third type object in the previous video frame has already been generated, the third type object region of the current video frame can be directly added to the region sequence of the third type object in the previous video frame of the current video frame, so as to obtain the region sequence corresponding to each third type object table.
In another embodiment of the present invention, based on the key point sequence corresponding to each third type target, the region where the third type target is located may be determined in the current video frame and each video frame before the current video frame, so as to obtain the region sequence corresponding to each third type target.
In yet another embodiment of the present invention, the region sequence corresponding to each of the third type objects can be obtained by the following steps.
Step E: and obtaining the minimum area of each third type object in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type object.
Step F: and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
Specifically, the minimum circumscribed rectangular region may be expanded in any manner that can enlarge the minimum circumscribed rectangular region, which is not limited in the present invention.
The minimum region of the key point corresponding to each third-class target is expanded, so that the information contained in the region can be richer, and the accuracy of action identification is increased.
And step 107, performing action recognition on each third type target according to the region sequence corresponding to each third type target.
In an embodiment of the present invention, after the action of each third type target is identified, the identification result may include a probability that the action of each third type target is identified as a preset action, which may also be referred to as a confidence level. For each first type of target, when the confidence of the action of the target being recognized as a preset action is greater than or equal to a preset threshold, the preset action is considered as the action of the target. Wherein the action is preset as an action identified by the sequence of key points.
For example, the action identified by the sequence of key points is raising a hand. The confidence level that the action of the target is identified as the preset action and the value of the preset threshold can be shown in table 2 below.
TABLE 2
Preset actions | Lifting hand |
Confidence level | 0.58 |
Confidence threshold | 0.3 |
It can be seen that, when the scheme provided by this embodiment is applied to perform motion recognition, after performing motion recognition based on the key point sequence to obtain a third type target whose motion is a preset motion, further performing motion recognition based on a region sequence corresponding to the third type target. This can improve the accuracy of motion recognition of the target.
In the above process of recognizing the target motion, the content related to machine learning can be considered, and the following description will be given by three embodiments.
Example one
When detecting the keypoints of the first type of target in step 101, the current video frame may be input into a pre-trained keypoint detection model, and the keypoints of the target in the current video frame or in a preset region of the current video frame may be detected.
The key point detection model may be a model obtained by training the first neural network model in advance and used for identifying key points of the target in the video frame. For example, the first neural network model may be a model that implements a keypoint detection network such as rtposition under the context (Convolutional neural network framework).
Specifically, the method may label key points of a target in each video frame of the sample video to obtain labeled key points, and then perform key point detection on the target in each video frame or in a preset area in the video frame by using each video frame of the sample video as the input of the neural network model. In addition, the detection result is supervised by the marked key points of the targets in each video frame, so that the neural network model is trained, and the key point detection model is obtained.
Example two
When the first type of targets are identified in step 104, the keyword sequence corresponding to each first type of target may be input into a first pre-trained motion identification model to identify the motion of each first type of target.
Wherein the first motion recognition model is: and training a preset second neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target. For example, the second neural network model may be a model in which a network including a plurality of convolutional layers is implemented under Caffe.
Specifically, a video frame generating a preset action in a sample video may be selected first, a key point sequence of the video frame is obtained, and the key point sequence is manually labeled to be used as a positive sample; and then selecting the video frames which do not generate the preset action in the sample video, obtaining the key point sequence of the video frames, and carrying out manual marking on the key point sequence to be used as a negative sample. And training the second neural network model by using a positive sample and a negative sample to obtain the first action recognition model.
In an embodiment of the present invention, the output result of the first motion recognition model may include a confidence level that the motion of each first type object is recognized as each preset motion. For each first type of target, when the confidence of the target corresponding to the one preset action is greater than or equal to a preset threshold, the preset action corresponding to the confidence is considered as the action of the target. The preset action is an action preset according to a use scene.
EXAMPLE III
When the third type objects are identified in step 107, the region sequence corresponding to each third type object may be input into a second motion identification model trained in advance, and the third type objects may be identified.
And the second action recognition model outputs a confidence coefficient of the third type of target action recognition obtained according to the region sequence, and when the confidence coefficient is greater than or equal to a preset threshold value, the preset action corresponding to the confidence coefficient is considered as the action of the target. The preset threshold is preset according to a usage scenario, and may be 0.3, 0.4, 0.5, and the like.
Wherein the second motion recognition model is: and training a preset neural network model by adopting a region sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target. For example, the neural network model may be a model that implements a motion recognition network such as Two Stream under Caffe.
Corresponding to the motion recognition method, the embodiment of the invention also provides a motion recognition device.
Fig. 3 is a diagram of an action recognition device according to an embodiment of the present invention, where the action recognition device includes:
a key point detecting module 301, configured to detect a key point of a first type of target, where the first type of target is: a target in a current video frame;
a key point sequence obtaining module 302, configured to obtain a key point sequence corresponding to a second type of target, where the second type of target is: for the target in the previous video frame of the current video frame, the key point sequence corresponding to each second type target is: a sequence formed by the key points of the second type of target in the previous video frame and the key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points;
a sequence generating module 303, configured to generate a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point;
the first action recognition module 304 is configured to perform action recognition on each first-class object according to the key point sequence corresponding to each first-class object.
In an embodiment of the present invention, the sequence generating module 303 is specifically configured to:
determining the area of each first type target in the current video frame as a first type target area according to the detected key points;
obtaining the area of each second type target in the previous video frame as a second type target area;
calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity;
and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
In an embodiment of the invention, the first action recognition module 304 is specifically configured to:
inputting the key point sequence corresponding to each first-class target into a pre-trained motion recognition model, and recognizing the motion of each first-class target, wherein the motion recognition model is as follows: and training a preset neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target.
It can be seen that, when the device provided by each of the above embodiments is applied to perform motion recognition, not only can motion recognition of a target in a video be achieved, but also key points of the target in each video frame before the current video frame are considered in the recognition process, so that information according to which the target is subjected to motion recognition is very rich, and the recognized motion has higher accuracy.
Referring to fig. 4, in an embodiment of the present invention, compared with the foregoing embodiment shown in fig. 3, after the first motion recognition module 304 performs motion recognition on each first-class object, the embodiment further includes:
a target determination module 305, configured to determine, as a third type target, a target whose recognition result represents a preset action in each first type target;
a region sequence obtaining module 306, configured to obtain regions of each third type object in the current video frame and each video frame before the current video frame, respectively, to obtain a region sequence corresponding to each third type object;
and a second action recognition module 307, configured to perform action recognition on each third type of object according to the region sequence corresponding to each third type of object.
In an embodiment of the present invention, the region sequence obtaining module 306 is specifically configured to:
obtaining the minimum area of each third type target in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type target;
and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
The minimum region of the key point corresponding to each third-class target is expanded, so that the information contained in the region can be richer, and the accuracy of action identification is increased.
As can be seen, when the device provided in this embodiment is used to perform motion recognition, after performing motion recognition based on the key point sequence to obtain a third type target whose motion is a preset motion, further performing motion recognition based on the region sequence corresponding to the third type target. This can improve the accuracy of motion recognition of the target.
An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,
a memory 503 for storing a computer program;
the processor 501 is configured to implement the motion recognition method according to the embodiment of the present invention when executing the program stored in the memory 503.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program realizes the steps of any one of the above-mentioned motion recognition methods when executed by a processor.
In yet another embodiment, a computer program product containing instructions is provided, which when run on a computer causes the computer to perform any of the above-described method for motion recognition.
It can be seen that when the electronic device provided by the above embodiment is applied to perform motion recognition, execute a computer program stored in a computer-readable storage medium provided by the above embodiment to perform motion recognition, and execute a computer program product provided by the above embodiment on a computer to perform motion recognition, a key point sequence corresponding to a second type of target is obtained by detecting key points of a first type of target, a key point sequence corresponding to each first type of target is generated according to the key point sequence corresponding to the target belonging to the first type of target in the second type of target and the detected key points, and motion recognition is performed on each first type of target according to the key point sequence corresponding to each first type of target.
Therefore, the scheme provided by the embodiment of the invention can be used for identifying the action of the target in the video.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, apparatus embodiments, electronic device embodiments, computer-readable storage medium embodiments, and computer program product embodiments are described with relative simplicity as they are substantially similar to method embodiments, where relevant only as described in portions of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (12)
1. A method of motion recognition, the method comprising:
detecting key points of a first type of target, wherein the first type of target is as follows: a target in a current video frame;
obtaining a key point sequence corresponding to a second type of target, wherein the second type of target is as follows: and the key point sequence corresponding to each second type target of the targets in the previous video frame of the current video frame is as follows: a sequence formed by key points of a second type of target in the previous video frame and key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points;
generating a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point;
and performing action recognition on each first type target according to the key point sequence corresponding to each first type target.
2. The method according to claim 1, wherein the generating a key point sequence corresponding to each first-class object according to the detected key points and key point sequences corresponding to objects belonging to the first-class object in the second-class object comprises:
determining the area of each first type target in the current video frame as a first type target area according to the detected key points;
obtaining the area where each second type target in the previous video frame is located as a second type target area;
calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity;
and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
3. The method according to claim 1 or 2, wherein the performing motion recognition on each first-class object according to the key point sequence corresponding to each first-class object comprises:
inputting the key point sequence corresponding to each first-class target into a pre-trained motion recognition model, and recognizing the motion of each first-class target, wherein the motion recognition model is as follows: and training a preset neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target.
4. The method according to claim 1 or 2, wherein after performing the motion recognition on each first-class object according to the key point sequence corresponding to each first-class object, the method further comprises:
determining a target of which the recognition result represents a preset action in each first type of target as a third type of target;
obtaining the areas of the third type targets in the current video frame and the video frame before the current video frame respectively to obtain the area sequence corresponding to each third type target;
and performing action recognition on each third type target according to the region sequence corresponding to each third type target.
5. The method according to claim 4, wherein the obtaining regions of the third type objects in the video frames before the current video frame and the current video frame respectively to obtain the region sequence corresponding to each third type object comprises:
obtaining the minimum area of each third type target in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type target;
and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
6. An action recognition device, characterized in that the device comprises:
the key point detection module is used for detecting key points of a first type of target, wherein the first type of target is as follows: a target in a current video frame;
a key point sequence obtaining module, configured to obtain a key point sequence corresponding to a second type of target, where the second type of target is: and the key point sequence corresponding to each second type target of the targets in the previous video frame of the current video frame is as follows: a sequence formed by key points of a second type of target in the previous video frame and key points of the second type of target in the video frame before the previous video frame according to the playing sequence of the video frames and the arrangement sequence of the key points;
the sequence generation module is used for generating a key point sequence corresponding to each first-class target according to the key point sequence corresponding to the target belonging to the first-class target in the second-class target and the detected key point;
and the first action identification module is used for identifying the action of each first type of target according to the key point sequence corresponding to each first type of target.
7. The apparatus of claim 6, wherein the sequence generation module is specifically configured to:
determining the area of each first type target in the current video frame as a first type target area according to the detected key points;
obtaining the area where each second type target in the previous video frame is located as a second type target area;
calculating the similarity between each first type target area and each second type target area, and determining a second type target which belongs to the same target with each first type target according to the calculated similarity;
and generating a key point sequence corresponding to each first class of target according to the key point sequence corresponding to the determined target and the detected key points.
8. The device according to claim 6 or 7, wherein the first action recognition module is specifically configured to:
inputting the key point sequence corresponding to each first-class target into a pre-trained motion recognition model, and recognizing the motion of each first-class target, wherein the motion recognition model is as follows: and training a preset neural network model by adopting a key point sequence corresponding to the target in each video frame of the sample video and the labeling action of the target to obtain a model for identifying the action of the target.
9. The apparatus according to claim 6 or 7, wherein after the first motion recognition module performs motion recognition on each first type object, the apparatus further comprises:
the target determining module is used for determining a target of which the recognition result represents a preset action in each first type of target as a third type of target;
the region sequence obtaining module is used for obtaining the regions of the third type targets in the current video frame and the video frames before the current video frame respectively to obtain the region sequence corresponding to each third type target;
and the second action identification module is used for identifying the action of each third type of target according to the region sequence corresponding to each third type of target.
10. The apparatus according to claim 9, wherein the region sequence obtaining module is specifically configured to:
obtaining the minimum area of each third type target in each video frame before the current video frame and the current video frame respectively to obtain the minimum area sequence corresponding to each third type target;
and performing region expansion on each region in the minimum region sequence corresponding to each third type of target to obtain a region sequence corresponding to each third type of target.
11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.
12. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807508.5A CN112446240A (en) | 2019-08-29 | 2019-08-29 | Action recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807508.5A CN112446240A (en) | 2019-08-29 | 2019-08-29 | Action recognition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112446240A true CN112446240A (en) | 2021-03-05 |
Family
ID=74740809
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910807508.5A Pending CN112446240A (en) | 2019-08-29 | 2019-08-29 | Action recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112446240A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107180226A (en) * | 2017-04-28 | 2017-09-19 | 华南理工大学 | A kind of dynamic gesture identification method based on combination neural net |
US10096125B1 (en) * | 2017-04-07 | 2018-10-09 | Adobe Systems Incorporated | Forecasting multiple poses based on a graphical image |
CN108803874A (en) * | 2018-05-30 | 2018-11-13 | 广东省智能制造研究所 | A kind of human-computer behavior exchange method based on machine vision |
CN109325412A (en) * | 2018-08-17 | 2019-02-12 | 平安科技(深圳)有限公司 | Pedestrian recognition method, device, computer equipment and storage medium |
CN110070029A (en) * | 2019-04-17 | 2019-07-30 | 北京易达图灵科技有限公司 | A kind of gait recognition method and device |
CN110134241A (en) * | 2019-05-16 | 2019-08-16 | 珠海华园信息技术有限公司 | Dynamic gesture exchange method based on monocular cam |
-
2019
- 2019-08-29 CN CN201910807508.5A patent/CN112446240A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10096125B1 (en) * | 2017-04-07 | 2018-10-09 | Adobe Systems Incorporated | Forecasting multiple poses based on a graphical image |
CN107180226A (en) * | 2017-04-28 | 2017-09-19 | 华南理工大学 | A kind of dynamic gesture identification method based on combination neural net |
CN108803874A (en) * | 2018-05-30 | 2018-11-13 | 广东省智能制造研究所 | A kind of human-computer behavior exchange method based on machine vision |
CN109325412A (en) * | 2018-08-17 | 2019-02-12 | 平安科技(深圳)有限公司 | Pedestrian recognition method, device, computer equipment and storage medium |
CN110070029A (en) * | 2019-04-17 | 2019-07-30 | 北京易达图灵科技有限公司 | A kind of gait recognition method and device |
CN110134241A (en) * | 2019-05-16 | 2019-08-16 | 珠海华园信息技术有限公司 | Dynamic gesture exchange method based on monocular cam |
Non-Patent Citations (1)
Title |
---|
尹建芹等: "基于关键点序列的人体动作识别", 机器人, no. 02 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6893233B2 (en) | Image-based data processing methods, devices, electronics, computer-readable storage media and computer programs | |
CN107944020B (en) | Face image searching method and device, computer device and storage medium | |
CN109145766B (en) | Model training method and device, recognition method, electronic device and storage medium | |
CN106951484B (en) | Picture retrieval method and device, computer equipment and computer readable medium | |
CN109345553B (en) | Palm and key point detection method and device thereof, and terminal equipment | |
CN109359539B (en) | Attention assessment method and device, terminal equipment and computer readable storage medium | |
CN110363220B (en) | Behavior class detection method and device, electronic equipment and computer readable medium | |
CN109034069B (en) | Method and apparatus for generating information | |
CN110321845B (en) | Method and device for extracting emotion packets from video and electronic equipment | |
WO2023010758A1 (en) | Action detection method and apparatus, and terminal device and storage medium | |
CN111931859B (en) | Multi-label image recognition method and device | |
CN110363084A (en) | A kind of class state detection method, device, storage medium and electronics | |
CN109858327B (en) | Character segmentation method based on deep learning | |
CN111310737B (en) | Lane line detection method and device | |
CN111242109B (en) | Method and device for manually fetching words | |
CN112085701A (en) | Face ambiguity detection method and device, terminal equipment and storage medium | |
CN113378852A (en) | Key point detection method and device, electronic equipment and storage medium | |
CN111199050B (en) | System for automatically desensitizing medical records and application | |
CN111008576A (en) | Pedestrian detection and model training and updating method, device and readable storage medium thereof | |
WO2022021948A1 (en) | Action recognition method and apparatus, computer device, and storage medium | |
CN115100739B (en) | Man-machine behavior detection method, system, terminal device and storage medium | |
CN113505763B (en) | Key point detection method and device, electronic equipment and storage medium | |
CN117290596A (en) | Recommendation label generation method, device, equipment and medium for multi-mode data model | |
WO2020244076A1 (en) | Face recognition method and apparatus, and electronic device and storage medium | |
WO2023273570A1 (en) | Target detection model training method and target detection method, and related device therefor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |