CN110135246B

CN110135246B - Human body action recognition method and device

Info

Publication number: CN110135246B
Application number: CN201910264883.XA
Authority: CN
Inventors: 叶明�
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2023-10-20
Anticipated expiration: 2039-04-03
Also published as: WO2020199479A1; CN110135246A

Abstract

The invention is applicable to the technical field of image recognition, and provides a human body action recognition method and device, wherein the method comprises the following steps: acquiring a video file of a target object; analyzing each video image frame, extracting a human body region image related to a target object in the video image frame, and determining an interactable object contained in the video image frame; marking each key part in a preset key part list of the human body in the human body area image, and acquiring feature coordinates of each key part; generating a key feature sequence according to feature coordinates of the key part corresponding to each video image frame; determining candidate actions of the target object through key feature sequences of the key parts; and respectively calculating the matching degree between each candidate action and the interactable object, and determining the action type of the target object according to the matching degree. According to the method and the device, whether the target user has interaction behavior is determined by means of interaction actions, so that a plurality of approximate gestures can be distinguished, and the accuracy of action recognition is further improved.

Description

Human body action recognition method and device

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a human body action recognition method and device.

Background

With the continuous development of image recognition technology, a computer can automatically recognize more and more information from image files and video files, for example, determine the human body action type of a user contained in a picture, and perform operations such as object tracking, object behavior analysis and the like based on the recognized action information, so that the accuracy and recognition rate of the image recognition technology directly influence the processing effect of subsequent steps. The existing human body motion recognition technology generally adopts a convolutional neural network for recognition, however, the technology needs to carry out time sequence recursion operation for a plurality of times by means of optical flow information, so that the recognition speed is low, the accuracy is low, and particularly, for partial approximate gesture behaviors, such as sitting and squatting actions, the accuracy of motion recognition is further reduced because the human body gestures are similar and cannot be accurately distinguished through the convolutional neural network.

Disclosure of Invention

In view of the above, embodiments of the present invention provide a method and an apparatus for identifying a human body motion, so as to solve the problems that in the existing method for identifying a human body motion, the identification speed is low, and the accuracy is not high, and particularly, for partial similar gesture behaviors, for example, for sitting and squatting actions, the accuracy of motion identification is further reduced due to the fact that the human body gestures are similar and cannot be accurately distinguished through a convolutional neural network.

A first aspect of an embodiment of the present invention provides a method for identifying a human motion, including:

acquiring a video file of a target object; the video file includes a plurality of video image frames;

analyzing each video image frame, extracting a human body area image related to the target object in the video image frame, and determining an interactable object contained in the video image frame;

marking each key part in a preset key part list of the human body in the human body area image, and acquiring feature coordinates of each key part;

generating a key feature sequence related to the key part according to the feature coordinates of the key part corresponding to each video image frame;

determining at least one candidate action of the target object through the key feature sequences of the key parts;

and respectively calculating the matching degree between each candidate action and the interactable object, and determining the action type of the target object from the candidate actions according to the matching degree.

A second aspect of an embodiment of the present invention provides an apparatus for recognizing a human motion, including:

A video file acquisition unit for acquiring a video file of a target object; the video file includes a plurality of video image frames;

a human body region image extraction unit, configured to parse each video image frame, extract a human body region image related to the target object in the video image frame, and determine an interactable object contained in the video image frame;

the key part identification unit is used for marking each key part in a preset human body key part list in the human body area image and acquiring the characteristic coordinates of each key part;

a key feature sequence generating unit, configured to generate a key feature sequence related to the key location according to the feature coordinates corresponding to the key location in each video image frame;

a candidate action recognition unit, configured to determine at least one candidate action of the target object through the key feature sequences of the respective key parts;

and the action type recognition unit is used for respectively calculating the matching degree between each candidate action and the interactable object and determining the action type of the target object from the candidate actions according to the matching degree.

A third aspect of the embodiments of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the first aspect when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the first aspect.

The human body action recognition method and the human body action recognition device provided by the embodiment of the invention have the following beneficial effects:

according to the embodiment of the invention, the video file of the target user needing to be subjected to action analysis is obtained, each video image frame of the video file is analyzed, the human body region image contained in each video image frame is determined, the interactable object which can have interaction action with the target user in the video image frame is identified, each key part is marked in the human body region image, the change condition of each part of the target object is determined according to the feature coordinates of each key part, so that the candidate action of the target object is determined, the candidate actions with similar multiple postures are further screened according to the matching degree between the candidate actions and the interactable object, the action type of the target object is determined, and the human body action of the target object is automatically identified. Compared with the existing human body motion recognition technology, the method and the device have the advantages that the motion type of the video image is not required to be recognized by the aid of the neural network, recognition time delay caused by time sequence recursion is avoided, recognition efficiency is improved, on the other hand, the terminal equipment can determine the interactive object in the video image frame, and whether the interactive behavior exists in the target user or not is determined by means of the interactive motion, so that multiple approximate gestures can be distinguished, and accuracy of motion recognition is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of an implementation of a method for identifying human actions according to a first embodiment of the present invention;

fig. 2 is a flowchart of an implementation of a method S106 for identifying a human motion according to a second embodiment of the present invention;

fig. 3 is a flowchart of an implementation of a method S104 for identifying a human motion according to a third embodiment of the present invention;

fig. 4 is a flowchart of a specific implementation of a method S102 for identifying a human motion according to a fourth embodiment of the present invention;

fig. 5 is a flowchart of an implementation of a method S105 for identifying a human motion according to a fifth embodiment of the present invention;

FIG. 6 is a block diagram of a human motion recognition device according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a terminal device according to another embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

According to the embodiment of the invention, the video file of the target user needing to perform action analysis is obtained, all video image frames of the video file are analyzed, the human body region image contained in each video image frame is determined, the interactable object which can have interaction action with the target user in the video image frame is identified, all key parts are marked in the human body region image, the change condition of all parts of the target object is determined according to the characteristic coordinates of all the key parts, so that the candidate action of the target object is determined, the candidate actions with similar gestures are further screened according to the matching degree between the candidate actions and the interactable object, the action types of the target object are determined, the human body action of the target object is automatically identified, the problems that the existing human body action identification method is low in identification speed and low in accuracy, and particularly for partial similar gesture actions such as sitting and squatting are solved, and the accuracy of action identification is further reduced due to the fact that the human body gestures are similar and cannot be accurately distinguished through a convolutional neural network are solved.

In the embodiment of the present invention, the execution subject of the flow is a terminal device. The terminal device includes, but is not limited to: and the server, the computer, the smart phone, the tablet personal computer and other equipment capable of executing the identification operation of human body actions. Fig. 1 shows a flowchart of an implementation of a method for identifying a human motion according to a first embodiment of the present invention, which is described in detail below:

in S101, a video file of a target object is acquired; the video file includes a plurality of video image frames.

In this embodiment, the administrator may designate a video file containing a target object as the target video file, in which case the terminal device downloads the video file about the target object from the video database according to the file identification of the target video file, and identifies the action behavior of the target object. Preferably, the terminal device is a video monitoring device, and obtains a video file in a current scene; in this case, the terminal device recognizes each object captured in the current scene as a target object, configures an object number for each object based on face images of different captured objects, determines an action type of each monitored object in real time according to a video file generated in the monitoring process, and if the action type of a certain target object is detected to be in an abnormal action list, generates warning information to inform the monitored object executing the abnormal action to stop the abnormal action, thereby achieving the purpose of warning the abnormal action of the monitored object in real time.

Alternatively, the user may transmit face information of the target object to the terminal device. And the terminal equipment searches the face of each video file in the video database based on the face information, and takes the video file containing the face information as a target video file. The specific seek operation may be: the terminal equipment identifies candidate faces in each video image frame in each video file in a video database, extracts face characteristic values of key areas in the candidate faces, matches the face characteristic values of each candidate face with face information of a target face, and if the matching degree of the face characteristic values of each candidate face and the face information of the target face is larger than a preset matching threshold value, the face characteristic values of each candidate face and the face information of the target face are represented to correspond to the same entity person, and the video file is identified to be a face image containing the target object.

In this embodiment, the video file includes a plurality of video image frames, each video image frame corresponds to a frame number, and each video image frame is arranged and packaged based on a positive sequence of the frame numbers to generate the video file. The frame number may be determined based on the time the video image frame was played in the video file.

In S102, each of the video image frames is parsed, a human body region image of the video image frame with respect to the target object is extracted, and an interactable object contained in the video image frame is determined.

In this embodiment, the terminal device analyzes the video file, respectively performs human body recognition on each video image frame in the video file, and extracts a human body region image of each video image frame about the target object. The specific way of extracting the human body region image can be as follows: the terminal equipment judges whether the video image frame contains a human face area image or not through a human face recognition algorithm, and if not, the video image frame does not contain the human face area image; otherwise, if the video image frame contains a face image, carrying out contour recognition on the region of the coordinate based on the coordinate where the face image is located, extracting a human body region image corresponding to the face image based on contour information obtained by recognition, and carrying out matching according to the face image and a face template of a target object, so as to judge whether the human body region image is a human body region image of the target object.

Optionally, if the number of the target objects is multiple, that is, the behaviors of the multiple objects need to be monitored, after determining the human body area image of the human face image contained in the video image frame, the terminal device matches the human face image with the human face templates of the target objects, so as to determine the target objects corresponding to the human face image, mark the object identifiers of the associated target objects on the human body area image, and then can quickly determine the human body area image corresponding to each target object in the video image frame, thereby facilitating the motion tracking of multiple objects.

Optionally, in this embodiment, the terminal device may obtain, according to the object identifier of the target object, an object human template associated with the object identifier. The object human body template can be used for representing human body characteristics of the target object, such as body type information, sex information and/or hairstyle information, the terminal equipment can perform sliding frame extraction in a video image frame according to the object human body template, calculate the matching degree between a framed candidate region and the object human body template, and identify the candidate region as a human body region image of the target object if the matching degree is greater than a preset matching threshold; otherwise, if the matching degree of the two is smaller than or equal to the matching threshold, identifying the human body region image of which the candidate region is not the target object, and continuing to slide and frame; if all the candidate areas in the video image frame do not contain the human body area image, repeating the operation on the video image frame of the next frame, and identifying the human body area image of the target object.

In this embodiment, the terminal device may extract, in addition to the human body region image of the target object, an interactable object that can interact with the user from the image. The specific identification mode can be as follows: and determining contour information contained in the video image frame through a contour recognition algorithm, determining the type of a main body of each shooting main body based on the contour information, and determining an interactable object according to the type of the main body. The profile characteristics of different types of interactable subjects may be different, so that the subject type of the shooting subject can be determined by identifying the wheel base information, and the shooting subject capable of interacting with the target object is selected as the interactable object according to the subject type. For example, a photographing object such as a chair, a table, a knife, etc. may interact with the target object, whereas a photographing subject such as a cloud, a sun, etc. may have a low probability of interacting with the target object. Thus, by identifying the subject type, a large portion of the invalid interactable objects can be filtered.

Optionally, after identifying the shooting subjects, the terminal device calculates a distance value between each shooting subject and the human body region image, and selects the shooting subject with the distance value smaller than a preset threshold as the interactable object. Preferably, the terminal device may select a shooting subject having a contour boundary adjacent to the human body region image as the interactable subject, and since the target object interacts with the interaction subject, i.e., the two are in contact with each other, the contour boundary of the interactable object is adjacent to the target user.

In S103, each key part in a preset key part list of the human body is marked in the human body area image, and feature coordinates of each key part are obtained.

In this embodiment, the terminal device stores a human body key part list, where the human body key part list includes a plurality of human body key parts, and preferably, the human body key part list includes 17 key parts, which are respectively: 17 key parts of nose, eyes, ears, shoulders, wrists, hands, waists, knees and feet. By positioning a plurality of key parts of the human body and tracking the movement change conditions of the key parts, the accuracy of human body action recognition can be improved.

In this embodiment, the terminal device marks each key part in the image of the human body area in the specific marking manner: based on the contour information of the human body region image, determining the current gesture type of the target object, wherein the gesture type is specifically: standing type, walking type, lying type, sitting type, etc., and then marking each key position on the human body region image according to the corresponding relation between different key positions and gesture type. Optionally, the correspondence records a distance value and a relative direction vector of the key part and a contour center point of the human body region image, and the terminal device may locate each key part based on the distance value and the relative direction vector and perform the marking operation.

In this embodiment, the terminal device establishes an image coordinate axis based on the video image frame, and determines feature coordinates of each key location according to the location of each key location on the video image frame. Optionally, the terminal device may use the end point of the lower left corner of the video image frame as the origin of coordinates, or may use the center point of the image as the origin of coordinates, which is determined according to the default settings of the administrator or the device.

In S104, a key feature sequence regarding the key location is generated according to the feature coordinates of the key location corresponding to each of the video image frames.

In this embodiment, the terminal device needs to determine the motion trail of each key location, so that, based on the location identifier of the key location, the terminal device extracts the feature coordinates corresponding to the location identifier from each video image frame, encapsulates all the feature coordinates related to the feature location, and generates the key feature sequence related to the feature location. The sequence order of each element in the key feature sequence is consistent with the frame number of the video image frame, namely each element in the key feature sequence has a time sequence relationship, so that the condition that the key part changes based on time can be determined through the key feature sequence.

Optionally, if the key part in the partial video image frame is blocked and no corresponding feature coordinate exists, the terminal device may establish a feature curve about the key part on a preset coordinate axis according to the frame number of the video image frame, sequentially connect each feature coordinate based on the frame number, and fill the feature coordinate corresponding to the missing video image frame through a smoothing algorithm, so as to determine the feature coordinate corresponding to the missing video image frame.

In S105, at least one candidate action of the target object is determined by the key feature sequence of each key part.

In this embodiment, according to the key feature sequences of the plurality of key parts, the terminal device may determine the motion trajectories of the different key parts, and then, the action types conforming to the motion trajectories are used as candidate types. Specifically, the terminal device may determine a movement direction of the key part according to the key feature sequence, and then match the movement directions of the key parts with the movement directions of the key parts of each action template in each action type library one by one based on the movement directions of the plurality of key parts, and select, based on the number of matched key parts, for example, an action template with the number of matched key parts being greater than a preset matching threshold as a candidate action of the target object.

Optionally, the terminal device may be provided with a maximum frame number, and then the terminal device divides the key feature sequence of the key part into a plurality of feature subsequences based on the maximum frame number, and determines action types of different feature subsequences respectively.

In S106, a matching degree between each candidate action and the interactable object is calculated, and an action type of the target object is determined from the candidate actions according to the matching degree.

In this embodiment, the terminal device may obtain an interaction behavior list of the interactable object, detect similarities between the candidate actions and respective interaction behaviors in the interaction behavior list, select the maximum similarity as a matching degree between the candidate actions and the interactable object, and then determine an action type of the target object according to the matching degree of the respective candidate actions. It should be noted that, the identified action types may be multiple, for example, the user may cut the fruit with the fruit knife while holding the fruit, that is, the method includes two interactive actions of "holding" and "cutting", so the number of action types finally identified by the terminal device may be multiple. Based on the above, the terminal device may select the candidate action with the matching degree larger than the preset matching threshold as the action type currently executed by the target object.

For another example, the terminal device may determine the action type of the video file obtained by video monitoring, specifically, the video file may be a video file related to a security check area, determine the interaction behavior of personnel in the security check area, and detect whether there is an abnormal behavior of the user. The method comprises the steps of locating a target object to be identified in a video monitoring file, judging action types between the target object and each interactable object, wherein the interactable object can be a suitcase or a certificate to be authenticated, judging whether a user submits the suitcase to carry out security check operation according to a rule or takes dangerous goods from the suitcase to avoid the security check operation, and accordingly accuracy of a security check process can be improved.

Optionally, the terminal device may identify a distance value between each interactable object and the image of the human body region, select one interactable object with the lowest distance value as a target interaction object, and calculate a matching degree between the target interaction object and each candidate action, thereby determining an action type of the target object.

As can be seen from the foregoing, in the method for identifying a human body motion provided by the embodiment of the present invention, by acquiring a video file of a target user that needs to perform motion behavior analysis, analyzing each video image frame of the video file, determining a human body region image included in each video image frame, marking each key position in the human body region image, and determining a change condition of each position of the target object according to feature coordinates of each key position, thereby determining a motion type of the target object, and automatically identifying a human body motion of the target object. Compared with the existing human body motion recognition technology, the embodiment of the invention does not need to rely on a neural network to recognize the motion type of the video image, does not use optical flow information, avoids recognition time delay caused by time sequence recursion, improves recognition efficiency, determines the motion of a target object by positioning a plurality of key parts and determining the change condition of the plurality of key parts, and further improves accuracy, thereby improving the image recognition effect and the object behavior analysis efficiency.

Fig. 2 shows a flowchart of a specific implementation of a human motion recognition method S106 according to a second embodiment of the present invention. Referring to fig. 2, with respect to the embodiment described in fig. 1, the method S106 for identifying a human motion provided in this embodiment includes: s1061 to S1066 are specifically described below:

further, the calculating the matching degree between each candidate action and the interactable object, and determining the action type of the target object from the candidate actions according to the matching degree, includes:

in S1061, a distance value between the interactable object and the human body region image is acquired, and an interaction confidence of the interactable object is determined based on the distance value.

In this embodiment, the terminal device may mark an area image in which the interactable object is located on a video image frame, use a central coordinate of the area image as a feature coordinate of the interactable object, calculate an euclidean distance between the feature coordinate and a central coordinate of the human body area, and use the euclidean distance as a distance value between the interactable object and the human body area image. If the distance value is smaller, the interaction probability between the two is larger; conversely, if the distance value is larger, the interaction probability between the two is smaller. Therefore, the terminal device can calculate the interaction confidence between the interactable object and the target human body according to the distance value.

In 1062, the similarity between the key feature sequence and the standard feature sequence of each candidate action is calculated, and the similarity is identified as the action confidence of the candidate action.

In this embodiment, the terminal device needs to determine the correct probability of the identified candidate action, so that the standard feature sequence of the candidate action is obtained, and the similarity between the key feature sequence and the standard feature sequence in the multiple video image frames is calculated. The calculation mode of the similarity may be: the terminal equipment generates a standard curve related to a standard feature sequence on a preset coordinate axis, calculates a behavior curve of the key feature sequence, calculates the area of a closed area surrounded by the two curves, and determines the similarity between the key feature sequence and the standard feature sequence based on the area. If the area is larger, the difference between the two actions is larger, and the similarity is smaller; conversely, if the area is smaller, the smaller the difference between the two actions is, the greater the similarity is.

In S1063, based on the object type of the interactable object, a probability of interaction of the candidate action with the object type is determined.

In this embodiment, the terminal device determines, according to the contour information of the interactable object, an object type of the interactable object, that is, determines which type of article the interactable object belongs to, and determines interaction probability of the object type and the candidate action. For example, the object type of "basketball" can be used as an action receptor of "shooting" and "kicking" waiting for selection actions, namely, the interaction probability is high; and for the 'sitting' and 'standing' waiting for the selection action, the 'basketball' object type cannot be interacted with, and the interaction probability is small. The terminal equipment can acquire the action receptor object of each candidate action according to the action record library, calculate the number of action records corresponding to the object type, and determine the interaction probability between the object type and the candidate action based on the number.

In S1064, an object region image of the interactable object is extracted from the video image frame, and an object confidence level of the interactable object is determined according to the object region image and a standard image preset by the object type.

In this embodiment, the terminal device further needs to determine the accuracy of identifying the interactable object, so that an object region image of the interactable object is obtained, similarity comparison is performed between the object region image and a standard image matched with the object type, and the object confidence of the interactable object is determined according to the similarity between the two images.

In S1065, the interaction confidence, the action confidence, the object confidence, and the interaction probability are imported into a matching degree calculation model, and the matching degree of the candidate action is determined; the matching degree calculation model specifically comprises the following steps:

wherein ,the degree of matching for the candidate action a; />The interaction confidence is the interaction confidence; s is(s) _h The action confidence is the action confidence; s is(s) _o Confidence for the object; />Is the interaction probability; />And the triggering probability of the candidate action a is preset.

In this embodiment, the terminal device imports the four calculated parameters into the matching degree calculation model to determine the matching degree between the candidate action and the interactable object, so that the action type can be screened and identified by means of the interaction object. Specifically, the trigger probability of the candidate action can be calculated according to the action type corresponding to the previous image frame and the action type of the next image frame, and the trigger probability of the current action can be determined through the triggered action and the subsequent action because the action has a certain continuity.

In S1066, the candidate action with the matching degree greater than the matching threshold is selected as the action type of the target object.

In this embodiment, since there may be a plurality of interactions with the interactable object, the terminal device may select, as the type of the action of the target object, a candidate action having a matching degree greater than a preset matching threshold.

In the embodiment of the invention, the confidence degrees of the candidate actions and the interactable objects in a plurality of dimensions are determined, so that the matching degree of each candidate action is calculated, the accuracy of the matching degree calculation can be improved, and the accuracy of human action recognition is improved.

Fig. 3 is a flowchart showing a specific implementation of a human motion recognition method S104 according to a third embodiment of the present invention. Referring to fig. 3, with respect to the embodiment described in fig. 1, the method S104 for identifying a human motion provided in this embodiment includes: s1041 to S1045 are specifically described below:

further, the generating a key feature sequence about the key location according to the feature coordinates of the key location corresponding to each video image frame includes:

in S1041, a first feature coordinate and a second feature coordinate of the same key part in two video image frames with adjacent frames are obtained, and an image distance value between the first feature coordinate and the second feature coordinate is calculated.

In this embodiment, the terminal device needs to track the key parts of the human body, and if the displacement of the same key part in two adjacent image frames is detected to be too large, the two key parts are identified to belong to different human bodies, so that re-tracking can be quickly performed, and the accuracy of motion recognition is improved. Based on the above, the terminal device obtains the first feature coordinates and the second feature coordinates of the same key part in two video image frames adjacent to each other in the frame number, and introduces the two feature coordinates into the euclidean distance calculation formula to calculate the distance value between the two coordinate points, namely the image distance value. The image distance value specifically refers to the distance between two coordinate points on the video image frame, and is not the moving distance of the key part in the actual scene, so that the image distance value needs to be subjected to numerical conversion.

In S1042, an image area of the human body region image is calculated, and a photographing focal length between the target object and a photographing module is determined based on the image area.

In this embodiment, the terminal device acquires the area occupied by the human body region image in the video image frame, that is, the image area. The terminal equipment is provided with a standard human body area and a standard shooting focal length corresponding to the area. The terminal device may calculate a ratio between the current image area and the standard human body area, determine a scaling ratio, and calculate an actual photographing focal length between the target object and the photographing model, that is, the photographing focal length described above, based on the scaling ratio and the standard photographing focal length.

In S1043, importing the shooting focal length, the image distance value and the shooting frame rate of the video file into a distance conversion model, and calculating the actual moving distances of the key parts in the two video image frames; the distance conversion model specifically comprises the following steps:

wherein Dist is the actual movement distance; standard dist is the image distance value; figDist is the shooting focal length; baseDist is a preset reference focal length; actFrame is the shooting frame rate; baseFrame is the reference frame rate.

In this embodiment, the shooting focal length corresponding to the video image frame, the image distance values of the two key parts and the shooting frame rate of the video file are imported into the distance conversion model by the terminal device, so that the actual moving distance of the key parts in the scene can be calculated.

In S1044, identifying two feature coordinates of which the actual moving distance is smaller than a preset distance threshold as feature coordinates associated with each other.

In this embodiment, if the terminal device detects that the actual moving distance is greater than or equal to the preset distance threshold, it indicates that the moving distance of the key part exceeds the normal moving distance, at this time, it identifies that the key part in the two video image frames belongs to different target objects, and at this time, it determines that the two feature coordinates are non-associated feature coordinates; otherwise, if the actual moving distance value is smaller than the preset distance threshold value, the fact that the key parts in the two video image frames belong to the same target object is indicated, the fact that the two feature coordinates are associated feature coordinates is judged at the moment, the purpose of tracking the target object is achieved, the situation that the moving track of the user A is tracked is avoided, the moving track of the user B is tracked, and the accuracy of motion recognition is improved.

At S1045, the key feature sequence for the key location is generated according to all the feature coordinates associated with each other.

In this embodiment, the terminal device filters all the feature coordinates that are not associated, encapsulates the feature coordinates that are associated with each other, and generates a key feature sequence related to the key location.

In the embodiment of the invention, the abnormal characteristic coordinate points can be filtered by calculating the actual moving distance of the key parts under different frames, so that the accuracy of motion recognition is improved.

Fig. 4 is a flowchart showing a specific implementation of a human motion recognition method S102 according to a fourth embodiment of the present invention. Referring to fig. 4, with respect to the embodiments described in fig. 1 to 3, a method S102 for identifying a human motion provided in this embodiment includes: s1021 to S1024 are specifically described as follows:

further, the analyzing each video image frame separately, extracting a human body area image about the target object in the video image frame includes:

in S1021, a contour curve of the video image frame is acquired by a contour recognition algorithm, and an area surrounded by each contour curve is calculated.

In this embodiment, the terminal device determines the contour curve in the video image frame by a contour recognition algorithm. The specific way of identifying the contour line can be as follows: and the terminal equipment calculates the difference value of pixel values between two adjacent coordinate points, if the difference value is larger than a preset contour threshold value, the coordinate point is identified as the coordinate point where the contour line is located, and all the coordinate points on the contour line obtained by identification are connected to form a continuous contour curve. Each closed contour curve corresponds to a subject.

In this embodiment, the terminal device marks all contour curves on the video image frame, and integrates the contour curves and/or the areas enclosed between the boundaries of the video image frame, so as to obtain the area corresponding to each contour curve.

In S1022, a human body recognition window of the video image frame is generated according to each of the area areas.

In this embodiment, because of different scaling ratios, the size of the human body recognition window needs to be adjusted accordingly, based on this, the terminal device may calculate the scaling ratio corresponding to the video image frame according to the area of each shooting object, and inquire the size of the human body recognition window associated with the scaling ratio, and then generate the human body recognition window matched with the video image frame.

Optionally, in this embodiment, the terminal device adopts a yolov3 human body recognition algorithm, and yolov3 needs to configure 3 human body recognition windows. Based on the above, the terminal equipment generates the distribution condition of the area according to the area surrounded by each contour curve, selects three area with the largest distribution density as the characteristic area, and generates the human body recognition window corresponding to the three characteristic areas, namely three feature maps.

In S1023, sliding framing is performed on the video image frame based on the human body recognition window, and a plurality of candidate region images are generated.

In this embodiment, after generating the human body recognition window corresponding to the scaling ratio of the video image frame, the terminal device may perform sliding framing on the video image frame through the human body recognition window, and use the area image framed each time as the candidate area image. If a plurality of human body recognition windows with different sizes exist, concurrent threads corresponding to the number of the human body recognition windows are created, the plurality of video image frames are copied, and the human body recognition windows are respectively controlled to slide and frame on different video image frames through the plurality of concurrent threads, namely, the sliding and frame operation of the human body recognition windows with different sizes are mutually independent and do not affect each other, and candidate region images with different sizes are generated.

In S1024, the coincidence ratios between the candidate region images and the standard human body template are calculated, respectively, and the candidate region image with the coincidence ratio greater than the preset coincidence ratio threshold is selected as the human body region image.

In this embodiment, the terminal device calculates the coincidence ratio between the candidate region image and the standard human body template, if the coincidence ratio between the candidate region image and the standard human body template is higher, the similarity between the shooting object corresponding to the region image and the target object is higher, so that the candidate region can be identified as a human body region image; on the other hand, if the overlapping ratio between the two is lower, the similarity between the form of the region image and the target object is lower, and the region image is identified as a non-human region image. Because the video image frame may include a plurality of different users, the terminal device may identify all candidate areas with the coincidence rate exceeding the preset coincidence rate threshold as the human body area images, in which case, the terminal device may locate the face images of the respective human body area images, so as to match the human body images with the standard face of the target object, and thereby select the human body area image matched with the standard face as the human body area image of the target object.

In the embodiment of the invention, the contour curves in the video image frames are acquired, so that the scaling ratio of the video image frames is determined based on the area of each contour curve, and the human body recognition window corresponding to the scaling ratio is generated to perform the recognition operation of the human body area images, thereby improving the recognition accuracy.

Fig. 5 shows a flowchart of a specific implementation of a human motion recognition method S105 according to a fifth embodiment of the present invention. Referring to fig. 5, with respect to the embodiments described in fig. 1 to 3, a method S105 for identifying a human motion provided in this embodiment includes: s1051 to S1052, the details are as follows:

further, the determining at least one candidate action of the target object by the key feature sequence of each key part includes:

in S1051, feature coordinates of each of the key feature sequences are marked in a preset coordinate axis, and a part change curve for each of the key parts is generated.

In this embodiment, the terminal device marks each feature coordinate on a preset coordinate axis according to the coordinate values of each feature coordinate in each key feature sequence and the frame number of the corresponding video image frame, and connects each feature coordinate to generate a location change curve about the key location. The coordinate axis may be a coordinate axis established based on the video image frame, with the horizontal axis corresponding to the length of the video image frame and the vertical axis corresponding to the width of the video image frame.

In S1052, the part change curve is matched with a standard action curve of each candidate action in a preset action library, and the candidate action of the target object is determined based on the matching result.

In this embodiment, the terminal device matches the part change curves of all the key parts with the standard action curves of each candidate action in the preset action library, calculates the coincidence rate of the two change curves, and selects one candidate action with the highest coincidence rate as the action type of the target object.

In the embodiment of the invention, the action type of the target object can be intuitively determined by drawing the part change curve of the key part, and the accuracy of the action type is improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

Fig. 6 is a block diagram of a human motion recognition apparatus according to an embodiment of the present invention, where the human motion recognition apparatus includes units for performing the steps in the embodiment corresponding to fig. 1. Please refer to fig. 1 and the related description of the embodiment corresponding to fig. 1. For convenience of explanation, only the portions related to the present embodiment are shown.

Referring to fig. 6, the human motion recognition apparatus includes:

a video file acquisition unit 61 for acquiring a video file of a target object; the video file includes a plurality of video image frames;

a human body region image extracting unit 62, configured to parse each of the video image frames, extract a human body region image related to the target object in the video image frames, and determine an interactable object contained in the video image frames;

a key part identification unit 63, configured to mark each key part in a preset human body key part list in the human body area image, and obtain feature coordinates of each key part;

a key feature sequence generating unit, configured to generate a key feature sequence related to the key location according to the feature coordinates 64 corresponding to the key location in each of the video image frames;

a candidate action recognition unit 65 for determining at least one candidate action of the target object by the key feature sequences of the respective key parts;

an action type recognition unit 66, configured to calculate a degree of matching between each candidate action and the interactable object, and determine an action type of the target object from the candidate actions according to the degree of matching.

Optionally, the action type recognition unit 66 includes:

the interactive confidence calculation unit is used for acquiring a distance value between the interactable object and the human body region image and determining the interactive confidence of the interactable object based on the distance value;

the action confidence degree identification unit is used for respectively calculating the similarity between the key feature sequence and the standard feature sequence of each candidate action and identifying the similarity as the action confidence degree of the candidate action;

an interaction probability determining unit, configured to determine an interaction probability of the candidate action with the object type based on the object type of the interactable object;

the object confidence degree identification unit is used for extracting an object region image of the interactable object from the video image frame and determining the object confidence degree of the interactable object according to the object region image and a standard image preset by the object type;

a matching degree calculation unit, configured to import the interaction confidence degree, the action confidence degree, the object confidence degree and the interaction probability into a matching degree calculation model, and determine the matching degree of the candidate action; the matching degree calculation model specifically comprises the following steps:

wherein ,the degree of matching for the candidate action a; />The interaction confidence is the interaction confidence; s is(s) _h To be the instituteThe action confidence; s is(s) _o Confidence for the object; />Is the interaction probability; />The triggering probability of the candidate action a is preset;

and the candidate action selecting unit is used for selecting the candidate actions with the matching degree larger than a matching threshold value and identifying the candidate actions as the action types of the target objects.

Optionally, the key feature sequence generating unit 64 includes:

the image distance value calculation unit is used for acquiring first feature coordinates and second feature coordinates of the same key part in two adjacent video image frames of the frame number, and calculating an image distance value between the first feature coordinates and the second feature coordinates;

a shooting focal length determining unit for calculating an image area of the human body region image and determining a shooting focal length between the target object and a shooting module based on the image area;

an actual moving distance calculating unit, configured to import the shooting focal length, the image distance value, and the shooting frame rate of the video file into a distance conversion model, and calculate actual moving distances of the key parts in the two video image frames; the distance conversion model specifically comprises the following steps:

Wherein Dist is the actual movement distance; standard dist is the image distance value; figDist is the shooting focal length; baseDist is a preset reference focal length; actFrame is the shooting frame rate; baseFrame is the reference frame rate;

the associated coordinate recognition unit is used for recognizing the two feature coordinates with the actual moving distance smaller than a preset distance threshold as feature coordinates which are associated with each other;

and the associated coordinate packaging unit is used for generating the key feature sequence related to the key part according to all the feature coordinates which are associated with each other.

Optionally, the human body region image extraction unit 62 includes:

the contour curve acquisition unit is used for acquiring contour curves of the video image frames through a contour recognition algorithm and calculating the area surrounded by each contour curve;

the human body identification window generation unit is used for generating a human body identification window of the video image frame according to the area of each region;

a candidate region image extraction unit, configured to perform sliding frame extraction on the video image frame based on the human body recognition window, and generate a plurality of candidate region images;

and the human body region image matching unit is used for respectively calculating the coincidence rate between each candidate region image and the standard human body template, and selecting the candidate region image with the coincidence rate larger than a preset coincidence rate threshold value as the human body region image.

Optionally, the action type recognition unit 65 includes:

a position change curve generating unit, configured to mark feature coordinates of each key feature sequence in a preset coordinate axis, and generate a position change curve about each key position;

and the candidate action selecting unit is used for matching the part change curve with standard action curves of all candidate actions in a preset action library and determining the action type of the target object based on a matching result.

Therefore, the recognition device for human body actions provided by the embodiment of the invention can also recognize the action type of the video image without depending on a neural network and without help of optical flow information, so that recognition time delay caused by time sequence recursion is avoided, the recognition efficiency is improved, on the other hand, the terminal device can determine the interactive object in the video image frame, and determine whether the target user has interactive actions or not by virtue of the interactive actions, so that a plurality of approximate gestures can be distinguished, and the accuracy of action recognition is further improved.

Fig. 7 is a schematic diagram of a terminal device according to another embodiment of the present invention. As shown in fig. 7, the terminal device 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72 stored in said memory 71 and executable on said processor 70, for example an identification program of a human action. The steps in the above-described respective human motion recognition method embodiments are implemented by the processor 70 when executing the computer program 72, such as S101 to S106 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, performs the functions of the units in the above-described device embodiments, such as the functions of the modules 61 to 66 shown in fig. 6.

By way of example, the computer program 72 may be divided into one or more units, which are stored in the memory 71 and executed by the processor 70 to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing a specific function for describing the execution of the computer program 72 in the terminal device 7. For example, the computer program 72 may be divided into a video file acquisition unit, a human body region image extraction unit, a key part recognition unit, a key feature sequence generation unit, a candidate action recognition unit, and an action type recognition unit, each of which functions specifically as described above.

The terminal device 7 may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the terminal device 7 and does not constitute a limitation of the terminal device 7, and may include more or less components than illustrated, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.

The processor 70 may be a central processing unit (Central Processing Unit, CPU), or may be another general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used for storing the computer program as well as other programs and data required by the terminal device. The memory 71 may also be used for temporarily storing data that has been output or is to be output.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A method for recognizing human motion, comprising:

calculating the matching degree between each candidate action and the interactable object respectively, and determining the action type of the target object from the candidate actions according to the matching degree;

the calculating the matching degree between each candidate action and the interactable object, and determining the action type of the target object from the candidate actions according to the matching degree, includes:

acquiring a distance value between the interactable object and the human body region image, and determining interaction confidence of the interactable object based on the distance value;

respectively calculating the similarity between the key feature sequence and the standard feature sequence of each candidate action, and identifying the similarity as the action confidence of the candidate action;

Determining an interaction probability of the candidate action with the object type based on the object type of the interactable object;

extracting an object region image of the interactable object from the video image frame, and determining the object confidence of the interactable object according to the object region image and a standard image preset by the object type;

importing the interaction confidence, the action confidence, the object confidence and the interaction probability into a matching degree calculation model to determine the matching degree of the candidate action; the matching degree calculation model specifically comprises the following steps:

wherein ,the degree of matching for the candidate action a; />The interaction confidence is the interaction confidence; s is(s) _h The action confidence is the action confidence; s is(s) _o Confidence for the object; />Is the interaction probability; />The triggering probability of the candidate action a is preset;

and selecting the candidate actions with the matching degree larger than a matching threshold as action types of the target objects.

2. The method of claim 1, wherein generating a key feature sequence for the key location based on the feature coordinates of the key location in each of the video image frames, comprises:

Acquiring first feature coordinates and second feature coordinates of the same key part in two adjacent video image frames, and calculating an image distance value between the first feature coordinates and the second feature coordinates;

calculating the image area of the human body area image, and determining the shooting focal length between the target object and the shooting module based on the image area;

importing the shooting focal length, the image distance value and the shooting frame rate of the video file into a distance conversion model, and calculating the actual moving distance of the key part in the two video image frames; the distance conversion model specifically comprises the following steps:

identifying two feature coordinates of which the actual moving distance is smaller than a preset distance threshold as feature coordinates which are associated with each other;

and generating the key feature sequence related to the key part according to all the feature coordinates which are mutually associated.

3. The method of any of claims 1-2, wherein the parsing each of the video image frames, extracting a human body region image of the video image frames with respect to the target object, and determining an interactable object contained in the video image frames, comprises:

Acquiring contour curves of the video image frames through a contour recognition algorithm, and calculating the area surrounded by each contour curve;

according to the area of each region, generating a human body identification window of the video image frame;

sliding frame extraction is carried out on the video image frame based on the human body recognition window, and a plurality of candidate area images are generated;

and respectively calculating the coincidence rate between each candidate region image and the standard human body template, and selecting the candidate region image with the coincidence rate larger than a preset coincidence rate threshold value as the human body region image.

4. The method of any of claims 1-2, wherein said determining at least one candidate action of the target object from the key feature sequence of each of the key locations comprises:

marking feature coordinates of each key feature sequence in a preset coordinate axis to generate a part change curve about each key part;

and matching the position change curve with a standard action curve of each candidate action in a preset action library, and determining the candidate action of the target object based on a matching result.

5. An apparatus for recognizing human motion, comprising:

the action type recognition unit is used for respectively calculating the matching degree between each candidate action and the interactable object and determining the action type of the target object from the candidate actions according to the matching degree;

The action type recognition unit includes:

6. The apparatus according to claim 5, wherein the key feature sequence generating unit includes:

7. A terminal device, characterized in that it comprises a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the steps of the method according to any one of claims 1 to 4.

8. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.