CN110135246A

CN110135246A - A kind of recognition methods and equipment of human action

Info

Publication number: CN110135246A
Application number: CN201910264883.XA
Authority: CN
Inventors: 叶明�
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-08-16
Anticipated expiration: 2039-04-03
Also published as: WO2020199479A1; CN110135246B

Abstract

The present invention is suitable for image identification technical field, provides the recognition methods and equipment of a kind of human action, comprising: obtain the video file of target object；Parse each video image frame respectively, extract the human region image in video image frame about target object, and determine that video image frame includes can interactive object；Each key position in preset human body key position list is marked in human region image, and obtains the characteristic coordinates of each key position；According to key position in each video image frame corresponding characteristic coordinates, generate key feature sequence；By the key feature sequence of key position, the candidate actions of target object are determined；Calculate separately each candidate actions and can matching degree between interactive object, and determine according to matching degree the type of action of target object.The present invention determines that target user further improves the accuracy rate of action recognition with the presence or absence of interbehavior so as to distinguish to multiple approximate postures by interactive action.

Description

Human body action recognition method and device

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a human body action recognition method and device.

Background

With the continuous development of image recognition technology, a computer can automatically recognize more and more information from an image file and a video file, for example, determine the human body action type of a user contained in a picture, and perform operations such as object tracking and object behavior analysis based on the motion information obtained by recognition, so that the accuracy and recognition rate of the image recognition technology directly affect the processing effect of subsequent steps. The existing human body motion recognition technology generally adopts a convolutional neural network for recognition, however, the technology needs to use optical flow information, and needs to perform time sequence recursion operation for multiple times, so that the recognition speed is low, the accuracy is not high, and particularly for partial approximate posture behaviors, such as sitting and squatting, due to the similarity of human body postures, the actions cannot be accurately distinguished through the convolutional neural network, so that the accuracy of motion recognition is further reduced.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for recognizing a human body action, so as to solve the problem that the accuracy of motion recognition is further reduced due to the fact that the human body gestures are similar and cannot be accurately distinguished by a convolutional neural network, especially for partial approximate gesture behaviors such as sitting and squatting.

The first aspect of the embodiments of the present invention provides a method for recognizing human body actions, including:

acquiring a video file of a target object; the video file comprises a plurality of video image frames;

respectively analyzing each video image frame, extracting a human body area image of the target object in the video image frame, and determining an interactive object contained in the video image frame;

marking each key part in a preset human body key part list in the human body area image, and acquiring the characteristic coordinates of each key part;

generating a key feature sequence related to the key part according to the feature coordinates corresponding to the key part in each video image frame;

determining at least one candidate action of the target object through the key feature sequence of each key part;

and respectively calculating the matching degree between each candidate action and the interactive object, and determining the action type of the target object from the candidate actions according to the matching degree.

A second aspect of an embodiment of the present invention provides an apparatus for recognizing a human body motion, including:

a video file acquisition unit for acquiring a video file of a target object; the video file comprises a plurality of video image frames;

the human body area image extracting unit is used for respectively analyzing each video image frame, extracting a human body area image related to the target object in the video image frames and determining an interactive object contained in the video image frames;

the key part identification unit is used for marking each key part in a preset human body key part list in the human body region image and acquiring the characteristic coordinates of each key part;

a key feature sequence generating unit, configured to generate a key feature sequence related to the key portion according to the feature coordinates of the key portion corresponding to each of the video image frames;

a candidate action recognition unit, configured to determine at least one candidate action of the target object through the key feature sequence of each key portion;

and the action type identification unit is used for respectively calculating the matching degree between each candidate action and the interactive object and determining the action type of the target object from the candidate actions according to the matching degree.

A third aspect of embodiments of the present invention provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the first aspect when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of the first aspect.

The method and the equipment for identifying the human body action, provided by the embodiment of the invention, have the following beneficial effects:

the method comprises the steps of obtaining a video file of a target user needing action behavior analysis, analyzing each video image frame of the video file, determining a human body area image contained in each video image frame, identifying an interactive object which can have an interactive behavior with the target user in each video image frame, marking each key part in the human body area image, determining the change condition of each part of the target object according to the characteristic coordinates of each key part, determining candidate actions of the target object, further screening a plurality of candidate actions with similar gestures according to the matching degree between the candidate actions and the interactive object, determining the action type of the target object, and automatically identifying the human body action of the target object. Compared with the existing human motion recognition technology, the embodiment of the invention does not need to rely on a neural network to recognize the motion type of the video image, does not rely on optical flow information, and avoids recognition delay caused by time sequence recursion, thereby improving recognition efficiency.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart of an implementation of a method for recognizing human body actions according to a first embodiment of the present invention;

fig. 2 is a flowchart illustrating an implementation of the method for recognizing human body actions S106 according to the second embodiment of the present invention;

fig. 3 is a flowchart of a specific implementation of a human body motion recognition method S104 according to a third embodiment of the present invention;

fig. 4 is a flowchart of a detailed implementation of the human body motion recognition method S102 according to a fourth embodiment of the present invention;

fig. 5 is a flowchart illustrating an implementation of the method for recognizing human body actions S105 according to a fifth embodiment of the present invention;

fig. 6 is a block diagram of a human body motion recognition device according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a terminal device according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention determines the human body area image contained in each video image frame by acquiring the video file of the target user needing action and behavior analysis and analyzing each video image frame of the video file, identifies the interactive object which can have interactive behavior with the target user in the video image frame, marks each key part in the human body area image, determines the change condition of each part of the target object according to the characteristic coordinates of each key part, thereby determining the candidate action of the target object, further screens a plurality of candidate actions with similar gestures according to the matching degree between the candidate actions and the interactive object, determines the action type of the target object, automatically identifies the human body action of the target object, solves the problem of the existing identification method of human body action, has lower identification speed, and the accuracy rate is not high, especially for partial approximate gesture behaviors, such as sitting and squatting actions, because human body gestures are similar, the actions cannot be accurately distinguished through a convolutional neural network, and the accuracy rate of action recognition is further reduced.

In the embodiment of the invention, the execution subject of the process is the terminal equipment. The terminal devices include but are not limited to: the device comprises a server, a computer, a smart phone, a tablet computer and the like, and can execute the identification operation of human body actions. Fig. 1 shows a flowchart of an implementation of a human motion recognition method according to a first embodiment of the present invention, which is detailed as follows:

in S101, a video file of a target object is acquired; the video file includes a plurality of video image frames.

In this embodiment, the administrator may designate a video file containing the target object as the target video file, in which case the terminal device may download the video file about the target object from the video database according to the file identifier of the target video file, and recognize the action behavior of the target object. Preferably, the terminal device is a video monitoring device, and obtains a video file in a current scene; under the condition, the terminal equipment can identify all objects shot in the current scene as target objects, object numbers are configured for all the objects based on face images of different shot objects, the terminal equipment judges the action types of all the monitored objects in real time according to video files generated in the monitoring process, and if the action type of a certain target object is detected to be in an abnormal action list, warning information is generated to inform the monitored object executing abnormal actions to stop the abnormal actions, so that the purpose of warning the abnormal actions of the monitored object in real time is achieved.

Alternatively, the user may send the face information of the target object to the terminal device. And the terminal equipment searches the face of each video file in the video database based on the face information, and takes the video file containing the face information as a target video file. The specific search operation may be: the terminal equipment identifies the candidate human face in each video image frame in each video file in the video database, extracts the face characteristic value of a key area in the candidate human face, matches the face characteristic value of each candidate human face with the face information of the target human face, and if the matching degree of the face characteristic value and the face information of the target human face is greater than a preset matching threshold value, the face characteristic value and the face information of the target human face indicate that the candidate human face and the target human face correspond to the same entity person, and the video file is identified as a human face image containing a target object.

In this embodiment, the video file includes a plurality of video image frames, each video image frame corresponds to a frame number, and the video image frames are arranged and encapsulated based on the positive sequence of the frame numbers to generate the video file. The frame number may be determined according to the playing time of the video image frame in the video file.

In S102, each of the video image frames is parsed, a human body region image of the video image frame about the target object is extracted, and an interactable object included in the video image frame is determined.

In this embodiment, the terminal device parses the video file, performs human body recognition on each video image frame in the video file, and extracts a human body region image of each video image frame about the target object. The specific way of extracting the human body region image may be: the terminal equipment judges whether the video image frame contains a human face area image or not through a human face recognition algorithm, and if not, the video image frame does not contain a human body area image; on the contrary, if the video image frame contains the face image, the area of the coordinate is subjected to contour recognition based on the coordinate where the face image is located, a human body area image corresponding to the face image is extracted based on the contour information obtained through recognition, and matching is carried out according to the face image and a face template of a target object, so that whether the human body area image is the human body area image of the target object or not is judged.

Optionally, if the number of the target objects is multiple, that is, behaviors of multiple objects need to be monitored, after the terminal device determines a human body area image of a face image included in the video image frame, the face image is matched with a face template of each target object, so as to determine multiple target objects corresponding to the face image, and mark object identifiers of the associated target objects on the human body area image, so that the human body area image corresponding to each target object can be quickly determined in the video image frame, which is convenient for tracking actions of multiple objects.

Optionally, in this embodiment, the terminal device may obtain an object human body template associated with the object identifier according to the object identifier of the target object. The object human body template can be used for representing human body characteristics of the target object, such as body type information, gender information and/or hair style information, the terminal device can perform sliding framing in a video image frame according to the object human body template, calculate the matching degree between the framed candidate area and the object human body template, and if the matching degree of the two is greater than a preset matching threshold, identify the candidate area as a human body area image of the target object; otherwise, if the matching degree of the candidate region and the human body region image is smaller than or equal to the matching threshold, identifying the candidate region and the human body region image which is not the target object, and continuing to perform sliding framing; if all the candidate areas in the video image frame do not contain the human body area image, the operation is repeatedly executed on the video image frame of the next frame, and the human body area image of the target object is identified.

In this embodiment, the terminal device may extract an interactive object that can interact with the user from the image, in addition to acquiring the human body region image of the target object. The specific identification method may be as follows: determining contour information contained in the video image frame through a contour recognition algorithm, determining a subject type of each shooting subject based on the contour information, and determining an interactive object according to the subject type. The outline characteristics of different types of interactive subjects can be different, so that the subject type of the shooting subject can be determined by identifying the wheel base information, and the shooting subject capable of interacting with the target object is selected as the interactive object according to the subject type. For example, a subject such as a chair, a table, or a knife may interact with a target object, while a subject such as a cloud or a sun may interact with the target object with a low probability. Thus, by identifying the subject type, most invalid interactable objects can be filtered.

Optionally, after the shooting subject is identified and obtained by the terminal device, the distance value between each shooting subject and the human body region image is calculated, and the shooting subject with the distance value smaller than the preset threshold value is selected as the interactive object. Preferably, the terminal device may select a photographic subject having a contour boundary adjacent to the human body region image as the interactive subject, and since the target object and the interactive subject interact with each other, that is, the target object and the interactive subject are in contact with each other, the contour boundary of the interactive subject is adjacent to the target user.

In S103, each key part in a preset human body key part list is marked in the human body region image, and a feature coordinate of each key part is obtained.

In this embodiment, the terminal device stores a human body key part list, where the human body key part list includes a plurality of human body key parts, and preferably, the human body key part list includes 17 key parts, which are: the nose, eyes, ears, shoulders, wrists, hands, waist, knees and feet are 17 key parts. The accuracy of human body action recognition can be improved by positioning a plurality of human body key parts and tracking the motion change conditions of the key parts.

In this embodiment, the terminal device marks each key part in the human body region image, and the specific marking manner is as follows: determining the current posture type of the target object based on the contour information of the human body area image, wherein the posture type specifically comprises the following steps: standing type, walking type, lying type, sitting type and the like, and then marking each key part on the human body region image according to the corresponding relation between different key parts and posture types. Optionally, the corresponding relationship records a distance value and a relative direction vector between the key part and a contour center point of the human body region image, and the terminal device may locate each key part based on the distance value and the relative direction vector and perform a marking operation.

In the embodiment, the terminal device establishes an image coordinate axis based on the video image frame, and determines the feature coordinates of each key part according to the position of each key part on the video image frame. Optionally, the terminal device may use an endpoint of a lower left corner of the video image frame as the origin of coordinates, or may use an image center point as the origin of coordinates, which is determined according to default settings of an administrator or the device.

In S104, a key feature sequence related to the key part is generated according to the feature coordinates corresponding to the key part in each video image frame.

In this embodiment, the terminal device needs to determine the motion trajectory of each key part, and therefore, based on the part identifier of the key part, the terminal device extracts feature coordinates corresponding to the part identifier from each video image frame, encapsulates all the feature coordinates of the feature part, and generates a key feature sequence of the feature part. The sequence of each element in the key feature sequence is consistent with the frame number of the video image frame to which the element belongs, namely, each element in the key feature sequence has a time sequence relation, so that the situation that the key part changes based on the time transition can be determined through the key feature sequence.

Optionally, if the key part in a part of the video image frames is blocked and does not have a corresponding feature coordinate, the terminal device may establish a feature curve about the key part on a preset coordinate axis according to the frame number of the video image frames, and sequentially connect each feature coordinate based on the frame number, and the feature coordinate corresponding to the missing video image frame may be filled by a smoothing algorithm to determine the feature coordinate corresponding to the missing video image frame.

In S105, at least one candidate motion of the target object is determined according to the key feature sequence of each key part.

In this embodiment, the terminal device may determine the motion trajectories of different key portions according to the key feature sequences of the plurality of key portions, and then the motion types that conform to the motion trajectories are used as candidate types. Specifically, the terminal device may determine the motion direction of the key portion according to the key feature sequence, and then match the motion directions of the key portions of the action templates in each action type library one by one based on the motion directions of the plurality of key portions, and select, for example, an action template with the number of matched key portions larger than a preset matching threshold as a candidate action of the target object based on the number of matched key portions.

Optionally, the terminal device may set a maximum frame number, and then the terminal device divides the key feature sequence of the key portion into a plurality of feature subsequences based on the maximum frame number, and determines the action types of different feature subsequences, respectively.

In S106, a matching degree between each candidate action and the interactive object is respectively calculated, and an action type of the target object is determined from the candidate actions according to the matching degree.

In this embodiment, the terminal device may obtain an interactive behavior list of the interactive object, detect a similarity between the candidate action and each interactive behavior in the interactive behavior list, select a maximum value of the similarity as a matching degree between the candidate action and the interactive object, and determine the action type of the target object according to the matching degree of each candidate action. It should be noted that the recognized action types may be multiple, for example, a user may hold a fruit and cut the fruit with a fruit knife, that is, two interactive actions of "holding" and "cutting" are included, so that the number of the action types finally recognized by the terminal device may be multiple. Based on this, the terminal device may select a candidate action with a matching degree greater than a preset matching threshold as the action type currently executed by the target object.

For another example, the terminal device may determine the action type of a video file obtained by video monitoring, specifically, the video file may be a video file related to a security check area, and determine the interaction behavior of people in the security check area, so as to detect whether there is an abnormal behavior of the user. The method comprises the steps of positioning a target object to be identified in a video monitoring file, judging the action type between the target object and each interactive object, judging whether a user submits the luggage to carry out security inspection operation according to the regulations or takes dangerous goods from the luggage to avoid the security inspection operation, and accordingly improving the accuracy of the security inspection process.

Optionally, the terminal device may identify a distance value between each interactable object and the human body region image, select an interactable object with the lowest distance value as a target interactive object, and calculate a matching degree between the target interactive object and each candidate action, thereby determining an action type of the target object.

As can be seen from the above, the method for recognizing human body actions according to the embodiments of the present invention obtains a video file of a target user that needs to perform action behavior analysis, analyzes each video image frame of the video file, determines a human body area image included in each video image frame, marks each key part in the human body area image, and determines a change condition of each part of a target object according to a feature coordinate of each key part, thereby determining an action type of the target object and automatically recognizing the human body action of the target object. Compared with the existing human motion recognition technology, the embodiment of the invention does not need to rely on a neural network to recognize the motion type of the video image, does not rely on optical flow information, and avoids recognition delay caused by time sequence recursion, thereby improving recognition efficiency.

Fig. 2 shows a flowchart of a specific implementation of the human body motion recognition method S106 according to the second embodiment of the present invention. Referring to fig. 2, with respect to the embodiment described in fig. 1, the method for recognizing human body actions S106 provided in this embodiment includes: s1061 to S1066 are specifically described as follows:

further, the calculating a matching degree between each candidate action and the interactive object, and determining an action type of the target object from the candidate actions according to the matching degree includes:

in S1061, a distance value between the interactable object and the human body region image is obtained, and an interaction confidence of the interactable object is determined based on the distance value.

In this embodiment, the terminal device may mark an area image where the interactive object is located on the video image frame, use the center coordinate of the area image as the feature coordinate of the interactive object, calculate the euclidean distance between the feature coordinate and the center coordinate of the human body area, and use the euclidean distance as the distance value between the interactive object and the human body area image. If the distance value is smaller, the interaction probability between the distance value and the distance value is larger; conversely, if the distance value is larger, the interaction probability between the two is smaller. Therefore, the terminal device can calculate the interaction confidence between the interactive object and the target human body according to the distance value.

At 1062, the similarity between the key feature sequence and the standard feature sequence of each candidate motion is calculated, and the similarity is identified as the motion confidence of the candidate motion.

In this embodiment, the terminal device needs to determine the correct probability of the identified candidate motion, so that a standard feature sequence of the candidate motion is obtained, and the similarity between the key feature sequence and the standard feature sequence in the plurality of video image frames is calculated. The similarity may be calculated in the following manner: the method comprises the steps that the terminal equipment generates a standard curve related to a standard characteristic sequence on a preset coordinate axis, calculates a behavior curve of the key characteristic sequence, calculates the area between the two curves which form a closed area, and determines the similarity between the key characteristic sequence and the standard characteristic sequence based on the area. If the area is larger, the difference between the two actions is larger, and the similarity is smaller; conversely, if the area is smaller, the difference between the two actions is smaller, and the similarity is larger.

In S1063, based on the object type of the interactable object, the interaction probability of the candidate action with the object type is determined.

In this embodiment, the terminal device determines the object type of the interactable object, that is, determines which type of article the interactable object belongs to according to the contour information of the interactable object, and determines the interaction probability between the object type and the candidate action. For example, the object type of "basketball" can be used as an action receptor for "casting" and "kicking" to wait for a selection action, i.e. the interaction probability is high; and for the actions of sitting and standing for selection, the interaction with the object type of basketball is not carried out, and the interaction probability is small. The terminal device may obtain the action recipient object of each candidate action according to the action record library, calculate the number of action records corresponding to the object type, and determine the interaction probability between the object type and the candidate action based on the number.

In S1064, an object area image of the interactive object is extracted from the video image frame, and an object confidence of the interactive object is determined according to the object area image and a standard image preset by the object type.

In this embodiment, the terminal device further needs to determine the accuracy of identification of the interactive object, so that an object region image of the interactive object is obtained, similarity comparison is performed between the object region image and a standard image matched with the object type, and the object confidence of the interactive object is determined according to the similarity between the two images.

In S1065, importing the interaction confidence, the action confidence, the object confidence, and the interaction probability into a matching degree calculation model, and determining the matching degree of the candidate action; the matching degree calculation model specifically comprises the following steps:

wherein ,the degree of match for the candidate action a;the interaction confidence is; s_hThe action confidence is; s_oIs the object confidence;the interaction probability is the interaction probability;the trigger probability of the candidate action a is preset.

In this embodiment, the terminal device imports the four calculated parameters into a matching degree calculation model, and determines the matching degree between the candidate motion and the interactive object, so that the motion type can be further screened and identified by the interactive object. In particular, the trigger probability of the candidate action may be calculated according to the action type corresponding to the previous image frame and the action type of the next image frame, and since the actions have a certain continuity, the trigger probability of the current action may be determined by the triggered action and the subsequent action.

In S1066, the candidate action with the matching degree greater than the matching threshold is selected as the action type of the target object.

In this embodiment, since there may be a plurality of interaction actions with the interactable object, the terminal device may select a candidate action with a matching degree greater than a preset matching threshold as the action type of the target object.

In the embodiment of the invention, the matching degree of each candidate action is calculated by determining the confidence degrees of the candidate action and the interactive object in multiple dimensions, so that the accuracy of calculating the matching degree can be improved, and the accuracy of identifying the human body action is improved.

Fig. 3 shows a flowchart of a specific implementation of the human body motion recognition method S104 according to a third embodiment of the present invention. Referring to fig. 3, with respect to the embodiment described in fig. 1, a method S104 for recognizing a human body motion provided by the present embodiment includes: s1041 to S1045 are specifically detailed as follows:

further, the generating a key feature sequence about the key part according to the feature coordinates of the key part corresponding to each video image frame includes:

in S1041, a first feature coordinate and a second feature coordinate of the same key portion in two video image frames adjacent to each other are obtained, and an image distance value between the first feature coordinate and the second feature coordinate is calculated.

In this embodiment, the terminal device needs to track key portions of a human body, and if it is detected that the displacement of the same key portion in two adjacent image frames is too large, the two key portions are identified to belong to different human bodies, so that retracing can be performed quickly, and the accuracy of motion recognition is improved. Based on the above, the terminal device obtains the first characteristic coordinate and the second characteristic coordinate of the same key part in two video image frames adjacent to the frame number, introduces the two characteristic coordinates into the euclidean distance calculation formula, and calculates the distance value between the two coordinate points, i.e. the image distance value. The image distance value specifically refers to a distance between two coordinate points on the video image frame, and is not a moving distance of the key part in an actual scene, so that the image distance value needs to be subjected to numerical value conversion.

In S1042, an image area of the human body region image is calculated, and a photographing focal length between the target object and a photographing module is determined based on the image area.

In this embodiment, the terminal device acquires an area occupied by the human body region image in the video image frame, that is, an image area. The terminal equipment is provided with a standard human body area and a standard shooting focal length corresponding to the area. The terminal device may calculate a ratio between a current image area and a standard human body area, determine a zoom ratio, and calculate an actual shooting focal distance between the target object and the shooting model, that is, the shooting focal distance, based on the zoom ratio and the standard shooting focal distance.

In S1043, importing the shooting focal length, the image distance value, and the shooting frame rate of the video file into a distance conversion model, and calculating an actual moving distance of the key part in the two video image frames; the distance conversion model specifically comprises:

wherein Dist is the actual moving distance; the StandardDist is the image distance value; FigDist is the shooting focal length; the BaseDist is a preset reference focal length; the ActFrame is the shooting frame rate; BaseFrame is the reference frame rate.

In this embodiment, the shooting focal length corresponding to the video image frame, the image distance values of two key parts, and the shooting frame rate of the video file are imported into the distance conversion model, so that the actual moving distance of the key parts in the scene can be calculated.

In S1044, the two feature coordinates of which the actual movement distance is smaller than the preset distance threshold are identified as feature coordinates that are associated with each other.

In this embodiment, if the terminal device detects that the actual moving distance is greater than or equal to the preset distance threshold, it indicates that the moving distance of the key part exceeds the normal moving distance, and at this time, it recognizes that the key part in the two video image frames belongs to different target objects, and at this time, it determines that the two feature coordinates are non-associated feature coordinates; on the contrary, if the actual moving distance value is smaller than the preset distance threshold value, it indicates that the key part in the two video image frames belongs to the same target object, and at this time, the two feature coordinates are determined to be associated feature coordinates, so that the purpose of tracking the target object is achieved, switching to tracking the motion trajectory of the user B under the condition of tracking the motion trajectory of the user a is avoided, and the accuracy of motion recognition is improved.

At S1045, the key feature sequence related to the key portion is generated according to all the feature coordinates associated with each other.

In this embodiment, the terminal device filters all non-associated feature coordinates, encapsulates feature coordinates that are associated with each other, and generates a key feature sequence related to a key location.

In the embodiment of the invention, the actual moving distance of the key part under different frame numbers is calculated, so that the abnormal characteristic coordinate points can be filtered, and the accuracy of action identification is improved.

Fig. 4 shows a flowchart of a specific implementation of the human body motion recognition method S102 according to a fourth embodiment of the present invention. Referring to fig. 4, with respect to the embodiments described in fig. 1 to 3, the method for recognizing human body motion S102 provided in this embodiment includes: s1021 to S1024 are described in detail as follows:

further, the parsing each video image frame respectively and extracting a human body region image of the video image frame about the target object includes:

in S1021, contour curves of the video image frame are acquired through a contour recognition algorithm, and an area surrounded by each contour curve is calculated.

In the embodiment, the terminal device determines the contour curve in the video image frame through a contour recognition algorithm. The specific way of identifying the contour lines may be: and the terminal equipment calculates the difference value of the pixel values between two adjacent coordinate points, if the difference value is greater than a preset contour threshold value, the coordinate point is identified as the coordinate point where the contour line is located, and all identified coordinate points on the contour line are connected to form a continuous contour curve. Each closed contour curve corresponds to a photographic subject.

In this embodiment, the terminal device marks all contour curves on the video image frame, and integrates the contour curves and/or the region enclosed between the boundaries of the video image frame, so as to obtain the region area corresponding to each contour curve.

In S1022, a human body recognition window of the video image frame is generated according to each of the area areas.

In this embodiment, because of the different scaling ratios, the size of the human body identification window needs to be adjusted accordingly, and based on this, the terminal device may calculate the scaling ratio corresponding to the video image frame according to the area of each shooting object, and query the size of the human body identification window associated with the scaling ratio, and then generate the human body identification window matched with the video image frame.

Optionally, in this embodiment, the terminal device adopts the human body recognition algorithm of yolov3, and yolov3 needs to configure 3 human body recognition windows. Based on the method, the terminal equipment generates the distribution condition of the area according to the area surrounded by each contour curve, selects three area areas with the maximum distribution density as the feature areas, and generates the human body identification windows corresponding to the three area areas, namely three feature maps based on the three feature areas.

In S1023, a plurality of candidate area images are generated by performing a sliding frame on the video image frame based on the human body recognition window.

In this embodiment, after generating the human body recognition window corresponding to the scaling of the video image frame, the terminal device may perform sliding framing on the video image frame through the human body recognition window, and take the region image framed each time as a candidate region image. If a plurality of human body identification windows in different sizes exist, concurrent threads corresponding to the number of the human body identification windows are created, the plurality of video image frames are copied, the human body identification windows are controlled to perform sliding framing on different video image frames through the plurality of concurrent threads, namely, the sliding framing operations of the human body identification windows in different sizes are independent and do not influence each other, and candidate area images in different sizes are generated.

In S1024, the coincidence rate between each candidate region image and the standard human body template is calculated respectively, and the candidate region image with the coincidence rate larger than a preset coincidence rate threshold value is selected as the human body region image.

In this embodiment, the terminal device calculates a coincidence rate between the candidate region image and the standard human body template, and if the coincidence rate between the candidate region image and the standard human body template is higher, it indicates that the similarity between the photographic object corresponding to the region image and the target object is higher, so that the candidate region can be identified as a human body region image; conversely, if the overlap ratio between the two images is lower, the similarity between the form of the region image and the target object is low, and the region image is recognized as a non-human body region image. Because the video image frame can contain a plurality of different users, the terminal device can identify all candidate regions with the coincidence rate exceeding the preset coincidence rate threshold value as human body region images, under the condition, the terminal device can position the face images of all the human body region images, so that the human body images are matched with the standard faces of the target object, and the human body region images matched with the standard faces are selected as the human body region images of the target object.

In the embodiment of the invention, the contour curves in the video image frame are obtained, so that the scaling of the video image frame is determined based on the area of each contour curve, and the human body identification window corresponding to the video image frame is generated to perform the identification operation of the human body area image, thereby improving the identification accuracy.

Fig. 5 shows a flowchart of a specific implementation of the human body motion recognition method S105 according to a fifth embodiment of the present invention. Referring to fig. 5, with respect to the embodiment described in fig. 1 to 3, the method for recognizing human body motion S105 provided in this embodiment includes: s1051 to S1052 are specifically described as follows:

further, the determining at least one candidate action of the target object through the key feature sequences of the respective key parts includes:

in S1051, the feature coordinates of each of the key feature sequences are marked on a preset coordinate axis, and a region variation curve is generated for each of the key regions.

In this embodiment, the terminal device marks each feature coordinate on a preset coordinate axis according to the coordinate value of each feature coordinate in each key feature sequence and the frame number of the corresponding video image frame, and connects each feature coordinate to generate a part variation curve about the key part. The coordinate axis can be a coordinate axis established on the basis of the video image frame, the horizontal axis corresponds to the length of the video image frame, and the vertical axis corresponds to the width of the video image frame.

In S1052, the part variation curve is matched with a standard motion curve of each candidate motion in a preset motion library, and the candidate motion of the target object is determined based on a matching result.

In this embodiment, the terminal device matches the part variation curves of all the key parts with the standard action curves of each candidate action in the preset action library, calculates the coincidence rate of the two variation curves, and selects one candidate action with the highest coincidence rate as the action type of the target object.

In the embodiment of the invention, the action type of the target object can be intuitively determined by drawing the part change curve of the key part, so that the accuracy of the action type is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

Fig. 6 shows a block diagram of a human body motion recognition device according to an embodiment of the present invention, where the human body motion recognition device includes units for executing steps in the embodiment corresponding to fig. 1. Please refer to fig. 1 and fig. 1 for the corresponding description of the embodiment. For convenience of explanation, only the portions related to the present embodiment are shown.

Referring to fig. 6, the human motion recognition apparatus includes:

a video file acquisition unit 61 for acquiring a video file of a target object; the video file comprises a plurality of video image frames;

a human body region image extracting unit 62, configured to parse each of the video image frames, respectively, extract a human body region image of the video image frames about the target object, and determine an interactable object included in the video image frames;

a key part identification unit 63, configured to mark each key part in a preset human body key part list in the human body region image, and obtain a feature coordinate of each key part;

a key feature sequence generating unit, configured to generate a key feature sequence related to the key portion according to the feature coordinates 64 corresponding to the key portion in each of the video image frames;

a candidate motion recognition unit 65, configured to determine at least one candidate motion of the target object through the key feature sequence of each of the key portions;

and an action type identification unit 66, configured to calculate matching degrees between each candidate action and the interactable object, respectively, and determine an action type of the target object from the candidate actions according to the matching degrees.

Optionally, the action type recognition unit 66 includes:

the interaction confidence coefficient calculation unit is used for acquiring a distance value between the interactive object and the human body region image and determining the interaction confidence coefficient of the interactive object based on the distance value;

the action confidence coefficient identification unit is used for respectively calculating the similarity between the key feature sequence and the standard feature sequence of each candidate action and identifying the similarity as the action confidence coefficient of the candidate action;

an interaction probability determination unit, configured to determine, based on an object type of the interactable object, an interaction probability of the candidate action with the object type;

the object confidence coefficient identification unit is used for extracting an object region image of the interactive object from the video image frame and determining the object confidence coefficient of the interactive object according to the object region image and a standard image preset by the object type;

the matching degree calculation unit is used for importing the interaction confidence degree, the action confidence degree, the object confidence degree and the interaction probability into a matching degree calculation model to determine the matching degree of the candidate action; the matching degree calculation model specifically comprises the following steps:

wherein ,the degree of match for the candidate action a;the interaction confidence is; s_hThe action confidence is; s_oIs the object confidence;the interaction probability is the interaction probability;the preset trigger probability of the candidate action a is set;

and the candidate action selecting unit is used for selecting the candidate action with the matching degree larger than a matching threshold value and identifying the candidate action as the action type of the target object.

Optionally, the key feature sequence generating unit 64 includes:

the image distance value calculating unit is used for acquiring a first characteristic coordinate and a second characteristic coordinate of the same key part in two video image frames with adjacent frame numbers and calculating an image distance value between the first characteristic coordinate and the second characteristic coordinate;

the shooting focal length determining unit is used for calculating the image area of the human body region image and determining the shooting focal length between the target object and the shooting module based on the image area;

the actual moving distance calculation unit is used for importing the shooting focal length, the image distance value and the shooting frame rate of the video file into a distance conversion model and calculating the actual moving distance of the key part in the two video image frames; the distance conversion model specifically comprises:

wherein Dist is the actual moving distance; the StandardDist is the image distance value; FigDist is the shooting focal length; the BaseDist is a preset reference focal length; the ActFrame is the shooting frame rate; BaseFrame is the reference frame rate;

the associated coordinate identification unit is used for identifying the two feature coordinates of which the actual moving distance is smaller than a preset distance threshold value as feature coordinates which are associated with each other;

and the associated coordinate packaging unit is used for generating the key feature sequence related to the key part according to all the feature coordinates which are associated with each other.

Optionally, the human body region image extracting unit 62 includes:

the contour curve acquisition unit is used for acquiring contour curves of the video image frames through a contour recognition algorithm and calculating the area of a region surrounded by each contour curve;

the human body identification window generating unit is used for generating a human body identification window of the video image frame according to the area of each region;

a candidate region image extraction unit, configured to perform sliding framing on the video image frame based on the human body recognition window, and generate a plurality of candidate region images;

and the human body area image matching unit is used for respectively calculating the coincidence rate between each candidate area image and a standard human body template, and selecting the candidate area image with the coincidence rate larger than a preset coincidence rate threshold value as the human body area image.

Optionally, the action type recognition unit 65 includes:

the part variation curve generating unit is used for marking the characteristic coordinates of each key characteristic sequence in a preset coordinate axis and generating part variation curves related to each key part;

and the candidate action selection unit is used for matching the part change curve with a standard action curve of each candidate action in a preset action library and determining the action type of the target object based on a matching result.

Therefore, the human body motion recognition device provided by the embodiment of the invention can also recognize the motion type of the video image without depending on a neural network, does not rely on optical flow information, and avoids recognition delay caused by time sequence recursion, so that the recognition efficiency is improved.

Fig. 7 is a schematic diagram of a terminal device according to another embodiment of the present invention. As shown in fig. 7, the terminal device 7 of this embodiment includes: a processor 70, a memory 71 and a computer program 72, such as a human motion recognition program, stored in said memory 71 and executable on said processor 70. The processor 70 implements the steps in the above-mentioned embodiments of the human body motion recognition method when executing the computer program 72, for example, S101 to S106 shown in fig. 1. Alternatively, the processor 70, when executing the computer program 72, implements the functions of the units in the above-described device embodiments, such as the functions of the modules 61 to 66 shown in fig. 6.

Illustratively, the computer program 72 may be divided into one or more units, which are stored in the memory 71 and executed by the processor 70 to accomplish the present invention. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 72 in the terminal device 7. For example, the computer program 72 may be divided into a video file acquisition unit, a human body region image extraction unit, a key part recognition unit, a key feature sequence generation unit, a candidate motion recognition unit, and a motion type recognition unit, each of which functions as described above.

The terminal device 7 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 70, a memory 71. It will be appreciated by those skilled in the art that fig. 7 is merely an example of a terminal device 7 and does not constitute a limitation of the terminal device 7 and may comprise more or less components than shown, or some components may be combined, or different components, for example the terminal device may further comprise input output devices, network access devices, buses, etc.

The Processor 70 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7, such as a hard disk or a memory of the terminal device 7. The memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 7. Further, the memory 71 may also include both an internal storage unit and an external storage device of the terminal device 7. The memory 71 is used for storing the computer program and other programs and data required by the terminal device. The memory 71 may also be used to temporarily store data that has been output or is to be output.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A human body action recognition method is characterized by comprising the following steps:

2. The identification method according to claim 1, wherein said respectively calculating a matching degree between each of the candidate actions and the interactable object, and determining the action type of the target object from the candidate actions according to the matching degree comprises:

acquiring a distance value between the interactive object and the human body region image, and determining an interaction confidence coefficient of the interactive object based on the distance value;

respectively calculating the similarity between the key feature sequence and the standard feature sequence of each candidate action, and identifying the similarity as the action confidence of the candidate action;

determining an interaction probability of the candidate action with the object type based on the object type of the interactable object;

extracting an object region image of the interactive object from the video image frame, and determining an object confidence coefficient of the interactive object according to the object region image and a standard image preset by the object type;

importing the interaction confidence level, the action confidence level, the object confidence level and the interaction probability into a matching degree calculation model, and determining the matching degree of the candidate action; the matching degree calculation model specifically comprises the following steps:

and selecting the candidate action with the matching degree larger than a matching threshold value as the action type of the target object.

3. The method according to claim 1, wherein the generating a key feature sequence about the key part according to the feature coordinates corresponding to the key part in each video image frame comprises:

acquiring a first characteristic coordinate and a second characteristic coordinate of the same key part in two video image frames with adjacent frame numbers, and calculating an image distance value between the first characteristic coordinate and the second characteristic coordinate;

calculating the image area of the human body region image, and determining the shooting focal distance between the target object and a shooting module based on the image area;

importing the shooting focal length, the image distance value and the shooting frame rate of the video file into a distance conversion model, and calculating the actual moving distance of the key part in the two video image frames; the distance conversion model specifically comprises:

identifying the two feature coordinates of which the actual moving distance is smaller than a preset distance threshold value as feature coordinates which are related to each other;

and generating the key feature sequence related to the key part according to all the feature coordinates which are mutually associated.

4. The identification method according to any one of claims 1 to 3, wherein the parsing each of the video image frames, extracting a body region image of the video image frames with respect to the target object, and determining the interactable objects contained in the video image frames comprises:

acquiring contour curves of the video image frames through a contour recognition algorithm, and calculating the area of a region surrounded by each contour curve;

generating a human body identification window of the video image frame according to the area of each region;

performing sliding framing on the video image frame based on the human body identification window to generate a plurality of candidate area images;

respectively calculating the coincidence rate between each candidate region image and a standard human body template, and selecting the candidate region image with the coincidence rate larger than a preset coincidence rate threshold value as the human body region image.

5. The identification method according to any one of claims 1 to 3, wherein the determining at least one candidate action of the target object by the key feature sequence of each key part comprises:

marking the feature coordinates of each key feature sequence in a preset coordinate axis, and generating a part change curve about each key part;

and matching the part change curve with a standard action curve of each candidate action in a preset action library, and determining the candidate action of the target object based on a matching result.

6. An apparatus for recognizing a human body motion, comprising:

7. The apparatus according to claim 6, wherein the action type identifying unit includes:

8. The apparatus according to claim 6, wherein the key feature sequence generating unit includes:

9. A terminal device, characterized in that the terminal device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program with the steps of the method according to any of claims 1 to 5.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.