CN113536857A

CN113536857A - Target action recognition method, device, server and storage medium

Info

Publication number: CN113536857A
Application number: CN202010313131.0A
Authority: CN
Inventors: 陈小强
Original assignee: Lumi United Technology Co Ltd
Current assignee: Shenzhen Lumi United Technology Co Ltd; Lumi United Technology Co Ltd
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2021-10-22

Abstract

The embodiment of the application discloses a target action identification method, a target action identification device, a server and a storage medium. The method comprises the following steps: selecting a plurality of frames of images from the video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates. The human body key points and the corresponding coordinates in each frame of image are obtained, and the target action is identified by the human body key points and the corresponding coordinates, so that a large amount of calculation power is not required to be consumed, the complexity of action identification is reduced, and the action identification can be normally carried out on electronic equipment with low calculation power.

Description

Target action recognition method, device, server and storage medium

Technical Field

The present application relates to the field of image processing, and more particularly, to a target action recognition method, apparatus, server, and storage medium.

Background

With the development of science and technology, the man-machine interaction is also developed and upgraded, the traditional man-machine interaction comprises a keyboard and a mouse, a display is fed back and the like, and the popularization of intelligent electronic equipment increases modes such as touch control interaction, voice control interaction and the like, gesture control interaction and the like.

However, in the gesture control interaction, the gesture needs to be recognized, a great amount of calculation is usually consumed in the process of recognizing the gesture, and the logic of recognition is also complex.

Disclosure of Invention

The embodiment of the application provides a target action identification method, a target action identification device, a server and a storage medium, so as to solve the problems.

In a first aspect, an embodiment of the present application provides a target action identification method, where the method includes: selecting a plurality of frames of images from the video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates.

In a second aspect, an embodiment of the present application provides a target motion recognition apparatus, including: the selecting module is used for selecting a plurality of frames of images from the video data; the extraction module is used for extracting human key point information in each frame of image, wherein the human key point information comprises human key points and corresponding coordinates; and the identification module is used for identifying the target action in the video data according to the human body key point and the corresponding coordinate.

In a third aspect, the present application provides an electronic device, which includes one or more processors, a memory, and a computer program stored on the memory and executable on the processors, and when executed by the processors, the computer program implements the method applied to the electronic device as described above.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method described above.

The target action identification method, the target action identification device, the server and the storage medium provided by the embodiment of the application select a plurality of frames of images from video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates. The human body key points and the corresponding coordinates in each frame of image are obtained, and the target action is identified by the human body key points and the corresponding coordinates, so that a large amount of calculation power is not required to be consumed, the complexity of action identification is reduced, and the action identification can be normally carried out on electronic equipment with low calculation power.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic application environment applicable to the target action recognition method provided by the embodiment of the present application.

Fig. 2 shows a flowchart of a target action recognition method according to an embodiment of the present application.

Fig. 3 shows a flowchart of a target action recognition method according to another embodiment of the present application.

Fig. 4 shows a flowchart of a target action recognition method according to still another embodiment of the present application.

Fig. 5 shows a schematic diagram of human body key points provided by the embodiment of the present application.

Fig. 6 shows a flow chart of step S330 in the embodiment provided in fig. 4.

Fig. 7 shows a flow chart of step S340 in the embodiment provided in fig. 4.

Fig. 8 shows a flowchart of a target action recognition method according to still another embodiment of the present application.

Fig. 9 is a functional block diagram of a target motion recognition apparatus according to an embodiment of the present application.

Fig. 10 shows a block diagram of an electronic device for executing a target action recognition method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Gesture control is a natural mode, and everyone can be an interactive mode, so that the cost for people to learn gesture control interaction is low, and interaction with a machine can be realized only by learning gestures. In an intelligent home, various home devices are generally required to be controlled, and the control modes can be various, such as voice control, gesture control, and the like. However, in the gesture control, the gesture is usually recognized, and the corresponding gesture control can be realized after the gesture is recognized. The gesture control is realized by completing the recognition process of the hand motion in real time or step by step through various sensor devices and converting the recognition process into commands which can be recognized by the computer host device so as to realize the control of the corresponding controlled device.

The inventor finds in research that gesture recognition generally adopts the mode that human body key point information is input into a convolutional neural network, and an output result is obtained through classifying the convolutional network. However, the method needs a large amount of data to train the model, the training process is complex, and when performing gesture recognition, the model needs to be operated, which also occupies a large computational effort, and has a high computational effort requirement on the electronic device, and the gesture recognition is difficult to be realized by the general electronic device. In some methods with low computational requirements, the gesture is generally determined by using a 3D image, and a new sensor is required to be added to generate the 3D image, however, the cost is high due to the addition of the new sensor.

Therefore, the inventor proposes a target action recognition method in the embodiment of the application, which selects a plurality of frames of images from video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates. The human body key points and the corresponding coordinates in each frame of image are obtained, the target action is identified by the human body key points and the corresponding coordinates, a large amount of calculation power is not needed to be consumed, the complexity of action identification is reduced, and therefore action identification can be normally carried out on electronic equipment with low calculation power through a common image.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 illustrates an application environment 10 of a target motion recognition method according to an embodiment of the present application.

The application environment 10 includes: a local server/cloud server 11, a gateway 12, a user terminal 13, an electronic device 14, and a controlled device 15. The controlled device 15 may be an air conditioner, a television, a motorized curtain, or the like. The electronic device 14 is a device capable of performing motion recognition, and the electronic device 14 may be a separate device or may be integrated with the gateway 12 into a single device. If the electronic device 14 is a single device, it can be connected to the controlled device 14 through the gateway 12; if the electronic device 14 and the gateway 12 are integrated into one device, they can be directly connected to the controlled device 14, and the connection mode may be bluetooth, WiFi, ZigBee, or the like.

The user terminal 13 may be a mobile phone, a tablet computer, a pc (personal computer) computer, a notebook computer, a smart tv, a car terminal, or the like. The user terminal 13 may be connected to the local server or the cloud server 11 through a network such as 2G, 3G, 4G, 5G, WIFI, and the like. The gateway 12 may be connected to the local server or the cloud server 11 through a router so as to be connected to the user terminal 13, and the gateway 12 may also be connected to the controlled device 15 and the electronic device 14. The connection between the gateway 12 and the controlled device 15, and the connection between the gateway 12 and the electronic device 14 may also be through bluetooth, WiFi, ZigBee, or the like. Of course, the connection mode between the devices and the network connection mode of the devices in the embodiment of the present application are not limited.

Referring to fig. 2, an embodiment of the present application provides a target action recognition method, which may be applied to a local server/cloud server, and the method may include:

step S110: a plurality of frames of images are selected from video data.

In gesture recognition, recognition is usually performed in a piece of video data, or in real time. Therefore, the video data can be acquired through the image acquisition equipment, and the video data can be real-time video data or video data of a past period of time. The image capturing device may be integrated on the electronic device, or may be in communication connection with the electronic device, and sends the captured video data to the electronic device, so that the electronic device may capture the video data, and the video data captured by the electronic device may be an encoded H264 or H265 data stream.

When the video data is acquired, the video data can be decoded to obtain a real-time image sequence, and the image sequence can be an image acquired by the same image acquisition device at continuous time, that is, the content in the video data is converted into a frame-by-frame continuous image. Thus, selecting a plurality of frames of images from the video data allows selecting a plurality of frames of images from the sequence of images for subsequent steps.

As an embodiment, the selection of a plurality of images from the image sequence may select a plurality of consecutive images.

As an embodiment, selecting a multi-frame image from the image sequence may extract partial images as the multi-frame image at a preset frame number interval, for example, if the preset frame number interval is 2 and 10 frames of images are in total in the image sequence, the 1 st frame image, the 3 rd frame image, the 5 th frame image, the 7 th frame image and the 9 th frame image may be extracted as the multi-frame image.

As an implementation manner, the selecting the multi-frame images from the image sequence may be selecting images including a target object in the image sequence, for example, the target object is a human body, 10 frames of images are total in the image sequence, and all images after the 3 rd frame include a human body, so that the 3 rd frame image to the 10 th frame image multi-frame images may be determined.

As an implementation manner, a test may be performed according to the target action, and the number of frames corresponding to the target action is tested, so as to facilitate extraction of an image with a corresponding number of frames. For example, in action 1, 10 frames are normally used, and 10 frames of images are extracted.

Specifically, the number and the mode of extracting the multi-frame images may be set according to actual needs, and are not specifically limited herein.

Step S120: and extracting the human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates.

After selecting multiple frames of images from the video data, each frame of image can be processed, and the human body key point information in each frame of image is extracted. The human body key point information comprises human body key points and corresponding coordinates.

The human body key points refer to main joint parts such as a nose, a neck, four limbs and the like, when the human body key point information in each frame of image is extracted, the same human body key points in each frame of image can be detected, the same coordinate system is established in each image to determine the corresponding coordinates of the detected human body key points in the coordinate system, and the coordinates of the images in different frames can be compared with each other because the same coordinate system is established in each frame of image.

Specifically, the method for extracting the human body key point information in each frame of image can utilize a neural network model to sequentially input each frame of image into the neural network model to obtain the corresponding human body key point and the corresponding coordinate in each frame of image.

Step S130: and identifying target actions in the video data according to the human body key points and the corresponding coordinates.

After the human body key points and the corresponding coordinates in each frame of image are obtained, target actions in the video data can be identified according to the human body key points and the corresponding coordinates. Due to the fact that the human body key points and the corresponding coordinates in each frame of image are obtained, the target action in the video data can be identified through the human body key points and the corresponding coordinates in each frame of image.

Defining the target action as an action needing to be identified, wherein the target action can comprise a plurality of human key points, and the positions of the human key points meet a preset position relationship. Wherein, in different target actions, a plurality of selected human body key points can be the same or different; if the human body key points selected by different target actions are the same, the preset position relations met by the positions of the human body key points can be different. Therefore, when the target action is identified according to the human body key points in each frame of image and the corresponding coordinates, whether the human body key points in each frame of image are in the preset position relation corresponding to the target action or not can be judged. If the number of images which do not meet the preset position relationship in the multi-frame images is smaller than the preset number, which indicates that the action contained in the multi-frame images is the target action, the action in the video data can be identified as the target action.

According to the target action identification method provided by the embodiment of the application, a plurality of frames of images are selected from video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates, searching an instruction comparison table after identifying the target actions in the video data to obtain control instructions and controlled equipment corresponding to the target actions, and controlling the controlled equipment according to the control instructions. The target action is recognized by using the key points of the human body and the corresponding coordinates, a large amount of calculation force is not needed to be consumed, the complexity of action recognition is reduced, and therefore action recognition can be normally performed on electronic equipment with low calculation force through a common image.

Referring to fig. 3, another embodiment of the present application provides a target motion recognition method, where a process of extracting key point information of a human body of each frame of image is described in detail on the basis of the previous embodiment, and the method may include:

step S210: a plurality of frames of images are selected from video data.

Step S210 may refer to corresponding parts of the foregoing embodiments, and will not be described herein again.

Step S220: and inputting the multi-frame images in the video data into an extraction network, and obtaining the key points of the human body and the coordinates of the key points of the human body corresponding to each frame of image according to the output of the extraction network.

After selecting multiple frames of images from the video data, human key points in each frame of image can be extracted, and when the human key points in each frame of image are extracted, the multiple frames of images can be input into an extraction network, and the human key points and the coordinates of the key points corresponding to each frame of image are obtained according to the output of the extraction network. And the extraction network is used for outputting corresponding human key points and coordinates of the input image.

Specifically, the extraction network may be a convolutional neural network for extracting the key points of the human body, for example, a convolutional neural network such as opendose, dose-pro-dose-net, and the like. When the extraction network is used, pre-training is not required, the trained network can be directly selected, and when the key points in each frame of image are extracted, each frame of image can be input into the extraction network to obtain the key points of the human body corresponding to each frame of image and the coordinates of the key points of the human body.

The included human body key points may be different in the images of different frames.

As an implementation manner, inputting each frame of image into the extraction network may be outputting all human body key points included in the corresponding image, and establishing the same coordinate system for each frame of image to output the extracted coordinates corresponding to the human body key points.

As another embodiment, inputting each frame of image into the extraction network may be outputting fixed human key points. For example, if only a hand key point is required, only the hand key point and the corresponding coordinates are output for each frame of image, and if no hand key point is included in a certain frame of image, the human body key point and the coordinates of the human body key point corresponding to the frame of image are not output.

Step S230: and judging whether the plurality of human body key points in each frame of image meet the preset position relation or not according to the coordinates of each human body key point.

After extracting the corresponding human body key points and the coordinates of the human body key points in each frame of image, the human body key points and the coordinates required in each frame of image can be obtained. For example, if the required human body key is a human body hand key point, the hand key point and the corresponding coordinate in each frame of image can be acquired, and the target action can be identified according to the hand key point and the corresponding coordinate.

The target action can comprise a plurality of human body key points, and the positions of the human body key points meet a preset position relation. When the target actions are different, the preset position relations which need to be met corresponding to the human body key points in the target actions are different. For example, when the target motion is motion 1, the preset position relationship corresponding to the plurality of human body key points may be relationship 1, and when the target motion is motion 2, the preset position relationship corresponding to the plurality of human body key points may be relationship 2. Therefore, the preset position relation required to be met by the human key points in each frame of image can be determined according to the target action to be identified.

Therefore, whether the human body key points in each frame of image meet the preset position relation corresponding to the target action or not can be judged.

The human body key points are not limited to hand key points, and in some embodiments, the human body key points may be key points of legs, or key points of a combined hand and legs. Specifically, when determining whether the human body key point in each frame of image satisfies the preset position relationship corresponding to the target action, the key point may be used for determining when the target action is executed. In some embodiments, a correspondence table between a target action and a key point may be established, where the key point represents a key point that may be used to determine the target action, and thus, the key point that determines the target action may be determined according to the target action that needs to be identified.

For example, when the target motion is a motion in which two arms are closed to open, the human body key points that can be determined are the hand key points, when the target motion is a leg kicking motion, the human body key points that can be determined are the leg key points, and when the target motion is a twisting motion, the human body key points that can be determined are the waist and neck key points. Therefore, when different target actions are recognized, the key points which can be judged can be correspondingly arranged.

In addition, in the correspondence table, the target action may also correspond to a preset positional relationship, which indicates a positional relationship that the key point corresponding to the target action needs to satisfy.

Step S240: and if the number of the images of which the plurality of human key points do not meet the preset position relationship is less than the preset number, judging that the action in the video data is the target action.

When judging whether the human body key points in each frame of image meet the preset position relationship corresponding to the target action, counting the number of images which do not meet the preset position relationship. And presetting a preset number, wherein the preset number represents the number of the allowed maximum images which do not meet the preset position relation corresponding to the target action in the multi-frame images corresponding to the target action. For example, when a target action is performed, it is detected that at most 3 images in the corresponding multi-frame images do not satisfy the preset position relationship corresponding to the target action, and when the action is recognized, it is detected that 4 images in the multi-frame images do not satisfy the preset position relationship, which indicates that the target action is not being performed.

The determination of the preset number may be obtained through testing, for example, multiple frames of images in the target action are obtained for multiple times, the number of images which do not satisfy the preset position relationship in the images is determined, and an average value is obtained according to results of the multiple tests, where the average value is the preset number. Of course, the preset number corresponding to different target actions may be different, and specifically, the value of the preset number may be set according to actual needs, which is not limited herein.

After the multi-frame images are detected, the number of images of which the human key points do not meet the preset position relationship in the multi-frame images can be obtained, and if the number of images of which the human key points do not meet the preset position relationship is less than the preset number, the target action is executed, so that the action in the video data can be judged to be the target action; if the number of the images of which the human body key points do not meet the preset position relationship is larger than or equal to the preset number, the target action is not executed, and therefore the action in the video data can be judged not to be the target action.

The target action identification method provided by the embodiment of the application selects a plurality of frames of images from video data; inputting the multi-frame images in the video data into an extraction network, and obtaining human key points corresponding to each frame of image and coordinates of the human key points according to the output of the extraction network; judging whether a plurality of human body key points in each frame of image meet a preset position relation or not according to the coordinates of each human body key point; and if the number of the images of which the plurality of human key points do not meet the preset position relationship is less than the preset number, judging that the action in the video data is the target action. Judging whether the preset position relation of a plurality of corresponding human body key points when the target action is executed is met or not through the human body key points and the corresponding coordinates in each frame of image, counting the number of images which do not meet the preset position relation, and determining whether the target action is identified or not according to the counted number, so that a large amount of calculation power is not consumed, the complexity of action identification is reduced, and the action identification can be normally carried out on electronic equipment with low calculation power through a common image.

Referring to fig. 4, a further embodiment of the present application provides a method for identifying a target action, where a process of identifying a target action in the video data according to the human body key points and corresponding coordinates is described in detail on the basis of the previous embodiment, and the method may include:

step S310: a plurality of frames of images are selected from video data.

Step S320: and inputting the multi-frame images in the video data into an extraction network, and obtaining the key points of the human body and the coordinates of the key points of the human body corresponding to each frame of image according to the output of the extraction network.

The steps S310 to S320 refer to corresponding parts of the foregoing embodiments, and are not described herein again.

Since there are many key points of the human body, when performing motion recognition, only the key points of the human body arm are taken as an example for explanation, so the acquired key points of the human body may be a first key point, a second key point, and a third key point, where the first key point may be a key point of the hand, the second key point may be a key point of the elbow, and the third key point may be a key point of the shoulder. In particular, referring to fig. 5, a schematic diagram of key points of a human body is shown. The black dots in the figure represent the respective key points of the human body.

In some embodiments, the first keypoint, the second keypoint, and the third keypoint may refer to three keypoints, i.e., the first keypoint is a hand keypoint, i.e., 4 or 7 in fig. 5, the second keypoint is an elbow keypoint, i.e., 3 or 6 in fig. 5, and the third keypoint is a shoulder keypoint, i.e., 2 or 5 in fig. 5.

In other embodiments, the first keypoints may be one or more of the symmetric keypoints in the human body, the second keypoints may be one or more of the symmetric keypoints in the human body, and the third keypoints may be one or more of the symmetric keypoints in the human body. The first key point, the second key point and the third key point respectively correspond to 3 groups of different symmetrical key points. If the first key point represents all the key points of the hand, the second key point represents all the key points of the elbow, and the third key point represents all the key points of the shoulder. For example, a human body has two arms, then the first keypoint represents the keypoint of two hands, i.e. 4 or 7 in fig. 5, the second keypoint represents the keypoint of two elbows, i.e. 3 or 6 in fig. 5, and the third keypoint represents the keypoint of two shoulders, i.e. 2 or 5 in fig. 5.

In the following description, the first key point represents key points of all hands, the second key point represents key points of all elbows, and the third key point represents key points of all shoulders, as an example. When judging whether a plurality of human key points in each frame of image meet a preset position relationship according to the coordinates of each human key point, the preset position relationship may include a height relationship and a distance relationship, and therefore, it is necessary to judge whether the human key points meet the height relationship and whether the human key points meet the distance relationship.

Step S330: and judging whether the height positions among all the human body key points in each frame of image meet the height relation or not according to each human body key coordinate.

When the first key point, the second key point and the third key point are obtained, whether the height relationship is satisfied or not among the human body key points in each frame of image can be judged according to the corresponding coordinates. The height position relationship may be the height of each human body key point when the target action is executed.

Specifically, referring to fig. 6, the following steps may be included:

step S331: and judging whether each human body key point in each frame of image meets the condition that the first key point is higher than the second key point and the second key point is lower than the third key point or not according to the coordinate of each human body key point.

And if the target action is executed, the satisfied high-low position relation is that the first key point is higher than the second key point, and the second key point is lower than the third key point. Whether the height relationship is satisfied can be judged according to the first key point, the second key point, the third key point and the corresponding coordinates extracted from each frame of image.

For example, the coordinates of the first key point of the same arm are (x1, y1), the coordinates of the second key point are (x2, y2), the coordinates of the third key point are (x3, y3), the coordinates of the first key point of the corresponding other arm are (x4, y4), the coordinates of the second key point are (x5, y5), and the coordinates of the third key point are (x6, y 6).

When the high-low position relation is obtained, only the y coordinate needs to be concerned. For one arm, it is necessary to determine whether y1> y2, y2< y3, and for the other arm, it is necessary to determine whether y4> y5, y5< y6 are satisfied. Therefore, whether each frame of image meets the high-low relation can be judged in sequence.

Step S332: and if so, judging that the height positions among all the human body key points in each frame of image meet the height relation.

In some embodiments, when the first key point is higher than the second key point and the second key point is lower than the third key point in both the arm key points, it may be determined that the height positions between the human body key points in the frame image satisfy the height relationship.

In other embodiments, if the key point in any one of the two arms satisfies that the first key point is higher than the second key point, and the second key point is lower than the third key point, it may be determined that the height position between the key points of the human body in the frame image satisfies the height relationship.

Step S340: and judging whether the Euclidean distance between two hands in each frame of image meets the distance relation or not according to the coordinates of each human body key point.

When the first key point, the second key point and the third key point are obtained, whether the Euclidean distance between two hands in each frame of image meets the distance relation or not can be judged according to the corresponding coordinates. Wherein, the distance relationship may be the position height of each human body key point when the target action is executed.

Specifically, referring to fig. 7, the following steps may be included:

step S341: and calculating the Euclidean distance between two hands in each frame of image according to the coordinates of the first key point, the second key point and the third key point.

When the euclidean distance between the two hands is calculated, the first keypoint, the second keypoint, and the third keypoint may be 3 groups of symmetric keypoints in the human body, and may be calculated by calculating a euclidean distance between two keypoints in the first keypoint, a euclidean distance between two keypoints in the second keypoint, and a euclidean distance between two keypoints in the third keypoint.

When the Euclidean distance between the two hands is calculated, the Euclidean distance between the first key points can be obtained as

The Euclidean distance between the second key points is

The Euclidean distance between the third key points is

Thus, p1, p2, and p3 are all the distances between the two hands in the frame image. By the method, the Euclidean distance between two hands in each frame of image can be calculated.

Step S342: and when the Euclidean distance between the two hands of the next frame is larger than that of the current frame, judging that the next frame meets the distance relation.

After the euclidean distance between the first key point, the second key point and the third key point in each frame of image is calculated, whether the euclidean distance between the first key point, the second key point and the third key point in the next frame meets the distance relationship can be judged according to the calculated euclidean distances between the first key point, the second key point and the third key point.

In some embodiments, in the distance relationship, the distance between the two hands may be represented by three sets of key points, and the change in the euclidean distance between the first key points, the change in the euclidean distance between the second key points, and the change in the euclidean distance between the third key points may be different.

For example, when the target action is executed, the euclidean distance between the first key points may be gradually increased, the euclidean distance between the second key points may be gradually decreased, and the euclidean distance between the third key points may be substantially maintained. And only when the three key points simultaneously meet the change condition of the Euclidean distance, judging that the next frame meets the distance relation.

Therefore, the euclidean distances between the first key point, the second key point and the third key point in the current frame image and the first key point, the second key point and the third key point in the next frame image need to be sequentially judged. The Euclidean distance between first key points in the current frame image and the Euclidean distance between first key points in the next frame image are compared to obtain the change of the Euclidean distance between the first key points; comparing the Euclidean distance between second key points in the current frame image with the Euclidean distance between second key points in the next frame image to obtain the change of the Euclidean distance between the second key points; and comparing the Euclidean distance between the third key points in the current frame image with the Euclidean distance between the third key points in the next frame image to obtain the change of the Euclidean distance between the third key points. When the euclidean distance between first key points in the next frame image is greater than the euclidean distance between first key points in the current frame image, the euclidean distance between second key points in the next frame image is less than the euclidean distance between second key points in the current frame image, and the euclidean distance between third key points in the next frame image is approximately equal to the euclidean distance between third key points in the current frame image, it may be determined that the next frame satisfies the distance relationship.

It is to be understood that the approximately equals may be set such that the difference between the two is not greater than a predetermined value, for example, the predetermined value is 0.5, and the difference between the two is 0.4, the approximately equals may be considered to be approximately. The preset value can be set according to actual requirements, and is not specifically limited herein.

In other embodiments, in the distance relationship, the euclidean distance between two hands may be represented by one of the groups of key points, which may be the first key point, the second key point, or the third key point.

For example, when the target action is executed, the euclidean distance between the first keypoints is gradually increased, the euclidean distance between the first keypoints in the current frame image and the euclidean distance between the first keypoints in the next frame image may be compared to obtain a change in the euclidean distance between the first keypoints, and when the euclidean distance between the first keypoints in the next frame image is greater than the euclidean distance between the first keypoints in the current frame image, the next frame may be considered to satisfy the distance relationship.

Step S350: and when the plurality of human body key points in each frame of image do not meet the height relationship or the distance relationship, judging that the plurality of human body key points in the frame of image do not meet the preset position relationship.

When the human key points in a certain frame of image do not satisfy the height relationship or the distance relationship, the human key points in the frame of image can be considered not to satisfy the preset position relationship.

For example, if the image of the 3 rd frame does not satisfy the height relationship and the distance relationship, the image of the 7 th frame does satisfy the height relationship and the distance relationship, the image of the 10 th frame satisfies the height relationship and the distance relationship, and the image of the 11 th frame does not satisfy the height relationship and the distance relationship, it may be determined that the key point of the human body in the 3 rd frame, the 7 th frame, and the 11 th frame does not satisfy the preset position relationship, and the key point of the human body in the image of the 10 th frame satisfies the preset position relationship.

Step S360: and if the number of the images of which the plurality of human body key points do not meet the preset position relationship is less than the preset number, judging that the action in the video frequency is the target action.

Wherein the number of images that do not satisfy the preset positional relationship may be represented by action _ miss, and when it is determined that one frame of image does not satisfy the preset positional relationship, the action _ miss is action _ miss +1, so that the number of images that do not satisfy the preset positional relationship may be obtained. Step S360 may be referred to with the description of the corresponding parts of the foregoing embodiments, and is not described herein again.

It should be noted that, in some embodiments, step S330 and step S340 may not both be performed. For example, only one of the steps may be executed, and the selection may be performed according to a preset positional relationship that needs to be satisfied by an actual target motion that needs to be identified, which is not particularly limited.

In other embodiments, if the number of images that do not satisfy the preset position relationship and are greater than or equal to the preset number after step S330 is performed, step S340 may not be performed any more to speed up the recognition of the target action.

According to the target action identification method provided by the embodiment of the application, whether the height position between every two human key points in every frame of image meets the height relationship is judged according to the coordinate of every human key point, whether the Euclidean distance between two hands in every frame of image meets the distance relationship is judged, and when a plurality of human key points in every frame of image do not meet the height relationship or the distance relationship, the plurality of human key points in the frame of image do not meet the preset position relationship is judged; and if the number of the images of which the plurality of human key points do not meet the preset position relationship is less than the preset number, judging that the action in the video data is the target action. Corresponding to different target actions, the height relation and the distance relation which need to be met by the human key points are different, so that simple data operation can be directly carried out according to the extracted coordinates of the human key points, the target actions in the video data are identified, the complexity of action identification is reduced, and the action identification can be normally carried out on electronic equipment with low calculation capacity through a common image.

Referring to fig. 8, a further embodiment of the present application provides a target action identification method, where a process of controlling a controlled device by using an identified target action is mainly described on the basis of the previous embodiment, and the method may include:

step S410: a plurality of frames of images are selected from video data.

Step S420: and extracting the human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates.

Step S430: and identifying target actions in the video data according to the human body key points and the corresponding coordinates.

The steps S410 to S430 can refer to corresponding parts of the foregoing embodiments, and are not described herein again.

After identifying the target action in the video data, further control may be performed on the controlled device according to the identified target action.

Step S440: and searching an instruction comparison table to obtain a control instruction corresponding to the target action and the controlled equipment.

The control device comprises a control device and a control device, wherein an instruction comparison table is stored in advance and comprises the corresponding relation between gesture actions and control instructions and controlled equipment. Specifically, the contents of the instruction look-up table can be referred to in table 1.

TABLE 1

Gesture motion	Action 1	Action 2
			Control instruction	Instruction 1	Instruction 2
Controlled device	Device 1	Device 2

In table 1, the control command and the controlled device corresponding to the gesture motion as motion 1 are command 1 and device 1, respectively, and the control command and the controlled device corresponding to the gesture motion as motion 2 are command 2 and device 2, respectively. Therefore, the target action can be searched in the gesture action according to the recognized target action, and the control instruction and the controlled device corresponding to the target action are obtained.

Step S450: and controlling the controlled equipment according to the control instruction.

And when the control instruction corresponding to the target action is found, the controlled equipment can be controlled according to the control instruction.

In some embodiments, if the electronic device and the gateway are two different devices, the electronic device may send the acquired control instruction and the controlled device to the gateway, and the gateway sends the control instruction to the controlled device, so that the controlled device may execute the control instruction to implement a corresponding function.

In other embodiments, if the electronic device is integrated on a gateway, the acquired control instruction may be directly sent to the controlled device, so that the controlled device may execute the control instruction to implement a corresponding function.

For example, the motorized window treatment may be controlled to automatically pull open when it is detected that the user has made two arms from closed to open. And then the target action to be identified is the action of closing to opening of the two arms, a plurality of frames of images can be selected from the video data in real time, the key points and the corresponding coordinates of the human body in each frame of image are extracted, and the action of closing to opening of the two arms is identified according to the key points and the corresponding coordinates of the human body. After the action of the two arms from closing to opening is identified, the instruction comparison table is searched, the control instruction corresponding to the action of the two arms from closing to opening is obtained, the controlled device is the electric curtain, and therefore the electric curtain can be controlled to be opened.

The target action identification method provided by the embodiment of the application selects a plurality of frames of images from video data; extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates; and identifying target actions in the video data according to the human body key points and the corresponding coordinates. The method comprises the steps of obtaining human key points and corresponding coordinates in each frame of image, identifying target actions by using the human key points and the corresponding coordinates, consuming no large amount of computing power, and reducing complexity of action identification, so that the electronic equipment with low computing power can normally identify the actions through common images, and complexity of controlling the smart home by using the actions is reduced.

Referring to fig. 9, a target action recognition apparatus 500 according to an embodiment of the present application is shown, which can be applied to an electronic device, where the target action recognition apparatus 500 includes a selecting module 510, an extracting module 520, and a recognition module 530. The selecting module 510 is configured to select a plurality of frames of images from the video data; the extracting module 520 is configured to extract human key point information in each frame of image, where the human key point information includes human key points and corresponding coordinates; the identifying module 530 is configured to identify a target action in the video data according to the human body key point and the corresponding coordinate.

The human body key points and the corresponding coordinates in each frame of image are obtained, and the target action is identified by the human body key points and the corresponding coordinates, so that a large amount of calculation power is not required to be consumed, the complexity of action identification is reduced, and the action identification can be normally carried out on electronic equipment with low calculation power.

Further, the extracting module 520 is further configured to input the multiple frames of images in the video data into an extracting network, obtain the key points of the human body and the coordinates of the key points of the human body corresponding to each frame of image according to the output of the extracting network, and the extracting network is configured to output the corresponding key points of the human body and the coordinates according to the input image.

And extracting the human body key point information in the image by using the neural network model so as to conveniently identify the target action based on the human body key point information. And moreover, only the neural network is utilized to extract the key point information, and the training of the neural network is omitted.

Further, the target action includes a plurality of human body key points, positions of the plurality of human body key points satisfy a preset position relationship, and the identification module 520 is further configured to determine whether the plurality of human body key points in each frame of image satisfy the preset position relationship according to coordinates of each human body key point; and if the number of the images of which the plurality of human key points do not meet the preset position relationship is less than the preset number, judging that the action in the video data is the target action.

And judging whether the oak agency position relationship is met or not by using the extracted human body key point information in each frame of image, counting the number of frames of the image which does not meet the preset position relationship, and indicating that the target action is identified when the number is less than the preset number.

Further, the human body key points are human body arm key points, the preset position relationship includes a height relationship and a distance relationship, and the identification module 530 is further configured to determine whether the height position between the human body key points in each frame of image satisfies the height relationship according to the coordinates of each human body key point; judging whether the Euclidean distance between two hands in each frame of image meets the distance relation or not according to the coordinates of each human body key point; and when the plurality of human body key points in each frame of image do not meet the height relationship or the distance relationship, judging that the plurality of human body key points in the frame of image do not meet the preset position relationship.

Further, the human body key points include a first key point, a second key point and a third key point, the first key point is higher than the second key point in the target action, the second key point is lower than the third key point, and the identification module 530 is further configured to determine whether each human body key point in each frame of image meets the condition that the first key point is higher than the second key point and the second key point is lower than the third key point according to the coordinate of each human body key point; and if so, judging that the height positions among all the human body key points in each frame of image meet the height relation.

Further, the identifying module 530 is further configured to calculate an euclidean distance between two hands in each frame of image according to the coordinates of the first key point, the second key point, and the third key point; and when the Euclidean distance between the two hands of the next frame is larger than that of the current frame, judging that the next frame meets the distance relation.

The preset position relationship can comprise a height relationship and a distance relationship, when any one of the preset position relationships is not satisfied, the preset position relationship is considered not to be satisfied, and the accuracy of motion recognition is enhanced by setting a plurality of conditions.

Further, the target action recognition device 500 further includes a control module, where the control module is configured to search the instruction comparison table to obtain a control instruction and a controlled device corresponding to the target action; and controlling the controlled equipment according to the control instruction.

After the target action is recognized, the controlled equipment is controlled according to the control instruction corresponding to the target action and the controlled equipment, so that the equipment is controlled by utilizing the action.

The target action recognition device 500 provided in this embodiment of the application can implement each process of the method for recognizing the target action implemented by the server in the method embodiments of fig. 2 to fig. 7, and is not described herein again to avoid repetition.

An embodiment of the present application provides an electronic device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the target action recognition method provided in the above method embodiment.

The memory may be used to store software programs and modules, and the processor may execute various functional applications and information feedback by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.

Fig. 10 is a block diagram of a hardware structure of an electronic device according to a target motion recognition method provided in an embodiment of the present application. The electronic device may be a local server/cloud server. As shown in fig. 10, the electronic device 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 610 (the processors 610 may include but are not limited to Processing devices such as a microprocessor MCU or a programmable logic device FPGA), a memory 630 for storing data, and one or more storage media 620 (e.g., one or more mass storage devices) for storing applications 623 or data 622. Memory 630 and storage medium 620 may be, among other things, transient or persistent storage. The program stored in the storage medium 620 may include one or more modules, each of which may include a series of instruction operations for the electronic device. Further, the processor 610 may be configured to communicate with the storage medium 620 to execute a series of instruction operations in the storage medium 620 on the electronic device 600. The electronic device 600 may also include one or more power supplies 660, one or more wired or wireless network interfaces 650, one or more input-output interfaces 640, and/or one or more operating systems 621, such as Windows Server, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The input/output interface 640 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 600. In one example, i/o Interface 640 includes a Network adapter (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 640 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 10 is merely an illustration and is not intended to limit the structure of the electronic device. For example, electronic device 600 may also include more or fewer components than shown in FIG. 10, or have a different configuration than shown in FIG. 10.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the above target action identification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A target action recognition method, characterized in that the method comprises:

selecting a plurality of frames of images from the video data;

extracting human body key point information in each frame of image, wherein the human body key point information comprises human body key points and corresponding coordinates;

and identifying target actions in the video data according to the human body key points and the corresponding coordinates.

2. The method according to claim 1, wherein the extracting of the human body key point information in each frame of image comprises:

and inputting the multi-frame images in the video data into an extraction network, and obtaining the human key points and the coordinates of the human key points corresponding to each frame of image according to the output of the extraction network, wherein the extraction network is used for outputting the corresponding human key points and the coordinates according to the input images.

3. The method according to claim 1, wherein the target action includes a plurality of human body key points, positions of the plurality of human body key points satisfy a preset position relationship, and identifying the target action in the video data according to the human body key points and corresponding coordinates includes:

judging whether the plurality of human body key points in each frame of image meet the preset position relation or not according to the coordinates of each human body key point;

and if the number of the images of which the plurality of human key points do not meet the preset position relationship is less than the preset number, judging that the action in the video data is the target action.

4. The method according to claim 3, wherein the human body key points are human body arm key points, the preset position relationship includes a height relationship and a distance relationship, and the determining whether the plurality of human body key points in each frame of image satisfy the preset position relationship according to the coordinates of each human body key point includes:

judging whether the height positions among the human body key points in each frame of image meet the height relation or not according to the coordinates of the human body key points;

or judging whether the Euclidean distance between two hands in each frame of image meets the distance relation or not according to the coordinates of each human body key point;

and when the plurality of human body key points in each frame of image do not meet the height relationship or the distance relationship, judging that the plurality of human body key points in the frame of image do not meet the preset position relationship.

5. The method according to claim 4, wherein the human body key points include a first key point, a second key point and a third key point, the first key point is higher than the second key point in the target action, the second key point is lower than the third key point, and whether the height relationship is satisfied by the height positions between the human body key points in each frame of image is judged according to the coordinates of each human body key point, including:

judging whether each human body key point in each frame of image meets the condition that the first key point is higher than the second key point and the second key point is lower than the third key point or not according to the coordinate of each human body key point;

and if so, judging that the height positions among all the human body key points in each frame of image meet the height relation.

6. The method according to claim 4, wherein the human body key points include a first key point, a second key point and a third key point, the first key point is higher than the second key point, the second key point is lower than the third key point in the target action, and whether the Euclidean distance between two hands in each frame of image satisfies the distance relationship is judged according to the coordinates of each human body key point, including:

calculating the Euclidean distance between two hands in each frame of image according to the coordinates of the first key point, the second key point and the third key point;

and when the Euclidean distance between the two hands of the next frame is larger than that of the current frame, judging that the next frame meets the distance relation.

7. The method according to any one of claims 1 to 6, wherein an instruction comparison table is preset, the instruction comparison table includes a corresponding relationship between gesture actions and control instructions and controlled devices, and after identifying target actions in the video data according to the human key points and corresponding coordinates, the method further includes:

searching the instruction comparison table to obtain a control instruction corresponding to the target action and the controlled equipment;

and controlling the controlled equipment according to the control instruction.

8. An object motion recognition apparatus, characterized in that the method comprises:

the selecting module is used for selecting a plurality of frames of images from the video data;

the extraction module is used for extracting human key point information in each frame of image, wherein the human key point information comprises human key points and corresponding coordinates;

and the identification module is used for identifying the target action in the video data according to the human body key point and the corresponding coordinate.

9. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory electrically connected with the one or more processors;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 7.