CN115620398A

CN115620398A - Target action detection method and device

Info

Publication number: CN115620398A
Application number: CN202211391722.5A
Authority: CN
Inventors: 王青天
Original assignee: Beijing Aibee Technology Co Ltd
Current assignee: Aibee Beijing Intelligent Technology Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-01-17

Abstract

The application provides a method and a device for detecting target actions, wherein a human body detection algorithm is used for identifying a human body region from an image to be detected to obtain a human body region detection frame, and a human body bone point detection algorithm is used for identifying human body bone points from the human body region detection frame; intercepting a target hand area image from the image to be detected according to the human skeleton point; and finally, inputting the target hand area image into a pre-trained motion classification model, and obtaining a target motion detection result in the target hand area image. Since only the image of the target hand area needs to be intercepted, the background interference in the task scene image is reduced, and the detection accuracy is improved; in addition, only the target hand area image is intercepted and used for detecting whether related objects exist, so that the difference between the simulation scene and the real scene is reduced, the simulation scene and fine adjustment of the simulation scene can be used as training samples to train a model with higher generalization, and the model training efficiency is also improved.

Description

Target action detection method and device

Technical Field

The present application relates to the field of motion recognition technologies, and in particular, to a method and an apparatus for detecting a target motion.

Background

In the existing service scene, the demand for human action prediction is often included, and many actions in the demand have strong dependence on objects, such as signing, mobile phone photographing and other actions. In an actual business scene, because the objects are usually small and similar to other objects, and a client usually has a high requirement on confidentiality, it is difficult to acquire a large amount of data of a real scene, resulting in a poor result of direct target detection, for example, a mobile phone in a bank scene is shot and a detection result of the mobile phone needs to be combined, and direct target detection often identifies cards, interphones and the like in the scene as the mobile phone by mistake, resulting in more false photographing reports.

In the prior art, an intelligent target recognition mode is usually used for performing direct target detection on a target image, but the detection result accuracy is low due to the fact that the image background under a real service scene is messy, in addition, if a target action is detected, a classification model needs to be trained in advance, and due to the fact that the image background under the real service scene is messy, a large amount of data is needed for training the traditional classification model, and the model training efficiency is low.

Disclosure of Invention

Based on this, the application provides a target action detection method and device, aiming at improving the accuracy of the detection result.

In order to solve the above problems, the technical solution provided by the present application is as follows:

a target action detection method, the method comprising:

acquiring an image to be detected;

identifying a human body region from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame;

identifying human skeleton points from the human body region detection frame by using a human skeleton point detection algorithm;

intercepting a target hand area image from the image to be detected according to the human skeleton points;

and inputting the target hand area image into a pre-trained motion classification model, and acquiring a target motion detection result in the target hand area image.

In a possible implementation manner, the identifying human bone points from the human body region detection box by using a human bone point detection algorithm includes:

and identifying human skeleton points from the human body region detection frame by using a skeleton detection model.

In a possible implementation manner, the intercepting an image of a target hand region from the image to be detected according to the human skeleton point includes:

obtaining a central point of a part to be intercepted by using skeleton points at the elbows and the wrists as extension lines, wherein the extension lines extend from the elbows to the wrists in the extension direction, and the extension length is the product of a preset coefficient and the distance from the elbows to the wrists;

determining the side length of a part to be intercepted;

and intercepting the target hand area image according to the central point position and the side length of the part to be intercepted.

In a possible implementation manner, the determining a side length of the portion to be intercepted includes:

and setting a fixed value as the side length of the part to be intercepted.

In a possible implementation manner, the determining the side length of the portion to be truncated includes:

and obtaining the side length of the part to be intercepted by using a self-adaptive algorithm.

In a possible implementation manner, the obtaining, by using an adaptive algorithm, a side length of the portion to be truncated includes:

setting a fixed coefficient;

and multiplying the fixed coefficient by the length and the width of the human body region detection frame respectively, and taking the minimum value of the product as the side length of the part to be intercepted.

In one possible implementation manner, the training method of the pre-trained motion classification model includes:

determining a target action according to a task scene;

constructing an image including the target action as a positive sample set and an image not including the target action as a negative sample set;

and taking the positive sample set and the negative sample set as training samples to train an action classification model.

A target motion detection apparatus, the apparatus comprising:

the image acquisition module to be detected is used for acquiring an image to be detected;

the first identification module is used for identifying a human body region from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame;

the second identification module is used for identifying and obtaining human skeleton points from the human body region detection frame by using a human skeleton point detection algorithm;

the image intercepting module is used for intercepting a target hand area image from the image to be detected according to the human skeleton points;

and the detection result acquisition module is used for inputting the target hand area image into a pre-trained motion classification model and acquiring a target motion detection result in the target hand area image.

A target motion detection apparatus, the apparatus comprising:

a memory for storing instructions or code for the target action detection;

and the processor is used for executing the instructions or codes for target action detection so as to realize the method for target action detection.

A computer storage medium having code stored therein, the code, when executed, causing an apparatus executing the code to implement the method of target action detection described above.

Compared with the prior art, the method has the following beneficial effects:

firstly, acquiring an image to be detected; then, a human body region is identified from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame, and human body bone points are identified from the human body region detection frame by using a human body bone point detection algorithm to obtain human body bone points; then, intercepting a target hand area image from the image to be detected according to the human skeleton points; and finally, inputting the target hand area image into a pre-trained motion classification model to obtain a target motion detection result in the target hand area image. Since only the image of the target hand area needs to be intercepted, the background interference in the task scene image is reduced, and the detection accuracy is improved; in addition, only the target hand area image is intercepted and used for detecting whether related objects exist, so that the difference between the simulation scene and the real scene is reduced, the simulation scene and fine adjustment of the simulation scene can be used as training samples to train a model with higher generalization, and the model training efficiency is also improved.

Drawings

To illustrate the technical solutions in the present embodiment or the prior art more clearly, the drawings needed to be used in the description of the embodiment or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method of a target action detection method according to an embodiment of the present application;

fig. 2 is a flowchart of a method of detecting a target action in a specific application scenario according to an embodiment of the present application;

fig. 3 is a schematic view of an application scenario of a target action detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of human bone point detection provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of positive and negative sample images provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a target motion detection apparatus according to an embodiment of the present disclosure.

Detailed Description

In the prior art, an intelligent target recognition method is usually used to perform direct target detection on a target image, such as an artificial intelligence-based image recognition method or an artificial intelligence-based object detection method, and this method mainly includes acquiring an image to be recognized first, and then performing target object or motion detection by using the above detection method.

However, through research, the image background under the real business scene is relatively disordered, which results in lower accuracy of the detection result, in addition, if a target action is to be detected, the classification model needs to be trained in advance, and the image background under the real business scene is relatively disordered, so that the traditional classification model needs a large amount of data to be trained, which results in lower model training efficiency.

Based on the above, the embodiment of the application provides a target action detection method based on human body topology prior, and the method comprises the steps of firstly obtaining an image to be detected; then, a human body region is identified from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame, and human body bone points are identified from the human body region detection frame by using a human body bone point detection algorithm to obtain human body bone points; then, intercepting a target hand area image from the image to be detected according to the human skeleton points; and finally, inputting the target hand area image into a pre-trained motion classification model to obtain a target motion detection result in the target hand area image. Since only the image of the target hand area needs to be intercepted, the background interference in the task scene image is reduced, and the detection accuracy is improved; in addition, only the target hand area image is captured for detecting whether related objects exist, so that the difference between the simulated scene and the real scene is reduced, the simulated scene and fine adjustment of the simulated scene can be used as training samples to train a model with higher generalization, and the model training efficiency is improved.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a flowchart of a method of a target motion detection method provided in an embodiment of the present application, including:

s101: and acquiring an image to be detected.

The image to be detected refers to an image which needs to be subjected to target action detection, generally, a certain frame image in a video is intercepted, and dynamic identification can be carried out on each frame image in the video.

The purpose of this step is to acquire task images in the task scene.

S102: and identifying a human body region from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame.

The human body detection algorithm is an algorithm for identifying and obtaining a human body region from an image to be detected and framing the human body region, and in brief, the human body region is identified and marked by using a virtual frame, and the virtual frame is a human body region detection frame.

In the embodiment of the application, a target detection method is used as a human body detection algorithm to identify human body regions, the target detection method is used for realizing a target detection task in a specific task scene, generally speaking, the target detection task is to find out target objects in images or videos and detect the positions and the sizes of the target objects; in the embodiment of the application, the specific position of the human body region can be marked through the human body region detection frame, and certainly, other human body detection algorithms can be used for human body region identification.

S103: and identifying and obtaining human skeleton points from the human body region detection frame by using a human skeleton point detection algorithm.

The human body bone point detection algorithm is one of basic algorithms of computer vision, plays a basic role in research of other related fields of computer vision, such as behavior recognition, character tracking, gait recognition and other related fields, and describes human body bone information through detecting some key bone points of a human body, such as joints, five organs and the like, and through the key bone points.

In one possible implementation, identifying human bone points from the human region detection box by using a human bone point detection algorithm includes:

The skeleton detection model is specifically used for marking human skeleton points, the skeleton detection model comprises an OpenPose algorithm, a high-resolution network HRNet and the like, and the specific model is not limited in the embodiment of the application.

S104: and intercepting a target hand area image from the image to be detected according to the human skeleton points.

The specific application scene of the embodiment of the application is to detect hand movement, and the target hand image is captured to detect an object held in a hand, for example, a user needs to hold a pen in the hand when writing, and the user needs to hold a mobile phone in the hand when taking a picture with the mobile phone.

In one possible implementation manner, the step of capturing the target hand area image from the image to be detected according to the human skeleton point comprises the following steps:

using the skeleton points at the elbows and wrists as extension lines to obtain the central point of the part to be intercepted;

determining the side length of a part to be intercepted;

and intercepting the target hand area image according to the position of the central point and the side length of the part to be intercepted.

The extension line extends from the elbow to the wrist in the extending direction, and the extending length is the product of a preset coefficient and the distance between the elbow and the wrist.

The central point of the part to be intercepted is the central point of the finally intercepted hand image, and as the human skeleton point in the human body region detection frame is obtained by the human skeleton point detection algorithm, the position of the hand can be obtained by the human skeleton point and is set as the central point of the part to be intercepted.

After the central point of the part to be intercepted is determined, the side length of the part to be intercepted is required to be determined, the size of the final detection frame can be determined according to the side length of the part to be intercepted, then the central point of the part to be intercepted is used as the center of the final detection frame, the intersection of the final detection frame and the image is used, and the target hand area image is finally obtained.

In one possible implementation, determining the side length of the portion to be truncated includes:

The adaptive algorithm refers to a method of performing numerical calculation according to the size of the area of a target region in an image or according to preset conditions.

In a possible implementation manner, obtaining the side length of the portion to be truncated by using an adaptive algorithm includes:

setting a fixed coefficient;

and multiplying the length and the width of the human body region detection frame by a fixed coefficient respectively, and taking the minimum value of the product as the side length of the part to be intercepted.

The fixed coefficient can be estimated and set according to information such as image view angle.

The length and the width of the human body region detection frame are multiplied by the fixed coefficient respectively, and the minimum value is taken as the side length of the part to be intercepted, so that the side length of the part to be intercepted, which is adaptive to the size of the human body region detection frame, is finally obtained, and the possibility of excessive interference items caused by the overlarge interception range is reduced.

In one possible implementation manner, determining the side length of the portion to be intercepted includes:

and setting a fixed value as the side length of the part to be intercepted.

The method for setting the fixed value is simpler and more convenient, the calculation amount is smaller, but certain disadvantages exist, for example, for the same scene, images with two visual angles exist, one visual angle is very close to a human body area or a hand area, the other visual angle is far away, in the image with the close visual angle, the hand area is relatively large in the image, and if the fixed edge length value is taken, in the image with the close visual angle, more background screenshots exist in the captured hand.

Therefore, compared with the method for setting the fixed edge length value, the adaptive algorithm has better generalization and large computation amount.

S105: and inputting the target hand area image into a pre-trained motion classification model, and acquiring a target motion detection result in the target hand area image.

The motion classification model is used to classify the input hand motion pictures, for example, when a person holds a pen in his hand, the input hand motion pictures are classified as writing motions, or when a person holds a mobile phone in his hand, the input hand motion pictures are classified as photographing motions.

In one possible implementation, the training method of the pre-trained motion classification model includes:

determining a target action according to the task scene;

constructing an image comprising the target action as a positive sample set, and an image not comprising the target action as a negative sample set;

The target motion refers to a possible hand motion in a task scene, for example, in a bank worker scene, the target motion can be set as a signature, and then whether a sign pen is held in a hand of the image to be classified is judged.

The positive sample set refers to images including the target motion, and the negative sample set refers to images not including the target motion.

The embodiment of the application provides a target action detection method, which comprises the steps of firstly, obtaining an image to be detected; then, identifying a human body region from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame, and identifying human body bone points from the human body region detection frame by using a human body bone point detection algorithm to obtain human body bone points; then, intercepting a target hand area image from the image to be detected according to the human skeleton points; and finally, inputting the target hand area image into a pre-trained motion classification model to obtain a target motion detection result in the target hand area image. Since only the image of the target hand area needs to be intercepted, the background interference in the task scene image is reduced, and the detection accuracy is improved; in addition, only the target hand area image is captured for detecting whether related objects exist, so that the difference between the simulated scene and the real scene is reduced, the simulated scene and fine adjustment of the simulated scene can be used as training samples to train a model with higher generalization, and the model training efficiency is improved.

The target action detection method provided in the embodiment of the present application is introduced above, and the method is exemplarily described below with reference to a specific application scenario.

Referring to fig. 2, fig. 2 is a flowchart of a method of a scenario embodiment in a specific application scenario provided in the embodiment of the present application, and assuming that a target action in a task scenario is used as a signature, an implementation manner of the scenario embodiment specifically includes:

s201: and setting a target action as a signature, and acquiring an image to be detected.

The target action is set as a signature for exemplary illustration only, and the target action can be set as smoking, taking a picture, etc. according to the task scene.

S202: and identifying a human body region from the image to be detected by using a human body detection algorithm to obtain a human body region detection frame.

Referring to fig. 3, fig. 3 is a schematic view of an application scenario of the target motion detection method, in which a larger frame is a human body region detection frame. The human body detection algorithm used is not specifically limited.

S203: and identifying and obtaining human skeleton points from the human body region detection frame by using a human skeleton point detection algorithm.

Referring to fig. 4, fig. 4 is a schematic diagram of distribution of human skeleton points obtained by the human skeleton point algorithm, and lines in a human body region in the diagram show the distribution of the human skeleton points.

S204: and intercepting the target hand area image.

determining the side length of a part to be intercepted;

The extension line extends from the elbow to the wrist, the arrow direction in fig. 3 is the extension direction of the extension line, and the extension length is the product of the preset coefficient and the distance from the elbow to the wrist.

and obtaining the side length of the part to be intercepted by using a self-adaptive algorithm or setting a fixed value as the side length of the part to be intercepted.

setting a fixed coefficient;

The final target hand area image obtained by the adaptive algorithm is shown as a smaller box in fig. 3.

S205: and inputting the target hand area image into a pre-trained motion classification model, and acquiring a target motion detection result in the target hand area image.

determining a target action according to the task scene;

constructing an image including the target action as a positive sample set, and an image not including the target action as a negative sample set;

and taking the positive sample set and the negative sample set as training samples to train the motion classification model.

The target action is determined according to a task scene, the action needing to be detected can be determined, positive and negative samples can be constructed in a targeted mode only after the target action is determined, and the target action is set as a signature.

Fig. 5 is a schematic diagram of positive and negative sample images, where the positive sample is an image of the pen in the right two hands, and the negative sample is an image of the left two hands without the pen.

The trained action classification sample can identify the current action according to the input picture and classify the current action.

The scene embodiment of the application provides the using method of the target action detection method provided by the application under a specific application scene by combining with an actual application scene, and through actual inspection, the method provided by the embodiment of the application has better generalization and practicability in the actual application scene.

The foregoing is some specific implementation manners of the target action detection method provided in the embodiments of the present application, and based on this, the present application also provides a corresponding apparatus. The device provided by the embodiment of the present application will be described in terms of functional modularity.

Referring to fig. 6, the information extracting apparatus includes:

the to-be-detected image acquisition module 601 is used for acquiring an image to be detected;

a first identification module 602, configured to identify a human body region from the image to be detected by using a human body detection algorithm, so as to obtain a human body region detection frame;

a second identifying module 603, configured to identify human bone points from the human body region detection frame by using a human bone point detection algorithm;

an image capture module 604, configured to capture an image of a target hand region from the image to be detected according to the human skeleton point;

a detection result obtaining module 605, configured to input the target hand region image into a pre-trained motion classification model, and obtain a target motion detection result in the target hand region image.

In a possible implementation manner, the first identifying module 602 is specifically configured to:

and identifying a human body region from the image to be detected by using a target detection method to obtain a human body region detection frame.

In a possible implementation manner, the second identifying module 603 is specifically configured to:

In one possible implementation, the target hand area image capture module 604 includes:

the central point acquisition unit is used for obtaining a central point of a part to be intercepted by using skeleton points at the elbows and the wrists as extension lines, wherein the extension lines extend from the elbows to the wrists in the extension direction, and the extension length is the product of a preset coefficient and the distance from the elbows to the wrists;

the side length determining unit of the part to be intercepted is used for determining the side length of the part to be intercepted;

and the image intercepting unit is used for intercepting the target hand area image according to the position of the central point and the side length of the part to be intercepted.

In a possible implementation manner, the side length determining unit of the portion to be intercepted is specifically configured to:

In a possible implementation manner, the side length determining unit of the portion to be intercepted includes:

a coefficient setting subunit for setting a fixed coefficient;

and the side length determining subunit is used for respectively multiplying the fixed coefficient by the length and the width of the human body region detection frame, and taking the minimum value of the product as the side length of the part to be intercepted.

In one possible implementation manner, the detection result obtaining module 605 includes:

the model pre-training unit is used for pre-training the motion classification model;

the input unit is used for inputting the target hand area image into a pre-trained action classification model;

and the detection result acquisition unit is used for acquiring a target action detection result in the target hand area image.

In one possible implementation, the model pre-training unit includes:

the target action determining subunit is used for determining a target action according to the task scene;

a sample set constructing subunit, configured to construct an image including the target motion as a positive sample set, and an image not including the target motion as a negative sample set;

and the model training subunit is used for taking the positive sample set and the negative sample set as training samples to train the action classification model.

According to the method and the device, only the image of the target hand area needs to be intercepted, so that background interference in the task scene image is reduced, and the detection accuracy is improved; in addition, only the target hand area image is intercepted and used for detecting whether related objects exist, so that the difference between the simulation scene and the real scene is reduced, the simulation scene and fine adjustment of the simulation scene can be used as training samples to train a model with higher generalization, and the model training efficiency is also improved.

The embodiment of the application also provides corresponding equipment and a computer storage medium, which are used for realizing the scheme provided by the embodiment of the application.

The device comprises a memory and a processor, wherein the memory is used for storing instructions or codes, and the processor is used for executing the instructions or codes so as to enable the device to execute the target action detection method in any embodiment of the application.

The computer storage medium has code stored therein, and when the code is executed, an apparatus for executing the code implements the target action detection method according to any embodiment of the present application.

In the embodiments of the present application, the names "first" and "second" (if any) in the names "first" and "second" are used merely for name identification, and do not represent the sequential first and second.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the method of the above embodiments may be implemented by software plus a general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a read-only memory (ROM)/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a router) to execute the method according to the embodiments or some parts of the embodiments of the present application.

All the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only an exemplary embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A target motion detection method, the method comprising:

acquiring an image to be detected;

intercepting a target hand area image from the image to be detected according to the human skeleton point;

2. The method of claim 1, wherein the identifying human skeletal points from the human region detection box using a human skeletal point detection algorithm comprises:

3. The method according to claim 1, wherein said intercepting an image of a target hand region from said image to be detected based on said human skeletal points comprises:

obtaining a central point of a part to be intercepted by taking skeleton points at the elbows and wrists as extension lines, wherein the extension lines extend from the elbows to the wrists in the extension direction, and the extension length is the product of a preset coefficient and the distance from the elbows to the wrists;

determining the side length of a part to be intercepted;

4. The method of claim 3, wherein determining the side length of the portion to be truncated comprises:

and setting a fixed value as the side length of the part to be intercepted.

5. The method of claim 4, wherein determining the side length of the portion to be truncated comprises:

6. The method of claim 5, wherein the obtaining the side length of the portion to be truncated by using an adaptive algorithm comprises:

setting a fixed coefficient;

7. The method of claim 1, wherein the training method of the pre-trained motion classification model comprises:

determining a target action according to a task scene;

8. An object motion detection apparatus, characterized in that the apparatus comprises:

the second identification module is used for identifying human skeleton points from the human body region detection frame by using a human skeleton point detection algorithm;

9. A target motion detection apparatus, characterized in that the apparatus comprises:

a memory for storing instructions or code for the target action detection;

a processor for executing the instructions or codes of the target action detection to implement the method of target action detection of any one of claims 1-7.

10. A computer storage medium having code stored therein, wherein when the code is executed, an apparatus executing the code implements the method of target action detection of any one of claims 1-7.