CN111783665A

CN111783665A - Action recognition method and device, storage medium and electronic equipment

Info

Publication number: CN111783665A
Application number: CN202010624706.0A
Authority: CN
Inventors: 黄泽; 张泽覃; 陈冰
Original assignee: Innovation Qizhi Xi'an Technology Co ltd
Current assignee: Innovation Qizhi Xi'an Technology Co ltd
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-16

Abstract

The embodiment of the application provides a motion recognition method, a motion recognition device, a storage medium and electronic equipment, wherein the motion recognition method comprises the following steps: acquiring an image to be processed; inputting the image to be processed into a pre-trained detection model, and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located; performing edge detection on the first detection frame to obtain gradient values of object edge pixels in the overlapped part; deleting the overlapped part from the first detection frame to obtain a target detection frame in the case that gradient values of object edge pixels within the overlapped part are concave; and inputting the target detection frame into a pre-trained motion recognition model to obtain a motion recognition result of the first object. The embodiment of the application avoids the interference of other objects except the first object by deleting the pixels in the overlapped part from the first detection frame.

Description

Action recognition method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for motion recognition, a storage medium, and an electronic device.

Background

Action recognition is a research hotspot in the field of computer vision in recent years, and is widely applied to multiple fields such as intelligent monitoring and the like.

Currently, existing motion recognition methods include a local search detection method, in which key points of all objects (e.g., pedestrians) in an image are detected through a bone point recognition algorithm, and then the key points are combined through an even matching algorithm of the bone points, so that motion postures of the objects are estimated.

In the process of implementing the invention, the inventor finds that the following problems exist in the prior art: in a scene of dense people, an object is shielded, which is an unavoidable phenomenon, but when facing a situation that the object is shielded, a local search detection method may incorrectly splice different parts of different people, so that a situation that action recognition is wrong occurs.

Disclosure of Invention

An object of the embodiments of the present application is to provide a motion recognition method, a motion recognition device, a storage medium, and an electronic device, so as to solve a problem of motion recognition errors caused by an object being blocked.

In a first aspect, an embodiment of the present application provides an action recognition method, where the action recognition method includes: acquiring an image to be processed; inputting the image to be processed into a pre-trained detection model, and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located; performing edge detection on the first detection frame to obtain gradient values of object edge pixels in the overlapped part; deleting the overlapped part from the first detection frame to obtain a target detection frame in the case that gradient values of object edge pixels within the overlapped part are concave; and inputting the target detection frame into a pre-trained motion recognition model to obtain a motion recognition result of the first object.

Therefore, in the embodiment of the application, in the case that the gradient value of the edge pixel of the object in the overlapped part is concave, the overlapped part is determined to be the blocked part, and then the pixel in the overlapped part is deleted from the first detection frame, so that the interference of other objects except the first object is avoided, and the problem of motion recognition error caused by the blocked object is solved.

In addition, the target detection frame is detected by using the pre-trained motion recognition model in the embodiment of the application, so that the embodiment of the application can accurately recognize the motion of the first object, and further the embodiment of the application can be suitable for intensive crowd scenes.

In a possible embodiment, the overlap portion is an overlap portion of a second detection frame and the first detection frame, the second detection frame represents an area where a second object is located in the image to be processed, and the overlap portion is deleted from the first detection frame in a case where gradient values of object edge pixels in the overlap portion are concave, including: calculating the coincidence degree of the first detection frame and the second detection frame; and deleting the overlapped part from the first detection frame under the condition that the overlapping degree is less than the preset overlapping degree.

Therefore, according to the embodiment of the application, the overlapped part in the detection frame with the smaller overlapping degree is deleted, so that the interference of other object information is avoided.

In one possible embodiment, the motion recognition method further includes: and deleting the first detection frame under the condition that the contact ratio is greater than or equal to the preset contact ratio and the confidence coefficient of the first detection frame is smaller than that of the second detection frame.

Therefore, the detection frames which are seriously overlapped can be filtered out.

In a possible embodiment, before the inputting the image to be processed into the pre-trained detection model and acquiring the first detection frame containing the overlapped part, the motion recognition method further includes: inputting a sample image used for training an initial detection model into the initial detection model, and obtaining a prediction frame, wherein the prediction frame represents a prediction area where a sample object in the sample image is located; determining a first loss value according to the prediction frame and a first sample detection frame corresponding to the sample image, wherein the first sample detection frame represents an area where a sample object in the sample image is located; and adjusting parameters of the initial detection model by using the first loss value to obtain the pre-trained detection model.

Therefore, the detection model is trained in advance, so that the detection frame can be directly obtained, and a new model does not need to be established before the process of obtaining the detection frame each time.

In one possible embodiment, the first loss value comprises a regression loss value, and the motion recognition method further comprises calculating the regression loss value according to the following formula:

wherein L is_CIOURepresenting the regression loss value, IOU representing the intersection of the prediction box and the first sample detection box, p (b, b)^gt) Representing a distance between a center point of the prediction box and a center point of the first sample detection box, c representing a diagonal length of a minimum bounding rectangle of the prediction box and the first sample detection box, a representing a first hyper-parameter, w^gtRepresents the width, h, of the first sample detection box^gtDenotes the length of the first sample detection box, w denotes the width of the prediction box, and h denotes the length of the prediction box.

Therefore, in order to fit the regression frame more accurately, a regression Loss function CIoU Loss is introduced into the Loss function for calculating the first Loss value in the embodiment of the present application, so as to predict the difference ratio of the overlapping area, the center distance, and the aspect ratio between the first sample frame and the prediction frame as a penalty term, and avoid additional interference pixels.

In a possible embodiment, before the inputting the target detection box into a pre-trained motion recognition model and obtaining a motion recognition result of the first object, the motion recognition method further includes: inputting a second sample detection box for training an initial motion recognition model into the initial motion recognition model to obtain a predicted motion recognition result; determining a second loss value according to the predicted action recognition result and the sample action recognition result corresponding to the second sample detection frame; and adjusting parameters of the initial motion recognition model by using the second loss value to obtain the pre-trained motion recognition model.

Therefore, the motion recognition result can be directly obtained by training the motion recognition model in advance, and a new model does not need to be established before the process of obtaining the motion recognition result each time.

In one possible embodiment, the determining a second loss value according to the predicted motion recognition result and the sample motion recognition result corresponding to the second sample detection box includes calculating the second loss value according to the following formula:

L₁＝μL_fl+L_softmax；

wherein L is₁Represents a second Loss value, mu represents a second hyperparameter regulating the ratio of the classification Loss function Focal Loss and the cross-entropy Loss function Softmax Loss, L_flRepresents the value of the classification Loss function obtained by the calculation of Focal local, L_softmaxRepresents the cross-entropy Loss value calculated by Softmax Loss.

In a second aspect, an embodiment of the present application provides a motion recognition apparatus, including: the acquisition module is used for acquiring an image to be processed; the input module is used for inputting the image to be processed into a pre-trained detection model and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located; an edge detection module, configured to perform edge detection on the first detection frame to obtain a gradient value of an object edge pixel in the overlapped portion; a deleting module, configured to delete the overlapped portion from the first detection frame to obtain a target detection frame, if a gradient value of an object edge pixel within the overlapped portion is concave; the input module is further configured to input the target detection box into a pre-trained motion recognition model, and obtain a motion recognition result of the first object.

In a third aspect, an embodiment of the present application provides a storage medium, where a computer program is stored on the storage medium, and when the computer program is executed by a processor, the computer program performs the method according to the first aspect or any optional implementation manner of the first aspect.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect or any of the alternative implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product which, when run on a computer, causes the computer to perform the method of the first aspect or any possible implementation manner of the first aspect.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart illustrating a method for motion recognition provided by an embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for training an initial detection model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating a method for training an initial motion recognition model according to an embodiment of the present application;

fig. 4 shows a specific flowchart of an action recognition method provided in an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an image to be processed according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an annotated image to be processed according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an object detection box in an embodiment of the present application;

FIG. 8 is a schematic diagram of another object detection box in the embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a motion recognition apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

Currently, the conventional motion recognition method also includes two detection methods. The two-part detection method includes predicting a detection frame containing an object to be detected (e.g., a person) through a detection model, then calculating the positions of skeletal key points of the object by performing key point recognition on the detection frame, and further performing gesture matching of standard actions through established rules based on the predicted skeletal key points, thereby predicting the actions of the object.

In addition, in a scene of dense people, an object is blocked, but the detection frames obtained by detection in the processing processes of the two detection methods may have interference factors of other objects, and the prediction effect on the bone points is also affected correspondingly.

And the effect of the two detection methods is also very easily influenced by the detection frame, and if the predicted detection frame is externally expanded or lost, the recognition effect of the action is also interfered in different degrees.

In addition, when the two-part detection method and the local search detection method face intensive crowd scenes (for example, vehicles such as subways), the detection effect is linearly reduced along with the increase of the number of objects in the image, and the real-time requirement is probably not met.

Based on this, the embodiment of the present application skillfully provides an action recognition scheme, in which an image to be processed is acquired, then the image to be processed is input into a pre-trained detection model, and a first detection frame including an overlapped portion is acquired, where the first detection frame indicates an area in which a first object in the image to be processed is located, then edge detection is performed on the first detection frame to obtain a gradient value of an object edge pixel in the overlapped portion, then, in a case where the gradient value of the object edge pixel in the overlapped portion is concave, the overlapped portion is deleted from the first detection frame to obtain a target detection frame, and finally, the target detection frame is input into the pre-trained action recognition model to acquire an action recognition result of the first object.

Therefore, the embodiment of the present application avoids interference of objects other than the first object by determining that the overlapped part is a blocked part in a case where the gradient value of the object edge pixel in the overlapped part is concave, and then deleting the pixel in the overlapped part from the first detection frame.

Referring to fig. 1, fig. 1 shows a flowchart of an action recognition method provided in an embodiment of the present application, and it should be understood that the method shown in fig. 1 may be executed by an action recognition device, which may correspond to the device shown in fig. 9 below, and the device may be various devices capable of executing the method, such as a personal computer, a server, or a network device, for example, and the present application is not limited thereto. The motion recognition method shown in fig. 1 includes:

step S110, an image to be processed is acquired.

Specifically, the image to be processed can be acquired through the image acquisition device, so that the motion recognition device can perform motion recognition on the acquired image to be processed subsequently.

It should be understood that the specific device of the image capturing device may be set according to actual requirements, and the embodiments of the present application are not limited thereto.

For example, the image capturing device may be a camera or the like provided in a subway.

Step S120, inputting the image to be processed into a detection model trained in advance, and acquiring a first detection frame containing the overlapped part. The first detection frame represents an area where a first object in the image to be processed is located.

It should be understood that the overlapped portion refers to an overlapped portion between the current detection frame and the other detection frame other than the current detection frame. The overlapped part can include not only the object in the current detection frame, but also objects in other detection frames (for example, part of limbs of the objects in other detection frames, etc.); the overlapped part may also only contain the object in the current detection frame.

It is also understood that the first object may be a pedestrian, an animal, or the like.

In order to facilitate understanding of the embodiments of the present application, the following description will be given by way of specific examples.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for training an initial detection model according to an embodiment of the present disclosure. The method shown in fig. 2 comprises:

step S210, inputting the sample image into the initial detection model, and obtaining a prediction frame. The sample image belongs to sample data, the sample image is used for training an initial detection model, and the prediction box represents a prediction area where a sample object in the sample image is located. That is, the prediction box is the output result of the initial detection model.

It should be understood that the sample image may be pre-processed before being input into the initial detection model, and then the pre-processed sample image may be input into the initial detection model, which is not limited to this embodiment of the application.

For example, the pre-processing of the sample image may include: pixels in the sample image can be erased randomly, so that the detection performance of the initial detection model under the condition that the pedestrian is shielded can be improved.

As another example, the pre-processing of the sample image may also include: and the sample image can be processed by utilizing a random color disturbance strategy, so that the data volume is increased, and the robustness of the training model is improved.

In addition, it should be noted that, in addition to outputting the prediction box, the initial detection model may also output the confidence of the prediction box.

Correspondingly, for the pre-trained detection model, in addition to outputting the first detection box, the pre-trained detection model may also output the confidence level of the first detection box.

In step S220, a first loss value is determined according to the prediction frame and a first sample detection frame corresponding to the sample image. Wherein, the first sample detection box represents the area of the sample object in the sample image.

It should be understood that the first sample detection box represents the real area where the sample object in the sample image is located.

It should also be understood that the calculation formula of the first loss value may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the first loss value may be the sum of the first classification loss value and the regression loss value. The first classification Loss value is calculated through a first classification Loss function Focal local, and the regression Loss value is calculated through a regression Loss function CIoU local.

It should also be understood that the calculation formula corresponding to the first classification Loss function Focal local and the calculation formula corresponding to the regression Loss function CIoULoss may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the regression Loss function CIoU Loss may be calculated according to the following formula, specifically:

wherein L is_CIOURepresenting the regression loss value, IOU representing the intersection of the prediction box and the first sample detection box, p (b, b)^gt) Representing the distance between the center point of the prediction frame and the center point of the first sample detection frame, c representing the diagonal length of the minimum bounding rectangle of the prediction frame and the first sample detection frame, a representing a first hyperparameter for adjusting the weight value, w^gtDenotes the width, h, of the first sample detection box^gtDenotes the length of the first sample detection box, w denotes the width of the prediction box, and h denotes the length of the prediction box.

For another example, the calculation formula of the first classification Loss function, Focal local, may be an existing function, and the embodiment of the present application is not limited thereto.

Step S230, adjusting parameters of the initial detection model by using the first loss value to obtain a pre-trained detection model.

Further, it should be noted that, although fig. 2 shows a training process of the initial detection model, it should be understood by those skilled in the art that, in the case that the detection model is a pre-trained detection model, the process of fig. 2 may be omitted, i.e., the initial detection model does not need to be trained each time.

In addition, after the image to be processed is input to the pre-trained detection model, the pre-trained detection model can output the first detection frame and the confidence of the first detection frame.

Step S130, performing edge detection on the first detection frame, and obtaining a gradient value of an object edge pixel in the first detection frame.

It is to be understood that the object edge pixels may be pixels at the edge of the object (including the first object) within the first detection frame.

It should also be understood that the specific algorithm used for performing edge detection on the first detection frame may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the algorithm used for edge detection of the first detection box may be a Canny edge detection algorithm.

In step S140, gradient values of the object edge pixels within the overlapped portion in the first detection frame are acquired according to the gradient values of the object edge pixels within the first detection frame.

Specifically, since the positions of the pixels within the overlapped part are known, the gradient values of the object edge pixels within the overlapped part can be sorted out from the gradient values of all the object edge pixels within the first detection frame according to the positions of the pixels within the overlapped part.

In step S150, in the case where the gradient value of the object edge pixel within the overlapped portion has a concave shape, the overlapped portion is deleted from the first detection frame to obtain a target detection frame.

That is, in the case where the overlapping portion has a concave shape, it is determined that the overlapping portion is a blocked image, and pixels in the overlapping portion can be erased.

It should be understood that, although steps S130 and S150 show the process of erasing the pixels in the overlapped part, those skilled in the art will understand that the processing manner of the first detection frame may also be set according to actual requirements.

Alternatively, since the overlapped part may be an overlapped part of the first detection frame and the second detection frame for identifying the region in which the second object is located in the image to be processed, the degree of overlap of the first detection frame and the second detection frame may be calculated.

And, in the case that the coincidence degree is greater than or equal to the preset coincidence degree, filtering out the detection frame with a smaller confidence degree, for example, in the case that the confidence degree of the first detection frame is smaller than the confidence degree of the second detection frame, filtering out the first detection frame, and using the second detection frame as the target detection frame; and under the condition that the contact ratio is less than the preset contact ratio, performing edge detection on the first detection frame.

And under the condition that the coincidence degree is less than the preset coincidence degree and the gradient value of the edge pixel of the object in the coincidence part is convex, determining that the first detection frame is an unobstructed image, and taking the first detection frame as a target detection frame without processing the first detection frame; and in the case that the gradient value of the object edge pixel in the overlapped part is concave, determining that the overlapped part is a blocked image, and erasing the pixel in the overlapped part in the first detection frame to obtain the target detection frame.

It should be understood that the preset overlap ratio may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

Therefore, the detection frames can be sorted according to the confidence degree of the detection frames, the detection frames which are seriously overlapped are filtered according to the preset overlapping degree, the detection frames to which the overlapped parts belong can be judged by edge contour detection of the detection frames with lower overlapping degree, the overlapped parts of the detection frames which are shielded are deleted, and the interference of information of other objects is avoided.

Step S160, inputting the target detection box into a pre-trained motion recognition model, and obtaining a motion recognition result of the first object.

Specifically, referring to fig. 3, fig. 3 is a flowchart illustrating a method for training an initial motion recognition model according to an embodiment of the present application. The method shown in fig. 3 comprises:

step S310, inputting the second sample detection box into the initial motion recognition model to obtain a predicted motion recognition result.

It should be understood that the second sample detection frame may be a detection frame remaining after the overlapped portion is erased, or may be a detection frame without the overlapped portion being erased. That is, the motion recognition model in the embodiment of the present application may recognize a detection frame in an erased overlapped portion, or may recognize a detection frame in which an overlapped portion is not erased.

It should be understood that the predicted motion recognition result may be a specific motion, or may be vector data including probabilities of different motion identifiers, and the embodiment of the present application is not limited thereto.

Further, after the vector data is acquired, the action on the line can be determined from the vector data.

Step S320 determines a second loss value according to the predicted motion recognition result and the sample motion recognition result corresponding to the second sample detection frame.

It should be understood that the calculation formula of the second loss value may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

For example, the second loss value may be calculated according to the following formula:

L₁＝μL_fl+L_softmax

wherein L is₁Denotes a second Loss value, μ denotes a second hyperparameter regulating the ratio of the two Loss functions (i.e. the second classification Loss function focallloss and the cross entropy Loss function Softmax Loss), L_flRepresenting the value of the classification Loss function calculated by the second classification Loss function Focal local, L_softmaxRepresents the cross entropy Loss value calculated by a cross entropy Loss function Softmax Loss.

It should also be understood that the calculation formula of the cross entropy Loss function Softmax Loss and the calculation formula of the second classification Loss function focallloss may be set according to actual requirements, and the embodiment of the present application is not limited thereto.

Alternatively, the classification loss function value may be calculated according to the following formula:

where λ represents a third hyperparameter for adjusting the weights of the positive and negative samples, and λ ∈ [0,1]Wherein, the negative sample refers to a detection frame in the image, and the positive sample refers to an environment in the image, and the like; y is^∧The method comprises the steps of representing a predicted action recognition result, namely a predicted value output by an initial action recognition model, y representing a sample action recognition result, namely a true value of the action recognition result corresponding to a second sample detection box, and β representing a fourth super parameter for accelerating the convergence process of a cross entropy Loss function Softmax Loss.

Therefore, if the target detection frame is directly identified and classified in the motion identification process, a number of false labels may occur, because the target detection frame may not satisfy the original data distribution. Therefore, for the above situation, in the embodiment of the present application, a linear combination of the second classification Loss function Focal Loss and the cross entropy Loss function Softmax Loss is used as a final Loss function, so that interference caused by a large negative sample count ratio is reduced.

Alternatively, the cross entropy loss value may be calculated according to the following formula:

wherein T represents the length of an output vector of the motion recognition model; y is_jRepresenting a value at the j position (e.g., y if the label is of class j, y is a value of class j) for identifying the class (e.g., in the case where the motion recognition model corresponds to four output classes, the label for the first class may be 1000, the label for the second class may be 0100, the label for the third class may be 0010, the label for the fourth class may be 0001, etc.)_jEqual to 1, and 0 in other positions); s_jRepresenting the value of the output vector at the j position (i.e., the probability in being of the j-th class).

And step S330, adjusting parameters of the initial motion recognition model by using the second loss value to obtain a pre-trained motion recognition model.

It should be noted that, although fig. 3 shows a training process of the initial motion recognition model, it should be understood by those skilled in the art that, in the case where the motion recognition model is a motion recognition model trained in advance, the process of fig. 3 may be omitted, that is, the initial motion recognition model does not need to be trained every time.

In addition, it should be noted that, although the above steps S120 to S160 show the motion recognition process for the first detection frame, those skilled in the art should understand that, in the case where the image to be processed corresponds to a plurality of detection frames, the motion recognition processes for all the detection frames can be realized by the processes of steps S120 to S160.

Referring to fig. 4, fig. 4 is a specific flowchart illustrating an action recognition method according to an embodiment of the present application. The motion recognition method shown in fig. 4 includes:

step S410, acquiring an image to be processed.

For example, please refer to fig. 5, and fig. 5 shows a schematic diagram of an image to be processed according to an embodiment of the present application. The image to be processed as shown in fig. 5 includes a pedestrian 510 and a pedestrian 520.

And step S420, carrying out object detection on the image to be processed through a pre-trained detection model to obtain a detection frame.

In step S430, pixels in the overlapped portion in the detection frame are erased to obtain a target detection frame.

For example, please refer to fig. 6, fig. 6 shows a schematic diagram of an annotated to-be-processed image provided in an embodiment of the present application. As shown in fig. 6, in the image to be processed, the pedestrian 510 and the pedestrian 520 are labeled, and at this time, the detection frame containing the pedestrian 510 and the detection frame containing the pedestrian 520 have an overlapping portion.

And, please refer to fig. 7, fig. 7 shows a schematic diagram of a target detection frame in the embodiment of the present application. The object detection frame shown in fig. 7 corresponds to the pedestrian 510.

And, please refer to fig. 8, fig. 8 shows a schematic diagram of another target detection frame in the embodiment of the present application. The object detection frame shown in fig. 8 corresponds to the pedestrian 520.

Step S440, obtaining the action recognition result corresponding to the target detection frame through the pre-trained action recognition model.

It should be understood that after the target detection frames are obtained, all the target detection frames may be preprocessed to adjust all the target detection frames to a preset size. And then, respectively inputting the target detection frames with preset sizes into the pre-trained motion recognition model to obtain motion recognition results.

It should be understood that the preset size may be set according to actual requirements, and the embodiments of the present application are not limited thereto.

It should be understood that the above-described motion recognition method is only exemplary, and those skilled in the art can make various changes, modifications or alterations according to the above-described method and still fall within the scope of the present application.

For example, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Referring to fig. 9, fig. 9 shows a structural block diagram of a motion recognition device 900 provided in an embodiment of the present application, it should be understood that the motion recognition device 900 corresponds to the above method embodiment and is capable of performing various steps related to the above method embodiment, specific functions of the motion recognition device 900 may be referred to in the foregoing description, and detailed descriptions are appropriately omitted herein to avoid repetition. The motion recognition device 900 includes at least one software function module that can be stored in a memory in the form of software or firmware (firmware) or is fixed in an Operating System (OS) of the motion recognition device 900. Specifically, the motion recognition device 900 includes:

an obtaining module 910, configured to obtain an image to be processed; an input module 920, configured to input an image to be processed into a pre-trained detection model, and acquire a first detection frame including a coincidence portion, where the first detection frame indicates an area where a first object in the image to be processed is located; an edge detection module 930, configured to perform edge detection on the first detection frame to obtain a gradient value of an object edge pixel in the overlapped portion; a deleting module 940, configured to delete the overlapped portion from the first detection frame to obtain a target detection frame in a case where the gradient value of the object edge pixel within the overlapped portion is concave; the input module 920 is further configured to input the target detection box into a pre-trained motion recognition model, and obtain a motion recognition result of the first object.

In a possible embodiment, the overlapped part is an overlapped part of a second detection frame and the first detection frame, the second detection frame represents an area where a second object in the image to be processed is located, and the deleting module 940 includes: a calculating module (not shown) for calculating the coincidence ratio of the first detection frame and the second detection frame; a deletion submodule (not shown) for deleting the overlapped part from the first detection frame in a case where the degree of overlap is smaller than a preset degree of overlap.

In a possible embodiment, the deleting submodule is further configured to delete the first detection frame when the coincidence degree is greater than or equal to a preset coincidence degree and the confidence degree of the first detection frame is smaller than the confidence degree of the second detection frame.

In a possible embodiment, the input module 920 is further configured to input a sample image used for training the initial detection model into the initial detection model, and obtain a prediction box, where the prediction box represents a prediction region where a sample object in the sample image is located; a first determining module (not shown) configured to determine a first loss value according to the prediction frame and a first sample detection frame corresponding to the sample image, where the first sample detection frame represents a region where the sample object in the sample image is located; and a first adjusting module (not shown) for adjusting parameters of the initial detection model by using the first loss value to obtain a pre-trained detection model.

In one possible embodiment, the first loss value comprises a regression loss value, and the motion recognition device 900 further comprises a first calculation module for calculating the regression loss value according to the following formula:

wherein L is_CIOURepresenting the regression loss value, IOU representing the intersection of the prediction box and the first sample detection box, p (b, b)^gt) Representing the distance between the center point of the prediction frame and the center point of the first sample detection frame, c representing the diagonal length of the minimum bounding rectangle of the prediction frame and the first sample detection frame, a representing the first hyperparameter, w^gtDenotes the width, h, of the first sample detection box^gtDenotes the length of the first sample detection box, w denotes the width of the prediction box, and h denotes the length of the prediction box.

In a possible embodiment, the input module 920 is further configured to input a second sample detection box for training the initial motion recognition model into the initial motion recognition model to obtain a predicted motion recognition result; a second determining module (not shown) for determining a second loss value according to the predicted motion recognition result and the sample motion recognition result corresponding to the second sample detection frame; and a second adjusting module (not shown) for adjusting parameters of the initial motion recognition model by using the second loss value to obtain a pre-trained motion recognition model.

In a possible embodiment, the motion recognition device 900 further comprises a second calculation module for calculating a second loss value according to the following formula:

L₁＝μL_fl+L_softmax；

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Fig. 10 shows a block diagram of an electronic device 1000 according to an embodiment of the present application. The electronic device 1000 may include a processor 1010, a communication interface 1020, a memory 1030, and at least one communication bus 1040. Wherein the communication bus 1040 is used for realizing direct connection communication of these components. The communication interface 1020 in the embodiment of the present application is used for communicating signaling or data with other devices. Processor 1010 may be an integrated circuit chip having signal processing capabilities. The Processor 1010 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 1010 may be any conventional processor or the like.

The Memory 1030 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 1030 stores computer readable instructions, which when executed by the processor 1010, the electronic device 1000 may perform the steps of the above-described method embodiments.

The electronic device 1000 may further include a memory controller, an input-output unit, an audio unit, a display unit.

The memory 1030, the memory controller, the processor 1010, the peripheral interface, the input/output unit, the audio unit, and the display unit are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, these elements may be electrically connected to each other via one or more communication buses 1040. The processor 1010 is configured to execute executable modules stored in the memory 1030. Also, the electronic device 1000 is configured to perform the following method: acquiring an image to be processed; inputting the image to be processed into a pre-trained detection model, and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located; performing edge detection on the first detection frame to obtain gradient values of object edge pixels in the overlapped part; deleting the overlapped part from the first detection frame to obtain a target detection frame in the case that gradient values of object edge pixels within the overlapped part are concave; and inputting the target detection frame into a pre-trained motion recognition model to obtain a motion recognition result of the first object.

The input and output unit is used for providing input data for a user to realize the interaction of the user and the server (or the local terminal). The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

The audio unit provides an audio interface to the user, which may include one or more microphones, one or more speakers, and audio circuitry.

The display unit provides an interactive interface (e.g. a user interface) between the electronic device and a user or for displaying image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.

It is to be understood that the configuration shown in fig. 10 is merely exemplary, and that the electronic device 1000 may include more or fewer components than shown in fig. 10, or have a different configuration than shown in fig. 10. The components shown in fig. 10 may be implemented in hardware, software, or a combination thereof.

The present application also provides a storage medium having a computer program stored thereon, which, when executed by a processor, performs the method of the method embodiments.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A motion recognition method, comprising:

acquiring an image to be processed;

inputting the image to be processed into a pre-trained detection model, and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located;

performing edge detection on the first detection frame to obtain gradient values of object edge pixels in the overlapped part;

deleting the overlapped part from the first detection frame to obtain a target detection frame in the case that gradient values of object edge pixels within the overlapped part are concave;

and inputting the target detection frame into a pre-trained motion recognition model to obtain a motion recognition result of the first object.

2. The motion recognition method according to claim 1, wherein the overlapped part is an overlapped part of a second detection frame and the first detection frame, the second detection frame indicating an area where a second object is present in the image to be processed, and the overlapped part is deleted from the first detection frame in a case where a gradient value of an object edge pixel in the overlapped part is concave, includes:

calculating the coincidence degree of the first detection frame and the second detection frame;

and deleting the overlapped part from the first detection frame under the condition that the overlapping degree is less than the preset overlapping degree.

3. The motion recognition method according to claim 2, further comprising:

and deleting the first detection frame under the condition that the contact ratio is greater than or equal to the preset contact ratio and the confidence coefficient of the first detection frame is smaller than that of the second detection frame.

4. The motion recognition method according to claim 1, wherein before the inputting the image to be processed into a pre-trained detection model and acquiring a first detection frame containing an overlapped part, the motion recognition method further comprises:

inputting a sample image used for training an initial detection model into the initial detection model, and obtaining a prediction frame, wherein the prediction frame represents a prediction area where a sample object in the sample image is located;

determining a first loss value according to the prediction frame and a first sample detection frame corresponding to the sample image, wherein the first sample detection frame represents an area where a sample object in the sample image is located;

and adjusting parameters of the initial detection model by using the first loss value to obtain the pre-trained detection model.

5. The motion recognition method according to claim 4, wherein the first loss value comprises a regression loss value, the motion recognition method further comprising calculating the regression loss value according to the following formula:

6. The motion recognition method according to claim 1, wherein before the inputting the target detection box into a pre-trained motion recognition model and obtaining the motion recognition result of the first object, the motion recognition method further comprises:

inputting a second sample detection box for training an initial motion recognition model into the initial motion recognition model to obtain a predicted motion recognition result;

determining a second loss value according to the predicted action recognition result and the sample action recognition result corresponding to the second sample detection frame;

and adjusting parameters of the initial motion recognition model by using the second loss value to obtain the pre-trained motion recognition model.

7. The motion recognition method of claim 6, wherein determining a second loss value based on the predicted motion recognition result and the sample motion recognition result corresponding to the second sample detection box comprises calculating the second loss value according to the following formula:

L₁＝μL_fl+L_softmax；

wherein L is₁Represents the second Loss value, mu represents a second hyperparameter for regulating the ratio of the classification Loss function FocalLoss and the cross entropy Loss function Softmax Loss, and L_flRepresents the value of the classification Loss function, L, calculated by said Focal local_softmaxRepresents the cross-entropy Loss value calculated by the Softmax Loss.

8. An action recognition device, comprising:

the acquisition module is used for acquiring an image to be processed;

the input module is used for inputting the image to be processed into a pre-trained detection model and acquiring a first detection frame containing a superposition part, wherein the first detection frame represents an area where a first object in the image to be processed is located;

an edge detection module, configured to perform edge detection on the first detection frame to obtain a gradient value of an object edge pixel in the overlapped portion;

a deleting module, configured to delete the overlapped portion from the first detection frame to obtain a target detection frame, if a gradient value of an object edge pixel within the overlapped portion is concave;

the input module is further configured to input the target detection box into a pre-trained motion recognition model, and obtain a motion recognition result of the first object.

9. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the action recognition method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the action recognition method of any of claims 1-7.