CN112149602B

CN112149602B - Action counting method and device, electronic equipment and storage medium

Info

Publication number: CN112149602B
Application number: CN202011059856.8A
Authority: CN
Inventors: 祁雷; 王雷; 张波; 陈广
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2022-03-25
Anticipated expiration: 2040-09-30
Also published as: CN112149602A

Abstract

The application discloses an action counting method, an action counting device, electronic equipment and a storage medium. The method can realize that the action times of the target object in the video to be processed are obtained only by analyzing the skeleton key points of the target object in the video image without analyzing the whole picture of the video image, thereby reducing the calculation complexity and further improving the accuracy of action counting.

Description

Action counting method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for counting actions, an electronic device, and a storage medium.

Background

With the development of computer vision technology, motion counting has been widely applied in the fields including behavior monitoring, sports, game design and the like. The existing action counting method needs to analyze the whole picture of the video image, so that a large amount of interference information irrelevant to action counting is inevitably introduced, the complexity of calculation is increased, and the accuracy of action counting is reduced.

Disclosure of Invention

In view of the above problems, the present application provides an operation counting method, an operation counting device, an electronic device, and a storage medium to improve the above problems.

In a first aspect, an embodiment of the present application provides an action counting method, which is applied to an electronic device, and the method includes: acquiring a video to be processed; obtaining skeleton key points of a target object in the video to be processed; acquiring target key point characteristics based on the skeleton key points; calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics; and acquiring the action times of the target object in the video to be processed based on the similarity matrix.

In a second aspect, an embodiment of the present application provides an action counting apparatus, which is operated in an electronic device, and includes: the video data acquisition module is used for acquiring a video to be processed; the skeleton key point acquisition module is used for acquiring the skeleton key points of the target object in the video to be processed; a key point feature obtaining module, configured to obtain a target key point feature based on the bone key point; the calculation module is used for calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics; and the counting module is used for acquiring the action times of the target object in the video to be processed based on the similarity matrix.

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and one or more processors; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of the first aspect described above.

In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, wherein when the program code is executed by a processor, the method according to the first aspect is performed.

According to the action counting method and device, the electronic device and the storage medium, the video to be processed is obtained, then the skeleton key points of the target object in the video to be processed are obtained, then the target key point characteristics are obtained based on the skeleton key points, then the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic diagram of an application environment provided by an embodiment of the present application.

Fig. 2 shows a flowchart of a method of an action counting method according to an embodiment of the present application.

Fig. 3 shows a method flowchart of step S120 in fig. 2.

Fig. 4 shows a method flowchart of step S122 in fig. 3.

Fig. 5 is a schematic diagram illustrating a position of a skeletal key point of a target object according to an embodiment of the present application.

Fig. 6 shows a method flowchart of step S130 in fig. 2.

Fig. 7 shows a flowchart of the method of step S140 in fig. 2.

Fig. 8 shows a method flowchart of step S150 in fig. 2.

Fig. 9 shows a network structure diagram of an action counting neural network provided in an embodiment of the present application.

Fig. 10 shows an example diagram for counting the number of actions of a target object according to an embodiment of the present application.

Fig. 11 shows a block diagram of a motion counting apparatus according to an embodiment of the present application.

Fig. 12 shows a block diagram of an electronic device according to an embodiment of the present application.

Fig. 13 illustrates a storage unit for storing or carrying program code implementing the action counting method according to the embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The counting of actions may be simply understood as counting repeated actions. The current motion counting methods can be roughly classified into two types according to the difference of the emphasis points: a frequency domain analysis based motion counting method and a matching based motion counting method. The motion counting method based on frequency domain analysis is to find the motion period and the category of a target under the condition of a known target track, and the motion counting method based on matching is to find repeated pictures in a time sequence by adding geometric constraint so as to identify the motion period. However, the conventional motion counting method needs to analyze the whole picture of the video image, so that a large amount of interference information irrelevant to the motion counting is inevitably introduced, the complexity of calculation is increased, and the accuracy of the motion counting is reduced.

In view of the above problems, the inventors have found, through long-term research, that the accuracy of motion counting can be improved while reducing the computational complexity if the motion counting model is focused on an effective motion region. Specifically, the video to be processed is acquired, then skeleton key points of the target object in the video to be processed are acquired, then the characteristics of the target key points are acquired based on the skeleton key points, then the similarity matrix corresponding to the video to be processed is calculated based on the characteristics of the target key points, and then the action times of the target object in the video to be processed is acquired based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved. Therefore, an action counting method, an action counting device, an electronic device and a storage medium provided by the embodiment of the application are provided.

For the convenience of describing the scheme of the present application in detail, an application environment in the embodiment of the present application is described below with reference to the accompanying drawings.

Referring to fig. 1, a schematic application environment diagram of an action counting method according to an embodiment of the present application is shown in fig. 1, where the application environment may be understood as a network system 10 according to an embodiment of the present application, where the network system 10 includes: the user terminal 11 and the server 12, optionally, the user terminal 11 may be any device having communication and storage functions, including but not limited to a PC (Personal Computer), a PDA (tablet Personal Computer), a smart television, a smart phone, a smart wearable device, or other smart communication devices having a network connection function, and the server 12 may be a server (a network access server), a server cluster (a cloud server) composed of a plurality of servers, or a cloud computing center (a database server).

In this embodiment, the user terminal 11 may be configured to record or shoot a short video, and count the number of times of actions (optionally, the actions may include squat, head rotation, running, and the like) of a target object (for example, the target object may be a person, optionally, in some other embodiments, the target object may also be another living being, for example, an animal such as a cat, a dog, a monkey, and the like, and may not be limited specifically) in a short video image recorded or shot by the video, and in order to increase a calculation rate of the number of times of actions of the target object, the user terminal 11 may send a count result to the server 12 through a network for storage, so that occupation of a storage space of the user terminal 11 may be reduced, and further increase a calculation speed of the number of times of actions of the target object.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, a flowchart of an action counting method according to an embodiment of the present application is shown, where the action counting method is applicable to an electronic device, and the method includes:

step S110: and acquiring a video to be processed.

In this embodiment, the video to be processed may include a plurality of video segments, each video segment may include a plurality of frames of images, each video segment may include at least one complete action of the target object, and the action may include squat, head rotation, running, and other actions. In some embodiments, if the action of the target object is "squat", at least one action of "standing-squat-standing" or the like of the target object may be included in one video clip.

Optionally, in a live broadcast scenario, when two anchor users connect the PK, an anchor user who wins behind the PK may punish a failed anchor user, and the punishment manner may include requesting the failed anchor user to sing a song, squat, turn around, or run for 10 times, and the specific punishment manner may not be limited. As one way, a video including a connected PK state image in a live scene may be used as a to-be-processed video. When the video includes the connected PK status image, two different user account ids may be included (the user account ids corresponding to different anchor users are different).

Alternatively, when a video image is detected to include a motion that occurs repeatedly, the video image may be identified as a video to be processed. For example, when it is detected that a complete squat course of action "standing-squat-standing" is included in the video image, if it is detected that the user "squats" again, the video image may be identified as the video to be processed.

Step S120: and acquiring bone key points of the target object in the video to be processed.

In the process of counting the action times of the target object, the counting accuracy may be caused by an abnormal action posture or external environment changes such as background and illumination, for example, when some users perform a squat action, if the squat amplitude of the users is small (for example, the squat is in a half squat state), the squat may not be counted, thereby affecting the counting accuracy.

As a way to improve the above problem, the present embodiment may acquire the bone key points of the target object in the video to be processed, so that the counting of the number of actions of the target object may be completed subsequently by means of the bone key points. The specific acquisition process of the skeletal key points of the target object is described as follows.

Referring to fig. 3, as an alternative, step S120 may include:

step S121: and inputting the video to be processed into a target attitude estimation network, and acquiring a plurality of reference key points output by the target attitude estimation network.

In this embodiment, the target posture estimation network may be used to estimate skeletal key points of the target object, the target posture estimation network is obtained by training based on a posture estimation network model, and optionally, the posture estimation network model may be a MobileNetV2 network model. In this embodiment, the convolution part of the MobileNetV2 network model may be used as the backbone network of the target attitude estimation network, and two attitude estimation losses are added as the loss layers of the target attitude estimation network. Wherein, the two posture estimation losses can be respectively a key point position prediction loss and a limb prediction loss. The calculation rule corresponding to the predicted loss of the key point position can be expressed as:

wherein J represents the J-th key point, and J key points are total. p represents the p-th pixel of the video frame. W is a binary mask, and W (p) ═ 0 indicates that the pth pixel does not participate in the final loss calculation. S_j(p) the characterization network outputs a probability score for the pth pixel belonging to the jth keypoint,

characterizing the p-th pixelWhether it really belongs to the jth keypoint.

Correspondingly, the calculation rule corresponding to the predicted loss of the limb can be expressed as:

wherein C characterizes the C limb segment, for a total of C limb segments. p represents the p-th pixel of the video frame. W is a binary mask, and W (p) ═ 0 indicates that the pth pixel does not participate in the final loss calculation. L is_c(p) the characterization network outputs a probability score for the pth pixel belonging to the c-th limb segment,

it is characterized whether the p pixel really belongs to the c segment of the limb segment.

Optionally, the sum of the predicted loss of the keypoint location and the predicted loss of the limb may be used as the training loss of the target posture estimation network, that is, the training loss of the target posture estimation network in this embodiment may be expressed as: f ═ F_L+F_S. On the basis of determining the backbone network and the training loss function, the backbone network can be trained based on the training loss function, and optionally, the backbone network can be trained in an error return mode, so that a target attitude estimation network can be obtained.

As an implementation manner, a video to be processed may be input into a target pose estimation network, and a plurality of reference key points output by the target pose estimation network may be obtained. It should be noted that the plurality of reference keypoints are potential keypoints of the target object in the video image to be processed.

Step S122: bone keypoints are obtained from the plurality of reference keypoints.

Optionally, the bone key points may be obtained from the multiple potential key points based on a key point screening rule, and the key point screening rule in this embodiment may be expressed as:

wherein D is_j1And D_j2The potential positions of the j1 th key point and the j2 th key point on the video image to be processed are respectively set, and m and n are divided into two positions in the set. Set (j1, j2) characterizes the key point, z, of the screen_j1j2Representing whether the j1 th and j2 th key points are connected, connected is 1, disconnected is 0, E_mnCharacterization D_j1And D_j2Two points in weight under the constraint of the limb segment.

Optionally, the process of screening according to the key point screening rule is described as follows:

referring to fig. 4, as an alternative, step S122 may include:

step S1221: and acquiring reference position associated parameters respectively corresponding to the plurality of reference key points.

The reference position associated parameter is used to represent whether two key points of the plurality of key points are connected, optionally, if the two key points are connected, the reference position associated parameter is 1, and if the two key points are not connected, the reference position associated parameter is 0.

Step S1222: and acquiring weight parameters of the plurality of key points under limb joint constraint.

Optionally, the weight parameter is used to characterize the weight of all two keypoints of the plurality of keypoints under the limb segment constraint. The weight parameter may be expressed as:

wherein d is_m，d_nDefined as the coordinates of the m-th and n-th points, p (u) is the interpolated coordinate, p (u) can be expressed as:

p(u)＝(1-u)d_j1+ud_j2·

step S1223: and acquiring the product of the reference position correlation parameter and the weight parameter.

Step S1224: and taking the key point corresponding to the product with the maximum value as the bone key point.

As one approach, the key point corresponding to the product of the largest median values of the plurality of key points may be used as the bone key point of the target object.

For example, in a specific application scenario, a video frame I may be input into the trained target pose estimation network, and a key point score map S and a limb node score map L corresponding to the video frame I output by the target pose estimation network are obtained, where S_j(p) characterizing the likelihood that the p-th pixel in video frame I belongs to the j-th keypoint, L_c(p) characterizing the likelihood that the p-th pixel in video frame I belongs to the c-th limb segment. Through a maximum suppression algorithm, the potential key points of the target object can be obtained from the key point score map S, and then the potential key points can be minimized through the key point screening rule, so that the key points closest to the real situation (i.e., the bone key points of the target object) can be screened out. Optionally, if the target object is a human, the bone keypoint location schematic diagram shown in fig. 5 may be obtained according to the bone keypoint obtaining method of the embodiment. As shown in fig. 5, the electronic device may perform serial number marking on the skeletal key points of the target object, and optionally, during marking, the trunk (for example, limbs of a human body) may be marked first, and then other parts of the target object may be marked.

Step S130: and acquiring target key point characteristics based on the skeleton key points.

Referring to fig. 6, as an alternative, step S130 may include:

step S131: and acquiring the space related characteristics corresponding to the bone key points.

The spatial correlation features may be obtained by calculating spatial relative positions of different key points, and specifically, a position set of key points of a target object in a kth frame of video may be defined as

Then the spatially dependent features corresponding to the skeletal key points of the target object (which may be V's)_SRepresentation) can be constructed in the following way:

wherein the content of the first and second substances,

characterization of

And a first

The euclidean distance between them. By adopting the relative position, the influence on the count of the number of times of movement of the target object due to a change in the attitude angle of view or the like can be reduced.

Step S132: and acquiring time-related features corresponding to the bone key points.

Optionally, the time-related feature may reflect a time sequence of the action of the target object, and as a manner, the time-sequence feature of the current video frame may be obtained by calculating a change of a position of the same key point between the current video frame and an adjacent video frame, and the time-sequence feature may be used as a time-related feature corresponding to a skeletal key point of the target object.

In particular, if V is used_tRepresenting time-dependent features, the time-dependent features may be constructed by:

wherein the content of the first and second substances,

and characterizing the position of the ith bone key point in the k +1 th frame video image.

Step S133: and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.

Optionally, the target keypoint features may be obtained by splicing the spatial correlation features and the temporal correlation features. Assuming that the target key point is in the kth frame video image, the specific stitching principle can be expressed as follows:

wherein the content of the first and second substances,

the features characterizing the kth frame video, vec (-) represents vectorization operation, and Concat (-) represents concatenation operation (spatial correlation features can be spliced with temporal correlation features by concatenation operation).

Step S140: and calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics.

Optionally, the video to be processed in this embodiment may include multiple frames of images.

Referring to fig. 7, as an alternative, step S140 may include:

step S141: and acquiring the distance between the key point characteristics of a plurality of random two-frame images in the multi-frame image.

Assuming that the video clip has a total of N frames, a matrix M' e R can be defined^N×NThe number of rows and columns of the matrix M ' is N, let M ' (i, j) denote the elements of the ith row and the jth column of the matrix, and M ' (i, j) can be used to characterize the distance between the keypoint feature of the ith frame image and the keypoint feature of the jth frame image, and is defined as:

where τ characterizes the scale control factor. The values of all position elements of the matrix M' can be obtained by using the above formula, that is, the distance between the keypoint features of a plurality of arbitrary two frames of images in a multi-frame image can be obtained.

Step S142: and normalizing the distances according to a specified calculation rule to obtain a plurality of elements, and taking a matrix formed by combining the elements as a similarity matrix corresponding to the video to be processed.

Wherein, the specified calculation rule may be:

the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of a video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, D represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image.

By normalizing the plurality of distances, a plurality of elements with a value range of 0 to 1 can be obtained, and as a mode, a matrix formed by combining the plurality of elements can be used as a similarity matrix M corresponding to a video to be processed. It should be noted that, the number of rows and the number of columns of the similarity matrix M are both N, and the similarity matrix may be used to represent the similarity between the ith frame image and the jth frame image of the video to be processed.

Step S150: and acquiring the action times of the target object in the video to be processed based on the similarity matrix.

Referring to fig. 8, as an alternative, step S150 may include:

step S151: and inputting the similarity matrix into an action counting neural network, and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed.

Alternatively, the action counting neural network in the present embodiment may be constructed by a convolutional neural network and a classifier, wherein the convolutional neural network may be formed by combining a plurality of repeated convolutional layers (e.g., the convolutional layer, the activation layer, and the pooling layer shown in fig. 9) and adding a classification layer (i.e., the softmax classification shown in fig. 9). Alternatively, the convolution layer combination calculation from the l-th layer to the l + 1-th layer in this embodiment can be obtained by the following formula:

wherein the content of the first and second substances,

the output of the convolution operation in layer l +1 is characterized,

characterize the kth filter in layer l +1,

characterize the weight bias of the kth filter in the l +1 th layer,

characterizing the output of the l-th layer;

representing the output of activation operation in the l +1 th layer, and representing max operation; z^l+1The overall output of layer l +1 is characterized and pooling characterizes the pooling operation.

Optionally, the classification layer in this embodiment may adopt a SoftMax classifier, and the specific implementation manner is:

wherein p (k, t) represents the probability that the k frame video image period is t, w represents the parameter of the SoftMax classifier, and w represents the probability that the k frame video image period is t^tCharacterize its t-th column.

Optionally, the first layer of input to the convolutional neural network is a similarity matrix M, thus Z¹And M, obtaining the output of the last layer of network as P epsilon R through layer-by-layer forward propagation^K×TWherein P (k, t) represents the period of t unit times corresponding to the action of the k-th frameProbability, each unit of time characterizes the duration of a frame. The corresponding loss function is:

F_p＝-logP(k，t^*)

wherein, t^*The actual motion period characterizing the motion of the kth frame.

As one way, the constructed convolutional neural network can be trained to obtain a motion counting neural network. Optionally, a labeled data set may be constructed, and the loss function is optimized by using an error back-transmission algorithm. The labeling process of the data set is as follows: several video segments containing single action are collected, and then each segment is repeated for multiple times to form a new video. Defining a video segment V containing a single motion, wherein the video length is T frames, the video segment V can be repeated for N times to form a video segment V 'with the length of NT frames, the motion period in the V' is T, and the motion times are N. In this way, the trained motion counting neural network can be used to predict the motion period of each frame in the video segment.

Step S152: and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.

As an implementation manner, the action times of the target object in the video to be processed may be obtained based on the action period with the maximum period duration according to an action time calculation rule, where the action time calculation rule may include:

wherein, C^LRepresenting the number of actions, T, of the target object to the Lth frame of the video to be processed with the maximum period durationⁱThe motion period of the ith frame representing the video to be processed,

and representing the sum of the action times of the target objects in the 1 st frame to the L < th > frame of the video to be processed.

Alternatively to this, the first and second parts may,inputting the similarity matrix M into the action counting neural network, and obtaining the output P E R of the last layer of network through forward propagation layer by layer^K×TThe number of rows of the matrix is the number of frames K of the video segment and the number of columns is the maximum period T. Alternatively, if P (K, T) represents the probability that the period corresponding to the action of the kth frame is T unit times, and the period with the maximum probability value can be selected as the actual action period of the kth frame, the period T of the action of the kth frame can be obtained by the following formula^k：

The number of actions in the k-th frame, c, can then be obtained by counting the inverse of the period^kCharacterizing the number of actions for the k-th frame, then:

optionally, the total action times in the video segment can be obtained by accumulating the action times of each frame in the video, and the total action times is recorded as C, so that:

wherein, cⁱThe number of actions characterizing the ith frame.

By the action counting method of the embodiment, the action times of the target object in the video image can be automatically counted, for example, in a specific application scenario, if the target object is a person, the action of the target object to be counted is "squat", as shown in fig. 10, the current "squat" time of the user can be counted, and the counting result can be updated in real time, optionally, the action frequency of the user can also be displayed, in fig. 10, the current "squat" time (i.e., Count) of the user is 5, and the "squat" frequency (i.e., Rate) is "0.1923 HZ".

In the action counting method provided by this embodiment, a video to be processed is acquired, then skeletal key points of a target object in the video to be processed are acquired, then target key point features are acquired based on the skeletal key points, then a similarity matrix corresponding to the video to be processed is calculated based on the target key point features, and then action times of the target object in the video to be processed are acquired based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.

Referring to fig. 11, which is a block diagram of a motion counting apparatus according to an embodiment of the present disclosure, in this embodiment, a motion counting apparatus 200 is provided, which can be operated in an electronic device, where the apparatus 200 includes: the video data acquisition module 210, the bone key point acquisition module 220, the key point feature acquisition module 230, the calculation module 240, and the counting module 250:

the video data obtaining module 210 is configured to obtain a video to be processed.

A bone key point obtaining module 220, configured to obtain bone key points of a target object in the video to be processed.

As one mode, the skeleton key point obtaining module 220 may be specifically configured to input the video to be processed into a target pose estimation network, and obtain a plurality of reference key points output by the target pose estimation network; bone keypoints are obtained from the plurality of reference keypoints. Wherein the step of obtaining bone keypoints from the plurality of reference keypoints may comprise: acquiring reference position associated parameters respectively corresponding to the plurality of reference key points; acquiring weight parameters of the plurality of key points under limb joint constraint; obtaining a product of the reference position correlation parameter and the weight parameter; and taking the key point corresponding to the product with the maximum value as the bone key point.

A key point feature obtaining module 230, configured to obtain a target key point feature based on the bone key point.

As a manner, the key point feature obtaining module 230 may be specifically configured to obtain a spatial correlation feature corresponding to the bone key point; acquiring time-related features corresponding to the bone key points; and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.

A calculating module 240, configured to calculate, based on the target keypoint feature, a similarity matrix corresponding to the video to be processed.

Optionally, the video to be processed in this embodiment may include multiple frames of images. In this way, the calculating module 240 may be specifically configured to obtain a distance between keypoint features of a plurality of any two frames of images in the multi-frame image; and normalizing the distances according to a specified calculation rule to obtain a plurality of elements, and taking a matrix formed by combining the elements as a similarity matrix corresponding to the video to be processed. Optionally, the specified calculation rule is:

the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, K represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image of the video to be processed.

A counting module 250, configured to obtain, based on the similarity matrix, the number of actions of the target object in the video to be processed.

As one mode, the counting module 250 may be specifically configured to input the similarity matrix into an action counting neural network, and obtain an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed; and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration. Optionally, the obtaining the number of actions of the target object in the video to be processed based on the action cycle with the maximum cycle duration includes: acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration according to an action time calculation rule, wherein the action time calculation rule comprises the following steps:

wherein, C^LRepresenting the action times T of the target object when the L-th frame with the maximum period duration of the video to be processed is reachedⁱCharacterizing an action period of an ith frame of the video to be processed, the

And representing the sum of the action times of the target object in the 1 st frame to the L < th > frame of the video to be processed.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Referring to fig. 12, based on the motion counting method and apparatus, an electronic device 100 capable of performing the motion counting method is further provided in the embodiment of the present application. The electronic device 100 includes a memory 102 and one or more processors 104 (only one shown) coupled to each other, the memory 102 and the processors 104 being communicatively coupled to each other. The memory 102 stores therein a program that can execute the contents of the foregoing embodiments, and the processor 104 can execute the program stored in the memory 102.

The processor 104 may include one or more processing cores, among other things. The processor 104 interfaces with various components throughout the electronic device 100 using various interfaces and circuitry to perform various functions of the electronic device 100 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 102 and invoking data stored in the memory 102. Alternatively, the processor 104 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 104 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 104, but may be implemented by a communication chip.

The Memory 102 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 102 may be used to store instructions, programs, code sets, or instruction sets. The memory 102 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the foregoing embodiments, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

Referring to fig. 13, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 300 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 300 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 300 includes a non-transitory computer-readable storage medium. The computer readable storage medium 300 has storage space for program code 310 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 310 may be compressed, for example, in a suitable form.

To sum up, according to the action counting method, the action counting device, the electronic device, and the storage medium provided by the embodiments of the present application, the video to be processed is obtained, then the bone key points of the target object in the video to be processed are obtained, then the target key point features are obtained based on the bone key points, then the similarity matrix corresponding to the video to be processed is calculated based on the target key point features, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of motion counting, the method comprising:

acquiring a video to be processed;

obtaining skeleton key points of a target object in the video to be processed;

acquiring target key point characteristics based on the skeleton key points;

calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics;

inputting the similarity matrix into an action counting neural network, and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed;

and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.

2. The method according to claim 1, wherein the obtaining of skeletal key points of a target object in the video to be processed comprises:

inputting the video to be processed into a target attitude estimation network, and acquiring a plurality of reference key points output by the target attitude estimation network;

bone keypoints are obtained from the plurality of reference keypoints.

3. The method of claim 2, wherein said obtaining skeletal keypoints from said plurality of reference keypoints comprises:

acquiring reference position associated parameters respectively corresponding to the plurality of reference key points, wherein the reference position associated parameters are used for representing whether a certain two key points in the plurality of key points are connected or not;

acquiring weight parameters of the plurality of key points under limb joint constraint;

obtaining a product of the reference position correlation parameter and the weight parameter;

and taking the key point corresponding to the product with the maximum value as the bone key point.

4. The method of claim 1, wherein said obtaining target keypoint features based on said bone keypoints comprises:

acquiring space related characteristics corresponding to the bone key points;

acquiring time-related features corresponding to the bone key points;

and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.

5. The method according to claim 1, wherein the video to be processed comprises a plurality of frames of images, and the calculating a similarity matrix corresponding to the video to be processed based on the target keypoint features comprises:

acquiring the distance between the key point characteristics of a plurality of any two frames of images in the multi-frame image;

and normalizing the plurality of distances to obtain a plurality of elements, and taking a matrix formed by combining the plurality of elements as a similarity matrix corresponding to the video to be processed.

6. The method of claim 5, wherein the normalizing the plurality of distances is:

the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, D represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image of the video to be processed.

7. The method according to claim 1, wherein the obtaining the number of times of actions of the target object in the video to be processed based on the action period with the maximum period duration comprises:

acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration according to an action time calculation rule, wherein the action time calculation rule comprises the following steps:

8. An action counting device, characterized in that the device comprises:

the video data acquisition module is used for acquiring a video to be processed;

the skeleton key point acquisition module is used for acquiring the skeleton key points of the target object in the video to be processed;

a key point feature obtaining module, configured to obtain a target key point feature based on the bone key point;

the calculation module is used for calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics;

the counting module is used for inputting the similarity matrix into an action counting neural network and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed; and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.

9. An electronic device comprising one or more processors and memory;

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-7.

10. A computer-readable storage medium, having program code stored therein, wherein the program code when executed by a processor performs the method of any of claims 1-7.