CN112149602B - Action counting method and device, electronic equipment and storage medium - Google Patents

Action counting method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112149602B
CN112149602B CN202011059856.8A CN202011059856A CN112149602B CN 112149602 B CN112149602 B CN 112149602B CN 202011059856 A CN202011059856 A CN 202011059856A CN 112149602 B CN112149602 B CN 112149602B
Authority
CN
China
Prior art keywords
video
processed
action
key point
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011059856.8A
Other languages
Chinese (zh)
Other versions
CN112149602A (en
Inventor
祁雷
王雷
张波
陈广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cubesili Information Technology Co Ltd
Original Assignee
Guangzhou Cubesili Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Cubesili Information Technology Co Ltd filed Critical Guangzhou Cubesili Information Technology Co Ltd
Priority to CN202011059856.8A priority Critical patent/CN112149602B/en
Publication of CN112149602A publication Critical patent/CN112149602A/en
Application granted granted Critical
Publication of CN112149602B publication Critical patent/CN112149602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an action counting method, an action counting device, electronic equipment and a storage medium. The method can realize that the action times of the target object in the video to be processed are obtained only by analyzing the skeleton key points of the target object in the video image without analyzing the whole picture of the video image, thereby reducing the calculation complexity and further improving the accuracy of action counting.

Description

Action counting method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for counting actions, an electronic device, and a storage medium.
Background
With the development of computer vision technology, motion counting has been widely applied in the fields including behavior monitoring, sports, game design and the like. The existing action counting method needs to analyze the whole picture of the video image, so that a large amount of interference information irrelevant to action counting is inevitably introduced, the complexity of calculation is increased, and the accuracy of action counting is reduced.
Disclosure of Invention
In view of the above problems, the present application provides an operation counting method, an operation counting device, an electronic device, and a storage medium to improve the above problems.
In a first aspect, an embodiment of the present application provides an action counting method, which is applied to an electronic device, and the method includes: acquiring a video to be processed; obtaining skeleton key points of a target object in the video to be processed; acquiring target key point characteristics based on the skeleton key points; calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics; and acquiring the action times of the target object in the video to be processed based on the similarity matrix.
In a second aspect, an embodiment of the present application provides an action counting apparatus, which is operated in an electronic device, and includes: the video data acquisition module is used for acquiring a video to be processed; the skeleton key point acquisition module is used for acquiring the skeleton key points of the target object in the video to be processed; a key point feature obtaining module, configured to obtain a target key point feature based on the bone key point; the calculation module is used for calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics; and the counting module is used for acquiring the action times of the target object in the video to be processed based on the similarity matrix.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory and one or more processors; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of the first aspect described above.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, wherein when the program code is executed by a processor, the method according to the first aspect is performed.
According to the action counting method and device, the electronic device and the storage medium, the video to be processed is obtained, then the skeleton key points of the target object in the video to be processed are obtained, then the target key point characteristics are obtained based on the skeleton key points, then the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 shows a schematic diagram of an application environment provided by an embodiment of the present application.
Fig. 2 shows a flowchart of a method of an action counting method according to an embodiment of the present application.
Fig. 3 shows a method flowchart of step S120 in fig. 2.
Fig. 4 shows a method flowchart of step S122 in fig. 3.
Fig. 5 is a schematic diagram illustrating a position of a skeletal key point of a target object according to an embodiment of the present application.
Fig. 6 shows a method flowchart of step S130 in fig. 2.
Fig. 7 shows a flowchart of the method of step S140 in fig. 2.
Fig. 8 shows a method flowchart of step S150 in fig. 2.
Fig. 9 shows a network structure diagram of an action counting neural network provided in an embodiment of the present application.
Fig. 10 shows an example diagram for counting the number of actions of a target object according to an embodiment of the present application.
Fig. 11 shows a block diagram of a motion counting apparatus according to an embodiment of the present application.
Fig. 12 shows a block diagram of an electronic device according to an embodiment of the present application.
Fig. 13 illustrates a storage unit for storing or carrying program code implementing the action counting method according to the embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The counting of actions may be simply understood as counting repeated actions. The current motion counting methods can be roughly classified into two types according to the difference of the emphasis points: a frequency domain analysis based motion counting method and a matching based motion counting method. The motion counting method based on frequency domain analysis is to find the motion period and the category of a target under the condition of a known target track, and the motion counting method based on matching is to find repeated pictures in a time sequence by adding geometric constraint so as to identify the motion period. However, the conventional motion counting method needs to analyze the whole picture of the video image, so that a large amount of interference information irrelevant to the motion counting is inevitably introduced, the complexity of calculation is increased, and the accuracy of the motion counting is reduced.
In view of the above problems, the inventors have found, through long-term research, that the accuracy of motion counting can be improved while reducing the computational complexity if the motion counting model is focused on an effective motion region. Specifically, the video to be processed is acquired, then skeleton key points of the target object in the video to be processed are acquired, then the characteristics of the target key points are acquired based on the skeleton key points, then the similarity matrix corresponding to the video to be processed is calculated based on the characteristics of the target key points, and then the action times of the target object in the video to be processed is acquired based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved. Therefore, an action counting method, an action counting device, an electronic device and a storage medium provided by the embodiment of the application are provided.
For the convenience of describing the scheme of the present application in detail, an application environment in the embodiment of the present application is described below with reference to the accompanying drawings.
Referring to fig. 1, a schematic application environment diagram of an action counting method according to an embodiment of the present application is shown in fig. 1, where the application environment may be understood as a network system 10 according to an embodiment of the present application, where the network system 10 includes: the user terminal 11 and the server 12, optionally, the user terminal 11 may be any device having communication and storage functions, including but not limited to a PC (Personal Computer), a PDA (tablet Personal Computer), a smart television, a smart phone, a smart wearable device, or other smart communication devices having a network connection function, and the server 12 may be a server (a network access server), a server cluster (a cloud server) composed of a plurality of servers, or a cloud computing center (a database server).
In this embodiment, the user terminal 11 may be configured to record or shoot a short video, and count the number of times of actions (optionally, the actions may include squat, head rotation, running, and the like) of a target object (for example, the target object may be a person, optionally, in some other embodiments, the target object may also be another living being, for example, an animal such as a cat, a dog, a monkey, and the like, and may not be limited specifically) in a short video image recorded or shot by the video, and in order to increase a calculation rate of the number of times of actions of the target object, the user terminal 11 may send a count result to the server 12 through a network for storage, so that occupation of a storage space of the user terminal 11 may be reduced, and further increase a calculation speed of the number of times of actions of the target object.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 2, a flowchart of an action counting method according to an embodiment of the present application is shown, where the action counting method is applicable to an electronic device, and the method includes:
step S110: and acquiring a video to be processed.
In this embodiment, the video to be processed may include a plurality of video segments, each video segment may include a plurality of frames of images, each video segment may include at least one complete action of the target object, and the action may include squat, head rotation, running, and other actions. In some embodiments, if the action of the target object is "squat", at least one action of "standing-squat-standing" or the like of the target object may be included in one video clip.
Optionally, in a live broadcast scenario, when two anchor users connect the PK, an anchor user who wins behind the PK may punish a failed anchor user, and the punishment manner may include requesting the failed anchor user to sing a song, squat, turn around, or run for 10 times, and the specific punishment manner may not be limited. As one way, a video including a connected PK state image in a live scene may be used as a to-be-processed video. When the video includes the connected PK status image, two different user account ids may be included (the user account ids corresponding to different anchor users are different).
Alternatively, when a video image is detected to include a motion that occurs repeatedly, the video image may be identified as a video to be processed. For example, when it is detected that a complete squat course of action "standing-squat-standing" is included in the video image, if it is detected that the user "squats" again, the video image may be identified as the video to be processed.
Step S120: and acquiring bone key points of the target object in the video to be processed.
In the process of counting the action times of the target object, the counting accuracy may be caused by an abnormal action posture or external environment changes such as background and illumination, for example, when some users perform a squat action, if the squat amplitude of the users is small (for example, the squat is in a half squat state), the squat may not be counted, thereby affecting the counting accuracy.
As a way to improve the above problem, the present embodiment may acquire the bone key points of the target object in the video to be processed, so that the counting of the number of actions of the target object may be completed subsequently by means of the bone key points. The specific acquisition process of the skeletal key points of the target object is described as follows.
Referring to fig. 3, as an alternative, step S120 may include:
step S121: and inputting the video to be processed into a target attitude estimation network, and acquiring a plurality of reference key points output by the target attitude estimation network.
In this embodiment, the target posture estimation network may be used to estimate skeletal key points of the target object, the target posture estimation network is obtained by training based on a posture estimation network model, and optionally, the posture estimation network model may be a MobileNetV2 network model. In this embodiment, the convolution part of the MobileNetV2 network model may be used as the backbone network of the target attitude estimation network, and two attitude estimation losses are added as the loss layers of the target attitude estimation network. Wherein, the two posture estimation losses can be respectively a key point position prediction loss and a limb prediction loss. The calculation rule corresponding to the predicted loss of the key point position can be expressed as:
Figure BDA0002711984850000061
wherein J represents the J-th key point, and J key points are total. p represents the p-th pixel of the video frame. W is a binary mask, and W (p) ═ 0 indicates that the pth pixel does not participate in the final loss calculation. Sj(p) the characterization network outputs a probability score for the pth pixel belonging to the jth keypoint,
Figure BDA0002711984850000062
characterizing the p-th pixelWhether it really belongs to the jth keypoint.
Correspondingly, the calculation rule corresponding to the predicted loss of the limb can be expressed as:
Figure BDA0002711984850000063
wherein C characterizes the C limb segment, for a total of C limb segments. p represents the p-th pixel of the video frame. W is a binary mask, and W (p) ═ 0 indicates that the pth pixel does not participate in the final loss calculation. L isc(p) the characterization network outputs a probability score for the pth pixel belonging to the c-th limb segment,
Figure BDA0002711984850000064
it is characterized whether the p pixel really belongs to the c segment of the limb segment.
Optionally, the sum of the predicted loss of the keypoint location and the predicted loss of the limb may be used as the training loss of the target posture estimation network, that is, the training loss of the target posture estimation network in this embodiment may be expressed as: f ═ FL+FS. On the basis of determining the backbone network and the training loss function, the backbone network can be trained based on the training loss function, and optionally, the backbone network can be trained in an error return mode, so that a target attitude estimation network can be obtained.
As an implementation manner, a video to be processed may be input into a target pose estimation network, and a plurality of reference key points output by the target pose estimation network may be obtained. It should be noted that the plurality of reference keypoints are potential keypoints of the target object in the video image to be processed.
Step S122: bone keypoints are obtained from the plurality of reference keypoints.
Optionally, the bone key points may be obtained from the multiple potential key points based on a key point screening rule, and the key point screening rule in this embodiment may be expressed as:
Figure BDA0002711984850000071
wherein D isj1And Dj2The potential positions of the j1 th key point and the j2 th key point on the video image to be processed are respectively set, and m and n are divided into two positions in the set. Set (j1, j2) characterizes the key point, z, of the screenj1j2Representing whether the j1 th and j2 th key points are connected, connected is 1, disconnected is 0, EmnCharacterization Dj1And Dj2Two points in weight under the constraint of the limb segment.
Optionally, the process of screening according to the key point screening rule is described as follows:
referring to fig. 4, as an alternative, step S122 may include:
step S1221: and acquiring reference position associated parameters respectively corresponding to the plurality of reference key points.
The reference position associated parameter is used to represent whether two key points of the plurality of key points are connected, optionally, if the two key points are connected, the reference position associated parameter is 1, and if the two key points are not connected, the reference position associated parameter is 0.
Step S1222: and acquiring weight parameters of the plurality of key points under limb joint constraint.
Optionally, the weight parameter is used to characterize the weight of all two keypoints of the plurality of keypoints under the limb segment constraint. The weight parameter may be expressed as:
Figure BDA0002711984850000072
wherein d ism,dnDefined as the coordinates of the m-th and n-th points, p (u) is the interpolated coordinate, p (u) can be expressed as:
p(u)=(1-u)dj1+udj2·
step S1223: and acquiring the product of the reference position correlation parameter and the weight parameter.
Step S1224: and taking the key point corresponding to the product with the maximum value as the bone key point.
As one approach, the key point corresponding to the product of the largest median values of the plurality of key points may be used as the bone key point of the target object.
For example, in a specific application scenario, a video frame I may be input into the trained target pose estimation network, and a key point score map S and a limb node score map L corresponding to the video frame I output by the target pose estimation network are obtained, where Sj(p) characterizing the likelihood that the p-th pixel in video frame I belongs to the j-th keypoint, Lc(p) characterizing the likelihood that the p-th pixel in video frame I belongs to the c-th limb segment. Through a maximum suppression algorithm, the potential key points of the target object can be obtained from the key point score map S, and then the potential key points can be minimized through the key point screening rule, so that the key points closest to the real situation (i.e., the bone key points of the target object) can be screened out. Optionally, if the target object is a human, the bone keypoint location schematic diagram shown in fig. 5 may be obtained according to the bone keypoint obtaining method of the embodiment. As shown in fig. 5, the electronic device may perform serial number marking on the skeletal key points of the target object, and optionally, during marking, the trunk (for example, limbs of a human body) may be marked first, and then other parts of the target object may be marked.
Step S130: and acquiring target key point characteristics based on the skeleton key points.
Referring to fig. 6, as an alternative, step S130 may include:
step S131: and acquiring the space related characteristics corresponding to the bone key points.
The spatial correlation features may be obtained by calculating spatial relative positions of different key points, and specifically, a position set of key points of a target object in a kth frame of video may be defined as
Figure BDA0002711984850000081
Then the spatially dependent features corresponding to the skeletal key points of the target object (which may be V's)SRepresentation) can be constructed in the following way:
Figure BDA0002711984850000082
wherein the content of the first and second substances,
Figure BDA0002711984850000083
characterization of
Figure BDA0002711984850000084
And a first
Figure BDA0002711984850000085
The euclidean distance between them. By adopting the relative position, the influence on the count of the number of times of movement of the target object due to a change in the attitude angle of view or the like can be reduced.
Step S132: and acquiring time-related features corresponding to the bone key points.
Optionally, the time-related feature may reflect a time sequence of the action of the target object, and as a manner, the time-sequence feature of the current video frame may be obtained by calculating a change of a position of the same key point between the current video frame and an adjacent video frame, and the time-sequence feature may be used as a time-related feature corresponding to a skeletal key point of the target object.
In particular, if V is usedtRepresenting time-dependent features, the time-dependent features may be constructed by:
Figure BDA0002711984850000091
wherein the content of the first and second substances,
Figure BDA0002711984850000092
and characterizing the position of the ith bone key point in the k +1 th frame video image.
Step S133: and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.
Optionally, the target keypoint features may be obtained by splicing the spatial correlation features and the temporal correlation features. Assuming that the target key point is in the kth frame video image, the specific stitching principle can be expressed as follows:
Figure BDA0002711984850000093
wherein the content of the first and second substances,
Figure BDA0002711984850000094
the features characterizing the kth frame video, vec (-) represents vectorization operation, and Concat (-) represents concatenation operation (spatial correlation features can be spliced with temporal correlation features by concatenation operation).
Step S140: and calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics.
Optionally, the video to be processed in this embodiment may include multiple frames of images.
Referring to fig. 7, as an alternative, step S140 may include:
step S141: and acquiring the distance between the key point characteristics of a plurality of random two-frame images in the multi-frame image.
Assuming that the video clip has a total of N frames, a matrix M' e R can be definedN×NThe number of rows and columns of the matrix M ' is N, let M ' (i, j) denote the elements of the ith row and the jth column of the matrix, and M ' (i, j) can be used to characterize the distance between the keypoint feature of the ith frame image and the keypoint feature of the jth frame image, and is defined as:
Figure BDA0002711984850000095
where τ characterizes the scale control factor. The values of all position elements of the matrix M' can be obtained by using the above formula, that is, the distance between the keypoint features of a plurality of arbitrary two frames of images in a multi-frame image can be obtained.
Step S142: and normalizing the distances according to a specified calculation rule to obtain a plurality of elements, and taking a matrix formed by combining the elements as a similarity matrix corresponding to the video to be processed.
Wherein, the specified calculation rule may be:
Figure BDA0002711984850000101
the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of a video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, D represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image.
By normalizing the plurality of distances, a plurality of elements with a value range of 0 to 1 can be obtained, and as a mode, a matrix formed by combining the plurality of elements can be used as a similarity matrix M corresponding to a video to be processed. It should be noted that, the number of rows and the number of columns of the similarity matrix M are both N, and the similarity matrix may be used to represent the similarity between the ith frame image and the jth frame image of the video to be processed.
Step S150: and acquiring the action times of the target object in the video to be processed based on the similarity matrix.
Referring to fig. 8, as an alternative, step S150 may include:
step S151: and inputting the similarity matrix into an action counting neural network, and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed.
Alternatively, the action counting neural network in the present embodiment may be constructed by a convolutional neural network and a classifier, wherein the convolutional neural network may be formed by combining a plurality of repeated convolutional layers (e.g., the convolutional layer, the activation layer, and the pooling layer shown in fig. 9) and adding a classification layer (i.e., the softmax classification shown in fig. 9). Alternatively, the convolution layer combination calculation from the l-th layer to the l + 1-th layer in this embodiment can be obtained by the following formula:
Figure BDA0002711984850000111
wherein the content of the first and second substances,
Figure BDA0002711984850000112
the output of the convolution operation in layer l +1 is characterized,
Figure BDA0002711984850000113
characterize the kth filter in layer l +1,
Figure BDA0002711984850000114
characterize the weight bias of the kth filter in the l +1 th layer,
Figure BDA0002711984850000115
characterizing the output of the l-th layer;
Figure BDA0002711984850000116
representing the output of activation operation in the l +1 th layer, and representing max operation; zl+1The overall output of layer l +1 is characterized and pooling characterizes the pooling operation.
Optionally, the classification layer in this embodiment may adopt a SoftMax classifier, and the specific implementation manner is:
Figure BDA0002711984850000117
wherein p (k, t) represents the probability that the k frame video image period is t, w represents the parameter of the SoftMax classifier, and w represents the probability that the k frame video image period is ttCharacterize its t-th column.
Optionally, the first layer of input to the convolutional neural network is a similarity matrix M, thus Z1And M, obtaining the output of the last layer of network as P epsilon R through layer-by-layer forward propagationK×TWherein P (k, t) represents the period of t unit times corresponding to the action of the k-th frameProbability, each unit of time characterizes the duration of a frame. The corresponding loss function is:
Fp=-logP(k,t*)
wherein, t*The actual motion period characterizing the motion of the kth frame.
As one way, the constructed convolutional neural network can be trained to obtain a motion counting neural network. Optionally, a labeled data set may be constructed, and the loss function is optimized by using an error back-transmission algorithm. The labeling process of the data set is as follows: several video segments containing single action are collected, and then each segment is repeated for multiple times to form a new video. Defining a video segment V containing a single motion, wherein the video length is T frames, the video segment V can be repeated for N times to form a video segment V 'with the length of NT frames, the motion period in the V' is T, and the motion times are N. In this way, the trained motion counting neural network can be used to predict the motion period of each frame in the video segment.
Step S152: and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.
As an implementation manner, the action times of the target object in the video to be processed may be obtained based on the action period with the maximum period duration according to an action time calculation rule, where the action time calculation rule may include:
Figure BDA0002711984850000121
wherein, CLRepresenting the number of actions, T, of the target object to the Lth frame of the video to be processed with the maximum period durationiThe motion period of the ith frame representing the video to be processed,
Figure BDA0002711984850000122
and representing the sum of the action times of the target objects in the 1 st frame to the L < th > frame of the video to be processed.
Alternatively to this, the first and second parts may,inputting the similarity matrix M into the action counting neural network, and obtaining the output P E R of the last layer of network through forward propagation layer by layerK×TThe number of rows of the matrix is the number of frames K of the video segment and the number of columns is the maximum period T. Alternatively, if P (K, T) represents the probability that the period corresponding to the action of the kth frame is T unit times, and the period with the maximum probability value can be selected as the actual action period of the kth frame, the period T of the action of the kth frame can be obtained by the following formulak
Figure BDA0002711984850000123
The number of actions in the k-th frame, c, can then be obtained by counting the inverse of the periodkCharacterizing the number of actions for the k-th frame, then:
Figure BDA0002711984850000124
optionally, the total action times in the video segment can be obtained by accumulating the action times of each frame in the video, and the total action times is recorded as C, so that:
Figure BDA0002711984850000125
wherein, ciThe number of actions characterizing the ith frame.
By the action counting method of the embodiment, the action times of the target object in the video image can be automatically counted, for example, in a specific application scenario, if the target object is a person, the action of the target object to be counted is "squat", as shown in fig. 10, the current "squat" time of the user can be counted, and the counting result can be updated in real time, optionally, the action frequency of the user can also be displayed, in fig. 10, the current "squat" time (i.e., Count) of the user is 5, and the "squat" frequency (i.e., Rate) is "0.1923 HZ".
In the action counting method provided by this embodiment, a video to be processed is acquired, then skeletal key points of a target object in the video to be processed are acquired, then target key point features are acquired based on the skeletal key points, then a similarity matrix corresponding to the video to be processed is calculated based on the target key point features, and then action times of the target object in the video to be processed are acquired based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.
Referring to fig. 11, which is a block diagram of a motion counting apparatus according to an embodiment of the present disclosure, in this embodiment, a motion counting apparatus 200 is provided, which can be operated in an electronic device, where the apparatus 200 includes: the video data acquisition module 210, the bone key point acquisition module 220, the key point feature acquisition module 230, the calculation module 240, and the counting module 250:
the video data obtaining module 210 is configured to obtain a video to be processed.
A bone key point obtaining module 220, configured to obtain bone key points of a target object in the video to be processed.
As one mode, the skeleton key point obtaining module 220 may be specifically configured to input the video to be processed into a target pose estimation network, and obtain a plurality of reference key points output by the target pose estimation network; bone keypoints are obtained from the plurality of reference keypoints. Wherein the step of obtaining bone keypoints from the plurality of reference keypoints may comprise: acquiring reference position associated parameters respectively corresponding to the plurality of reference key points; acquiring weight parameters of the plurality of key points under limb joint constraint; obtaining a product of the reference position correlation parameter and the weight parameter; and taking the key point corresponding to the product with the maximum value as the bone key point.
A key point feature obtaining module 230, configured to obtain a target key point feature based on the bone key point.
As a manner, the key point feature obtaining module 230 may be specifically configured to obtain a spatial correlation feature corresponding to the bone key point; acquiring time-related features corresponding to the bone key points; and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.
A calculating module 240, configured to calculate, based on the target keypoint feature, a similarity matrix corresponding to the video to be processed.
Optionally, the video to be processed in this embodiment may include multiple frames of images. In this way, the calculating module 240 may be specifically configured to obtain a distance between keypoint features of a plurality of any two frames of images in the multi-frame image; and normalizing the distances according to a specified calculation rule to obtain a plurality of elements, and taking a matrix formed by combining the elements as a similarity matrix corresponding to the video to be processed. Optionally, the specified calculation rule is:
Figure BDA0002711984850000141
the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, K represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image of the video to be processed.
A counting module 250, configured to obtain, based on the similarity matrix, the number of actions of the target object in the video to be processed.
As one mode, the counting module 250 may be specifically configured to input the similarity matrix into an action counting neural network, and obtain an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed; and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration. Optionally, the obtaining the number of actions of the target object in the video to be processed based on the action cycle with the maximum cycle duration includes: acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration according to an action time calculation rule, wherein the action time calculation rule comprises the following steps:
Figure BDA0002711984850000151
wherein, CLRepresenting the action times T of the target object when the L-th frame with the maximum period duration of the video to be processed is reachediCharacterizing an action period of an ith frame of the video to be processed, the
Figure BDA0002711984850000152
And representing the sum of the action times of the target object in the 1 st frame to the L < th > frame of the video to be processed.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
Referring to fig. 12, based on the motion counting method and apparatus, an electronic device 100 capable of performing the motion counting method is further provided in the embodiment of the present application. The electronic device 100 includes a memory 102 and one or more processors 104 (only one shown) coupled to each other, the memory 102 and the processors 104 being communicatively coupled to each other. The memory 102 stores therein a program that can execute the contents of the foregoing embodiments, and the processor 104 can execute the program stored in the memory 102.
The processor 104 may include one or more processing cores, among other things. The processor 104 interfaces with various components throughout the electronic device 100 using various interfaces and circuitry to perform various functions of the electronic device 100 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 102 and invoking data stored in the memory 102. Alternatively, the processor 104 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 104 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 104, but may be implemented by a communication chip.
The Memory 102 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 102 may be used to store instructions, programs, code sets, or instruction sets. The memory 102 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the foregoing embodiments, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.
Referring to fig. 13, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 300 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.
The computer-readable storage medium 300 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 300 includes a non-transitory computer-readable storage medium. The computer readable storage medium 300 has storage space for program code 310 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 310 may be compressed, for example, in a suitable form.
To sum up, according to the action counting method, the action counting device, the electronic device, and the storage medium provided by the embodiments of the present application, the video to be processed is obtained, then the bone key points of the target object in the video to be processed are obtained, then the target key point features are obtained based on the bone key points, then the similarity matrix corresponding to the video to be processed is calculated based on the target key point features, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix. Therefore, by the method, under the condition that the bone key points of the target object in the video to be processed are obtained, the target key point characteristics can be obtained based on the bone key points, the similarity matrix corresponding to the video to be processed is calculated based on the target key point characteristics, and then the action times of the target object in the video to be processed are obtained based on the similarity matrix, so that the bone key points of the target object in the video image can be analyzed without analyzing the whole picture of the video image, the calculation complexity is reduced, and the accuracy of action counting is improved.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A method of motion counting, the method comprising:
acquiring a video to be processed;
obtaining skeleton key points of a target object in the video to be processed;
acquiring target key point characteristics based on the skeleton key points;
calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics;
inputting the similarity matrix into an action counting neural network, and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed;
and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.
2. The method according to claim 1, wherein the obtaining of skeletal key points of a target object in the video to be processed comprises:
inputting the video to be processed into a target attitude estimation network, and acquiring a plurality of reference key points output by the target attitude estimation network;
bone keypoints are obtained from the plurality of reference keypoints.
3. The method of claim 2, wherein said obtaining skeletal keypoints from said plurality of reference keypoints comprises:
acquiring reference position associated parameters respectively corresponding to the plurality of reference key points, wherein the reference position associated parameters are used for representing whether a certain two key points in the plurality of key points are connected or not;
acquiring weight parameters of the plurality of key points under limb joint constraint;
obtaining a product of the reference position correlation parameter and the weight parameter;
and taking the key point corresponding to the product with the maximum value as the bone key point.
4. The method of claim 1, wherein said obtaining target keypoint features based on said bone keypoints comprises:
acquiring space related characteristics corresponding to the bone key points;
acquiring time-related features corresponding to the bone key points;
and splicing the space correlation characteristics and the time correlation characteristics to obtain target key point characteristics.
5. The method according to claim 1, wherein the video to be processed comprises a plurality of frames of images, and the calculating a similarity matrix corresponding to the video to be processed based on the target keypoint features comprises:
acquiring the distance between the key point characteristics of a plurality of any two frames of images in the multi-frame image;
and normalizing the plurality of distances to obtain a plurality of elements, and taking a matrix formed by combining the plurality of elements as a similarity matrix corresponding to the video to be processed.
6. The method of claim 5, wherein the normalizing the plurality of distances is:
Figure FDA0003300174280000021
the method comprises the steps that M (i, j) represents the similarity between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, i represents the ith frame of the video to be processed, j represents the jth frame of the video to be processed, q represents an auxiliary variable, D represents a positive integer, e represents a natural constant, M '(i, j) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the jth frame image of the video to be processed, and M' (i, q) represents the distance between the key point feature of the ith frame image of the video to be processed and the key point feature of the qth frame image of the video to be processed.
7. The method according to claim 1, wherein the obtaining the number of times of actions of the target object in the video to be processed based on the action period with the maximum period duration comprises:
acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration according to an action time calculation rule, wherein the action time calculation rule comprises the following steps:
Figure FDA0003300174280000022
wherein, CLRepresenting the action times T of the target object when the L-th frame with the maximum period duration of the video to be processed is reachediCharacterizing an action period of an ith frame of the video to be processed, the
Figure FDA0003300174280000023
And representing the sum of the action times of the target object in the 1 st frame to the L < th > frame of the video to be processed.
8. An action counting device, characterized in that the device comprises:
the video data acquisition module is used for acquiring a video to be processed;
the skeleton key point acquisition module is used for acquiring the skeleton key points of the target object in the video to be processed;
a key point feature obtaining module, configured to obtain a target key point feature based on the bone key point;
the calculation module is used for calculating a similarity matrix corresponding to the video to be processed based on the target key point characteristics;
the counting module is used for inputting the similarity matrix into an action counting neural network and acquiring an action cycle output by the action counting neural network and corresponding to each frame of image in the video to be processed; and acquiring the action times of the target object in the video to be processed based on the action period with the maximum period duration.
9. An electronic device comprising one or more processors and memory;
one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-7.
10. A computer-readable storage medium, having program code stored therein, wherein the program code when executed by a processor performs the method of any of claims 1-7.
CN202011059856.8A 2020-09-30 2020-09-30 Action counting method and device, electronic equipment and storage medium Active CN112149602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011059856.8A CN112149602B (en) 2020-09-30 2020-09-30 Action counting method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011059856.8A CN112149602B (en) 2020-09-30 2020-09-30 Action counting method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112149602A CN112149602A (en) 2020-12-29
CN112149602B true CN112149602B (en) 2022-03-25

Family

ID=73951262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011059856.8A Active CN112149602B (en) 2020-09-30 2020-09-30 Action counting method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112149602B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818801B (en) * 2021-01-26 2024-04-26 每步科技(上海)有限公司 Motion counting method, recognition device, recognition system and storage medium
CN114764946B (en) * 2021-09-18 2023-08-11 北京甲板智慧科技有限公司 Action counting method and system based on time sequence standardization and intelligent terminal

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020018469A1 (en) * 2018-07-16 2020-01-23 The Board Of Trustees Of The Leland Stanford Junior University System and method for automatic evaluation of gait using single or multi-camera recordings
CN111507301B (en) * 2020-04-26 2021-06-08 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium
CN111275032B (en) * 2020-05-07 2020-09-15 西南交通大学 Deep squatting detection method, device, equipment and medium based on human body key points
CN111282248A (en) * 2020-05-12 2020-06-16 西南交通大学 Pull-up detection system and method based on skeleton and face key points
CN111368810B (en) * 2020-05-26 2020-08-25 西南交通大学 Sit-up detection system and method based on human body and skeleton key point identification

Also Published As

Publication number Publication date
CN112149602A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN112446270B (en) Training method of pedestrian re-recognition network, pedestrian re-recognition method and device
Ahmed The impact of filter size and number of filters on classification accuracy in CNN
CN112149602B (en) Action counting method and device, electronic equipment and storage medium
CN112200041B (en) Video motion recognition method and device, storage medium and electronic equipment
CN110378245B (en) Football match behavior recognition method and device based on deep learning and terminal equipment
Wang et al. Uncertainty-dtw for time series and sequences
Liu Objects detection toward complicated high remote basketball sports by leveraging deep CNN architecture
CN111783997B (en) Data processing method, device and equipment
CN111954860A (en) System and method for predicting fine-grained antagonistic multi-player movements
CN113822254B (en) Model training method and related device
CN112529149B (en) Data processing method and related device
Xiao et al. Recognizing sports activities from video frames using deformable convolution and adaptive multiscale features
CN111414803A (en) Face recognition method and device and electronic equipment
CN115131604A (en) Multi-label image classification method and device, electronic equipment and storage medium
CN114495241A (en) Image identification method and device, electronic equipment and storage medium
CN113269013A (en) Object behavior analysis method, information display method and electronic equipment
CN109784295B (en) Video stream feature identification method, device, equipment and storage medium
US20210245005A1 (en) Implementation of machine learning for skill-improvement through cloud computing and method therefor
CN113902989A (en) Live scene detection method, storage medium and electronic device
Yuan et al. A systematic survey on human behavior recognition methods
CN113297883A (en) Information processing method, analysis model obtaining device and electronic equipment
Srilakshmi et al. A-DQRBRL: attention based deep Q reinforcement battle royale learning model for sports video classification
Vrskova et al. Education of video classification based by neural networks
CN114511877A (en) Behavior recognition method and device, storage medium and terminal
Vyas Pose estimation and action recognition in sports and fitness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210114

Address after: 511442 3108, 79 Wanbo 2nd Road, Nancun Town, Panyu District, Guangzhou City, Guangdong Province

Applicant after: GUANGZHOU CUBESILI INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 511400 24th floor, building B-1, North District, Wanda Commercial Plaza, Wanbo business district, No.79 Wanbo 2nd Road, Nancun Town, Panyu District, Guangzhou, Guangdong Province

Applicant before: GUANGZHOU HUADUO NETWORK TECHNOLOGY Co.,Ltd.

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20201229

Assignee: GUANGZHOU HUADUO NETWORK TECHNOLOGY Co.,Ltd.

Assignor: GUANGZHOU CUBESILI INFORMATION TECHNOLOGY Co.,Ltd.

Contract record no.: X2021440000054

Denomination of invention: Action counting method, device, electronic equipment and storage medium

License type: Common License

Record date: 20210208

GR01 Patent grant
GR01 Patent grant