CN115100745A - Swin transform model-based motion real-time counting method and system - Google Patents

Swin transform model-based motion real-time counting method and system Download PDF

Info

Publication number
CN115100745A
CN115100745A CN202210784218.5A CN202210784218A CN115100745A CN 115100745 A CN115100745 A CN 115100745A CN 202210784218 A CN202210784218 A CN 202210784218A CN 115100745 A CN115100745 A CN 115100745A
Authority
CN
China
Prior art keywords
motion
swin
layer
matrix
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210784218.5A
Other languages
Chinese (zh)
Other versions
CN115100745B (en
Inventor
李长霖
李海洋
侯永弟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deck Intelligent Technology Co ltd
Original Assignee
Beijing Deck Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deck Intelligent Technology Co ltd filed Critical Beijing Deck Intelligent Technology Co ltd
Priority to CN202210784218.5A priority Critical patent/CN115100745B/en
Publication of CN115100745A publication Critical patent/CN115100745A/en
Application granted granted Critical
Publication of CN115100745B publication Critical patent/CN115100745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a method and a system for counting motions in real time based on a Swin transducer model, wherein the method comprises the following steps: the method comprises the steps of obtaining a motion video in a target time period, determining a target sporter in the motion video, and calculating a motion attitude vector of the target sporter in each frame of image of the motion video; further arranging the motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action; the Swin transform model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete motion of one target motion sample. The technical problem of poor action recognition and counting accuracy is solved.

Description

Swin transform model-based motion real-time counting method and system
Technical Field
The invention relates to the technical field of motion monitoring, in particular to a method and a system for counting motions in real time based on a Swin transducer model.
Background
Along with the rising of emerging sports such as intelligent fitness, cloud events, virtual sports and the like, AI fitness is widely popularized, and in order to ensure the remote fitness effect, a motion counting module is more embedded in AI fitness software. In the prior art, during motion counting, human body postures are captured through a camera, and then motion recognition and counting are performed by combining an AI recognition algorithm. However, the existing method has poor accuracy of motion recognition and counting for the motion with faster or slower motion speed.
Disclosure of Invention
Therefore, the embodiment of the invention provides a method and a system for counting motions in real time based on a Swin transducer model, so as to at least partially solve the technical problem that the recognition and counting accuracy of motions in the prior art is poor.
In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:
a Swin transform model-based motion real-time counting method, comprising:
acquiring human motion video data in real time through camera equipment;
detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video;
arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix;
analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action;
the Swin transform model is obtained by training a training data set formed by motion attitude matrix samples, the motion attitude matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete action of one target motion sample;
the model structure of the Swin transform model comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
Further, calculating a motion pose vector of the target sporter in each frame image of the motion video specifically includes:
detecting three-dimensional coordinates of bone key points of the target sporter in each frame of image in the motion video to obtain a posture picture of the target sporter in each frame of image;
acquiring a plurality of target skeleton key points based on the attitude diagram, and taking any three target skeleton key points as a skeleton key point sequence to obtain a plurality of skeleton key point sequences;
and calculating included angles among all the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles.
Further, calculating included angles between the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles, specifically comprising:
setting a skeletal key point n to pass through a three-dimensional coordinate (x) n ,y n ,z n ) Description, suppose there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) w ,y w ,z w ),(x p ,y p ,z p ),(x q ,y q ,z q ) Where the w and p points may form a line segment l 1 Q and p may form a line segment l 2
Calculating l 1 And l 2 The included angle between the two skeleton key points is a sequence included angle formed by the three skeleton key points of w, p and q;
calculating sequence included angles of other skeleton key point sequences, and obtaining all sequence included angles;
the values of all sequence angles constitute a motion attitude vector: [ theta ] of 12 ,…,θ n ]。
Further, analyzing the motion posture matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action, specifically comprising:
inputting the motion attitude matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion attitude matrix relative to any target action;
if the output probability is judged to be larger than or equal to a preset threshold value, adding 1 to the counting of the target action, and sliding a window w forward for p frames;
wherein p is the length of the window w, the value range of p is [ l, r ], l represents the minimum value of the video frame number of the target action in the training data set, and r represents the maximum value of the video frame number of the target action in the training data set.
Further, inputting the motion posture matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion posture matrix relative to any target action, and then the method further comprises the following steps:
and if the output probability is judged to be smaller than a preset threshold value, sliding the window w forward for 1 frame.
Further, the air conditioner is provided with a fan,
the matrix blocking layer divides an input motion attitude matrix with m × t × 1 dimensionality into a matrix of q × c through a convolution function, wherein t represents the frame number of a video corresponding to the motion attitude matrix, and q represents the size of a block;
the Embedding layer is used for converting the dimension of the q-c matrix into a dimension which can be accepted by a Swin transform module;
the input of the Swin Transformer module is a matrix processed by an Embedding layer, and the Swin Transformer module applies a sliding window-based self-attention mechanism;
the blocking and merging layer is used for compressing the dimensionality of a matrix output by the Swin transform module;
the global pooling layer is used for reducing the dimension of the matrix output by the Swin transform module through calculating average;
the input of the multilayer perceptron layer is a matrix processed by a global pooling layer, the multilayer perceptron layer is linearly and fully connected by m layers, and the output dimensionality of the fully connected layer is the number of the types of the actions;
the input of the Softmax layer is the output of the multilayer perceptron layer, and the probability of the action category is calculated through the Softmax layer.
The invention also provides a Swin transform model-based motion real-time counting system, which comprises:
the data acquisition unit is used for acquiring human motion video data in real time through the camera equipment;
the gesture vector calculation unit is used for detecting a sporter positioned at the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion gesture vector of the target sporter in each frame image of the motion video;
the attitude matrix generating unit is used for arranging the motion attitude vectors obtained by each frame of image in a time sequence to obtain a motion attitude matrix;
the counting result output unit is used for analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action;
the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method as described above.
The Swin transform model-based motion real-time counting method provided by the invention has the advantages that human motion video data are collected in real time through a camera device; detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video; further arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.
Fig. 1 is a flowchart of an embodiment of a Swin Transformer model-based motion real-time counting method according to the present invention;
fig. 2 is a second flowchart of an embodiment of a Swin Transformer model-based real-time motion counting method according to the present invention;
fig. 3 is a third flowchart of an embodiment of a Swin Transformer model-based real-time motion counting method according to the present invention;
FIG. 4 is a flow chart of one embodiment of the Swin Transformer model provided by the present invention;
FIG. 5 is a diagram of a Swin transform model according to the present invention;
FIG. 6 is a block diagram of an embodiment of a Swin Transformer model-based sports real-time counting system according to the present invention;
fig. 7 is a schematic physical structure diagram of an electronic device provided in the present invention.
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the same sports, when the movement speed of different people is too fast or too slow, the counting effect of the algorithm is influenced. In order to solve the problem, the invention provides a method for counting motions in real time based on a Swin transducer model, which utilizes a motion posture matrix arranged in a time sequence and a Swin transducer model trained in advance to obtain a more accurate motion counting result in a target time interval.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for real-time counting motions based on Swin Transformer model according to an embodiment of the present invention.
In a specific embodiment, the method for counting motions in real time based on the Swin Transformer model provided by the invention comprises the following steps:
s101: human motion video data are collected in real time through the camera equipment.
S102: and detecting a sporter positioned in the center of the video image through a human body detection algorithm, and calculating the motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter. The motion video may include a plurality of frames of images, each frame of image may obtain one motion gesture vector, and the motion video may obtain a plurality of motion gesture vectors.
S103: and arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix. Taking a 1-minute motion video as an example, in the motion video, a plurality of motion attitude vectors are obtained, the motion attitude vectors respectively correspond to each frame image in the motion video, the frame images have a time sequence in the motion video, and the motion attitude vectors are arranged in the time sequence of each frame image in the motion video, so that a motion attitude matrix can be obtained.
S104: analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
In some embodiments, as shown in fig. 2, the calculating the motion pose vector of the target sporter in each frame image of the motion video specifically includes the following steps:
s201: and detecting the three-dimensional coordinates of the bone key points of the target sporter in each frame of image in the motion video so as to obtain the posture image of the target sporter in each frame of image. In an actual use scene, generally shot motion videos are 2D video frame images, three-dimensional coordinates of human skeleton key points in each frame image can be detected after analysis is carried out through a 3D human skeleton key point detection algorithm, and after the motion videos are analyzed, each frame is changed into a posture image formed by the 3D human skeleton key points.
S202: and acquiring a plurality of target bone key points based on the attitude map, and taking any three target bone key points as a bone key point sequence to obtain a plurality of bone key point sequences.
The kinematic posture of the human body can be described by the angle formed between the different skeletal joint points. A skeletal key point n may be represented by a three-dimensional coordinate (x) n ,y n ,z n ) To describe. Suppose [ w, p, q)]Three skeletal key point sequences, the coordinates of key points are: (x) w ,y w ,z w ),(x p ,y p ,z p ),(x q ,y q ,z q ) Wherein points w and p may form line segment l 1 Q and p may form a line segment l 2 。l 1 And l 2 The included angle between the two skeleton key points is the included angle formed by the three skeleton key points of w, p and q. In this embodiment, there are 18 skeletal key point sequences defined for describing the human motion pose: [ left ankle joint, left knee joint, left hip joint][ Right ankle joint, right knee joint, right hip joint ]][ left Knee joint, left hip joint, pelvis](Right Knee joint, Right hip joint, pelvis)]The left wrist, the left elbow joint and the left shoulder joint]The right wrist, the right elbow joint and the right shoulder joint]The right elbow joint, the right shoulder joint and the left shoulder joint]The left elbow joint, the left shoulder joint and the right shoulder joint][ head, neck, pelvis bone][ right wrist, crown of head, neck ]][ left wrist, crown of head, neck]The left elbow joint, the vertex and the neck]The right elbow joint, the vertex of the head and the neck]Head, left ear, neck]Head, right ear, neck][ left ear, neck, right shoulder joint ]]The right ear, neck and left shoulder joint](left hip joint, pelvis, right hip joint)]。
S203: and calculating included angles among all the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles.
Specifically, it is known to set a skeletal key point n by three-dimensional coordinates (x) n ,y n ,z n ) Description, suppose there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) w ,y w ,z w ),(x p ,y p ,z p ),(x q ,x q ,z q ) Wherein points w and p may form line segment l 1 Q and p may form a line segment l 2 (ii) a Calculating l 1 And l 2 The included angle between the two skeleton key points is a sequence included angle formed by the three skeleton key points of w, p and q; calculating sequence included angles of other skeleton key point sequences, and obtaining all sequence included angles; the values of all sequence angles constitute a motion attitude vector: [ theta ] of 12 ,…,θ n ]。
That is, the values of all the sequence angles can be constructedA vector can be used for describing the motion gesture, and is called a motion gesture vector: [ theta ] of 12 ,…,θ n ]. Each frame in the motion video corresponds to a motion attitude vector, and the motion attitude vectors of all frames in the video are arranged according to a time sequence to form a motion attitude matrix.
In some embodiments, as shown in fig. 3, for the real-time recorded online moving video data of the user, the algorithm slides from left to right in a window w to construct a moving posture matrix corresponding to the video in the window; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action, and specifically comprising the following steps of:
s301: inputting the motion attitude matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion attitude matrix relative to any target action;
s302: if the output probability is judged to be greater than or equal to a preset threshold value, adding 1 to the count of the target action, and sliding the window w forward for p frames;
s303: if the output probability is judged to be smaller than a preset threshold value, sliding the window w forward for 1 frame;
wherein p is the length of the window w, the value range of p is [ l, r ], l represents the minimum value of the video frame number of the target action in the training data set, and r represents the maximum value of the video frame number of the target action in the training data set.
The off-line training and on-line detection processes and the model structure of the Swin Transformer model are simply introduced below, and the Swin Transformer model is obtained through training, so that accurate action counting is realized.
Specifically, in the off-line training phase, first, video data of a plurality of different types of sports needing to be counted in real time are collected, wherein each video segment only contains one motion of one sports, for example, one video segment of push-up only contains one push-up motion. Then, the sports category of each video is labeled. And finally, calculating a motion attitude matrix corresponding to each section of video, wherein all the motion attitude matrices form training data, inputting the training data into the model in the figure 3 for training, and finally generating a trained model, as shown in figure 4.
As shown in fig. 5, the model structure of the Swin Transformer model includes seven parts: the system comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multilayer perceptron layer and a Softmax layer. Wherein:
the matrix blocking layer divides an input motion attitude matrix with m × t × 1 dimensionality into a matrix of q × c through a convolution function, wherein t represents the frame number of a video corresponding to the motion attitude matrix, and q represents the size of a block;
the Embedding layer is consistent with a Linear Embedding layer in a Swin Transformer algorithm and is used for converting the dimension of a q-c matrix into a dimension which can be accepted by a Swin Transformer module;
the Swin Transformer module is consistent with the Swin Transformer module in the Swin Transformer algorithm. The input to the module is the matrix processed by the Embedding layer. The module applies a sliding window based self-attention mechanism, which comprises two steps: firstly, applying a self-attention mechanism to a matrix in an initial window, and then applying the self-attention mechanism to the matrix in a sliding window;
the block Merging layer is consistent with the Patch Merging module in the Swin transform algorithm, and functions like a pooling layer in a convolutional neural network to compress the dimensionality of the matrix output by the Swin transform module. In order to extract features of different scales through the Swin Transformer module, the blocking merging layer and the Swin Transformer module may be integrally stacked N times, as shown by a dashed box in fig. 5;
the global pooling layer is consistent with a global pooling layer in the Swin transform algorithm, and the dimension of a matrix output by the Swin transform module is reduced by calculating average;
the input of the multi-layer perceptron layer is a matrix processed by a global pooling layer, the multi-layer perceptron layer uses m layers of linear full connection, and the output dimensionality of the full connection layer is the number of types of action categories;
the input to the Softmax layer is the output of the multi-layered perceptron layer, from which the probability of an action class is finally calculated.
When detecting on line, firstly, for the user on-line motion video data recorded in real time, the algorithm will slide from left to right by the window w, and slide for 1 frame each time. The length p of w can be a value in the [ l, r ] interval, wherein l represents the minimum value of the number of the motion video frames in the training data, and r represents the maximum value of the number of the motion video frames in the training data. In this embodiment, the window length p is selected as the average value of the video frames of the type of motion in the training data. Then, a motion attitude matrix of the view band in the window w is calculated. Finally, inputting the motion attitude matrix into the model of fig. 4, and calculating the output probability of the video segment:
if the probability that the video belongs to a certain type of action is greater than or equal to the threshold value, the count of the type of action is increased by 1. And the window w slides forward p frames.
If the probability that the video segment belongs to a certain type of action is less than the threshold, the window w is slid forward by 1 frame.
In the above embodiment, the Swin Transformer model-based motion real-time counting method provided by the invention collects human motion video data in real time through a camera device; detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; further arranging the motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.
In addition to the above method, the present invention further provides a Swin Transformer model-based motion real-time counting system, as shown in fig. 6, the system includes:
the data acquisition unit 601 is used for acquiring human motion video data in real time through the camera equipment;
an attitude vector calculation unit 602, configured to detect a sporter located in a center position of a video image through a human body detection algorithm, and calculate a motion attitude vector of the target sporter in each frame image of the motion video with the sporter as a target sporter;
the attitude matrix generation unit 603 is configured to arrange motion attitude vectors obtained for each frame of image in a time sequence to obtain a motion attitude matrix;
a counting result output unit 604, configured to analyze the motion posture matrix based on a Swin Transformer model trained in advance, so as to obtain a counting result of the target motion;
the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
In the above embodiment, the Swin Transformer model-based real-time motion counting system provided by the present invention detects a sporter located at the center position of a video image through a human body detection algorithm, and calculates a motion attitude vector of the target sporter in each frame image of the motion video with the sporter as a target sporter; further arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a transaction request processing method comprising: detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The processor 710 in the electronic device provided in the embodiment of the present application may call the logic instruction in the memory 730, and an implementation manner of the processor 710 is consistent with an implementation manner of the transaction request processing method provided in the present application, and the same beneficial effects may be achieved, and details are not described here.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a transaction request processing method provided by the above methods, the method comprising: detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video; arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action; the Swin transform model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin transform model comprises a matrix partitioning layer, an Embedding layer, a Swin transform module, partitioning combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
When executed, the computer program product provided in the embodiment of the present application implements the transaction request processing method, and the specific implementation manner of the method is consistent with the implementation manner described in the embodiment of the foregoing method, and the same beneficial effects can be achieved, and details are not described herein again.
In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the transaction request processing methods provided above, the method including: detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
When executed, the computer program stored on the non-transitory computer-readable storage medium provided in the embodiment of the present application implements the transaction request processing method, and a specific implementation manner of the method is consistent with the implementation manner described in the embodiments of the method, and the same beneficial effects can be achieved, and details are not repeated herein.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The above embodiments are only for illustrating the embodiments of the present invention and are not to be construed as limiting the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the embodiments of the present invention shall be included in the scope of the present invention.

Claims (10)

1. A Swin transducer model-based motion real-time counting method is characterized by comprising the following steps:
acquiring human motion video data in real time through camera equipment;
detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter;
arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix;
analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action;
the Swin transform model is obtained by training a training data set formed by motion attitude matrix samples, the motion attitude matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete action of one target motion sample;
the model structure of the Swin transform model comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
2. The method according to claim 1, wherein calculating the motion pose vector of the target actor in each frame image of the motion video comprises:
detecting three-dimensional coordinates of skeleton key points of the target sporter in each frame of image in the moving video to obtain a posture image of the target sporter in each frame of image;
acquiring a plurality of target skeleton key points based on the attitude diagram, and taking any three target skeleton key points as a skeleton key point sequence to obtain a plurality of skeleton key point sequences;
and calculating included angles among all the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles.
3. The method according to claim 2, wherein the step of calculating included angles between the sequences of bone key points to obtain sequence included angles, and forming motion attitude vectors from all the sequence included angles includes:
setting a skeletal key point n to pass through a three-dimensional coordinate (x) n ,y n ,z n ) Description, assume that there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) w ,y w ,z w ),(x p ,y p ,z p ),(x q ,y q ,z q ) Wherein points w and p may form line segment l 1 Q and p may form a line segment l 2
Calculating l 1 And l 2 The included angle between the two skeleton key points is a sequence included angle formed by the three skeleton key points of w, p and q;
calculating sequence included angles of other skeleton key point sequences, and obtaining all sequence included angles;
the values of all sequence angles constitute a motion attitude vector: [ theta ] of 12 ,…,θ n ]。
4. The method of claim 1, wherein analyzing the motion posture matrix based on a pre-trained Swin transform model to obtain a counting result of a target motion comprises:
inputting the motion attitude matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion attitude matrix relative to any target action;
if the output probability is judged to be greater than or equal to a preset threshold value, adding 1 to the count of the target action, and sliding the window w forward for p frames;
wherein p is the length of the window w, the value range of p is [ l, r ], l represents the minimum value of the video frame number of the target action in the training data set, and r represents the maximum value of the video frame number of the target action in the training data set.
5. The method of claim 4, wherein the motion posture matrix is input into a Swin Transformer model trained in advance, and the output probability of the motion posture matrix relative to any target action is calculated, and then the method further comprises:
and if the output probability is judged to be smaller than a preset threshold value, sliding the window w forward for 1 frame.
6. The real-time motion counting method according to claim 1,
the matrix blocking layer divides an input motion attitude matrix with m × t × 1 dimensionality into a matrix of q × c through a convolution function, wherein t represents the frame number of a video corresponding to the motion attitude matrix, and q represents the size of a block;
the Embedding layer is used for converting the dimension of the q-c matrix into a dimension which can be accepted by a Swin transform module;
the input of the Swin Transformer module is a matrix processed by an Embedding layer, and the Swin Transformer module applies a sliding window-based self-attention mechanism;
the blocking and merging layer is used for compressing the dimensionality of a matrix output by the Swin transform module;
the global pooling layer is used for reducing the dimension of the matrix output by the Swin transform module through calculating average;
the input of the multilayer perceptron layer is a matrix processed by a global pooling layer, the multilayer perceptron layer is linearly and fully connected by m layers, and the output dimensionality of the fully connected layer is the number of the types of the actions;
the input of the Softmax layer is the output of the multilayer perceptron layer, and the probability of the action category is calculated through the Softmax layer.
7. A Swin transform model-based motion real-time counting system is characterized by comprising:
the data acquisition unit is used for acquiring human motion video data in real time through the camera equipment;
the gesture vector calculation unit is used for detecting a sporter positioned at the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion gesture vector of the target sporter in each frame image of the motion video;
the attitude matrix generating unit is used for arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix;
the counting result output unit is used for analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action;
the Swin transform model is obtained by training a training data set formed by motion attitude matrix samples, the motion attitude matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete action of one target motion sample;
the model structure of the Swin transform model comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.
9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
10. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
CN202210784218.5A 2022-07-05 2022-07-05 Swin transducer model-based motion real-time counting method and system Active CN115100745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210784218.5A CN115100745B (en) 2022-07-05 2022-07-05 Swin transducer model-based motion real-time counting method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210784218.5A CN115100745B (en) 2022-07-05 2022-07-05 Swin transducer model-based motion real-time counting method and system

Publications (2)

Publication Number Publication Date
CN115100745A true CN115100745A (en) 2022-09-23
CN115100745B CN115100745B (en) 2023-06-20

Family

ID=83296140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210784218.5A Active CN115100745B (en) 2022-07-05 2022-07-05 Swin transducer model-based motion real-time counting method and system

Country Status (1)

Country Link
CN (1) CN115100745B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163038A (en) * 2018-03-15 2019-08-23 南京硅基智能科技有限公司 A kind of human motion method of counting based on depth convolutional neural networks
CN110711374A (en) * 2019-10-15 2020-01-21 石家庄铁道大学 Multi-modal dance action evaluation method
US20200211154A1 (en) * 2018-12-30 2020-07-02 Altumview Systems Inc. Method and system for privacy-preserving fall detection
CN112464808A (en) * 2020-11-26 2021-03-09 成都睿码科技有限责任公司 Rope skipping posture and number identification method based on computer vision
CN112966597A (en) * 2021-03-04 2021-06-15 山东云缦智能科技有限公司 Human motion action counting method based on skeleton key points
CN113392742A (en) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 Abnormal action determination method and device, electronic equipment and storage medium
CN113705540A (en) * 2021-10-09 2021-11-26 长三角信息智能创新研究院 Method and system for recognizing and counting non-instrument training actions
WO2021243561A1 (en) * 2020-06-02 2021-12-09 中国科学院深圳先进技术研究院 Behaviour identification apparatus and method
US20220004744A1 (en) * 2018-11-27 2022-01-06 Bigo Technology Pte. Ltd. Human posture detection method and apparatus, device and storage medium
CN113920583A (en) * 2021-10-14 2022-01-11 根尖体育科技(北京)有限公司 Fine-grained behavior recognition model construction method and system
CN114581945A (en) * 2022-02-21 2022-06-03 中国科学院大学 Monocular three-dimensional human body posture estimation method and system integrating space-time characteristics
US20220203165A1 (en) * 2020-12-29 2022-06-30 NEX Team Inc. Video-based motion counting and analysis systems and methods for virtual fitness application

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163038A (en) * 2018-03-15 2019-08-23 南京硅基智能科技有限公司 A kind of human motion method of counting based on depth convolutional neural networks
US20220004744A1 (en) * 2018-11-27 2022-01-06 Bigo Technology Pte. Ltd. Human posture detection method and apparatus, device and storage medium
US20200211154A1 (en) * 2018-12-30 2020-07-02 Altumview Systems Inc. Method and system for privacy-preserving fall detection
CN110711374A (en) * 2019-10-15 2020-01-21 石家庄铁道大学 Multi-modal dance action evaluation method
WO2021243561A1 (en) * 2020-06-02 2021-12-09 中国科学院深圳先进技术研究院 Behaviour identification apparatus and method
CN112464808A (en) * 2020-11-26 2021-03-09 成都睿码科技有限责任公司 Rope skipping posture and number identification method based on computer vision
US20220203165A1 (en) * 2020-12-29 2022-06-30 NEX Team Inc. Video-based motion counting and analysis systems and methods for virtual fitness application
CN112966597A (en) * 2021-03-04 2021-06-15 山东云缦智能科技有限公司 Human motion action counting method based on skeleton key points
CN113392742A (en) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 Abnormal action determination method and device, electronic equipment and storage medium
CN113705540A (en) * 2021-10-09 2021-11-26 长三角信息智能创新研究院 Method and system for recognizing and counting non-instrument training actions
CN113920583A (en) * 2021-10-14 2022-01-11 根尖体育科技(北京)有限公司 Fine-grained behavior recognition model construction method and system
CN114581945A (en) * 2022-02-21 2022-06-03 中国科学院大学 Monocular three-dimensional human body posture estimation method and system integrating space-time characteristics

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
太阳花的小绿豆: "Swin-Transformer网络结构详解", pages 1 - 12, Retrieved from the Internet <URL:https://blog.csdn.net/qq_37541097/article/details/121119988> *
曾文献等: "基于骨骼关键点检测的士兵训练动作分类研究", 河北省科学院学报, vol. 39, no. 1 *
梁敬柏: "基于骨骼点信息的动作识别方法的研究与应用", 中国优秀硕士学位论文全文数据库(基础科学辑) *
蓝翔技校的码农: "Swin Transformer 详解", pages 1 - 7, Retrieved from the Internet <URL:https://blog.csdn.net/qq_43349542/article/details/118585880> *

Also Published As

Publication number Publication date
CN115100745B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN110570455B (en) Whole body three-dimensional posture tracking method for room VR
WO2021169839A1 (en) Action restoration method and device based on skeleton key points
CN107239728A (en) Unmanned plane interactive device and method based on deep learning Attitude estimation
CN111414797B (en) System and method for estimating pose and pose information of an object
CN111476097A (en) Human body posture assessment method and device, computer equipment and storage medium
WO2023071964A1 (en) Data processing method and apparatus, and electronic device and computer-readable storage medium
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN110633004B (en) Interaction method, device and system based on human body posture estimation
CN104821010A (en) Binocular-vision-based real-time extraction method and system for three-dimensional hand information
CN110751039A (en) Multi-view 3D human body posture estimation method and related device
CN112200157A (en) Human body 3D posture recognition method and system for reducing image background interference
WO2021098545A1 (en) Pose determination method, apparatus, and device, storage medium, chip and product
CN115205737B (en) Motion real-time counting method and system based on transducer model
CN112149531B (en) Human skeleton data modeling method in behavior recognition
CN114882493A (en) Three-dimensional hand posture estimation and recognition method based on image sequence
KR102333768B1 (en) Hand recognition augmented reality-intraction apparatus and method
CN115223240B (en) Motion real-time counting method and system based on dynamic time warping algorithm
CN115100745A (en) Swin transform model-based motion real-time counting method and system
CN113894779B (en) Multi-mode data processing method applied to robot interaction
CN115471863A (en) Three-dimensional posture acquisition method, model training method and related equipment
CN115205750B (en) Motion real-time counting method and system based on deep learning model
CN116612495B (en) Image processing method and device
WO2023185241A1 (en) Data processing method and apparatus, device and medium
CN117275089A (en) Character recognition method, device and equipment for monocular camera and storage medium
CN115620394A (en) Behavior identification method, system and device based on skeleton and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant