CN115100745A

CN115100745A - Swin transform model-based motion real-time counting method and system

Info

Publication number: CN115100745A
Application number: CN202210784218.5A
Authority: CN
Inventors: 李长霖; 李海洋; 侯永弟
Original assignee: Beijing Deck Intelligent Technology Co ltd
Current assignee: Beijing Deck Intelligent Technology Co ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-09-23
Anticipated expiration: 2042-07-05
Also published as: CN115100745B

Abstract

The embodiment of the invention discloses a method and a system for counting motions in real time based on a Swin transducer model, wherein the method comprises the following steps: the method comprises the steps of obtaining a motion video in a target time period, determining a target sporter in the motion video, and calculating a motion attitude vector of the target sporter in each frame of image of the motion video; further arranging the motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action; the Swin transform model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete motion of one target motion sample. The technical problem of poor action recognition and counting accuracy is solved.

Description

Swin transform model-based motion real-time counting method and system

Technical Field

The invention relates to the technical field of motion monitoring, in particular to a method and a system for counting motions in real time based on a Swin transducer model.

Background

Along with the rising of emerging sports such as intelligent fitness, cloud events, virtual sports and the like, AI fitness is widely popularized, and in order to ensure the remote fitness effect, a motion counting module is more embedded in AI fitness software. In the prior art, during motion counting, human body postures are captured through a camera, and then motion recognition and counting are performed by combining an AI recognition algorithm. However, the existing method has poor accuracy of motion recognition and counting for the motion with faster or slower motion speed.

Disclosure of Invention

Therefore, the embodiment of the invention provides a method and a system for counting motions in real time based on a Swin transducer model, so as to at least partially solve the technical problem that the recognition and counting accuracy of motions in the prior art is poor.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

a Swin transform model-based motion real-time counting method, comprising:

acquiring human motion video data in real time through camera equipment;

detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video;

arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix;

analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action;

the Swin transform model is obtained by training a training data set formed by motion attitude matrix samples, the motion attitude matrix samples are obtained by calculating video data samples of various types of motion, and each video data sample only contains one complete action of one target motion sample;

the model structure of the Swin transform model comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

Further, calculating a motion pose vector of the target sporter in each frame image of the motion video specifically includes:

detecting three-dimensional coordinates of bone key points of the target sporter in each frame of image in the motion video to obtain a posture picture of the target sporter in each frame of image;

acquiring a plurality of target skeleton key points based on the attitude diagram, and taking any three target skeleton key points as a skeleton key point sequence to obtain a plurality of skeleton key point sequences;

and calculating included angles among all the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles.

Further, calculating included angles between the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles, specifically comprising:

setting a skeletal key point n to pass through a three-dimensional coordinate (x) _n ,y _n ,z _n ) Description, suppose there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) _w ,y _w ,z _w ),(x _p ,y _p ,z _p ),(x _q ,y _q ,z _q ) Where the w and p points may form a line segment l ₁ Q and p may form a line segment l ₂ ；

Calculating l ₁ And l ₂ The included angle between the two skeleton key points is a sequence included angle formed by the three skeleton key points of w, p and q;

calculating sequence included angles of other skeleton key point sequences, and obtaining all sequence included angles;

the values of all sequence angles constitute a motion attitude vector: [ theta ] of ₁ ,θ ₂ ,…,θ _n ]。

Further, analyzing the motion posture matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action, specifically comprising:

inputting the motion attitude matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion attitude matrix relative to any target action;

if the output probability is judged to be larger than or equal to a preset threshold value, adding 1 to the counting of the target action, and sliding a window w forward for p frames;

wherein p is the length of the window w, the value range of p is [ l, r ], l represents the minimum value of the video frame number of the target action in the training data set, and r represents the maximum value of the video frame number of the target action in the training data set.

Further, inputting the motion posture matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion posture matrix relative to any target action, and then the method further comprises the following steps:

and if the output probability is judged to be smaller than a preset threshold value, sliding the window w forward for 1 frame.

Further, the air conditioner is provided with a fan,

the matrix blocking layer divides an input motion attitude matrix with m × t × 1 dimensionality into a matrix of q × c through a convolution function, wherein t represents the frame number of a video corresponding to the motion attitude matrix, and q represents the size of a block;

the Embedding layer is used for converting the dimension of the q-c matrix into a dimension which can be accepted by a Swin transform module;

the input of the Swin Transformer module is a matrix processed by an Embedding layer, and the Swin Transformer module applies a sliding window-based self-attention mechanism;

the blocking and merging layer is used for compressing the dimensionality of a matrix output by the Swin transform module;

the global pooling layer is used for reducing the dimension of the matrix output by the Swin transform module through calculating average;

the input of the multilayer perceptron layer is a matrix processed by a global pooling layer, the multilayer perceptron layer is linearly and fully connected by m layers, and the output dimensionality of the fully connected layer is the number of the types of the actions;

the input of the Softmax layer is the output of the multilayer perceptron layer, and the probability of the action category is calculated through the Softmax layer.

The invention also provides a Swin transform model-based motion real-time counting system, which comprises:

the data acquisition unit is used for acquiring human motion video data in real time through the camera equipment;

the gesture vector calculation unit is used for detecting a sporter positioned at the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion gesture vector of the target sporter in each frame image of the motion video;

the attitude matrix generating unit is used for arranging the motion attitude vectors obtained by each frame of image in a time sequence to obtain a motion attitude matrix;

the counting result output unit is used for analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action;

the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method as described above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method as described above.

The Swin transform model-based motion real-time counting method provided by the invention has the advantages that human motion video data are collected in real time through a camera device; detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video; further arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

Fig. 1 is a flowchart of an embodiment of a Swin Transformer model-based motion real-time counting method according to the present invention;

fig. 2 is a second flowchart of an embodiment of a Swin Transformer model-based real-time motion counting method according to the present invention;

fig. 3 is a third flowchart of an embodiment of a Swin Transformer model-based real-time motion counting method according to the present invention;

FIG. 4 is a flow chart of one embodiment of the Swin Transformer model provided by the present invention;

FIG. 5 is a diagram of a Swin transform model according to the present invention;

FIG. 6 is a block diagram of an embodiment of a Swin Transformer model-based sports real-time counting system according to the present invention;

fig. 7 is a schematic physical structure diagram of an electronic device provided in the present invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For the same sports, when the movement speed of different people is too fast or too slow, the counting effect of the algorithm is influenced. In order to solve the problem, the invention provides a method for counting motions in real time based on a Swin transducer model, which utilizes a motion posture matrix arranged in a time sequence and a Swin transducer model trained in advance to obtain a more accurate motion counting result in a target time interval.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for real-time counting motions based on Swin Transformer model according to an embodiment of the present invention.

In a specific embodiment, the method for counting motions in real time based on the Swin Transformer model provided by the invention comprises the following steps:

s101: human motion video data are collected in real time through the camera equipment.

S102: and detecting a sporter positioned in the center of the video image through a human body detection algorithm, and calculating the motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter. The motion video may include a plurality of frames of images, each frame of image may obtain one motion gesture vector, and the motion video may obtain a plurality of motion gesture vectors.

S103: and arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix. Taking a 1-minute motion video as an example, in the motion video, a plurality of motion attitude vectors are obtained, the motion attitude vectors respectively correspond to each frame image in the motion video, the frame images have a time sequence in the motion video, and the motion attitude vectors are arranged in the time sequence of each frame image in the motion video, so that a motion attitude matrix can be obtained.

S104: analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

In some embodiments, as shown in fig. 2, the calculating the motion pose vector of the target sporter in each frame image of the motion video specifically includes the following steps:

s201: and detecting the three-dimensional coordinates of the bone key points of the target sporter in each frame of image in the motion video so as to obtain the posture image of the target sporter in each frame of image. In an actual use scene, generally shot motion videos are 2D video frame images, three-dimensional coordinates of human skeleton key points in each frame image can be detected after analysis is carried out through a 3D human skeleton key point detection algorithm, and after the motion videos are analyzed, each frame is changed into a posture image formed by the 3D human skeleton key points.

S202: and acquiring a plurality of target bone key points based on the attitude map, and taking any three target bone key points as a bone key point sequence to obtain a plurality of bone key point sequences.

The kinematic posture of the human body can be described by the angle formed between the different skeletal joint points. A skeletal key point n may be represented by a three-dimensional coordinate (x) _n ,y _n ,z _n ) To describe. Suppose [ w, p, q)]Three skeletal key point sequences, the coordinates of key points are: (x) _w ,y _w ,z _w ),(x _p ,y _p ,z _p ),(x _q ,y _q ,z _q ) Wherein points w and p may form line segment l ₁ Q and p may form a line segment l ₂ 。l ₁ And l ₂ The included angle between the two skeleton key points is the included angle formed by the three skeleton key points of w, p and q. In this embodiment, there are 18 skeletal key point sequences defined for describing the human motion pose: [ left ankle joint, left knee joint, left hip joint][ Right ankle joint, right knee joint, right hip joint ]][ left Knee joint, left hip joint, pelvis](Right Knee joint, Right hip joint, pelvis)]The left wrist, the left elbow joint and the left shoulder joint]The right wrist, the right elbow joint and the right shoulder joint]The right elbow joint, the right shoulder joint and the left shoulder joint]The left elbow joint, the left shoulder joint and the right shoulder joint][ head, neck, pelvis bone][ right wrist, crown of head, neck ]][ left wrist, crown of head, neck]The left elbow joint, the vertex and the neck]The right elbow joint, the vertex of the head and the neck]Head, left ear, neck]Head, right ear, neck][ left ear, neck, right shoulder joint ]]The right ear, neck and left shoulder joint](left hip joint, pelvis, right hip joint)]。

S203: and calculating included angles among all the skeleton key point sequences to obtain sequence included angles, and forming motion attitude vectors by all the sequence included angles.

Specifically, it is known to set a skeletal key point n by three-dimensional coordinates (x) _n ,y _n ,z _n ) Description, suppose there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) _w ,y _w ,z _w ),(x _p ,y _p ,z _p ),(x _q ,x _q ,z _q ) Wherein points w and p may form line segment l ₁ Q and p may form a line segment l ₂ (ii) a Calculating l ₁ And l ₂ The included angle between the two skeleton key points is a sequence included angle formed by the three skeleton key points of w, p and q; calculating sequence included angles of other skeleton key point sequences, and obtaining all sequence included angles; the values of all sequence angles constitute a motion attitude vector: [ theta ] of ₁ ,θ ₂ ,…,θ _n ]。

That is, the values of all the sequence angles can be constructedA vector can be used for describing the motion gesture, and is called a motion gesture vector: [ theta ] of ₁ ,θ ₂ ,…,θ _n ]. Each frame in the motion video corresponds to a motion attitude vector, and the motion attitude vectors of all frames in the video are arranged according to a time sequence to form a motion attitude matrix.

In some embodiments, as shown in fig. 3, for the real-time recorded online moving video data of the user, the algorithm slides from left to right in a window w to construct a moving posture matrix corresponding to the video in the window; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action, and specifically comprising the following steps of:

s301: inputting the motion attitude matrix into a Swin Transformer model trained in advance, and calculating the output probability of the motion attitude matrix relative to any target action;

s302: if the output probability is judged to be greater than or equal to a preset threshold value, adding 1 to the count of the target action, and sliding the window w forward for p frames;

s303: if the output probability is judged to be smaller than a preset threshold value, sliding the window w forward for 1 frame;

The off-line training and on-line detection processes and the model structure of the Swin Transformer model are simply introduced below, and the Swin Transformer model is obtained through training, so that accurate action counting is realized.

Specifically, in the off-line training phase, first, video data of a plurality of different types of sports needing to be counted in real time are collected, wherein each video segment only contains one motion of one sports, for example, one video segment of push-up only contains one push-up motion. Then, the sports category of each video is labeled. And finally, calculating a motion attitude matrix corresponding to each section of video, wherein all the motion attitude matrices form training data, inputting the training data into the model in the figure 3 for training, and finally generating a trained model, as shown in figure 4.

As shown in fig. 5, the model structure of the Swin Transformer model includes seven parts: the system comprises a matrix blocking layer, an Embedding layer, a Swin transform module, a blocking combination layer, a global pooling layer, a multilayer perceptron layer and a Softmax layer. Wherein:

the Embedding layer is consistent with a Linear Embedding layer in a Swin Transformer algorithm and is used for converting the dimension of a q-c matrix into a dimension which can be accepted by a Swin Transformer module;

the Swin Transformer module is consistent with the Swin Transformer module in the Swin Transformer algorithm. The input to the module is the matrix processed by the Embedding layer. The module applies a sliding window based self-attention mechanism, which comprises two steps: firstly, applying a self-attention mechanism to a matrix in an initial window, and then applying the self-attention mechanism to the matrix in a sliding window;

the block Merging layer is consistent with the Patch Merging module in the Swin transform algorithm, and functions like a pooling layer in a convolutional neural network to compress the dimensionality of the matrix output by the Swin transform module. In order to extract features of different scales through the Swin Transformer module, the blocking merging layer and the Swin Transformer module may be integrally stacked N times, as shown by a dashed box in fig. 5;

the global pooling layer is consistent with a global pooling layer in the Swin transform algorithm, and the dimension of a matrix output by the Swin transform module is reduced by calculating average;

the input of the multi-layer perceptron layer is a matrix processed by a global pooling layer, the multi-layer perceptron layer uses m layers of linear full connection, and the output dimensionality of the full connection layer is the number of types of action categories;

the input to the Softmax layer is the output of the multi-layered perceptron layer, from which the probability of an action class is finally calculated.

When detecting on line, firstly, for the user on-line motion video data recorded in real time, the algorithm will slide from left to right by the window w, and slide for 1 frame each time. The length p of w can be a value in the [ l, r ] interval, wherein l represents the minimum value of the number of the motion video frames in the training data, and r represents the maximum value of the number of the motion video frames in the training data. In this embodiment, the window length p is selected as the average value of the video frames of the type of motion in the training data. Then, a motion attitude matrix of the view band in the window w is calculated. Finally, inputting the motion attitude matrix into the model of fig. 4, and calculating the output probability of the video segment:

if the probability that the video belongs to a certain type of action is greater than or equal to the threshold value, the count of the type of action is increased by 1. And the window w slides forward p frames.

If the probability that the video segment belongs to a certain type of action is less than the threshold, the window w is slid forward by 1 frame.

In the above embodiment, the Swin Transformer model-based motion real-time counting method provided by the invention collects human motion video data in real time through a camera device; detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; further arranging the motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.

In addition to the above method, the present invention further provides a Swin Transformer model-based motion real-time counting system, as shown in fig. 6, the system includes:

the data acquisition unit 601 is used for acquiring human motion video data in real time through the camera equipment;

an attitude vector calculation unit 602, configured to detect a sporter located in a center position of a video image through a human body detection algorithm, and calculate a motion attitude vector of the target sporter in each frame image of the motion video with the sporter as a target sporter;

the attitude matrix generation unit 603 is configured to arrange motion attitude vectors obtained for each frame of image in a time sequence to obtain a motion attitude matrix;

a counting result output unit 604, configured to analyze the motion posture matrix based on a Swin Transformer model trained in advance, so as to obtain a counting result of the target motion;

In the above embodiment, the Swin Transformer model-based real-time motion counting system provided by the present invention detects a sporter located at the center position of a video image through a human body detection algorithm, and calculates a motion attitude vector of the target sporter in each frame image of the motion video with the sporter as a target sporter; further arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer. Therefore, the motion real-time counting method takes the video frame sequence as input, counts various sports motions by real-time motion analysis and combining with a Swin transform model trained in advance, can be conveniently applied to various sports projects, has better motion identification and technical accuracy, and solves the technical problem of poorer motion identification and counting accuracy in the prior art.

Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may invoke logic instructions in memory 730 to perform a transaction request processing method comprising: detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The processor 710 in the electronic device provided in the embodiment of the present application may call the logic instruction in the memory 730, and an implementation manner of the processor 710 is consistent with an implementation manner of the transaction request processing method provided in the present application, and the same beneficial effects may be achieved, and details are not described here.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a transaction request processing method provided by the above methods, the method comprising: detecting a sporter positioned in the center position of a video image through a human body detection algorithm, taking the sporter as a target sporter, and calculating a motion attitude vector of the target sporter in each frame image of the motion video; arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a Swin Transformer model trained in advance to obtain a counting result of the target action; the Swin transform model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin transform model comprises a matrix partitioning layer, an Embedding layer, a Swin transform module, partitioning combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

When executed, the computer program product provided in the embodiment of the present application implements the transaction request processing method, and the specific implementation manner of the method is consistent with the implementation manner described in the embodiment of the foregoing method, and the same beneficial effects can be achieved, and details are not described herein again.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the transaction request processing methods provided above, the method including: detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter; arranging the motion attitude vectors obtained from the frame images in a time sequence to obtain a motion attitude matrix; analyzing the motion attitude matrix based on a previously trained Swin Transformer model to obtain a counting result of the target action; the Swin Transformer model is obtained by training a training data set formed by motion posture matrix samples, the motion posture matrix samples are obtained by calculating video data samples of various types of motion, each video data sample only contains one complete action of one target motion sample, and the model structure of the Swin Transformer model comprises a matrix blocking layer, an Embedding layer, a Swin Transformer module, blocking combination, a global pooling layer, a multi-layer perceptron layer and a Softmax layer.

When executed, the computer program stored on the non-transitory computer-readable storage medium provided in the embodiment of the present application implements the transaction request processing method, and a specific implementation manner of the method is consistent with the implementation manner described in the embodiments of the method, and the same beneficial effects can be achieved, and details are not repeated herein.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those skilled in the art will appreciate that the functionality described in the present invention may be implemented in a combination of hardware and software in one or more of the examples described above. When software is applied, the corresponding functionality may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above embodiments are only for illustrating the embodiments of the present invention and are not to be construed as limiting the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made on the basis of the embodiments of the present invention shall be included in the scope of the present invention.

Claims

1. A Swin transducer model-based motion real-time counting method is characterized by comprising the following steps:

acquiring human motion video data in real time through camera equipment;

detecting a sporter located at the center position of a video image through a human body detection algorithm, and calculating a motion attitude vector of the target sporter in each frame image of the motion video by taking the sporter as a target sporter;

arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix;

2. The method according to claim 1, wherein calculating the motion pose vector of the target actor in each frame image of the motion video comprises:

detecting three-dimensional coordinates of skeleton key points of the target sporter in each frame of image in the moving video to obtain a posture image of the target sporter in each frame of image;

3. The method according to claim 2, wherein the step of calculating included angles between the sequences of bone key points to obtain sequence included angles, and forming motion attitude vectors from all the sequence included angles includes:

setting a skeletal key point n to pass through a three-dimensional coordinate (x) _n ,y _n ,z _n ) Description, assume that there is [ w, p, q ]]Three skeletal key point sequences, the coordinates of key points are: (x) _w ,y _w ,z _w ),(x _p ,y _p ,z _p ),(x _q ,y _q ,z _q ) Wherein points w and p may form line segment l ₁ Q and p may form a line segment l ₂ ；

4. The method of claim 1, wherein analyzing the motion posture matrix based on a pre-trained Swin transform model to obtain a counting result of a target motion comprises:

if the output probability is judged to be greater than or equal to a preset threshold value, adding 1 to the count of the target action, and sliding the window w forward for p frames;

5. The method of claim 4, wherein the motion posture matrix is input into a Swin Transformer model trained in advance, and the output probability of the motion posture matrix relative to any target action is calculated, and then the method further comprises:

6. The real-time motion counting method according to claim 1,

7. A Swin transform model-based motion real-time counting system is characterized by comprising:

the attitude matrix generating unit is used for arranging motion attitude vectors obtained from each frame of image in a time sequence to obtain a motion attitude matrix;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the processor executes the program.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, wherein the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.