CN115393963A - Motion action correcting method, system, storage medium, computer equipment and terminal - Google Patents

Motion action correcting method, system, storage medium, computer equipment and terminal Download PDF

Info

Publication number
CN115393963A
CN115393963A CN202211070820.9A CN202211070820A CN115393963A CN 115393963 A CN115393963 A CN 115393963A CN 202211070820 A CN202211070820 A CN 202211070820A CN 115393963 A CN115393963 A CN 115393963A
Authority
CN
China
Prior art keywords
action
image
training
motion
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211070820.9A
Other languages
Chinese (zh)
Inventor
贺王鹏
刘伟
周悦
胡德顺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202211070820.9A priority Critical patent/CN115393963A/en
Publication of CN115393963A publication Critical patent/CN115393963A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of computer vision, and discloses a method, a system, a storage medium, computer equipment and a terminal for correcting movement actions, wherein the method comprises the following steps: collecting human motion posture identification data; designing a motion action classification neural network model and training the model; comparing the posture of the single frame image with the DTW (Dynamic Time Warping) distance of the action Time sequence. The human body motion image is acquired by the common USB camera, the human body posture recognition neural network model structure is inferred through the notebook computer, the error action comparison and correction are completed through the skeleton extraction and the posture classification, the generality is high, the calculated amount is small, the accuracy is high, and the daily life requirements are well met. The invention uses a double comparison strategy of key frame action and time sequence comparison, wherein the action correction aiming at the part is carried out through the key action, and the comparison is carried out through a complete action fragment. Such a corrective strategy is more accurate and reasonable than a general system.

Description

Motion action correcting method, system, storage medium, computer equipment and terminal
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a motion correction method, a motion correction system, a storage medium, computer equipment and a terminal.
Background
At present, some people need more professional exercise guidance in the current day of the vigorous development of social economy, but private education is too expensive and the teaching quality is uneven, so that a plurality of exercise action correction systems appear on the market at present.
The conventional motion assisting system is generally an inertial motion capturing technology, and the inertial motion capturing system consists of an attitude sensor, a signal receiver and a data processing system. The gesture is fixed on each main limb part of the human body, and the gesture signals are transmitted to a data processing system through wireless transmission modes such as Bluetooth and the like to perform motion calculation. The attitude sensor obtains attitude information of each limb by integrating elements such as an inertial sensor, a gravity sensor, an accelerometer and the like, and calculates spatial position information of joint points by combining length information of bones and a bone level connection relation; the other is an optical motion capture technology, which is based on the computer vision principle, and uses a plurality of high-speed cameras to monitor and track the target feature points from different angles, and simultaneously combines the algorithm of bone solution to complete motion capture. In theory for any point in space, as long as it can be seen by more than two cameras at the same time, the 3D position of that point in space at that moment can be determined. When the camera continuously shoots at a high frame rate, the motion trail of the point can be obtained from the image sequence, and some meaningful indexes and the like can be obtained.
However, the optical motion capture system has many disadvantages, which are not beneficial to daily user use and popularization, and cannot be deployed more conveniently and rapidly: 1) The erection of multiple machine positions is difficult. High cost and large space requirement. 2) The frame synchronization technique is complicated. In which redundant hardware devices are designed, which further contributes to the redundancy and maintenance costs of the system. 3) Computer vision and machine learning equipment are in great demand for computing power. This does not guarantee the portability and real-time processing of the entire system. 4) The portability is poor. At present, professional exercise movement correction software is specially designed for special exercises, and if new movements are involved, the transplanting difficulty is higher. Therefore, it is desirable to design a new exercise motion correction method and system.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) The existing optical motion capture system has the defects of difficult multi-machine erection, high cost, larger required space, difficulty in use and popularization for daily users and incapability of being deployed more conveniently and rapidly.
(2) The frame synchronization technology of the existing optical motion capture system is complex; in which redundant hardware devices are designed, which further contributes to the complexity and maintenance costs of the system.
(3) The existing computer vision and machine learning equipment has high calculation force requirements, so that the portability and the real-time processing of the whole system cannot be ensured.
(4) The existing optical motion capture system has poor portability, professional motion correction software is specially designed for special motions at present, and if new motions are involved, the transplanting difficulty is higher.
The difficulty in solving the above problems and defects is: the adopted scheme is the mainstream scheme of the existing human motion action correcting system, but the use scene is concentrated in a professional scene. As professional sports needs professional guidance, the characteristic conditions of hardware and the like are guaranteed to be good, the method and the system provided by the invention have the advantages that the effectiveness and the adaptability of the algorithm are guaranteed under the condition of reducing the hardware cost and the site cost, and the difficulty is high.
The significance for solving the problems and the defects is as follows: the method has the advantages of controlling and correcting the development cost of the system, reducing the limitation of sites and hardware equipment, having rich use scenes and services, being simple to use and having short iteration cycle.
Disclosure of Invention
The invention provides a movement action correction method, a movement action correction system, a storage medium, computer equipment and a terminal, and particularly relates to a movement action correction method, a movement action correction system, a storage medium, computer equipment and a terminal based on Cascade PoseNet human posture recognition.
The present invention is achieved in such a way that a method for correcting a motion includes the steps of:
collecting human motion posture recognition data in a targeted manner;
the purpose of this step is to image capture the motion of the person. The image characteristic variability of the sports motion is large, and the method is different from general human body posture image acquisition. So in addition to acquiring normal motion gestures, some special cases need to be considered, such as: key point occlusion, complex background, motion blur, poor lighting conditions, etc. The human body movement image scene in the data set collected in the step is more comprehensive, the characteristics are more diversified, the considered posture category is single, the characteristics of the data sets are different from those of the traditional human body posture estimation data set, and the model training of human body posture detection and identification in motion is facilitated.
Designing a motion action classification neural network model and training the model;
the aim of the step is to design a light human posture detection and recognition network which has two functions of skeleton key point position detection and human posture classification. The lightweight network is used as the core of the whole algorithm and has the function of ensuring the real-time performance of the whole system to be deployed in the embedded equipment; the skeleton key point position detection has the function of obtaining 2D space coordinate information of human skeleton key points, and the space coordinate information of standard motion actions is consistent, so that the information can be used for reflecting the correctness or the mistake of the actions; the human body posture classification has the function of judging the motion action type through the position coordinates of the key points, and can be used for matching standard actions with actions to be corrected.
And step three, comparing the postures of the single-frame images and the DTW distance of the action time sequence.
The character posture comparison of the single-frame image in the step specifically refers to the comparison of the standard movement of the athlete with the movement to be corrected of a common person in the static single-frame image. The method has the main function of correcting certain key actions in the motion process of a common person by comparing the 2D positions of the relevant skeleton key points with the standard positions and the wrong positions in the static image; DTW distance comparison of motion time series specifically refers to comparison of a player with a certain continuous motion of a common person in continuous image frames. The method has the main functions that 2D space positions of human skeleton key points in continuous image frames form sequences in the time direction, the similarity between a standard action sequence and an action sequence to be corrected is compared through a DTW algorithm, and the action is judged and scored according to the similarity of the action sequences.
Further, the collecting of the human motion gesture recognition data in the first step comprises:
collecting data set information in daily images, collecting video data of athletes from a network, and capturing a certain action of the athletes in the obtained video in a video frame capturing mode; wherein the video data comprises the motion process of a single tennis player.
All static attitude pictures are three-channel color RGB, and the file format is jpg format. A motion tag tennis key motion image dataset is obtained in a suitable video for a period of time.
In training, an image data set is divided into a training set, a verification set and a test set; wherein, the training set is used for inputting a neural network to be trained in a training process; the verification set is used for periodically verifying the rationality of the method in the training process; the test set is used to evaluate the performance of the method at completion.
Further, after the human motion posture recognition data in the step one is collected, image input and image enhancement are also included; the image input and image enhancement method comprises the following steps:
and finally, enhancing the training image of the input model in a pairwise combination mode by utilizing seven image enhancement strategies including horizontal image turning, vertical image turning, random image rotation of 0-10 degrees, random image brightness change, random image contrast change, image distortion and image scaling.
Further, the design of the neural network for human body posture recognition in the second step includes:
and training the collected human body motion attitude data set by adopting a Cascade Posenet neural network, inputting a single-person motion attitude image, and outputting coordinates of human body attitude key points in the 2D image and the category of actions in the image.
The network consists of a Posenet human posture estimation network as a backbone and a classification network, wherein the Posenet part in the whole network outputs 17 human posture key points comprising a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle; meanwhile, the 17 points are used as characteristics and input into a subsequent classification network part for classification, and the category of the action in the image is obtained.
The PoseNet part in the Cascade PoseNet is a human body motion posture estimation network, the core architecture of the PoseNet is MobileNet v1 and is composed of 28 layers, the first layer adopts a standard convolution kernel, and the rest convolution layers all use depth separable convolution.
The convolutional layer is used for carrying out convolution operation on input image data, and in the process of analyzing and training the input image, the neural network gradually extracts the features of the gesture image data set from shallow to deep for analysis. The basic convolution operation is as follows:
Figure BDA0003830150880000051
wherein x is an image input by convolution, h is a convolution kernel, y is a result after convolution, the convolution operation is a basic calculation method in the image processing based on deep learning, and the effect of extracting the characteristics of the input image is realized by updating parameters of the convolution kernel.
And the batch standard layer is used for forcibly pulling back the distribution of the input values gradually closing to the value range limit saturation region after mapping any neuron of each layer of neural network to a nonlinear function to a standard normal distribution with the mean value 0 and the variance of 1 through a certain standardization means. The input values of the non-linear transformation function are made to fall into a region that is sensitive to the input.
And regularizing by using a dropout layer at the last part of the neural network, classifying by using a full-connection layer to obtain a neural network prediction result, and finally outputting 17-dimensional vectors which respectively represent x, y coordinates and confidence coefficients of key points of the human posture obtained by predicting the test image and the category of the action.
Further, the training of the motor action classification neural network model in the second step includes:
after a posture action estimation neural network for training is designed, inputting a training set in a data set into the network for calculation; the training process uses a 5-fold cross validation method, the training set is divided into 5 mutually exclusive subsets with the same size, and the number ratio of 7 images in each subset is close to 1:1:1.
randomly selecting one from the 5 subsets as a verification set during training, and using the remaining four subsets as training sets; by analogy, a total of 5 batches were trained, each subset was used as a validation set, and each batch was trained for 20 rounds.
The batch adopted in the training process is 32, the optimization function is an Adam optimizer, the momentum parameters are 0.9 and 0.99, and the initial learning rate is 0.01; the learning rate is purposefully attenuated in each round to a last round learning rate attenuation of 0.00001.
Adding early stopping strategy in training, if the error of the model on the verification set is calculated in each period, calculating the epoch once every 15 times; and stopping training when the error of the model on the verification set is worse than the last training result, and using the parameters in the last iteration result as the final parameters of the model.
Further, the comparison of the single-frame image poses and the DTW distance comparison of the action time sequence in step three comprise:
(1) Action fragment interception in video streams
Inputting the video stream collected when a user uses a camera in real time into a Cascade PoseNet network to extract the coordinates of key points of human body postures for each frame of image, and finishing classification to obtain a classified single action frame; and matching the obtained single action frame with the corresponding frame in the video stream, and intercepting a period of time sequence between the matched action frames.
The action combination for completing the matching comprises the steps of completing the forward hand-guided shooting and forward hand-waving shooting, completing the backward hand-guided shooting and backward hand-waving shooting, and mainly completing the correction of a single-frame image by other actions.
And inputting the standard image data set into a Cascade Posenet network, capturing the motion frame and matching the motion frame time sequence, and finishing frame interception on the video stream read by a camera of a user to obtain a standard comparison group and a user data group to be detected.
(2) Single frame action comparison and DTW action fragment distance comparison
Wherein, the single frame image contrast correction and the time sequence DTW contrast correction are carried out in a vote mode. The Vote mode represents a voting mode, and comprises the following steps:
and determining key action points, comparing the coordinate values of the key points one by one, judging the position part which can not move, selecting the suggestion with the largest proportion, and voting to obtain the most reasonable correction suggestion.
For action alignment of DTW time series, the action of time series is explained first. The time sequence is a one-dimensional signal, wherein the horizontal axis represents time in milliseconds, the vertical axis represents an x value or a y value of a certain key point of a human body, the time-space signal is used for reflecting the change process of a certain part in a complete single action process, the two elements of the time-space signal are space information and time information, and the space information comprises a value of a coordinate point and is used for reflecting whether indexes such as the amplitude and range of a certain action, the mutual position of a relevant node and the like are in place or not; the time information comprises whether the time of a certain action segment is too long or too short, and the similarity degree with the standard action time sequence is tested by using a DTW algorithm.
Completing position comparison of key joint points in a single action frame, and carrying out fuzzy comparison on (x, y) coordinates of key points in a two-dimensional image in a reference data set and a user data set to judge the movement amplitude of a certain part; and comparing DTW distance of the action time sequence in one reference data set and the user data set to judge whether the time of one action segment is proper.
Another object of the present invention is to provide a motion correction system for implementing the motion correction method, the motion correction system including:
the motion gesture recognition data acquisition module is used for collecting human motion gesture recognition data;
the network model building and training module is used for designing a motion action classification neural network model and training the model;
and the action sequence intercepting, comparing and correcting module is used for respectively comparing the postures of the single-frame images and the DTW distance of the action time sequence.
It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:
collecting human body motion posture identification data; designing a motion action classification neural network model and training the model; and carrying out comparison of single-frame image postures and DTW distance comparison of action time sequences.
It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
collecting human motion posture recognition data; designing a motion action classification neural network model and training the model; and carrying out comparison of the single-frame image pose and DTW distance comparison of the action time sequence.
Another object of the present invention is to provide an information data processing terminal for implementing the exercise action correcting system.
Traditional motion action correction system mostly uses in professional scenes such as national team training, uses the scene limitation high. The motion action correction systems are composed of wearable equipment, multi-camera high-definition high-frame-rate cameras, high-performance hardware equipment, large-scale algorithms and the like. The factors directly cause a plurality of problems of high cost of software and hardware systems for realizing the correction of the movement action, complex flow, large site limiting conditions, high difficulty in starting, and the like. Due to the problems, the existing motion correction system is difficult to popularize in daily user life, and due to the influence of epidemic situations, more and more family users urgently need a set of reasonable motion correction system to help the users to complete physical exercise. The motion action correction system provided by the invention adopts a single camera, a single machine position, an end side embedded device and a simple normal form algorithm flow, is low in hand difficulty and small in space limitation, and aims to solve the blank problem of products in the field in the current market.
By combining all the technical schemes, the invention has the advantages and positive effects that: the motion action correction method provided by the invention is improved and perfected based on an optical motion capture system, and provides a machine learning algorithm based on Cascade PoseNet human posture recognition and DTW distance comparison to finish the correction feedback of motion actions, and the algorithm requires less hardware equipment, has lower computing power, and has high processing speed and higher precision; the whole system has portability and the like, and can finish the tasks of correcting and timely feeding back the movement actions under most of household and outdoor scenes, so that the requirements of most of users are met.
The Cascade PoseNet Cascade network designed by the invention and the single-frame image and DTW distance comparison algorithm are tested by experiments, the correct proportion of the correction of the motion action of the human body can reach 85 percent, and the frame rate can reach more than 20 fps. Under a palm host and a complex algorithm stack, the precision can be achieved, and the use requirements of daily users can be basically met. Therefore, the invention can finish the correction feedback of the action aiming at different motion postures, and realize high accuracy and high recognition rate.
Meanwhile, the invention also has the following beneficial effects:
(1) The human body moving image is acquired by the common USB camera, the human body posture recognition neural network model structure is inferred by the notebook computer, the error action comparison and correction are completed by the skeleton extraction and the posture classification, the universality is high, the calculated amount is small, the accuracy is higher, and the daily life requirements are well met.
(2) The invention uses the double comparison strategy of key frame action and time sequence comparison, firstly, the action correction aiming at the part is carried out through the key action, and secondly, the comparison is carried out through a complete action fragment. Such corrective strategies are more accurate and reasonable than in general systems.
(3) The invention provides an algorithm idea of Cascade, which is characterized in that Posenet completes posture recognition, and then deactivates a subsequent posture classification network to recognize the human body posture type in the current state. Such a cascading concept reduces the computational requirements of the resources on which the system operates.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for correcting exercise according to an embodiment of the present invention.
FIG. 2 is a block diagram of a motion correction system according to an embodiment of the present invention;
in the figure: 1. a motion gesture recognition data acquisition module; 2. a network model construction and training module; 3. and an action sequence intercepting, comparing and correcting module.
Fig. 3 is a schematic diagram of the improvement of the basic depth separable convolution in the PoseNet part according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In view of the problems in the prior art, the present invention provides a method, a system, a storage medium, a computer device and a terminal for correcting a movement, which are described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the exercise movement correction method provided by the embodiment of the present invention includes the following steps:
s101, collecting human motion posture recognition data;
s102, designing a motion action classification neural network model and training the model;
s103, comparing the postures of the single-frame images and the DTW distance of the action time sequence.
As shown in fig. 2, the exercise movement correction system provided in the embodiment of the present invention includes:
the motion gesture recognition data acquisition module 1 is used for collecting human motion gesture recognition data;
the network model building and training module 2 is used for designing a motion action classification neural network model and training the model;
and the action sequence intercepting, comparing and correcting module 3 is used for respectively comparing the postures of the single-frame images and the DTW distance of the action time sequence.
The technical solution of the present invention is further described below with reference to specific examples.
The technical problem to be solved by the invention is as follows: the method provides a machine learning algorithm based on Cascade Posenet for human posture identification and DTW distance comparison to finish correction and feedback of movement actions, and the algorithm requires less hardware equipment, lower computing power, high processing speed and high precision; the whole set of system has portability and the like, and can finish exercise action correction and timely feedback tasks under most of household and outdoor scenes, so that the requirements of most of users are met.
The motion action correction method based on human body posture recognition and machine learning comprises the following two stages: building and training a motion action classification neural network, and intercepting, comparing and correcting an action sequence.
The first stage comprises the following steps: collecting human motion posture recognition data, designing a neural network model and training the model.
1. Standard athletic performance data set collection
The data set information used by the invention is collected from daily images, and 5 required video data of the athletes are collected from the network, so that different backgrounds and different lighting conditions of the athletes are ensured as much as possible. The video data comprises the motion process of a single tennis player;
and (3) capturing a certain action of the athlete in the obtained video by using a video frame capture mode, such as: the special illustrative examples include the positive hand-leading swatter, the positive hand-waving swatter completion, the negative hand-leading swatter, the negative hand-waving swatter completion, the tennis interception, the hair buckling and the waiting, wherein the 7 are specifically exemplified.
5 professional athletes are selected, 2000 pictures are taken from the sports video of each athlete, the total number of 10000 pictures is 10000, and the 2000 pictures of each athlete comprise 7 static sports pictures in equal proportion.
All static attitude pictures are three-channel color RGB, and the file format is jpg format. A total of 7 tennis ball key action image datasets with action tags are available in videos suitable for a period of time.
In actual training, the image data set needs to be divided into three parts, namely a training set, a validation set and a test set. The training set is input into the neural network for training in the training process, the verification set periodically verifies the rationality of the method in the training process, and the test set evaluates the performance of the method when the method is finished. In 10000 images in total, 7500 images are selected as data of a training set, 500 images are selected as data of a verification set, the rest 2000 images are selected as data of a test set, and the distribution proportion of the whole data is 15:1: and 4, meeting deep learning training standards.
2. Image input and image enhancement
The main task in this step is to train the static motion picture recognition dataset using the human posture recognition neural network of the previous step. Before the neural network is used for training, image enhancement processing needs to be carried out on the image to be trained, so that the difficulty of learning image features by the neural network can be increased, and the data set is reasonably expanded. The final effect algorithm can enable the whole network to more deeply mine the characteristic information of the images, and an accurate classification effect is achieved.
Aiming at the characteristics of the collected motion attitude image data set, the invention adopts the following image enhancement modes: and finally, enhancing the training image of the input model in a pairwise combination mode by seven image enhancement strategies of horizontal image turning, vertical image turning, random image rotation of 0-10 degrees, random image brightness change, random image contrast change, image distortion, image scaling and the like.
3. Neural network structure for recognizing human body posture
The invention adopts a Cascade Posenet neural network to train the collected human body motion attitude data set, the input is a single-person motion attitude image, and the output is the coordinates of 17 human body attitude key points in a 2D image and the category of actions in the image.
The network consists of a Posenet human posture estimation network as a backbone and a classification network, wherein the Posenet part in the whole network outputs 17 human posture key points comprising a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle, and is shown in a table 1; and meanwhile, inputting 17 points as features into a subsequent classification network part for classification to obtain the category of the action in the image.
TABLE 1 human pose Key points
Figure BDA0003830150880000111
Figure BDA0003830150880000121
The Posenet part in Cascade Posenet is a human body motion posture estimation network, the core architecture of Posenet is MobileNetv1, the network structure is shown in Table 2, the Posenet is composed of 28 layers (not including Avg Pool and FC layer, and the deep convolution and point-by-point convolution are separately calculated), except that the first layer adopts a standard convolution kernel, the rest convolution layers all use deep separable convolution. Fig. 3 illustrates the improvement brought about by the basic depth separable convolution in the PoseNet section, which can reduce the computation amount by a factor of about 9 compared to the standard convolution.
TABLE 2 network structure for estimating human motion attitude
Figure BDA0003830150880000131
The convolution layer has the function of performing convolution operation on input image data, the convolution layer is similar to the operation mode of a traditional filter, and in the process of analyzing and training the input image, the neural network gradually extracts the characteristics of a gesture image data set from shallow to deep for analysis. The basic convolution operation is as follows:
Figure BDA0003830150880000132
wherein x is an image input by convolution, h is a convolution kernel, y is a result after convolution, the convolution operation is a basic calculation method in the image processing based on deep learning, and the effect of extracting the characteristics of the input image is realized by updating parameters of the convolution kernel.
The batch standard layer has the function that through a certain standardization means, the distribution of input values which are gradually closed to the value range limit saturation region after any neuron of each layer of neural network is mapped to a nonlinear function is forcibly pulled back to the normal distribution of a comparison standard with the mean value 0 and the variance of 1. The input value of the nonlinear transformation function falls into a region sensitive to input, so that the problem of gradient disappearance is avoided, and meanwhile, the increase of the gradient is equivalent to the increase of the convergence rate of learning, so that the training time can be greatly reduced.
In order to avoid the overfitting problem in a deep neural network, a dropout layer is used for regularization in the last part of the neural network, a full-connection layer is used for classification to obtain a neural network prediction result, and finally a 17-dimensional vector is output to respectively represent the x, y coordinates and confidence coefficient of a human posture key point obtained by predicting a test image and the category of the motion.
4. Cross-validation training
After the neural network for estimating the gesture and motion of the training is designed, the training set in the data set needs to be input into the network for calculation. The training process uses a 5-fold cross validation method to divide the training set into 5 mutually exclusive subsets with the same size, each subset contains 1600 posture images, and the number ratio of 7 images in each subset is close to 1:1:1.
firstly, one of the 5 subsets is arbitrarily selected as a verification set during training, and the remaining four subsets are used as training sets. By analogy, a total of 5 batches were trained, each subset was made into a validation set, and each batch was trained for 20 rounds. The data set can be fully utilized through a cross validation training mode, so that the neural network can fully learn the characteristic information of the image, and the problem of overfitting is effectively avoided.
The batch adopted in the training process is 32, the optimization function is an Adam optimizer, the momentum parameters are 0.9 and 0.99, and the initial learning rate is 0.01. The learning rate is purposefully attenuated in each round to a last round learning rate attenuation of 0.00001.
Meanwhile, in order to prevent overfitting, an early stopping strategy is added in training, and if the error of the model on the verification set is calculated in each period, the epoch is calculated every 15 times; and stopping training when the error of the model on the verification set is worse than that of the last training result, and using the parameters in the last iteration result as final parameters of the model.
Finally, after the whole neural network model is trained for 180 rounds, the model accuracy rate reaches 98.9%. The deducing speed of the model on the palm computer is 23 ms/frame, and the occupied size of the whole model is 5Mb.
The second stage comprises the comparison of single-frame image pose and the DTW distance comparison of action time sequence
1. Action fragment interception in video streams
And inputting the video stream acquired when the user uses the camera in real time into a Cascade PoseNet network for extracting the coordinates of key points of human body postures for each frame of image, and completing classification to obtain a classified single action frame. Matching the obtained single action frame with the corresponding frame in the video stream (for example, the forward hand-leading and forward hand-waving complete are a pair of matched actions), and intercepting a time sequence between the matched action frames.
The action combination to be matched is that the forward-hand directing swatter and the forward-hand waving swatter are completed, and the backward-hand directing swatter and the backward-hand waving swatter are completed. Because the two groups of movements are the most basic movement movements in tennis, the tennis ball has great correction value. And the other actions mainly finish the correction of the single-frame image.
Inputting a standard image data set into a Cascade PoseNet network, capturing by using motion frames and matching motion frame time sequences, finishing frame interception by aiming at a video stream read by a user through a camera, and finishing the processes, so that a standard comparison group and a user data group to be detected are obtained.
2. Single frame action comparison and DTW action fragment distance comparison
Wherein, the single frame image contrast correction and the time sequence DTW contrast correction are carried out in a vote mode. The note mode represents a voting mode, and because more than 1000 standard motions exist in a standard data set performed by a certain motion of a user, most of the standard motions are relatively correct, and professional athletes also have wrong motion, the invention adopts the following mode in single-frame image motion comparison: firstly, key action points are determined, for example, main points of a swing action are a right wrist, a right elbow and a right shoulder, and an auxiliary left arm system, so that the invention compares 12 coordinate values one by one aiming at the above 6 key points to judge that the movement of a certain part is not in place, wherein the number of correction suggestions of the vote thought embodied in a certain part is more than 1000 in 1000 comparison images, the invention selects one suggestion with the largest proportion, votes out the most reasonable correction suggestion, and is more reasonable.
For action alignment of DTW time series, the action of time series is explained first. Such a time sequence is a one-dimensional signal, wherein the horizontal axis represents time in milliseconds, and the vertical axis represents an x value (or a y value) of a certain key point of a human body, such a time-space signal can reflect the change process of a certain part in a complete single motion process, and two elements contained in the time sequence are its simple spatial information and time information: the spatial information comprises a value of a coordinate point, and reflects the amplitude and range of a certain action, the mutual position of joint points and whether indexes are in place or not; the time information includes whether the time of a certain action segment is too long or too short, the similarity degree with a standard action time sequence is tested by using a DTW algorithm, and particularly, the X or Y coordinate of a certain part can be more accurately and meticulously determined.
Finishing position comparison of key joint points in a single action frame, wherein the position comparison is specifically represented as fuzzy comparison of (x, y) coordinates of key points in a two-dimensional image in a reference data set and a user data set, and the moving amplitude of a certain part is judged; and comparing DTW distance of the action time sequence in a reference data set and a user data set to judge whether the time of an action fragment is proper.
The invention adopts the common USB camera to acquire the human body motion image, carries out reasoning on the human body posture recognition neural network model structure through the notebook computer, carries out skeleton extraction and posture classification, completes error action comparison and correction, has high universality, small calculated amount and higher accuracy, and well meets the requirements of daily life.
The invention uses a double comparison strategy of key frame action and time sequence comparison, wherein the action correction aiming at the part is carried out through the key action, and the comparison is carried out through a complete action fragment. Such a corrective strategy is more accurate and reasonable than a general system; the invention provides an algorithm idea of Cascade, which is characterized in that Posenet completes posture recognition, and then deactivates a subsequent posture classification network to recognize the human body posture type in the current state. Such a cascading concept reduces the computational requirements of the resources on which the system operates.
The technical effects of the present invention will be described in detail with reference to simulation experiments.
1. The experimental conditions are as follows:
the test platform of the present invention is lattepandada Delta, where the configuration is Intel eighth generation Celeron N4100 processor, which is the most elegant choice in price and performance when used as a robotic controller, interactive project core, internet of things edge device, and AI brain. The system used was ubuntu16.04;
the camera adopts a Rogowski Webcam C270, the resolution and the frame rate are 720p/30fps, the focal distance is fixed, and the visual field is 60 degrees;
a three-channel RGB image data set with the resolution of 640px multiplied by 480px is tested by adopting software platforms of vscode, openCV and electron.
2. The experimental results are as follows:
the Cascade Posenet Cascade network designed by the invention and the single-frame image and DTW distance comparison algorithm have the advantages that through experimental tests, the correct proportion of the correction of the motion actions of a human body can reach 85%, and the frame rate can reach more than 20 fps. Under a palm host and a complex algorithm stack, the precision can be achieved, and the use requirements of daily users can be basically met.
For summary, the invention can finish correction feedback of actions aiming at different motion postures, and realize high accuracy and high recognition rate.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When used in whole or in part, is implemented in a computer program product that includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the invention may be generated in whole or in part when the computer program instructions are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.
The above description is only for the purpose of illustrating the embodiments of the present invention, and the scope of the present invention should not be limited thereto, and any modifications, equivalents and improvements made by those skilled in the art within the technical scope of the present invention as disclosed in the present invention should be covered by the scope of the present invention.

Claims (10)

1. A method for correcting a moving motion, the method comprising:
collecting human motion posture identification data;
designing a motion action classification neural network model and training the model;
and step three, comparing the postures of the single-frame images and the DTW distance of the action time sequence.
2. The motion behavior correction method according to claim 1, wherein the collection of the human motion gesture recognition data in the first step includes:
collecting data set information in daily images, collecting video data of athletes from a network, and capturing a certain action of the athletes in the obtained video in a video frame capturing mode; wherein the video data comprises the motion process of a single tennis player;
all static attitude pictures are three-channel color RGB, and the file format is jpg format; acquiring a tennis key action image data set of an action tag in a video with a proper duration;
in training, an image data set is divided into a training set, a verification set and a test set; wherein, the training set is used for inputting a neural network to be trained in a training process; the verification set is used for periodically verifying the rationality of the method in the training process; the test set is used to evaluate the performance of the method at completion.
3. The method for correcting exercise motions according to claim 1, wherein the step one, after collecting the data of recognizing the human body exercise gesture, further comprises image input and image enhancement; the image input and image enhancement method comprises the following steps:
and finally, enhancing the training image of the input model in a pairwise combination mode by utilizing seven image enhancement strategies including horizontal image turning, vertical image turning, random image rotation of 0-10 degrees, random image brightness change, random image contrast change, image distortion and image scaling.
4. The motor action correcting method according to claim 1, wherein the designing of the neural network for human body posture recognition in the second step includes:
training the collected human body motion attitude data set by adopting a Cascade Posenet neural network, inputting a single-person motion attitude image, and outputting coordinates of human body attitude key points in a 2D image and the category of actions in the image;
the network consists of a PoseNet human body posture estimation network as a backbone and a classification network, wherein the PoseNet part in the whole network outputs 17 human body posture key points comprising a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle; meanwhile, inputting 17 points as features into a subsequent classification network part for classification to obtain the category of the action in the image;
the Posenet part in the Cascade Posenet is a human body motion posture estimation network, the core architecture of the Posenet is MobileNetv1 and is composed of 28 layers, the first layer adopts a standard convolution kernel, and the rest convolution layers are all convoluted by depth separable;
the convolutional layer is used for carrying out convolution operation on input image data, and in the process of analyzing and training the input image, the neural network gradually extracts the features of the gesture image data set from shallow to deep for analysis; the basic convolution operation is as follows:
Figure FDA0003830150870000021
wherein x is an image input by convolution, h is a convolution kernel, y is a result after convolution, the convolution operation is a basic calculation method in the image processing based on deep learning, and the effect of extracting the characteristics of the input image is realized by updating parameters of the convolution kernel;
the batch standard layer is used for forcibly pulling back the distribution of input values which are gradually closed to a value range ultimate saturation region after any neuron of each layer of neural network is mapped to a nonlinear function to a standard normal distribution with a mean value of 0 and a variance of 1 through a certain standardization means; making the input value of the nonlinear transformation function fall into a region sensitive to the input;
and regularizing by using a dropout layer at the last part of the neural network, classifying by using a full-connection layer to obtain a neural network prediction result, and finally outputting 17-dimensional vectors which respectively represent an x coordinate, a y coordinate and a confidence coefficient of a human body posture key point obtained by predicting the test image and the category of the action.
5. The motor action correction method according to claim 1, wherein the training of the motor action classification neural network model in the second step includes:
after a posture action estimation neural network for training is designed, inputting a training set in a data set into the network for calculation; the training process uses a 5-fold cross validation method, the training set is divided into 5 mutually exclusive subsets with the same size, and the number ratio of 7 images in each subset is close to 1:1:1;
randomly selecting one from the 5 subsets as a verification set during training, and taking the remaining four subsets as a training set; by analogy, training 5 batches in total, respectively making a verification set for each subset, and training 20 rounds in each batch;
the batch adopted in the training process is 32, the optimization function is an Adam optimizer, the momentum parameters are 0.9 and 0.99, and the initial learning rate is 0.01; purposefully attenuating the learning rate in each round until the last round learning rate attenuation is 0.00001;
adding early stopping strategy in training, if the error of the model on the verification set is calculated in each period, calculating the epoch once every 15 times; and stopping training when the error of the model on the verification set is worse than the last training result, and using the parameters in the last iteration result as the final parameters of the model.
6. The method for correcting exercise movement according to claim 1, wherein the comparison of the pose of the single-frame image and the DTW distance comparison of the movement time series in step three comprise:
(1) Action fragment interception in video streams
Inputting the video stream collected when a user uses a camera in real time into a Cascade PoseNet network to extract the coordinates of key points of human body postures for each frame of image, and finishing classification to obtain a classified single action frame; matching the obtained single action frame with the corresponding frame in the video stream, and intercepting a period of time sequence between the matched action frames;
the matched action combination is that the forward hand-leading shooting and the forward hand-waving shooting are completed, the backward hand-leading shooting and the backward hand-waving shooting are completed, and the other actions are mainly completed by correcting a single-frame image;
inputting a standard image data set into a Cascade PoseNet network, capturing by using an action frame and matching an action frame time sequence, and finishing frame interception by aiming at a video stream read by a user through a camera to obtain a standard comparison group and a user data group to be detected;
(2) Single frame action comparison and DTW action fragment distance comparison
Wherein, the single-frame image contrast correction and the time sequence DTW contrast correction are both carried out by adopting a vote mode; the volume mode represents a voting mode, and comprises the following steps:
determining key action points, comparing the coordinate values of the key points one by one, judging that the position is not reached by movement, selecting a suggestion with the largest proportion, and voting to obtain the most reasonable correction suggestion;
for action comparison of a DTW time sequence, action description of the time sequence is firstly carried out, the time sequence is a section of one-dimensional signal, wherein the horizontal axis represents time and the unit is millisecond, the vertical axis represents an x value or a y value of a certain human body key point, the time-space signal is used for reflecting the change process of a certain part in a complete single action process, the two included elements are space information and time information, and the space information comprises a value of a coordinate point and is used for reflecting the amplitude and the range of a certain action, the mutual position of joint points and whether other indexes are in place or not; the time information comprises whether the time of a certain action segment is too long or too short, and the similarity degree with a standard action time sequence is tested by using a DTW algorithm;
completing position comparison of key joint points in a single action frame, and carrying out fuzzy comparison on (x, y) coordinates of key points in a two-dimensional image in a reference data set and a user data set to judge the moving amplitude of a certain part; and comparing DTW distance of the action time sequence in a reference data set and a user data set to judge whether the time of an action fragment is proper.
7. An athletic movement correction system that implements the athletic movement correction method according to any one of claims 1 to 6, wherein the athletic movement correction system includes:
the motion gesture recognition data acquisition module is used for collecting human motion gesture recognition data;
the network model building and training module is used for designing a motion action classification neural network model and training the model;
and the action sequence intercepting, comparing and correcting module is used for respectively comparing the postures of the single-frame images and the DTW distance of the action time sequence.
8. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:
collecting human motion posture recognition data; designing a motion action classification neural network model and training the model; and carrying out comparison of single-frame image postures and DTW distance comparison of action time sequences.
9. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:
collecting human body motion posture identification data; designing a motion action classification neural network model and training the model; and comparing the postures of the single-frame images and the DTW distance of the action time sequence.
10. An information data processing terminal characterized by being configured to implement the motion action correcting system according to claim 7.
CN202211070820.9A 2022-09-02 2022-09-02 Motion action correcting method, system, storage medium, computer equipment and terminal Pending CN115393963A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211070820.9A CN115393963A (en) 2022-09-02 2022-09-02 Motion action correcting method, system, storage medium, computer equipment and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211070820.9A CN115393963A (en) 2022-09-02 2022-09-02 Motion action correcting method, system, storage medium, computer equipment and terminal

Publications (1)

Publication Number Publication Date
CN115393963A true CN115393963A (en) 2022-11-25

Family

ID=84124795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211070820.9A Pending CN115393963A (en) 2022-09-02 2022-09-02 Motion action correcting method, system, storage medium, computer equipment and terminal

Country Status (1)

Country Link
CN (1) CN115393963A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016242A (en) * 2024-04-09 2024-05-10 南京康尼机电股份有限公司 Method and system for generating human body movement function correction training scheme

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118016242A (en) * 2024-04-09 2024-05-10 南京康尼机电股份有限公司 Method and system for generating human body movement function correction training scheme

Similar Documents

Publication Publication Date Title
WO2021129064A1 (en) Posture acquisition method and device, and key point coordinate positioning model training method and device
US20220101654A1 (en) Method for recognizing actions, device and storage medium
CN110135249B (en) Human behavior identification method based on time attention mechanism and LSTM (least Square TM)
WO2021098616A1 (en) Motion posture recognition method, motion posture recognition apparatus, terminal device and medium
WO2020042542A1 (en) Method and apparatus for acquiring eye movement control calibration data
CN111723707B (en) Gaze point estimation method and device based on visual saliency
CN110458235B (en) Motion posture similarity comparison method in video
CN112329525A (en) Gesture recognition method and device based on space-time diagram convolutional neural network
Zhang et al. Multimodal spatiotemporal networks for sign language recognition
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN110633004A (en) Interaction method, device and system based on human body posture estimation
CN112258555A (en) Real-time attitude estimation motion analysis method, system, computer equipment and storage medium
Gao et al. A novel multiple-view adversarial learning network for unsupervised domain adaptation action recognition
CN115131879B (en) Action evaluation method and device
CN112906520A (en) Gesture coding-based action recognition method and device
CN113269013B (en) Object behavior analysis method, information display method and electronic equipment
CN115393964A (en) Body-building action recognition method and device based on BlazePose
CN115393963A (en) Motion action correcting method, system, storage medium, computer equipment and terminal
Ait-Bennacer et al. Applying Deep Learning and Computer Vision Techniques for an e-Sport and Smart Coaching System Using a Multiview Dataset: Case of Shotokan Karate.
Liu et al. Key algorithm for human motion recognition in virtual reality video sequences based on hidden markov model
JP2022095332A (en) Learning model generation method, computer program and information processing device
CN112069943A (en) Online multi-person posture estimation and tracking method based on top-down framework
CN111222459A (en) Visual angle-independent video three-dimensional human body posture identification method
Almasi Human movement analysis from the egocentric camera view
Tsai et al. Temporal-variation skeleton point correction algorithm for improved accuracy of human action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination