CN115937895B - Speed and strength feedback system based on depth camera - Google Patents

Speed and strength feedback system based on depth camera Download PDF

Info

Publication number
CN115937895B
CN115937895B CN202211414614.5A CN202211414614A CN115937895B CN 115937895 B CN115937895 B CN 115937895B CN 202211414614 A CN202211414614 A CN 202211414614A CN 115937895 B CN115937895 B CN 115937895B
Authority
CN
China
Prior art keywords
depth
hand
frame
module
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211414614.5A
Other languages
Chinese (zh)
Other versions
CN115937895A (en
Inventor
张堃
涂鑫涛
张鹏程
刘志诚
徐沛霞
林鹏程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202211414614.5A priority Critical patent/CN115937895B/en
Publication of CN115937895A publication Critical patent/CN115937895A/en
Application granted granted Critical
Publication of CN115937895B publication Critical patent/CN115937895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention relates to the technical field of electronic information, in particular to a speed and strength feedback system based on a depth camera, which comprises an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module; the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape; the human body capturing module is used for efficiently positioning 16 key points of a human body, takes a Exc-Pose algorithm as a core, and specifically comprises a lightweight E-shaped structural coding layer and a decoding layer based on a regression model supervised learning method. According to the invention, core technical indexes such as gesture, speed, strength, power and the like generated by athletes in the physical training process are captured in a non-contact manner through a reasonable structural collocation related algorithm, and are digitized to guide scientific training.

Description

Speed and strength feedback system based on depth camera
Technical Field
The invention relates to the technical field of electronic information, in particular to a speed and strength feedback system based on a depth camera.
Background
In the practice of athletic training, the quantitative analysis and evaluation of the change process of the expressive power of the athletic target are the main ways for a coach to know the training effect, correct the training plan and scientifically control the training process. Under the background of big data age, how to develop physical training and monitoring by using digital equipment technology in high-level sports teams, and to help athletes to stably realize the performance of athletic targets at a determined time point, is an important problem for improving the scientificity and the accuracy of athletic training. At present, in physical training of high-level sports teams, a digital monitoring method and means are mainly focused on application of a physical training platform, a physical state monitoring platform and a physical big data management platform. Physical training is the root of all competitive sports, with speed and strength training being the core. However, the current training techniques or methods for speed and strength in physical training rely on either visual observation by a coach or continuous fumbling by the athlete's own feel. Or the speed and the strength of the athlete during training are monitored by auxiliary equipment such as GymeAware, but the equipment is required to be attached to the athlete or the load weight by virtue of ropes and the like, so that a certain disturbance is easily caused to the athlete during the exercise. How to digitize the core technical indexes such as gesture, speed, strength, power and the like generated in the speed and strength training process without binding any other equipment, and further scientifically guide the daily training of athletes so as to improve the training efficiency and reduce the sports injury becomes a primary difficult problem faced by coaches.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a speed and strength feedback system based on a depth camera, which is used for capturing core technical indexes such as gestures, speeds, strengths, powers and the like generated by athletes in the physical training process in a contactless manner through a reasonable structural collocation related algorithm, digitizing the core technical indexes and guiding scientific training.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a speed and strength feedback system based on a depth camera comprises an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module;
the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape;
the human body capturing module is used for efficiently positioning 16 key points of a human body, takes a Exc-Pose algorithm as a core, and specifically comprises a light E-type structural coding layer and a decoding layer of an RLE-Decoder based on a regression model supervised learning method;
the motion monitoring module is used for judging the true and false motion, a hand region is generated by adopting the hand key points positioned by the human body capturing module, and the true and false motion is judged by using Exc two classifiers in the region;
the power calculation module is used for calculating relevant training information and calculating by adopting a Kalman filtering algorithm of multi-frame fusion;
the speed and force feedback system comprises the following specific steps:
s1: in the color and depth videos acquired by two Intel Real Sense D435 high-definition cameras, acquiring images with athletes frame by frame, and carrying out enhancement processing on the images;
s2: performing human body detection by using a non-contact lightweight human body detection algorithm based on log likelihood estimation and regression model, and dividing a human body region in an image;
s3: and extracting coordinate values of key points of the hand from the detected human body area, and converting the coordinate values into motion technical indexes such as speed, power average speed and average power.
Preferably, in the human body capturing module, the Exc-Pose algorithm comprises target detection and gesture detection, and mainly comprises a PD-shufflelet encoding layer and an RLE-Decoder decoding layer. The PD-Shuffet can extract finer bottom features through a three-way structure, and the RLE-Decoder carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress coordinate values of target key points.
Preferably, 16 key points and coordinate information detected by the gesture are selected by a motion detection module to obtain Hand key point coordinates Hand n (x n ,y n ) Generating a self-adaptive external rectangular region of the hand frame by frame on the basis of the key points of the hand, wherein the region can be automatically adjusted along with the walking of a sporter or the walking of a sporter, and the barbell is always ensured to be positioned in the region; after the Exc classifier is added, the motion detection module judges whether any frame is in a hand-held barbell motion state.
Preferably, after inputting an image, a convolution operation and a maximum pooling operation are performed, and then three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units are performed, wherein a first module of each Stage adopts a stage=2 Stage unit to implement a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units. The PD-shufflelet designed herein divides the network into three branches, constituting an "E" structure, which learn the target underlying features, respectively. The specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-shaped structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that the network learns the bottom layer characteristics of different degrees.
Preferably, the RLE-Decoder directly learns the coordinates of the target node by using a regression-based model supervised learning method. For the image feature I learned by the E-shuffle encoding layer, the decoding layer predicts the probability of the target joint point at the position x through a regression model, and the probability distribution uses P Θ (x|I) where Θ represents a parameter learned by the model. The whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu g The probability is greatest at this point. Estimating the loss function of the model by maximum likelihood can be described as:
wherein L is mle As a loss function, Θ is an optimization model parameter, P Θ (x|I) is probability distribution, I is image feature, μ g In the case of a label being a label,for variance of distribution>Is the sample mean.
The simple distribution can be transformed into any complex distribution using a normalized flow model that converts a simple distribution P (z) into a learnable function x=f φ (z) to represent a complex distribution P φ (x) Visual presentation of the normalized stream is as follows:
wherein p is 1 (x i ),p 1 (x i ),...,p K-1 (x i ),p K (x i ) Represents the distribution function, T (z i ) As a function of the origin of the function,a transform matrix for the 1 st, 2 nd,;
in the model transformation process, in order to model the flowFitting the best floor distribution +.>Three types can be distinguished: simple distribution item->E.g. Gaussian distribution->Residual log likelihood estimation term->And a constant term s, as shown in the following formula:
in the course of the training process, the user can perform,the training process of the model can be quickened because the model is not dependent on the flow model. When training is finished, the regression model learns the panning scaling parameters +.>The invention is fixed by transforming on the N (0,I) standard distribution, and in the reasoning stage, the scaling coefficient is shifted +.>Can be directly seen as the final predicted coordinate value.
Preferably, the confidence degrees corresponding to coordinate values of the 16 nodes of the human body respectively learned by the PD-shuffle three branches are used as input to be transmitted to the feature aggregation unit. A Concat operation is then performed to combine the three results into a 3 x 16 matrix, where each row represents a different branch and each column represents the confidence level of the different joint coordinates of that branch. Then Split operation is performed, namely the matrix is divided into 16 1×3 matrices, wherein each matrix represents a stack of three different confidence levels of a certain part of the human body posture, such as A 0 I.e., the confidence of the head joint point coordinates learned by the three branches. Then output A through max function 0 ,A 1 ,A 2 ,...,A 17 Joint point coordinate a corresponding to maximum confidence coefficient of the model 0 ,a 1 ,a 2 ,...,a 17 . Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 coordinate points of the human body posture. And (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:
wherein i represents the ith channel, T S T, as a feature of final integration i C, as the characteristic value to be fused in the ith channel i For the ith channel, K i % represents the specific gravity of the features generated by each channel to all channels.
Preferably, after 16 nodes are obtained through the gesture detection algorithm, hand key points Hand in continuous frames are defined 0 (x 0 ,y 0 ),Hand 1 (x 1 ,y 1 ),Hand 2 (x 2 ,y 2 ),...,Hand n (x n ,y n ) Wherein Hand 0 The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand n The coordinates obtained by the points are based on the positions of key points of hands of the athlete at the final moment of exercise; acquiring depth values z of recorded key points frame by frame through a visual sensor and an infrared sensor carried by a depth camera 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) The method comprises the steps of carrying out a first treatment on the surface of the Due to the principle of near-large and far-small, the side length of the ROI self-adaptive square frame based on the hand and the depth value coordinate at the moment show a linear relation, and can be expressed by the following equation:
wherein L is 0 ,L 1 ,L 2 ,...,L n Representing the side length of a frame-by-frame acquisition based on an ROI adaptive square frame of a hand, a represents the relative change rate of the side length L and the depth value, and z 0 ,z 1 ,z 2 ,...,z n Representing depth value information corresponding to a hand key point at a certain moment, and b represents deviation correction caused by position change;
obtaining a set of best matches from the set of equations (a 0 ,b 0 ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth value, and the following formula is shown:
L=a 0 z+b 0
based on the change of L, the size of the self-adaptive square of the ROI is flexibly adjusted, the barbell bar is tracked and captured in real time, and on the basis of capturing the barbell bar, a ResNet network attention adding mechanism is introduced to judge whether the barbell bar is in a handheld barbell movement state.
Preferably, exc classifier is added on the basis of hand key points, a traversing frame is constructed for the classifying module to judge whether to grasp the barbell, and meanwhile, the residual block is used for solving the problems of degradation and gradient disappearance in the image processing process; aiming at the fact that the distance from a camera influences the size of a target area so as to influence the accuracy of a classification network, a multi-scale pyramid module is introduced to improve the extraction capability of a model on multi-scale features, particularly a small target, an expansion pyramid structure is greatly improved in the aspects of capturing multi-scale information and high-density extracted features, a new multi-scale feature is created for a ResNet bottom layer by using expansion convolution of expansion rates 1, 2 and 4, a block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the weights of any two convolution features are obtained by a channel attention mechanism, and the final fusion feature is obtained after multiplying and splicing a weight matrix with the corresponding convolution feature, so that the remote hand classification performance is improved.
Preferably, the depth value z based on the hand key points is recorded by frame in continuous frames 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any moment, the depth obtained by a certain frame based on the hand key points simultaneously meets the condition that the depth is larger than the minimum value Z of the depth Min Less than the depth maximum Z Max The athlete can be considered to be in a reasonable exercise area as follows:
Z Min ≤z n (x n ,y n )≤Z Max
wherein Z is Min ,Z Max Represents a depth minimum and a depth maximum, z n (x n ,y n ) Depth value, x, representing key point of hand in a frame n ,y n Representing the horizontal and vertical coordinate values of the key points of the hands in a certain frame.
Preferably, for ten frames of images obtained by the image processing module as a group, the height change in real time is read, and at time K-1, there are two moments as followsArray: best estimate A k And covariance matrix B k The following is shown:
wherein Height and Velocity are the Height and Velocity at that time, hv is used in the subsequent formulas to refer to the two variables, Σij represents covariance among vector elements, ij is any two-two combination among pv, and 4 combination modes are adopted;
on the premise that the original estimate is correct, the measurement of altitude and velocity at time K can be expressed by the following formula:
wherein A is k ,A k-1 Information representing time K and time K-1, T k Representing a transformation matrix, Δt being the elapsed time;
considering that the measured value and the predicted value are interfered by external factors, the predicted value Gaussian distribution and the measured value Gaussian distribution are respectively obtained and multiplied, and the final predicted Gaussian distribution can be obtained as follows:
B′ k =B k -K′H k B k
A k to introduce the best estimate after external interference, B k In order to introduce covariance matrix after external interference, K' is corrected Kalman coefficient, A k ' being the best estimate of the predicted value at the next time, B k ' covariance matrix of predicted value at next moment, R k For the mean matrix of the sensor measurements, Z k For covariance matrix of sensor measurements, H k For the measured value correction matrix, the Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention combines the color image and the depth image as the input of the algorithm, and processes the data with the intelligent computer through the algorithm, thereby effectively solving the problem that the peripheral is required to be worn for acquiring the training data.
2. The training mode is strength training based on speed, and the explosion force and strength of the athlete are judged through the speed, so that the athlete is prevented from being injured due to blind load increase.
3. The invention provides a self-created Exc-Pose gesture detection algorithm, a regression model-based detection scheme is introduced on the basis of a lightweight coding layer, and meanwhile, the E-shaped structure provided by the invention is adopted to build a decoding layer, so that the detected gesture is accurate, the calculated amount of the process is reduced, and the detection speed is improved.
4. The invention provides an algorithm for dividing an interested region through human key points, and a ResNet network is added by utilizing a multi-scale pyramid to form multi-scale characteristics so as to divide a remote interested region.
5. The human body capturing module and the motion detection module adopted by the invention can monitor the human body in real time and record data, thereby improving the strength level of athletes and improving the performance in sports.
6. The invention greatly reduces the burden of coaches to a certain extent, can accurately and efficiently acquire the sports indexes of athletes through the sensor, and greatly promotes the development of intelligent sports in China.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a Exc-Pose gesture detection algorithm in the present invention;
FIG. 3 is a diagram of a PD-SheffeNet network in accordance with the present invention;
FIG. 4 is a diagram of an "E" network configuration in accordance with the present invention;
FIG. 5 is a detailed block diagram of a WFU module;
FIG. 6 is a block diagram of an RLE-Decoder according to the present invention
Detailed Description
The following technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings, so that those skilled in the art can better understand the advantages and features of the present invention, and thus the protection scope of the present invention is more clearly defined. The described embodiments of the present invention are intended to be only a few, but not all embodiments of the present invention, and all other embodiments that may be made by one of ordinary skill in the art without inventive faculty are intended to be within the scope of the present invention.
Referring to fig. 1, a depth camera based speed and force feedback system includes an image acquisition module, a human body capture module, a motion monitoring module, and a speed and force calculation module;
the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape;
the human body capturing module is used for efficiently positioning 16 key points of a human body, takes a Exc-Pose algorithm as a core, and specifically comprises a light E-type structural coding layer and a decoding layer of an RLE-Decoder based on a regression model supervised learning method;
the motion monitoring module is used for judging the true and false motion, a hand region is generated by adopting the hand key points positioned by the human body capturing module, and the true and false motion is judged by using Exc two classifiers in the region;
the power calculation module is used for calculating relevant training information and calculating by adopting a Kalman filtering algorithm of multi-frame fusion;
the speed and force feedback system comprises the following specific steps:
s1: in the color and depth videos acquired by two Intel Real Sense D435 high-definition cameras, acquiring images with athletes frame by frame, and carrying out enhancement processing on the images;
s2: performing human body detection by using a non-contact lightweight human body detection algorithm based on log likelihood estimation and regression model, and dividing a human body region in an image;
s3: and extracting coordinate values of key points of the hand from the detected human body area, and converting the coordinate values into motion technical indexes such as speed, power average speed and average power.
Specifically, referring to fig. 2, in the human body capturing module, the Exc-Pose algorithm comprises target detection and gesture detection, and mainly comprises a PD-shuffle encoding layer and an RLE-Decoder decoding layer. The PD-Shuffet can extract finer bottom features through a three-way structure, and the RLE-Decoder carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress coordinate values of target key points.
Referring to fig. 3, fig. 3 (a) and fig. 3 (b) are a base module and a downsampling module, when the step size is 1, the channel separation operation is firstly adopted to divide the number of channels into two branches to replace the original group convolution structure, the 1×1 group convolution is replaced by the 1×1 common convolution, the subsequent channel rearrangement operation is canceled, after the 3×3 depth separable convolution and the 1×1 common convolution are finished, the two channels are connected in a serial manner to replace the original direct addition manner, and finally the channel rearrangement operation is carried out to fuse the information between the groups. At step size of 2, the channel separation operation is canceled, and the original 3×3 average pooling operation is replaced by the 3×3 depth separable convolution and 1×1 normal convolution.
Referring to fig. 4, after an image is input, a convolution operation and a max pooling operation are performed, and then three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units are performed, wherein a first module of each Stage adopts a stage=2 Stage unit to implement a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units. The PD-shufflelet designed herein divides the network into three branches, constituting an "E" structure, which learn the target underlying features, respectively. The specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-shaped structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that the network learns the bottom layer characteristics of different degrees.
Referring to fig. 5, the wfu module transmits confidence degrees corresponding to coordinate values of the 16 nodes of the human body learned by the PD-shuffle three branches, respectively, as input to the feature aggregation unit. A Concat operation is then performed to combine the three results into a 3 x 16 matrix, where each row represents a different branch and each column represents the confidence level of the different joint coordinates of that branch. Then Split operation is performed, namely the matrix is divided into 16 1×3 matrices, wherein each matrix represents a stack of three different confidence levels of a certain part of the human body posture, such as A 0 I.e., the confidence of the head joint point coordinates learned by the three branches. Then output A through max function 0 ,A 1 ,A 2 ,...,A 17 Joint point coordinate a corresponding to maximum confidence coefficient of the model 0 ,a 1 ,a 2 ,...,a 17 . Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 coordinate points of the human body posture. And (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:
wherein i represents the ith channel, T S T, as a feature of final integration i C, as the characteristic value to be fused in the ith channel i For the ith channel, K i % represents the specific gravity of the features generated by each channel to all channels.
Referring to fig. 6, the rle-Decoder directly learns the coordinates of the target node using a regression-based model supervised learning method. For the image feature I learned by the E-shuffle encoding layer, the decoding layer predicts the probability of the target joint point at the position x through a regression model, and the probability distribution uses P Θ (x|I) where Θ represents a parameter learned by the model. The whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu g The probability is greatest at this point. Estimating the loss function of the model by maximum likelihood can be described as:
wherein L is mle As a loss function, Θ is an optimization model parameter, P Θ (x|I) is probability distribution, I is image feature, μ g In the case of a label being a label,for variance of distribution>Is the sample mean.
The simple distribution can be transformed into any complex distribution using a normalized flow model that converts a simple distribution P (z) into a learnable function x=f φ (z) to represent a complex distribution P φ (x) Visual presentation of the normalized stream is as follows:
wherein p is 1 (x i ),p 1 (x i ),...,p K-1 (x i ),p K (x i ) Represents the distribution function, T (z i ) As a function of the origin of the function,a transform matrix for the 1 st, 2 nd,;
in the model transformation process, in order to model the flowFitting the best floor distribution +.>Three types can be distinguished: simple distribution item->E.g. Gaussian distribution->Residual log likelihood estimation term->And a constant term s, as shown in the following formula:
in the course of the training process, the user can perform,the training process of the model can be quickened because the model is not dependent on the flow model. When training is finished, the regression model learns the panning scaling parameters +.>The invention is fixed by transforming on the N (0,I) standard distribution, and in the reasoning stage, the scaling coefficient is shifted +.>Can be directly seen as the final predicted coordinate value.
The multi-scale pyramid module is introduced to improve the extraction capability of the model on multi-scale features, especially small targets, the expansion pyramid structure is greatly improved in the aspects of capturing multi-scale information and extracting features at high density, the expansion convolution using expansion rates 1, 2 and 4 creates new multi-scale features for the ResNet bottom layer, the block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the channel attention mechanism is utilized to obtain weights of any two convolution features when the two convolution features are combined, and the final fusion features are obtained after the weight matrix is multiplied and spliced with the corresponding convolution features, so that the remote hand key point region classification performance is improved.
EXC screening: depth value z based on hand key point recorded frame by frame in continuous frames 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any moment, the depth obtained by a certain frame based on the hand key points simultaneously meets the condition that the depth is larger than the minimum value Z of the depth Min Less than the depth maximum Z Max The athlete can be considered to be in a reasonable exercise area as follows:
Z Min ≤z n (x n ,y n )≤Z Max
wherein Z is Min ,Z Max Represents a depth minimum and a depth maximum, z n (x n ,y n ) Depth value, x, representing key point of hand in a frame n ,y n Representing the horizontal and vertical coordinate values of the key points of the hands in a certain frame.
The method for acquiring the key point areas of the hands comprises the following steps: after 16 joint points are obtained through an attitude detection algorithm, hand key points Hand in continuous frames are defined 0 (x 0 ,y 0 ),Hand 1 (x 1 ,y 1 ),Hand 2 (x 2 ,y 2 ),...,Hand n (x n ,y n ) Wherein Hand 0 The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand n The coordinates obtained by the points are based on the positions of key points of hands of the athlete at the final moment of exercise; acquiring depth values z of recorded key points frame by frame through a visual sensor and an infrared sensor carried by a depth camera 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) The method comprises the steps of carrying out a first treatment on the surface of the Due to the principle of near-large and far-small, the side length of the ROI self-adaptive square frame based on the hand and the depth value coordinate at the moment show a linear relation, and can be expressed by the following equation:
wherein L is 0 ,L 1 ,L 2 ,...,L n Representing the side length of a frame-by-frame acquisition based on an ROI adaptive square frame of a hand, a represents the relative change rate of the side length L and the depth value, and z 0 ,z 1 ,z 2 ,...,z n Representing depth value information corresponding to a hand key point at a certain moment, and b represents deviation correction caused by position change;
obtaining a set of best matches from the set of equations (a 0 ,b 0 ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth valueThe following formula:
L=a 0 z+b 0
based on the change of L, the size of the self-adaptive square of the ROI is flexibly adjusted, the barbell bar is tracked and captured in real time, and on the basis of capturing the barbell bar, a ResNet network attention adding mechanism is introduced to judge whether the barbell bar is in a handheld barbell movement state.
The method for measuring and calculating the technical indexes (speed, power, average speed and average power) comprises the following steps: and reading the real-time height change by taking ten frames of images obtained by the image processing module as a group. At time K-1, there are two matrices: best estimate A k And covariance matrix B k The following is shown:
in the formula, height and Velocity are the Height and Velocity at the moment, hv is used for designating the two variables in the subsequent formulas, Σij represents covariance among vector elements, and ij is any two-two combination among pv, and 4 combination modes are adopted.
On the premise that the original estimate is correct, the measurement of altitude and velocity at time K can be expressed by the following formula:
wherein A is k ,A k-1 Information representing time K and time K-1, T k Represents the transformation matrix, Δt being the elapsed time.
When external interference exists, a plurality of uncertain rings are generated on the basis of the original predictive value, and the formula is required to be corrected, even if the untracked interference is taken as covariance Q k Is processed by the noise of the processor.
A k =T k A k-1
Wherein A is k 、B k Respectively representing a position matrix and a covariance matrix at the moment K, T k Represents a transformation matrix, Q k Is an interference correction matrix.
After the actual measurement data of the sensor is added, the sensor itself can generate covariance Z k Is represented by a Gaussian distribution of R k
The prediction formula after correction considering the measured value is as follows:
A exp =H k A k
wherein A is exp 、B exp Respectively representing a position matrix and a covariance matrix at the moment K, H k Representing a measured value correction matrix.
Multiplying the predicted value Gaussian distribution by the measured value Gaussian distribution to obtain the final predicted Gaussian distribution, wherein the final predicted Gaussian distribution is represented by the following formula:
B′ k =B k -K′H k B k
A k to introduce the best estimate after external interference, B k In order to introduce the covariance matrix after external interference, K' is the corrected Kalman coefficient. A is that k ' being the best estimate of the predicted value at the next time, B k ' covariance matrix of predicted value at next moment, R k For the mean matrix of the sensor measurements, Z k Covariance matrix of microsensor measurements, H k The matrix is modified for the measured values. The Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.
In summary, the invention captures the core technical indexes such as the gesture, the speed, the strength, the power and the like generated by the athlete in the physical training process in a non-contact way through a reasonable structure collocation related algorithm, digitizes the core technical indexes and guides scientific training.
The description and practice of the invention disclosed herein will be readily apparent to those skilled in the art, and may be modified and adapted in several ways without departing from the principles of the invention. Accordingly, modifications or improvements may be made without departing from the spirit of the invention and are also to be considered within the scope of the invention.

Claims (7)

1. The speed and strength feedback system based on the depth camera is characterized by comprising an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module;
the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape;
the human body capturing module is used for positioning 16 human body articulation points, takes a Exc-Pose lightweight human body gesture detection algorithm as a core, and comprises target detection and gesture detection, and specifically comprises an E-shufflelet encoding layer, an RLE-Decoder decoding layer and a WFU module; the E-shuffle coding layer can extract fine bottom features through a three-branch structure, and the RLE-Decoder decoding layer carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress the coordinate values of the target node;
after inputting an image, performing convolution operation and maximum pooling operation once, and then performing three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units, wherein a first module of each Stage adopts a stage=2 Stage unit to realize a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units; e-shuffle divides the network into three branches to form an E-type structure, and the branches respectively learn the target bottom layer characteristics; the specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-type structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that networks learn image features of different degrees;
the method comprises the steps that a buffer unit with stride=1 divides the number of channels into two branches to replace an original group convolution structure by adopting a channel separation operation, 1×1 group convolution is changed into 1×1 common convolution, a subsequent channel rearrangement operation is canceled, after 3×3 depth separable convolution and 1×1 common convolution are finished, two channels are connected by adopting a serial connection mode to replace an original direct addition mode, and finally, the channel rearrangement operation is performed to fuse information between groups; a shuffle unit with stride=2, the channel separation operation is canceled, and the original 3×3 average pooling operation is replaced by a 3×3 depth separable convolution and a 1×1 normal convolution;
aiming at the image features learned by the E-shuffle coding layer, the RLE-Decoder decoding layer predicts the confidence coefficients corresponding to the coordinate values of 16 nodes of the human body through a regression model;
the WFU module transmits the confidence coefficient corresponding to the coordinate values of the 16 nodes as input to the feature aggregation unit, then performs Concat operation, and merges the results into a 3X 16 matrix, wherein each row representsDifferent branches, each column representing the confidence level of the coordinates of different joint points of the branch; then Split operation is carried out, namely the matrix is divided into 16 1X 3 matrices according to columns, wherein each matrix represents a stack of three different confidence degrees of a certain part of the human body posture, such as A 0 The confidence degree of the head joint point coordinates learned by the three branches; then output A through max function 0 ,A 1 ,A 2 ,...,A 15 Joint point coordinate a corresponding to maximum confidence coefficient of the model 0 ,a 1 ,a 2 ,...,a 15 The method comprises the steps of carrying out a first treatment on the surface of the Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 joint points of the human body posture; and (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:
wherein j represents the jth channel, T S T, as a feature of final integration j C is the characteristic value to be fused in the jth channel j For the j-th channel, K j % represents the specific gravity of the characteristics generated by each channel to all channels;
the motion monitoring module is used for judging true and false motions, screening coordinate values of 16 joints of the human body detected by the human body capturing module to obtain hand joints, generating a hand area on the basis of the hand joints, and judging true and false motions in the area by using Exc classification algorithm; the power calculation module is used for calculating a motion technical index and calculating by adopting a Kalman filtering algorithm of multi-frame fusion;
the working method of the speed and strength feedback system comprises the following specific steps:
s1: in the color and depth videos acquired by two depth cameras, acquiring images with athletes frame by frame, and carrying out enhancement processing on the images;
s2: performing human body detection by using a Exc-Pose lightweight human body posture detection algorithm based on log-likelihood estimation and regression models, and dividing a human body region in an image;
s3: and extracting the coordinate values of the hand joint points from the detected human body area, and converting the coordinate values into the motion technical indexes of the strength calculation module, such as speed, power, average speed and average power.
2. The speed and force feedback system based on a depth camera according to claim 1, wherein the coordinate values of 16 joints detected by Exc-Pose lightweight human body posture detection algorithm are Hand joint coordinates Hand obtained by screening by a motion monitoring module n (x n ,y n ) Generating a self-adaptive external rectangular region of the hand frame by frame on the basis of the hand joint point, wherein the region can be automatically adjusted along with the approach or the distance of a sportsman, and the barbell is always ensured to be positioned in the region; the motion monitoring module judges whether the athlete in each frame of picture is in a hand-held barbell state or not through a Exc classification algorithm.
3. The depth camera-based speed and force feedback system of claim 2 wherein the RLE-Decoder is used as the decoding layer to predict confidence level of the target joint point at position x by regression model for the image feature I learned by the E-shufflelenet encoding layer, the confidence level distribution using P Θ (x|i) where Θ represents the optimization model parameters; the whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu g The confidence at this point is greatest and the loss function of the model estimated by maximum likelihood is described as:
wherein L is mle As a loss function, Θ is an optimization model parameter, P Θ (x|I) is confidence distribution, I is image feature, μ g In the case of a label being a label,for variance of distribution>Is the sample mean value;
transforming the simple distribution into an arbitrary complex distribution using a normalized flow model that converts a simple distribution P (z) into a learnable function m=f φ (z) to represent a complex distribution P φ (x) Visual presentation of the normalized stream is as follows:
wherein p is 1 (x i ),p 2 (x i ),...,p K-1 (x i ),p K (x i ) Represents the distribution function, T (z i ) As a function of the origin of the function,a transform matrix for the 1 st, 2 nd,;
in the model transformation process, in order to make the flow model P φ (x) Fitting the best floor distributionThe method can be realized by the following formula, wherein parameter items can be divided into three types: simple distribution item->E.g. Gaussian distribution->Residual log likelihood estimation term->And a constant term log s, the process is represented by the formula:
in the course of the training process, the user can perform,the training process of the model can be quickened because the model is not dependent on the flow model.
4. A depth camera based speed and force feedback system according to claim 3 wherein after 16 nodes are obtained by Exc-Pose lightweight human body posture detection algorithm, a continuous intra-frame Hand node Hand is defined 0 (x 0 ,y 0 ),Hand 1 (x 1 ,y 1 ),Hand 2 (x 2 ,y 2 ),...,Hand n (x n ,y n ) Wherein Hand 0 The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand n The coordinates obtained by the points are based on the positions of the hand joint points of the athlete at the final movement moment; acquiring the depth value z of the recorded joint point frame by frame through a visual sensor and an infrared sensor carried by a depth camera 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) The method comprises the steps of carrying out a first treatment on the surface of the Based on the principle of near-large and far-small, the side length of the ROI self-adaptive square frame of the hand is matched with the depth value coordinate at the momentThe linear relationship is represented by the following equation:
wherein L is 0 ,L 1 ,L 2 ,...,L n Representing the side length of a frame-by-frame acquisition based on an ROI adaptive square frame of a hand, c represents the relative change rate of the side length L and the depth value, z 0 ,z 1 ,z 2 ,...,z n Representing depth value information corresponding to a hand joint point at a certain moment, and d represents deviation correction caused by position change;
obtaining a set of best matches from the set of equations (c 0 ,d 0 ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth value, and the following formula is shown:
L=c 0 z+d 0
based on the change of L, the size of the manufactured ROI self-adaptive square frame is flexibly adjusted, the barbell bar is tracked and captured in real time, and on the basis of capturing the barbell bar, a ResNet network attention adding mechanism is introduced to judge whether the barbell bar is in a handheld barbell movement state.
5. The depth camera based speed and force feedback system of claim 4 wherein Exc classification algorithms are added on the basis of hand joints to construct a traversing frame for Exc classification algorithms for determining whether to grasp barbell, and residual blocks are used to solve degradation and gradient disappearance problems during image processing; aiming at the problem that the distance from a camera influences the size of a target area so as to influence the accuracy of a classification network, a multi-scale pyramid module is introduced to improve the small target of the model on the multi-scale feature extraction capability, the expansion pyramid structure is characterized in that a new multi-scale feature is created for the bottom layer of ResNet by using expansion convolution of expansion rates 1, 2 and 4, the block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the channel attention mechanism is utilized to obtain the weights of any two convolution features when the two convolution features are combined, and the final fusion feature is obtained after multiplying and splicing the weight matrix and the corresponding convolution feature, so that the remote hand classification performance is improved.
6. The depth camera based speed and force feedback system of claim 5 wherein the hand-node based depth value z is recorded by frame-by-frame in successive frames 0 (x 0 ,y 0 ),z 1 (x 1 ,y 1 ),z 2 (x 2 ,y 2 ),...,z n (x n ,y n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any time, a certain frame simultaneously satisfies a depth greater than a minimum value Z based on the depth obtained by the hand joint point Min Less than the depth maximum Z Max The athlete can be considered to be in a reasonable exercise area as follows:
Z Min ≤z n (x n ,y n )≤Z Max
wherein Z is Min ,Z Max Represents a depth minimum and a depth maximum, z n (x n ,y n ) Representing the depth value, x, of the hand joint point in a certain frame n ,y n Representing the horizontal and vertical coordinate values of the hand joint point in a certain frame.
7. The depth camera based speed and force feedback system of claim 1 wherein for ten frames of images obtained by the image processing module as a set, the height change in real time is read, and at time k-1 there are two matrices: best estimate A k-1 And covariance matrix B k-1 The following is shown:
best estimate A k-1 In (3), height and Velocity are the Height and Velocity at that time; covariance matrix B k-1 Where h is the altitude distribution, v is the velocity distribution, Σhh represents the covariance between the altitude distribution, Σhv and Σhv represent the covariance between the altitude distribution and the velocity distribution, Σv represents the covariance between the velocity distribution;
the measurement of altitude and velocity at time k can be expressed by the following formula, provided that the original estimate is correct:
wherein A is k ,A k-1 Represents the best estimate of time k and time k-1, T, respectively k Representing a transformation matrix, Δt being the elapsed time;
considering that the measured value and the predicted value are interfered by external factors, the predicted value Gaussian distribution and the measured value Gaussian distribution are respectively obtained and multiplied, and the final predicted Gaussian distribution can be obtained as follows:
B′ k =B k -K′H k B k
A k representing the best estimate of time k,B k Represents the covariance matrix at time K, K' is the corrected Kalman coefficient, A k ' being the best estimate of the predicted value at the next time, B k ' covariance matrix of predicted value at next moment, R k For the mean matrix of the sensor measurements, Z k For covariance matrix of sensor measurements, H k For the measured value correction matrix, the Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.
CN202211414614.5A 2022-11-11 2022-11-11 Speed and strength feedback system based on depth camera Active CN115937895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211414614.5A CN115937895B (en) 2022-11-11 2022-11-11 Speed and strength feedback system based on depth camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211414614.5A CN115937895B (en) 2022-11-11 2022-11-11 Speed and strength feedback system based on depth camera

Publications (2)

Publication Number Publication Date
CN115937895A CN115937895A (en) 2023-04-07
CN115937895B true CN115937895B (en) 2023-09-19

Family

ID=86696737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211414614.5A Active CN115937895B (en) 2022-11-11 2022-11-11 Speed and strength feedback system based on depth camera

Country Status (1)

Country Link
CN (1) CN115937895B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180077683A (en) * 2016-12-29 2018-07-09 재단법인대구경북과학기술원 Apparatus of detecting treadmill based image analysis and method of detecting emergency using the same
CN111724414A (en) * 2020-06-23 2020-09-29 宁夏大学 Basketball movement analysis method based on 3D attitude estimation
CN111862126A (en) * 2020-07-09 2020-10-30 北京航空航天大学 Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN112535858A (en) * 2020-12-22 2021-03-23 舒华体育股份有限公司 Speed and force feedback equipment and system based on depth camera technology
CN112861624A (en) * 2021-01-05 2021-05-28 哈尔滨工业大学(威海) Human body posture detection method, system, storage medium, equipment and terminal
CN113641103A (en) * 2021-08-13 2021-11-12 广东工业大学 Adaptive robot treadmill control method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10083233B2 (en) * 2014-09-09 2018-09-25 Microsoft Technology Licensing, Llc Video processing for motor task analysis
EP3649633A4 (en) * 2017-07-06 2021-03-10 Icuemotion LLC Systems and methods for data-driven movement skill training
US20210237774A1 (en) * 2020-01-31 2021-08-05 Toyota Research Institute, Inc. Self-supervised 3d keypoint learning for monocular visual odometry
CN111739078B (en) * 2020-06-15 2022-11-18 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180077683A (en) * 2016-12-29 2018-07-09 재단법인대구경북과학기술원 Apparatus of detecting treadmill based image analysis and method of detecting emergency using the same
CN111724414A (en) * 2020-06-23 2020-09-29 宁夏大学 Basketball movement analysis method based on 3D attitude estimation
CN111862126A (en) * 2020-07-09 2020-10-30 北京航空航天大学 Non-cooperative target relative pose estimation method combining deep learning and geometric algorithm
CN112535858A (en) * 2020-12-22 2021-03-23 舒华体育股份有限公司 Speed and force feedback equipment and system based on depth camera technology
CN112861624A (en) * 2021-01-05 2021-05-28 哈尔滨工业大学(威海) Human body posture detection method, system, storage medium, equipment and terminal
CN113641103A (en) * 2021-08-13 2021-11-12 广东工业大学 Adaptive robot treadmill control method and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Efficient Semantic Segmentation Using Gradual Grouping;Nikitha Vallurupalli et al.;《2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)》;全文 *
Huan Du et al..3D Hand Model Fitting for Virtual Keyboard System.《2007 IEEE Workshop on Applications of Computer Vision (WACV '07)》.2007,全文. *
体育视频中羽毛球运动员的动作识别;杨静;;自动化技术与应用(10);全文 *
基于深度图像的人体动作识别方法研究;王松;《中国博士学位论文全文数据库信息科技辑》(第01期);全文 *
基于视觉传感的智能运动训练系统的研发;詹彬;周磊;;物联网技术(09);全文 *

Also Published As

Publication number Publication date
CN115937895A (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN106384093B (en) A kind of human motion recognition method based on noise reduction autocoder and particle filter
Chaudhari et al. Yog-guru: Real-time yoga pose correction system using deep learning methods
CN110147738B (en) Driver fatigue monitoring and early warning method and system
Asif et al. Privacy preserving human fall detection using video data
CN110738154A (en) pedestrian falling detection method based on human body posture estimation
Jensen et al. Classification of kinematic swimming data with emphasis on resource consumption
CN110298265A (en) Specific objective detection method in a kind of elevator based on YOLO neural network
Mehrizi et al. A Deep Neural Network-based method for estimation of 3D lifting motions
CN114067358A (en) Human body posture recognition method and system based on key point detection technology
CN113963315A (en) Real-time video multi-user behavior recognition method and system in complex scene
CN110575663A (en) physical education auxiliary training method based on artificial intelligence
WO2024051597A1 (en) Standard pull-up counting method, and system and storage medium therefor
CN114550027A (en) Vision-based motion video fine analysis method and device
CN110956141A (en) Human body continuous action rapid analysis method based on local recognition
CN115482580A (en) Multi-person evaluation system based on machine vision skeletal tracking technology
CN110287829A (en) A kind of video face identification method of combination depth Q study and attention model
CN111079481B (en) Aggressive behavior recognition method based on two-dimensional skeleton information
Manaf et al. Computer vision-based survey on human activity recognition system, challenges and applications
Bandini et al. A wearable vision-based system for detecting hand-object interactions in individuals with cervical spinal cord injury: First results in the home environment
CN115937895B (en) Speed and strength feedback system based on depth camera
CN112116236A (en) Trampoline dangerous behavior detection reminding method based on artificial intelligence
CN114639168B (en) Method and system for recognizing running gesture
Zhu et al. Dance Action Recognition and Pose Estimation Based on Deep Convolutional Neural Network.
Zeng et al. Deep learning approach to automated data collection and processing of video surveillance in sports activity prediction
Huang et al. Real-time rehabilitation exercise performance evaluation system using deep learning and thermal image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant