CN115937895B

CN115937895B - Speed and strength feedback system based on depth camera

Info

Publication number: CN115937895B
Application number: CN202211414614.5A
Authority: CN
Inventors: 张堃; 涂鑫涛; 张鹏程; 刘志诚; 徐沛霞; 林鹏程
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-09-19
Anticipated expiration: 2042-11-11
Also published as: CN115937895A

Abstract

The invention relates to the technical field of electronic information, in particular to a speed and strength feedback system based on a depth camera, which comprises an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module; the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape; the human body capturing module is used for efficiently positioning 16 key points of a human body, takes a Exc-Pose algorithm as a core, and specifically comprises a lightweight E-shaped structural coding layer and a decoding layer based on a regression model supervised learning method. According to the invention, core technical indexes such as gesture, speed, strength, power and the like generated by athletes in the physical training process are captured in a non-contact manner through a reasonable structural collocation related algorithm, and are digitized to guide scientific training.

Description

Speed and strength feedback system based on depth camera

Technical Field

The invention relates to the technical field of electronic information, in particular to a speed and strength feedback system based on a depth camera.

Background

In the practice of athletic training, the quantitative analysis and evaluation of the change process of the expressive power of the athletic target are the main ways for a coach to know the training effect, correct the training plan and scientifically control the training process. Under the background of big data age, how to develop physical training and monitoring by using digital equipment technology in high-level sports teams, and to help athletes to stably realize the performance of athletic targets at a determined time point, is an important problem for improving the scientificity and the accuracy of athletic training. At present, in physical training of high-level sports teams, a digital monitoring method and means are mainly focused on application of a physical training platform, a physical state monitoring platform and a physical big data management platform. Physical training is the root of all competitive sports, with speed and strength training being the core. However, the current training techniques or methods for speed and strength in physical training rely on either visual observation by a coach or continuous fumbling by the athlete's own feel. Or the speed and the strength of the athlete during training are monitored by auxiliary equipment such as GymeAware, but the equipment is required to be attached to the athlete or the load weight by virtue of ropes and the like, so that a certain disturbance is easily caused to the athlete during the exercise. How to digitize the core technical indexes such as gesture, speed, strength, power and the like generated in the speed and strength training process without binding any other equipment, and further scientifically guide the daily training of athletes so as to improve the training efficiency and reduce the sports injury becomes a primary difficult problem faced by coaches.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a speed and strength feedback system based on a depth camera, which is used for capturing core technical indexes such as gestures, speeds, strengths, powers and the like generated by athletes in the physical training process in a contactless manner through a reasonable structural collocation related algorithm, digitizing the core technical indexes and guiding scientific training.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a speed and strength feedback system based on a depth camera comprises an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module;

the image acquisition module monitors the external environments of athletes and athletes respectively by using built-in visual sensors, and can acquire color images and depth images simultaneously; the image acquisition module consists of two depth cameras, and the two depth cameras are horizontally and vertically installed and fixed according to a cross shape;

the human body capturing module is used for efficiently positioning 16 key points of a human body, takes a Exc-Pose algorithm as a core, and specifically comprises a light E-type structural coding layer and a decoding layer of an RLE-Decoder based on a regression model supervised learning method;

the motion monitoring module is used for judging the true and false motion, a hand region is generated by adopting the hand key points positioned by the human body capturing module, and the true and false motion is judged by using Exc two classifiers in the region;

the power calculation module is used for calculating relevant training information and calculating by adopting a Kalman filtering algorithm of multi-frame fusion;

the speed and force feedback system comprises the following specific steps:

s1: in the color and depth videos acquired by two Intel Real Sense D435 high-definition cameras, acquiring images with athletes frame by frame, and carrying out enhancement processing on the images;

s2: performing human body detection by using a non-contact lightweight human body detection algorithm based on log likelihood estimation and regression model, and dividing a human body region in an image;

s3: and extracting coordinate values of key points of the hand from the detected human body area, and converting the coordinate values into motion technical indexes such as speed, power average speed and average power.

Preferably, in the human body capturing module, the Exc-Pose algorithm comprises target detection and gesture detection, and mainly comprises a PD-shufflelet encoding layer and an RLE-Decoder decoding layer. The PD-Shuffet can extract finer bottom features through a three-way structure, and the RLE-Decoder carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress coordinate values of target key points.

Preferably, 16 key points and coordinate information detected by the gesture are selected by a motion detection module to obtain Hand key point coordinates Hand _n (x _n ,y _n ) Generating a self-adaptive external rectangular region of the hand frame by frame on the basis of the key points of the hand, wherein the region can be automatically adjusted along with the walking of a sporter or the walking of a sporter, and the barbell is always ensured to be positioned in the region; after the Exc classifier is added, the motion detection module judges whether any frame is in a hand-held barbell motion state.

Preferably, after inputting an image, a convolution operation and a maximum pooling operation are performed, and then three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units are performed, wherein a first module of each Stage adopts a stage=2 Stage unit to implement a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units. The PD-shufflelet designed herein divides the network into three branches, constituting an "E" structure, which learn the target underlying features, respectively. The specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-shaped structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that the network learns the bottom layer characteristics of different degrees.

Preferably, the RLE-Decoder directly learns the coordinates of the target node by using a regression-based model supervised learning method. For the image feature I learned by the E-shuffle encoding layer, the decoding layer predicts the probability of the target joint point at the position x through a regression model, and the probability distribution uses P _Θ (x|I) where Θ represents a parameter learned by the model. The whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu _g The probability is greatest at this point. Estimating the loss function of the model by maximum likelihood can be described as:

wherein L is _mle As a loss function, Θ is an optimization model parameter, P _Θ (x|I) is probability distribution, I is image feature, μ _g In the case of a label being a label,for variance of distribution>Is the sample mean.

The simple distribution can be transformed into any complex distribution using a normalized flow model that converts a simple distribution P (z) into a learnable function x=f _φ (z) to represent a complex distribution P _φ (x) Visual presentation of the normalized stream is as follows:

…

wherein p is ₁ (x ⁱ ),p ₁ (x ⁱ ),...,p _K-1 (x ⁱ ),p _K (x ⁱ ) Represents the distribution function, T (z ⁱ ) As a function of the origin of the function,a transform matrix for the 1 st, 2 nd,;

in the model transformation process, in order to model the flowFitting the best floor distribution +.>Three types can be distinguished: simple distribution item->E.g. Gaussian distribution->Residual log likelihood estimation term->And a constant term s, as shown in the following formula:

in the course of the training process, the user can perform,the training process of the model can be quickened because the model is not dependent on the flow model. When training is finished, the regression model learns the panning scaling parameters +.>The invention is fixed by transforming on the N (0,I) standard distribution, and in the reasoning stage, the scaling coefficient is shifted +.>Can be directly seen as the final predicted coordinate value.

Preferably, the confidence degrees corresponding to coordinate values of the 16 nodes of the human body respectively learned by the PD-shuffle three branches are used as input to be transmitted to the feature aggregation unit. A Concat operation is then performed to combine the three results into a 3 x 16 matrix, where each row represents a different branch and each column represents the confidence level of the different joint coordinates of that branch. Then Split operation is performed, namely the matrix is divided into 16 1×3 matrices, wherein each matrix represents a stack of three different confidence levels of a certain part of the human body posture, such as A ₀ I.e., the confidence of the head joint point coordinates learned by the three branches. Then output A through max function ₀ ,A ₁ ,A ₂ ,...,A ₁₇ Joint point coordinate a corresponding to maximum confidence coefficient of the model ₀ ,a ₁ ,a ₂ ,...,a ₁₇ . Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 coordinate points of the human body posture. And (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:

wherein i represents the ith channel, T _S T, as a feature of final integration _i C, as the characteristic value to be fused in the ith channel _i For the ith channel, K _i % represents the specific gravity of the features generated by each channel to all channels.

Preferably, after 16 nodes are obtained through the gesture detection algorithm, hand key points Hand in continuous frames are defined ₀ (x ₀ ,y ₀ )，Hand ₁ (x ₁ ,y ₁ )，Hand ₂ (x ₂ ,y ₂ )，...，Hand _n (x _n ,y _n ) Wherein Hand ₀ The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand _n The coordinates obtained by the points are based on the positions of key points of hands of the athlete at the final moment of exercise; acquiring depth values z of recorded key points frame by frame through a visual sensor and an infrared sensor carried by a depth camera ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) The method comprises the steps of carrying out a first treatment on the surface of the Due to the principle of near-large and far-small, the side length of the ROI self-adaptive square frame based on the hand and the depth value coordinate at the moment show a linear relation, and can be expressed by the following equation:

wherein L is ₀ ，L ₁ ，L ₂ ，...,L _n Representing the side length of a frame-by-frame acquisition based on an ROI adaptive square frame of a hand, a represents the relative change rate of the side length L and the depth value, and z ₀ ，z ₁ ，z ₂ ，...,z _n Representing depth value information corresponding to a hand key point at a certain moment, and b represents deviation correction caused by position change;

obtaining a set of best matches from the set of equations (a ₀ ,b ₀ ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth value, and the following formula is shown:

L＝a ₀ z+b ₀

based on the change of L, the size of the self-adaptive square of the ROI is flexibly adjusted, the barbell bar is tracked and captured in real time, and on the basis of capturing the barbell bar, a ResNet network attention adding mechanism is introduced to judge whether the barbell bar is in a handheld barbell movement state.

Preferably, exc classifier is added on the basis of hand key points, a traversing frame is constructed for the classifying module to judge whether to grasp the barbell, and meanwhile, the residual block is used for solving the problems of degradation and gradient disappearance in the image processing process; aiming at the fact that the distance from a camera influences the size of a target area so as to influence the accuracy of a classification network, a multi-scale pyramid module is introduced to improve the extraction capability of a model on multi-scale features, particularly a small target, an expansion pyramid structure is greatly improved in the aspects of capturing multi-scale information and high-density extracted features, a new multi-scale feature is created for a ResNet bottom layer by using expansion convolution of expansion rates 1, 2 and 4, a block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the weights of any two convolution features are obtained by a channel attention mechanism, and the final fusion feature is obtained after multiplying and splicing a weight matrix with the corresponding convolution feature, so that the remote hand classification performance is improved.

Preferably, the depth value z based on the hand key points is recorded by frame in continuous frames ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any moment, the depth obtained by a certain frame based on the hand key points simultaneously meets the condition that the depth is larger than the minimum value Z of the depth _Min Less than the depth maximum Z _Max The athlete can be considered to be in a reasonable exercise area as follows:

Z _Min ≤z _n (x _n ,y _n )≤Z _Max

wherein Z is _Min ，Z _Max Represents a depth minimum and a depth maximum, z _n (x _n ,y _n ) Depth value, x, representing key point of hand in a frame _n ，y _n Representing the horizontal and vertical coordinate values of the key points of the hands in a certain frame.

Preferably, for ten frames of images obtained by the image processing module as a group, the height change in real time is read, and at time K-1, there are two moments as followsArray: best estimate A _k And covariance matrix B _k The following is shown:

wherein Height and Velocity are the Height and Velocity at that time, hv is used in the subsequent formulas to refer to the two variables, Σij represents covariance among vector elements, ij is any two-two combination among pv, and 4 combination modes are adopted;

on the premise that the original estimate is correct, the measurement of altitude and velocity at time K can be expressed by the following formula:

wherein A is _k ，A _k-1 Information representing time K and time K-1, T _k Representing a transformation matrix, Δt being the elapsed time;

considering that the measured value and the predicted value are interfered by external factors, the predicted value Gaussian distribution and the measured value Gaussian distribution are respectively obtained and multiplied, and the final predicted Gaussian distribution can be obtained as follows:

B′ _k ＝B _k -K′H _k B _k

A _k to introduce the best estimate after external interference, B _k In order to introduce covariance matrix after external interference, K' is corrected Kalman coefficient, A _k ' being the best estimate of the predicted value at the next time, B _k ' covariance matrix of predicted value at next moment, R _k For the mean matrix of the sensor measurements, Z _k For covariance matrix of sensor measurements, H _k For the measured value correction matrix, the Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention combines the color image and the depth image as the input of the algorithm, and processes the data with the intelligent computer through the algorithm, thereby effectively solving the problem that the peripheral is required to be worn for acquiring the training data.

2. The training mode is strength training based on speed, and the explosion force and strength of the athlete are judged through the speed, so that the athlete is prevented from being injured due to blind load increase.

3. The invention provides a self-created Exc-Pose gesture detection algorithm, a regression model-based detection scheme is introduced on the basis of a lightweight coding layer, and meanwhile, the E-shaped structure provided by the invention is adopted to build a decoding layer, so that the detected gesture is accurate, the calculated amount of the process is reduced, and the detection speed is improved.

4. The invention provides an algorithm for dividing an interested region through human key points, and a ResNet network is added by utilizing a multi-scale pyramid to form multi-scale characteristics so as to divide a remote interested region.

5. The human body capturing module and the motion detection module adopted by the invention can monitor the human body in real time and record data, thereby improving the strength level of athletes and improving the performance in sports.

6. The invention greatly reduces the burden of coaches to a certain extent, can accurately and efficiently acquire the sports indexes of athletes through the sensor, and greatly promotes the development of intelligent sports in China.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a Exc-Pose gesture detection algorithm in the present invention;

FIG. 3 is a diagram of a PD-SheffeNet network in accordance with the present invention;

FIG. 4 is a diagram of an "E" network configuration in accordance with the present invention;

FIG. 5 is a detailed block diagram of a WFU module;

FIG. 6 is a block diagram of an RLE-Decoder according to the present invention

Detailed Description

The following technical solutions in the embodiments of the present invention will be clearly and completely described with reference to the accompanying drawings, so that those skilled in the art can better understand the advantages and features of the present invention, and thus the protection scope of the present invention is more clearly defined. The described embodiments of the present invention are intended to be only a few, but not all embodiments of the present invention, and all other embodiments that may be made by one of ordinary skill in the art without inventive faculty are intended to be within the scope of the present invention.

Referring to fig. 1, a depth camera based speed and force feedback system includes an image acquisition module, a human body capture module, a motion monitoring module, and a speed and force calculation module;

the speed and force feedback system comprises the following specific steps:

Specifically, referring to fig. 2, in the human body capturing module, the Exc-Pose algorithm comprises target detection and gesture detection, and mainly comprises a PD-shuffle encoding layer and an RLE-Decoder decoding layer. The PD-Shuffet can extract finer bottom features through a three-way structure, and the RLE-Decoder carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress coordinate values of target key points.

Referring to fig. 3, fig. 3 (a) and fig. 3 (b) are a base module and a downsampling module, when the step size is 1, the channel separation operation is firstly adopted to divide the number of channels into two branches to replace the original group convolution structure, the 1×1 group convolution is replaced by the 1×1 common convolution, the subsequent channel rearrangement operation is canceled, after the 3×3 depth separable convolution and the 1×1 common convolution are finished, the two channels are connected in a serial manner to replace the original direct addition manner, and finally the channel rearrangement operation is carried out to fuse the information between the groups. At step size of 2, the channel separation operation is canceled, and the original 3×3 average pooling operation is replaced by the 3×3 depth separable convolution and 1×1 normal convolution.

Referring to fig. 4, after an image is input, a convolution operation and a max pooling operation are performed, and then three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units are performed, wherein a first module of each Stage adopts a stage=2 Stage unit to implement a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units. The PD-shufflelet designed herein divides the network into three branches, constituting an "E" structure, which learn the target underlying features, respectively. The specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-shaped structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that the network learns the bottom layer characteristics of different degrees.

Referring to fig. 5, the wfu module transmits confidence degrees corresponding to coordinate values of the 16 nodes of the human body learned by the PD-shuffle three branches, respectively, as input to the feature aggregation unit. A Concat operation is then performed to combine the three results into a 3 x 16 matrix, where each row represents a different branch and each column represents the confidence level of the different joint coordinates of that branch. Then Split operation is performed, namely the matrix is divided into 16 1×3 matrices, wherein each matrix represents a stack of three different confidence levels of a certain part of the human body posture, such as A ₀ I.e., the confidence of the head joint point coordinates learned by the three branches. Then output A through max function ₀ ,A ₁ ,A ₂ ,...,A ₁₇ Joint point coordinate a corresponding to maximum confidence coefficient of the model ₀ ,a ₁ ,a ₂ ,...,a ₁₇ . Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 coordinate points of the human body posture. And (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:

Referring to fig. 6, the rle-Decoder directly learns the coordinates of the target node using a regression-based model supervised learning method. For the image feature I learned by the E-shuffle encoding layer, the decoding layer predicts the probability of the target joint point at the position x through a regression model, and the probability distribution uses P _Θ (x|I) where Θ represents a parameter learned by the model. The whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu _g The probability is greatest at this point. Estimating the loss function of the model by maximum likelihood can be described as:

…

The multi-scale pyramid module is introduced to improve the extraction capability of the model on multi-scale features, especially small targets, the expansion pyramid structure is greatly improved in the aspects of capturing multi-scale information and extracting features at high density, the expansion convolution using expansion rates 1, 2 and 4 creates new multi-scale features for the ResNet bottom layer, the block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the channel attention mechanism is utilized to obtain weights of any two convolution features when the two convolution features are combined, and the final fusion features are obtained after the weight matrix is multiplied and spliced with the corresponding convolution features, so that the remote hand key point region classification performance is improved.

EXC screening: depth value z based on hand key point recorded frame by frame in continuous frames ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any moment, the depth obtained by a certain frame based on the hand key points simultaneously meets the condition that the depth is larger than the minimum value Z of the depth _Min Less than the depth maximum Z _Max The athlete can be considered to be in a reasonable exercise area as follows:

Z _Min ≤z _n (x _n ,y _n )≤Z _Max

The method for acquiring the key point areas of the hands comprises the following steps: after 16 joint points are obtained through an attitude detection algorithm, hand key points Hand in continuous frames are defined ₀ (x ₀ ,y ₀ )，Hand ₁ (x ₁ ,y ₁ )，Hand ₂ (x ₂ ,y ₂ )，...，Hand _n (x _n ,y _n ) Wherein Hand ₀ The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand _n The coordinates obtained by the points are based on the positions of key points of hands of the athlete at the final moment of exercise; acquiring depth values z of recorded key points frame by frame through a visual sensor and an infrared sensor carried by a depth camera ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) The method comprises the steps of carrying out a first treatment on the surface of the Due to the principle of near-large and far-small, the side length of the ROI self-adaptive square frame based on the hand and the depth value coordinate at the moment show a linear relation, and can be expressed by the following equation:

obtaining a set of best matches from the set of equations (a ₀ ,b ₀ ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth valueThe following formula:

L＝a ₀ z+b ₀

The method for measuring and calculating the technical indexes (speed, power, average speed and average power) comprises the following steps: and reading the real-time height change by taking ten frames of images obtained by the image processing module as a group. At time K-1, there are two matrices: best estimate A _k And covariance matrix B _k The following is shown:

in the formula, height and Velocity are the Height and Velocity at the moment, hv is used for designating the two variables in the subsequent formulas, Σij represents covariance among vector elements, and ij is any two-two combination among pv, and 4 combination modes are adopted.

wherein A is _k ，A _k-1 Information representing time K and time K-1, T _k Represents the transformation matrix, Δt being the elapsed time.

When external interference exists, a plurality of uncertain rings are generated on the basis of the original predictive value, and the formula is required to be corrected, even if the untracked interference is taken as covariance Q _k Is processed by the noise of the processor.

A _k ＝T _k A _k-1

Wherein A is _k 、B _k Respectively representing a position matrix and a covariance matrix at the moment K, T _k Represents a transformation matrix, Q _k Is an interference correction matrix.

After the actual measurement data of the sensor is added, the sensor itself can generate covariance Z _k Is represented by a Gaussian distribution of R _k 。

The prediction formula after correction considering the measured value is as follows:

A _exp ＝H _k A _k

wherein A is _exp 、B _exp Respectively representing a position matrix and a covariance matrix at the moment K, H _k Representing a measured value correction matrix.

Multiplying the predicted value Gaussian distribution by the measured value Gaussian distribution to obtain the final predicted Gaussian distribution, wherein the final predicted Gaussian distribution is represented by the following formula:

B′ _k ＝B _k -K′H _k B _k

A _k to introduce the best estimate after external interference, B _k In order to introduce the covariance matrix after external interference, K' is the corrected Kalman coefficient. A is that _k ' being the best estimate of the predicted value at the next time, B _k ' covariance matrix of predicted value at next moment, R _k For the mean matrix of the sensor measurements, Z _k Covariance matrix of microsensor measurements, H _k The matrix is modified for the measured values. The Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.

In summary, the invention captures the core technical indexes such as the gesture, the speed, the strength, the power and the like generated by the athlete in the physical training process in a non-contact way through a reasonable structure collocation related algorithm, digitizes the core technical indexes and guides scientific training.

The description and practice of the invention disclosed herein will be readily apparent to those skilled in the art, and may be modified and adapted in several ways without departing from the principles of the invention. Accordingly, modifications or improvements may be made without departing from the spirit of the invention and are also to be considered within the scope of the invention.

Claims

1. The speed and strength feedback system based on the depth camera is characterized by comprising an image acquisition module, a human body capturing module, a motion monitoring module and a speed and strength calculation module;

the human body capturing module is used for positioning 16 human body articulation points, takes a Exc-Pose lightweight human body gesture detection algorithm as a core, and comprises target detection and gesture detection, and specifically comprises an E-shufflelet encoding layer, an RLE-Decoder decoding layer and a WFU module; the E-shuffle coding layer can extract fine bottom features through a three-branch structure, and the RLE-Decoder decoding layer carries out regression-based model supervision learning on the extracted bottom features through residual log likelihood estimation and re-parameterization to directly regress the coordinate values of the target node;

after inputting an image, performing convolution operation and maximum pooling operation once, and then performing three stages of Stage2, stage3 and Stage4 formed by stacking different numbers of Stage units, wherein a first module of each Stage adopts a stage=2 Stage unit to realize a downsampling function, and Stage2 and Stage4 are formed by stacking one stage=2 Stage unit and 3 stage=1 Stage units; e-shuffle divides the network into three branches to form an E-type structure, and the branches respectively learn the target bottom layer characteristics; the specific operation is as follows: after Stage2 is finished, stage3 and Stage4 are duplicated to form a 3-branch E-type structure, wherein Stage3-1, stage3-2 and Stage3-3 respectively continue to stack 5, 7 and 9 shuffle units with stride=1 after the shuffle units with stride=2 are adopted, and then Stage4 is connected, so that networks learn image features of different degrees;

the method comprises the steps that a buffer unit with stride=1 divides the number of channels into two branches to replace an original group convolution structure by adopting a channel separation operation, 1×1 group convolution is changed into 1×1 common convolution, a subsequent channel rearrangement operation is canceled, after 3×3 depth separable convolution and 1×1 common convolution are finished, two channels are connected by adopting a serial connection mode to replace an original direct addition mode, and finally, the channel rearrangement operation is performed to fuse information between groups; a shuffle unit with stride=2, the channel separation operation is canceled, and the original 3×3 average pooling operation is replaced by a 3×3 depth separable convolution and a 1×1 normal convolution;

aiming at the image features learned by the E-shuffle coding layer, the RLE-Decoder decoding layer predicts the confidence coefficients corresponding to the coordinate values of 16 nodes of the human body through a regression model;

the WFU module transmits the confidence coefficient corresponding to the coordinate values of the 16 nodes as input to the feature aggregation unit, then performs Concat operation, and merges the results into a 3X 16 matrix, wherein each row representsDifferent branches, each column representing the confidence level of the coordinates of different joint points of the branch; then Split operation is carried out, namely the matrix is divided into 16 1X 3 matrices according to columns, wherein each matrix represents a stack of three different confidence degrees of a certain part of the human body posture, such as A ₀ The confidence degree of the head joint point coordinates learned by the three branches; then output A through max function ₀ ,A ₁ ,A ₂ ,...,A ₁₅ Joint point coordinate a corresponding to maximum confidence coefficient of the model ₀ ,a ₁ ,a ₂ ,...,a ₁₅ The method comprises the steps of carrying out a first treatment on the surface of the Finally, performing Concat operation to obtain a brand new coordinate corresponding to 16 joint points of the human body posture; and (3) carrying out weight-based intelligent channel integration on the result generated by each branch, wherein the integrated process formula is as follows due to the three-branch structure:

wherein j represents the jth channel, T _S T, as a feature of final integration _j C is the characteristic value to be fused in the jth channel _j For the j-th channel, K _j % represents the specific gravity of the characteristics generated by each channel to all channels;

the motion monitoring module is used for judging true and false motions, screening coordinate values of 16 joints of the human body detected by the human body capturing module to obtain hand joints, generating a hand area on the basis of the hand joints, and judging true and false motions in the area by using Exc classification algorithm; the power calculation module is used for calculating a motion technical index and calculating by adopting a Kalman filtering algorithm of multi-frame fusion;

the working method of the speed and strength feedback system comprises the following specific steps:

s1: in the color and depth videos acquired by two depth cameras, acquiring images with athletes frame by frame, and carrying out enhancement processing on the images;

s2: performing human body detection by using a Exc-Pose lightweight human body posture detection algorithm based on log-likelihood estimation and regression models, and dividing a human body region in an image;

s3: and extracting the coordinate values of the hand joint points from the detected human body area, and converting the coordinate values into the motion technical indexes of the strength calculation module, such as speed, power, average speed and average power.

2. The speed and force feedback system based on a depth camera according to claim 1, wherein the coordinate values of 16 joints detected by Exc-Pose lightweight human body posture detection algorithm are Hand joint coordinates Hand obtained by screening by a motion monitoring module _n (x _n ,y _n ) Generating a self-adaptive external rectangular region of the hand frame by frame on the basis of the hand joint point, wherein the region can be automatically adjusted along with the approach or the distance of a sportsman, and the barbell is always ensured to be positioned in the region; the motion monitoring module judges whether the athlete in each frame of picture is in a hand-held barbell state or not through a Exc classification algorithm.

3. The depth camera-based speed and force feedback system of claim 2 wherein the RLE-Decoder is used as the decoding layer to predict confidence level of the target joint point at position x by regression model for the image feature I learned by the E-shufflelenet encoding layer, the confidence level distribution using P _Θ (x|i) where Θ represents the optimization model parameters; the whole supervision process is to learn and optimize the model parameters theta so that the model prediction result is marked with a label mu _g The confidence at this point is greatest and the loss function of the model estimated by maximum likelihood is described as:

wherein L is _mle As a loss function, Θ is an optimization model parameter, P _Θ (x|I) is confidence distribution, I is image feature, μ _g In the case of a label being a label,for variance of distribution>Is the sample mean value;

transforming the simple distribution into an arbitrary complex distribution using a normalized flow model that converts a simple distribution P (z) into a learnable function m=f _φ (z) to represent a complex distribution P _φ (x) Visual presentation of the normalized stream is as follows:

…

wherein p is ₁ (x _i ),p ₂ (x _i ),...,p _K-1 (x _i ),p _K (x _i ) Represents the distribution function, T (z _i ) As a function of the origin of the function,a transform matrix for the 1 st, 2 nd,;

in the model transformation process, in order to make the flow model P _φ (x) Fitting the best floor distributionThe method can be realized by the following formula, wherein parameter items can be divided into three types: simple distribution item->E.g. Gaussian distribution->Residual log likelihood estimation term->And a constant term log s, the process is represented by the formula:

in the course of the training process, the user can perform,the training process of the model can be quickened because the model is not dependent on the flow model.

4. A depth camera based speed and force feedback system according to claim 3 wherein after 16 nodes are obtained by Exc-Pose lightweight human body posture detection algorithm, a continuous intra-frame Hand node Hand is defined ₀ (x ₀ ,y ₀ )，Hand ₁ (x ₁ ,y ₁ )，Hand ₂ (x ₂ ,y ₂ )，...，Hand _n (x _n ,y _n ) Wherein Hand ₀ The coordinates obtained by the points are based on the Hand position of the athlete during the initial movement, hand _n The coordinates obtained by the points are based on the positions of the hand joint points of the athlete at the final movement moment; acquiring the depth value z of the recorded joint point frame by frame through a visual sensor and an infrared sensor carried by a depth camera ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) The method comprises the steps of carrying out a first treatment on the surface of the Based on the principle of near-large and far-small, the side length of the ROI self-adaptive square frame of the hand is matched with the depth value coordinate at the momentThe linear relationship is represented by the following equation:

wherein L is ₀ ，L ₁ ，L ₂ ，...,L _n Representing the side length of a frame-by-frame acquisition based on an ROI adaptive square frame of a hand, c represents the relative change rate of the side length L and the depth value, z ₀ ，z ₁ ，z ₂ ，...,z _n Representing depth value information corresponding to a hand joint point at a certain moment, and d represents deviation correction caused by position change;

obtaining a set of best matches from the set of equations (c ₀ ,d ₀ ) For the side length L of the ROI self-adaptive square frame in any frame, the side length L has a relation with the corresponding hand depth value, and the following formula is shown:

L＝c ₀ z+d ₀

based on the change of L, the size of the manufactured ROI self-adaptive square frame is flexibly adjusted, the barbell bar is tracked and captured in real time, and on the basis of capturing the barbell bar, a ResNet network attention adding mechanism is introduced to judge whether the barbell bar is in a handheld barbell movement state.

5. The depth camera based speed and force feedback system of claim 4 wherein Exc classification algorithms are added on the basis of hand joints to construct a traversing frame for Exc classification algorithms for determining whether to grasp barbell, and residual blocks are used to solve degradation and gradient disappearance problems during image processing; aiming at the problem that the distance from a camera influences the size of a target area so as to influence the accuracy of a classification network, a multi-scale pyramid module is introduced to improve the small target of the model on the multi-scale feature extraction capability, the expansion pyramid structure is characterized in that a new multi-scale feature is created for the bottom layer of ResNet by using expansion convolution of expansion rates 1, 2 and 4, the block module can be obtained by convolution before the multi-scale pyramid module, the obtained 3 convolution features are fused pairwise, the channel attention mechanism is utilized to obtain the weights of any two convolution features when the two convolution features are combined, and the final fusion feature is obtained after multiplying and splicing the weight matrix and the corresponding convolution feature, so that the remote hand classification performance is improved.

6. The depth camera based speed and force feedback system of claim 5 wherein the hand-node based depth value z is recorded by frame-by-frame in successive frames ₀ (x ₀ ,y ₀ )，z ₁ (x ₁ ,y ₁ )，z ₂ (x ₂ ,y ₂ )，...，z _n (x _n ,y _n ) Expanding depth screening one by one, setting a depth upper limit value and setting a depth lower limit value; at any time, a certain frame simultaneously satisfies a depth greater than a minimum value Z based on the depth obtained by the hand joint point _Min Less than the depth maximum Z _Max The athlete can be considered to be in a reasonable exercise area as follows:

Z _Min ≤z _n (x _n ,y _n )≤Z _Max

wherein Z is _Min ，Z _Max Represents a depth minimum and a depth maximum, z _n (x _n ,y _n ) Representing the depth value, x, of the hand joint point in a certain frame _n ，y _n Representing the horizontal and vertical coordinate values of the hand joint point in a certain frame.

7. The depth camera based speed and force feedback system of claim 1 wherein for ten frames of images obtained by the image processing module as a set, the height change in real time is read, and at time k-1 there are two matrices: best estimate A _k-1 And covariance matrix B _k-1 The following is shown:

best estimate A _k-1 In (3), height and Velocity are the Height and Velocity at that time; covariance matrix B _k-1 Where h is the altitude distribution, v is the velocity distribution, Σhh represents the covariance between the altitude distribution, Σhv and Σhv represent the covariance between the altitude distribution and the velocity distribution, Σv represents the covariance between the velocity distribution;

the measurement of altitude and velocity at time k can be expressed by the following formula, provided that the original estimate is correct:

wherein A is _k ，A _k-1 Represents the best estimate of time k and time k-1, T, respectively _k Representing a transformation matrix, Δt being the elapsed time;

B′ _k ＝B _k -K′H _k B _k

A _k representing the best estimate of time k，B _k Represents the covariance matrix at time K, K' is the corrected Kalman coefficient, A _k ' being the best estimate of the predicted value at the next time, B _k ' covariance matrix of predicted value at next moment, R _k For the mean matrix of the sensor measurements, Z _k For covariance matrix of sensor measurements, H _k For the measured value correction matrix, the Velocity 'is the speed of the barbell at the next moment, and the Height' is the Height of the barbell at the next moment.