CN114612526A

CN114612526A - Joint point tracking method, and Parkinson auxiliary diagnosis method and device

Info

Publication number: CN114612526A
Application number: CN202210203296.1A
Authority: CN
Inventors: 高�浩; 李奕; 徐枫; 宗睿; 余新光; 潘隆盛; 凌至培
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2022-06-10

Abstract

The invention discloses a joint point tracking method, a Parkinson auxiliary diagnosis method and a device, wherein the Parkinson auxiliary diagnosis method comprises the following steps: putting a human body video to be detected into a human body skeleton extraction model, and obtaining two-dimensional joint points by adopting multi-scale prediction; a next frame of the human body video is tracked to obtain a target two-dimensional joint point; obtaining a corrected two-dimensional joint point by using a front-back frame intersection method for a frame with a failed two-dimensional joint point tracking, and further taking the corrected two-dimensional joint point as a new two-dimensional joint point to continue tracking; obtaining an accurate joint point sequence according to the tracked two-dimensional joint points and the corrected two-dimensional joint points; and inputting the joint point sequence into a Parkinson rating model to obtain auxiliary diagnosis information of the Parkinson. The method has simple steps, further improves the prediction precision of the joint point sequence and obtains effective Parkinson auxiliary diagnosis information.

Description

Joint point tracking method, and Parkinson auxiliary diagnosis method and device

Technical Field

The invention relates to a joint point tracking method, a Parkinson auxiliary diagnosis method and a device, and belongs to the technical field of artificial intelligent video images.

Background

Parkinson's disease is a common degenerative disease of the nervous system in the elderly. 50 to 80 percent of the Parkinson's disease cases are discovered when the disease of a patient has obvious symptoms and even affects the quality of life and is seen, so that the better time for early treatment is lost. The treatment difficulty of the population with the moderate and severe Parkinson's disease is higher, and the medical burden is heavier.

The Parkinson's disease is insidious, and various indexes of the Parkinson's disease are mostly changed by using the existing auxiliary inspection technology. Thus, parkinson's disease is currently diagnosed by experienced neurologists, usually based on medical history, clinical symptoms, and complex physical examinations. Due to the difference in the medical skill level of medical institutions and the practical experience of doctors, parkinson's disease cannot be found and diagnosed correctly at an early stage.

The tremor condition of a certain joint of a patient is often observed through naked eyes in clinical diagnosis of a doctor, and because the movement distance of the joint cannot be accurately estimated through naked eyes, the clinical diagnosis often generates serious subjectivity and generates larger errors. The joint point position of the Parkinson patient can be accurately extracted by applying a human body pose estimation technology, and the tremor of the joints of the patient can be more accurately analyzed compared with the macroscopic estimation in clinical diagnosis.

Most of the existing human body posture detection algorithms use a confidence map method to predict the probability of joint points, and the accuracy of the method depends on the relative size of the confidence map and the human body in the map to a great extent, so that the prediction accuracy is limited. The current human posture detection algorithm based on a single image has obvious jitter on the detection of video streams, and the current human posture detection algorithm based on a plurality of images has huge calculation amount and needs larger video memory, which is not beneficial to the accurate analysis of joint points.

CPM is called the capacitive position Machines, and is a top-down human body Pose detection method. CPM is a serialized structure composed of a full convolution grid, and the convolution grid is directly operated on a confidence map of a previous stage, so that more and more refined joint points are output. CPM is able to learn feature representations of image and spatial information simultaneously and does not require the construction of any explicit inter-joint relationship model. However, since it is a top-down detection method, the detection performance may be degraded as the number of people increases when detecting the poses of multiple people.

The Openpos human body posture identification method is an important method in the field of human body posture estimation, can realize posture estimation of human body actions, facial expressions, finger motions and the like, and has good robustness. The predecessor of openpos is the CPM algorithm, and in a scene of multi-person recognition, the CPM algorithm can also perform multi-person recognition through a heat map. Openpos is a bottom-up human body pose estimation method, and uses a Confidence Map (Part Confidence Map) to represent joint points of a human body, so that the method is suitable for estimating the human body pose of a single person, and uses a limb Affinity vector Field (Part Affinity Field) to represent skeleton connection of the human body, thereby better solving the problem of estimating the human body pose of multiple persons. However, openpos only detects a single frame of image, and for video detection, openpos detection results generate an obvious jitter problem. If only the native openpos is used as a human body pose estimation method to extract the joint points, a large error is generated for the diagnosis of the parkinson disease. Therefore, in order to accurately diagnose the Parkinson's disease, a set of accurate joint point extraction method and a set of tremor analysis method based on the extracted joint points are very important.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a joint point tracking method, a Parkinson auxiliary diagnosis method and a device, which can overcome the defects of low detection precision, low detection speed and large calculation amount in the prior art, and can calculate the tremor rating of the Parkinson through accurate joint point tracking detection so as to provide effective auxiliary information for the Parkinson diagnosis.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

in a first aspect, the present invention provides a method for tracking a joint, the method comprising the steps of:

step A: acquiring a frame sequence of a video to be detected;

and B, step B: inputting a current frame of a video to be detected into a human body skeleton extraction model to obtain a joint confidence map and a limb affinity vector field, and obtaining two-dimensional joint points of the current frame according to the joint confidence map and the limb affinity vector field;

and C: selecting a needed joint point as a joint point to be tracked for the next frame of the current frame, and tracking to obtain a target two-dimensional joint point of the next frame;

step D: taking a two-dimensional joint point obtained by tracking the next frame of the video as a joint point to be tracked, and reversely tracking the previous frame of the video to obtain a two-dimensional joint point obtained by reversely tracking the previous frame;

step E: calculating the Euclidean distance between the two-dimensional joint points tracked reversely in the previous frame and the actual two-dimensional joint points in the previous frame;

step R: if the Euclidean distance is too large, the joint point tracked by the next frame cannot be reversely tracked to obtain the joint point of the previous frame, the joint point tracked by the next frame is judged to be failed to track, otherwise, the joint point tracked by the next frame is used as the actual joint point of the next frame;

g: if the joint point tracking of the next frame fails, overlapping the joint confidence maps of the previous frame and the next frame by using a front-and-back frame intersection method, reducing the detection area of the joint point to obtain a corrected target joint point of the next frame, and using the corrected joint point of the next frame as the actual joint point of the next frame;

step H: the sequence number of the current frame is skipped to be the next frame;

step I: and D, repeating the step B to the step H, and extracting to obtain a complete two-dimensional joint point sequence of the video to be detected.

Further, in the parkinson-assisted diagnosis method based on accurate joint tracking of the present invention, the human skeleton extraction model is based on openpos, and the structure thereof includes: the system comprises a feature extraction module, a limb affinity vector field estimation module and a joint confidence map estimation module; the feature extraction module comprises the first ten layers and two convolutional layers of the VGG-19 network;

the method for acquiring the human skeleton extraction model comprises the following steps:

calibrating to obtain a training data set based on the MSCOCO data set and the data of the Parkinson rating video;

inputting the training data set into a human body skeleton extraction model, and obtaining a final human body skeleton extraction model through multiple iterations and training.

Further, the method for obtaining the joint confidence map and the limb affinity vector field by inputting the current frame of the video to be detected into the human body skeleton extraction model further comprises the following steps:

using multi-scale fusion estimation to obtain images with different sizes by scaling the current frame by equal ratio;

inputting the images with different sizes into a human body skeleton extraction model to obtain joint confidence maps of the images with different sizes;

and superposing the joint confidence maps of the images with different sizes to obtain an average value, and obtaining an output joint confidence map.

Further, the method for inputting the image into the human skeleton extraction model to obtain the joint confidence map and the limb affinity vector field comprises the following steps:

step 1: transmitting a frame of image into a first 10 layers of a VGG-19 network to obtain a feature map of a first stage;

step 2: obtaining a limb affinity vector field of a first stage by the feature map through a convolutional layer;

and step 3: inputting the limb affinity vector field and the characteristic diagram of the first stage into a network, and acquiring a limb affinity vector field of a second stage through a convolutional layer;

and 4, step 4: repeating the steps 2 and 3 to the specified times to obtain a final limb affinity vector field;

and 5: inputting the final limb affinity vector field and the feature map into a network, and acquiring a joint confidence map of a first stage through a convolutional layer;

step 6: inputting the joint confidence coefficient diagram and the feature diagram of the first stage into a network, and acquiring a joint confidence coefficient diagram of a second stage through a convolutional layer;

and 7: and repeating the steps 5 and 6 to a preset number of times to obtain an output joint confidence map.

Further, a method for tracking and obtaining the target two-dimensional joint point of the next frame is a Lucas-Kanade optical flow method; the method of tracking the last frame of the video backward is the Lucas-Kanade optical flow method.

Further, the method for obtaining the modified target joint point of the next frame by superposing the joint confidence maps of the previous frame and the next frame by using a front-back frame intersection method and reducing the detection area of the joint point comprises the following steps:

and overlapping the confidence maps of the adjacent frames to obtain an intersection to obtain a corrected target joint point of the next frame.

In a second aspect, the present invention provides a parkinson-assisted diagnosis method based on accurate joint tracking, wherein the joint tracking method based on the first aspect includes the following steps:

acquiring a complete two-dimensional joint point sequence of the extracted video to be detected;

and inputting the extracted complete two-dimensional joint point sequence of the video to be detected into a Parkinson rating model to obtain the Parkinson rating information of the video to be detected and generate the Parkinson auxiliary diagnosis information.

Further, the input of the Parkinson rating model is a two-dimensional joint point sequence, and the output is the tremor level of the Parkinson;

the structure of the Parkinson rating model is based on an action classification network DD-Net, and the network structure comprises:

the first branch is: the method comprises three convolution layers and a convolution layer, wherein the convolution layers are used for predicting the relation of skeleton points on a current frame picture and capturing the space domain characteristics on the picture;

the second branch is as follows: the system consists of three convolution layers, is used for calculating difference data between joint points by 1 frame interval and learning the quick time sequence information of different frames;

the third branch is as follows: the system consists of three convolution layers, is used for calculating difference data between joint points by spacing 2 frames and learning the slow time sequence information of different frames;

a classification module: the three-branch tremble level fusion device consists of three convolution layers and two full-connection layers and is used for fusing the characteristics extracted by the three branches to obtain the tremble level of the Parkinson.

Further, the parkinson rating model is obtained in the following manner:

extracting a joint point sequence of a human body of the Parkinson rating video by using a human body skeleton extraction model, and calibrating the joint point sequence of the human body to obtain a rating training data set;

and inputting the rating training data set into a Parkinson rating model, and performing repeated iterative training to obtain the trained Parkinson rating model.

In a third aspect, the present invention provides a joint tracking device, comprising:

an acquisition module: the method comprises the steps of obtaining a frame sequence of a video to be tested;

an extraction module: the system comprises a human body skeleton extraction model, a joint confidence map, a limb affinity vector field and two-dimensional joint points, wherein the human body skeleton extraction model is used for inputting a current frame of a video to be detected into the human body skeleton extraction model to obtain the joint confidence map and the limb affinity vector field;

a tracking module: the method comprises the following steps of selecting needed joint points as joint points to be tracked for the next frame of a current frame, and tracking to obtain target two-dimensional joint points of the next frame;

an anti-tracking module: the system comprises a video processing unit, a tracking unit and a tracking unit, wherein the video processing unit is used for reversely tracking a previous frame of a video by taking a two-dimensional joint point obtained by tracking the next frame of the video as a joint point to be tracked to obtain a two-dimensional joint point obtained by reversely tracking the previous frame;

the joint point calculation module: the Euclidean distance calculation module is used for calculating the Euclidean distance between the two-dimensional joint points tracked reversely in the previous frame and the actual two-dimensional joint points in the previous frame; if the Euclidean distance is too large, the joint point tracked by the next frame cannot be reversely tracked to obtain the joint point of the previous frame, the joint point tracked by the next frame is judged to be failed to track, otherwise, the joint point tracked by the next frame is used as the actual joint point of the next frame; if the joint point tracking of the next frame fails, overlapping the joint confidence maps of the previous frame and the next frame by using a front-and-back frame intersection method, reducing the detection area of the joint point to obtain a corrected target joint point of the next frame, and using the corrected joint point of the next frame as the actual joint point of the next frame;

a skip module: the sequence number for skipping the current frame is the next frame.

In a fourth aspect, the present invention provides a parkinson's auxiliary diagnosis device based on joint accurate tracking, including:

the apparatus of the third aspect: the method comprises the steps of extracting a complete two-dimensional joint point sequence of a video to be detected;

joint point sequence module: the method comprises the steps of obtaining a complete two-dimensional joint point sequence of an extracted video to be detected;

an output module: and the method is used for inputting the extracted complete two-dimensional joint point sequence of the video to be detected into the Parkinson rating model to obtain the Parkinson rating information of the video to be detected and generate the Parkinson auxiliary diagnosis information.

Compared with the prior art, the invention has the following beneficial effects:

1. the method comprises the steps of obtaining a target two-dimensional joint point by a tracking method for the next frame of a human body video; obtaining a corrected two-dimensional joint point by using a front-back frame intersection method for a frame with a failed two-dimensional joint point tracking, and further taking the corrected two-dimensional joint point as a new two-dimensional joint point to continue tracking; an accurate joint point sequence is obtained according to the tracked two-dimensional joint points and the corrected two-dimensional joint points, so that the prediction precision of the joint point sequence is further improved, the calculation amount is reduced, and the detection speed is improved;

2. according to the method, the multi-scale estimation is applied to the human skeleton extraction model, so that the detection precision of the human skeleton extraction model is improved under the condition that the model does not need to be retrained;

3. according to the method, the joint points extracted by the human skeleton extraction model are tracked by applying the Lucas-Kanade optical flow method, so that the extraction precision of joint estimation on the video stream is improved;

4. according to the method, the intersection of the joint confidence maps of two continuous frames and the limb affinity vector field is solved, the estimation range is narrowed, the joint points which are tracked to fail by the Lucas-Kanade optical flow method are corrected, and the problem that manual correction is needed after tracking fails is solved;

5. the Parkinson tremor evaluation method can be used for evaluating the Parkinson tremor according to the input joint points through the Parkinson rating model capable of extracting the information of the space domain and the time domain, so that the subjectivity of clinical diagnosis and visual observation is avoided.

Drawings

FIG. 1 is a flow chart of a Parkinson's auxiliary diagnosis method based on joint accurate tracking according to the invention;

FIG. 2 is a network structure diagram of a human skeleton extraction model of the Parkinson's auxiliary diagnosis method based on accurate joint tracking according to the invention;

fig. 3 is a network structure diagram of a parkinson rating model of the parkinson-assisted diagnosis method based on accurate joint tracking according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The first embodiment is as follows:

the present embodiment provides a method for tracking a joint point of a human body, the method comprising the following steps:

step 1: training a human skeleton extraction model by using an MSCOCO public data set and a data set calibrated by a Parkinson rating video;

step 2: inputting a first frame of a video to be detected into a human body skeleton extraction model, and obtaining an accurate two-dimensional joint point of the first frame of the video by using a multi-scale fusion confidence map and a limb affinity vector field;

and step 3: for the next frame of the video, taking the joint points needing to be evaluated by the Parkinson rating as the joint points needing to be tracked, and tracking by using a Lucas-Kanade optical flow method to obtain the target two-dimensional joint points of the next frame;

and 4, step 4: taking a two-dimensional joint point obtained by tracking the next frame of the video as a joint point to be tracked, and reversely tracking the previous frame of the video by using a Lucas-Kanade optical flow method to obtain a two-dimensional joint point obtained by reversely tracking the previous frame;

and 5: calculating the Euclidean distance between the two-dimensional joint points tracked reversely in the previous frame and the actual two-dimensional joint points in the previous frame;

step 6: if the Euclidean distance is too large, the joint point tracked by the next frame cannot be reversely tracked to obtain the joint point tracked by the previous frame, and the joint point tracked by the next frame is judged to be failed to be tracked, otherwise, the tracking is successful;

and 7: if the tracking of the next frame of the joint point fails, overlapping the confidence maps of the previous frame and the next frame by using a front-and-back frame intersection method, and reducing the detection area of the joint point to obtain a corrected next frame of target joint point;

and step 8: using the joint point of the next frame obtained by tracking or the joint point of the next frame obtained by correcting as the actual joint point of the next frame;

and step 9: and (5) repeating the step 2 to the step 8, and extracting to obtain a complete two-dimensional joint point sequence of the video to be detected.

Specifically, the human skeleton extraction model is a human skeleton extraction model based on openposition. Firstly, pre-training a human skeleton extraction model based on a COCO data set to obtain a pre-training model of the human skeleton extraction model, wherein the pre-training aims to improve the generalization performance of the human skeleton extraction model;

the training method of the model comprises the following steps: the method comprises the steps of collecting videos of Parkinson patients, labeling human body joint points in the videos to obtain a new data set, inputting the obtained data set into a pre-training model of a human body skeleton extraction model, and performing re-training, so that the human body skeleton extraction model is more suitable for application scenes of Parkinson skeleton extraction.

The specific steps of joint extraction, tracking and correction are as follows:

zooming the first frame image of the video into the scales of 0.5, 1, 1.5 and 2, and inputting the scales into the trained human skeleton extraction model to obtain a multi-scale joint confidence map and a limb affinity vector field;

then superposing the multi-scale joint confidence map and the limb affinity vector field to calculate an average value to obtain a multi-scale fused joint confidence map and a limb affinity vector field to obtain a first frame of more accurate two-dimensional joint points;

for a second frame of the video, taking the joint point needing evaluation of the Parkinson rating in the previous step as the joint point needing tracking, and tracking a second frame of tracked two-dimensional joint point in a second frame image of the video by using a Lucas-Kanade optical flow method;

in order to verify whether the joint points tracked in the previous step are accurate, the joint points needing evaluation by the Parkinson rating in the second frame of joint points are used as the joint points needing tracking, a Lucas-Kanade optical flow method is used for carrying out reverse tracking on the first frame of images of the video to obtain a frame of two-dimensional joint points which are reversely tracked, and then the Euclidean distance between the obtained joint points and the joint points estimated by the first frame of models is calculated;

if the Euclidean distance in the previous step is smaller than the set threshold value, the joint point tracked in the next frame can be reversely tracked to obtain the joint point of the previous frame, the joint point tracked in the second frame is judged to be accurate, and otherwise, the joint point needs to be corrected;

for videos with continuous frames, joint points of people are always located in the intersection of joint confidence maps of two continuous frames, so that a front-and-back frame intersection method is designed to narrow the detection range of the joint points of the continuous frames. To correct the joint point of the second frame, firstly, using a human body skeleton extraction model to respectively extract joint confidence maps of the first frame and the second frame, then superposing the confidence maps to calculate an average value, and reducing a detection area of the joint point so as to correct the joint point of the second frame;

and repeating the steps to finally obtain the two-dimensional joint point sequence of the video.

Multi-scale fusion estimation (Multi-scale testing) is an important idea in the field of Object Detection (Object Detection), because the size of the input picture has a significant impact on the performance of the Detection model. In deep learning, a larger convolution window can see a larger object after convolution, and conversely, only a smaller picture can be seen. Taking human pose estimation as an example, when the size of a person to be detected is small, the areas of the confidence map and the affinity vector field extracted by the person to be detected are small, and the accuracy of model prediction can be reduced to a certain extent. In the field of target detection, the detection of the last layer of the network of the YOLO v5 uses feature map fusion of multiple scales, so that the detection precision of a very large object and a very small object is improved.

As shown in fig. 2, fig. 2 is a network structure diagram of a human skeleton recognition model. The human skeleton recognition model mainly predicts a joint confidence map and a limb affinity vector field. The joint confidence map represents the probability value of a certain joint point appearing on each pixel point in the image, the generated joint confidence map generates a probability area in Gaussian distribution at the origin point by taking the certain joint point as the origin coordinate, the maximum probability value of the origin point is 1, and the probability value is smaller towards the periphery by taking the origin point, namely the joint point, as the center. The limb affinity vector field represents the position and orientation of the limb in the image, and is generated in a manner similar to the joint confidence map. Firstly, the human body skeleton recognition model adds two convolutional layers on the first ten layers of a VGG-19 network, and extracts a characteristic diagram F of an image. The VGG then feeds the feature map F into the first part of the network to predict the limb affinity vector field. The network module of the first stage of the first part consists of 7 convolution blocks, the first 5 convolution blocks are 3 layers of 3 × 3 convolutions, each 3 × 3 convolution is connected in series, and the last two convolution blocks are 1 layer of 1 × 1 convolution, so that the limb affinity vector field L1 of the first stage is finally obtained. The feature map F of the image and the limb affinity vector field L1 obtained by the first-stage prediction constitute the input of the second part of the first stage, and the first part of the limb affinity vector field Lt is obtained by repeating the steps t times. The VGG then feeds the feature map F and the limb affinity vector field Lt into the second part of the network to predict a joint confidence map. The network structure of each stage of the second part is the same as that of the first part, and the joint confidence map St of the second part is obtained after repeating the steps for t times. Finally, according to the joint confidence map and the limb affinity vector field, the network outputs the joint point coordinates of a current frame.

Example two:

as shown in fig. 1, the parkinson-assisted diagnosis method based on accurate joint tracking provided by this embodiment includes the following steps:

the human skeleton extraction model is a human skeleton extraction model based on Openpos. Firstly, pre-training a human skeleton extraction model based on a COCO data set to obtain a pre-training model of the human skeleton extraction model, wherein the pre-training aims to improve the generalization performance of the human skeleton extraction model;

then, collecting videos of the Parkinson's disease people, labeling human body joint points in the videos to obtain a group of new data sets, inputting the obtained data sets into a pre-training model of a human body skeleton extraction model, and performing re-training to enable the human body skeleton extraction model to be more suitable for an application scene of Parkinson's skeleton extraction;

secondly, zooming the first frame image of the video into scales of 0.5, 1, 1.5 and 2, and inputting the scales into the trained human skeleton extraction model to obtain a multi-scale joint confidence map and a limb affinity vector field;

then superposing the multi-scale joint confidence map and the limb affinity vector field to obtain an average value, and obtaining a multi-scale fused joint confidence map and a limb affinity vector field to obtain a first frame of more accurate two-dimensional joint points;

for videos with continuous frames, joint points of people are always located in the intersection of joint confidence maps of two continuous frames, so that a front-and-back frame intersection method is designed to narrow the detection range of the joint points of the continuous frames. Firstly, respectively extracting joint confidence maps of a first frame and a second frame by using a human body skeleton extraction model, then superposing the joint confidence maps to calculate an average value, and reducing a detection area of the joint point so as to correct the joint point of the second frame;

repeating the steps to finally obtain a two-dimensional joint point sequence of the video;

extracting a joint point sequence of a human body of the Parkinson rating video by using a human body skeleton extraction model, and calibrating to obtain a rated training data set;

inputting a group of joint point sequences in the rated training data set into the convolutional layer, extracting to obtain the spatial domain characteristics of the current frame and the time domain characteristics of continuous frames, inputting the spatial characteristics and the time characteristics into the convolutional layer, and training to obtain a Parkinson rating model;

inputting the video of the Parkinson's disease person to be detected into a human body skeleton extraction model to obtain a two-dimensional joint point sequence, inputting the two-dimensional joint point sequence into a Parkinson rating model, and finally obtaining the rating of the Parkinson's disease person.

Specifically, the input of the Parkinson rating model is a two-dimensional joint point sequence, and the output is the tremor level of the Parkinson;

The training method of the Parkinson rating model comprises the following steps:

inputting a group of joint point sequences in a training data set into a Parkinson rating model, and extracting spatial domain characteristics of a current frame and time domain characteristics of continuous frames through three branches in the Parkinson rating model;

inputting the spatial features and the time features into a classification module to obtain the tremor rating of the Parkinson;

and obtaining the trained Parkinson rating model through repeated iterative training.

In computer vision, Optical flow is a pattern of motion that refers to the apparent movement of an object, surface, edge, between an observer (e.g., eye, camera, etc.) and the background at a viewing angle. The ideal output of the optical flow algorithm is an estimated correlation of the velocity of each pixel in the two images, or equivalently, the displacement vector of each pixel in one image, indicating the relative position of the pixel in the other image, if this method is used for each pixel in the image, commonly referred to as "dense optical flow". Yet another algorithm, called "sparse optical flow," tracks only a subset of certain points in an image. The Lucas-Kanade optical flow algorithm is a two-frame differential sparse optical flow estimation algorithm and is commonly used for tracking points.

As shown in fig. 2, fig. 2 is a network structure diagram of a human skeleton recognition model. The human skeleton recognition model mainly predicts a joint confidence map and a limb affinity vector field. The joint confidence map represents the probability value of a certain joint point appearing on each pixel point in the image, the generated joint confidence map generates a probability area in Gaussian distribution at the origin point by taking the certain joint point as the origin coordinate, the maximum probability value of the origin point is 1, and the probability value is smaller towards the periphery by taking the origin point, namely the joint point, as the center. The limb affinity vector field represents the position and orientation of a limb in the image, and is generated in a manner similar to the joint confidence map. Firstly, the human body skeleton recognition model adds two convolutional layers on the first ten layers of a VGG-19 network, and extracts a characteristic diagram F of an image. The VGG then feeds the feature map F into the first part of the network to predict the limb affinity vector field. The network module of the first stage of the first part consists of 7 convolution blocks, the first 5 convolution blocks are 3 layers of 3 × 3 convolution, each 3 × 3 convolution is connected in series, and the last two convolution blocks are 1 layer of 1 × 1 convolution, so that the limb affinity vector field L1 of the first stage is finally obtained. The feature map F of the image and the limb affinity vector field L1 obtained by the first-stage prediction constitute the input of the second part of the first stage, and the first part of the limb affinity vector field Lt is obtained by repeating the steps t times. The VGG then feeds the feature map F and the limb affinity vector field Lt into the second part of the network to predict a joint confidence map. The network structure of each stage of the second part is the same as that of the first part, and the joint confidence map St of the second part is obtained after repeating the steps for t times. Finally, according to the joint confidence map and the limb affinity vector field, the network outputs the joint point coordinates of a current frame.

And then, the image is zoomed into the scales of 0.5, 1, 1.5 and 2, the obtained images with different scales are input into a human body skeleton recognition model to obtain joint confidence maps and limb affinity vector fields with different corresponding scales, and the joint confidence maps and the limb affinity vector fields are fused, so that the prediction precision of the joint point can be improved, and more accurate joint point coordinates can be extracted.

In order to effectively track the coordinates of the joint points, a joint tracking module is designed. The tracking module mainly adopts a feature point-based sparse tracking method Lucas-Kanade optical flow method, which is an optical flow estimation algorithm based on two-frame difference, and the basic idea is based on the following three assumptions.

Constant brightness: pixels of a target image in a scene appear unchanged from frame to frame movement.

Time duration (micro movement): the movement of the camera on the image varies slowly with time.

Spatial consistency: adjacent points of the same surface in the scene have similar motion and are projected at a relatively close distance onto the image plane.

The Lucas-Kanade optical flow method can be used to effectively track feature points such as joint points, but when the feature of two consecutive frames does not meet the above assumption, the tracking fails. If it is determined that the tracking of the joint point fails, the joint point tracked in the next frame may be used to reversely track the joint point tracked in the previous frame in a reverse tracking manner, and the joint point tracked in the previous frame may be compared with the joint point of the previous frame, so as to determine whether the tracking of the joint point of the next frame is successful. Therefore, only the euclidean distance between the joint point of the last frame obtained by back tracking and the actual joint point coordinate of the last frame needs to be calculated. If the Euclidean distance is too large, the joint point tracked by the next frame cannot be reversely tracked to obtain the joint point of the previous frame, and the joint point tracking failure of the next frame is judged.

In order to compensate for the occasional tracking failure problem, a joint correction module is designed, and a front-and-back frame intersection method is used for correcting joint points with tracking failure in time. Because the tremor of the parkinsonian has a small amplitude of motion, the joint points tend to lie in the intersection of two consecutive frames of joint confidence maps. The front-back frame intersection method in the joint correction module utilizes the characteristic, reduces the estimation range of joint points, and corrects and tracks failed joints. And finally, through accurate tracking of the joints, an accurate two-dimensional joint point sequence of the Parkinson patient can be extracted from the video of the Parkinson patient.

As shown in fig. 3, fig. 3 is a network structure diagram of the parkinson rating model. The network has three branches of incoming data streams. The first branch predicts the skeleton point relation of each frame of picture, and the information is used for capturing static information on the picture; the second and third branches can learn the fast and slow optical flow information of different frames by spacing different frames, and the information is used for capturing the time sequence information of actions. Finally, the features obtained by the three branches are fused, and the joint point sequence is input into the network, so that the rating result of the Parkinson can be obtained.

Example three:

the present embodiment provides an articulated point tracking device, including:

an extraction module: the system comprises a human body skeleton extraction model, a joint confidence map, a limb affinity vector field and two-dimensional joint points, wherein the human body skeleton extraction model is used for inputting a current frame of a video to be detected, the joint confidence map and the limb affinity vector field are obtained, and the two-dimensional joint points of the current frame are obtained according to the joint confidence map and the limb affinity vector field;

an anti-tracking module: the two-dimensional joint point tracking device is used for reversely tracking the previous frame of the video by taking the two-dimensional joint point tracked by the next frame of the video as the joint point to be tracked to obtain the two-dimensional joint point tracked by the reverse direction of the previous frame;

the joint point calculation module: the Euclidean distance calculating unit is used for calculating the Euclidean distance between the two-dimensional joint point of the last frame of backward tracking and the actual two-dimensional joint point of the last frame; if the Euclidean distance is too large, the joint point tracked by the next frame cannot be reversely tracked to obtain the joint point of the previous frame, the joint point tracked by the next frame is judged to be failed to track, otherwise, the joint point tracked by the next frame is used as the actual joint point of the next frame; if the joint point tracking of the next frame fails, overlapping the joint confidence maps of the previous frame and the next frame by using a front-and-back frame intersection method, reducing the detection area of the joint point to obtain a corrected target joint point of the next frame, and using the corrected joint point of the next frame as the actual joint point of the next frame;

a skip module: the sequence number for skipping the current frame is the next frame;

an output sequence module: and the two-dimensional joint point sequence is used for outputting a complete two-dimensional joint point sequence of the video to be detected.

The apparatus of the present embodiment can be used to implement the method of the first embodiment.

Example four:

the embodiment provides a parkinson's auxiliary diagnosis device based on accurate tracking of joint, includes:

a tracking module: the method comprises the steps of selecting needed joint points as joint points needing to be tracked for the next frame of a current frame, and tracking to obtain target two-dimensional joint points of the next frame;

This embodiment can be used to implement the method described in embodiment two.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for tracking a joint, the method comprising the steps of:

step A: acquiring a frame sequence of a video to be detected;

and B: inputting a current frame of a video to be detected into a human body skeleton extraction model to obtain a joint confidence map and a limb affinity vector field, and obtaining two-dimensional joint points of the current frame according to the joint confidence map and the limb affinity vector field;

step G: if the joint point tracking of the next frame fails, overlapping the joint confidence maps of the previous frame and the next frame by using a front-and-back frame intersection method, reducing the detection area of the joint point to obtain a corrected target joint point of the next frame, and using the corrected joint point of the next frame as the actual joint point of the next frame;

2. The joint point tracking method according to claim 1, wherein the human skeleton extraction model is based on openpos, and the structure of the human skeleton extraction model includes: the system comprises a feature extraction module, a limb affinity vector field estimation module and a joint confidence map estimation module; the feature extraction module comprises the first ten layers and two convolutional layers of the VGG-19 network;

and inputting the training data set into a human body skeleton extraction model, and obtaining a final human body skeleton extraction model through multiple iterations and training.

3. The joint point tracking method according to claim 1, wherein the method of inputting the current frame of the video to be detected into the human skeleton extraction model to obtain the joint confidence map and the limb affinity vector field comprises:

4. The joint point tracking method according to claim 3, wherein the method of inputting the image into the human skeleton extraction model to obtain the joint confidence map and the limb affinity vector field comprises:

5. The joint point tracking method according to claim 1, wherein the method for tracking the target two-dimensional joint point of the next frame is Lucas-Kanade optical flow method; the method for tracking the last frame of the video backward is the Lucas-Kanade optical flow method.

6. The joint point tracking method according to claim 1, wherein the method for obtaining the modified target joint point of the next frame by reducing the detection area of the joint point by superimposing the joint confidence maps of the previous frame and the next frame using the intersection method of the previous frame and the next frame comprises:

7. A Parkinson auxiliary diagnosis method based on joint accurate tracking, characterized in that, the joint tracking method based on any one of claims 1-6 comprises the following steps:

8. The parkinson's assisted diagnosis method of claim 7, wherein the input of the parkinson rating model is a two-dimensional joint point sequence, and the output is a tremor level of the parkinson;

the second branch is as follows: the system comprises three convolution layers, wherein the convolution layers are used for calculating difference data between joint points and learning quick time sequence information of different frames at intervals of 1 frame;

the third branch is as follows: the system comprises three convolution layers, wherein the convolution layers are used for calculating difference data among joint points by spacing 2 frames and learning the slow time sequence information of different frames;

a classification module: the device comprises three convolution layers and two full-connection layers, and is used for fusing the characteristics extracted by the three branches to obtain the tremor grade of the Parkinson;

the acquisition mode of the Parkinson rating model is as follows:

inputting the rating training data set into a Parkinson rating model, and performing repeated iterative training to obtain the trained Parkinson rating model.

9. An articulation point tracking device, comprising:

10. A Parkinson auxiliary diagnosis device based on joint accurate tracking is characterized by comprising:

the apparatus of claim 9: the method comprises the steps of extracting a complete two-dimensional joint point sequence of a video to be detected;