CN117497129A

CN117497129A - Game rehabilitation scene participation degree behavior recognition method based on vision

Info

Publication number: CN117497129A
Application number: CN202310634168.7A
Authority: CN
Inventors: 谢龙汉; 林旭杰; 陈彦; 谢沛民
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2024-02-02

Abstract

The invention discloses a game rehabilitation scene participation degree behavior recognition method based on vision. The method comprises the following steps: constructing a vision-based game rehabilitation scene participation degree behavior recognition model, wherein the participation degree behavior recognition model comprises an eye feature extraction unit, a gesture feature extraction unit and a time sequence feature extraction unit; training the participation behavior recognition model: training the participation degree behavior recognition model by utilizing a server in a training sample video data set collected in a game rehabilitation scene, and optimizing parameters of the participation degree behavior recognition model by reducing a network loss function until the participation degree behavior recognition model converges to obtain a trained participation degree behavior recognition model; and identifying the participation degree behavior in the new game rehabilitation scene by using the participation degree behavior identification model. The method is simple and effective, has high accuracy, and can provide supervision and feedback of rehabilitation training for patients in real time.

Description

Game rehabilitation scene participation degree behavior recognition method based on vision

Technical Field

The invention relates to the fields of rehabilitation medicine and computer vision, in particular to a game rehabilitation scene participation degree behavior identification method based on vision.

Background

Investigation has shown that stroke is the leading cause of disability in adults. A plurality of clinical researches show that the rehabilitation training is an effective way for improving the movement function of stroke patients and promoting rehabilitation. Rehabilitation participation is defined as a state in which it is stimulated and actively strives to participate in rehabilitation training. It can reflect the patient's attitude to rehabilitation, understanding the task requirements, and the effectiveness of the entire training process. Previous studies have shown that high patient involvement is an important factor in promoting nerve recombination. Even if the patient does not have the ability to perform an action, the willingness to exercise actively is necessary for rehabilitation. However, the repeatability of rehabilitation exercises tends to be boring and tiring for the patient. In fact, patients often exhibit low levels of participation during rehabilitation training, resulting in poor rehabilitation performance and safety hazards. In traditional rehabilitation training, supervision and correction by therapists can reduce the impact of low patient engagement. However, there is a shortage of rehabilitation physicians and patients often perform training tasks without supervision and correction. Therefore, it is very important to evaluate the participation of the patient in rehabilitation training, which is beneficial to the evaluation of rehabilitation results and the adjustment of training tasks.

Researchers have developed a number of methods of assessing patient engagement, including mainly scoring-based scales and physiological signal-based methods. The evaluation method adopting the scale requires a rehabilitation doctor to observe and score according to the participation indexes of the patient in the rehabilitation training process. This approach inevitably introduces subjective judgment from the physician and increases the workload of the rehabilitation physician. Physiological signals generated during rehabilitation training are also often used to assess patient engagement.

In the chinese published patent application CN105054927a, zhang Jinhua et al provides a method for quantitatively evaluating the degree of active participation in a lower limb rehabilitation system by detecting EEG signals and EMG signals of a patient in real time to calculate the degree of active participation in the lower limb rehabilitation system, which is disadvantageous in that the acquisition of physiological signals depends on wearable sensors, is inconvenient to wear, and is liable to cause discomfort to the patient in rehabilitation training.

The low engagement behavior of stroke patients in virtual gaming rehabilitation training is often accompanied by many facial and posture behaviors. Inspired by this finding, the present invention seeks to capture changes in facial behavior, such as eyelid movement, pupil movement, eye opening, head posture and facial fatigue expression, using visual techniques to intuitively detect subjective participation by a patient. In contrast to physiological signal-based methods, vision-based methods can be evaluated in a non-invasive manner and do not affect the patient's rehabilitation training process. However, there has been no related study in which vision techniques were applied to detect engagement in stroke patients.

The matters in the background section are only those known to the public and do not represent prior art in the field.

Disclosure of Invention

Aiming at the technical defects existing in the prior art, the invention provides a game rehabilitation scene participation degree behavior recognition method based on vision, which aims at applying the vision technology to detect the participation degree of a stroke patient, and realizes the extraction of participation degree characteristics in a vision image and the recognition of participation degree behaviors by using a deep learning model.

The object of the invention is achieved by at least one of the following technical solutions.

A game rehabilitation scene participation degree behavior recognition method based on vision comprises the following steps:

s1, constructing a vision-based game rehabilitation scene participation degree behavior recognition model, wherein the participation degree behavior recognition model comprises an eye feature extraction unit, a gesture feature extraction unit and a time sequence feature extraction unit;

the eye feature extraction unit is used for obtaining eye feature vectors according to the intercepted eye images in the original video frames; the gesture feature extraction unit is used for extracting feature points related to the head gesture from the original image and carrying out normalization operation so as to obtain a head gesture feature vector; in the time sequence feature extraction unit, eye feature vectors and head gesture feature vectors extracted from each video frame are spliced into fusion feature vectors, and a fusion feature vector group of one video is input into a time sequence neural network to obtain the classification of participation degree;

s2, training a participation behavior recognition model: training the participation degree behavior recognition model by utilizing a server in a training sample video data set collected in a game rehabilitation scene, and optimizing parameters of the participation degree behavior recognition model by reducing a network loss function until the participation degree behavior recognition model converges to obtain a trained participation degree behavior recognition model;

s3, identifying the participation degree behavior in the new game rehabilitation scene by using the participation degree behavior identification model.

Further, in order to unify the ocular feature extraction unit input, the ocular images cut out from the original video frames are unified in size.

Further, the eye feature extraction unit comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer and a full connection layer which are connected in sequence;

the method comprises the steps of intercepting an eye image from an original video frame to serve as input, and extracting features through a first convolution layer to obtain a first feature map;

the first feature map output by the first convolution layer is input into the first pooling layer for feature dimension reduction, and a second feature map is obtained;

the second feature map output by the first pooling layer is input into a second convolution layer for further feature extraction to obtain a third feature map;

the third feature map output by the second convolution layer is input into the second pooling layer for feature dimension reduction, and a fourth feature map is obtained;

the fourth feature map output by the second pooling layer is input into a third convolution layer for further feature extraction to obtain a fifth feature map;

and inputting the fifth feature map into the full-connection layer for feature dimension reduction to obtain an output vector.

Further, in the gesture feature extraction unit, feature points related to the head gesture are extracted from the original image, wherein the feature points comprise a left shoulder point, a right shoulder point, a trunk vertex, a nose tip point, a left eye positioning point and a right eye positioning point which are two-dimensional coordinates;

the trunk top point is respectively connected with the left shoulder point, the right shoulder point and the nose tip point, and the nose tip point is respectively connected with the left eye positioning point and the right eye positioning point.

Further, in the gesture feature extraction unit, for the normalization operation of the 6 head gesture feature points, first, the two-dimensional coordinates of the trunk vertex are subtracted from the two-dimensional coordinates of the 6 head gesture feature points to obtain relative coordinates, and the obtained 6 relative coordinates are subjected to normal distribution normalization processing.

Further, in the time sequence feature extraction unit, the eye feature vector and the head gesture feature vector extracted from each video frame are spliced into a fusion feature vector, and a frame-level feature representation of one video is constructed.

Further, in the time sequence feature extraction unit, the fusion feature vector is used as the input of the time step of the TCN time sequence neural network, and the output of the last time step of the TCN time sequence neural network is transmitted to a full connection layer and a softmax function to obtain the category prediction Y 'of the query video V' _V And a loss function L.

Further, the step S2 specifically includes the following steps:

s2.1, constructing a training video sample library in a server, and selecting a sample video fragment from the training video sample library as input of a participation degree behavior recognition model;

s2.2, executing the eye feature extraction unit by using a server, extracting eye feature vectors by using a training sample video segment through the eye feature extraction unit according to video frames as units, and storing the output eye feature vectors in a feature library;

s2.3, executing the gesture feature extraction unit by using a server, extracting head gesture feature vectors by using a training sample video segment through the gesture feature extraction unit according to video frames as units, and storing the output head gesture feature vectors in a feature library;

s2.4, executing the time sequence feature extraction unit by utilizing a server, splicing the eye feature vector and the head gesture feature vector of each video frame, and performing time sequence analysis type prediction as input of a TCN time sequence neural network to obtain the type prediction Y 'of the query video V' _V And a loss function L;

s2.5, performing end-to-end network training by using a server; the participation degree behavior recognition task loss function L is the distance between a predicted value and a true value, and the distance is minimized through a standard cross entropy loss function, namely, from the query video predicted category to the true category;

s2.6, optimizing an objective function by using a server, and acquiring a local optimal network parameter as a network weight of the participation degree behavior recognition model by using the loss function L in the step S2.5 as the objective function.

Further, the step S3 specifically includes the following steps:

s3.1, executing a test sample video segment generating unit by using a server, uniformly sampling all frames of one test sample video segment, and taking an obtained video frame set as input;

s3.2, performing participation behavior classification on the obtained video frame set by using the participation behavior recognition model.

Compared with the prior art, the invention has the following advantages and technical effects:

the game rehabilitation scene participation degree behavior recognition method based on vision is provided, eye features and gesture features are fused to realize recognition of participation degree behaviors: an eye feature extraction unit is provided to automatically extract advanced spatial features of an eye image; a pose feature extraction unit is proposed to extract head space pose features; a TCN time sequence feature extraction unit is introduced, time sequence modeling is carried out based on the extracted eye and gesture space features, the identification of participation degree behaviors is realized, and excellent identification accuracy is obtained; the vision-based participation behavior detection method is convenient and quick, does not need wearing articles and complex experimental settings, and is favorable for better supervision and feedback for rehabilitation training of patients.

Drawings

FIG. 1 is a diagram of an eye feature extraction unit constructed in an embodiment of the invention;

FIG. 2 is a schematic view of extracted head pose feature points according to an embodiment of the present invention;

fig. 3 is an algorithm frame diagram of a vision-based game rehabilitation scene participation degree behavior recognition method in an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Aiming at the problems and the shortcomings in the prior art, the invention provides a game rehabilitation scene participation degree behavior recognition method based on vision, which mainly comprises five stages of steps of constructing an eye feature extraction unit, constructing a gesture feature extraction unit, constructing a time sequence feature extraction unit, model training and model deducing.

Examples:

s1, constructing a vision-based game rehabilitation scene participation degree behavior recognition model;

s2, training a participation degree behavior recognition model;

s3, identifying the participation degree behavior in a new game rehabilitation scene by using the participation degree behavior identification model;

each step is described in detail below.

the eye feature extraction unit is used for obtaining an eye feature vector according to an eye image intercepted from an original video frame, as shown in fig. 1, which is an eye feature extraction unit diagram constructed by the invention, and an eye image is intercepted from the original video frame by using a dlib face key point detection tool, and an eye feature extraction model based on a convolutional neural network is designed to obtain an 84-dimensional eye feature vector; the gesture feature extraction unit is used for extracting 6 head gesture related feature points from an original image by using an openpost human body key point detection tool box, and performing normalization operation to obtain a 12-dimensional head gesture feature vector; in the time sequence feature extraction unit, eye feature vectors and head gesture feature vectors extracted from each video frame are spliced into fusion feature vectors, and a fusion feature vector group of one video is input into a time sequence neural network to obtain the classification of participation degree. The method comprises the following steps:

in one embodiment, to unify the ocular feature extraction unit inputs, the captured ocular images from the original video frames are unified to a size of 32 x 3.

The eye feature extraction unit comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer and a full connection layer which are sequentially connected;

the method comprises the steps of intercepting an eye image from an original video frame, and inputting the eye image into a first convolution layer for feature extraction; the first convolution layer adopts 32 convolution kernels of 5×5×3, and is not filled with 0, and the step size is 1; the convolution kernel obtains a characteristic value through convolution operation;

in one embodiment, the 28×28×32 first feature map output by the first convolution layer is input into the first pooling layer for feature dimension reduction; a pooling core with the size of 2 multiplied by 2 and the step length of 2 is adopted in the first pooling layer; in the first pooling layer, for inputAll elements in the 2 x 2 region of the first feature map are summed and then multiplied by a trainable first coefficient w ₁ Add a first bias term b ₁ Finally, obtaining the output of the first pooling layer through a sigmoid function, wherein the sigmoid function is a nonlinear activation function, and specifically comprises the following steps:

in one embodiment, the 14×14×32 second feature map output by the first pooling layer is input into the second convolution layer for further feature extraction; the second convolution layer adopts 64 convolution kernels of 5 multiplied by 3, 0 is not used for filling, the step length is 1, and a third characteristic diagram of 10 multiplied by 64 is obtained through convolution calculation;

in one embodiment, the third feature map is input into the second pooling layer for feature dimension reduction; a pooling core with the size of 2 multiplied by 2 and the step length of 2 is adopted in the second pooling layer; in the second pooling layer, all elements in the 2×2 region of the input third feature map are summed first, followed by multiplication with a trainable second coefficient w ₂ Plus a second bias term b ₂ Finally, obtaining the output of the second pooling layer through a sigmoid function;

in one embodiment, the 5×5×64 fourth feature map output by the second pooling layer is input into the third convolution layer for further feature extraction; 128 convolution kernels of 5 multiplied by 3 are adopted in the third convolution layer, 0 filling is not used, the step length is 1, and a fifth characteristic diagram of 1 multiplied by 128 is obtained through convolution calculation;

in one embodiment, the fifth feature map is input into the fully connected layer for feature dimension reduction; the full connection layer comprises 84 nodes; in the fully connected layer, the input vector is multiplied by a trainable third coefficient w ₃ Add a third coefficient bias term b ₃ Finally, an 84-dimensional output is obtained through a sigmoid function.

In one embodiment, in the gesture feature extraction unit, as shown in fig. 2, the schematic diagrams of 6 head gesture feature points extracted from the original image are respectively: the left shoulder point 1, the right shoulder point 2, the trunk vertex 3, the nose tip point 4, the left eye positioning point 5 and the right eye positioning point 6 are two-dimensional coordinates;

the trunk vertex 3 is respectively connected with the left shoulder point 1, the right shoulder point 2 and the nose tip point 4, and the nose tip point 4 is respectively connected with the left eye positioning point 5 and the right eye positioning point 6.

In one embodiment, in the gesture feature extraction unit, for the normalization operation of 6 head gesture feature points, first, two-dimensional coordinates of the torso vertex 3 are subtracted from two-dimensional coordinates of the 6 head gesture feature points to obtain relative coordinates, and the obtained 6 relative coordinates are subjected to normal distribution normalization processing, which specifically includes:

wherein μ and σ are the mean and standard deviation of the data, respectively; n is the number of data samples; p (P) _k Is the original input vector of the kth data sample; p'. _k Is the normalized input vector for the kth data sample.

In the time sequence feature extraction unit, 84-dimensional eye feature vectors and 12-dimensional head gesture feature vectors extracted from each video frame are spliced into 96-dimensional fusion feature vectors, and a frame-level feature representation of one video is constructed.

In the time sequence feature extraction unit, the fusion feature vector is used as the input of the time step of the TCN time sequence neural network, and the output of the last time step of the TCN time sequence neural network is transmitted to a full connection layer and a softmax function to obtain the category prediction Y 'of the query video V' _V And a loss function L.

Step S2, training the participation degree behavior recognition model: training the participation degree behavior recognition model by utilizing a server in a training sample video data set collected in a game rehabilitation scene, and optimizing parameters of the participation degree behavior recognition model by reducing a network loss function until the participation degree behavior recognition model converges to obtain a trained participation degree behavior recognition model; in one embodiment, as shown in fig. 3, an algorithm framework diagram of a vision-based game rehabilitation scene participation behavior recognition method is specifically implemented as follows:

And S5, identifying the participation degree behavior in the new game rehabilitation scene by using the participation degree behavior identification model. The specific implementation process is as follows:

The preferred embodiment of the invention provides a game rehabilitation scene participation degree behavior recognition method based on vision, which does not need complex wearing equipment and experimental setting operation, has usability and accuracy, and can be used for helping rehabilitation doctors to obtain feedback in time and providing proper training prescriptions for patients.

While the invention has been described with reference to specific embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It is to be understood that the features described in the different dependent claims and in the invention may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with separate embodiments may be used in other described embodiments.

Claims

1. The game rehabilitation scene participation degree behavior recognition method based on vision is characterized by comprising the following steps of:

2. The vision-based game rehabilitation scene participation degree behavior recognition method according to claim 1, wherein in the participation degree behavior recognition model, an eye feature extraction unit is used for obtaining eye feature vectors according to an eye image intercepted from an original video frame; the gesture feature extraction unit is used for extracting feature points related to the head gesture from the original image and carrying out normalization operation so as to obtain a head gesture feature vector; in the time sequence feature extraction unit, eye feature vectors and head gesture feature vectors extracted from each video frame are spliced into fusion feature vectors, and a fusion feature vector group of one video is input into a time sequence neural network to obtain the classification of participation degree.

3. The vision-based game rehabilitation scenario engagement behavior recognition method according to claim 1, wherein in order to unify the ocular feature extraction unit input, the ocular images cut out from the original video frames are unified in size.

4. The vision-based game rehabilitation scenario engagement behavior recognition method according to claim 1, wherein the eye feature extraction unit comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer and a full connection layer which are sequentially connected;

5. The vision-based game rehabilitation scene participation degree behavior recognition method according to claim 1, wherein in the gesture feature extraction unit, feature points related to head gestures extracted from an original image comprise a left shoulder point (1), a right shoulder point (2), a trunk vertex (3), a nose tip point (4), a left eye positioning point (5) and a right eye positioning point (6), which are two-dimensional coordinates;

the trunk vertex (3) is respectively connected with the left shoulder point (1), the right shoulder point (2) and the nose tip point (4), and the nose tip point (4) is respectively connected with the left eye positioning point (5) and the right eye positioning point (6).

6. The vision-based game rehabilitation scene participation degree behavior recognition method according to claim 1, wherein in the gesture feature extraction unit, normalization operation is performed on 6 head gesture feature points, first two-dimensional coordinates of the 6 head gesture feature points are respectively subtracted from two-dimensional coordinates of a trunk vertex (3) to obtain relative coordinates, and the obtained 6 relative coordinates are subjected to normal distribution normalization processing.

7. The vision-based game rehabilitation scene participation degree behavior recognition method according to claim 1, wherein in the time sequence feature extraction unit, an eye feature vector and a head posture feature vector extracted from each video frame are spliced into a fusion feature vector, and a frame-level feature representation of one video is constructed.

8. The vision-based game rehabilitation scenario engagement behavior recognition method according to claim 7, wherein in the time sequence feature extraction unit, the fusion feature vector is used as an input of a time step of a TCN time sequence neural network, and an output of a last time step of the TCN time sequence neural network is transferred to a full connection layer and a softmax function to obtain a category prediction Y of the query video V _V ' and a loss function L.

9. The vision-based game rehabilitation scenario engagement behavior recognition method according to claim 1, wherein the step S2 specifically comprises the following steps:

s2.4, executing the time sequence feature extraction unit by utilizing a server, splicing the eye feature vector and the head gesture feature vector of each video frame, and performing time sequence analysis type prediction as input of a TCN time sequence neural network to obtain the type prediction Y of the query video V _V ' sum loss function L;

10. The vision-based game rehabilitation scenario engagement behavior recognition method according to claim 1, wherein the step S3 specifically comprises the following steps: