CN116580832A

CN116580832A - Auxiliary diagnosis system and method for senile dementia based on video data

Info

Publication number: CN116580832A
Application number: CN202310497550.8A
Authority: CN
Inventors: 陶倩; 雷小林
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-08-11

Abstract

The invention discloses an auxiliary diagnosis system and method for senile dementia based on video data, and particularly relates to the field of video analysis. The invention is a novel intelligent auxiliary diagnosis and early warning means for senile dementia, has high diagnosis accuracy, reduces the workload of doctors, and is beneficial to early cognitive abnormality early warning and diagnosis of vast users in communities or at home.

Description

Auxiliary diagnosis system and method for senile dementia based on video data

Technical Field

The invention relates to the field of video analysis, in particular to an auxiliary diagnosis system and method for senile dementia based on video data.

Background

As society ages, the prevalence of dementia increases. According to the latest global dementia prevalence report issued by the world health organization, 5000 or more tens of thousands of people have dementia in 2020, this figure doubles every 20 years. China is the country with the largest number of dementia patients, and medical treatment, care and management of dementia elderly people have become the national important public health problem.

At present, the technical means for diagnosing early senile dementia mainly adopts cognitive tests and clinical questionnaires to comprehensively evaluate according to clinical symptoms. However, the mental scale, which is the main diagnostic standard, has technical defects such as being influenced by subjective psychological factors of patients. In addition to clinical cognition, psychological and other questionnaires, biomarker detection and neuroimaging techniques are also increasingly used in clinical diagnostic practice. However, such technology detection relies on specialized equipment and medical personnel and can only be used at specific locations.

Along with the development of medical big data accumulation and artificial intelligence technology, the application of artificial intelligence in the medical field has been advanced to a certain extent, and intelligent auxiliary diagnosis and treatment is one of the most important and core application scenes of artificial intelligence in the medical field, and through a large amount of image data and diagnostic data, the artificial intelligence continuously carries out deep learning training on a neural network, promotes the neural network to master the diagnostic capability, and has great help to reduce the medical burden. The invention provides an auxiliary diagnosis system and an auxiliary diagnosis method for senile dementia based on video data.

Disclosure of Invention

Therefore, the auxiliary diagnosis system and the auxiliary diagnosis method for the senile dementia based on the video data are high in diagnosis accuracy, reduce the workload of doctors, and are beneficial to early-stage cognitive abnormality early warning and diagnosis of vast users in communities or at home.

In order to achieve the above object, the present invention provides the following technical solutions: the senile dementia auxiliary diagnosis system based on the video data comprises a terminal, a network, a server and a database;

the terminal is in communication connection with the server through a network, and is used for collecting videos of the operation process of the tea making task of the subject, uploading the videos to the server through the network, carrying out person interactive identification and health state diagnosis of senile dementia on images to be detected by the server, and storing related data and original video data information into a database;

the server comprises a sample acquisition module and a model training generation module;

the sample acquisition module is used for acquiring original video data, converting the data and preprocessing the data, and generating sample data and text labels for training the module;

the model training generation module is used for inputting training sample data and text labels into the neural training network for training.

Further, the sample acquisition module comprises an original video data acquisition unit, a video data key frame interception unit, an image sample generation unit and an image sample preprocessing unit;

the original video data acquisition unit is an electronic device with a camera shooting function;

the video data key frame intercepting unit is used for reading an original video data stream, intercepting video frames at equal intervals according to video fps and intercepting time intervals t and storing the video frames as static images, wherein the images are png or jpg;

the image sample generation unit performs manual screening and labeling on the static original image stored by the video data key frame interception unit through interactive labeling, the manual screening performs random sampling on images of key character interaction behaviors in the process of the tea making operation of a subject, and the key character interaction behaviors comprise turning on a power switch, boiling water, discharging tea, pouring hot water and pouring tea;

the image sample preprocessing unit adjusts the sizes of all samples to be consistent after the data are loaded smoothly, converts the data into a Tensor form required by a neural training network, divides a data set into a training set and a test set according to a set proportion, and meanwhile reduces the influence caused by insufficient data quantity by using a data enhancement technology, so that the robustness of the model is improved.

Further, the model training generation module comprises a character interaction behavior recognition model training module and a character interaction behavior recognition model prediction module;

the character interaction behavior recognition model training module adopts a CLIP model trained on a large number of Image title pairs, acquires the characteristics of images and texts through an Image-Encoder and a Text-Encoder respectively, calculates the similarity between texts and images in a batch through dot products to obtain a batch size x batch size similarity matrix, wherein the similarity value on a diagonal is the similarity value of a positive sample, so that the optimization target in the training process is that the similarity value of the positive sample is as large as possible; the model architecture is divided into two parts, an image encoder and a text encoder, the image encoder considers 2 different architectures, namely a residual network ResNet or a visual transducer, the residual network adopts a model which is ResNet-50, resNet-100 and is obtained by respectively 4 times, 16 times and 64 times of the ResNet-50 according to the idea of Efficient Net: resNet-50x4, resNet-50x16, resNet-50x64.ViT 3 pre-trained models of ViT-B/32, viT-B/16 and ViT-L/14 were used; the text encoder uses a transform encoder with depth of 12 and width of 512, with eight attention headers, the weights of which are derived from the pre-trained CLIP text encoder;

when the character interaction behavior recognition model prediction module performs reasoning of character interaction behavior classification tasks on input pictures, the model firstly needs to convert category labels into sentences which are the same as those in pre-training, namely, sentences corresponding to categories are obtained through a prompt operation, finally, the similarity of the input pictures and sentences corresponding to each category is calculated, and the category corresponding to the sentence with the highest similarity is the predicted category;

the category label includes: turning on a power switch, boiling water, discharging tea, pouring hot water and pouring tea; the template of Prompt uses A photo of a mask, replacing the mask according to class labels.

Further, the neural training network training method comprises the following steps:

s210: preprocessing a character interaction behavior information sequence:

s220: carrying out characteristic vectorization on the preprocessing result of the character interaction behavior information sequence;

s230: inputting the feature vector sample into a neural network for training;

s240: and (5) saving the trained network model, and carrying out reasoning prediction on cognitive dysfunction.

Further, S210 performs a deduplication operation on the sequence, and only retains a part of key data with two adjacent data being changed before and after, that is, retains character interaction behavior change information in the process of the operation of the subject tea making task, and a judgment algorithm of whether the sequence elements are retained is as follows:

s220, carrying out special vectorization conversion on a sequence of key operation behaviors in the operation process of preserving the subject tea making task after the duplication elimination, and generating a sample meeting the requirement of a neural network input data format;

s230, training and verification of a neural network model are completed, the adopted network model is an MLP, a 1DCNN, RNN, LSTM, GRU, CNN + LSTM, textCNN, BILSTM, attention, multiHeadAttention, attention +BiLSTM, a BiGRU+ Attention, transformer, positionalEmbedding +Transformer model, an optimizer adopts Adam, the data batch length is 64, the training batch is 10, the initial learning rate is set to be 0.001, and the dividing ratio of the training set to the testing set is 9:1.

The invention also comprises a diagnosis method of the senile dementia auxiliary diagnosis system based on the video data, which comprises the following specific steps:

step S110: collecting video of the operation process of the tea making task of the subject through the terminal;

step S120: uploading video data of the tea making operation of the subject to a server through a network;

step S130: the server converts the video data frames of the tea making operation of the subject into static pictures;

step S140: the server preprocesses the converted static picture to generate sample data and text labels for training of the module, and then inputs the training sample data and the text labels into a neural training network for training;

step S150: the server performs interactive character recognition and diagnosis on the images to be detected and the disease health state of the senile dementia;

step S160: the server stores the relevant data and the original video data information in a database.

The invention has the following advantages:

1. the invention is a novel intelligent auxiliary diagnosis and early warning means for senile dementia, has high diagnosis accuracy, reduces the workload of doctors, and is beneficial to early cognitive abnormality early warning and diagnosis of vast users in communities or at home. The invention adopts the pre-trained CLIP model to have strong migration learning capability on a downstream small data set, thereby greatly reducing training cost;

2. the tea making task adopted by the invention is derived from daily life, is easy to understand and accept by the elderly, is simple and convenient to operate, avoids long-time tedious evaluation, has objective evaluation results, and avoids result differences caused by main body differences of the evaluator.

Drawings

FIG. 1 is a view of the application of the auxiliary diagnosis system and method for senile dementia based on video data provided by the invention;

fig. 2 is a diagram of an auxiliary diagnosis system for senile dementia based on video data provided by the invention;

FIG. 3 is a flowchart of an auxiliary diagnosis method for senile dementia based on video data provided by the invention;

FIG. 4 is a flowchart of a training and predicting method for identifying a neural network for human interactive behavior provided by the invention;

FIG. 5 is a flow chart of a method for training and predicting a cognitive dysfunction neural network provided by the present invention;

in the figure: 101 terminals, 102 networks, 103 servers and 104 databases;

the system comprises a 310 sample acquisition module, a 311 original video data acquisition unit, a 312 video data key frame interception unit, a 313 image sample generation unit and a 314 image sample preprocessing unit;

320 model training generation module, 321 human interaction behavior recognition model training module, 322 human interaction behavior recognition model prediction module.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 of the specification, an embodiment of the invention provides an application scene diagram of an senile dementia auxiliary diagnosis system and a senile dementia auxiliary diagnosis method based on video data. In the scene, the interaction behavior of the subject and the tea making related objects is shot through the camera. By the character interaction behavior recognition method provided by the embodiment of the invention, the operation behavior information of the subject, the related information of the tea making related object and the interaction action information of the subject and the object can be detected. The objects of the tea making scene can comprise kettles, teapots, teacups, tea cans filled with tea, mineral water, colas, sockets with indicator lamps, tables, chairs and the like. The staff orders the elderly to select proper articles to complete the tea making task. The specific implementation steps are as follows: 1. staff orders the subject to sit down and inform: on the table top in front of you, tools required for making tea in daily life are placed, and you can perform actual operation according to the operation mode considered to be correct by you, and finally a cup of hot tea is to be made; 2. when the test is ready, the process of making tea is recorded after the test is shouted to the beginning, the recorded content comprises all operation processes of a subject, and the recording is stopped after the test is shouted to the end, and no prompt is given in the process.

Referring to fig. 2 of the specification, an embodiment of the present invention provides an auxiliary diagnosis system diagram for senile dementia based on video data, which includes a terminal 101, a network 102, a server 103 and a database 104;

the terminal 101 and the server 103 are connected to each other by a network 102. The terminal 101 may be various forms of image capturing devices, such as a video camera, a still camera, a mobile phone, etc. The server 103 may be an independent server, deployed with the senile dementia auxiliary diagnosis system platform provided by the invention, or may be a server group formed by a plurality of servers, where each server is deployed with a module of the senile dementia auxiliary diagnosis system and the method thereof provided by the invention. Of course, the server 103 may also be a cloud server, and the senile dementia auxiliary diagnosis system platform provided by the invention is deployed on the cloud server. The terminal 101 collects the video of the operation process of the tea making task of the subject, uploads the video to the server 103 through the network 102, the server 103 performs interactive character recognition and diagnosis of the disease health state of the senile dementia on the image to be detected, and information such as related data, original video data and the like is stored in the database 104.

Referring to fig. 3 of the specification, an embodiment of the present invention provides a flowchart of an auxiliary diagnosis method for senile dementia based on video data, including:

step S110: collecting video of the operation process of the tea making task of the subject through the terminal 101;

step S120: uploading the subject tea making operation video data to the server 103 through the network 102;

step S130: the server 103 converts the video data frame into a still picture for the subject tea making operation;

step S140: the server 103 preprocesses the converted static picture to generate sample data and text labels for module training, and then inputs the training sample data and the text labels into a neural training network for training;

step S150: the server 103 performs interactive character recognition and diagnosis on the images to be detected and the disease health state of the senile dementia;

step S160: the server 103 saves the relevant data and the original video data information to the database 104.

Referring to fig. 4 of the drawings, an embodiment of the present invention provides a flowchart of a neural network training and predicting method for an auxiliary diagnosis system for senile dementia based on video data, where the neural network training method includes a sample acquisition module 310 and a model training generation module 320;

the sample acquiring module 310 is used for acquiring original video data, converting the data, preprocessing the data, and generating sample data and text labels for training the module;

the model training generation module 320 is configured to input training sample data and text labels into a neural training network for training.

On the basis of the above embodiment, the sample acquisition module 310 includes an original video data acquisition unit 311, a video data key frame capture unit 312, an image sample generation unit 313, and an image sample preprocessing unit 314;

the original video data acquisition unit 311 is an electronic device with a camera function, such as a video camera, a mobile phone, or a camera; the video acquisition is carried out on the operation process of the tea making of the subject under the specific light source condition as far as possible in the acquisition scene, so that the shooting environment is stable, the acquisition pixels and the color degree are high, and the accuracy of the subsequent analysis is improved;

the video data key frame intercepting unit 312 is used for reading an original video data stream, intercepting video frames according to video fps and intercepting time interval t at equal intervals and storing the video frames as static images, wherein the images are png or jpg;

the image sample generation unit 313 performs manual screening and labeling on the static original image stored in the video data key frame capturing unit 312 through interactive labeling, wherein the manual screening is to randomly sample images of key character interaction behaviors in the process of the operation of making tea of a subject, and the key character interaction behaviors comprise turning on a power switch, boiling water, placing tea leaves, pouring hot water, pouring tea water and the like;

the image sample preprocessing unit 314 adjusts the sizes of all samples to be consistent after the data is loaded smoothly, converts the data into a Tensor form required by a neural training network, divides the data set into a training set and a test set according to a set proportion, and meanwhile, reduces the influence caused by insufficient data quantity by using a data enhancement technology, improves the robustness of the model, and the available data enhancement means comprise: dimensional changes, pixel value changes, viewing angle changes, and other changes, i.e., rotation, saturation and brightness adjustment, flipping, center clipping, etc.

On the basis of the above embodiment, the model training generating module 320 includes a character interaction behavior recognition model training module 321 and a character interaction behavior recognition model prediction module 322;

the character interaction behavior recognition model training module 321 adopts a CLIP model trained on a large number of Image title pairs, acquires the characteristics of images and texts from collected Image sample-Text label pairs through an Image-encoding device and a Text-encoding device respectively, calculates the similarity between texts and images in a batch by dot product to obtain a batch size x batch size similarity matrix, and the similarity value on a diagonal is the similarity value of a positive sample, so that the optimization target in the training process is to make the similarity value of the positive sample as large as possible; the model architecture is divided into two parts, an image encoder and a text encoder, the image encoder considers 2 different architectures, namely a residual network ResNet or a visual transducer, the residual network adopts a model which is ResNet-50, resNet-100 and is obtained by respectively 4 times, 16 times and 64 times of the ResNet-50 according to the idea of Efficient Net: resNet-50x4, resNet-50x16, resNet-50x64.ViT 3 pre-trained models of ViT-B/32, viT-B/16 and ViT-L/14 were used; the text encoder uses a transform encoder with depth of 12 and width of 512, with eight attention headers, the weights of which are derived from the pre-trained CLIP text encoder.

When the human interactive behavior recognition model prediction module 322 performs reasoning of the human interactive behavior classification task on the input picture, the model firstly needs to convert the class label into sentences the same as those in the pre-training process, namely, the sentences corresponding to the classes are obtained through the prompt operation, finally, the similarity of the input picture and the sentences corresponding to each class is calculated, and the class corresponding to the sentence with the highest similarity is the predicted class.

The category label includes: turning on a power switch, boiling water, discharging tea, pouring hot water, pouring tea water and the like; the template of Prompt uses A photo of a mask, replacing the mask according to class labels.

Optionally, the category label includes: turning on a power switch, boiling water, discharging tea, pouring hot water, pouring tea water and the like. The template of Prompt uses "Aphoto of a [ mask ]", replacing the mask according to category labels.

In this embodiment, the dementia recognition system may be integrated in an electronic device such as a computer, a mobile phone, a tablet computer, a four-diagnosis instrument of traditional Chinese medicine, and the like.

In this embodiment, model training generation module 320 may be pre-trained, trained alone, or combined to optimize the overall training process. A linear-probe fixed/frozen pre-training network may be used to extract features, then a trainable linear classifier is added to complete the training; finetune can also be used to fine tune the entire network so that all the learnable parameter weights in the network are updated.

Referring to fig. 5 of the specification, an embodiment of the present invention provides a flowchart of a training and predicting method for a cognitive dysfunction neural network of an auxiliary diagnosis system for senile dementia based on video data

On the basis of the above embodiment, the neural network training method further includes:

s210: preprocessing a character interaction behavior information sequence:

s230: inputting the feature vector sample into a neural network for training;

Specifically, S210 performs a deduplication operation on the sequence, and only retains some key data with two adjacent data changed before and after, that is, retains character interaction behavior change information in the operation process of the subject tea making task, and a judgment algorithm of whether the sequence elements are retained is as follows:

s220, performing special vectorization conversion on the sequence of key operation behaviors in the operation process of preserving the subject tea making task after the duplication elimination, and generating a sample meeting the requirement of the input data format of the neural networkThe method comprises the steps of carrying out a first treatment on the surface of the For processing convenience, the sequence data can be converted into a length L _seq Is a fixed length feature vector of (a). The tail part of the length is not enough to be filled with 0, and the tail part of the length exceeds the length and can be cut off.

S230, training and verification of a neural network model are completed, the adopted network model is an MLP model, a 1DCNN, RNN, LSTM, GRU, CNN + LSTM, textCNN, BILSTM, attention, multiHeadAttention, attention +BiLSTM model, a BiGRU+ Attention, transformer, positionalEmbedding +Transformer model and the like, an optimizer adopts Adam, the data batch length is 64, the training batch is 10, the initial learning rate is set to be 0.001, and the dividing ratio of the training set to the testing set is 9:1.

According to the auxiliary diagnosis system and the auxiliary diagnosis method for senile dementia based on video data, the adopted tea making task is derived from daily life, the senile people can easily understand and accept the task, the operation is simple and convenient, long-time tedious evaluation is avoided, the evaluation result is objective, and the result difference caused by the main body difference of an evaluator is avoided;

in addition, the invention has high diagnosis accuracy, reduces the workload of doctors, and is beneficial to early cognitive abnormality early warning and diagnosis of vast users in communities or at home. The invention adopts the pre-trained CLIP model to have strong migration learning capability on a downstream small data set, thereby greatly reducing training cost.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. The auxiliary diagnosis system for senile dementia based on video data is characterized in that:

comprises a terminal (101), a network (102), a server (103) and a database (104);

the terminal (101) is in communication connection with the server (103) through a network (102), the terminal (101) is used for collecting video of the operation process of the tea making task of the subject, then the video is uploaded to the server (103) through the network (102), the server (103) carries out character interactive identification and senile dementia illness health state diagnosis on images to be detected, and relevant data and original video data information are stored in the database (104);

the server (103) comprises a sample acquisition module (310) and a model training generation module (320);

the sample acquisition module (310) is used for acquiring original video data, converting the data and preprocessing the data, and generating sample data and text labels for training the module;

the model training generation module (320) is used for inputting training sample data and text labels into a neural training network for training.

2. The auxiliary diagnosis system for senile dementia based on video data according to claim 1, wherein: the sample acquisition module (310) comprises an original video data acquisition unit (311), a video data key frame interception unit (312), an image sample generation unit (313) and an image sample preprocessing unit (314);

the original video data acquisition unit (311) is an electronic device with a camera shooting function;

the video data key frame intercepting unit (312) is used for reading an original video data stream, intercepting video frames at equal intervals according to video fps and intercepting time interval t and storing the video frames as static images, wherein the images are png or jpg;

the image sample generation unit (313) performs manual screening and labeling on the static original image stored by the video data key frame interception unit (312) through interactive labeling, the manual screening performs random sampling on images of key character interaction behaviors in the process of the operation of making tea of a subject, and the key character interaction behaviors comprise turning on a power switch, boiling water, discharging tea, pouring hot water and pouring tea;

the image sample preprocessing unit (314) adjusts the sizes of all samples to be consistent after the data are loaded smoothly, converts the data into a Tensor form required by a neural training network, divides the data set into a training set and a testing set according to a set proportion, and meanwhile, reduces the influence caused by insufficient data quantity by using a data enhancement technology and improves the robustness of the model.

3. The auxiliary diagnosis system for senile dementia based on video data according to claim 1, wherein: the model training generation module (320) comprises a character interaction behavior recognition model training module (321) and a character interaction behavior recognition model prediction module (322);

the character interaction behavior recognition model training module (321) adopts a CLIP model trained on a large number of Image title pairs, acquires the characteristics of images and texts through an Image-Encoder and a Text-Encoder respectively, calculates the similarity between texts and images in a batch through dot products to obtain a batch size x batch size similarity matrix, and the similarity value on a diagonal is the similarity value of a positive sample, so that the optimization target in the training process is to make the similarity value of the positive sample as large as possible; the model architecture is divided into two parts, an image encoder and a text encoder, the image encoder considers 2 different architectures, namely a residual network ResNet or a visual transducer, the residual network adopts a model which is ResNet-50, resNet-100 and is obtained by respectively 4 times, 16 times and 64 times of the ResNet-50 according to the idea of Efficient Net: resNet-50x4, resNet-50x16, resNet-50x64; viT 3 pre-trained models of ViT-B/32, viT-B/16 and ViT-L/14 were used; the text encoder uses a transform encoder with depth of 12 and width of 512, with eight attention headers, the weights of which are derived from the pre-trained CLIP text encoder;

when the character interaction behavior recognition model prediction module (322) performs reasoning of character interaction behavior classification tasks on input pictures, the model firstly needs to convert category labels into sentences which are the same as those in pre-training, namely, sentences corresponding to categories are obtained through a prompt operation, finally, the similarity of the input pictures and sentences corresponding to each category is calculated, and the category corresponding to the sentence with the highest similarity is the predicted category;

4. The auxiliary diagnosis system for senile dementia based on video data according to claim 1, wherein: the neural training network training method comprises the following steps:

s210: preprocessing a character interaction behavior information sequence:

s230: inputting the feature vector sample into a neural network for training;

5. The auxiliary diagnosis system for senile dementia based on video data according to claim 4, wherein: s210, performing de-duplication operation on the sequence, and only retaining part of key data with two adjacent data changed before and after, namely retaining character interaction behavior change information in the operation process of a subject tea making task, wherein a judgment algorithm for judging whether sequence elements are retained is as follows:

6. A diagnostic method of the auxiliary diagnostic system for senile dementia based on video data as claimed in any one of claims 1 to 5, characterized in that: the method comprises the following steps of

Step S110: collecting video of the operation process of the tea making task of the subject through a terminal (101);

step S120: uploading the subject tea making operation video data to a server (103) through a network (102);

step S130: the server (103) converts the video data frames into static pictures for the tea making operation of the subject;

step S140: the server (103) preprocesses the converted static picture to generate sample data and text labels for module training, and then inputs the training sample data and the text labels into a neural training network for training;

step S150: the server (103) performs interactive character recognition and diagnosis on the images to be detected and the disease health state of the senile dementia;

step S160: the server (103) stores the relevant data and the original video data information in a database (104).