WO2020107847A1

WO2020107847A1 - Bone point-based fall detection method and fall detection device therefor

Info

Publication number: WO2020107847A1
Application number: PCT/CN2019/089500
Authority: WO
Inventors: 周涛涛; 周宝; 陈远旭; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-28
Filing date: 2019-05-31
Publication date: 2020-06-04
Also published as: CN109492612A

Abstract

The present application provides a bone point-based fall detection method and a fall detection device therefor. The method comprises: training a first feature extraction neural network by means of a first picture sample, the first feature extraction neural network being used for extracting a plurality of first feature points representing key bone points of a human body; inputting a second video sample to the trained first feature extraction neural network to obtain a plurality of second feature points representing the key bone points of the human body in the second video sample; encoding the plurality of second feature points to generate a prediction feature map; training a second behavior classification neural network by means of the prediction feature map, the second behavior classification neural network being used for classifying behaviors represented in the prediction feature map; and sequentially inputting video data of a monitored object into the trained first feature extraction neural network and the trained second behavior classification neural network to output a behavior category of the monitored object.

Description

Skeletal point-based fall detection method and fall detection device

Cross-reference of related applications

This application declares to enjoy the priority of the Chinese patent application filed on November 28, 2018 with the application number CN201811433808.3 and titled "skeletal point-based fall detection method and fall detection device". The entire content of the Chinese patent application Incorporated by reference in this application.

Technical field

The present application relates to the field of machine vision deep learning technology, and in particular to a fall detection method, device, computer equipment, and storage medium based on bone points.

Background technique

As my country enters an aging society, the problem of old-age care is getting more and more serious. The elderly's various physical function indexes decline, and their activity ability decreases, especially the lack of balance, reaction ability and coordination ability may cause accidental falls. When the old man falls, he may even die at home if he does not receive timely assistance. Therefore, the fall detection for the elderly in the family or other environments is a very meaningful research problem in the field of computer vision and machine learning.

At present, there are three main methods of fall detection, which are fall detection based on wearable devices, fall detection based on depth cameras and fall detection based on ordinary cameras. Among them, the method based on wearable devices must be carried at all times, which causes great inconvenience to users and has little practical application value; the method based on depth cameras is expensive and difficult to promote in practice; and the method based on ordinary cameras is cheap and easy to use Convenient, but requires higher algorithm.

Since ordinary cameras can cover all places, their hardware foundation is mature. At present, many methods have been proposed in the industry for using ordinary cameras for fall detection. For example, the information of the image sequence is directly used to classify the fall behavior, and the detection algorithm is used to classify the change of the person's frame. However, at present, there are few data for fall detection, and the scene is single, so it cannot be applied to various actual scenes. For the classification method using image sequences, due to the lack of data, it is not possible to train an excellent network. For the method of using the detection algorithm to classify the person's frame, using a large amount of information from other data sets, people can be effectively detected, but when using the frame information to classify, because the frame information is limited, a network with good generalization cannot be obtained.

Summary of the invention

The purpose of this application is to provide a fall detection method, device, computer equipment and storage medium based on bone points, which are used to solve the problems in the prior art.

To achieve the above objective, the present application provides a bone point-based fall detection method, including the following steps:

Train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract a plurality of first feature points in the first picture sample, the first feature points represent key points on the human body Bone point

Input a second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

Encoding the plurality of second feature points to generate a predicted feature map;

Training a second behavior classification neural network through the predicted feature map, the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The video data of the monitored object is sequentially input into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.

To achieve the above purpose, the present application also proposes a fall detection device based on bone points, including:

The first neural network training module is adapted to train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract multiple first feature points in the first picture sample, the The first feature point represents the key bone point on the human body;

The feature point extraction module is adapted to input the second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

A feature map generation module, adapted to encode the multiple second feature points to generate a predicted feature map characterizing the distribution of the multiple second feature points;

A second neural network training module, adapted to train a second behavior classification neural network through the predicted feature map, and the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The classification module is adapted to sequentially input the video data of the monitored object into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.

In order to achieve the above object, the present application also provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the computer program, the following steps are implemented:

To achieve the above purpose, the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by the processor, the following steps are realized:

This application addresses the problem of insufficient fall detection data in the prior art, and uses other data to train human bone point feature extraction neural networks; for the problem of using frame information to detect fall behavior, use bone point information to classify fall behavior. This application trains the first feature extraction neural network through the image sample library to extract the key bone point information in the human body; trains the second behavior classification neural network through the video sample library, and judges the video based on the extracted key bone point information Whether the human movement in is a fall behavior. Through the first feature extraction neural network and the second behavior classification neural network trained by this application, the bone point information of the monitored object can be accurately extracted, and according to the bone point information, it can be judged in time whether the monitored object has fallen down. The provision of timely and effective care for the handicapped elderly and disabled persons is conducive to improving people's quality of life.

BRIEF DESCRIPTION

FIG. 1 is a flowchart of Embodiment 1 of a fall detection method based on bone points of the present application;

2 is a schematic structural diagram of a first feature extraction neural network in Embodiment 1 of the present application;

3 is a schematic structural diagram of a second behavior classification neural network in Embodiment 1 of the present application;

4 is a schematic diagram of a program module of a first embodiment of a fall detection device based on a bone point according to this application;

5 is a schematic diagram of the hardware structure of the first embodiment of the memory sharing device of the present application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the present application will be described in further detail in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. Based on the embodiments in the present application, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.

The fall detection method, device, computer equipment and storage medium provided by the present application are applicable to the field of machine vision technology, and provide a fall detection method and device for the elderly or disabled persons living alone to detect fall behavior in time. This application trains the first feature extraction neural network through the image sample library to extract the key bone point information in the human body; trains the second behavior classification neural network through the video sample library, and judges the video based on the extracted key bone point information Whether the human movement in is a fall behavior. The first feature extraction neural network and the second behavior classification neural network trained by this application can accurately extract the bone point information of the monitored object, and timely determine whether the monitored object has fallen down according to the bone point information, which is beneficial to Greatly improve people's quality of life.

Example 1

Referring to FIG. 1, a fall detection method based on bone points in this embodiment includes the following steps:

S1: Train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract a plurality of first feature points in the first picture sample, and the first feature points represent the human body Key bone points.

In this step, the first picture sample is selected from the picture sample library to train the first feature extraction neural network. The first picture sample is preferably a full-body picture of the person. During implementation, the first image sample is divided into a training image sample and a test image sample, where the training image sample is used to train the first feature extraction neural network, and the test image sample is used to verify the first feature after the training image sample training The effect of extracting the neural network when extracting the feature information in the picture. Preferably, the above training picture samples and test picture samples may be subjected to data enhancement preprocessing, such as performing contrast transformation and brightness transformation on each sample, adding local random Gaussian noise, and performing uniform normalization processing, thereby The training image samples and the test image samples after data enhancement are obtained.

The structure of the first feature extraction neural network in this step will be described in detail below with a test picture sample as an example, as shown in FIG. 2. The test picture sample first enters the feature extraction module to extract the features in the test picture sample. The feature extraction module in this embodiment uses a ResNet residual network to ensure better feature extraction performance. After the test sample image passes through the ResNet residual network, the first extracted data D _{1 is obtained} , and then the first extracted data D ₁ enters four convolution modules with different expansion coefficients respectively, to obtain four second extracted data D with different feature channels. ₂ . Next, the four second extracted data D ₂ with different feature channels are combined into the first convolutional layer stacked by the residual module to obtain four third extracted data D ₃ with different perceptual fields. Finally, after fusing four third extracted data D ₃ with different perceptual fields, it enters the second convolutional layer piled up by the residual module again, and finally outputs multiple first feature points representing key bone points on the human body .

It should be noted that the number of convolution modules and the expansion coefficient values disclosed in this embodiment are merely exemplary descriptions, and are not limited thereto. A person of ordinary skill in the art can arbitrarily change the number of the above-mentioned convolution modules and the value of the expansion coefficient according to actual needs, and all fall within the protection scope of the present application.

Preferably, the convolution module includes the following layers in sequence: a convolution layer, a batch normalization layer, a Relu activation function layer, a convolution layer, a batch normalization layer, a Relu activation function layer, and a pooling layer, each The convolutional layers have different expansion coefficients.

In this step, the feature information is the bone feature points on the human body, including the feature points at the main joints of the body, such as the elbow joint, shoulder joint, knee joint, hip joint, etc. On this basis, the target feature point associated with the preset behavior can be further selected from the bone feature points. The preset behavior may be squatting, bending over, standing up, falling, etc. The characteristic points of displacement in different behaviors may be different, so the one that best reflects the characteristics of this behavior can be selected according to the behavior to be detected Target feature point. In the preferred embodiment of the present invention, by comprehensively considering the detection accuracy and the amount of data processing, a total of 14 bone point information including head, neck, shoulders, elbows, hands, hips, knees, and feet are selected as targets Feature points. On the one hand, the selection method of the present invention can make the number of bone feature points as small as possible to reduce the calculation amount in the subsequent behavior analysis process; on the other hand, the above-mentioned 14 target feature points are evenly distributed at major joints of the human body , Can reflect the basic trend of human behavior as a whole. Those skilled in the art can understand that the positions of the bone points listed above are only used as examples, and are not used to limit specific feature point information. According to specific circumstances, the above bone point information may also be deleted or added, or specific feature points may be changed For example, the location of the acupuncture point in the human body can also be obtained. This application does not limit this. On this basis, the plurality of first feature points in this embodiment may preferably be the above bone point distribution information maps marked in the human body.

In this step, a random gradient descent method with cross entropy as the loss function and momentum is used to train the first feature extraction neural network. The expression of the specific loss function is as follows:

Where x _p and y _p represent the predicted coordinates of the first feature point extracted by the first feature extraction neural network, and x _g and y _g represent the actual coordinates of the first feature point.

S2: Input the second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points that represent key bone points of the human body in the second video sample.

On the basis that the first feature extraction neural network has been trained in step S1, this step uses the trained first feature extraction neural network to extract the second feature point in the video sample. Preferably, the second feature point is the above The 14 bone feature points mentioned.

The present application is based on the video information of the monitored person collected by a common camera for fall detection. Therefore, the object of feature point extraction in this step is a continuous video rather than a simple picture. Since the video is formed by a series of picture frames changing with time, the video needs to be sampled first to extract the target picture. For example, the video is extracted according to the standard of 20 frames per second, with 3 seconds as a sample. At the same time, in order to generate diverse samples, the starting frame can be randomly selected near the starting point of the behavior in the video.

After a sufficient number of target pictures are extracted, the feature point information in the target pictures can be extracted through the first feature extraction neural network, preferably the 14 bone feature points mentioned above.

S3: Encoding the plurality of second feature points to generate a predicted feature map.

This step is used to process the extracted second feature points to obtain a predicted feature map. Taking the above 14 bone feature points as an example, the following processing steps are included:

S31: Pairwise pair the above bone feature points.

In this embodiment, any two feature points from the 14 bone feature points are paired, and the calculation formula is as follows:

C(14,2)=14! /(12!*2!)=91;

S32: l _xjt calculated Euclidean distance and direction of the velocity v and v _Xit skeletal features between each two points _yit:

v _xit = x _it -x _i(t-1)

v _yit = y _it -y _i(t-1)

In the above formula, x _it and y _it respectively represent the horizontal and vertical coordinates of the i-th second feature point at time t; l _xjt represents the Euler of the i-th second feature point and j-th second feature point at time t distance, v _xit represents the i-th second feature points at time t in the x-direction velocity, v _yit speed of the i-th representative of a second characteristic point in the y-direction.

S33: Combine all calculated Euler distance and directional speed data to form a prediction feature map.

For any sample image, pairing 14 bone feature points into two pairs can obtain 91 combinations, that is, 91 Euler distances can be calculated; each bone feature point has a speed in the x direction and a The speed in the y direction, that is, a total of 14 speeds in the x direction and 14 speeds in the y direction, combined to obtain a total of 91+14+14=119 feature vectors.

Assuming that a total of 60 frames of images need to be processed in this step, the feature vectors in each frame of images are arranged in order, and a 60×119 matrix map can be obtained. The matrix diagram is the prediction feature diagram.

S4: Train a second behavior classification neural network through the predicted feature map, and the second behavior classification neural network is used to classify the behavior represented in the predicted feature map.

On the basis that the prediction feature map has been obtained, the purpose of this step is to train a second behavior classification neural network to classify the behavior represented in the prediction feature map to determine whether a fall behavior has occurred. The structure of the second behavior classification neural network in this application is shown in FIG. 3, which will be described in detail below.

Take a certain predicted feature map obtained in step S3 as an example. The prediction feature map first passes through a conventional convolution module to obtain first classification data R1. Then, the first classification data R1 respectively passes through four convolution modules with different expansion coefficients to obtain four second classification data R2 with different characteristic channels. Preferably, the expansion coefficients of the above four convolution modules are 1 respectively. , 3, 6 and 12. Next, the above-mentioned four second classification data R2 with different feature channels are combined and then sequentially passed through three conventional convolution modules, and finally output behavior classification, which is used to judge which behavior category the behavior represented in the above-mentioned predicted feature map belongs to.

Preferably, the convolution module includes the following layers in sequence: a convolution layer, a batch normalization layer, a Relu activation function layer, a convolution layer, a batch normalization layer, a Relu activation function layer, and a pooling layer.

In this step, the second behavior classification neural network is trained by the loss function L _H (X, Y), the specific expression is as follows:

In the above formula, x _k represents the parameter value of the kth behavior category, and z _k represents the predicted probability of the kth behavior category. For example, the second behavior classification neural network can recognize the categories of squatting, standing, waving, bending, falling, lying down, etc., each behavior corresponds to its own parameter value, such as When the monitored person is falling, then x _k represents the parameter value of the monitored person's falling behavior, and z _k represents the predicted probability of the monitored person's falling behavior.

In order to prevent overfitting, this embodiment adds an L2 regular term after the loss function to prevent overfitting. The resulting cost function is as follows:

L(X,Y)=L _H (X,Y)+L2.

S5: Input the video data of the monitored object into the trained first feature extraction neural network and the second behavior classification neural network in order to output the behavior category of the monitored object.

On the basis of having completed the training of the first feature extraction neural network and the second behavior classification neural network, the present application can detect the fall behavior of the actual surveillance video. Specifically, in this application, the video information of the monitored object is photographed in real time through a common camera, and the video information is sampled to extract a certain number of target images. The target image first undergoes a trained first feature extraction neural network to extract multiple feature points in the target image, such as bone feature points. Calculate and combine multiple bone feature points, for example, calculate the Euler distance between each two bone feature points and the speed in the x and y directions, and arrange the vectors calculated above in the order of each frame of image , And finally get the predicted feature map. Next, by inputting the prediction feature map into the second behavior classification neural network, you can obtain the category to which the behavior included in the prediction feature map belongs, for example, whether it is a fall behavior.

Please continue to refer to FIG. 4, which shows a fall detection device. In this embodiment, the fall detection device 10 may include or be divided into one or more program modules, and one or more program modules are stored in a storage medium. , And executed by one or more processors to complete this application, and can implement the above-mentioned fall detection method. The program module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the fall detection device 10 in the storage medium than the program itself. The following description will specifically introduce the functions of the program modules of this embodiment:

The first neural network training module 11 is adapted to train a first feature extraction neural network through a first picture sample. The first feature extraction neural network is used to extract multiple first feature points in the first picture sample. The first feature point represents the key bone point on the human body;

The feature point extraction module 12 is adapted to input the second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

The feature map generation module 13 is adapted to encode the multiple second feature points to generate a predicted feature map characterizing the distribution of the multiple second feature points;

The second neural network training module 14 is adapted to train a second behavior classification neural network through the predicted feature map, and the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The classification module 15 is adapted to sequentially input the video data of the monitored object into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.

This embodiment also provides a computer device, such as a smartphone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a rack server (including an independent server, or A server cluster composed of multiple servers), etc. The computer device 20 of this embodiment includes at least but not limited to: a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus, as shown in FIG. 5. It should be noted that FIG. 5 only shows the computer device 20 having components 21-22, but it should be understood that it is not required to implement all the components shown, and that more or fewer components may be implemented instead.

In this embodiment, the memory 21 (ie, readable storage medium) includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 20, such as a hard disk or memory of the computer device 20. In other embodiments, the memory 21 may also be an external storage device of the computer device 20, such as a plug-in hard disk equipped on the computer device 20, a smart memory card (Smart Media, Card, SMC), and secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the computer device 20 and its external storage device. In this embodiment, the memory 21 is generally used to store the operating system and various application software installed in the computer device 20, such as the program code of the fall detection device 10 of the first embodiment. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.

The processor 22 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the computer device 20. In this embodiment, the processor 22 is used to run the program code or process data stored in the memory 21, for example, to run the fall detection device 10, so as to implement the fall detection method of the first embodiment.

This embodiment also provides a computer-readable storage medium, such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, server, App store, etc., which store computer programs, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store the fall detection device 10, and when executed by the processor, the fall detection method of the first embodiment is implemented.

The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

Any process or method description in a flowchart or otherwise described herein may be understood to represent a module, segment, or portion of code that includes one or more executable instructions for implementing specific logical functions or steps of a process , And the scope of the preferred embodiment of the present application includes additional implementations, in which the functions may not be performed in the order shown or discussed, including performing functions in a substantially simultaneous manner or in reverse order according to the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present application belong.

A person of ordinary skill in the art can understand that all or part of the steps carried in the method of the above embodiment can be completed by instructing relevant hardware through a program, and the program can be stored in a computer-readable medium. When executed, it includes one of the steps of the method embodiment or a combination thereof.

In the description of this specification, the description with reference to the terms "one embodiment", "some embodiments", "examples", "specific examples" or "some examples" means specific features described in conjunction with the embodiment or examples, The structure, material or characteristic is included in at least one embodiment or example of the present application. In this specification, the schematic expression of the above terms does not necessarily refer to the same embodiment or example. Moreover, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.

The above are only the preferred embodiments of the present application, and do not limit the scope of the patent of the present application. Any equivalent structure or equivalent process transformation made by the description and drawings of this application, or directly or indirectly used in other related technical fields The same reason is included in the patent protection scope of this application.

Claims

A fall detection method based on bone points is characterized by the following steps:

Training a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract a plurality of first feature points representing key bone points on the human body;

Input a second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

Encoding the plurality of second feature points to generate a predicted feature map;

Training a second behavior classification neural network through the predicted feature map, the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The video data of the monitored object is sequentially input into the trained first feature extraction neural network and the second behavior classification neural network, and the behavior category of the monitored object is output.
The fall detection method according to claim 1, wherein the training of the first feature extraction neural network through the first picture sample includes:

The first picture sample is input to a Resnet residual network to obtain first extracted data;

The first extracted data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second extracted data with different feature channels;

The plurality of second extracted data with different feature channels are combined to enter a first convolution layer piled up by residual convolution, to obtain a plurality of third extracted data with different perception fields;

Fuse the plurality of third extracted data with different perceptual fields, and then enter a second convolutional layer piled up with a residual module, and finally output a plurality of first feature points representing key bone points on the human body;

Perform reverse training on the first feature extraction network through a first loss function.
The fall detection method according to claim 1, wherein the training of the second behavior classification neural network through the predicted feature map includes:

The predicted feature map passes through a conventional convolution module to obtain first classification data;

The first classification data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second classification data with different feature channels;

The plurality of second classification data with different feature channels are combined and sequentially passed through three conventional convolution modules to finally output behavior classification.
The fall detection method according to claim 2, wherein the first loss function F is:

Where x p and y p represent the predicted coordinates of the first feature point extracted by the first feature extraction neural network, and x g and y g represent the actual coordinates of the first feature point.
The fall detection method according to claim 3, wherein the second loss function L is:

Wherein, x k represents the parameter value of the kth behavior category, z k represents the predicted probability of the kth behavior category, and L2 represents the regular term that prevents overfitting.
The fall detection method according to claim 2, wherein the convolution module is composed of the following layers in series: a convolutional layer, a batch regularization layer, a Relu activation function layer, a convolutional layer, a batch regularization Layer, Relu activation function layer, pooling layer.
The fall detection method according to claim 3, wherein the convolution module is composed of the following layers connected in series: a convolutional layer, a batch regularization layer, a Relu activation function layer, a convolutional layer, a batch regularization Layer, Relu activation function layer, pooling layer.
The fall detection method according to claim 1, wherein the encoding of the plurality of second feature points to generate a predicted feature map includes:

Pairwise pairing the plurality of second feature points;

Calculate the distance and speed between every two second feature points:

In the above formula, x it and y it respectively represent the horizontal and vertical coordinates of the i-th second feature point at time t; l xjt represents the Euler of the i-th second feature point and j-th second feature point at time t distance, v xit represents the i-th second feature points at time t in the x-direction velocity, v yit speed of the i-th representative of a second characteristic point in the y direction;

Combine all calculated distance and speed data to form a prediction feature map.
The fall detection method according to claim 8, wherein the step of encoding the plurality of second feature points to generate a predicted feature map includes:

Selecting a plurality of target feature points associated with a preset behavior from the plurality of second feature points;

Encoding the plurality of target feature points to generate a predicted feature map.
The fall detection method according to claim 1, wherein before the step of training the first feature extraction neural network through the first picture sample, further comprising:

Data pre-processing is performed on the first picture sample.
A fall detection device based on bone points is characterized by comprising:

The first neural network training module is adapted to train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract multiple first feature points in the first picture sample, the The first feature point represents the key bone point on the human body;

The feature point extraction module is adapted to input the second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

A feature map generation module, adapted to encode the multiple second feature points to generate a predicted feature map;

A second neural network training module, adapted to train a second behavior classification neural network through the predicted feature map, and the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The classification module is adapted to sequentially input the video data of the monitored object into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.
A computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the following steps when executing the computer program:

Train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract a plurality of first feature points in the first picture sample, the first feature points represent key points on the human body Bone point

Input a second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

Encoding the plurality of second feature points to generate a predicted feature map;

Training a second behavior classification neural network through the predicted feature map, the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The video data of the monitored object is sequentially input into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.
The computer device according to claim 12, wherein the training of the first feature extraction neural network through the first picture sample includes:

The first picture sample is input to a Resnet residual network to obtain first extracted data;

The first extracted data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second extracted data with different feature channels;

The plurality of second extracted data with different feature channels are combined to enter a first convolution layer piled up by residual convolution, to obtain a plurality of third extracted data with different perception fields;

Fuse the plurality of third extracted data with different perceptual fields, and then enter a second convolutional layer piled up with a residual module, and finally output a plurality of first feature points representing key bone points on the human body;

Perform reverse training on the first feature extraction network through a first loss function.
The computer device according to claim 12, wherein the training of the second behavior classification neural network through the predicted feature map comprises:

The predicted feature map passes through a conventional convolution module to obtain first classification data;

The first classification data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second classification data with different feature channels;

The plurality of second classification data with different feature channels are combined and sequentially passed through three conventional convolution modules to finally output behavior classification.
The computer device according to claim 13, wherein the first loss function F is:

Where x p and y p represent the predicted coordinates of the first feature point extracted by the first feature extraction neural network, and x g and y g represent the actual coordinates of the first feature point.
The computer device according to claim 13, wherein the second loss function L is:

Wherein, x k represents the parameter value of the kth behavior category, z k represents the predicted probability of the kth behavior category, and L2 represents the regular term that prevents overfitting.
The computer device according to claim 13, wherein the convolution module is composed of the following layers connected in series: a convolution layer, a batch regularization layer, a Relu activation function layer, a convolution layer, a batch regularization layer , Relu activation function layer, pooling layer.
The computer device according to claim 14, wherein the convolution module is composed of the following layers connected in series in sequence: a convolution layer, a batch regularization layer, a Relu activation function layer, a convolution layer, a batch regularization layer , Relu activation function layer, pooling layer.
The computer device according to claim 12, wherein the encoding of the plurality of second feature points to generate a predicted feature map includes:

Pairwise pairing the plurality of second feature points;

Calculate the distance and speed between every two second feature points:

In the above formula, x it and y it respectively represent the horizontal and vertical coordinates of the i-th second feature point at time t; l xjt represents the Euler of the i-th second feature point and j-th second feature point at time t distance, v xit represents the i-th second feature points at time t in the x-direction velocity, v yit speed of the i-th representative of a second characteristic point in the y direction;

Combine all calculated distance and speed data to form a prediction feature map.
A computer-readable storage medium on which a computer program is stored, characterized in that when the computer program is executed by a processor, the following steps are realized:

Train a first feature extraction neural network through a first picture sample, the first feature extraction neural network is used to extract a plurality of first feature points in the first picture sample, the first feature points represent key points on the human body Bone point

Input a second video sample into the trained first feature extraction neural network to obtain a plurality of second feature points characterizing key bone points of the human body in the second video sample;

Encoding the plurality of second feature points to generate a predicted feature map;

Training a second behavior classification neural network through the predicted feature map, the second behavior classification neural network is used to classify the behavior represented in the predicted feature map;

The video data of the monitored object is sequentially input into the trained first feature extraction neural network and the second behavior classification neural network to output the behavior category of the monitored object.
The computer-readable storage medium of claim 20, wherein the training of the first feature extraction neural network through the first picture sample includes:

The first picture sample is input to a Resnet residual network to obtain first extracted data;

The first extracted data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second extracted data with different feature channels;

The plurality of second extracted data with different feature channels are combined to enter a first convolution layer piled up by residual convolution, to obtain a plurality of third extracted data with different perception fields;

Fuse the plurality of third extracted data with different perceptual fields, and then enter a second convolutional layer piled up with a residual module, and finally output a plurality of first feature points representing key bone points on the human body;

Perform reverse training on the first feature extraction network through a first loss function.
The computer-readable storage medium of claim 20, wherein the training of the second behavior classification neural network through the predicted feature map includes:

The predicted feature map passes through a conventional convolution module to obtain first classification data;

The first classification data respectively passes through multiple convolution modules with different expansion coefficients to obtain multiple second classification data with different feature channels;

The plurality of second classification data with different feature channels are combined and sequentially passed through three conventional convolution modules to finally output behavior classification.
The computer-readable storage medium of claim 21, wherein the first loss function F is:

Where x p and y p represent the predicted coordinates of the first feature point extracted by the first feature extraction neural network, and x g and y g represent the actual coordinates of the first feature point.
The computer-readable storage medium of claim 22, wherein the second loss function L is:

Wherein, x k represents the parameter value of the kth behavior category, z k represents the predicted probability of the kth behavior category, and L2 represents the regular term that prevents overfitting.
The computer-readable storage medium according to claim 21, wherein the convolution module is composed of the following layers connected in series: a convolution layer, a batch regularization layer, a Relu activation function layer, a convolution layer, a batch Regularization layer, Relu activation function layer, pooling layer.
The computer-readable storage medium of claim 22, wherein the convolution module is composed of the following layers connected in series: a convolution layer, a batch regularization layer, a Relu activation function layer, a convolution layer, a batch Regularization layer, Relu activation function layer, pooling layer.
The computer-readable storage medium of claim 20, wherein the encoding of the plurality of second feature points to generate a predicted feature map includes:

Pairwise pairing the plurality of second feature points;

Calculate the distance and speed between every two second feature points:

In the above formula, x it and y it respectively represent the horizontal and vertical coordinates of the i-th second feature point at time t; l xjt represents the Euler of the i-th second feature point and j-th second feature point at time t distance, v xit represents the i-th second feature points at time t in the x-direction velocity, v yit speed of the i-th representative of a second characteristic point in the y direction;

Combine all calculated distance and speed data to form a prediction feature map.