CN116092179A

CN116092179A - Improved Yolox fall detection system

Info

Publication number: CN116092179A
Application number: CN202211500031.4A
Authority: CN
Inventors: 周蕾; 钟海莲; 陈冠宇; 马冰娅; 殷哲文; 董妍妍
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-05-09

Abstract

An improved Yolox fall detection system comprises a backbone network, a feature fusion module, a prediction head, a simAM attention module introduced into the backbone network, an ECA channel attention module added in the feature fusion module, and a loss function of the system adopts an EIoU loss function, so that the difference between a prediction frame and a real frame can be calculated; the specific steps of detecting the picture are as follows: the input picture is subjected to unified scaling pretreatment, then a backbone network constructed by basic convolution and a CSPLlayer with a residual structure and a simAM attention mechanism is used, then feature information is further extracted by a feature fusion module with an ECA attention mechanism, and finally a detection result is output by means of YOLOHEAD and displayed on the detected picture. According to the invention, through an improved algorithm, the detection effect of the target in mAP, recall rate, precision and F1 and complex environments is obviously improved, the complete and blocked falling target can be effectively detected under the fuzzy background and dim light conditions, and the precision of the detection target is improved.

Description

Improved Yolox fall detection system

Technical Field

The invention relates to the technical field of image processing, in particular to an improved Yolox fall detection system.

Background

In recent years, machine learning and deep learning algorithms have been widely used in the fall detection field. A currently common fall detection architecture is Convolutional Neural Network (CNN). CNN is a feedforward neural network with a convolution calculation and a deep structure, and its artificial neurons can respond to surrounding units in a part of coverage, so that it has excellent performance for large-scale image processing, and is one of representative algorithms of deep learning. The target detection algorithm based on deep learning is divided into a two-stage target detection algorithm and a one-stage target detection algorithm. The two-stage target detection algorithm generates a region candidate frame in the first stage, classifies and regresses the content in the region of the candidate frame in the second stage, and the representative algorithms include R-CNN, SPP-Net, fast R-CNN and the like. The target detection algorithm in one stage does not directly generate the candidate frame region, but rather regards the target detection task as a regression task of the whole image, and the representative algorithm includes YOLO series and the like. Among the current convolutional neural network-based methods, the YOLO series target detection method performs well for object detection, and is used for fall detection by many researchers.

In order to reduce the injury caused by falling, a plurality of researchers at home and abroad combine with information technology research hotspots to develop various falling detection systems. The fall detection system commonly used at present comprises a remote fall detection system based on intelligent home, a fall detection system based on signals such as images in video and acoustics, and different types of fall detection system designs based on a sensor fall detection system. However, these fall detection systems have many external limiting factors, which are not conducive to popularization and use in actual work, and the detection accuracy in complex scenes is not high enough, even if the target cannot be detected.

Disclosure of Invention

In order to solve the problems of omission, low detection precision and the like of the existing pedestrian falling detection algorithm under a complex scene, the invention provides an improved YOLOX falling detection system, a simAM attention module is introduced into a backbone network, an ECA channel attention module is added to a Bottleneck and a feature fusion module so as to further extract key information of a feature layer, and a loss function adopts EIoU, so that the difference between a prediction frame and a real frame can be calculated more effectively, and the precision of a model is improved; the technical problems can be effectively solved.

The invention is realized by the following technical scheme:

an improved Yolox fall detection system comprises a backbone network, a feature fusion module, a prediction head and a simAM attention module introduced into the backbone network, wherein the ECA channel attention module is added in the feature fusion module to further extract key information of a feature layer, and a loss function adopts EIoU, so that the difference between a prediction frame and a real frame can be calculated more effectively, and the accuracy of a model is improved; the specific steps of detecting the picture are as follows: step one: and (3) data acquisition: collecting falling videos of cameras in public places, taking different pictures of each frame as picture data sets, and marking information by labelimg software, wherein the types of real frames are divided into 5 types, namely: stand, fall, sit, squat, run, and dividing a training set, a verification set and a test set;

step two: text preprocessing: preprocessing the collected picture data set, uniformly putting picture path information of the training set and the verification set into a TXT document, and formulating correct reading path text information and category information so as to facilitate model reading;

step three: building a training model: constructing a fall detection system of the YolOX of a CSPDarknet backbone network with a residual structure and a simaM attention mechanism, taking uniformly scaled pictures as network input, putting the output of the backbone network into a feature fusion module with an ECA attention mechanism, further extracting feature information, and finally outputting a detection result through the YolOHEAD;

step four: model training: dividing the processed data set into a training set, a verification set and a test set, and training by using the constructed model to obtain an optimal weight model;

step five: and (3) model detection: inputting the pictures or the monitoring video into the trained model for detection, marking and positioning the detected falling target, and sending out alarm information.

Further, in the fall detection system for constructing the YOLOX of the CSPDarknet backbone network with the residual structure and the simAM attention mechanism in the third step, the simAM attention mechanism is introduced into the CSPDarknet backbone network, and the simAM attention mechanism is introduced into the CSP non-residual part of the backbone network, namely, the plurality of bottleneck superposition parts and the last output part, so that the network structure can be deepened, and further deep characteristic information can be extracted;

the simAM attention mechanism is a parameter-free attention module, starting from neuroscience theory, an energy function is constructed to mine the importance of neurons, so that each neuron is assigned with a unique weight, and an analytic solution is derived for accelerating calculation, and the final energy function is as follows:

wherein ,

in the above, t and x _i Is the input feature X epsilon R ^C×H×W Target neurons and other neurons in a single channel of (a); i is an index in the spatial dimension, m=h×w is the number of neurons on the channel; w (w) _t and b_t Are all weight and bias transformations, μ _t and σ_t ² Is the mean and variance of all neurons except t;

assuming that all pixels in a single channel follow the same distribution, the mean and variance of all neurons can be calculated and all neurons on that channel reused, so the minimum energy can be found by the following formula:

the above formula means: the lower the energy, the greater the distinction of neurons t from peripheral neurons, the higher the importance; thus, the importance of neurons can be determined by

Obtaining; the feature is enhanced by using the scaling operator Sigmoid function, the formula is as follows: />

Where E is a grouping of all cross-channel and spatial dimension energies, sigmoid is to limit the value of E to be too large and not to affect the relative importance of neurons.

Furthermore, the multiple bootleneck overlapping parts introduce a simAM attention mechanism, namely, the input feature images are firstly convolved twice, then three-dimensional attention weights are better extracted through the simAM to be mapped onto feature layers, then the feature layers of the two branches are added, and finally channel information is extracted through ECA attention to be mapped onto the feature layers.

Further, the final output part introduces a simAM attention mechanism by dividing an input feature map into two branches, convoluting each branch, convoluting the left branch, then splicing the two branches into one branch, convoluting the spliced feature layer, and finally extracting three-dimensional weight information through the simAM and mapping the three-dimensional weight information to the feature map finally output on the feature layer.

Further, the convolution is performed on each branch, and the convolution process is as follows: the input characteristic layer is subjected to basic convolution, three branches and a constant branch are respectively pooled through 5×5,9×9 and 13×13 global average, the results of the four branches are added and then convolved, three-dimensional weight information is extracted through the attention of the simAM, and important information is further extracted and mapped onto the characteristic layer to be output.

Furthermore, the feature fusion module of the ECA attention mechanism in the third step introduces a simAM mechanism after the feature fusion algorithm after up-sampling and down-sampling in the feature fusion module, and introduces the ECA attention mechanism after three output feature layers at the output end of the special fusion module.

Furthermore, the ECA attention mechanism learns effective channel attention with lower model complexity, the module generates channel attention through quick 1×1 convolution, the kernel size can be adaptively determined through nonlinear mapping of channel sizes, and compared with other attention mechanisms, the ECA attention mechanism avoids dimension reduction, realizes local cross-channel interaction with 1-dimensional convolution efficiently, and extracts dependency relations among channels.

Further, in the third step, the output of the backbone network is put into a feature fusion module with an ECA attention mechanism, so as to further extract feature information; the specific operation steps are as follows: firstly, carrying out global average pooling operation on an input feature map; and then carrying out 1-dimensional convolution operation with the convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the formula is as follows:

w＝Sigmoid(C1D _k (X)) (8)；

wherein C1D represents one-dimensional convolution, k is the convolution kernel size, and x is the input feature map;

and finally multiplying the weight with the corresponding element of the original input feature map to obtain a final output feature map.

Further, the EIoU loss function includes three parts: overlap loss, center-to-center distance loss, and width-to-height loss, the first two continuing the method in CIOU; the EIoU provides a loss function for directly punishing the prediction results of W and H, and directly minimizes the difference between the width and the height of a real frame and the prediction frame, so that the convergence speed is higher; the specific formula is as follows:

wherein ,C_w and C_h The width and height of the smallest external frame covering the two frames; ρ represents the Euclidean distance of the two center points calculated; b, b ^gt Representing the center points of the prediction frame and the real frame respectively; w, w ^gt Representing the widths of the real frame and the predicted frame respectively; h, h ^gt Representing the heights of the real frame and the prediction frame respectively; c represents the diagonal distance of the minimum closure region that can contain both the predicted and real frames.

Further, the specific operation mode of training by using the constructed model in the fourth step is as follows: the method comprises the steps of carrying out mosaic data enhancement on pictures of a training set, namely reading one picture, then randomly reading other three pictures, de-splicing the four pictures in a 'field' character lattice mode, adjusting label information of the spliced mosaic pictures, achieving the purpose of further enhancing small target detection performance, uniformly scaling the pictures to 640 multiplied by 3, and filling the rest by padding so as to solve the problem of image distortion.

Advantageous effects

Compared with the prior art, the improved Yolox falling detection system provided by the invention has the following beneficial effects:

(1) According to the technical scheme, the simAM attention module is introduced into the backbone network, the ECA channel attention module is added to the Bottleneck and the feature fusion module, so that key information of a feature layer is further extracted, the EIoU is adopted as a loss function, the difference between a prediction frame and a real frame can be calculated more effectively, and the accuracy of a model is improved.

(2) According to the technical scheme, when a model is trained, mosaic data enhancement is carried out on pictures of a training set, namely one picture is read, then the other three pictures are read randomly, the four pictures are spliced in a 'field' character lattice mode, the label information of the spliced mosaic pictures is adjusted, the purpose of further enhancing the detection performance of a small target is achieved, the pictures are uniformly scaled to 640 multiplied by 3, and the rest of the pictures are filled by padding so as to solve the problem of image distortion.

Drawings

Fig. 1 is a diagram of a model architecture of the improved YOLOX target detection method of the present invention.

FIG. 2 is a flow chart of an improved Yolox target detection method of the present invention.

Fig. 3 is a diagram of an improved backbone network architecture in the present invention.

Fig. 4 is a diagram of an improved feature fusion module architecture in accordance with the present invention.

FIG. 5 is a graph of Loss curves for experiments of the present invention using different Loss functions.

Fig. 6 is a graph of the detection result of a falling picture under a complex background in the experiment of the present invention.

FIG. 7 is a graph showing the comparison of the detection effect under the weak light condition in the experiment of the present invention.

FIG. 8 is a graph showing the comparison of the detection effect on the shielding target in the experiment of the present invention.

FIG. 9 is a graph showing the comparison of the detection effect on small targets in the experiment of the present invention.

FIG. 10 is a graph showing the comparison of the effect of detecting a part of targets in the experiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the examples described are only some, but not all examples of the invention.

Example 1:

an improved YOLOX fall detection system is an improvement to a YOLOX network architecture, wherein the YOLOX network architecture comprises a backbone network, a feature fusion module, and a predictive head.

Backbone network: the YOLOX backbone network is CSPDarknet53. The 640×640×3 picture is input, and four independent feature layers are obtained by first taking a value at every other pixel point through the Focus network structure, and then stacking the four feature layers to obtain 320×320×12 feature layers. The feature layer obtained by performing convolution normalization and Silu activation functions is 320×320×64, and then four more ResBlock, namely base convolution and CSPlayer, are performed. In the last ResBlock, SPP structures are used to increase the receptive field of the network.

Reinforcing feature fusion portion: the feature fusion part of YOLOX adopts a PAN algorithm. The three feature layers extracted finally by the backbone network are respectively: 80×80×256, 40×40×512, 20×20×1024. The deep feature layers are up-sampled and the shallow feature layers are fused, then the shallow fused feature layers are down-sampled and the deep feature layers are fused to further obtain rich feature information, and finally three feature layers are output.

A pre-measurement head part: unlike previous versions of YOLO (classification and regression on one convolution), the prediction header of YOLOX is a simple decoupling header, which can separate classification and regression, the input feature layer is first subjected to one 1 x 1 convolution to reduce the channel dimension, then two parallel branches, each containing two 3 x 3 convolutions, and finally the classification result and regression result are output, respectively.

As shown in fig. 1, the improved YOLOX fall detection system comprises a backbone network, a feature fusion module, a prediction head, and a simAM attention module introduced in the backbone network, wherein an ECA channel attention module is added in the feature fusion module to further extract key information of a feature layer; the loss function adopts EIoU, so that the difference between the prediction frame and the real frame can be calculated more effectively, and the precision of the model is improved.

As shown in fig. 2, the specific steps of the improved YOLOX fall detection system for detecting pictures are as follows: step one: collecting data;

collecting falling videos of cameras in public places, taking different pictures of each frame as picture data sets, and marking information by labelimg software, wherein the types of real frames are divided into 5 types, namely: stand, fall, sit, squat, run, and the training set, validation set and test set are partitioned.

The video is mainly monitored in various public places and traffic roads, and the number of the video is more than 3000. In the experiment, pictures of five postures of falling, standing, sitting, running and squatting can be collected from a network to serve as a data set.

Step two: preprocessing a text;

preprocessing the collected picture data set, uniformly putting the picture path information of the training set and the verification set into the TXT document, and formulating correct reading path text information and category information so as to facilitate model reading.

When the picture is marked, a frame suitable for the target size is selected, and the frame cannot be too large or too small. The data distribution consistency is kept as much as possible by dividing the training set and the verification set, for example, the proportion of positive and negative samples of the training set is 2:1, and the verification set is also kept 2:1.

Step three: constructing a training model;

and constructing a fall detection system of the YolOX of the CSPDarknet backbone network with a residual structure and a simaM attention mechanism, taking the uniformly scaled picture as the input of the network, putting the output of the backbone network into a feature fusion module with an ECA attention mechanism, further extracting feature information, and finally outputting a detection result through the YolOHEAD.

The construction of the fall detection system of YOLOX comprises the following steps:

the first step: introducing a simAM attention mechanism into a CSPDarknet backbone network to obtain an improved backbone network architecture; the improved backbone network architecture is that a simAM attention mechanism is introduced into a CSP non-residual part of the backbone network, namely a plurality of botleneck superposition parts and a final output part, so that a network structure can be deepened, and deep characteristic information can be further extracted.

/>

wherein ,

in the above, t and x _i Is the input feature X epsilon R ^C×H×W Target neurons and other neurons in a single channel of (a); i is in the spatial dimensionM=h×w is the number of neurons on the channel; w (w) _t and b_t Are all weight and bias transformations, μ _t and σ_t ² Is the mean and variance of all neurons except t.

Obtaining; the feature is enhanced by using the scaling operator Sigmoid function, the formula is as follows:

The attention mechanism is applied to CSPlayer and SPP of backbone network to further extract key information in the network, thereby achieving the purpose of improving detection accuracy, and the specific improvement is shown in FIG. 3.

In fig. 3 (a), a bottleneck in a CSPlayer, and a plurality of bottleneck overlapping parts introduce a simAM attention mechanism, namely, an input feature map is firstly convolved twice, then a three-dimensional attention weight is better extracted by the simAM to be mapped onto a feature layer, then the feature layers of two branches are added, and finally channel information is extracted by ECA attention to be mapped onto the feature layer.

In fig. 3 (b), csPlayer is shown, and finally the input feature map is divided into two branches by the output part, each branch is convolved, the left branch is convolved and then passes through n bottlenecks, then the two branches are spliced into one branch, the spliced feature layer is convolved, and finally three-dimensional weight information is extracted through the simAM and mapped to the feature map finally output on the feature layer.

In fig. 3 (c) is an SPP module for increasing the receptive field of the network. Each branch is respectively convolved, and the convolution process is as follows: the input characteristic layer is subjected to basic convolution, three branches and a constant branch are respectively pooled through 5×5,9×9 and 13×13 global average, the results of the four branches are added and then convolved, three-dimensional weight information is extracted through the attention of the simAM, and important information is further extracted and mapped onto the characteristic layer to be output.

And a second step of: a simAM attention module is inserted in the middle of the feature fusion module to extract three-dimensional weight information, and the flexibility and the effectiveness of the module improve the expression capacity of a plurality of convolutions; and an ECA attention module is added at the tail end, the channel information of the feature layer is further extracted, the thought and the operation of an ECA attention mechanism are very simple and convenient, and the influence on the network processing speed is minimal. As shown in particular in fig. 4.

The ECA attention mechanism learns effective channel attention with lower model complexity, the module generates channel attention through quick 1×1 convolution, the kernel size can be adaptively determined through nonlinear mapping of channel sizes, and compared with other attention mechanisms, the ECA attention mechanism avoids dimension reduction, realizes local cross-channel interaction with 1-dimensional convolution efficiently, and extracts dependency relations among channels.

The feature fusion module of the ECA attention mechanism is characterized in that a simAM mechanism is introduced behind the feature fusion algorithm after up-sampling and down-sampling in the feature fusion module, and the ECA attention mechanism is introduced behind three output feature layers at the output end of the special fusion module.

In fig. 4, the backbone network outputs three feature layers, namely, a feature 3, a feature 2 and a feature 1 from bottom to top, as input layers of the feature fusion module.

The feature layer of feat3= (20,20,1024) is subjected to 1 times of 1×1 convolution adjustment to obtain P5, up-sampling UmSampling is performed on P5, then the combination is performed on the feature layer of feat2= (40,40,512), and then the feature extraction is performed by using csplayerjsimam to obtain p5_upsamples, wherein the obtained feature layer is (40,40,512).

P5_upsample= (40,40,512) after 1 times of 1×1 convolution adjustment of the channel, P4 is obtained, up-sampled um sampling is performed on P4, and then combined with the feat1= (80,80,256) feature layer, and then the csplayersimam is used for feature extraction p3_out, where the obtained feature layer is (80,80,256).

The feature layer p3_out= (80,80,256) is downsampled by one 3×3 convolution, stacked with P4 after downsampling, and then feature extracted p4_out using csplayersimam, at which point the obtained feature layer is (40,40,512).

The feature layer p4_out= (40,40,512) is downsampled by one 3×3 convolution, stacked with P5 after downsampling, and then feature extracted p5_out using CSPLayer, at which point the obtained feature layer is (20,20,1024).

Finally, the output P3_out, P4_out and P5_out are respectively inserted into ECA attention to extract channel attention information so as to extract better characteristics.

Putting the output of the backbone network into a feature fusion module with an ECA attention mechanism, and further extracting feature information; the specific operation steps are as follows: firstly, carrying out global average pooling operation on an input feature map; and then carrying out 1-dimensional convolution operation with the convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the formula is as follows:

w＝Sigmoid(C1D _k (X)) (8)；

The EIoU loss function comprises three parts: overlap loss, center-to-center distance loss, and width-to-height loss, the first two continuing the method in CIOU; the EIoU provides a loss function for directly punishing the prediction results of W and H, and directly minimizes the difference between the width and the height of a real prediction frame, so that the convergence speed is higher; the specific formula is as follows:

Step four: training a model; dividing the processed data set into a training set, a verification set and a test set, and training by using the constructed model to obtain an optimal weight model.

The specific operation mode of training by using the constructed model is as follows: the method comprises the steps of carrying out mosaic data enhancement on pictures of a training set, namely reading one picture, then randomly reading other three pictures, de-splicing the four pictures in a 'field' character lattice mode, adjusting label information of the spliced mosaic pictures, achieving the purpose of further enhancing small target detection performance, uniformly scaling the pictures to 640 multiplied by 3, and filling the rest by padding so as to solve the problem of image distortion.

Step five: detecting a model;

inputting pictures or monitoring videos into a trained model for detection, marking and positioning the detected falling target, drawing a prediction frame on an original picture by using matplotlib according to the predicted two coordinate point information, displaying the behavior of the target by using characters, and outputting falling warning information if the falling target is detected.

In order to verify the performance advantages of the system, the inventor performs experimental verification on the system and analyzes the experimental result; specific experiments and analyses were as follows:

experimental environment

Experiments were trained and tested under the window10 operating system. And (3) performing operation by using a GPU (graphics processing unit), wherein the GPU is RTX 2080Ti, and is configured with 11GB of video memory and 64GB of memory. The learning framework of the YOLOX network is pytorch, the CUDA11.1 parallel architecture is called to improve the computational power, the initial learning rate is 0.01, and the weight decay is 0.0005.

Data set

The experimental data set is mainly intercepted from data disclosed by the Internet and news video media video recordings, and the pictures are marked by Labelme. The pictures 4576 are collected, the number of pictures of the training set is 3706, each training batch comprises 926 pictures, each training period comprises 4 training batches, and the experiment is trained for a total of 50 periods and 200 batches. The learning rate is adjusted by a cosine annealing algorithm (CDWR) every time a training batch is completed in the training process.

Evaluation index

Factors reflecting the performance of the network model mainly include the detection precision of the network model, the detection speed, the weight of the model and the like. The experiment uses indexes such as recall rate R, accuracy rate P, F1 and average accuracy mean mAP to evaluate the execution effect of the model.

Recall rate R

The recall rate R is expressed by a proportion of positive prediction to total actual positive, and the calculation formula is as follows:

where TP represents the number of positive samples that are predicted correctly and FN represents the number of negative samples that are predicted incorrectly.

Accuracy P

The accuracy P is expressed by the proportion of the correct prediction to the total prediction, and the calculation formula is as follows:

where FP represents the number of positive samples of the prediction error.

Average F1

F1 is the harmonic mean of the precision and recall, calculated as follows:

where FN represents the number of negative samples of the prediction error.

mAP

The average precision of one category is represented by an index AP, mAP is the average value of all categories of APs, and the average precision average value of the model on all categories is measured.

With R as the horizontal axis and P as the vertical axis, each point draws a line segment to the left until intersecting the vertical line of the previous point. The area enclosed by the line segment drawn in this way and the coordinate axis is the AP value. The calculation formula of the AP is as follows:

the mAP is calculated as follows:

where N represents the number of all categories. The AP value is used for evaluating the performance of the model on a single detection category, the average value of APs in all categories is the mAP value, and the higher the mAP value of the model is, the better the detection performance is.

Through the improved algorithm, the detection effect of the target in mAP, recall rate, precision, F1 and complex environment is obviously improved.

Experimental effect analysis of different loss functions:

the SIoU loss function considers vector angle, center point distance, shape, overlapping area, normalized coordinate scale and the like between the required regressions; ioU takes into account the overlap area and normalized coordinate scale; EIoU considers the real difference of overlapping area, center point distance and length and width, solves the fuzzy definition of aspect ratio based on CIoU, and adds Focal Loss to solve the problem of sample unbalance in bounding box regression.

To compare the model effects using different Loss functions, experiments were performed with YOLOX-s using the SIoU Loss function as the baseline model, with different Loss functions, and the final Loss graph is shown in fig. 5. Where (a) of fig. 5 is a SIoU loss function, (b) of fig. 5 is a IoU loss function, and (c) of fig. 5 is an EIoU loss function.

As can be seen from the experimental results in fig. 5, there is a significant difference in the Loss curves of the baseline model using different Loss functions. (b) The training loss and the verification loss of the graph are more fit and have smaller difference, and (b) the training loss and the verification loss of the graph are faster than (a), and the difference is gradually reduced, so that the learning ability of the model is better.

Although the SIoU loss function takes into account more loss aspects among the three loss functions, the effect of the EIoU loss function in the experimental effect is much higher than the SIoU loss function, so that the EIoU is finally chosen as the loss function of the YOLOX-s-EsE model presented herein. Ablation experimental result analysis of the increased attention module:

to verify the effect of each attention module on YOLOX-s detection effect, ablation experiments were performed on the dataset. The results of the model-added simAM attention and ECA attention module ablation experiments are shown in table 1, with YOLOX-s-EsE model experimental results identified in bold, based on the loss function using EIoU.

Table 1 experimental results with attention module added

/>

As can be seen from the data in Table 1, the mAP was increased by 0.48% after the addition of the simaM to the YOLOX-s model, the recall rate was increased by 6.77%, and the accuracy was increased by 0.31%; after ECA is added on the basis, mAP is improved by 0.11%, recall rate is improved by 2.84%, and precision is improved by 0.77%.

Compared with YOLOX-s and YOLOX-s-EsE, mAP is improved by 0.59%, recall rate is improved by 9.61%, accuracy is improved by 1.08%, and the inserted attention module is indicated to reasonably distribute weight to the feature points of each layer, so that the attention degree of a network to a target object is improved, and category information and position information of a key area are obtained.

1) Detection effect contrast under complex background

The results of detection of the original baseline model (YOLOX-SIoU) and YOLOX-s-EsE on the falling picture in the complex background are shown in fig. 6. FIG. 6 (a) shows the baseline model test results, and FIG. 6 (b) shows the YOLOX-s-EsE test results.

It is apparent that in fig. 6 (a) no target is detected, while (b) a falling target after rain is detected with a confidence level of 0.7, it is apparent that the improved algorithm can still detect targets normally in a complex background.

2) Comparison of detection effects under dim light conditions

The results of original baseline model (YOLOX-SIoU) and YOLOX-s-EsE for images of falls under dim light conditions are shown in fig. 7. Fig. 7 (a) shows the detection result of the baseline model picture 1, fig. 7 (b) shows the detection result of the YOLOX-s-EsE picture 1, fig. 7 (c) shows the detection result of the baseline model picture 2, and fig. 7 (d) shows the detection result of the YOLOX-s-EsE picture 2.

In FIG. 7, the original baseline model test results are (a) and (c), and the YOLOX-s-EsE test results are (b) and (d). The improved algorithm has better detection effect under the condition of weaker illumination condition, and can still accurately detect the falling object under the condition of weak light and has higher confidence. Comparison experimental result analysis of different models:

to examine the performance of the models presented herein, the YOLOv3-s, YOLOv4-s, YOLOv5-s and YOLOX-s-EsE models were selected, respectively, and the results of the comparison experiments are shown in table 2, wherein the results of the YOLOX-s-EsE model experiments are indicated in bold.

TABLE 2 results of comparative experiments on different models

Method	mAP(％)	F1	Recall(％)	Precious(％)
					YOLOv3-s	82.98	0.75	67.69	82.89
YOLOv4-s	76.58	0.60	46.72	82.31
					YOLOv5-s	77.96	0.58	44.32	85.29
YOLOX-s-EsE	89.23	0.84	82.10	91.79

It can be seen that the YOLOX-s-EsE model has a better effect than other models of the experiment, wherein the mAP of YOLOv4 is the smallest and the detection effect is relatively poor, and the recall rate of YOLOv5 only reaches 44.32, which indicates that YOLOv5 can detect fall behaviors, but is difficult to detect all. The mAP of the YOLOX-s-EsE model reaches 89.23%, the recall rate reaches 82.10%, the precision reaches 91.79%, and the detection performance is advanced over other models.

The detection effect comparison result of the model on the shielding target, the small target and part of target scenes is as follows:

1) Comparison of occlusion target detection effects

The detection results of the fall picture of the occlusion target by YOLOv3-s and YOLOX-s-EsE are shown in FIG. 8. FIG. 8 (a) shows the result of YOLOv3-s detection, and FIG. 8 (b) shows the result of YOLOX-s-EsE detection.

In the figure, the detection result of the YOLOv3-s is (a), the detection result of the YOLOX-s-EsE is (b), and it is obvious that the shielding target cannot be detected by the YOLOv3-s, and the YOLOX-s-EsE can be detected.

2) Comparison of small target detection effects

The results of the detection of the small target fall pictures by Yolov4-s and Yolox-s-EsE are shown in FIG. 9. FIG. 9 (a) shows the result of YOLOv4-s detection, and FIG. 9 (b) shows the result of YOLOX-s-EsE detection.

In the figure, the detection result of the YOLOv4-s is (a), the detection result of the YOLOX-s-EsE is (b), and it is obvious that the YOLOv4-s cannot detect a smaller target, and the YOLOX-s-EsE can.

3) Partial target detection effect contrast

The results of detection of partial target fall pictures by Yolov5-s and Yolox-s-EsE are shown in FIG. 10. FIG. 10 (a) shows the result of the YOLOv5-s detection, and FIG. 10 (b) shows the result of the YOLOX-s-EsE detection

In the figure, the detection result of the YOLOv5-s is (a), the detection result of the YOLOX-s-EsE is (b), and it is obvious that the YOLOv5-s can not detect partial targets, and the YOLOX-s-EsE can.

The ablation experiment and the comparison experiment result show that the index of the recall rate, the accuracy rate, the F1 index, the mAP index and the like of the algorithm are obviously improved no matter under the complex environment and the dim light condition or aiming at the detection aspect of the small target, the shielding target and the partial target.

Experimental results show that the improved algorithm has obvious improvement on mAP, recall rate, precision, F1 and target detection effect under complex environments, can effectively detect complete and blocked falling targets under fuzzy background and dim light conditions, and improves the precision of target detection.

Claims

1. An improved YOLOX fall detection system, characterized by: the improved Yolox fall detection system comprises a backbone network, a feature fusion module, a prediction head, and a simAM attention module introduced into the backbone network, wherein an ECA channel attention module is added in the feature fusion module, and the loss function of the system adopts an EIoU loss function; the specific steps of detecting the picture are as follows:

step one: and (3) data acquisition: collecting falling videos of cameras in public places, taking different pictures of each frame as picture data sets, and marking information by labelimg software, wherein the types of real frames are divided into 5 types, namely: stand, fall, sit, squat, run, and dividing a training set, a verification set and a test set;

2. An improved YOLOX fall detection system according to claim 1, wherein: in the fall detection system for constructing the YOLOX of the CSPDarknet backbone network with the residual structure and the simAM attention mechanism, the simAM attention mechanism is introduced into the CSPDarknet backbone network, and the simAM attention mechanism is introduced into the CSP non-residual part of the backbone network, namely a plurality of bottleneck superposition parts and a final output part, so that the network structure can be deepened, and the deep characteristic information can be further extracted;

wherein ,

3. An improved YOLOX fall detection system according to claim 2, wherein: the multiple butteleneck overlapping parts introduce a simAM attention mechanism, namely, an input feature image is firstly convolved twice, then three-dimensional attention weights are better extracted through the simAM and mapped onto a feature layer, then the feature layers of two branches are added, and finally channel information is extracted through ECA attention and mapped onto the feature layer.

4. An improved YOLOX fall detection system according to claim 2, wherein: the final output part introduces a simAM attention mechanism by dividing an input feature map into two branches, respectively convoluting each branch, convoluting the left branch, then splicing the two branches into one branch, convoluting the spliced feature layer, and finally extracting three-dimensional weight information through the simAM and mapping the three-dimensional weight information to the finally output feature map on the feature layer.

5. An improved YOLOX fall detection system according to claim 4, wherein: the convolution is carried out on each branch respectively, and the convolution process is as follows: the input characteristic layer is subjected to basic convolution, three branches and a constant branch are respectively pooled through 5×5,9×9 and 13×13 global average, the results of the four branches are added and then convolved, three-dimensional weight information is extracted through the attention of the simAM, and important information is further extracted and mapped onto the characteristic layer to be output.

6. An improved YOLOX fall detection system according to claim 1, wherein: the feature fusion module of the ECA attention mechanism is characterized in that a simAM mechanism is introduced behind the feature fusion algorithm after up-sampling and down-sampling in the feature fusion module, and the ECA attention mechanism is introduced behind three output feature layers at the output end of the feature fusion module.

7. An improved YOLOX fall detection system according to claim 6, wherein: the ECA attention mechanism learns effective channel attention with lower model complexity, the module generates channel attention through quick 1×1 convolution, the kernel size can be adaptively determined through nonlinear mapping of channel sizes, and compared with other attention mechanisms, the ECA attention mechanism avoids dimension reduction, realizes local cross-channel interaction with 1-dimensional convolution efficiently, and extracts dependency relations among channels.

8. An improved YOLOX fall detection system according to claim 1 or 6 or 7, characterized in that: putting the output of the backbone network into a feature fusion module with an ECA attention mechanism, and further extracting feature information; the specific operation steps are as follows: firstly, carrying out global average pooling operation on an input feature map; and then carrying out 1-dimensional convolution operation with the convolution kernel of k, and obtaining the weight w of each channel through a Sigmoid activation function, wherein the formula is as follows:

w＝Sigmoid(C1D _k (X))(8)；

9. An improved YOLOX fall detection system according to claim 1, wherein: the EIoU loss function comprises three parts: overlap loss, center-to-center distance loss, and width-to-height loss, the first two continuing the method in CIOU; the EIoU provides a loss function for directly punishing the prediction results of W and H, and directly minimizes the difference between the width and the height of a real frame and the prediction frame, so that the convergence speed is higher; the specific formula is as follows:

wherein ,C_w and C_h The width and height of the smallest external frame covering the two frames; ρ represents calculating the Euclidean distance of the two center points; b, b ^gt Representing the center points of the prediction frame and the real frame respectively; w, w ^gt Representing the widths of the real frame and the predicted frame respectively; h, h ^gt Representing the heights of the real frame and the prediction frame respectively; c represents the diagonal distance of the minimum closure region that can contain both the predicted and real frames.

10. An improved YOLOX fall detection system according to claim 1, wherein: the specific operation mode of training by using the constructed model is as follows: the method comprises the steps of carrying out mosaic data enhancement on pictures of a training set, namely reading one picture, then randomly reading other three pictures, de-splicing the four pictures in a 'field' character lattice mode, adjusting label information of the spliced mosaic pictures, achieving the purpose of further enhancing small target detection performance, uniformly scaling the pictures to 640 multiplied by 3, and filling the rest by padding so as to solve the problem of image distortion.