CN115578664A

CN115578664A - Video monitoring-based emergency event judgment method and device

Info

Publication number: CN115578664A
Application number: CN202211043197.8A
Authority: CN
Inventors: 陈卫强; 倪春; 姚天一
Original assignee: Hangzhou Half Cloud Technology Co ltd
Current assignee: Hangzhou Half Cloud Technology Co ltd
Priority date: 2022-08-29
Filing date: 2022-08-29
Publication date: 2023-01-06

Abstract

The invention discloses an emergency event judgment method and device based on video monitoring, which comprises the following steps: s1, collecting video streams; s2, the emergency event model has different emergency event identification functions; and S3, selecting the digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from the emergency function event list. According to the invention, firstly, a conventional ultrabrain video stream identification mode is replaced by an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, the data calculation resources are greatly reduced through frequency setting without influencing the final event identification effect, secondly, the low-frequency sample array type passes through each functional algorithm pool to realize multifunctional identification, and finally, a new emergency scene identification function can be added through adding a new functional algorithm pool for training when the area is required, so that the deployment efficiency is improved.

Description

Video monitoring-based emergency event judgment method and device

Technical Field

The invention relates to the technical field of emergency event management, in particular to an emergency event judgment method and device based on video monitoring.

Background

With the rapid development of information technology, AI artificial intelligence technology represented by machine vision is in a period of rapid development and can be applied to the ground, the machine vision replaces human vision to perceive problems, the problem discovery becomes a technical means widely applied in various fields, and in the technical field of emergency management, an entry which recognizes a classical image signal of an emergency event as the occurrence of the event and starts classification through machine vision equipment has gradually become a mainstream scheme.

However, firstly, because the internal hardware performance of the edge computing type machine vision device is limited, there is a great uniqueness in event processing, that is, a certain device can only find a certain problem or a certain problem, and a plurality of devices may be needed to work after a plurality of emergency events need to be found, and secondly, the requirement on the edge device is low for the center superconcephalon type machine vision device, but there is a great requirement on the network architecture and the computational cost, for example, when 8 identification functions are deployed in a 16-way superconcephalon, although only one front-end machine vision device is needed, only 2 devices can be deployed in total, and thus, large-area coverage deployment still cannot be performed in most emergency scenes.

In summary, although the solution of recognizing the classic image signal of the machine vision device as the entrance of the event occurrence and classifying the event occurrence has become the mainstream in the field of emergency management, the cost is limited, and most of the projects cannot realize all-around and high-density sensing coverage.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, a method and an apparatus for determining an emergency event based on video monitoring are provided.

In order to achieve the purpose, the invention adopts the following technical scheme:

an emergency event judgment method based on video monitoring comprises the following steps:

s1, installing corresponding digital signal cameras in different emergency scenes and collecting video streams;

s2, training emergency event models aiming at different emergency scenes, wherein the emergency event models have different emergency event identification functions, and constructing an emergency event function list;

s3, selecting a digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from an emergency function event list;

s4, processing the video stream A0 acquired by the digital signal camera A, configuring the effective acquisition time and acquisition frequency of the video stream A0, and acquiring a video stream A1;

s5, collecting identification samples a0 of the video stream A1, and simultaneously carrying out undifferentiated snapshot on the video stream A1 according to a set frequency to obtain a snapshot image A1;

s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in the emergency scene X when the identification sample a0 meets a screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X;

and S7, displaying the snapshot image a1, the corresponding snapshot time, the snapshot position and the emergency x name through a visual human-computer interaction interface.

As a further description of the above technical solution:

in the step S2, the emergency event recognition function is realized by a functional algorithm pool, and the functional algorithm pool constructing step includes:

s21, loading the picture of the emergency event by using a deep learning tool lebelimg in the emergency event model, selecting a yolo labeling format, performing frame selection labeling on a real frame of a target detection object in the picture, and after the labeling is completed, saving the result as a folder T, wherein different folders T correspond to different emergency events and are configured with different emergency event functions;

s22, the pictures in the folder T are subjected to self-adaptive scaling after being subjected to Mosaic enhancement and then are input into a YOLOV5 neuron network model;

s23, obtaining the position and the size of a prediction frame of a target detection object in the picture and the type of an included emergency area through forward propagation, wherein the forward propagation comprises three parts, namely feature extraction, feature fusion and detection head;

s24, calculating the difference between the prediction frame and the real frame by using a loss function;

s25, iteratively updating a weight matrix and a bias in forward propagation through gradient descent to reduce the loss between the prediction frame and the real frame;

s26, solving a weight matrix and an offset when the loss function takes the minimum value under the iteration times;

s27, the weight matrix and the bias are used as parameters of forward propagation in the detection stage to obtain prediction information for identifying a target detection object in a sample;

s28, repeating the steps S21-S27, training different emergency events through the YOLOV5 neural network model, generating corresponding algorithm files after training, and packaging and deploying the algorithm files to the algorithm pool to obtain the functional algorithm pool.

As a further description of the above technical solution:

in step S23, the position and the category of the target detection object are predicted by using an anchor point mechanism, and the anchor point mechanism prediction specifically includes the following steps:

after feature extraction and feature fusion are carried out on an input picture, three down-sampling feature maps are obtained, each grid in the feature maps has 3 x (1 +4+ C) channels and is used for predicting targets with three different sizes, wherein 3 represents the number of anchor frames, 1 represents the confidence coefficient of the anchor frames, 4 represents the offset (tx, ty, tw and th) of the anchor frame coordinates relative to the prior anchor frame coordinates obtained after training of a YOLOV5 neuron network model, the prior anchor frame sizes are the target anchor frame sizes obtained by the YOLOV5 neuron network model through a k-means algorithm on a folder T, and C represents the total number of target categories;

wherein, confidence is the Confidence of the anchor frame,pr (Object) indicates whether the coordinate of the center point of the target detection Object falls within the anchor frame, if so, 1 is taken, otherwise, 0 is taken,

representing the intersection and union ratio of the anchor frame and the real frame, namely the intersection area between the two frames is compared with the union area of the two frames;

if the central point of the target detection object is in the grid, the grid needs to be responsible for predicting the position and the size of an anchor frame of the target and the category confidence of the anchor frame;

b _x ＝factor×σ(t _x )+c _x ；

b _y ＝factor×σ(t _y )+c _y ；

b _w ＝p _w e ^tw ；

b _h ＝p _h e ^th ；

Class-Specific Confidence Score＝Pr(classi|object)×Confidence；

in the formula, b _x And b _y Coordinates of center point representing prediction box, b _w And b _h Width and height of the representing prediction box, c _x And c _y Coordinates, p, representing the cell relative to the upper left corner (0, 0) of the picture _w And p _h Length and width, t, of the anchor frame a priori _x And t _y Represents the width and height offset of the prediction box and the real box, and sigma (-) represents a Sigmoid function and is responsible for dividing t _x And t _y Mapping between 0 and 1, wherein the factor represents a factor with a value larger than 1.0, the Class-specific confidence Score represents the Class confidence of the anchor box, and the Pr (classi | object) represents the probability that the object belongs to the ith Class.

As a further description of the above technical solution:

in step S23, target detection objects of different sizes are predicted by using a multi-scale network, the input image may obtain three downsampled feature maps of 8 times, 16 times and 32 times after feature extraction and feature fusion, and each mesh in the feature maps has 3 anchor frames of different sizes, so that 9 target detection objects of different sizes may be predicted;

the multi-scale network prediction is realized by adopting a mechanism of combining a characteristic pyramid network and a path aggregation network, wherein the FPN layer in the characteristic pyramid network is responsible for transmitting rich semantic characteristics of the top layer to the bottom layer, and the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer.

As a further description of the above technical solution:

the YOLOV5 neuron network model comprises a Backbone network Backbone, a Neck network Neck and a Head network Head;

the Backbone network Backbone comprises a Focus module, a CBL module, a CSP module and an SPP module and is responsible for feature extraction of a target detection object;

the Focus module copies four input images firstly, each image is subjected to pixel value division, and finally channel fusion is carried out on the four images to obtain a double-sampling image without lost information;

the CBL module refers to convolution, batch normalization and Leaky _ ReLU function activation of the image;

the CSP module comprises a CSP1 module and a CSP2 module, wherein the CSP1 module divides an input image into two branches, one branch of the input image is convolved after flowing to a residual error structure and then is subjected to channel fusion with a convolved image of the other branch, the CSP2 module divides the input image into two branches, one branch of the input image is convolved after flowing to two CBL modules and then is subjected to channel fusion with a convolved image of the other branch of the input image;

the SPP module is used for pooling the maximum values of the filters of the input images respectively and then carrying out channel fusion on the original images and the three pooled images;

the Neck network Neck adopts a PANet polymerization structure to fuse the features extracted by the Backbone network Backbone;

in the Head network Head, three detection heads are adopted to perform downsampling on an input image by 8 times, 16 times and 32 times respectively, and three feature vectors with different sizes are generated respectively and are used for detecting target detection objects with different sizes.

As a further description of the above technical solution:

in the step S24, the loss function calculation includes confidence loss, classification loss, and boundary box loss, where the confidence loss and the classification loss use a binary cross entropy loss function, and the boundary box loss use a GloU loss function;

BCELoss＝-logP＇，y＝1；

BCELoss＝-log(1-P＇)，y＝0；

in the formula, BCELoss represents a binary cross-over loss function, P' represents a predicted value of an identification sample, y represents a real value of the identification sample, y =1 represents that the identification sample is the target of the class, and y =0 represents that the identification sample is not the target of the class;

in the formula, L _GloU Representing GloU loss function, B representing prediction box, B ^gt Representing the real box, C represents the smallest rectangle containing the prediction box and the real box.

As a further description of the above technical solution:

an emergency event judgment device based on video monitoring comprises an image acquisition module, a communication module, a function configuration module, an image processing module and an event display module;

the image acquisition module is used for setting an image acquisition environment state, and comprises the steps of determining the installation position of a digital signal camera and configuring parameters of a video stream, wherein the parameters comprise video resolution, a main code stream, a sub code stream and a compression format;

the communication module is used for transmitting the video stream acquired by the digital signal camera between the image acquisition module and the image processing module, the communication module adopts Ethernet communication, the Ethernet communication comprises local area network communication and wide area network communication, and the local area network communication or the wide area network communication is adapted according to an emergency scene;

the function configuration module is used for configuring the position information of the digital signal camera, configuring the emergency event identification function of the digital signal camera and configuring the acquisition parameters of the video stream to obtain a compressed video stream;

the image processing module collects samples from compressed video streams by adopting timing snapshot, inputs the samples into a functional algorithm pool to judge whether an emergency event occurs in an emergency scene or not, and outputs undifferentiated snapshot images for identifying the samples;

the event display module is used for displaying the snapshot image, the snapshot time, the snapshot position and the emergency event name.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: firstly, replacing a traditional ultrabrain video stream recognition mode with an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, greatly reducing data calculation resources through frequency setting without influencing final event recognition effect, secondly, realizing multifunctional recognition through passing a low-frequency sample queue through each functional algorithm pool, and finally, increasing a new emergency scene recognition function through adding a new functional algorithm pool for training when the region needs, and improving deployment efficiency.

Drawings

Fig. 1 is a schematic flowchart illustrating an emergency event determination method based on video surveillance according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram illustrating an emergency event determination device based on video surveillance according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1 and fig. 2, the present invention provides a technical solution: an emergency event judgment method based on video monitoring comprises the following steps:

s2, training emergency event models aiming at different emergency scenes, wherein the emergency event models have different emergency event recognition functions, and constructing an emergency event function list;

specifically, taking a fire incident as an example, preparing various types of fire and flame pictures, loading the pictures by using a deep learning tool lebelimg, selecting a yolo labeling format, performing frame selection labeling on a fire area and a flame area in the pictures, and storing a result as a folder T after the labeling is completed, wherein the folder T comprises the pictures and texts;

s22, adaptively scaling the picture in the folder T to 640 multiplied by 640 size after the picture is firstly subjected to Mosaic enhancement, and then inputting the picture into a Yolov5 neuron network model;

the YOLOV5 neural network model comprises a Backbone network Backbone, a Neck network Neck and a Head network Head, and compared with other YOLOV series models, the YOLOV5 neural network model is more complex in network structure, and meanwhile, a plurality of skills are used for improving the detection precision and speed of the model in data enhancement and training strategies;

the CSP module comprises a CSP1 module and a CSP2 module, wherein the CSP1 module divides an input image into two branches, one branch of the input image is convoluted after flowing to a residual error structure and then is subjected to channel fusion with the convoluted image of the other branch of the input image, the CSP2 module divides the input image into two branches, and one branch of the input image is convoluted after flowing to two CBL modules and then is subjected to channel fusion with the convoluted image of the other branch of the input image;

the SPP module performs maximum pooling of the filters 5 × 5, 9 × 9 and 13 × 13 on the input images respectively, and then performs channel fusion on the original image and the three pooled images;

in the Head network Head, three detection heads are adopted to respectively carry out 8-time, 16-time and 32-time down sampling on an input image, and three feature vectors with different sizes are respectively generated and used for detecting target detection objects with different sizes;

in step S23, the anchor point mechanism is used to predict the position and the category of the target detection object, and the anchor point mechanism prediction specifically comprises the following steps:

after feature extraction and feature fusion are carried out on an input picture, three down-sampling feature maps are obtained, the down-sampling parameters are sequentially 8 times, 16 times and 32 times, for example, the input emergency picture is 640 x 640, three feature maps of 20 x 20, 40 x 40 and 80 x 80 are obtained, each grid in the feature maps has 3 x (1 +4+ C) channels and is used for predicting targets of three different sizes, wherein 3 represents the number of anchor frames, 1 represents the confidence coefficient of the anchor frames, 4 represents the offset (tx, ty, tw and th) of the anchor frame coordinates relative to the prior anchor frame coordinates obtained after training of a YOLOV5 neuron network model, the prior anchor frame size is the size of a target anchor frame obtained by the YOLOV5 neuron network model on a folder T through a k-means algorithm, and C represents the total number of target categories;

in the formula, confidence is an anchor frame Confidence, pr (Object) indicates whether the center point coordinate of the target detection Object falls within the anchor frame, if so, 1 is taken, otherwise, 0 is taken,

representing the intersection ratio of the anchor frame and the real frame, namely the intersection area between the two frames is compared with the union area of the two frames;

b _x ＝factor×σ(t _x )+c _x ；

b _y ＝factor×σ(t _y )+c _y ；

b _w ＝p _w e ^tw ；

b _h ＝p _h e ^th ；

Class-Specific Confidence Score＝Pr(classi|object)×Confidence；

in the formula, b _x And b _y Coordinates of center point representing prediction box, b _w And b _h Width and height of the representing prediction box, c _x And c _y Coordinates, p, representing the cell relative to the upper left corner (0, 0) of the picture _w And p _h Length and width, t, of the anchor frame a priori _x And t _y Represents the width and height offset of the prediction box and the real box, and sigma (-) represents a Sigmoid function and is responsible fort _x And t _y Mapping between 0 and 1, due to t _x And t _y The value is less than 0 and 1, so when the central point of the target detection object is located on the boundary of the grid, no grid is responsible for predicting the target, a factor is required to be introduced for the purpose, the factor is generally greater than 1.0, class-Specific Confidence Score represents the Class Confidence of the anchor frame, and Pr (Class | object) represents the probability that the target belongs to the ith Class;

in step S23, target detection objects of different sizes are predicted by using a multi-scale network, the input image will obtain three downsampling feature maps of 8 times, 16 times and 32 times after feature extraction and feature fusion, and each grid in the feature maps has 3 anchor frames of different sizes, so that 9 target detection objects of different sizes can be predicted;

the multi-scale network prediction is realized by adopting a mechanism of combining a characteristic pyramid network and a path aggregation network, wherein the FPN layer in the characteristic pyramid network is responsible for transmitting abundant semantic features of the top layer to the bottom layer, the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer, the FPN layer is responsible for transmitting abundant semantic features of the top layer to the bottom layer, the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer, and the accurate positioning information of the bottom layer and the PAN layer are mutually complemented, so that the accuracy of the position of a model prediction frame is improved, and the accuracy of the category of the prediction frame is also improved;

aiming at the problem that the position information of a target detection object is insufficient due to the fact that the FPN only transmits the semantic information of a high layer back to a shallow layer from top to bottom, a method adopted by a YOLOV5 neural network model is that a pyramid from bottom to top is added on the basis of a PAN structure to serve as the supplement of an original FPN structure to form a PANet network, the structure from bottom to top performs information fusion by mapping and overlapping the rich position information of the shallow layer to a deep layer feature, the position information of a bottom layer is transmitted back to the high layer, information transmission among different feature maps is further enhanced, space information is accurately reserved, and the detection capability of the network on large and medium targets can be effectively improved;

for the anchor frame, if the model is not suppressed, the prediction frames (80 × 80+40 × 40+20 × 20) × 3= 25200) are finally generated on a graph, the final anchor frame of the target is obtained by calculating the intersection-to-parallel ratio of each anchor frame and the anchor frame with the highest confidence coefficient of the class, and if the intersection-to-parallel ratio is greater than a preset threshold value, the anchor frame is discarded, and the process is called Non-Maximum Suppression (NMS);

the loss function calculation comprises confidence coefficient loss, classification loss and boundary box loss, the confidence coefficient loss and the classification loss adopt a binary cross entropy loss function, and the boundary box loss adopts a GloU loss function;

BCELoss＝-logP＇，y＝1；

BCELoss＝-log(1-P＇)，y＝0；

in the formula, L _GloU Representing GloU loss function, B representing prediction box, B ^gt Representing a real box, C representing a minimum rectangle containing the prediction box and the real box;

s25, iteratively updating a weight matrix and a bias in forward propagation through gradient descent to reduce loss between the prediction frame and the real frame;

s28, repeating the steps S21-S27, training different emergency events through a YOLOV5 neural network model, generating corresponding algorithm files after training, and packaging and deploying the algorithm files to an algorithm pool to obtain a functional algorithm pool;

s4, processing the video stream A0 acquired by the digital signal camera A, configuring the acquisition effective time of the video stream A0 as 7 × 24h and the acquisition frequency as 0.25Hz to obtain a video stream A1, wherein the acquisition frequency is 0.0001-60 Hz, and is set according to different application scenes and cost requirements, preferably 0.25Hz in the embodiment;

s5, collecting identification samples a0 of the video stream A1, and simultaneously carrying out undifferentiated snapshot on the video stream A1 according to 0.25Hz to obtain a snapshot image A1;

s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in an emergency scene X when the identification sample a0 meets a screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X, the screening condition is determined according to a specific emergency scene, for example, when a motor vehicle is prohibited from entering the scene, the screening condition is that the motor vehicle is identified by the image, the image is calculated by using a motor vehicle identification algorithm, and when the identification result is that the motor vehicle exists, the screening condition is determined to be met;

The method adopts the method of 'snapshot + picture recognition' to realize the judgment of the emergency event, and greatly reduces the cost while meeting the use requirement compared with the method of video stream recognition adopted in the existing emergency management field.

Referring to fig. 2, an emergency event determination apparatus based on video monitoring includes an image acquisition module, a communication module, a function configuration module, an image processing module, and an event display module;

the image processing module carries out sample acquisition from the compressed video stream by adopting timing snapshot, inputs the sample into a functional algorithm pool to judge whether an emergency event occurs in an emergency scene, and outputs an undifferentiated snapshot image for identifying the sample, for example, the video stream is snapshot once every 4s at the acquisition frequency of 0.25 Hz;

According to the invention, firstly, a conventional ultrabrain video stream recognition mode is replaced by an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, the data calculation resources are greatly reduced through frequency setting without influencing the final event recognition effect, secondly, the low-frequency sample array type passes through each functional algorithm pool to realize multifunctional recognition, and finally, a new emergency scene recognition function can be added through adding a new functional algorithm pool for training when the region is required, so that the deployment efficiency is improved.

When a forest fire accident occurs, a large amount of dense smoke can be generated, video monitoring can transmit video streams to a central machine room in real time, at the moment, an image processing module of the central machine room carries out image snapshot every 4s on the video streams, when the images with flame or a large amount of dense smoke enter a flame recognition algorithm pool or a dense smoke recognition algorithm pool, the images with the flame or the dense smoke can be output to snapshot images of flame events or dense smoke events, and prompts are generated to emergency watchmen, and the emergency watchmen can manually start emergency responses by judging the event images and monitoring the field video.

The specific determination steps are as follows:

s1, collecting an image: s11, installing video monitoring in a parking lot, selecting corners with the height of 2m and key positions for fire prevention and control, carrying out full-coverage monitoring on the area, and providing available energy channels and information channels;

s12, configuring configuration parameters such as video resolution, main code stream, subcode stream and compression format on a video monitoring configuration interface to ensure that the image acquisition function is complete, continuous and effective;

s2, building a communication environment: s21, building a network environment according to an emergency scene, and communicating equipment such as a switch and a router with a network system capable of stably transmitting signals through a network cable;

s22, after the network is built, network configuration is carried out in a router, information such as IP addresses, subnet masks and gateways is configured, and network configuration information capable of covering the image acquisition equipment is needed;

s3, configuring functions: s31, after the network is built, performing function configuration, clicking a 'function configuration' button of a toolbar, and selecting monitoring needing configuration;

s32, after the monitoring is selected, selecting a fire monitoring event from the emergency event function list, checking a check in an option box, clicking 'configuration' after the check, and determining a configuration result to configure a fire monitoring event judgment function of a corresponding camera;

s33, clicking 'image unit configuration' of the toolbar after the function configuration is finished, entering the configuration of an image acquisition unit, and greatly compressing the performance requirement to five thousandths under the scene of less than the original 60Hz by configuring the acquisition effective time to be 7 x 24h and the acquisition frequency to be 0.25 Hz;

s4, image processing: s41, training a flame recognition function and a dense smoke recognition function through a YOLOV5 neural network model;

s42, after the configuration of the acquisition module is completed, the image processing module starts to acquire identification samples, and the video stream is received and simultaneously the undifferentiated snapshot is carried out according to the frequency of 0.25 Hz;

s43, after the samples are collected, screening the samples one by one through a flame recognition algorithm pool and a dense smoke recognition algorithm pool, and outputting the samples as snapshot images of suspected fire events when a certain sample meets screening conditions;

and S5, after the image processing is finished, displaying conditions to meet the information of the image signals, the snapshot time, the snapshot position, the emergency event and the like of the populus through a visual human-computer interaction interface.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered as the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. An emergency event judgment method based on video monitoring is characterized by comprising the following steps:

s4, processing the video stream A0 acquired by the digital signal camera A, and configuring the effective acquisition time and the acquisition frequency of the video stream A0 to obtain a video stream A1;

s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in the emergency scene X when the identification sample a0 meets the screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X;

2. The video monitoring-based emergency event determination method according to claim 1, wherein in step S2, the emergency event identification function is implemented by a functional algorithm pool, and the functional algorithm pool constructing step includes:

s21, loading the picture of the emergency event by using a deep learning tool lebel img in the emergency event model, selecting a yolo labeling format, carrying out frame selection and labeling on a target detection object in the picture, and after the labeling is finished, saving the result as a folder T, wherein different folders T correspond to different emergency events and are configured with different emergency event functions;

s22, the pictures in the folder T are subjected to adaptive scaling after being subjected to Mosaic enhancement and then are input into a YOLOV5 neural network model;

3. The method of claim 2, wherein in step S23, the anchor point mechanism is used to predict the location and the type of the target detection object, and the anchor point mechanism predicting step comprises:

b _x ＝factor×σ(t _x )+c _x ；

b _y ＝factor×σ(t _y )+c _y ；

b _w ＝p _w e ^tw ；

b _h ＝p _h e ^th ；

Class-Specific Confidence Score＝Pr(classi|object)×Confidence；

in the formula, b _x And b _y Coordinates of center point representing prediction box, b _w And b _h Width and height of the representing prediction box, c _x And c _y Denotes the coordinate of the cell relative to the upper left corner (0, 0) of the picture, p _w And p _h Length and width, t, of the anchor frame a priori _x And t _y Represents the width and height offset of a prediction box and a real box, and sigma (-) represents a Sigmoid function and is responsible for dividing t into _x And t _y Mapping between 0 and 1, wherein the factor represents a factor with a value larger than 1.0, the Class-Specific Confidence Score represents the Class Confidence of the anchor box, and the Pr (classi | object) represents the probability that the object belongs to the ith Class.

4. The method for determining emergency events based on video surveillance as claimed in claim 3, wherein in step S23, target detection objects with different sizes are predicted by using a multi-scale network, the input image after feature extraction and feature fusion will obtain three down-sampled feature maps of 8 times, 16 times and 32 times, each grid in the feature map has 3 anchor frames with different sizes, so that target detection objects with 9 different sizes can be predicted;

5. The video surveillance-based emergency event determination method of claim 4, wherein the YOLOV5 neural network model comprises a Backbone network Backbone, a Neck network Neck, and a Head network Head;

6. The method for determining an emergency event based on video surveillance as claimed in claim 5, wherein in step S24, the loss function calculation includes confidence loss, classification loss and bounding box loss, the confidence loss and the classification loss are binary cross entropy loss functions, and the bounding box loss is a GloU loss function;

BCELoss＝-logP＇，y＝1；

BCELoss＝-log(1-P＇)，y＝0；

7. An emergency event judgment device based on video monitoring is characterized by comprising an image acquisition module, a communication module, a function configuration module, an image processing module and an event display module;

the image acquisition module is used for setting an image acquisition environment state, and comprises the steps of determining the installation position of a digital signal camera and configuring parameters of a video stream, wherein the parameters comprise video resolution, a main code stream, a subcode stream and a compression format;