CN115578664A - Video monitoring-based emergency event judgment method and device - Google Patents

Video monitoring-based emergency event judgment method and device Download PDF

Info

Publication number
CN115578664A
CN115578664A CN202211043197.8A CN202211043197A CN115578664A CN 115578664 A CN115578664 A CN 115578664A CN 202211043197 A CN202211043197 A CN 202211043197A CN 115578664 A CN115578664 A CN 115578664A
Authority
CN
China
Prior art keywords
emergency
module
emergency event
image
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211043197.8A
Other languages
Chinese (zh)
Inventor
陈卫强
倪春
姚天一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Half Cloud Technology Co ltd
Original Assignee
Hangzhou Half Cloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Half Cloud Technology Co ltd filed Critical Hangzhou Half Cloud Technology Co ltd
Priority to CN202211043197.8A priority Critical patent/CN115578664A/en
Publication of CN115578664A publication Critical patent/CN115578664A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/18Closed-circuit television [CCTV] systems, i.e. systems in which the video signal is not broadcast

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an emergency event judgment method and device based on video monitoring, which comprises the following steps: s1, collecting video streams; s2, the emergency event model has different emergency event identification functions; and S3, selecting the digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from the emergency function event list. According to the invention, firstly, a conventional ultrabrain video stream identification mode is replaced by an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, the data calculation resources are greatly reduced through frequency setting without influencing the final event identification effect, secondly, the low-frequency sample array type passes through each functional algorithm pool to realize multifunctional identification, and finally, a new emergency scene identification function can be added through adding a new functional algorithm pool for training when the area is required, so that the deployment efficiency is improved.

Description

Video monitoring-based emergency event judgment method and device
Technical Field
The invention relates to the technical field of emergency event management, in particular to an emergency event judgment method and device based on video monitoring.
Background
With the rapid development of information technology, AI artificial intelligence technology represented by machine vision is in a period of rapid development and can be applied to the ground, the machine vision replaces human vision to perceive problems, the problem discovery becomes a technical means widely applied in various fields, and in the technical field of emergency management, an entry which recognizes a classical image signal of an emergency event as the occurrence of the event and starts classification through machine vision equipment has gradually become a mainstream scheme.
However, firstly, because the internal hardware performance of the edge computing type machine vision device is limited, there is a great uniqueness in event processing, that is, a certain device can only find a certain problem or a certain problem, and a plurality of devices may be needed to work after a plurality of emergency events need to be found, and secondly, the requirement on the edge device is low for the center superconcephalon type machine vision device, but there is a great requirement on the network architecture and the computational cost, for example, when 8 identification functions are deployed in a 16-way superconcephalon, although only one front-end machine vision device is needed, only 2 devices can be deployed in total, and thus, large-area coverage deployment still cannot be performed in most emergency scenes.
In summary, although the solution of recognizing the classic image signal of the machine vision device as the entrance of the event occurrence and classifying the event occurrence has become the mainstream in the field of emergency management, the cost is limited, and most of the projects cannot realize all-around and high-density sensing coverage.
Disclosure of Invention
In order to solve the technical problems mentioned in the background art, a method and an apparatus for determining an emergency event based on video monitoring are provided.
In order to achieve the purpose, the invention adopts the following technical scheme:
an emergency event judgment method based on video monitoring comprises the following steps:
s1, installing corresponding digital signal cameras in different emergency scenes and collecting video streams;
s2, training emergency event models aiming at different emergency scenes, wherein the emergency event models have different emergency event identification functions, and constructing an emergency event function list;
s3, selecting a digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from an emergency function event list;
s4, processing the video stream A0 acquired by the digital signal camera A, configuring the effective acquisition time and acquisition frequency of the video stream A0, and acquiring a video stream A1;
s5, collecting identification samples a0 of the video stream A1, and simultaneously carrying out undifferentiated snapshot on the video stream A1 according to a set frequency to obtain a snapshot image A1;
s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in the emergency scene X when the identification sample a0 meets a screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X;
and S7, displaying the snapshot image a1, the corresponding snapshot time, the snapshot position and the emergency x name through a visual human-computer interaction interface.
As a further description of the above technical solution:
in the step S2, the emergency event recognition function is realized by a functional algorithm pool, and the functional algorithm pool constructing step includes:
s21, loading the picture of the emergency event by using a deep learning tool lebelimg in the emergency event model, selecting a yolo labeling format, performing frame selection labeling on a real frame of a target detection object in the picture, and after the labeling is completed, saving the result as a folder T, wherein different folders T correspond to different emergency events and are configured with different emergency event functions;
s22, the pictures in the folder T are subjected to self-adaptive scaling after being subjected to Mosaic enhancement and then are input into a YOLOV5 neuron network model;
s23, obtaining the position and the size of a prediction frame of a target detection object in the picture and the type of an included emergency area through forward propagation, wherein the forward propagation comprises three parts, namely feature extraction, feature fusion and detection head;
s24, calculating the difference between the prediction frame and the real frame by using a loss function;
s25, iteratively updating a weight matrix and a bias in forward propagation through gradient descent to reduce the loss between the prediction frame and the real frame;
s26, solving a weight matrix and an offset when the loss function takes the minimum value under the iteration times;
s27, the weight matrix and the bias are used as parameters of forward propagation in the detection stage to obtain prediction information for identifying a target detection object in a sample;
s28, repeating the steps S21-S27, training different emergency events through the YOLOV5 neural network model, generating corresponding algorithm files after training, and packaging and deploying the algorithm files to the algorithm pool to obtain the functional algorithm pool.
As a further description of the above technical solution:
in step S23, the position and the category of the target detection object are predicted by using an anchor point mechanism, and the anchor point mechanism prediction specifically includes the following steps:
after feature extraction and feature fusion are carried out on an input picture, three down-sampling feature maps are obtained, each grid in the feature maps has 3 x (1 +4+ C) channels and is used for predicting targets with three different sizes, wherein 3 represents the number of anchor frames, 1 represents the confidence coefficient of the anchor frames, 4 represents the offset (tx, ty, tw and th) of the anchor frame coordinates relative to the prior anchor frame coordinates obtained after training of a YOLOV5 neuron network model, the prior anchor frame sizes are the target anchor frame sizes obtained by the YOLOV5 neuron network model through a k-means algorithm on a folder T, and C represents the total number of target categories;
Figure BDA0003821276200000041
wherein, confidence is the Confidence of the anchor frame,pr (Object) indicates whether the coordinate of the center point of the target detection Object falls within the anchor frame, if so, 1 is taken, otherwise, 0 is taken,
Figure BDA0003821276200000042
representing the intersection and union ratio of the anchor frame and the real frame, namely the intersection area between the two frames is compared with the union area of the two frames;
Figure BDA0003821276200000043
if the central point of the target detection object is in the grid, the grid needs to be responsible for predicting the position and the size of an anchor frame of the target and the category confidence of the anchor frame;
b x =factor×σ(t x )+c x
b y =factor×σ(t y )+c y
b w =p w e tw
b h =p h e th
Class-Specific Confidence Score=Pr(classi|object)×Confidence;
in the formula, b x And b y Coordinates of center point representing prediction box, b w And b h Width and height of the representing prediction box, c x And c y Coordinates, p, representing the cell relative to the upper left corner (0, 0) of the picture w And p h Length and width, t, of the anchor frame a priori x And t y Represents the width and height offset of the prediction box and the real box, and sigma (-) represents a Sigmoid function and is responsible for dividing t x And t y Mapping between 0 and 1, wherein the factor represents a factor with a value larger than 1.0, the Class-specific confidence Score represents the Class confidence of the anchor box, and the Pr (classi | object) represents the probability that the object belongs to the ith Class.
As a further description of the above technical solution:
in step S23, target detection objects of different sizes are predicted by using a multi-scale network, the input image may obtain three downsampled feature maps of 8 times, 16 times and 32 times after feature extraction and feature fusion, and each mesh in the feature maps has 3 anchor frames of different sizes, so that 9 target detection objects of different sizes may be predicted;
the multi-scale network prediction is realized by adopting a mechanism of combining a characteristic pyramid network and a path aggregation network, wherein the FPN layer in the characteristic pyramid network is responsible for transmitting rich semantic characteristics of the top layer to the bottom layer, and the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer.
As a further description of the above technical solution:
the YOLOV5 neuron network model comprises a Backbone network Backbone, a Neck network Neck and a Head network Head;
the Backbone network Backbone comprises a Focus module, a CBL module, a CSP module and an SPP module and is responsible for feature extraction of a target detection object;
the Focus module copies four input images firstly, each image is subjected to pixel value division, and finally channel fusion is carried out on the four images to obtain a double-sampling image without lost information;
the CBL module refers to convolution, batch normalization and Leaky _ ReLU function activation of the image;
the CSP module comprises a CSP1 module and a CSP2 module, wherein the CSP1 module divides an input image into two branches, one branch of the input image is convolved after flowing to a residual error structure and then is subjected to channel fusion with a convolved image of the other branch, the CSP2 module divides the input image into two branches, one branch of the input image is convolved after flowing to two CBL modules and then is subjected to channel fusion with a convolved image of the other branch of the input image;
the SPP module is used for pooling the maximum values of the filters of the input images respectively and then carrying out channel fusion on the original images and the three pooled images;
the Neck network Neck adopts a PANet polymerization structure to fuse the features extracted by the Backbone network Backbone;
in the Head network Head, three detection heads are adopted to perform downsampling on an input image by 8 times, 16 times and 32 times respectively, and three feature vectors with different sizes are generated respectively and are used for detecting target detection objects with different sizes.
As a further description of the above technical solution:
in the step S24, the loss function calculation includes confidence loss, classification loss, and boundary box loss, where the confidence loss and the classification loss use a binary cross entropy loss function, and the boundary box loss use a GloU loss function;
BCELoss=-logP',y=1;
BCELoss=-log(1-P'),y=0;
in the formula, BCELoss represents a binary cross-over loss function, P' represents a predicted value of an identification sample, y represents a real value of the identification sample, y =1 represents that the identification sample is the target of the class, and y =0 represents that the identification sample is not the target of the class;
Figure BDA0003821276200000061
in the formula, L GloU Representing GloU loss function, B representing prediction box, B gt Representing the real box, C represents the smallest rectangle containing the prediction box and the real box.
As a further description of the above technical solution:
an emergency event judgment device based on video monitoring comprises an image acquisition module, a communication module, a function configuration module, an image processing module and an event display module;
the image acquisition module is used for setting an image acquisition environment state, and comprises the steps of determining the installation position of a digital signal camera and configuring parameters of a video stream, wherein the parameters comprise video resolution, a main code stream, a sub code stream and a compression format;
the communication module is used for transmitting the video stream acquired by the digital signal camera between the image acquisition module and the image processing module, the communication module adopts Ethernet communication, the Ethernet communication comprises local area network communication and wide area network communication, and the local area network communication or the wide area network communication is adapted according to an emergency scene;
the function configuration module is used for configuring the position information of the digital signal camera, configuring the emergency event identification function of the digital signal camera and configuring the acquisition parameters of the video stream to obtain a compressed video stream;
the image processing module collects samples from compressed video streams by adopting timing snapshot, inputs the samples into a functional algorithm pool to judge whether an emergency event occurs in an emergency scene or not, and outputs undifferentiated snapshot images for identifying the samples;
the event display module is used for displaying the snapshot image, the snapshot time, the snapshot position and the emergency event name.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: firstly, replacing a traditional ultrabrain video stream recognition mode with an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, greatly reducing data calculation resources through frequency setting without influencing final event recognition effect, secondly, realizing multifunctional recognition through passing a low-frequency sample queue through each functional algorithm pool, and finally, increasing a new emergency scene recognition function through adding a new functional algorithm pool for training when the region needs, and improving deployment efficiency.
Drawings
Fig. 1 is a schematic flowchart illustrating an emergency event determination method based on video surveillance according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram illustrating an emergency event determination device based on video surveillance according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
Referring to fig. 1 and fig. 2, the present invention provides a technical solution: an emergency event judgment method based on video monitoring comprises the following steps:
s1, installing corresponding digital signal cameras in different emergency scenes and collecting video streams;
s2, training emergency event models aiming at different emergency scenes, wherein the emergency event models have different emergency event recognition functions, and constructing an emergency event function list;
s21, loading the picture of the emergency event by using a deep learning tool lebelimg in the emergency event model, selecting a yolo labeling format, performing frame selection labeling on a real frame of a target detection object in the picture, and after the labeling is completed, saving the result as a folder T, wherein different folders T correspond to different emergency events and are configured with different emergency event functions;
specifically, taking a fire incident as an example, preparing various types of fire and flame pictures, loading the pictures by using a deep learning tool lebelimg, selecting a yolo labeling format, performing frame selection labeling on a fire area and a flame area in the pictures, and storing a result as a folder T after the labeling is completed, wherein the folder T comprises the pictures and texts;
s22, adaptively scaling the picture in the folder T to 640 multiplied by 640 size after the picture is firstly subjected to Mosaic enhancement, and then inputting the picture into a Yolov5 neuron network model;
the YOLOV5 neural network model comprises a Backbone network Backbone, a Neck network Neck and a Head network Head, and compared with other YOLOV series models, the YOLOV5 neural network model is more complex in network structure, and meanwhile, a plurality of skills are used for improving the detection precision and speed of the model in data enhancement and training strategies;
the Backbone network Backbone comprises a Focus module, a CBL module, a CSP module and an SPP module and is responsible for feature extraction of a target detection object;
the Focus module copies four input images firstly, each image is subjected to pixel value division, and finally channel fusion is carried out on the four images to obtain a double-sampling image without lost information;
the CBL module refers to convolution, batch normalization and Leaky _ ReLU function activation of the image;
the CSP module comprises a CSP1 module and a CSP2 module, wherein the CSP1 module divides an input image into two branches, one branch of the input image is convoluted after flowing to a residual error structure and then is subjected to channel fusion with the convoluted image of the other branch of the input image, the CSP2 module divides the input image into two branches, and one branch of the input image is convoluted after flowing to two CBL modules and then is subjected to channel fusion with the convoluted image of the other branch of the input image;
the SPP module performs maximum pooling of the filters 5 × 5, 9 × 9 and 13 × 13 on the input images respectively, and then performs channel fusion on the original image and the three pooled images;
the Neck network Neck adopts a PANet polymerization structure to fuse the features extracted by the Backbone network Backbone;
in the Head network Head, three detection heads are adopted to respectively carry out 8-time, 16-time and 32-time down sampling on an input image, and three feature vectors with different sizes are respectively generated and used for detecting target detection objects with different sizes;
s23, obtaining the position and the size of a prediction frame of a target detection object in the picture and the type of an included emergency area through forward propagation, wherein the forward propagation comprises three parts, namely feature extraction, feature fusion and detection head;
in step S23, the anchor point mechanism is used to predict the position and the category of the target detection object, and the anchor point mechanism prediction specifically comprises the following steps:
after feature extraction and feature fusion are carried out on an input picture, three down-sampling feature maps are obtained, the down-sampling parameters are sequentially 8 times, 16 times and 32 times, for example, the input emergency picture is 640 x 640, three feature maps of 20 x 20, 40 x 40 and 80 x 80 are obtained, each grid in the feature maps has 3 x (1 +4+ C) channels and is used for predicting targets of three different sizes, wherein 3 represents the number of anchor frames, 1 represents the confidence coefficient of the anchor frames, 4 represents the offset (tx, ty, tw and th) of the anchor frame coordinates relative to the prior anchor frame coordinates obtained after training of a YOLOV5 neuron network model, the prior anchor frame size is the size of a target anchor frame obtained by the YOLOV5 neuron network model on a folder T through a k-means algorithm, and C represents the total number of target categories;
Figure BDA0003821276200000102
in the formula, confidence is an anchor frame Confidence, pr (Object) indicates whether the center point coordinate of the target detection Object falls within the anchor frame, if so, 1 is taken, otherwise, 0 is taken,
Figure BDA0003821276200000103
representing the intersection ratio of the anchor frame and the real frame, namely the intersection area between the two frames is compared with the union area of the two frames;
Figure BDA0003821276200000101
if the central point of the target detection object is in the grid, the grid needs to be responsible for predicting the position and the size of an anchor frame of the target and the category confidence of the anchor frame;
b x =factor×σ(t x )+c x
b y =factor×σ(t y )+c y
b w =p w e tw
b h =p h e th
Class-Specific Confidence Score=Pr(classi|object)×Confidence;
in the formula, b x And b y Coordinates of center point representing prediction box, b w And b h Width and height of the representing prediction box, c x And c y Coordinates, p, representing the cell relative to the upper left corner (0, 0) of the picture w And p h Length and width, t, of the anchor frame a priori x And t y Represents the width and height offset of the prediction box and the real box, and sigma (-) represents a Sigmoid function and is responsible fort x And t y Mapping between 0 and 1, due to t x And t y The value is less than 0 and 1, so when the central point of the target detection object is located on the boundary of the grid, no grid is responsible for predicting the target, a factor is required to be introduced for the purpose, the factor is generally greater than 1.0, class-Specific Confidence Score represents the Class Confidence of the anchor frame, and Pr (Class | object) represents the probability that the target belongs to the ith Class;
in step S23, target detection objects of different sizes are predicted by using a multi-scale network, the input image will obtain three downsampling feature maps of 8 times, 16 times and 32 times after feature extraction and feature fusion, and each grid in the feature maps has 3 anchor frames of different sizes, so that 9 target detection objects of different sizes can be predicted;
the multi-scale network prediction is realized by adopting a mechanism of combining a characteristic pyramid network and a path aggregation network, wherein the FPN layer in the characteristic pyramid network is responsible for transmitting abundant semantic features of the top layer to the bottom layer, the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer, the FPN layer is responsible for transmitting abundant semantic features of the top layer to the bottom layer, the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer, and the accurate positioning information of the bottom layer and the PAN layer are mutually complemented, so that the accuracy of the position of a model prediction frame is improved, and the accuracy of the category of the prediction frame is also improved;
aiming at the problem that the position information of a target detection object is insufficient due to the fact that the FPN only transmits the semantic information of a high layer back to a shallow layer from top to bottom, a method adopted by a YOLOV5 neural network model is that a pyramid from bottom to top is added on the basis of a PAN structure to serve as the supplement of an original FPN structure to form a PANet network, the structure from bottom to top performs information fusion by mapping and overlapping the rich position information of the shallow layer to a deep layer feature, the position information of a bottom layer is transmitted back to the high layer, information transmission among different feature maps is further enhanced, space information is accurately reserved, and the detection capability of the network on large and medium targets can be effectively improved;
for the anchor frame, if the model is not suppressed, the prediction frames (80 × 80+40 × 40+20 × 20) × 3= 25200) are finally generated on a graph, the final anchor frame of the target is obtained by calculating the intersection-to-parallel ratio of each anchor frame and the anchor frame with the highest confidence coefficient of the class, and if the intersection-to-parallel ratio is greater than a preset threshold value, the anchor frame is discarded, and the process is called Non-Maximum Suppression (NMS);
s24, calculating the difference between the prediction frame and the real frame by using a loss function;
the loss function calculation comprises confidence coefficient loss, classification loss and boundary box loss, the confidence coefficient loss and the classification loss adopt a binary cross entropy loss function, and the boundary box loss adopts a GloU loss function;
BCELoss=-logP',y=1;
BCELoss=-log(1-P'),y=0;
in the formula, BCELoss represents a binary cross-over loss function, P' represents a predicted value of an identification sample, y represents a real value of the identification sample, y =1 represents that the identification sample is the target of the class, and y =0 represents that the identification sample is not the target of the class;
Figure BDA0003821276200000121
in the formula, L GloU Representing GloU loss function, B representing prediction box, B gt Representing a real box, C representing a minimum rectangle containing the prediction box and the real box;
s25, iteratively updating a weight matrix and a bias in forward propagation through gradient descent to reduce loss between the prediction frame and the real frame;
s26, solving a weight matrix and an offset when the loss function takes the minimum value under the iteration times;
s27, the weight matrix and the bias are used as parameters of forward propagation in the detection stage to obtain prediction information for identifying a target detection object in a sample;
s28, repeating the steps S21-S27, training different emergency events through a YOLOV5 neural network model, generating corresponding algorithm files after training, and packaging and deploying the algorithm files to an algorithm pool to obtain a functional algorithm pool;
s3, selecting a digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from an emergency function event list;
s4, processing the video stream A0 acquired by the digital signal camera A, configuring the acquisition effective time of the video stream A0 as 7 × 24h and the acquisition frequency as 0.25Hz to obtain a video stream A1, wherein the acquisition frequency is 0.0001-60 Hz, and is set according to different application scenes and cost requirements, preferably 0.25Hz in the embodiment;
s5, collecting identification samples a0 of the video stream A1, and simultaneously carrying out undifferentiated snapshot on the video stream A1 according to 0.25Hz to obtain a snapshot image A1;
s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in an emergency scene X when the identification sample a0 meets a screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X, the screening condition is determined according to a specific emergency scene, for example, when a motor vehicle is prohibited from entering the scene, the screening condition is that the motor vehicle is identified by the image, the image is calculated by using a motor vehicle identification algorithm, and when the identification result is that the motor vehicle exists, the screening condition is determined to be met;
and S7, displaying the snapshot image a1, the corresponding snapshot time, the snapshot position and the emergency x name through a visual human-computer interaction interface.
The method adopts the method of 'snapshot + picture recognition' to realize the judgment of the emergency event, and greatly reduces the cost while meeting the use requirement compared with the method of video stream recognition adopted in the existing emergency management field.
Referring to fig. 2, an emergency event determination apparatus based on video monitoring includes an image acquisition module, a communication module, a function configuration module, an image processing module, and an event display module;
the image acquisition module is used for setting an image acquisition environment state, and comprises the steps of determining the installation position of a digital signal camera and configuring parameters of a video stream, wherein the parameters comprise video resolution, a main code stream, a sub code stream and a compression format;
the communication module is used for transmitting the video stream acquired by the digital signal camera between the image acquisition module and the image processing module, the communication module adopts Ethernet communication, the Ethernet communication comprises local area network communication and wide area network communication, and the local area network communication or the wide area network communication is adapted according to an emergency scene;
the function configuration module is used for configuring the position information of the digital signal camera, configuring the emergency event identification function of the digital signal camera and configuring the acquisition parameters of the video stream to obtain a compressed video stream;
the image processing module carries out sample acquisition from the compressed video stream by adopting timing snapshot, inputs the sample into a functional algorithm pool to judge whether an emergency event occurs in an emergency scene, and outputs an undifferentiated snapshot image for identifying the sample, for example, the video stream is snapshot once every 4s at the acquisition frequency of 0.25 Hz;
the event display module is used for displaying the snapshot image, the snapshot time, the snapshot position and the emergency event name.
According to the invention, firstly, a conventional ultrabrain video stream recognition mode is replaced by an image unit snapshot mode through a general network architecture and a common front-end digital signal camera, the data calculation resources are greatly reduced through frequency setting without influencing the final event recognition effect, secondly, the low-frequency sample array type passes through each functional algorithm pool to realize multifunctional recognition, and finally, a new emergency scene recognition function can be added through adding a new functional algorithm pool for training when the region is required, so that the deployment efficiency is improved.
When a forest fire accident occurs, a large amount of dense smoke can be generated, video monitoring can transmit video streams to a central machine room in real time, at the moment, an image processing module of the central machine room carries out image snapshot every 4s on the video streams, when the images with flame or a large amount of dense smoke enter a flame recognition algorithm pool or a dense smoke recognition algorithm pool, the images with the flame or the dense smoke can be output to snapshot images of flame events or dense smoke events, and prompts are generated to emergency watchmen, and the emergency watchmen can manually start emergency responses by judging the event images and monitoring the field video.
The specific determination steps are as follows:
s1, collecting an image: s11, installing video monitoring in a parking lot, selecting corners with the height of 2m and key positions for fire prevention and control, carrying out full-coverage monitoring on the area, and providing available energy channels and information channels;
s12, configuring configuration parameters such as video resolution, main code stream, subcode stream and compression format on a video monitoring configuration interface to ensure that the image acquisition function is complete, continuous and effective;
s2, building a communication environment: s21, building a network environment according to an emergency scene, and communicating equipment such as a switch and a router with a network system capable of stably transmitting signals through a network cable;
s22, after the network is built, network configuration is carried out in a router, information such as IP addresses, subnet masks and gateways is configured, and network configuration information capable of covering the image acquisition equipment is needed;
s3, configuring functions: s31, after the network is built, performing function configuration, clicking a 'function configuration' button of a toolbar, and selecting monitoring needing configuration;
s32, after the monitoring is selected, selecting a fire monitoring event from the emergency event function list, checking a check in an option box, clicking 'configuration' after the check, and determining a configuration result to configure a fire monitoring event judgment function of a corresponding camera;
s33, clicking 'image unit configuration' of the toolbar after the function configuration is finished, entering the configuration of an image acquisition unit, and greatly compressing the performance requirement to five thousandths under the scene of less than the original 60Hz by configuring the acquisition effective time to be 7 x 24h and the acquisition frequency to be 0.25 Hz;
s4, image processing: s41, training a flame recognition function and a dense smoke recognition function through a YOLOV5 neural network model;
s42, after the configuration of the acquisition module is completed, the image processing module starts to acquire identification samples, and the video stream is received and simultaneously the undifferentiated snapshot is carried out according to the frequency of 0.25 Hz;
s43, after the samples are collected, screening the samples one by one through a flame recognition algorithm pool and a dense smoke recognition algorithm pool, and outputting the samples as snapshot images of suspected fire events when a certain sample meets screening conditions;
and S5, after the image processing is finished, displaying conditions to meet the information of the image signals, the snapshot time, the snapshot position, the emergency event and the like of the populus through a visual human-computer interaction interface.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered as the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims (7)

1. An emergency event judgment method based on video monitoring is characterized by comprising the following steps:
s1, installing corresponding digital signal cameras in different emergency scenes and collecting video streams;
s2, training emergency event models aiming at different emergency scenes, wherein the emergency event models have different emergency event identification functions, and constructing an emergency event function list;
s3, selecting a digital signal camera A of a certain emergency scene X, and configuring a corresponding emergency event X identification function B for the digital signal camera A from an emergency function event list;
s4, processing the video stream A0 acquired by the digital signal camera A, and configuring the effective acquisition time and the acquisition frequency of the video stream A0 to obtain a video stream A1;
s5, collecting identification samples a0 of the video stream A1, and simultaneously carrying out undifferentiated snapshot on the video stream A1 according to a set frequency to obtain a snapshot image A1;
s6, screening the identification sample a0 through a functional algorithm pool B, judging that an emergency event X occurs in the emergency scene X when the identification sample a0 meets the screening condition, and outputting a snapshot image a1, wherein the functional algorithm pool B is an identification algorithm of an identification function B of the emergency event X;
and S7, displaying the snapshot image a1, the corresponding snapshot time, the snapshot position and the emergency x name through a visual human-computer interaction interface.
2. The video monitoring-based emergency event determination method according to claim 1, wherein in step S2, the emergency event identification function is implemented by a functional algorithm pool, and the functional algorithm pool constructing step includes:
s21, loading the picture of the emergency event by using a deep learning tool lebel img in the emergency event model, selecting a yolo labeling format, carrying out frame selection and labeling on a target detection object in the picture, and after the labeling is finished, saving the result as a folder T, wherein different folders T correspond to different emergency events and are configured with different emergency event functions;
s22, the pictures in the folder T are subjected to adaptive scaling after being subjected to Mosaic enhancement and then are input into a YOLOV5 neural network model;
s23, obtaining the position and the size of a prediction frame of a target detection object in the picture and the type of an included emergency area through forward propagation, wherein the forward propagation comprises three parts, namely feature extraction, feature fusion and detection head;
s24, calculating the difference between the prediction frame and the real frame by using a loss function;
s25, iteratively updating a weight matrix and a bias in forward propagation through gradient descent to reduce loss between the prediction frame and the real frame;
s26, solving a weight matrix and an offset when the loss function takes the minimum value under the iteration times;
s27, the weight matrix and the bias are used as parameters of forward propagation in the detection stage to obtain prediction information for identifying a target detection object in a sample;
s28, repeating the steps S21-S27, training different emergency events through the YOLOV5 neural network model, generating corresponding algorithm files after training, and packaging and deploying the algorithm files to the algorithm pool to obtain the functional algorithm pool.
3. The method of claim 2, wherein in step S23, the anchor point mechanism is used to predict the location and the type of the target detection object, and the anchor point mechanism predicting step comprises:
after feature extraction and feature fusion are carried out on an input picture, three down-sampling feature maps are obtained, each grid in the feature maps has 3 x (1 +4+ C) channels and is used for predicting targets with three different sizes, wherein 3 represents the number of anchor frames, 1 represents the confidence coefficient of the anchor frames, 4 represents the offset (tx, ty, tw and th) of the anchor frame coordinates relative to the prior anchor frame coordinates obtained after training of a YOLOV5 neuron network model, the prior anchor frame sizes are the target anchor frame sizes obtained by the YOLOV5 neuron network model through a k-means algorithm on a folder T, and C represents the total number of target categories;
Figure RE-FDA0003944656690000031
in the formula, confidence is an anchor frame Confidence, pr (Object) indicates whether the center point coordinate of the target detection Object falls within the anchor frame, if so, 1 is taken, otherwise, 0 is taken,
Figure RE-FDA0003944656690000032
representing the intersection ratio of the anchor frame and the real frame, namely the intersection area between the two frames is compared with the union area of the two frames;
Figure RE-FDA0003944656690000033
if the central point of the target detection object is in the grid, the grid needs to be responsible for predicting the position and the size of an anchor frame of the target and the category confidence of the anchor frame;
b x =factor×σ(t x )+c x
b y =factor×σ(t y )+c y
b w =p w e tw
b h =p h e th
Class-Specific Confidence Score=Pr(classi|object)×Confidence;
in the formula, b x And b y Coordinates of center point representing prediction box, b w And b h Width and height of the representing prediction box, c x And c y Denotes the coordinate of the cell relative to the upper left corner (0, 0) of the picture, p w And p h Length and width, t, of the anchor frame a priori x And t y Represents the width and height offset of a prediction box and a real box, and sigma (-) represents a Sigmoid function and is responsible for dividing t into x And t y Mapping between 0 and 1, wherein the factor represents a factor with a value larger than 1.0, the Class-Specific Confidence Score represents the Class Confidence of the anchor box, and the Pr (classi | object) represents the probability that the object belongs to the ith Class.
4. The method for determining emergency events based on video surveillance as claimed in claim 3, wherein in step S23, target detection objects with different sizes are predicted by using a multi-scale network, the input image after feature extraction and feature fusion will obtain three down-sampled feature maps of 8 times, 16 times and 32 times, each grid in the feature map has 3 anchor frames with different sizes, so that target detection objects with 9 different sizes can be predicted;
the multi-scale network prediction is realized by adopting a mechanism of combining a characteristic pyramid network and a path aggregation network, wherein the FPN layer in the characteristic pyramid network is responsible for transmitting rich semantic characteristics of the top layer to the bottom layer, and the PAN layer is responsible for transmitting accurate positioning information of the bottom layer to the top layer.
5. The video surveillance-based emergency event determination method of claim 4, wherein the YOLOV5 neural network model comprises a Backbone network Backbone, a Neck network Neck, and a Head network Head;
the Backbone network Backbone comprises a Focus module, a CBL module, a CSP module and an SPP module and is responsible for feature extraction of a target detection object;
the Focus module copies four input images firstly, each image is subjected to pixel value division, and finally channel fusion is carried out on the four images to obtain a double-sampling image without lost information;
the CBL module refers to convolution, batch normalization and Leaky _ ReLU function activation of the image;
the CSP module comprises a CSP1 module and a CSP2 module, wherein the CSP1 module divides an input image into two branches, one branch of the input image is convolved after flowing to a residual error structure and then is subjected to channel fusion with a convolved image of the other branch, the CSP2 module divides the input image into two branches, one branch of the input image is convolved after flowing to two CBL modules and then is subjected to channel fusion with a convolved image of the other branch of the input image;
the SPP module is used for pooling the maximum values of the filters of the input images respectively and then carrying out channel fusion on the original images and the three pooled images;
the Neck network Neck adopts a PANet polymerization structure to fuse the features extracted by the Backbone network Backbone;
in the Head network Head, three detection heads are adopted to perform downsampling on an input image by 8 times, 16 times and 32 times respectively, and three feature vectors with different sizes are generated respectively and are used for detecting target detection objects with different sizes.
6. The method for determining an emergency event based on video surveillance as claimed in claim 5, wherein in step S24, the loss function calculation includes confidence loss, classification loss and bounding box loss, the confidence loss and the classification loss are binary cross entropy loss functions, and the bounding box loss is a GloU loss function;
BCELoss=-logP',y=1;
BCELoss=-log(1-P'),y=0;
in the formula, BCELoss represents a binary cross-over loss function, P' represents a predicted value of an identification sample, y represents a real value of the identification sample, y =1 represents that the identification sample is the target of the class, and y =0 represents that the identification sample is not the target of the class;
Figure RE-FDA0003944656690000051
in the formula, L GloU Representing GloU loss function, B representing prediction box, B gt Representing the real box, C represents the smallest rectangle containing the prediction box and the real box.
7. An emergency event judgment device based on video monitoring is characterized by comprising an image acquisition module, a communication module, a function configuration module, an image processing module and an event display module;
the image acquisition module is used for setting an image acquisition environment state, and comprises the steps of determining the installation position of a digital signal camera and configuring parameters of a video stream, wherein the parameters comprise video resolution, a main code stream, a subcode stream and a compression format;
the communication module is used for transmitting the video stream acquired by the digital signal camera between the image acquisition module and the image processing module, the communication module adopts Ethernet communication, the Ethernet communication comprises local area network communication and wide area network communication, and the local area network communication or the wide area network communication is adapted according to an emergency scene;
the function configuration module is used for configuring the position information of the digital signal camera, configuring the emergency event identification function of the digital signal camera and configuring the acquisition parameters of the video stream to obtain a compressed video stream;
the image processing module collects samples from compressed video streams by adopting timing snapshot, inputs the samples into a functional algorithm pool to judge whether an emergency event occurs in an emergency scene or not, and outputs undifferentiated snapshot images for identifying the samples;
the event display module is used for displaying the snapshot image, the snapshot time, the snapshot position and the emergency event name.
CN202211043197.8A 2022-08-29 2022-08-29 Video monitoring-based emergency event judgment method and device Pending CN115578664A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211043197.8A CN115578664A (en) 2022-08-29 2022-08-29 Video monitoring-based emergency event judgment method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211043197.8A CN115578664A (en) 2022-08-29 2022-08-29 Video monitoring-based emergency event judgment method and device

Publications (1)

Publication Number Publication Date
CN115578664A true CN115578664A (en) 2023-01-06

Family

ID=84580046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211043197.8A Pending CN115578664A (en) 2022-08-29 2022-08-29 Video monitoring-based emergency event judgment method and device

Country Status (1)

Country Link
CN (1) CN115578664A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886991A (en) * 2023-08-21 2023-10-13 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116886991A (en) * 2023-08-21 2023-10-13 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data
CN116886991B (en) * 2023-08-21 2024-05-03 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data

Similar Documents

Publication Publication Date Title
CN110543867B (en) Crowd density estimation system and method under condition of multiple cameras
CN112288008B (en) Mosaic multispectral image disguised target detection method based on deep learning
CN107273832B (en) License plate recognition method and system based on integral channel characteristics and convolutional neural network
CN112767711B (en) Multi-class multi-scale multi-target snapshot method and system
WO2021139049A1 (en) Detection method, detection apparatus, monitoring device, and computer readable storage medium
CN114255407B (en) High-resolution-based anti-unmanned aerial vehicle multi-target identification and tracking video detection method
CN111325051A (en) Face recognition method and device based on face image ROI selection
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
CN116186770A (en) Image desensitizing method, device, electronic equipment and storage medium
CN115328319B (en) Intelligent control method and device based on light-weight gesture recognition
CN109146923B (en) Processing method and system for target tracking broken frame
CN115424264A (en) Panorama segmentation method, related device, electronic equipment and storage medium
CN115578664A (en) Video monitoring-based emergency event judgment method and device
CN115908442A (en) Image panorama segmentation method for unmanned aerial vehicle ocean monitoring and model building method
CN111260687A (en) Aerial video target tracking method based on semantic perception network and related filtering
CN114399734A (en) Forest fire early warning method based on visual information
CN116403162B (en) Airport scene target behavior recognition method and system and electronic equipment
CN117523437A (en) Real-time risk identification method for substation near-electricity operation site
CN116246200A (en) Screen display information candid photographing detection method and system based on visual identification
KR102122853B1 (en) Monitoring system to control external devices
CN113781388A (en) Image enhancement-based power transmission line channel hidden danger image identification method and device
CN113920354A (en) Action recognition method based on event camera
CN114640785A (en) Site model updating method and system
CN112215064A (en) Face recognition method and system for public safety precaution
CN116071656B (en) Intelligent alarm method and system for infrared image ponding detection of underground transformer substation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination