CN112216049B

CN112216049B - Construction warning area monitoring and early warning system and method based on image recognition

Info

Publication number: CN112216049B
Application number: CN202011026889.2A
Authority: CN
Inventors: 刘伟; 李春阳; 李伟; 陈磊; 杨弘卿
Original assignee: Research Institute of Highway Ministry of Transport
Current assignee: Research Institute of Highway Ministry of Transport
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2022-04-29
Anticipated expiration: 2040-09-25
Also published as: CN112216049A

Abstract

The invention provides a construction warning area monitoring and early warning system and method based on image recognition, wherein the system comprises an image input module, an image splicing module, an interactive calibration module, a pedestrian detection module, a visual intrusion module, a feature extraction module and a decision-making module, and combines visual intrusion detection, target detection and re-recognition technologies, a camera is arranged around a construction warning area or dangerous construction equipment to acquire surrounding environment information and entering and exiting character information in real time, an operator can replace an early warning triggering area and operator registration information at any time, whether a personnel intrusion signal exists in the early warning triggering area is updated in real time, and when the personnel intrusion signal exists, an alarm is activated to remind an illegal person to prohibit entering the area, so that the safety of the construction warning area is ensured.

Description

Construction warning area monitoring and early warning system and method based on image recognition

Technical Field

The invention relates to the technical field of information monitoring and early warning, in particular to a construction warning area monitoring and early warning system and method based on image recognition.

Background

In an engineering construction area, for example, in areas where warning areas should be set in hoisting operation (including bridge girder erection machine tower cranes), mechanical operation, hydraulic slip forms, blasting operation, main tower construction, the periphery of a bin wall body, tensioning operation and the like required by national and industrial standards or guidelines, and areas where construction warning areas should be set below high-altitude operations such as cradle construction, movable formwork construction and the like, a wire netting fence is usually set to prevent irrelevant technicians from entering the construction area to prevent unsafe events. However, for a construction area in a large range, the direct arrangement of the fence not only can influence the entering and exiting of constructors, but also can cause the condition of missing detection. Meanwhile, the construction area is large and needs to be changed along with the progress of the engineering, and the adoption of the wire netting fence mode can cause resource and labor waste, so that a safer, more effective and more convenient monitoring and early warning method needs to be found.

Disclosure of Invention

The invention aims to provide a construction warning area monitoring and early warning system and method based on image recognition, so that the problems in the prior art are solved.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a construction warning area monitoring and early warning system based on image recognition comprises a plurality of cameras, internet and terminal equipment which are arranged outside a construction warning area, and also comprises an image input module, an image splicing module, an interactive calibration module, a pedestrian detection module, a visual invasion module, a feature extraction module and a decision-making module,

the image input module is used for acquiring synchronous video images of all cameras around a highway engineering construction warning area and sending the collected synchronous video images to the image splicing module, and the splicing module is used for splicing a plurality of synchronous video images to obtain a panoramic image in an operation area;

the interactive calibration module is used for interactively calibrating a pre-warning trigger area on the panoramic image and activating the visual intrusion module at the same time:

after the visual intrusion module is activated, acquiring a calibrated early warning trigger area, observing video image information in the early warning trigger area in real time, and if intrusion information exists in the early warning trigger area, sending early warning to a pedestrian detection module;

after receiving the early warning sent by the visual intrusion module, the pedestrian detection module detects an intruded pedestrian target in the early warning trigger area and determines basic information of the intruded pedestrian;

the system comprises a decision module, a characteristic extraction module and a decision module, wherein the characteristic extraction module is used for extracting the characteristics of invading pedestrians in an early warning triggering area and transmitting the extracted characteristic information to the decision module, and the decision module compares the acquired characteristic information with the characteristic information of operating personnel recorded in the system and judges whether warning information needs to be sent or not.

Preferably, the image input module comprises an input process in an initialization state and an input process in an operation state;

the input processing of the initialization state is that when the video images of each camera are directly transmitted to the image splicing module, the input module extracts the video frames of different positions at the same time according to the serial numbers of the cameras, and the content of the video frames with adjacent numbers can be spliced;

the input processing in the operating state means that after an early warning trigger area is set, video frames covering the boundary of the early warning trigger area are transmitted to the visual intrusion module in real time; for an area where an intrusion response has been activated, all video frames of the area are directly transmitted into the pedestrian detection module.

Preferably, the image splicing module comprises a video feature extraction submodule, a video feature matching submodule and a matrix regression submodule, wherein the video feature extraction submodule adopts a high-resolution network to extract the features of video images input by two adjacent cameras at the same moment; the video matching submodule firstly performs L2 standardization processing on the two extracted video image characteristics, and then performs characteristic matching on the two video image characteristics after the standardization processing so as to obtain a similarity score matrix; and the matrix regression submodule processes the similarity score matrix by adopting a convolutional neural network to obtain a global homography matrix, and visually aligns the images through mapping change according to the global homography matrix to complete the splicing of the two images.

Preferably, the interactive calibration module is configured to map the homography matrix obtained by calculation of the image stitching module into the original multiple video frames according to multiple vertex coordinates calibrated by the user, and use a region surrounded by multiple vertex connecting lines as an early warning trigger region.

Preferably, the visual intrusion module realizes the visual intrusion detection by calling a vibi function, and specifically includes:

1) the method comprises the steps that a GetImMask module is designed to obtain an early warning trigger area, and the early warning trigger area can be set according to actual needs and comprises an area formed by a transverse line, a vertical line oblique line, a rectangular frame and a trapezoid;

2) through the vibe method class and the member functions thereof, the functions of resource initialization, dynamic background modeling, background updating, real-time foreground acquisition and the like are realized on the video data in the early warning trigger area;

3) filtering a detection frame which is not adjacent to a boundary line or a region of the early warning trigger region through an isoverLapWithBorder module to remove false detection;

4) through the dup _ rect _ eliminate module, detection frames which are repeated or overlapped when the detection frames are drawn are eliminated.

Preferably, the feature extraction module trains and generates a convolutional neural network for extracting features by constructing a twin neural network, specifically comprising ternary data construction, loss design and a human feature extraction network;

the triple data construct a triple data training set used for constructing characteristics of operators, each group of triple data comprises a pair of similar images and a dissimilar image, namely, the acquired images of the same operator at different camera positions at different moments are marked as a sample a_iRecording the collected images of other operators as a type sample a_jEach time data is selected to construct triple data, it will be at a_iTwo images are extracted at random and are in a_jExtracting an image to construct a triple, and calculating a cos similarity measurement distance;

the loss design is specifically as follows:

selecting a set of triplet training data, including from sample a_iA positive sample picture a and a positive sample picture p extracted from the image data, from the sample a_jExtracting a negative sample picture n, and calculating the loss of the triplet:

L_t＝(D(a,p)-D(a,n)+mar gin)₊

wherein margin is a boundary super parameter, D (a, p) represents a similarity distance between the picture a and the picture p, and D (a, n) represents a similarity distance between the picture a and the picture n;

the character feature extraction network sub-module adopts a three-branch input structure network to input sample feature data, the size of a sample feature graph is input in a unified mode, sample data are subjected to class division and sample identification, and character features are obtained.

The invention also aims to provide a construction warning area monitoring and early warning method based on image recognition, which specifically comprises the following steps:

s1, deploying a camera set to cover the engineering construction operation and the surrounding warning area; inputting images collected by a plurality of cameras into an image splicing module for splicing to obtain a panoramic image of a working area;

s2, marking an early warning trigger area on the obtained panoramic image through an interactive calibration module, and simultaneously recording the characteristics and the number of the operators allowed to enter the early warning trigger area;

s3, the visual intrusion module monitors video images in the early warning trigger area in real time, and when an intruder enters the early warning trigger area, an early warning signal is sent out to activate the pedestrian detection module;

s4, detecting the pedestrian target in the early warning trigger area after the pedestrian detection module obtains the early warning of the visual intrusion module, counting the number of pedestrians, intercepting the area where the pedestrian is located from the video image, and determining the specific position of the area where the pedestrian is located according to the warning position information;

and S5, the feature extraction module performs feature extraction on the obtained intrusion pedestrian image, and extracts the measured distance between the obtained features and the operator features recorded in the step S2, so that whether non-operators exist in the early warning trigger area or not is determined, and a corresponding warning signal is given.

Preferably, step S3 specifically includes:

s31, the vision invasion module acquires a real-time video stream, splits the video data stream to obtain a single-frame image, and acquires an early warning trigger area through a GetImMask module;

s32, performing border crossing detection and area detection on the obtained early warning trigger area respectively;

the out-of-range detection is to detect whether personnel intrusion signals exist at the upper/lower/left/right sides of the boundary line of the early warning trigger area, and if so, an alarm signal is sent out; if not, returning to the step S31, and acquiring the single-frame image again;

the area detection means that whether a personnel intrusion signal exists in the early warning trigger area or not, and if the personnel intrusion signal exists in the early warning trigger area, an alarm signal is sent out; if not, the process returns to step S31 to acquire a single frame image again.

Preferably, step S4 further includes: when the number of the invading pedestrians is larger than the number of the recorded operators, warning information is directly sent out; steps S4 and S5 may be initiated with a timed start or a specific time.

Preferably, step S5 specifically includes:

s51, constructing a triple data training set: each group of triple data comprises a pair of similar images and a dissimilar image, namely, the acquired images of the same operator at different camera positions at different moments are recorded as a sample a_iRecording the collected images of other operators as a type sample a_jEach time data is selected to construct triple data, it will be at a_iTwo images a and a picture p are extracted at random internally, wherein_jExtracting an image n to construct a triple, and calculating the cos similarity measurement distance of each group of triples;

s52, training a triplet data training set by using triplet loss:

in the training process, the number of training images read in at a time is set to be P multiplied by K, namely, images of P categories are randomly selected each time, and K images are randomly selected for each category to be used for training a network; calculating the triple loss of each read-in training image by adopting the following formula:

wherein the content of the first and second substances,

refers to the homogeneous sample with the largest similarity distance,

the method comprises the steps that different samples with the smallest similarity distance are referred, i and j respectively represent different categories, subscripts a and p represent picture labels in the same category, and subscript n represents picture labels in different categories;

s53, inputting sample feature data by adopting a three-branch input structure network, gathering input sample feature maps with different sizes into feature maps with uniform size by adopting a feature gathering mode and using ROI Align, and gathering and retaining effective features when compressing images;

and S54, respectively carrying out category division and sample identification on sample feature graphs of uniform size by adopting a multi-task learning method, classifying images of persons in the same area at different machine positions in the same time period into a category and numbering, modeling by using triple loss, measuring the distance between images of different persons by using cos similarity, and finally carrying out sample pair identification through the measured similarity distance between sample pairs.

The invention has the beneficial effects that:

the invention provides a construction warning area monitoring and early warning system and method based on image recognition, the system and method combines visual intrusion detection, target detection and re-recognition technology, a camera is arranged around a construction warning area or dangerous construction equipment to acquire surrounding environment information and entering and exiting character information in real time, an operator can replace an early warning triggering area and operator registration information at any time, whether a personnel intrusion signal exists in the early warning triggering area is updated in real time, and when the personnel intrusion signal exists, the system and method are activated to send out an alarm to remind illegal personnel to prohibit entering the area, so that the safety of the construction warning area is ensured.

In addition, in order to save resources, only the visual invasion module is generally reserved, and other modules are activated only when the visual invasion module sends a signal; but all modules are activated for important time periods, such as noon, evening or time periods in which people may be present, and all modules can be activated at regular intervals to prevent false alarm of visual intrusion into the modules.

Drawings

FIG. 1 is a diagram of a construction warning area monitoring and early warning system based on image recognition;

FIG. 2 is a flowchart of the overall algorithm of the visual intrusion module;

FIG. 3 is a functional relationship diagram in a visual intrusion module;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The embodiment provides a construction warning area monitoring and early warning system based on image recognition, which comprises a plurality of cameras, internet and terminal equipment which are arranged outside a construction warning area, and further comprises an image input module, an image splicing module, an interactive calibration module, a pedestrian detection module, a visual intrusion module, a feature extraction module and a decision module, wherein the image input module, the image splicing module, the interactive calibration module, the pedestrian detection module, the visual intrusion module, the feature extraction module and the decision module are shown in figure 1;

Specifically, the input module is mainly divided into two parts: the input processing in the initialization state and the input processing in the operating state.

An initialization state: in an initialization state, the human-computer interaction experience is mainly considered, and the panoramic image of the operation area needs to be displayed by applying an image splicing technology. The input module needs to extract video frames of different machine positions at the same time from input data, and serial numbers of different machine positions are carried out, so that the contents of the video frames with adjacent serial numbers can be spliced.

The operation state is as follows: in the operation state, the video frames covering the boundary area are provided for the visual intrusion module in real time, and all the video frames are sent to the pedestrian detection module for the video frames activating the intrusion response.

The focus of the image stitching task is image registration. The image registration part is designed with a convolution neural network composed of a feature extraction module, a feature matching module and a matrix regression module. By using an end-to-end neural network training optimization mode, the difference of optimization targets between different modules in the traditional method is overcome, and meanwhile, the image registration method becomes more robust and stable. The input of the whole network is two images, the output is 8 regression values, and then a homography matrix is obtained through the 8 regression values. Specifically, features are extracted for two input images using a feature extraction module. And (3) calculating the similarity relation among the characteristics by using a characteristic matching module, and finally predicting by using a matrix regression module to obtain 8 regression values.

The feature extraction module considers that more space detail information needs to be reserved in an image splicing task to sense the small difference between two images, so that an HRNet high-resolution network is adopted to ensure that the feature images reserve enough space detail information.

The feature matching module is mainly used for calculating a correlation coefficient between two features. In this module, the features obtained for both images are first normalized by L2. Then, feature matching is carried out, and then a similarity score matrix is obtained.

And the matrix regression module is used for trying to calculate a homography matrix by using a convolutional neural network, and performing Relu operation on the similarity score matrix obtained by the feature matching module to eliminate a negative correlation part. And then extracting features by building a plurality of convolution modules of convolution + Relu + Batchnorm, and finally obtaining 8 regression values for generating a homography matrix through two full connection layers, thereby obtaining a global homography matrix. And finally, according to the obtained global homography matrix, visually aligning the two obtained images through mapping change, and finally splicing the two images together.

The interactive calibration module has the main functions of providing a panoramic picture of an operation area for a user and carrying out interactive labeling, and workers can select an early warning area range on a notebook computer, a tablet computer and other equipment. And the calibration module is mapped into a plurality of original video frames according to the 4 vertex coordinates calibrated by the user and the homography matrix to be used as the calibration information of the visual intrusion module. Meanwhile, after the calibration process is finished, the program can automatically start the pedestrian detection and feature extraction module and record the feature information of the operating personnel in the field.

The overall algorithm flow of the visual intrusion module is shown in fig. 2, and the alarm signal refers to a signal for activating a subsequent module, and the specific contents are as follows:

firstly, the vision invasion module acquires a real-time video stream, splits the video data stream to obtain a single-frame image, and acquires an early warning trigger area through a GetImMask function;

then, performing border crossing detection and area detection on the obtained early warning trigger area respectively;

the area detection means that whether a personnel intrusion signal exists in the early warning trigger area or not, and if the personnel intrusion signal exists in the early warning trigger area, an alarm signal is sent out; if not, acquiring the single-frame image again, and repeating the steps.

The relationship of the functions in the visual intrusion detection module is shown in fig. 3. First, the main program main calls the vibi function to realize the visual intrusion detection. There are mainly four functional blocks in the vibi function: 1) the method comprises the steps that a monitoring area is obtained through a GetImMask module, and the monitoring area with various shapes such as a transverse line, a vertical line, an oblique line, a rectangular frame and an irregular quadrangle can be supported; 2) the functions of resource initialization, dynamic background modeling, background updating, real-time foreground acquisition and the like are realized through the vibe class and the member functions thereof; 3) filtering a detection frame which is not adjacent to the monitoring line or the monitoring area through an isoverLapWithBorder module to remove false detection; 4) through the dup _ rect _ eliminate module, detection frames which are repeated or overlapped when the detection frames are drawn are eliminated.

It is worth noting that the vibi algorithm is an algorithm for pixel-level video background modeling or foreground detection, and has better effect than the known algorithms, and occupies less hardware memory. The ViBe is a pixel-level background modeling and foreground detection algorithm, and the main difference of the algorithm is an updating strategy of a background model, samples of pixels needing to be replaced are randomly selected, and neighborhood pixels are randomly selected for updating. When the model of the pixel change cannot be determined, the random updating strategy can simulate the uncertainty of the pixel change to a certain extent.

The pedestrian detection module in this embodiment uses a classic ssd (single Shot multi box detector) detection network, and quickly locates the position and the number of people of the pedestrian in the video frame captured by each camera position. The SSD is an abbreviation of a Single Shot Detector, and can realize real-time detection speed on the premise of not influencing too much detection precision, and three major characteristics of the SSD comprise that: the method comprises the following steps of multi-scale, anchor point boxes with various aspect ratios and data enhancement strategies. The method effectively combines ideas in fast R-CNN, YOLO and multi-scale convolution characteristics, and can meet the requirement of real-time detection while achieving detection precision equivalent to the most advanced two-stage detection method at that time.

The feature extraction module in this embodiment is mainly used for generating a convolutional neural network for feature extraction by training in a twin neural network construction mode based on triple loss, and mainly includes triple data construction, loss design and a human feature extraction network. The main purpose of the triple data construction is to provide high-quality triple data to provide training data for a next high-resolution feature learning network. Each set of triplet data is required to contain three pieces of image data during the training process, one of which is a "similar" image and the other of which is a "dissimilar" image. Specifically, collected images of the same operator at different camera positions at different times are recorded as a sample a_iEach time data is selected to construct triple data, it will be at a_iTwo images are extracted at random and are in a_jAnd (i is not equal to j) extracting an image to construct a triple, and measuring the distance through cos similarity. a is₁

The triple loss is a widely applied metric learning loss, and has the advantages of end-to-end performance, cluster property, high embedding of features and the like compared with other losses (classification loss and contrast loss). Triple loss training data each group requires three input pictures. An input Triplet (Triplet) includes a pair of positive samples and a pair of negative samples. The three pictures are named as fixed picture (Anchor) a, Positive sample picture (Positive) p and Negative sample picture (Negative) n, respectively. The picture a and the picture p are a pair of positive samples, and the picture a and the picture n are a pair of negative samples. The triplet penalty is expressed as:

L_t＝(D(a,p)-D(a,n)+mar gin)₊

where margin is a boundary super parameter, D (a, p) represents the distance between picture a and picture p, and D (a, n) represents the distance between picture a and picture n.

However, in the training process of the triple loss network, a large number of negative sample pairs can be generated in a combined manner, so that the number of the positive and negative sample pairs is unbalanced, the training is blocked, and the convergence result is poor, so that the design of the training strategy for the personnel image can directly influence the performance of deep network learning. Thus, during training, the trained Batch size (number of images read in at a time) is set to P × K, i.e., P classes of images are randomly selected at a time, and K images are randomly selected for training the network per class. The triple penalty within each Batch size is calculated using the following formula:

wherein the content of the first and second substances,

refers to the homogeneous sample with the largest similarity distance,

through the training mode, the least similar positive sample pair and the least distinguishable similar negative sample pair in each batch size are selected each time to calculate loss, so that training data are reduced, the problem of training sample imbalance is solved, and the feature representation capability learned by the network is higher.

The part aims at the characteristics of an operation scene, combines a learning task represented by characteristics, designs a reasonable deep network structure, and extracts the characteristics with strong representation capability and strong robustness.

The feature extraction network comprises three contents of feature representation, feature aggregation and multitask construction, and for feature representation: according to the characteristics of triple loss, the net in this embodiment is designed as a three-branch input structure, corresponding to a set of positive sample pairs (X)_i,X_j) And a set of negative sample pairs (X)_j,X_l) Simultaneous provisioning of a backbone network using a shared set of parametersFeature Map of the industry personnel (Feature Map).

Feature aggregation: the method has the advantages that the limitation of early warning on speed requirements and calculation amount and the requirement of calculation measurement distance on feature dimensions are considered, the number of feature diagram channels obtained through extraction of a backbone network is not too high, for example, the Vgg produces 512-dimensional feature diagrams, Resnet produces 1024-dimensional feature diagrams, and the Incepotion produces 1024-dimensional feature diagrams, which affect the image retrieval speed and the availability of calculation measurement distance, and the number of channels of three image feature diagrams is compressed while adding 1 × 1 convolution of parameter sharing after the backbone network. Meanwhile, the feature maps of the three images obtained through the backbone network are only kept consistent on the channel due to the fact that the sizes of the input images are not consistent. The size of the feature map needs to be kept consistent when the distance measurement is calculated and the classification feature extraction is carried out, so the size of the ROI Align is designed to be uniform. The ROI Align obtains the image numerical value on the pixel point with the coordinate as the floating point number by using a bilinear interpolation method, so that the whole feature aggregation process is converted into a continuous operation. And clustering feature maps of different sizes into feature maps of the same size through the ROI Align operation, wherein effective features are reserved in clustering while the size of the feature maps is compressed.

And (3) multitask construction: the multi-task Learning (multi task Learning) is a derivation transfer Learning method, a deep network puts a plurality of related tasks together for Learning, and the Learning process mutually shares and mutually supplements the information related to the learned field through a shallow sharing representation, so that the Learning is mutually promoted, and the generalization effect is improved. In the feature expression learning network, a classification task and an identification task of a sample pair are simultaneously adopted, the classification task is used as an auxiliary task and is beneficial to the feature learning of the network and the convergence of the network, in the classification task, all sample data are roughly classified, images of the same person are determined as one class, and the images are numbered in a range of 1-N (N is the number of all classes). And connecting a full Connected Layer (full Connected Layer) after a feature map obtained by ROI Align extraction of each image is activated by a ReLU function. And stretching the feature map into a one-dimensional feature vector after passing through the full connection layer, inputting the feature vector into the classification layer after being activated by the ReLU function, and classifying by adopting softmax logistic regression (softmax regression). In the task of identifying sample pairs, the loss of the triples is used for guiding network learning, and the loss of the triples is also a core idea of feature representation modeling. First, the distance between different human images is measured by cos similarity on the feature space. Finally, sample pair identification is performed by measuring the distance between the sample pairs.

The decision module in this embodiment has a main function of deciding whether to send out warning information or to enter information of a newly-appeared operator. In addition, the decision module activates all modules at key time intervals (noon, evening or time intervals in which people may appear (generally, only the visual intrusion module is reserved)) and activates all modules at intervals so as to prevent the visual intrusion module from being under report.

Specifically, the decision module calculates cos similarity distances between the obtained personnel features in the area and all stored personnel features, does not make warning information when the similarity is smaller than a preset threshold value, and sends out early warning when the similarity is larger than the preset threshold value to remind the personnel of leaving.

Meanwhile, when the decision module finds that the number of detected personnel in the monitoring area is larger than the number of recorded personnel, early warning can be directly triggered.

Example 2

The embodiment provides a highway engineering construction warning area monitoring and early warning based on image recognition, which specifically comprises the following steps:

Step S3 specifically includes:

Step S4 further includes: when the number of the invading pedestrians is larger than the number of the recorded operators, warning information is directly sent out; steps S4 and S5 may be initiated with a timed start or a specific time.

Step S5 specifically includes:

s51, constructing a triple data training set: each group of triple data comprises a pair of similar images and a dissimilar image, namely, the acquired images of the same operator at different camera positions at different moments are recorded as a sample a_iRecording the collected image of another operator as a sample a_jEach time the selection data is constructedWhen three sets of data are present, a is_iTwo images are extracted at random and are in a_jExtracting an image to construct a triple, and calculating a cos similarity measurement distance;

s52, training a triplet data training set by using triplet loss:

during training, the trained Batch size (the number of images read in at a time) is set to be P × K, that is, P classes of images are randomly selected at a time, and K images are randomly selected for training the network for each class. The triple penalty within each Batch size is calculated using the following formula:

and S54, respectively carrying out category division and sample identification on sample feature graphs of uniform size by adopting a multi-task learning method, classifying images of persons in the same area at different machine positions in the same time period into a category and numbering, modeling by using triple loss, measuring the distance between images of different persons by using cos similarity, and finally carrying out sample pair identification through the measured distance between sample pairs.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A construction warning area monitoring and early warning system based on image recognition comprises a plurality of cameras, internet and terminal equipment which are arranged outside a construction warning area, and is characterized by also comprising an image input module, an image splicing module, an interactive calibration module, a pedestrian detection module, a visual invasion module, a feature extraction module and a decision-making module,

the image input module is used for acquiring synchronous video images of all cameras around a highway engineering construction warning area and sending the collected synchronous video images to the image splicing module, and the image splicing module is used for splicing a plurality of synchronous video images to obtain a panoramic image in an operation area;

the pedestrian detection module adopts an SSD detection network, detects an invading pedestrian target in an early warning trigger area after receiving the early warning sent by the vision invading module, rapidly positions the pedestrian position and the number of people in a video frame captured by each camera position, and determines the invading pedestrian position information;

the system comprises a characteristic extraction module, a decision module and a warning module, wherein the characteristic extraction module is used for extracting the characteristics of invading pedestrians in a warning triggering area and transmitting the extracted characteristic information to the decision module, and the decision module compares the obtained characteristic information with the characteristic information of operating personnel recorded in the system and judges whether warning information needs to be sent or not;

the decision module has the main functions of deciding whether to send out warning information or not or to input newly appeared operator information, and in addition, the decision module can activate all modules at key time intervals including noon, evening or time intervals when people are likely to appear, and can also activate all modules at intervals so as to prevent the condition of missing report of the visual intrusion module;

the specific steps of judging whether warning information needs to be sent are as follows: the decision module calculates cos similarity distance between the obtained personnel features in the area and all stored personnel features, does not make warning information when the similarity is smaller than a preset threshold value, and sends out early warning when the similarity is larger than the preset threshold value to remind the personnel to leave; meanwhile, when the decision module finds that the number of detected personnel in the monitoring area is larger than the number of recorded personnel, early warning can be directly triggered;

the operation process of the visual intrusion module is specifically as follows:

the out-of-range detection is to detect whether personnel intrusion signals exist at the upper/lower/left/right sides of the boundary line of the early warning trigger area, and if so, an alarm signal is sent out; if not, the real-time video stream is obtained again, and the single-frame image is obtained again;

the area detection means that whether a personnel intrusion signal exists in the early warning trigger area or not, and if the personnel intrusion signal exists in the early warning trigger area, an alarm signal is sent out; if not, acquiring the single-frame image again, and repeating the steps;

the visual intrusion module realizes visual intrusion detection by calling a vibi function, and specifically comprises the following steps:

1) acquiring an early warning trigger area by designing a GetImMask module, wherein the early warning trigger area comprises but is not limited to a monitoring area adopting a transverse line, a vertical line, an oblique line, a rectangular frame and a trapezoid;

2) through the vibe method class and the member functions thereof, resource initialization, dynamic background modeling, background updating and real-time foreground acquisition are realized on video data in the early warning trigger area;

4) eliminating the detection frames which are repeated or overlapped when the detection frames are drawn through a dup _ rect _ eliminate module;

the image input module comprises input processing in an initialization state and input processing in an operation state;

the input processing in the operating state means that after an early warning trigger area is set, video frames covering the boundary of the early warning trigger area are transmitted to the visual intrusion module in real time; for the area with activated intrusion response, directly transmitting all video frames of the area into the pedestrian detection module;

the image splicing module comprises a video feature extraction submodule, a video feature matching submodule and a matrix regression submodule, wherein the video feature extraction submodule adopts a high-resolution network to extract the features of video images input by two adjacent cameras at the same moment; the video matching submodule firstly performs L2 standardization processing on the two extracted video image characteristics, and then performs characteristic matching on the two video image characteristics after the standardization processing so as to obtain a similarity score matrix; the matrix regression submodule processes the similarity score matrix by adopting a convolutional neural network to obtain a global homography matrix, and visually aligns the images through mapping change according to the global homography matrix to complete the splicing of the two images;

the interactive calibration module is used for mapping the homography matrix obtained by calculation of the image splicing module into a plurality of original video frames through 4 vertex coordinates calibrated by a user, taking an area surrounded by connecting lines of 4 vertices as an early warning trigger area, and simultaneously, after the calibration process is finished, a program automatically starts the pedestrian detection and feature extraction module to record feature information of operators in the field;

the feature extraction module is used for training and generating a convolutional neural network for extracting features by constructing a twin neural network, and specifically comprises three-tuple data construction, loss design and a figure feature extraction network;

the loss design is specifically as follows:

L_t＝(D(a,p)-D(a,n)+margin)₊

the character feature extraction network sub-module adopts a three-branch input structure network, the size of a sample is input in a unified mode, the sample data is subjected to class division and sample identification, and character features are obtained.

2. A construction warning area monitoring and early warning method based on image recognition is characterized in that the early warning system of claim 1 is adopted, and the method specifically comprises the following steps:

3. The construction warning region monitoring and early warning method based on image recognition as claimed in claim 2, wherein the step S3 specifically comprises:

4. The construction warning region monitoring and early warning method based on image recognition as claimed in claim 2, wherein the step S4 further comprises: when the number of the invading pedestrians is larger than the number of the recorded operators, warning information is directly sent out; steps S4 and S5 may be initiated with a timed start or a specific time.

5. The construction warning region monitoring and early warning method based on image recognition as claimed in claim 2, wherein the step S5 specifically comprises:

s52, training a triplet data training set by using triplet loss:

wherein the content of the first and second substances,

refers to the homogeneous sample with the largest similarity distance,

s53, inputting sample feature data by adopting a three-branch input structure network, gathering input sample feature graphs of different sizes into feature graphs of uniform size by adopting a feature gathering mode and using ROI Align, and gathering and retaining effective features when compressing the features;