CN112613668A

CN112613668A - Scenic spot dangerous area management and control method based on artificial intelligence

Info

Publication number: CN112613668A
Application number: CN202011570349.0A
Authority: CN
Inventors: 刘瑜
Original assignee: Xian Cresun Innovation Technology Co Ltd
Current assignee: Xian Cresun Innovation Technology Co Ltd
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-06

Abstract

The invention discloses a scenic spot dangerous area management and control method based on artificial intelligence, which comprises the following steps: collecting videos of dangerous areas of the scenic spot, wherein targets in the videos of the dangerous areas of the scenic spot comprise tourists and dangerous area identifications; generating a spatial and OR graph model of spatial position relationships between tourists and dangerous area identifications based on the videos of the dangerous areas; extracting a sub-activity label set representing the activity state of the tourist from the space and or graph model; inputting the sub-activity label set into a time and OR graph model obtained in advance to obtain a prediction result of future activities of the tourists; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the scenic spot dangerous areas; and managing and controlling the scene dangerous area based on the prediction result. According to the embodiment of the invention, the activities of tourists in the dangerous area can be accurately and quickly predicted by utilizing the space-time and-or graph model, so that timely and effective danger reminding and prevention are realized.

Description

Scenic spot dangerous area management and control method based on artificial intelligence

Technical Field

The invention belongs to the technical field of monitoring, and particularly relates to a scenic spot dangerous area management and control method based on artificial intelligence.

Background

Tourist attractions usually have some dangerous areas, such as cliff sides, deep lake sides and other dangerous areas, and the casualty of tourists entering the tourist attractions can cause physical injury and even life danger; at present, the frequently used reminding means for tourists are dangerous signboard or fence; but only with the dangerous signboard, the tourist is likely to have an accident because the tourist does not see the dangerous signboard in time; the more insurance method is that the danger signboard and the fence are arranged; however, the direct arrangement of the fence is safer, but affects the beauty of the photo of the tourist.

Disclosure of Invention

In order to solve the technical problem, the embodiment of the invention provides a scenic spot dangerous area management and control method based on artificial intelligence. The specific technical scheme is as follows:

collecting videos of a scenic spot dangerous area, wherein targets in the videos of the scenic spot dangerous area comprise tourists and dangerous area identifications;

generating a spatial and OR graph model of spatial positional relationships between the guest and the hazardous area identification based on the video of the regional hazardous area;

extracting a set of sub-activity tags characterizing the guest's activity state from the spatial and-or graph model;

inputting the sub-activity label set into a time and OR graph model obtained in advance to obtain a prediction result of the future activity of the tourist; the time and OR graph model is obtained by utilizing a pre-established active corpus of targets of the scenic region dangerous areas;

and managing and controlling the scenic spot dangerous area based on the prediction result.

In one embodiment of the present invention, the dangerous area identification includes a first identification and a second identification, and a distance between the first identification and the dangerous edge is greater than a distance between the second identification and the dangerous edge.

In one embodiment of the present invention, the hazardous area label includes a warning board standing on the ground and a warning line laid on the ground.

In one embodiment of the present invention, the generating a spatial and or map model of spatial location relationship with respect to the guest and the hazardous area identification based on the video of the regional hazardous area comprises:

detecting the targets in the video of the scenic spot dangerous area by using a target detection network obtained by pre-training to obtain attribute information corresponding to each target in each frame of image of the video; wherein the attribute information includes position information of a bounding box containing the target;

matching the same target in each frame of image of the video of the scenic spot dangerous area by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

determining an actual spatial distance between the guest and the danger area identifier in each frame of image;

and generating a space and OR graph model of the dangerous area of the scenic spot by using the attribute information of the target corresponding to each frame of image after matching and the actual space distance.

In an embodiment of the present invention, the target detection network is a backbone network based on a YOLO _ v3 network, and the residual modules are replaced by densely connected modules.

In one embodiment of the invention, the target detection network comprises a plurality of dense connection modules and transition modules which are connected in series at intervals; the number of the dense connection modules is at least three; the dense connection module comprises a convolution network module and a dense connection unit group which are connected in series; the convolution network module comprises a convolution layer, a BN layer and a Leaky relu layer which are connected in series; the dense connection unit group comprises m dense connection units; each dense connection unit comprises a plurality of convolution network modules which are connected in a dense connection mode, and feature graphs output by the convolution network modules are fused in a cascading mode; wherein m is a natural number of 4 or more.

In one embodiment of the present invention, the determining the actual spatial distance between the guest and the danger area identification in each frame of image comprises:

in each frame image, determining the pixel coordinate of each target;

aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

and aiming at each frame of image, obtaining the actual spatial distance between the tourist and the dangerous area identifier in the frame of image by using the actual coordinates of the tourist and the dangerous area identifier in the frame of image.

In one embodiment of the present invention, the extracting a sub-activity label set characterizing the activity state of the guest from the spatial and or graph model includes:

determining the dangerous area identifier and the tourists of which the actual space distance from the dangerous area identifier is smaller than a preset distance threshold value in the space and or map model as attention targets;

determining the actual space distance of each pair of the attention targets and the speed value of each attention target aiming at each frame image;

obtaining distance change information representing the actual space distance change condition of each pair of the attention targets and speed change information representing the speed value change condition of each attention target by sequentially comparing the next frame image with the previous frame image;

and describing the distance change information and the speed change information which are sequentially obtained by each concerned target by utilizing semantic tags, and generating a sub-activity tag set representing the activity states of the tourist and the dangerous area identifier.

In an embodiment of the present invention, the inputting the sub-activity label set into a pre-obtained time and or graph model to obtain a result of predicting future activity of the person under guardianship comprises:

and inputting the sub-activity label set into the time and OR graph model, and obtaining a prediction result of future activities of the tourist in the environment by using an online symbolic prediction algorithm of an Earley resolver, wherein the prediction result comprises future sub-activity labels of the tourist and occurrence probability values.

In an embodiment of the present invention, the controlling the hazardous area of the scenic spot based on the prediction result includes:

when the prediction result indicates that the tourist approaches to the first identifier, a danger warning voice is sent to the tourist;

and when the prediction result indicates that the tourist approaches the second identifier, sending a control signal to a dangerous area fence controller to pull or lift the dangerous area fence to establish the enclosure.

The embodiment of the invention provides a scenic spot dangerous area management and control method based on artificial intelligence, which introduces a space-time AND/OR diagram into the field of target activity prediction for the first time. Firstly, generating a space and OR graph model by analyzing the spatial position relation between tourists and dangerous area identifications in a video of a scenic spot dangerous area; secondly, performing activity state extraction on the space and/or map model to obtain a sub-activity label set which comprises the identification of tourists and dangerous areas and identifies each concerned target, and realizing high-level semantic extraction of videos in the dangerous areas of scenic spots; and then taking the sub-activity label set as the input of a pre-obtained time and or graph model, and obtaining the prediction of the next sub-activity of the concerned tourist through the time syntax of the time and or graph. According to the embodiment of the invention, the activities of tourists in the dangerous area can be accurately and quickly predicted by utilizing the space-time and-or graph model, so that timely and effective danger reminding and prevention are realized.

Drawings

Fig. 1 is a flowchart of a method for controlling a dangerous area in a scenic spot based on artificial intelligence according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a prior art YOLO _ v3 network;

fig. 3 is a schematic structural diagram of a scene object detection network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a transition module according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the results of an exemplary traffic intersection time grammar (T-AOG) according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a scenic spot dangerous area management and control method based on artificial intelligence according to an embodiment of the present invention, where the scenic spot dangerous area management and control method based on artificial intelligence according to the embodiment of the present invention includes:

s11, collecting videos of the dangerous areas of the scenic spot, wherein the targets in the videos of the dangerous areas of the scenic spot comprise tourists and dangerous area identifications.

In the embodiment of the invention, the scenic spot dangerous area refers to a cliff side, a deep lake side or a river side and the like. In the dangerous area, the video acquisition of the scenic spot dangerous area can be realized through the cameras arranged around the dangerous area. It can be understood that the collected video contains a plurality of frames of images, and none of the frames of images include the tourist and the dangerous area identifier, and the embodiment of the invention takes the tourist and the dangerous area identifier as the target to be identified.

In the embodiment of the present invention, the size of each frame of image of the video is required to be 416 × 416 × 3.

Thus, at this step, in one embodiment, video satisfying the 416 × 416 × 3 image size can be directly obtained; in another embodiment, a video with any image size may be obtained, and the obtained video is subjected to a certain size scaling process to obtain a video satisfying the 416 × 416 × 3 image size.

In addition, in the two embodiments, image enhancement operations such as smoothing, filtering, edge filling and the like can be performed on the acquired video, so as to achieve the purposes of reducing video distortion and improving image quality.

Optionally, the dangerous area identifier includes a first identifier and a second identifier, and a distance between the first identifier and the dangerous edge is greater than a distance between the second identifier and the dangerous edge.

Through setting up two identifications, associate each identification with visitor's activity prediction, and each identification can correspond to there be corresponding dangerous countermeasure, for example:

when the prediction result represents that the tourist approaches the first identifier, danger warning voice is sent to the tourist;

and when the prediction result indicates that the tourist approaches the second identifier, sending a control signal to the dangerous area fence controller to enable the dangerous area fence to be lifted to establish the enclosure.

Optionally, the danger area indicator includes a warning board standing on the ground and a warning line laid on the ground.

And S12, generating a space and OR graph model of the spatial position relation of the tourists and the dangerous area identification based on the video of the dangerous area.

By way of example, this step may include:

s121, detecting the targets in the video of the scenic spot dangerous area by using a target detection network obtained through pre-training to obtain attribute information corresponding to each target in each frame of image of the video; wherein the attribute information includes position information of a bounding box containing the object.

In the embodiment of the invention, in a backbone network of a target detection network based on a YOLO _ v3 network, a residual error module is replaced by a dense connection module; the target detection network is obtained by training according to the video of the dangerous area of the sample scene area and the attribute information of each target in each frame image of the video of the dangerous area of the sample scene area; the attribute information includes position information of a bounding box containing the object and category information of the object.

The target detection of the embodiment of the invention is realized by using a regression-based neural network target detection algorithm representative-YOLO _ v3 network. The YOLO _ v3 network comprises a backbone network (backbone) and three prediction branches, wherein the backbone network is a dark net-53 network, the YOLO _ v3 is a full convolution network, a large number of layer jump connections using residual errors are used, and in order to reduce the negative effect of the gradient caused by POOLing, POOLing is abandoned, and stride of conv is used for realizing downsampling. In this network structure, a convolution with a step size of 2 is used for down-sampling. In order to enhance the accuracy of the algorithm for detecting small targets, in YOLO _ v3, an upsampling and Feature fusion method similar to FPN (Feature Pyramid) is adopted, and detection is performed on Feature maps of multiple scales. The three prediction branches adopt a full convolution structure.

To facilitate understanding of the network structure of the target detection network according to the embodiment of the present invention, first, a network structure of a YOLO _ v3 network in the prior art is described, please refer to fig. 2, and fig. 2 is a schematic structural diagram of a YOLO _ v3 network in the prior art. In fig. 2, the part inside the dashed box is the YOLO _ v3 network. Wherein the part in the dotted line frame is a backbone (backbone) network of the YOLO _ v3 network, namely a darknet-53 network; the other part is an FPN (Feature Pyramid network) network, the FPN network is divided into a Y1 prediction branch, a Y2 prediction branch and a Y3 prediction branch, and Y1, Y2 and Y3 represent prediction results of different scales.

The backbone network of the YOLO _ v3 network is formed by connecting CBL modules and a plurality of resn modules in series. The CBL module is a Convolutional network module, and includes a conv layer (convolutive layer, convolutive layer for short), a BN (Batch Normalization) layer and an leakage relu layer corresponding to an activation function leakage relu, which are connected in series, and the CBL represents conv + BN + leakage relu. The resn module is a Residual module, n represents a natural number, and includes Res1, Res2, …, Res8, and the like, the resn module includes a zero padding (zero padding) layer, a CBL module, and a Residual unit group, which are connected in series, the Residual unit group is represented by Res unit n, meaning includes n Residual units Res unit, each Residual unit includes a plurality of CBL modules connected in a Residual Network (ResNets) connection form, and the feature fusion form adopts a parallel form, i.e., an add form.

Each prediction branch of the FPN network includes a convolutional network module group, specifically includes 5 convolutional network modules, that is, CBL × 5 in fig. 2. In addition, the US (up sampling) module is an up sampling module; the concat module represents that the feature fusion adopts a cascade mode, and concat is short for concatemate.

For the specific structure of each main module in the YOLO _ v3 network, please refer to the schematic diagram below the dashed box in fig. 2.

The main improvement idea of the target detection network provided by the embodiment of the present invention compared with the YOLO _ v3 network is to use the connection mode of dense convolutional network DenseNet for reference, and propose a specific dense connection module to replace a residual module (resn module) in the backbone network of the YOLO _ v3 network. It is known that ResNets combines features by summation before passing them to layers, i.e. feature fusion in a parallel manner. Whereas the dense connection approach connects all layers (with matching signature sizes) directly to each other in order to ensure that information flows to the maximum extent between layers in the network. Specifically, for each layer, all feature maps of its previous layer are used as its input, and its own feature map is used as its input for all subsequent layers, i.e., feature fusion is in a cascade (also referred to as a cascade). Therefore, compared with the YOLO _ v3 network which uses a residual error module, the target detection network obtains more information quantity of the feature map by using the dense connection module instead, and can enhance feature propagation and improve detection precision when target detection is carried out. Meanwhile, because the redundant characteristic diagram does not need to be learned again, the number of parameters can be greatly reduced, the calculated amount is reduced, and the problem of gradient disappearance can be reduced.

Optionally, please refer to fig. 3 for a structure of a target detection network adopted in the embodiment of the present invention, and fig. 3 is a schematic structural diagram of the target detection network provided in the embodiment of the present invention. As shown in fig. 3, the portion within the dashed box is the target detection network. Wherein the part in the dotted line frame is the backbone network of the target detection network.

The main network of the target detection network comprises a plurality of dense connection modules and transition modules which are connected in series at intervals; the densely connected modules are denoted as denm.

Because there are at least three prediction branches, the number of the dense connection modules is at least three, so that the feature maps output by the dense connection modules are correspondingly fused into the prediction branches.

The dense connection module comprises a convolution network module (as before, denoted as CBL module) and a dense connection unit group which are connected in series; the dense connecting unit group is represented as den unit x m, and the meaning of the dense connecting unit group is that the dense connecting unit group comprises m dense connecting units, and m is a natural number which is more than or equal to 4;

each densely connected unit is denoted as den unit; the system comprises a plurality of convolution network modules which are connected in a dense connection mode, and a characteristic diagram output by the convolution network modules is fused in a cascading mode; concat means tensor concatenation, the operation is different from the operation of add in the residual module, concat expands the dimensionality of the tensor, and add directly does not cause the change of the tensor dimensionality. Therefore, when the main network of the target detection network extracts the features, the dense connection module is used for changing the feature fusion mode from parallel to serial, the early feature graph can be directly used as the input of each layer later, the transfer of the features is strengthened, and the number of parameters and the operation amount are reduced by multiplexing the feature graph parameters of the shallow network.

For a general dense convolutional network DenseNet structure, a transition layer between dense connections is included to adjust the feature map of the dense connections. Therefore, in the embodiment of the invention, transition modules can be arranged among the added densely-connected modules.

In an optional first embodiment, the transition module is a convolutional network module. I.e. using the CBL module as a transition module. Then, when the main network of the target detection network is built, the residual module is only required to be replaced by the dense connection module, and then the dense connection module and the original CBL module are connected in series, so that the network can be built quickly, and the obtained network structure is simple. But only the convolutional layer is used for transition, that is, the dimension of the feature map is reduced by directly increasing the step size, so that only the local region features can be taken care of, and the information of the whole feature map cannot be combined, so that the information in the feature map can be lost more.

In a second optional embodiment, the transition module comprises a convolution network module and a maximum pooling layer; the input of the convolution network module and the input of the maximum pooling layer are shared, and the characteristic diagram output by the convolution network module and the characteristic diagram output by the maximum pooling layer are fused in a cascading mode. Referring to fig. 4, a structure of a transition module in this embodiment is shown, and fig. 4 is a schematic structural diagram of a transition module according to an embodiment of the present invention. In this embodiment, the transition module is represented by a tran module, and the MP layer is a max pooling layer (Maxpool, abbreviated MP, meaning max pooling). Further, the step size of the MP layer may be selected to be 2. In this embodiment, the introduced MP layer can perform dimension reduction on the feature map with a larger receptive field; the used parameters are less, so that the calculated amount is not increased too much, the possibility of overfitting can be weakened, and the generalization capability of the network model is improved; and the original CBL module is combined, so that the characteristic diagram can be viewed as being subjected to dimension reduction from different receptive fields, and more information can be reserved.

For the second embodiment, optionally, the number of the convolution network modules included in the transition module is two or three, and a serial connection manner is adopted between each convolution network module. Compared with the method using one convolution network module, the method using two or three convolution network modules connected in series can increase the complexity of the model and fully extract the features.

The original YOLO _ v3 network contains many convolutional layers, and when the number of target categories is small, a large number of convolutional layers are not necessary, which wastes network resources and reduces processing speed.

Therefore, optionally, the value of m for the dense connection module in the target detection network may be 4, so that, compared with the number of convolutional layers contained in a plurality of residual modules of the backbone network in the original YOLO _ v3 network, by setting the number of dense connection units contained in the dense connection module to 4 in the target detection network, the number of convolutional layers in the backbone network may be reduced without affecting the network accuracy.

Optionally, the target detection network may further adjust a value of k in the convolutional network module group of each prediction branch in the FPN network, so as to reduce k from original 5 to 4 or 3, that is, change original CBL × 5 to CBL × 4 or CBL × 3; therefore, the number of the convolution layers in the FPN network can be reduced, the simplification of the number of the network layers is integrally realized under the condition of not influencing the network precision, and the network processing speed is improved.

Of course, a preferred embodiment is to use both of the above two approaches to reducing the number of convolutional layers.

Through a pre-trained target detection network, attribute information corresponding to each target in each frame of image of a video of a scenic spot dangerous area can be obtained. Wherein the attribute information includes position information of a bounding box containing the object. The position information of the bounding box of the object is expressed, wherein the position coordinates of the center of the current bounding box are expressed, and the width and height of the bounding box are expressed, and the skilled person can understand that, besides the position information of the bounding box, the attribute information also comprises the confidence level of the bounding box, and the confidence level reflects the confidence level of the object contained in the bounding box and the accuracy level of the object predicted by the bounding box. The confidence is defined as:

if not, pr (object) is 0, confidence is 0; if it contains an object, pr (object) is 1, so the confidence level

The intersection ratio of the real bounding box and the predicted bounding box is obtained.

As will be understood by those skilled in the art, the attribute information also includes category information of the object. The categories include guests and hazardous area identifications.

It should be noted that, since a frame of video image may often contain several objects, some objects are far away or too small in distance, or do not belong to "interested objects" in a preset scene, these objects are not objects with detection purpose. For example, in the context of the present invention, moving visitors and fixed danger zone identification are of interest, while other small objects in the area, such as resting stools, landscape stones, etc., are non-interesting targets. In this way, in a preferred embodiment, by controlling and adjusting the target detection network setting in advance in the pre-training stage, a preset number of targets can be detected for one frame of image, for example, the preset number may be 30, 40, and so on. And meanwhile, the marked training sample with the detection purpose is used for training the target detection network, so that the target detection network has the autonomous learning performance, the trained target detection network can be used as a video of a scenic region danger area of the test sample aiming at unknown objects, and the attribute information corresponding to the preset number of objects with the detection purpose in each frame of image can be obtained, so that the target detection efficiency and the detection purpose are improved.

As before, before S2, the target detection network needs to be trained in advance for a preset scene, and as can be understood by those skilled in the art, sample data used in the pre-training is videos and labeled attribute information of dangerous areas of sample scenic spots in the scene, and the labeled attribute information is attribute information of each target in each frame of image obtained by labeling. The labeling attribute information comprises category information of the target in each frame of image of the video of the dangerous area of the sample scene area and position information of a boundary box containing the target.

The pre-training process can be briefly described as the following steps:

1) and taking the attribute information of each frame of image of the video of the dangerous area of the sample scenic spot corresponding to the target as the corresponding true value of the frame of image, and training each frame of image and the corresponding true value through the network shown in the figure 3 to obtain the training result of each frame of image.

2) And comparing the training result of each frame of image with the true value corresponding to the frame of image to obtain the output result corresponding to the frame of image.

3) And calculating the loss value of the network according to the output result corresponding to each frame of image.

4) And adjusting parameters of the network according to the loss value, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum value, which means that the training result of each frame of image is consistent with the true value corresponding to the frame of image, thereby completing the training of the network, namely obtaining the pre-trained target detection network.

Aiming at the scene of the invention, a large number of videos of dangerous areas of the sample scene areas need to be obtained in advance, manual or machine labeling is carried out, the category information and the position information of each frame of image corresponding to the target in the video of the dangerous areas of each sample scene area are obtained, and the target detection network has the target detection performance in the scene through the pre-training process.

And S122, matching the same target in each frame of image of the video of the scenic spot dangerous area by using a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image.

The early target detection and tracking mainly aims at pedestrian detection, the detection idea is mainly to realize detection according to a traditional characteristic point detection method, and then tracking is realized by filtering and matching characteristic points. Such as pedestrian detection based on histogram of oriented gradient features (HOG), early pedestrian detection achieves various problems of missing detection, false alarm, repeated detection and the like. With the development of deep convolutional neural networks in recent years, various methods for detecting and tracking targets by using high-precision detection results appear.

Since a plurality of targets exist in the preset scene, the target tracking needs to be realized by using a multi-target tracking algorithm. The multi-target tracking problem can be regarded as a data association problem, aiming at associating cross-frame detection results in a sequence of video frames. By tracking and detecting the targets in the scene video by using a preset multi-target tracking algorithm, the bounding box of each target in different frame images in the scene video and the ID (Identity document) of the target can be obtained.

In an optional implementation manner, the preset multi-target tracking algorithm may include: SORT (simple Online and Realtime tracking) algorithm.

The SORT algorithm uses a TDB (tracking-by-detection) method, the tracking means is to use Kalman filtering tracking to realize target motion state estimation, and the Hungarian assignment algorithm is used for position matching. The SORT algorithm does not use any object appearance features in the object tracking process, but only uses the position and size of the bounding box for motion estimation and data correlation of the object. Therefore, the complexity of the SORT algorithm is low, the tracker can realize the speed of 260Hz, the target tracking detection speed is high, and the real-time requirement in the scene video of the embodiment of the invention can be met.

The SORT algorithm does not consider the shielding problem, and does not perform target re-identification through the appearance characteristics of the target, so that the SORT algorithm is more suitable for being applied to the preset scene without shielding of the target.

In an optional implementation manner, the preset multi-target tracking algorithm may include: deepsort (simple online and real time tracking with a deep association metric) algorithm.

DeepSort is an improvement on the basis of SORT target tracking, a Kalman filtering algorithm is used for carrying out track preprocessing and state estimation and is associated with a Hungarian algorithm, a deep learning model trained on a line re-identification data set is introduced into the algorithm on the basis of improving the SORT algorithm, and nearest neighbor matching is carried out by extracting depth appearance features of targets in order to improve the shielding condition of the targets in a video and the problem of frequent switching of target IDs when the targets are tracked on a real-time video. The core idea of deep sort is to use recursive kalman filtering and data correlation between several frames for tracking. Deep Association Metric (Deep Association Metric) is added to the Deep SORT on the basis of the SORT, and the purpose is to distinguish different pedestrians. Appearance Information (Appearance Information) is added to realize target tracking of longer-time occlusion. The algorithm is faster and more accurate than the SORT speed in real-time multi-target tracking.

For the specific tracking procedure of the SORT algorithm and the DeepSort algorithm, please refer to the related prior art for understanding, and the detailed description is omitted here.

And S123, determining the actual spatial distance between the tourist and the dangerous area identifier in each frame of image.

This step entails determining the actual spatial distance between each guest and the danger area identification in each frame of image, and defining the spatial composition of the objects using the actual spatial distance between the two objects. Therefore, accurate results can be obtained when the constructed space and/or graph model is subsequently used for target recognition, analysis and prediction.

In an alternative embodiment, the actual distance between the guest and the danger area indicator in the image may be determined using the principle of equal scaling. Specifically, the actual spatial distance between the two test targets may be measured in a preset scene, a frame of image including the two test targets is captured, and then the pixel distance between the two test targets in the image is calculated, so as to obtain the actual number of pixels corresponding to a unit length, for example, the actual number of pixels corresponding to 1 meter. Then, for two new targets needing to detect the actual spatial distance, the pixel number corresponding to the unit length in the actual scene is used as a factor, and the pixel distance of the two targets in one frame of image shot in the scene can be scaled by using a formula, so as to obtain the actual spatial distance of the two targets.

It will be appreciated that this approach is simple to implement, but is well suited to situations where the image is not distorted. In the case of distortion of an image, the pixel coordinates and the physical coordinates do not correspond to each other one to one, and the distortion needs to be corrected. Such as distortion removal by correcting the picture through cvinituninhibitormap and cvRemap, and so on. The implementation of such scaling and the specific process of image distortion modification can be understood by referring to the related art, and are not described herein again.

Alternatively, monocular distance measurement may be used to determine the actual spatial distance between the guest and the danger area indicator in the image.

The monocular camera model may be considered approximately as a pinhole model. Namely, the distance measurement is realized by using the pinhole imaging principle. Optionally, a similar triangle may be constructed through a spatial position relationship between the camera and the actual object and a position relationship of the target in the image, and then an actual spatial distance between the targets is calculated.

Alternatively, a correlation algorithm of the monocular distance measuring method in the prior art can be utilized,calculating the horizontal distance d between the actual position of a pixel point of a target and video shooting equipment (a video camera/a camera) by using the pixel coordinate of the pixel point_xAnd a vertical distance d_yThus realizing monocular distance measurement. Then through the known actual coordinates and d of the video shooting device_x、d_yAnd deducing and calculating the actual coordinates of the pixel points. Then, for two objects in the image, the actual spatial distance between the two objects can be calculated using the actual coordinates of the two objects.

In an optional implementation manner, the actual spatial distance between two targets in the image may be determined by calculating the actual coordinate point corresponding to the pixel point of the target.

And calculating the actual coordinate of the pixel point by the actual coordinate point corresponding to the pixel point of the calculation target.

Optionally, a monocular visual positioning and ranging technique may be employed to obtain the actual coordinates of the pixels.

The monocular vision positioning distance measuring technology has the advantages of low cost and fast calculation. Specifically, two modes can be included:

1) and obtaining the actual coordinates of each pixel by utilizing positioning measurement interpolation.

Taking into account the isometric enlargement of the pinhole imaging model, the measurement can be performed by directly printing paper full of equidistant array dots. And measuring equidistant array points (such as a calibration plate) at a higher distance, interpolating, and then carrying out equal-proportion amplification to obtain the actual ground coordinates corresponding to each pixel point. Such an operation can eliminate the need to manually measure the graphical indicia on the ground. After the dot pitch on the paper is measured, H/H (height ratio) is amplified to obtain the coordinates of the pixel corresponding to the actual ground. In order to avoid that the keystone distortion of the upper edge of the image is too severe, so that the mark points on the printing paper are not easy to identify, the method needs to prepare equidistant array circular point maps with different distances.

2) And calculating the actual coordinates of the pixel points according to the similar triangular proportion.

The main idea of this approach is still the pinhole imaging model. But the requirement for calibrating video shooting equipment (video camera/still camera/camera) is higher, and the distortion caused by the lens is smaller, but the method has stronger transportability and practicability. The video camera may be calibrated, for example, by using MATLAB or OPENCV, and then the conversion calculation of the pixel coordinates in the image is performed.

In the following description, an alternative to this mode is selected, and S123 may include S1231 to S1233:

s1231, determining the pixel coordinate of each target in each frame of image;

for example, a boundary box containing the target and pixel coordinates of all pixel points in the boundary box can be determined as pixel coordinates of the target; or a pixel point on or in the bounding box may be selected as the pixel coordinate of the target, that is, the pixel coordinate of the target is used to represent the target, for example, the center position coordinate of the bounding box may be selected as the pixel coordinate of the target, and so on.

S1232, aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

the pixel coordinates of any pixel in the image are known. The imaging process of the camera involves four coordinate systems: a world coordinate system, a camera coordinate system, an image physical coordinate system (also called an imaging plane coordinate system), a pixel coordinate system, and a transformation of these four coordinate systems. The transformation relationships between these four coordinate systems are known and derivable in the prior art. Then, the actual coordinates of the pixel points in the image in the world coordinate system can be calculated by using a coordinate system transformation formula, for example, the actual coordinates in the world coordinate system can be obtained from the pixel coordinates by using many public algorithm programs in OPENCV language. Specifically, for example, the corresponding world coordinates are obtained by inputting the camera parameters, rotation vectors, translation vectors, pixel coordinates, and the like in some OPENCV programs, using a correlation function. The actual coordinates of the center position of the bounding box representing the target A in the world coordinate system are assumed to be (X)_A,Y_A) The center of the bounding box representing the object B is obtainedThe actual coordinate of the position coordinate in the world coordinate system is (X)_B,Y_B). Further, if the object A has an actual height, the actual coordinates of the object A are

Where H is the actual height of the object a and H is the height of the video capture device.

And S1233, aiming at each frame of image, obtaining the actual space distance between every two targets in the frame of image by using the actual coordinates of every two targets in the frame of image.

The method for solving the distance between two points by using actual coordinates belongs to the prior art. For the above example, the actual spatial distance D between targets a and B, without considering the actual height of the targets, is:

of course, the case of considering the target actual height is similar thereto.

Optionally, if the multiple pixel coordinates of the objects a and B are obtained in S1231, it is reasonable to calculate multiple actual distances between the objects a and B by using the multiple pixel coordinates, and then select one of the actual distances as the actual spatial distance between the objects a and B according to a certain selection criterion, for example, select the minimum actual distance as the actual spatial distance between the objects a and B.

In an optional implementation, determining the actual spatial distance between different targets in each frame of image may also be implemented by using a binocular camera optical image ranging method.

The binocular cameras are the same as the human binoculars, the images of the same object shot by the two cameras have difference due to different angles and positions, the difference is called as parallax, the size of the parallax is related to the distance between the object and the cameras, and the target can be positioned according to the principle. The binocular camera optical image ranging is realized by calculating the parallax of two images shot by a left camera and a right camera. The specific method is similar to monocular camera optical image ranging, but has more accurate ranging and positioning information compared with a monocular camera. For a specific distance measurement process of the binocular camera optical image distance measurement method, reference is made to related prior art, and details are not repeated here.

In an alternative embodiment, determining the actual spatial distance between different objects in each frame of image may also include:

and aiming at each frame of image, obtaining the actual spatial distance between the two targets in the frame of image by using a depth camera ranging method.

The depth camera ranging method can directly obtain the depth information of the target from the image, and can accurately and quickly obtain the actual spatial distance between the target and the video shooting equipment without coordinate calculation, so that the actual spatial distance between the two targets is determined, and the accuracy and the timeliness are higher. For a specific distance measurement process of the depth camera distance measurement method, please refer to the related prior art, which is not described herein.

And S124, generating a spatial and OR graph model of the dangerous area of the scenic spot by using the attribute information and the actual spatial distance of the target corresponding to each matched frame image.

And performing spatial relationship decomposition on the tourist and the dangerous area identification in each frame of image to obtain a spatial and OR graph of the frame of image, and integrating the spatial and OR graphs corresponding to the frames of images in the video of the dangerous area of the scenic spot to obtain a spatial and OR graph model of the environment where the spatial and OR graph model is located.

Specifically, in this step, for each frame of image, the detected object and the attribute information of the object are used as leaf nodes of the space and or graph, and the actual space distance between the guest and the dangerous area identifier is used as the space constraint of the space and or graph, so as to generate the space and or graph of the frame of image. And forming a spatial and-or graph model of the environment by the spatial and-or graphs of all the frame images.

And S13, extracting a sub-activity label set for representing the activity state of the tourist from the space and or graph model.

S12 enables detection of leaf nodes of the space and or graph. In the step, the sub-activities are extracted to obtain an event sequence of the sub-activity combination, so that the whole event represented by the video of the scenic spot dangerous area is expressed. It should be noted that the sub-activities extracted in this step are actually the target activities, and the sub-activities are described in terms of nodes of the or-graph leaf.

In an alternative embodiment, the step may include:

s131, in the space and OR graph model, the dangerous area identifier and the tourists of which the actual space distance from the dangerous area identifier is smaller than a preset distance threshold are determined as the attention targets.

Optionally, in the space and or map model, the dangerous region identifier, and the target whose actual spatial distance from the dangerous region identifier is smaller than the preset distance threshold are determined as the attention target.

And S132, determining the actual spatial distance of each pair of attention targets and the speed value of each attention target for each frame image.

At this step, starting from the first frame image, the actual spatial Distance d of the attention target smaller than the preset Distance threshold minDis may be saved at Distance x; distance x is a multi-dimensional array that holds the actual spatial Distance d between different objects. Where x denotes a sequence number corresponding to an image, and x ═ 1 denotes a first frame image, for example.

Meanwhile, a speed value of the same attention target in each frame image can be calculated, and the speed value refers to the speed of the attention target in the video current frame of the scenic spot dangerous area.

The calculation method of the velocity value of the target is briefly described below:

and calculating the speed value of an object, wherein the moving distance s and the moving time t of the object in the front frame image and the rear frame image are required to be obtained. The frame rate FPS of the camera is first calculated. Specifically, in the development software OpenCV, the number of frames per second FPS of the video can be calculated by using the self-contained get (CAP _ PROP _ FPS) and get (CV _ CAP _ PROP _ FPS) methods.

Once every k frames, there are:

t＝k/FPS(s) (3)

thus, the velocity value v of the target can be calculated by:

wherein (X)₁，Y₁) And (X)₂，Y₂) Respectively, the actual coordinates of the object in the previous frame image and the next frame image, which can be obtained by the step S133. Since calculating the velocity value of the target of the current frame image requires using the previous frame image and the current frame image, it can be understood that the velocity value of the target can be obtained starting from the second frame image.

The speed of the attention target in the video can be calculated by the method, wherein the corresponding speed value, such as 0.8m/s and the like, can be identified beside the boundary box of each attention target. In the scenario of the embodiment of the present invention, the movement occurs mainly by a human being, and thus the calculated velocity value is mainly the velocity value of the human being in the image.

For the same object of interest, the velocity value in the first frame image may be denoted by v1, the velocity value in the second frame image may be denoted by v2, …, and so on.

And S133, sequentially comparing the next frame image with the previous frame image to obtain distance change information representing the actual space distance change condition of each pair of attention targets and speed change information representing the speed value change condition of each attention target.

For example, for two objects of interest E and F, if the actual spatial distance between the two images in the previous frame is 3 meters, and the actual spatial distance between the two images in the next frame is 2 meters, it is known that the actual spatial distance between the two images is reduced, which is the distance change information between the two images. Similarly, if the velocity value of E in the previous frame image is 8m/s and the velocity value of E in the subsequent frame image is 10m/s, it is known that the velocity of E is faster, which is the velocity change information thereof.

And obtaining the distance change information and the speed change information of each concerned target corresponding to each frame image, which are generated in sequence until the images of all the frames are traversed.

And S134, describing the distance change information and the speed change information which are sequentially obtained by each concerned target by using semantic tags, and generating a sub-activity tag set representing the activity states of the tourist and the dangerous area identifier.

The step is to describe distance change information and speed change information into character forms, such as acceleration, deceleration, approaching, far-away and the like, by means of meanings to obtain sub-activity labels representing the activity state of the attention target, and finally obtain a sub-activity label set by the sub-activity labels which correspond to each frame of image and sequentially occur. The sub-activity tab sets represent sub-event sequences of video of the scenic spot hazardous areas. The embodiment of the invention realizes the description of the video of the scenic spot dangerous area by utilizing the sub-activity label set, namely, the semantic description of the whole video is obtained by combining different sub-activities of each target in the video, and the semantic extraction of the video of the scenic spot dangerous area is realized.

The sub-activity definitions in embodiments of the present invention may refer to the manner in which sub-activity label definitions in the CAD-120 dataset are defined, and the shorter label schema helps to generalize nodes of the AND-OR graph. The sub-activity tags of interest can be defined specifically under different environments.

In the steps, complete sub-activity tag sets subactive can be obtained.

According to the embodiment of the invention, aiming at different environments, when target activities (events) are analyzed, sub-activities (namely sub-events) in a scene can be defined, and each sub-activity can obtain a sub-activity label through the methods of target detection, tracking and speed calculation. The sub-activity labels of different environments are different.

Taking the scenic spot danger area as an example, the following sub-activity tags can be defined:

nobody (None), person motionless (person _ stopping), person approaching (closing), person crossing (across), and the like.

S14, inputting the sub-activity label set into a pre-obtained time and OR graph model to obtain a prediction result of future activities of the tourists; the time and or graph model is obtained by utilizing a pre-established active corpus of targets of the scenic region dangerous areas.

The time and OR graph model construction process comprises the following steps:

firstly, observing videos of dangerous areas of sample scenic spots of the environment, extracting corpora of various events related to targets in the videos of the dangerous areas of the sample scenic spots, and establishing a movable corpus of the targets of the environment.

Wherein the targets include guests and hazardous area identifications. The activity state of the target is represented by a sub-activity label in the activity corpus of the target of the environment, and the event is composed of a set of sub-activities.

By analyzing videos of different sample scenic spot dangerous areas in the environment, a corpus of events, which are possible combinations of leaf nodes appearing in time sequence, is obtained.

By analyzing videos of different sample scenes in an environment, a corpus of events is obtained, and a corpus, that is, possible combinations of leaf nodes that appear in time sequence, for example, a traffic intersection scene, defined sub-activity labels may include: parking (car _ stopping), people's immobility (person _ stopping), people's vehicle far away (away), vehicle acceleration (accelerate), vehicle deceleration (decelerate), vehicle uniform velocity (moving-uniform), people's vehicle near (closing) unmanned or non-vehicle (None), people's crossing zebra stripes (walking, running), collision (blast). The next corpus may represent a video: "closing person _ stopping moving _ uniform walking away", can be expressed as: the man and the vehicle are close to each other, the man and the vehicle are not moved, the vehicle passes through at a constant speed, the vehicle is parked, the man and the vehicle pass through, and the man and the vehicle are far away.

The embodiment of the invention requires that the obtained scene corpus contains the events in the scene as much as possible, so that the target activity prediction can be more accurate.

Learning the symbol grammar structure of each event by using an ADIOS-based grammar induction algorithm for the activity corpus, and taking the sub-activities as terminal nodes of a time and OR graph to obtain a time and OR graph model; the activity state of the target is represented by a sub-activity label in the activity corpus, and the event is composed of a set of sub-activities.

Specifically, the ADIOS-based grammar induction algorithm learns the AND Node (And Node) And/Or the Node (Or Node) by generating important patterns And equivalent classes. The algorithm first loads the active corpus onto the graph whose vertices are children and expands by two special symbols (start and end). Each event sample is represented by a separate path on the graph. Candidate patterns are then generated by traversing the different search paths. At each iteration, each sub-path is tested for statistical significance according to context sensitivity criteria. The important patterns are identified as AND nodes; the algorithm then finds equivalent classes by looking for units that are interchangeable in a given context. The equivalence class is identified as an OR node. At the end of the iteration, the important pattern is added as a new node to the graph, replacing the sub-paths it contains. Raw sequence data of symbolic sub-activities can be obtained from an activity corpus of targets of an environment where the events are located, and a symbolic grammar structure of each event can be learned from the raw sequence data of the symbolic sub-activities by using an ADIOS-based grammar induction algorithm. Shorter significance patterns tend to be used in embodiments of the present invention so that basic grammar elements can be captured. The algorithm learns the Add node And the Or node by generating important patterns And equivalent classes. By way of example, a T-AOG generated using traffic intersection corpora is shown in FIG. 5, and FIG. 5 is a diagram illustrating the results of an exemplary traffic intersection time grammar (T-AOG) according to an embodiment of the present invention. The double-wire circle And single-wire circle nodes are the Add node And the Or node, respectively. The number on the branch edge of the Or node (fraction less than 1) represents the branch probability. The numbers on the edges of the And nodes represent the extended time order.

After obtaining the time and map model, the method may include the following steps:

and inputting the sub-activity label set into a time and OR graph model, and obtaining a prediction result of future activities of the tourists in the environment by using an online symbol prediction algorithm of an Earley resolver, wherein the prediction result comprises future sub-activity labels of the tourists and occurrence probability values.

The sub-activity labels represent the position relation or motion state of the paired attention targets at the future moment. For this step, it may be that the sub-activity label set containing each pair of objects of interest is input into a time and or graph model, and then the prediction result may include the future sub-activity labels and the probability values of occurrence for each pair of objects of interest. It is of course reasonable to input a sub-activity label set containing a certain pair of objects of interest into the time and or map model to obtain the future sub-activity labels and the probability values of occurrence of the pair of objects of interest. It should be noted that each pair of attention targets refers to a focused tourist and a focused danger area identifier, and may even be a focused tourist and a focused first identifier, a focused tourist and a focused second identifier.

The embodiment of the invention constructs the T-AOG through the activity corpus of the tourists in the environment, uses the sub-activity label set obtained by the S-AOG as the input of the T-AOG, and then predicts the next possible sub-activity on the T-AOG by adopting an online symbolic prediction algorithm based on an Earley resolver.

In an embodiment of the present invention, using the symbolic prediction algorithm of the Earley parser, the current sentence of the sub-campaign is used as input to the Earley parser, and all pending states are scanned to find the next possible end node (sub-campaign).

For details of the symbolic prediction algorithm of the Earley parser, refer to the description of the related art.

In summary, in the embodiment of the present invention, the target activity is represented by a space-time And Or graph (ST-AOG). The space-time and-or-map (ST-AOG) is composed of a space-time and-or-map (S-AOG) and a time-time and-or-map (T-AOG). The spatio-temporal or graph may be understood as being constructed using the root node of the spatio-temporal or graph as the leaf nodes of the spatio-temporal or graph. The S-AOG represents the state of a scene, the spatial relationship among the targets is hierarchically represented through the targets and the attributes of the targets, and the minimum sub-event (such as sub-event labels of the fact that the tourists are still, close to the tourists, far away from the tourists and the like) is represented through the spatial position relationship obtained by target detection. The root node of the S-AOG is a sub-active label, and the terminal node is a target and a relation between the targets. The T-AOG is a random time syntax and represents the hierarchy of the event into a plurality of sub-events, and simulates the target activity, wherein the root node of the hierarchy is the activity (event) and the terminal node of the hierarchy is the sub-activity (sub-event).

The embodiment of the invention verifies the prediction effect by the confusion matrix diagram of the prediction sub-activities and the actual sub-activities. The prediction accuracy can reach about 90%, and is higher than that of the conventional target detection method which is used for obtaining the sub-activity label and then predicting. The result proves that the sub-activity prediction result in the embodiment of the invention is very accurate.

And S15, managing and controlling the scenic spot dangerous area based on the prediction result.

Optionally, the step may include:

and S151, when the prediction result shows that the tourist approaches the first identifier, sending danger warning voice to the tourist.

The prediction result comprises child activity labels of the fact that the tourist is motionless, close and far away, and the like, and the child activity labels triggering the danger reminding or starting the danger prevention device are all close to the tourist, namely the prediction result is that the tourist is close to the first mark or the second mark so as to correspondingly start different protection plans.

Aiming at the step, when the prediction result indicates that the tourist approaches the first identifier, danger warning voice is sent to the tourist. The danger warning voice can be realized through a broadcasting device arranged in a dangerous area of a scenic spot, and when the prediction result in the scheme of the invention indicates that the tourist approaches the first identifier, the controller can be triggered to control the broadcasting device to play the preset danger warning voice.

And S152, when the prediction result indicates that the tourist approaches the second identifier, sending a control signal to the dangerous area fence controller to pull or lift the dangerous area fence so as to establish the enclosure.

Specifically, when the prediction result indicates that the tourist approaches the first identifier, the danger area fence controller can be triggered to control the danger area fence to be pulled away or lifted so as to establish the enclosure.

It should be noted that the dangerous area fence can be an electric control fence, one mode is that the fence is placed on the ground, when the fence does not work, the fence is folded at one side, and when a control signal is received, the fence is pulled away from one side and only the other side to establish a fence; another way is to be located under the ground and to raise the building enclosure from the ground when receiving a control signal.

The embodiment of the invention provides a scenic spot dangerous area management and control method based on artificial intelligence. The method of the invention introduces the space-time AND-OR graph into the field of target activity prediction for the first time. Firstly, generating a space and OR graph model by analyzing the spatial position relation between tourists and dangerous area identifications in a video of a scenic spot dangerous area; secondly, performing activity state extraction on the space and/or map model to obtain a sub-activity label set which comprises the identification of tourists and dangerous areas and identifies each concerned target, and realizing high-level semantic extraction of videos in the dangerous areas of scenic spots; and then taking the sub-activity label set as the input of a pre-obtained time and or graph model, and obtaining the prediction of the next sub-activity of the concerned tourist through the time syntax of the time and or graph. According to the embodiment of the invention, the activities of tourists in the dangerous area can be accurately and quickly predicted by utilizing the space-time and-or graph model, so that timely and effective danger reminding and prevention are realized.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A scenic spot dangerous area management and control method based on artificial intelligence is characterized by comprising the following steps:

2. The artificial intelligence based scenic region hazardous area management and control method of claim 1, wherein the hazardous area identification comprises a first identification and a second identification, and a distance between the first identification and the hazardous edge is greater than a distance between the second identification and the hazardous edge.

3. The artificial intelligence based scenic region danger area management and control method according to claim 1 or 2, wherein the danger area identification comprises a warning board standing on the ground and a warning line laid on the ground.

4. The artificial intelligence based scenic spot hazardous area management and control method of claim 1 or 3, wherein the generating of the spatial and OR graph model about the spatial position relationship of the tourist and the hazardous area identification based on the video of the regional hazardous area comprises:

5. The artificial intelligence based scenic spot dangerous area management and control method of claim 4, wherein the target detection network is a backbone network based on a YOLO _ v3 network, and residual modules are replaced by densely connected modules.

6. The artificial intelligence based scenic region danger area management and control method of claim 5, wherein the target detection network comprises a plurality of dense connection modules and transition modules connected in series at intervals; the number of the dense connection modules is at least three; the dense connection module comprises a convolution network module and a dense connection unit group which are connected in series; the convolution network module comprises a convolution layer, a BN layer and a Leaky relu layer which are connected in series; the dense connection unit group comprises m dense connection units; each dense connection unit comprises a plurality of convolution network modules which are connected in a dense connection mode, and feature graphs output by the convolution network modules are fused in a cascading mode; wherein m is a natural number of 4 or more.

7. The artificial intelligence based scenic region danger area management method of claim 2 or 6, wherein the determining an actual spatial distance between the guests and the danger area identification in each frame of image comprises:

in each frame image, determining the pixel coordinate of each target;

8. The artificial intelligence based scenic spot danger area management method according to claim 2 or 6, wherein the extracting of sub-activity tag sets characterizing the activity states of the tourists from the spatial and OR graph model comprises:

9. The artificial intelligence based scenic region danger area management and control method of claim 8, wherein the inputting the sub-activity tag set into a pre-obtained time and or graph model to obtain the predicted result of the future activity of the person under guardianship comprises:

10. The method for managing and controlling the hazardous area of the scenic spot based on artificial intelligence as claimed in claim 2, wherein the managing and controlling the hazardous area of the scenic spot based on the prediction result comprises: