CN117237867A

CN117237867A - Self-adaptive field monitoring video target detection method and system based on feature fusion

Info

Publication number: CN117237867A
Application number: CN202311191030.0A
Authority: CN
Inventors: 王欣; 赵帅; 刘冠; 李涛; 李敏乐; 王晓磊; 郭晓喻; 王英杰; 张雨松
Original assignee: Suzhou Shuzhiyuan Information Technology Co ltd; Capital Airport Group Co ltd Beijing Daxing International Airport
Current assignee: Suzhou Shuzhiyuan Information Technology Co ltd; Capital Airport Group Co ltd Beijing Daxing International Airport
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-15

Abstract

The invention discloses a self-adaptive scene monitoring video target detection method and system based on feature fusion, which solve the problems of insufficient fusion of time sequence context information and slower detection speed in video target detection in airport scene monitoring scenes. The technical proposal is as follows: step 1: calculating an optical flow motion field and a time sequence feature consistency judging matrix Qk2i of the current frame; step 2: judging whether the current frame is a key frame or not according to a time sequence self-adaptive key frame dynamic scheduling strategy and a judging matrix Qk2i; step 3: if the current frame is a key frame, inputting the current frame into a feature extraction network to obtain a feature map, and then inputting the feature maps of the current frame and the previous key frame into an adaptive weight network to perform weighted fusion to obtain a final feature map of the current frame; step 4: if the current frame is not a key frame, extracting features of a part of regions through a convolution network by adopting a space self-adaptive local feature updating method, calculating the features of the current region by other regions through the optical flow motion field and the feature map of the previous key frame, and then fusing the features of the two parts of regions to obtain the feature map of the current frame; step 5: and inputting the feature map of the current frame into a classification and positioning network to obtain a detection result. The method can improve the accuracy and the detection speed of the airport surface monitoring video target detection method on the position data of the airport aircrafts and vehicles under a complex scene.

Description

Self-adaptive field monitoring video target detection method and system based on feature fusion

Technical Field

The invention belongs to the technical field of video target detection in computer vision, and particularly relates to a self-adaptive field monitoring video target detection method and system based on feature fusion in an airport environment, which are used for identifying and detecting an airport aircraft and a vehicle target.

Background

Target detection is a very important research topic in the field of computer vision. The target detection model not only can be applied to the fields of various security monitoring systems, automatic driving systems, unmanned aerial vehicles and the like, but also has wide commercial application, such as face recognition, license plate recognition, medical image analysis and the like. With the continuous development of deep learning technology, more and more excellent target detection algorithms are proposed, which enable target detection with stronger accuracy, faster speed and more efficient capability of processing large amounts of data.

Deep learning is an artificial intelligence technique that implements machine learning by simulating the structure and function of a human brain neural network. In terms of image object detection, deep learning has many advantages such as high accuracy, high speed, capability of processing a large amount of data, and the like. Therefore, deep learning is widely used in image object detection. But face more complex difficulties and challenges in video object detection. First, the volume of video data is enormous and time consuming to process, which requires algorithms with higher speed and efficiency. Second, objects in the video may undergo changes in motion, occlusion, deformation, etc., which all present difficulties in object detection. Furthermore, video object detection requires real-time, which places higher demands on the speed and efficiency of the algorithm.

Prior to the occurrence of deep learning, conventional target detection methods can be generally divided into three parts, i.e., region selection (sliding window), feature extraction (SIFT, HOG, etc.), and classifier (SVM, adaboost, etc.). For example, the Viola-Jones detector uses a sliding window approach to check if the target is present in the window. The main problems are as follows: on one hand, the sliding window selection strategy has no pertinence, high time complexity and redundant window; on the other hand, the robustness of the manually designed features is poor. In the deep learning age, object detection can be divided into two categories: "two-stage detection" and "one-stage detection". The former defines the detection frame as a "coarse to fine" process, while the latter is "one-step in place". For example, RCNN is a typical two-stage target detection algorithm based on deep learning. The method comprises the steps of firstly selecting a possible object frame from a group of object candidate frames through a selective search algorithm, then feeding images resize in the selected object frames to an image with a certain fixed size, feeding the images to a CNN model to extract features, and finally feeding the extracted features to a classifier to predict whether an image in the object frame has a target to be detected or not and further predict which type the detection target specifically belongs to.

However, these video object detection methods often use the same processing method for all frames, or perform object detection for selecting a fixed frame as a key frame, which results in extremely slow detection and recognition speeds. At the same time, these methods tend to ignore timing information between different frames in the video, which is critical to improving detection accuracy and speed. Therefore, how to accurately capture timing information between different contexts in video and to adopt different processing strategies is worthy of intensive research, and can make an important contribution to improving the accuracy and speed of video object detection.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a self-adaptive field monitoring video target detection method and a self-adaptive field monitoring video target detection system based on feature fusion in an airport-oriented environment, which achieve the balance of detection speed while fully fusing the features of time sequence contexts.

The technical solution for realizing the purpose of the invention is as follows: the method and the system for detecting the self-adaptive field monitoring video target based on the feature fusion comprise the following steps:

step 1: determining a video stream comprising an object to be detected, wherein the video stream comprises a multi-frame image sequence, and the image comprises the object to be detected;

step 2: adopting a ResNet network as a feature extraction network Nfeat, adopting an RFCN network as a classification positioning network Ntask, designing a convolutional neural network as a weight network Nw, and designing an optical flow network FlowNet based on the convolutional neural network;

step 3: if the current frame is the first frame of the video stream, selecting the current frame as a key frame, extracting the characteristics of the current frame image by utilizing the characteristic extraction network Nheat, and then directly inputting the characteristic image into the classification and positioning network Ntask for classification and positioning to obtain a target detection result;

step 4: if the current frame is not the first frame of the video stream, calculating an optical flow motion field and a characteristic time sequence consistency judging matrix Qk2i of the current frame and the previous key frame according to a time sequence self-adaptive key frame dynamic scheduling strategy, and judging that the current frame is a key frame or a non-key frame;

step 5: if the current frame is a key frame, extracting the characteristics of the current frame image by utilizing the characteristic extraction network Nfeat, calculating a fusion characteristic image obtained by aggregating the characteristic images of the current frame and the previous key frame through a weight network Nw, and then classifying and positioning to obtain a target detection result;

step 6: if the current frame is a non-key frame, a space self-adaptive local feature updating method is adopted, a feature map of the current frame is calculated according to the consistency judging matrix Qk2i obtained in the step 4 and the feature map of the previous key frame, and then classification and positioning are carried out to obtain a target detection result.

Further, step 1: determining a video stream comprising an object to be detected, wherein the video stream comprises a multi-frame image sequence, and the image comprises the object to be detected;

further, in the step 2, a res net network is used as a feature extraction network Nfeat, an RFCN is used as a classification and positioning network nfsk, a convolutional neural network is designed as a weight network Nw, and an optical flow network FlowNet based on the convolutional neural network is designed, which specifically includes:

step 2-1: the ResNet-based feature extraction network, nfeat, is constructed for computing a feature map of the image. The modified ResNet-101 model is used herein.

Wherein ResNet101 discards the last classified layer, modifies the stride of the first block of conv5 to 1, and applies a ringing algorithm on all 3x3 convolution kernels in conv5, thus ensuring that the receptive field of the model is unchanged. The overall step size of the Nfeat is 16, i.e. the output of the Nfeat is 1/16 of the original.

At the end of conv5, a 3x3 convolutional layer is also added to reduce the feature channel dimension to 1024. The modified ResNet-101 model requires pre-training from the dataset. And inputting the image of the current key frame into a Nfeat network, and outputting the feature map of the key frame.

Step 2-2: the RFCN-based classification positioning network Ntask is constructed for calculating the classification and positioning of the current frame. Firstly, the feature map of the current frame image needs to be extracted, and then the feature map is input into an Ntask network for processing.

In the Ntask network, a series of convolution and pooling operations are first performed on the input feature map to extract features therein. And then, by means of structures similar to the full connection layer, the extracted features are subjected to dimension reduction and adjustment, and finally, the target classification and the position information corresponding to the current frame are output. This position information includes parameters such as the upper left corner coordinates of the target frame, the width and height of the frame.

Step 2-3: and adopting a characteristic fusion mode of self-adaptive weight. A weighting network based on a convolutional neural network is constructed, the network comprising a 3-layer convolutional layer and a 3-layer activation layer. And inputting the current frame characteristic diagram and the previous key frame characteristic diagram into a network together, and after the last pooling layer is finished, obtaining importance weights through cosine similarity function processing. According to the weight, carrying out weighted summation on the two feature images to obtain a new feature image of the key frame;

step 2-4: and constructing an optical flow network FlowNet based on FlowNetCorr for calculating an optical flow motion field, reducing the number of convolution kernels in the FlowNetCorr network by half, and improving the running speed. The current frame and the previous key frame are input into the FlowNet together, and the optical flow motion field of the current frame is obtained by output. Optical flow motion field is an important concept in computer vision for describing the motion of objects in a sequence of images. It represents the motion vector of each pixel in an image, i.e. the direction and speed of movement of each pixel in the image sequence. The calculation of the optical flow motion field is based on one assumption: between adjacent image frames, the pixels on the same object have the same motion. Thus, by comparing pixel values in adjacent image frames, a motion vector for each pixel can be calculated, resulting in an optical flow motion field.

Further, step 3: if the current frame is the first frame of the video stream, selecting the current frame as a key frame, extracting the characteristics of the current frame image by utilizing the characteristic extraction network Nheat, and then directly inputting the characteristic image into the classification and positioning network Ntask for classification and positioning to obtain a target detection result;

further, in step 4, if the current frame is not the first frame of the video stream, according to a timing adaptive key frame dynamic scheduling policy, calculating an optical flow motion field and a feature timing consistency judging matrix Qk2 of the current frame and a previous key frame, and judging that the current frame is a key frame or a non-key frame, including:

step 4-1: inputting the current frame and the previous key frame into the optical flow network FlowNet in the step 2-4 together, and outputting to obtain an optical flow motion field between the current frame and the previous key frame;

step 4-2: according to the optical flow motion field obtained in the step 4-1, calculating a characteristic time sequence consistency judging matrix Qk2i, wherein the specific calculation mode is as follows: setting a threshold value, setting a corresponding element of a matrix Qk2i to be 1 if the position offset of each pixel point on the optical flow motion field exceeds the threshold value, otherwise setting the corresponding element to be 0; if the offset exceeds the threshold, we consider that the motion amplitude of the object at the pixel point is larger.

Step 4-3: judging whether the current frame is a key frame or not according to the characteristic time sequence consistency judging matrix Qk2i; the method comprises the following steps: setting a threshold value, if the proportion of the element values 1 in the Qk2i exceeds the threshold value, selecting the current frame as a key frame, otherwise, selecting the current frame as a non-key frame. This is because when the ratio of 1 exceeds the threshold, the offset of more pixels in the image also exceeds the threshold; we consider that in the current image, the motion amplitude of the object is large, so it needs to be selected as a key frame.

Further, if the current frame is a key frame in step 5, extracting features of the current frame image by using the feature extraction network Nfeat, calculating a fusion feature map obtained by aggregating feature maps of the current frame and a previous key frame through a weight network Nw, and then classifying and positioning to obtain a target detection result; the method specifically comprises the following steps:

step 5-1: inputting the current frame into the feature extraction network Nfeat in the step 2-1 to obtain a feature map;

step 5-2: and (3) inputting the feature images of the current frame and the previous key frame into the weight network Nw in the step (2-3) together to obtain a fused feature image obtained by aggregating the feature images of the two frames, wherein the fused feature image represents the feature image of the current frame. In this way, different adjacent key frames are iterated continuously, so that important information is transferred in the whole video.

Step 5-3: and (3) inputting the fused feature map into the classification and positioning network Ntask in the step (2-2) to classify and position the feature map so as to obtain a target detection result.

Further, if the current frame is a non-key frame in step 6, a space self-adaptive local feature updating method is adopted, and a feature map of the current frame is calculated according to the consistency judging matrix Qk2i obtained in step 4 and the feature map of the previous key frame, and then classification and positioning are performed to obtain a target detection result; the method specifically comprises the following steps:

step 6-1: according to the feature time sequence consistency judging matrix Qk2i obtained in the step 4-2, dividing the pixel point of the current frame corresponding to the area with the element value of 0 of the matrix Qk2i into an area A, dividing the pixel point of the current frame corresponding to the area with the element value of 1 of the matrix Qk2i into an area B, and jointly forming the current frame by the area A and the area B;

step 6-2: calculating a feature map of the area A of the current frame by combining the feature maps of the optical flow motion field and the previous key frame obtained in the step 4-1; since we consider the pixel offset of the a region to be small, for the purpose of detection speed, the feature map of the a region of the current frame is predicted using the feature map of the previous key frame and the optical flow motion field in between.

Step 6-3: and carrying out convolution extraction on the local area B by utilizing a characteristic extraction network for the area B of the current frame. Since we consider that the pixel shift amount of the B region is large, if the manner of step 6-2 is continued, the detection accuracy is adversely affected. Taking this into consideration, we use a convolution to extract features for the B region.

Step 6-4: and recombining the characteristics of the area A and the area B of the current frame to obtain a characteristic diagram of the current frame, and then inputting a classification positioning network Ntask to classify and position to obtain a detection result. The feature map fused with the current frame ensures the accuracy of feature extraction while reducing the calculation speed. This approach provides an improvement in overall performance over pure global convolution extraction features or calculation based on optical flow motion field predictions.

The invention also discloses a self-adaptive field monitoring video target detection system based on feature fusion, which is characterized in that the system comprises:

the feature map acquisition module adopts a feature extraction network Nfeat based on ResNet as a backbone network and is used for extracting feature maps of key frames in the video;

the classification and positioning acquisition module adopts a classification and positioning network Ntask based on RFCN, and is used for calculating the classification and positioning of the image content according to the feature map of each frame of image in the video;

and the characteristic fusion module is used for constructing a weight network based on the convolutional neural network. And inputting the current frame characteristic diagram and the previous key frame characteristic diagram into a network, and after the treatment of the last pooling layer is finished, treating the current frame characteristic diagram and the previous key frame characteristic diagram through a cosine similarity function to finally obtain importance weights. According to the weight, carrying out weighted fusion on the two feature images to obtain a feature image of a new key frame;

and the optical flow motion field calculation module is used for constructing an optical flow network FlowNet based on the convolutional neural network for calculating the optical flow motion field. Inputting the current frame and the previous key frame into the FlowNet together, and outputting an optical flow motion field of the current frame;

and the key frame selection module adopts an adaptive decision mode to calculate a time sequence characteristic time sequence consistency judging matrix Qk2i of the current frame, and then judges whether the current frame is selected as a key frame according to the matrix adaptation.

The feature map calculating module is used for calculating a feature map of the current frame, and calculating the feature map of the key frame in the mode of the step 4; for non-key frames, the feature map is calculated in the manner of step 5.

According to the embodiment of the adaptive scene monitoring video target detection system based on feature fusion, a feature map acquisition module adopts a feature extraction network Nfeat based on ResNet as a backbone network for extracting a feature map of a key frame in a video, wherein ResNet101 discards a last classification layer, the stride of a first block of conv5 is modified to be 1, and a hoting algorithm is applied to all 3x3 convolution kernels in conv5, so that the receptive field of a model can be ensured to be unchanged. The overall step size of the Nfeat is 16, i.e. the output of the Nfeat is 1/16 of the original. At the end of conv5, a 3x3 convolutional layer is also added to reduce the feature channel dimension to 1024. The modified ResNet-101 model requires pre-training from the dataset. And inputting the image of the current key frame into a Nfeat network, and outputting the feature map of the key frame.

According to one embodiment of the self-adaptive field monitoring video target detection system based on the feature fusion, the feature fusion module adopts a feature fusion mode of self-adaptive weight. A weighting network based on a convolutional neural network is constructed, the network comprising a 3-layer convolutional layer and a 3-layer activation layer. And inputting the current frame characteristic diagram and the previous key frame characteristic diagram into a network together, and after the last pooling layer is finished, obtaining importance weights through cosine similarity function processing. According to the weight, carrying out weighted summation on the two feature images to obtain a new feature image of the key frame;

according to an embodiment of the adaptive scene monitoring video object detection system based on feature fusion of the present invention, the key frame selection module is further configured to perform the following processing:

inputting the current frame and the previous key frame into the optical flow network FlowNet in the step 2-4 together, and outputting to obtain an optical flow motion field between the current frame and the previous key frame;

according to the optical flow motion field obtained in the step 4-1, calculating a characteristic time sequence consistency judging matrix Qk2i in the following calculation mode: setting a threshold value, setting a corresponding element of a matrix Qk2i to be 1 if the position offset of each pixel point on the optical flow motion field exceeds the threshold value, otherwise setting the corresponding element to be 0;

and judging whether the current frame is a key frame or not according to the characteristic time sequence consistency judging matrix Qk2 i.

According to an embodiment of the adaptive scene-based surveillance video object detection system based on feature fusion of the present invention, the feature map calculation module is further configured to perform the following processing:

if the current frame is a key frame, respectively inputting the current frame and the previous key frame into the feature extraction network Nfeat in the step 2-1 to obtain feature graphs corresponding to the two frames;

inputting the feature images of the two frames into a weight network Nw together to obtain a fusion feature image after feature image aggregation of the two frames;

and inputting the fused feature images into a classification positioning network Ntask, classifying and positioning the feature images, and obtaining a target detection result.

If the current frame is a non-key frame, dividing the pixel point of the current frame corresponding to the area with the element value of 0 of the matrix Qk2i into an area A according to the judging matrix Qk2i, dividing the pixel point of the current frame corresponding to the area with the element value of 1 of the matrix Qk2i into an area B, and jointly forming the current frame by the area A and the area B;

calculating a feature map of the area A of the current frame by combining the feature maps of the optical flow motion field and the previous key frame obtained in the step 4-1;

and carrying out convolution extraction on the local area B by utilizing a characteristic extraction network for the area B of the current frame.

And recombining the characteristics of the area A and the area B of the current frame to obtain a characteristic diagram of the current frame, and then inputting a classification positioning network Ntask to classify and position to obtain a detection result.

Compared with the prior art, the invention has the following beneficial effects: firstly, in the self-adaptive field monitoring video target detection method based on feature fusion, the self-adaptive weight is calculated by the features of the previous key frame and the features of the current frame in the process of feature fusion, and then fusion is carried out, so that when the target detection is carried out on the current frame, the semantic information of all the previous key frames is combined. Secondly, the self-adaptive field monitoring video target detection method based on feature fusion adopts a time sequence self-adaptive key frame dynamic scheduling strategy in the key frame selection process, dynamically calculates and decides whether the current frame is selected as a key frame in the polling process of each frame, and has self-adaptability of time sequence change. Thirdly, in the feature fusion-based self-adaptive field monitoring video target detection method, in the process of calculating the feature map of the non-key frame, the problem that the feature calculation propagated along the optical flow is easy to be wrong due to the fact that the non-key frame has local large change relative to the key frame is considered, so that the problem is avoided by adopting a space self-adaptive local feature updating method.

The three points enable the method to be different from other methods in video target detection, achieve excellent balance in detection precision and speed, and obtain stable detection results in complex environments.

Drawings

The above-described features and advantages of the present invention will be better understood upon reading the detailed description of embodiments of the present disclosure in conjunction with the following drawings. In the drawings, the components are not necessarily to scale and components having similar related features or characteristics may have the same or nearby reference numerals.

FIG. 1 is a schematic diagram of an overall framework of an embodiment of the adaptive scene surveillance video object detection method based on feature fusion of the present invention.

Fig. 2 is a schematic diagram showing a weight network Nw and a feature fusion manner in the overall framework shown in fig. 1.

Fig. 3 shows a network structure of the optical flow network FlowNet in the overall framework shown in fig. 1.

Detailed Description

The invention is described in detail below with reference to the drawings and the specific embodiments. It is noted that the aspects described below in connection with the drawings and the specific embodiments are merely exemplary and should not be construed as limiting the scope of the invention in any way.

It should be noted that, if there is a description of "first", "second", etc. in the embodiments of the present invention, the description of "first", "second", etc. is only for descriptive purposes, and is not to be construed as indicating or implying relative importance or implying that the number of technical features indicated is indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.

FIG. 1 illustrates an overall framework of one embodiment of the adaptive scene surveillance video object detection method based on feature fusion of the present invention. Referring to fig. 1, the implementation steps of the method of the present embodiment are described in detail below.

in the step 2, a ResNet network is adopted as a feature extraction network Nfeat, an RFCN is adopted as a classification and positioning network Ntask, a convolutional neural network is designed as a weight network Nw, and an optical flow network FlowNet based on the convolutional neural network is designed, which specifically comprises the following steps:

if the current frame is not the first frame of the video stream, in step 4, according to the dynamic scheduling policy of the time-sequence adaptive key frame, calculating the optical flow motion field and the feature time sequence consistency judging matrix Qk2 of the current frame and the previous key frame, and judging whether the current frame is the key frame or not, specifically including:

Step 5, if the current frame is a key frame, extracting the features of the current frame image by using the feature extraction network Nfeat, calculating a fusion feature map obtained by aggregating the feature maps of the current frame and the previous key frame through a weight network Nw, and then classifying and positioning to obtain a target detection result; the method specifically comprises the following steps:

If the current frame is a non-key frame, a space self-adaptive local feature updating method is adopted, a feature map of the current frame is calculated according to the consistency judging matrix Qk2i obtained in the step 4 and the feature map of the previous key frame, and then classification and positioning are carried out to obtain a target detection result; the method specifically comprises the following steps:

The specific limitation of the adaptive scene monitoring video object detection system based on feature fusion of the present invention can be referred to above, and the description thereof will not be repeated. The modules in the adaptive scene monitoring video target detection system based on feature fusion of the invention can be realized in whole or in part by software, hardware and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

The foregoing has outlined and described the basic principles, features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The self-adaptive field monitoring video target detection method based on feature fusion is characterized by comprising the following steps of:

step 4: if the current frame is not the first frame of the video stream, calculating an optical flow motion field and a characteristic time sequence consistency judging matrix Qk2i of the current frame and the previous key frame according to a time sequence self-adaptive key frame dynamic scheduling strategy, and judging whether the current frame is a key frame or a non-key frame;

2. The method for detecting the target of the self-adaptive scene monitoring video based on feature fusion according to claim 1, wherein in the step 2, a ResNet network is adopted as a feature extraction network Nfeat, an RFCN is adopted as a classification positioning network Ntask, a convolutional neural network is designed as a weight network Nw, and an optical flow network FlowNet based on the convolutional neural network is designed, which comprises the following steps:

step 2-1: the ResNet-based feature extraction network, nfeat, is constructed for computing a feature map of the image. Here a modified res net-101 model is used, where res net-101 discards the last classification layer, modifies the stride of the first block of conv5 to 1, applies a holing algorithm on all 3x3 convolution kernels in conv5, and adds a 3x3 convolution layer after conv5, reducing the feature channel dimension to 1024. And inputting the image of the current key frame into a Nfeat network, and outputting the feature map of the key frame.

Step 2-2: the RFCN-based classification positioning network Ntask is constructed for calculating the classification and positioning of the current frame. And inputting the feature map obtained by calculating the current frame into an Ntask network, and outputting to obtain the classification and the positioning of the current frame.

Step 2-3: and adopting a characteristic fusion mode of self-adaptive weight. A weighting network based on a convolutional neural network is constructed. And inputting the current frame characteristic diagram and the previous key frame characteristic diagram into a network, and after the treatment of the last pooling layer is finished, obtaining importance weights through cosine similarity function treatment. According to the weight, carrying out weighted fusion on the two feature images to obtain a feature image of a new key frame;

step 2-4: an optical flow network FlowNet based on a convolutional neural network is constructed for calculating an optical flow motion field. The current frame and the previous key frame are input into the FlowNet together, and the optical flow motion field of the current frame is obtained by output.

3. The method for detecting a video object of adaptive scene monitoring based on feature fusion according to claim 1, wherein in step 4, if the current frame is not the first frame of the video stream, according to a dynamic scheduling policy of a time sequence adaptive key frame, an optical flow motion field and a feature time sequence consistency judging matrix Qk2i of the current frame and a previous key frame are calculated, and the current frame is judged to be a key frame or a non-key frame, which specifically comprises:

step 4-2: according to the optical flow motion field obtained in the step 4-1, calculating a characteristic time sequence consistency judging matrix Qk2i in the following calculation mode: setting a threshold value, setting a corresponding element of a matrix Qk2i to be 1 if the position offset of each pixel point on the optical flow motion field exceeds the threshold value, otherwise setting the corresponding element to be 0;

step 4-3: and judging whether the current frame is a key frame or not according to the characteristic time sequence consistency judging matrix Qk2 i.

4. The method for detecting the target of the self-adaptive field monitoring video based on feature fusion according to claim 1, wherein in step 5, if the current frame is a key frame, features of the current frame image are extracted by using the feature extraction network Nfeat, a fusion feature map obtained by aggregating feature maps of the current frame and a previous key frame through a weight network Nw is calculated, and then classification and positioning are performed to obtain a target detection result; the method specifically comprises the following steps:

step 5-2: inputting the feature images of the current frame and the previous key frame into the weight network Nw in the step 2-3 together to obtain a fusion feature image after feature image aggregation of the two frames;

5. The method for detecting the target of the self-adaptive field monitoring video based on feature fusion according to claim 1, wherein in step 6, if the current frame is a non-key frame, a space self-adaptive local feature updating method is adopted, a feature map of the current frame is calculated according to the consistency discrimination matrix Qk2i obtained in step 4 and a feature map of a previous key frame, and then classification and positioning are performed to obtain a target detection result; the method specifically comprises the following steps:

step 6-2: calculating a feature map of the area A of the current frame by combining the feature maps of the optical flow motion field and the previous key frame obtained in the step 4-1;

step 6-3: and carrying out convolution extraction on the local area B by utilizing a characteristic extraction network for the area B of the current frame.

Step 6-4: and recombining the characteristics of the area A and the area B of the current frame to obtain a characteristic diagram of the current frame, and then inputting a classification positioning network Ntask to classify and position to obtain a detection result.

6. The self-adaptive field monitoring video target detection system based on feature fusion is characterized in that the system comprises: