CN112434618A

CN112434618A - Video target detection method based on sparse foreground prior, storage medium and equipment

Info

Publication number: CN112434618A
Application number: CN202011357082.7A
Authority: CN
Inventors: 古晶; 巨小杰; 马文萍; 孙新凯; 刘芳; 杨淑媛; 焦李成; 冯婕
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-02
Anticipated expiration: 2040-11-26
Also published as: CN112434618B

Abstract

The invention discloses a video target detection method, a storage medium and equipment based on sparse foreground prior.A foreground extraction method based on orthogonal subspace learning is adopted to calculate and obtain a sparse foreground prior image corresponding to each frame in a video; utilizing the ResNet feature extraction network and the feature pyramid structure to obtain a semantic enhancement feature map of the video frame and the sparse foreground map thereof; after the semantic enhancement feature map of the sparse foreground prior map and the semantic enhancement feature map of the current frame are cascaded, obtaining the foreground prior fusion feature of the current frame through convolution fusion operation; mapping each pixel of the foreground prior fusion characteristic graph to generate a candidate anchor frame; and inputting the foreground prior fusion characteristics and all anchor frames into a trained classification and regression sub-network to obtain the category and position coordinates of the target object. The method fully excavates the sparse foreground prior of the video data and improves the target detection accuracy.

Description

Video target detection method based on sparse foreground prior, storage medium and equipment

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method based on sparse foreground prior, a storage medium and equipment.

Background

Computer vision is an important area of artificial intelligence that learns and understands real vision by training computers. By means of the pictures, the videos and the deep learning model, the concerned targets can be accurately classified and identified, and further judgment processing is carried out. Computer vision is generally divided into major tasks such as image recognition, target detection, instance segmentation, and the like. The classification task generally gives content description of the whole picture, while the detection task focuses more on a specific interested object, and requires to obtain an identification result and a positioning result of the interested object at the same time. In contrast to the classification task, detection is an understanding of the foreground and background of a picture, and it is also necessary to separate the object of interest from the background and determine the identification and location information of the object of interest.

The target detection is a popular direction in the field of computer vision research, and is widely applied to the fields of robot navigation, video monitoring, industrial detection, face recognition and the like. The task of image target detection is greatly improved in the last years, and the detection performance is obviously improved. However, in the fields of video surveillance, vehicle-assisted driving and the like, video-based target detection has a wider demand. However, applying image detection techniques directly to the video detection task faces new challenges. Firstly, the image target detection network is directly applied to each frame in the video for detection, which brings huge calculation cost; secondly, the conventional image target detection method cannot effectively utilize the time sequence continuity of the video data and the prior of the sparse foreground, and is difficult to mine the time sequence characteristics in the video data.

The video is composed of images, and the video target detection and the image target detection are closely related. In order to improve the video detection accuracy, after each frame is detected by image target detection, the detection result is further processed by using the time sequence characteristic of the video. In order to utilize the continuity and redundancy of video data in time sequence, some recent methods adopt optical flow, attention mechanism, sequence model, and the like to mine the time sequence characteristics of video.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a video target detection method, a storage medium and a device based on sparse foreground prior, aiming at the defects in the prior art, so as to improve the detection performance of video target detection.

The invention adopts the following technical scheme:

the video target detection method based on sparse foreground prior comprises the following steps:

s1, dividing the video V into m video segments C_iI 1,2, …, m, for each video segment C_iObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning^(t)Sparse foreground map E of^(t)；

S2, respectively converting the video frame I^(t)And sparse foreground map E^(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer^(t)And sparse foreground prior feature map

Computing video frames I^(t)Characteristic diagram F of^(t)And its sparse foreground map E^(t)Sparse foreground prior feature map of

S3, converting the video frame I through the characteristic pyramid structure^(t)Characteristic F of each layer^(t)And corresponding sparse foreground prior features

Calculating video frame I by combining with features obtained by sampling at higher layer^(t)Semantic enhanced features of

And foreground semantic enhancement features

S4, converting the video frame I^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

Fusing to obtain video frame I^(t)Foreground prior fusion feature map of

S5, in video frame I^(t)Foreground prior fusion feature map of

Generating an anchor frame;

s6, converting the video frame I^(t)Foreground prior fusion feature map of

Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I^(t)And (4) classifying and positioning results of all targets to finish target detection.

Specifically, in step S1, the video clip C is divided into_iWithin each frame image I^(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain a video clip C_iThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction^(t)Corresponding sparse foreground map E^(t)The objective function is calculated as follows:

wherein D is an orthogonal subspace, theta is an orthogonal subspace coefficient, alpha and beta are adjusting parameters, | | ·| computationally |_row,11 norm, I, representing the matrix row_kIs an identity matrix with an order k.

Further, solving the objective function by adopting an alternating direction method, solving D and theta by using a block coordinate descent method, and defining residual error terms

Solving and updating D and theta by using a residual error item; updating by D and theta obtained by solving

Contraction function

For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative updating until a convergence condition is reached and the maximum iteration number is reached_iSparse foreground priors E for all frames in.

Specifically, in step S3, in video frame I^(t)And sparse foreground map E^(t)Obtaining a feature graph F through a ResNet feature extraction network^(t)And sparse foreground prior feature map

In the process of (3), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively the features of the lowest layer

Multiplying, forming a characteristic pyramid by 5 characteristics with different scales, wherein the bottom of the characteristic pyramid is a high-resolution characteristic diagram, and the top characteristic diagram is a low-resolution characteristic diagram; the method comprises the steps of performing nearest neighbor upsampling on the strong semantic features of the high layer of the feature pyramid, adding the nearest neighbor upsampling to the features of the low layer, performing 3 x 3 convolution kernel, and outputting the features with semantic information

And foreground prior characteristics

Specifically, in step S4, the video frame I is divided into two parts^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

Cascading, and obtaining a foreground prior fusion characteristic diagram through 1 multiplied by 1 convolution operation

Specifically, in step S5, the feature map is fused a priori in the foreground

Setting a base anchor frame with the size of 16 multiplied by 16 on each pixel of each layer, respectively setting the length-width ratio to be 0.5,1 and 2 on the premise of keeping the area unchanged, respectively amplifying 8,16 and 32 scales for the anchor frames with different length-width ratios, and respectively fusing a feature map for the foreground prior

A total of 9 anchor boxes are generated for each pixel on each layer of the feature map.

Specifically, in step S6, the training classification and regression sub-network specifically includes:

s6011, randomly initializing classification and weight parameters of a regression network;

s6012, for each candidate region, calculating the probability that the candidate region belongs to each category by using the initialized classification network, and calculating the position coordinates of the candidate region by using the initialized regression network;

s6013, constructing a target detection loss function L;

s6014, updating learning classification and regression network parameters through back propagation iteration by using the target detection loss function L until the network is converged to obtain a trained classification and regression sub-network.

Further, in step S6013, the loss function L:

wherein z is the true label of the ith candidate region,

is the probability that the ith candidate region belongs to the class z object, gamma is the concentration parameter,

is the focal loss for the target classification; a is_iIs the position coordinates of the i-th candidate region,

is the coordinate vector of the real target box corresponding to the ith candidate region,

is the Smooth L1 regression loss for the target block, ω is the equilibrium weight.

Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.

Another aspect of the present invention is a computing device, including:

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods.

Compared with the prior art, the invention has at least the following beneficial effects:

on the basis of an image target detection method, a motion foreground priori image is extracted by using the sparse priori of a foreground and the spatial-temporal continuity priori of video data to obtain a foreground semantic enhanced feature image, and the foreground semantic enhanced feature image is cascaded with a current frame semantic enhanced feature to obtain a foreground priori fusion feature of the current frame, so that the video frame with motion blur, object shielding and large size change can be detected after the foreground priori feature fusion, and the detection accuracy is improved; the method makes full use of the relationship between the characteristics of the adjacent frames, and does not need to further process the detection result after each frame is detected. Compared with the existing video target detection method based on post-processing of the image target detection result, the method has the advantage that the detection speed is improved.

Furthermore, a foreground extraction algorithm based on orthogonal subspace learning can be used for obtaining a more interesting moving foreground object, wherein all video frames in a video segment are taken as a whole, foreground images of all frames are obtained by the orthogonal subspace learning algorithm, and foreground sparse prior of video data is better utilized.

Further, an alternating direction method is adopted to solve the objective function, wherein the parts of the unconstrained optimization are respectively optimized by a block coordinate descent method, a large global optimization problem is decomposed into a plurality of sub-problems which are easy to solve, and the solution of the global optimization problem is obtained by solving the plurality of sub-problems.

Furthermore, the features extracted from the ResNet network are constructed into a feature pyramid, multi-scale features of the video frame and the foreground prior image are obtained through the feature pyramid structure, wherein low-resolution high-level features with rich semantic information in the feature pyramid are used for enhancing the low-level features, and therefore the semantic information of the obtained semantic enhanced features is richer.

Furthermore, the semantic enhancement features of the foreground image and the semantic enhancement features of the current video frame are subjected to cascade convolution fusion to obtain a feature image with enhanced foreground priori, foreground sparse priori information is added in the detection process of the foreground target on the video frame, the feature information of the foreground target is enhanced, and the detection performance is further enhanced.

Further, an anchor frame is generated on the feature map, each anchor frame is classified, and then the anchor frame judged as a positive sample is regressed to obtain an accurate target position. The anchor frame generated on the feature map can limit the number of the candidate regions within a controllable range, and the calculation amount is greatly reduced.

Furthermore, training of the video data is completed by constructing a classification sub-network and a regression sub-network, wherein the classification sub-network can obtain a fine target classification result, and the regression sub-network can further correct a positioning result of a target, so that the finally obtained recognition results and positions of different targets in the video frame are more accurate.

Furthermore, the loss function L is set mainly to solve the problem of imbalance of the proportion of positive and negative samples in the one-stage target detection task. The loss function reduces the proportion of the large number of redundant negative samples in the training process.

In summary, the present invention fully utilizes the sparse prior of the foreground and the relationship between the adjacent frame features to solve the problems of motion blur, object occlusion, large size change, etc. existing in the video data, so that the present invention can effectively detect the targets with different scales and blurs in the video data, and improve the detection accuracy.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a diagram illustrating an effect of video object detection according to the present invention, wherein (a) is a detection result of one frame in a video sequence with a ship object, and (b) is a detection result of another frame in the video sequence with the ship object;

FIG. 3 is a diagram illustrating a second effect of detecting a video object according to the present invention, wherein (a) is a detection result of one frame in a video sequence targeted to a dog, and (b) is a detection result of another frame in the video sequence targeted to the dog;

fig. 4 is a diagram illustrating a third effect of video object detection according to the present invention, wherein (a) is a detection result of one frame in a video sequence of a car with an elephant as a target, and (b) is a detection result of another frame in the video sequence of the car with the elephant as a target.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a video target detection method based on sparse foreground prior.A foreground extraction method based on orthogonal subspace learning is used for obtaining a motion sparse foreground prior image of each frame of a video; then extracting the multi-scale semantic enhancement features of the video frame and the sparse foreground image thereof by using a ResNet feature extraction network and a feature pyramid structure; cascading and fusing the foreground semantic enhanced features and the semantic enhanced features of the current frame to obtain foreground prior fused features; generating an anchor frame on each pixel on the foreground prior fusion characteristic graph; then obtaining the category and position coordinates of all targets through a classification and regression network; sparse foreground prior of video data is fully mined, and target detection accuracy is improved.

Referring to fig. 1, the video target detection method based on sparse foreground prior of the present invention includes two parts, namely training and testing, wherein in the training process, network parameters are updated by calculating a loss function of a network model and then using back propagation; in the testing process, trained network parameters are used, the semantic enhancement features and the foreground semantic enhancement features of the current frame are fused to obtain foreground prior fusion features of the video frame, and then the category and the position of an interested target in the video frame are obtained based on the foreground prior fusion features; the method comprises the following specific steps:

Video clip C_iWithin each frame image I^(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain the corresponding foreground prior E of all frames.

The objective function is calculated as follows:

In specific implementation, the objective function may be solved by an inaccurate alternating direction method, and the following steps are repeatedly performed:

s101, solving D and theta by using a block coordinate descent method, and defining residual error items

And solving for updated D and θ using the residual terms:

wherein

S102, updating by using D and theta obtained through solving

Wherein the contraction function

". represents element-by-element multiplication, sign ()" is a symbolic function, and the concrete form is

Iteratively updating until reaching a convergence condition, namely reaching the maximum iteration number, and obtaining a video segment C_iThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction^(t)Corresponding sparse foreground map E^(t)。

S2, calculating video frame I^(t)Characteristic diagram F of^(t)And its sparse foreground map E^(t)Sparse foreground prior feature map of

Respectively convert video frames I^(t)With its sparse foreground map E^(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of the layer^(t)And sparse foreground prior feature map

The ResNet feature extraction network is a feature extraction network consisting of 1 7 multiplied by 7 convolutional layer, 1 maximum pooling layer and 16 residual blocks, wherein each residual block in the network is formed by combining 1 multiplied by 1 convolutional layer, 1 multiplied by 3 convolutional layer, 1 multiplied by 1 convolutional layer, a batch normalization layer and an activation function layer. The 16 residual blocks are divided into 5 stages. The output of each stage serves as the feature of the input image at different semantic levels.

S3, calculating video frame I^(t)Semantic enhanced features of

And foreground semantic enhancement features

Video frame I through characteristic pyramid structure^(t)Characteristic F of each layer^(t)And corresponding sparse foreground prior features

Respectively combined with the features obtained by sampling at higher layer to obtain semantic enhanced features with rich semantic information

And foreground semantic enhancement features

In video frame I^(t)And sparse foreground map E^(t)Obtaining a feature graph F through a ResNet feature extraction network^(t)And sparse foreground prior feature map

In the process of (3), 5 features with different scales are extracted from the ResNet middle layer, and the scales are respectively the features of the lowest layer

And forming a characteristic pyramid by the 5 characteristics with different scales. The bottom of the feature pyramid is a high-resolution feature map, while the top feature map is a low-resolution feature map, and the higher the level, the smaller the feature map and the lower the resolution.

The method comprises the steps of performing nearest neighbor upsampling on low-resolution high-semantic features with abstract information of a feature pyramid high layer, adding the nearest neighbor upsampling to the low-layer features, performing 3 x 3 convolution kernel, and outputting features with rich semantic information

And foreground priorFeature(s)

S4, calculating video frame I^(t)Foreground prior fusion feature map of

Video frame I^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

S5, in video frame I^(t)Foreground prior fusion feature map of

Generating an anchor frame;

feature map fusion a priori in foreground

Setting a base anchor frame with the size of 16 multiplied by 16 on each pixel of each layer, respectively setting the length-width ratios of the base anchor frame to be 0.5,1 and 2 on the premise of keeping the area unchanged, and respectively amplifying the three anchor frames with different length-width ratios by 8,16 and 32 scales, thereby respectively fusing the characteristic diagram of the foreground prior

A total of 9 anchor blocks are generated for each pixel on each layer of the feature map.

S6, converting the video frame I^(t)Foreground prior fusion feature map of

Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I^(t)The classification and positioning results of all the targets.

S601, training classification and regression sub-network:

s6013, constructing a target detection loss function L:

wherein z is the true label of the ith candidate region,

is the Smooth L1 regression loss of the target box, ω is the equilibrium weight;

s6014, updating learning classification and regression network parameters through back propagation iteration by using a target detection loss function L until the network is converged to obtain a trained classification and regression sub-network;

s602, converting the video frame I^(t)Foreground prior fusion feature map of

Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I^(t)Object category and object frame position.

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of video target detection based on sparse foreground prior, and comprises the following steps: dividing a video V into m video segments C_iI 1,2, …, m, for each video segment C_iObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning^(t)Sparse foreground map E of^(t)(ii) a Respectively convert video frames I^(t)And sparse foreground map E^(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer^(t)And sparse foreground prior feature map

And foreground semantic enhancement features

Video frame I^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

Fusing to obtain video frame I^(t)Foreground prior fusion feature map of

In video frame I^(t)Foreground prior fusion feature map of

Generating an anchor frame; video frame I^(t)Foreground prior fusion feature map of

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the checking method related to the medium-term and long-term maintenance plan of the power grid in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of: dividing a video V into m video segments C_iI 1,2, …, m, for each video segment C_iObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning^(t)Sparse foreground map E of^(t)(ii) a Respectively convert video frames I^(t)And sparse foreground map E^(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer^(t)And sparse foreground prior feature map

And foreground semantic enhancement features

Video frame I^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

Fusing to obtain video frame I^(t)Foreground prior fusion feature map of

In video frame I^(t)Foreground prior fusion feature map of

The effects of the present invention can be further illustrated by the following simulations:

1. simulation conditions

The workstation with the RTX 2080TI graphics card was used and the software framework was PyTorch.

Selecting a video sequence with a ship as a target and larger scale difference as a first group of detected video sequences, as shown in FIG. 2;

selecting a video sequence with a dog as a target and large posture difference as a second group of detected video sequences, as shown in fig. 3;

the two targets, namely the elephant and the automobile, are selected, and the video sequence with the object occlusion is used as the third group of detected video sequences, as shown in fig. 4.

2. Emulated content

Simulation 1, performing video target detection on a first group of detected video sequences by using the method of the present invention to obtain detection results of two frames, as shown in fig. 2.

Simulation 2, performing video target detection on the second group of detected video sequences by using the method of the present invention to obtain detection results of two frames, as shown in fig. 3.

And 3, simulating to perform video target detection on the third group of detected video sequences by using the method of the invention to obtain detection results of two frames, as shown in fig. 4.

3. Analysis of simulation results

FIG. 2(a) is a frame of the video sequence with the object of the ship, and FIG. 2(b) is another frame of the video sequence with the object of the ship, so that the invention can accurately detect the types and positions of the objects with different sizes in the video under the condition that the sizes of the objects are greatly different; FIG. 3(a) is the detection result of one frame in the video sequence with the target of dog, and FIG. 3(b) is the detection result of the other frame in the video sequence with the target of dog, it can be seen that the invention can accurately detect the type and position of the target in the video under the condition of fuzzy picture and large posture difference; fig. 4(a) is a detection result of one frame in a video sequence of a target including an elephant and a car, and fig. 4(b) is a detection result of another frame in a video sequence of a target including an elephant and a car, so that the invention can accurately detect the type and the position of an occluded target in a video under the condition that the occluded target exists in different types, especially the left elephant in fig. 4(b) is basically completely occluded.

In summary, the video target detection method based on sparse foreground prior can effectively detect the type and position of targets with different scales, and video sequences with motion blur and shielding phenomena.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical solution according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The video target detection method based on sparse foreground prior is characterized by comprising the following steps of:

And foreground semantic enhancement features

S4, converting the video frame I^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

Fusing to obtain video frame I^(t)Foreground prior fusion feature map of

S5, in video frame I^(t)Foreground prior fusion feature map of

Generating an anchor frame;

s6, converting the video frame I^(t)Foreground prior fusion feature map of

2. The sparse foreground prior-based video object detection method of claim 1, wherein in step S1, video segment C is segmented_iWithin each frame image I^(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain a video clip C_iThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction^(t)Corresponding sparse foreground map E^(t)The objective function is calculated as follows:

3. The sparse foreground prior-based video object detection method of claim 2, wherein an alternating direction method is used to solve the objective function, a block coordinate descent method is used to solve for D and θ, and residual terms are defined

Contraction function

4. The sparse foreground prior-based video object detection method of claim 1, wherein in step S3, in video frame I^(t)And sparse foreground map E^(t)Obtaining a feature graph F through a ResNet feature extraction network^(t)And sparse foreground prior feature map

Multiple times, willThe method comprises the following steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the method comprises the steps of performing nearest neighbor upsampling on the strong semantic features of the high layer of the feature pyramid, adding the nearest neighbor upsampling to the features of the low layer, performing 3 x 3 convolution kernel, and outputting the features with semantic information

And foreground prior characteristics

5. The sparse foreground prior-based video object detection method of claim 1, wherein in step S4, the video frame I is processed^(t)Semantic enhanced features of

And corresponding foreground semantic enhancement features

6. The sparse foreground prior-based video object detection method of claim 1, wherein in step S5, feature maps are fused in foreground prior

7. The sparse foreground prior-based video object detection method of claim 1, wherein in step S6, the training classification and regression sub-network specifically comprises:

s6013, constructing a target detection loss function L;

8. The sparse foreground prior-based video object detection method of claim 7, wherein in step S6013, the loss function L:

wherein z is the true label of the ith candidate region,

9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.

10. A computing device, comprising:

one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-8.