CN112434618B

CN112434618B - Video target detection method, storage medium and device based on sparse foreground priori

Info

Publication number: CN112434618B
Application number: CN202011357082.7A
Authority: CN
Inventors: 古晶; 巨小杰; 马文萍; 孙新凯; 刘芳; 杨淑媛; 焦李成; 冯婕
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2023-06-23
Anticipated expiration: 2040-11-26
Also published as: CN112434618A

Abstract

The invention discloses a video target detection method, a storage medium and a device based on sparse foreground prior, wherein a foreground extraction method based on orthogonal subspace learning is adopted to calculate and obtain a sparse foreground prior map corresponding to each frame in a video; utilizing the ResNet feature extraction network and the feature pyramid structure to obtain a semantic enhancement feature map of the video frame and the sparse foreground map thereof; after cascading the semantic enhancement feature map of the sparse foreground prior map with the semantic enhancement feature map of the current frame, obtaining foreground prior fusion features of the current frame through convolution fusion operation; mapping on each pixel of the foreground prior fusion feature map to generate a candidate anchor frame; and inputting the foreground priori fusion characteristics and all anchor frames into a trained classification and regression sub-network to obtain the category and position coordinates of the target object. The method fully digs the sparse front Jing Xianyan of the video data and improves the target detection accuracy.

Description

Video target detection method, storage medium and device based on sparse foreground priori

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method, a storage medium and equipment based on sparse foreground priori.

Background

Computer vision is an important area of artificial intelligence that learns and understands real vision by training a computer. By means of pictures, videos and a deep learning model, objects of interest can be accurately classified and identified, and further judgment processing can be performed. Computer vision is generally divided into major tasks such as image recognition, object detection, instance segmentation, and the like. The classification task generally gives a content description of the whole picture, and the detection task focuses more on a specific object of interest target, so that the recognition result and the positioning result of the object of interest target are required to be obtained simultaneously. In contrast to classification tasks, detection is an understanding of the foreground and background of a picture, while also requiring separation of objects of interest from the background and determination of identification and location information of the objects of interest.

The target detection is a popular direction in the field of computer vision research, and is widely applied to the fields of robot navigation, video monitoring, industrial detection, face recognition and the like. The image target detection task has greatly progressed in the past few years, and the detection performance is obviously improved. However, in the fields of video monitoring, vehicle assisted driving, and the like, there is a wider demand for video-based target detection. However, the use of image detection techniques directly to video detection tasks poses new challenges. Firstly, the image target detection network is directly applied to each frame in the video to detect, so that huge calculation cost is brought; secondly, the conventional image target detection method cannot effectively utilize the time sequence continuity of the video data and the priori of sparse prospects, and is difficult to mine the time sequence characteristics in the video data.

Video is composed of images, and video object detection is closely related to image object detection. In order to improve the accuracy of video detection, the detection result is further processed by utilizing the specific timing characteristic of video after each frame is detected by image target detection. In order to take advantage of the temporal continuity and redundancy of video data, some recent approaches employ optical flow, attention mechanisms, sequence models, and the like to mine the temporal characteristics of the video.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a video target detection method, a storage medium and a device based on sparse foreground priori, aiming at the defects in the prior art, so as to improve the detection performance of video target detection.

The invention adopts the following technical scheme:

the video target detection method based on sparse foreground priori comprises the following steps:

s1, dividing a video V into m video clips C _i I=1, 2, …, m, for each video clip C _i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning ^(t) Sparse foreground map E of (2) ^(t) ；

S2, respectively converting the video frames I ^(t) And sparse foreground map E ^(t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network ^(t) And sparse foreground prior feature map

Computing video frame I ^(t) Feature map F of (1) ^(t) Sparse foreground map E ^(t) Sparse foreground prior feature map of->

S3, through the characteristic pyramid structure, the video frame I ^(t) Is of each layer of features F ^(t) And corresponding sparse foreground prior features

Respectively with the features obtained by the up-sampling of the higher layers, calculate the video frame I ^(t) Semantic enhancement feature->

And foreground semantic enhancement feature->

S4, video frame I ^(t) Semantic enhanced features of (a)

And corresponding foreground semantic enhancement feature->

Fusing to obtain video frame I ^(t) Foreground prior fusion feature map of->

S5, in video frame I ^(t) Foreground prior fusion feature map of (1)

An anchor frame is generated in the process;

s6, video frame I ^(t) Foreground prior fusion feature map of (1)

All anchor frames are input into a trained classification and regression network to respectively obtain video frames I ^(t) The target detection is completed according to the classification and positioning results of all targets.

Specifically, in step S1, the video clip C _i Every frame of image I in ^(t) Converting the gray level into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective function to obtain a video fragment C _i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I ^(t) Corresponding sparse foreground map E of (2) ^(t) The objective function is calculated as follows:

wherein D is an orthogonal subspace, θ is an orthogonal subspace coefficient, alpha and beta are regulating parameters, and I.I.I.I _row,1 Representing 1 norm of a matrix row, I _k Is an identity matrix with the order of k.

Further, solving an objective function by adopting an alternating direction method, solving D and theta by using a block coordinate descent method, and defining a residual error term

Solving and updating D and theta by using residual error items; updating +.>

Contraction function->

For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative update until reaching convergence condition and maximum iterative times _i Sparse front Jing Xianyan E of all frames in (b).

Specifically, in step S3, in video frame I ^(t) And sparse foreground map E ^(t) Obtaining a feature map F through a ResNet feature extraction network ^(t) And sparse foreground prior feature map

In the process of (1), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively +.>

The method comprises the steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the strong semantic features of the higher layer of the feature pyramid are subjected to nearest neighbor upsampling and then added with the features of the lower layer, and after a 3 multiplied by 3 convolution kernel, the features with semantic information are output>

And foreground prior feature->

Specifically, in step S4, the video frame I ^(t) Semantic enhanced features of (a)

And corresponding foreground semantic enhancement feature->

Cascading, and obtaining a foreground priori fusion characteristic diagram by 1X 1 convolution operation>

Specifically, in step S5, feature maps are fused in the foreground prior

Each pixel of each layer is provided with a basic anchor frame with the size of 16 multiplied by 16, on the premise of keeping the area unchanged, the length-width ratio is respectively 0.5,1 and 2, then the anchor frames with three different length-width ratios are respectively enlarged by 8,16,32 scales, and the feature map is fused with the foreground prior art>

A total of 9 anchor boxes are generated for each pixel on each layer of feature map.

Specifically, in step S6, the training classification and regression sub-network specifically includes:

s6011, randomly initializing weight parameters of classification and regression networks;

s6012, calculating the probability that the candidate areas belong to each category by using the initialized classification network for each candidate area, and calculating the position coordinates of the candidate areas by using the initialized regression network;

s6013, constructing a target detection loss function L;

s6014, updating the learning classification and regression network parameters through back propagation iteration by utilizing the target detection loss function L until the network converges, and obtaining the trained classification and regression sub-network.

Further, in step S6013, the loss function L:

where z is the true label of the i-th candidate region,

is the probability that the i-th candidate region belongs to the z-class object, gamma is the concentration parameter,/->

Is a focal loss for object classification; a, a _i Is the position coordinates of the i-th candidate region, +.>

Is the coordinate vector of the real target frame corresponding to the i candidate region, +.>

Is the smoothl 1 regression loss of the target box, ω is the balance weight.

Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods.

Another aspect of the present invention is a computing device, including:

one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods.

Compared with the prior art, the invention has at least the following beneficial effects:

according to the video target detection method based on the sparse foreground prior, on the basis of the image target detection method, the motion foreground prior map is extracted by utilizing the sparse prior of the foreground and the space-time continuity prior of video data, so that a foreground semantic enhancement feature map is obtained, and the foreground prior fusion feature of the current frame is obtained by cascading the foreground semantic enhancement feature map with the current frame semantic enhancement feature, so that the video frames with motion blur, object shielding and large size change can be detected after the foreground prior feature fusion, and the detection accuracy is improved; the relation between the characteristics of adjacent frames is fully utilized, and the detection result does not need to be further processed after each frame is detected. Compared with the existing video target detection method based on post-processing of the image target detection result, the detection speed is improved.

Furthermore, a more interesting motion foreground target can be obtained by adopting a foreground extraction algorithm based on orthogonal subspace learning, wherein all video frames in a video segment are taken as a whole, and a foreground image of all frames is obtained by using an orthogonal subspace learning algorithm, so that the foreground sparse prior of video data is better utilized.

Further, an objective function is solved by adopting an alternate direction method, wherein unconstrained optimization parts are respectively optimized by adopting a block coordinate descent method, a large global optimization problem is decomposed into a plurality of easily solved sub-problems, and the solution of the global optimization problem is obtained by solving the plurality of sub-problems.

Furthermore, the features extracted from the ResNet network are constructed into feature pyramids, the multi-scale features of the video frames and the foreground prior image are obtained through the feature pyramids, wherein the low-level features are enhanced by the aid of the low-resolution high-level features with rich semantic information in the feature pyramids, and therefore the semantic information of the obtained semantic enhanced features is more abundant.

Furthermore, the semantic enhancement features of the foreground image and the semantic enhancement features of the current video frame are subjected to cascade convolution fusion to obtain a feature image with foreground prior enhancement, foreground sparse prior information is added in the detection process of the foreground object on the video frame, the feature information of the foreground object is enhanced, and the detection performance is further enhanced.

Further, by generating anchor frames on the feature map and classifying each anchor frame, regression is performed on the anchor frames which are judged to be positive samples, and accurate target positions are obtained. Generating anchor boxes on the feature map can limit the number of candidate regions to a controllable range, and the calculation amount is greatly reduced.

Furthermore, training of the video data is completed by constructing a classification sub-network and a regression sub-network, wherein the classification sub-network can obtain a fine target classification result, and the regression sub-network can further correct a target positioning result, so that the recognition result and the position of different targets in the finally obtained video frame are more accurate.

Further, the loss function L is set mainly to solve the problem of unbalanced proportion of positive and negative samples in the one-stage target detection task. The loss function reduces the weight of the number of redundant negative samples during training.

In summary, the invention fully utilizes the sparse prior of the foreground and the relation between the adjacent frame features aiming at the phenomena of motion blur, object shielding, large size change and the like in the video data, so that targets with different scales and blur in the video data can be effectively detected, and the detection accuracy is improved.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is an effect diagram of the present invention for video object detection, wherein (a) is the detection result of one frame in a video sequence targeted to a ship, and (b) is the detection result of another frame in a video sequence targeted to a ship;

FIG. 3 is a second effect diagram of the video object detection of the present invention, wherein (a) is the detection result of one frame of the video sequence targeted to the dog, and (b) is the detection result of another frame of the video sequence targeted to the dog;

fig. 4 is a third effect diagram of the video object detection according to the present invention, where (a) is a detection result of one frame in a video sequence of an automobile with an object being an elephant, and (b) is a detection result of another frame in a video sequence of an automobile with an object being an elephant.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a video target detection method based on sparse foreground prior, which comprises the steps of firstly obtaining a motion sparse foreground prior map of each frame of a video by using a foreground extraction method based on orthogonal subspace learning; then, extracting multi-scale semantic enhancement features of the video frames and the sparse foreground images thereof by using a ResNet feature extraction network and a feature pyramid structure; carrying out cascade fusion on the foreground semantic enhancement features and the semantic enhancement features of the current frame to obtain foreground prior fusion features; generating an anchor frame on each pixel on the foreground prior fusion feature map; obtaining the category and position coordinates of all targets through a classification and regression network; the sparse front Jing Xianyan of the video data is fully mined, and the target detection accuracy is improved.

Referring to fig. 1, the video target detection method based on sparse foreground priori is divided into two parts, namely training and testing, wherein the loss function of a network model is calculated in the training process, and then the network parameters are updated by using back propagation; in the test process, the trained network parameters are used for fusing the semantic enhancement features of the current frame with the foreground semantic enhancement features to obtain foreground priori fusion features of the video frame, and then the category and the position of the interested target in the video frame are obtained based on the foreground priori fusion features; the method comprises the following specific steps:

Video clip C _i Every frame of image I in ^(t) Converting into column vectors after graying, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective functionAll frames are obtained corresponding to the front Jing Xianyan E.

The objective function is calculated as follows:

In specific implementation, for the objective function, the function can be solved by an inaccurate alternating direction method, and the following steps are repeatedly executed:

s101, solving D and theta by using a block coordinate descent method, and defining a residual error term

And solving and updating D and theta by using residual terms:

wherein the method comprises the steps of

S102, updating D and theta obtained by solving

Wherein the contraction function

"." means multiplication by element, "sign ()" is a sign function, in particular form

Iteratively updating until reaching convergence condition, namely, after reaching maximum iteration times, obtaining video segment C _i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I ^(t) Corresponding sparse foreground map E of (2) ^(t) 。

S2, calculating video frame I ^(t) Feature map F of (1) ^(t) Sparse foreground map E ^(t) Is a sparse foreground prior feature map of (2)

Respectively divide video frame I ^(t) With sparse foreground map E ^(t) Inputting ResNet feature extraction network, each layer of ResNet feature extraction network outputting feature map F of the layer ^(t) And sparse foreground prior feature map

The ResNet feature extraction network is a feature extraction network consisting of 1 7×7 convolution layers, 1 max pooling layer and 16 residual blocks, wherein each residual block in the network is formed by combining 1×1 convolution layer, 13×3 convolution layer, 1×1 convolution layer, batch standardization layer and activation function layer. The 16 residual blocks are divided into 5 phases. The output of each stage is taken as a feature of the input image at a different semantic level.

S3, calculating video frame I ^(t) Semantic enhanced features of (a)

And foreground semantic enhancement feature->

Video frame I is processed through feature pyramid structure ^(t) Is of each layer of features F ^(t) And corresponding sparse foreground prior features

Respectively combined with the features obtained by the up-sampling of the higher layers to obtain the semantic enhancement features with rich semantic information

And foreground semantic enhancement feature->

In video frame I ^(t) And sparse foreground map E ^(t) Obtaining a feature map F through a ResNet feature extraction network ^(t) And sparse foreground prior feature map

In the process of (1) extracting 5 features with different scales from the ResNet intermediate layer, wherein the scales are respectively +.>

The feature pyramid is composed of these 5 features of different dimensions. The bottom of the feature pyramid is a high resolution feature map and the top feature map is a low resolution feature map, the higher the level, the smaller the feature map and the lower the resolution.

The high-level low-resolution high-semantic features with abstract information of the feature pyramid are subjected to nearest neighbor up-sampling and then added with the low-level features, and after a 3 multiplied by 3 convolution kernel, the features with rich semantic information are output

And foreground prior feature->

S4, calculating video frame I ^(t) Foreground prior fusion feature map of (1)

Video frameI ^(t) Semantic enhanced features of (a)

And corresponding foreground semantic enhancement feature->

S5, in video frame I ^(t) Foreground prior fusion feature map of (1)

Generating an anchor frame;

foreground priori fusion feature map

Each pixel of each layer is provided with a basic anchor frame with the size of 16 multiplied by 16, on the premise of keeping the area unchanged, the length-width ratio of the basic anchor frame is respectively 0.5,1 and 2, and the anchor frames with the three different length-width ratios are respectively enlarged by 8,16,32 scales, so that the feature map is fused with the foreground prior art>

S6, video frame I ^(t) Foreground prior fusion feature map of (1)

All anchor frames are input into a trained classification and regression network to respectively obtain video frames I ^(t) Classification and positioning results of all targets in the model.

S601, training classification and regression sub-network:

s6013, constructing a target detection loss function L:

where z is the true label of the i-th candidate region,

The Smooth L1 regression loss of the target frame, and omega is the balance weight;

s6014, updating learning classification and regression network parameters through back propagation iteration by utilizing a target detection loss function L until the network converges, and obtaining a trained classification and regression sub-network;

s602, video frame I ^(t) Foreground prior fusion feature map of (1)

All anchor frames are input into a trained classification and regression network to respectively obtain video frames I ^(t) Target category and target frame location of (c).

In yet another embodiment of the present invention, there is provided a terminal device including a processor anda memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor according to the embodiment of the invention can be used for detecting the video target based on sparse foreground priori, and comprises the following steps: dividing video V into m video segments C _i I=1, 2, …, m, for each video clip C _i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning ^(t) Sparse foreground map E of (2) ^(t) The method comprises the steps of carrying out a first treatment on the surface of the Respectively divide video frame I ^(t) And sparse foreground map E ^(t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network ^(t) And sparse foreground prior feature map

Video frame I is processed through feature pyramid structure ^(t) Is of each layer of features F ^(t) And corresponding sparse foreground prior feature->

And foreground semantic enhancement feature->

Frame I of video ^(t) Semantic enhancement feature->

And corresponding foreground semantic enhancement feature->

Fusing to obtain video frame I ^(t) Foreground prior fusion feature map of->

In video frame I ^(t) Foreground prior fusion feature map of (1)

An anchor frame is generated in the process; frame I of video ^(t) Foreground prior fusion feature map of->

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the method of checking a long-term service plan in a power grid in the above-described embodiments; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of: dividing video V into m video segments C _i I=1, 2, …, m, for each video clip C _i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning ^(t) Sparse foreground map E of (2) ^(t) The method comprises the steps of carrying out a first treatment on the surface of the Respectively divide video frame I ^(t) And sparse foreground map E ^(t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network ^(t) And sparse foreground prior feature map

And foreground semantic enhancement feature->

Frame I of video ^(t) Semantic enhancement feature->

And corresponding foreground semantic enhancement feature->

Fusing to obtain video frame I ^(t) Foreground prior fusion feature map of->

In video frame I ^(t) Foreground prior fusion feature map of->

The effect of the invention can be further illustrated by the following simulations:

1. simulation conditions

Using a workstation equipped with an RTX 2080TI graphics card, the software framework was PyTorch.

Selecting a video sequence with a large scale difference as a first group of detected video sequences, wherein the target is a ship, as shown in fig. 2;

selecting a video sequence with a large gesture difference as a second group of detected video sequences, wherein the target is a dog, as shown in fig. 3;

the video sequences with object shielding are selected as a third group of detected video sequences, as shown in fig. 4, wherein the objects are two objects, namely an elephant object and an automobile object.

2. Emulation content

Simulation 1, the method of the invention is used for detecting video targets of a first group of detected video sequences, and the detection results of two frames are obtained, as shown in fig. 2.

Simulation 2, the method of the invention is used for detecting video targets of a second group of detected video sequences, and the detection results of two frames are obtained, as shown in fig. 3.

Simulation 3, the method of the invention is used for detecting video targets of a third group of detected video sequences, and the detection results of two frames are shown in fig. 4.

3. Simulation result analysis

Fig. 2 (a) shows one frame of detection result of the video sequence of the target ship, and fig. 2 (b) shows the other frame of detection result of the video sequence of the target ship, and it can be seen that under the condition that the difference of the sizes of the targets is large, the invention can accurately detect the types and the positions of the targets with different sizes in the video; fig. 3 (a) shows the detection result of one frame of the video sequence with the target being a dog, and fig. 3 (b) shows the detection result of the other frame of the video sequence with the target being a dog, so that the invention can accurately detect the type and the position of the target in the video under the conditions of blurred pictures and large gesture difference; fig. 4 (a) shows the detection result of one frame of the video sequence of the object including the elephant and the automobile, and fig. 4 (b) shows the detection result of the other frame of the video sequence of the object including the elephant and the automobile, and it can be seen that the invention can accurately detect the type and the position of the blocked object in the video in the case that the different types of objects are blocked, especially when the left elephant in fig. 4 (b) is basically completely blocked.

In summary, according to the video target detection method based on sparse foreground priori, the category and the position of targets can be effectively detected for video sequences with targets of different scales and motion blur and occlusion phenomena.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The video target detection method based on sparse foreground priori is characterized by comprising the following steps:

And foreground semantic enhancement feature->

S4, video frame I ^(t) Semantic enhanced features of (a)

And corresponding foreground semantic enhancement feature->

Fusing to obtain video frame I ^(t) Foreground prior fusion feature map of->

S5, in video frame I ^(t) Foreground prior fusion feature map of (1)

An anchor frame is generated in the process;

s6, video frame I ^(t) Foreground prior fusion feature map of (1)

2. The method for detecting video objects based on sparse foreground prior of claim 1, wherein in step S1, video segment C is _i Every frame of image I in ^(t) Converting the gray level into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective function to obtain a video fragment C _i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I ^(t) Corresponding sparse foreground map E of (2) ^(t) The objective function is calculated as follows:

3. The video object detection method based on sparse foreground prior of claim 2, wherein an alternate direction method is used to solve an object function, a block coordinate descent method is used to solve D and θ, and a residual term is defined

Solving and updating D and theta by using residual error items; d and θ updates using solution

Contraction function->

4. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S3, in video frame I ^(t) And sparse foreground map E ^(t) Obtaining a feature map F through a ResNet feature extraction network ^(t) And sparse foreground prior feature map

The method comprises the steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the strong semantic features of the higher layer of the feature pyramid are sampled in nearest neighbor and then are compared with the features of the lower layerSign addition, after 3×3 convolution kernel, outputs features with semantic information +.>

And foreground prior feature->

5. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S4, video frame I is ^(t) Semantic enhanced features of (a)

And corresponding foreground semantic enhancement feature->

6. The method for detecting a video object based on sparse foreground prior as claimed in claim 1, wherein in step S5, feature maps are fused in foreground prior

7. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S6, training the classification and regression sub-network specifically comprises:

s6013, constructing a target detection loss function L;

8. The sparse foreground prior-based video object detection method of claim 7, wherein in step S6013, the loss function L:

where z is the true label of the i-th candidate region,

Is the smoothl 1 regression loss of the target box, ω is the balance weight.

9. A computer readable storage medium storing one or more programs, wherein the one or more programs comprise instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.

10. A computing device, comprising:

one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-8.