CN112434618B - Video target detection method, storage medium and device based on sparse foreground priori - Google Patents

Video target detection method, storage medium and device based on sparse foreground priori Download PDF

Info

Publication number
CN112434618B
CN112434618B CN202011357082.7A CN202011357082A CN112434618B CN 112434618 B CN112434618 B CN 112434618B CN 202011357082 A CN202011357082 A CN 202011357082A CN 112434618 B CN112434618 B CN 112434618B
Authority
CN
China
Prior art keywords
foreground
video
sparse
feature
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011357082.7A
Other languages
Chinese (zh)
Other versions
CN112434618A (en
Inventor
古晶
巨小杰
马文萍
孙新凯
刘芳
杨淑媛
焦李成
冯婕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202011357082.7A priority Critical patent/CN112434618B/en
Publication of CN112434618A publication Critical patent/CN112434618A/en
Application granted granted Critical
Publication of CN112434618B publication Critical patent/CN112434618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target detection method, a storage medium and a device based on sparse foreground prior, wherein a foreground extraction method based on orthogonal subspace learning is adopted to calculate and obtain a sparse foreground prior map corresponding to each frame in a video; utilizing the ResNet feature extraction network and the feature pyramid structure to obtain a semantic enhancement feature map of the video frame and the sparse foreground map thereof; after cascading the semantic enhancement feature map of the sparse foreground prior map with the semantic enhancement feature map of the current frame, obtaining foreground prior fusion features of the current frame through convolution fusion operation; mapping on each pixel of the foreground prior fusion feature map to generate a candidate anchor frame; and inputting the foreground priori fusion characteristics and all anchor frames into a trained classification and regression sub-network to obtain the category and position coordinates of the target object. The method fully digs the sparse front Jing Xianyan of the video data and improves the target detection accuracy.

Description

Video target detection method, storage medium and device based on sparse foreground priori
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method, a storage medium and equipment based on sparse foreground priori.
Background
Computer vision is an important area of artificial intelligence that learns and understands real vision by training a computer. By means of pictures, videos and a deep learning model, objects of interest can be accurately classified and identified, and further judgment processing can be performed. Computer vision is generally divided into major tasks such as image recognition, object detection, instance segmentation, and the like. The classification task generally gives a content description of the whole picture, and the detection task focuses more on a specific object of interest target, so that the recognition result and the positioning result of the object of interest target are required to be obtained simultaneously. In contrast to classification tasks, detection is an understanding of the foreground and background of a picture, while also requiring separation of objects of interest from the background and determination of identification and location information of the objects of interest.
The target detection is a popular direction in the field of computer vision research, and is widely applied to the fields of robot navigation, video monitoring, industrial detection, face recognition and the like. The image target detection task has greatly progressed in the past few years, and the detection performance is obviously improved. However, in the fields of video monitoring, vehicle assisted driving, and the like, there is a wider demand for video-based target detection. However, the use of image detection techniques directly to video detection tasks poses new challenges. Firstly, the image target detection network is directly applied to each frame in the video to detect, so that huge calculation cost is brought; secondly, the conventional image target detection method cannot effectively utilize the time sequence continuity of the video data and the priori of sparse prospects, and is difficult to mine the time sequence characteristics in the video data.
Video is composed of images, and video object detection is closely related to image object detection. In order to improve the accuracy of video detection, the detection result is further processed by utilizing the specific timing characteristic of video after each frame is detected by image target detection. In order to take advantage of the temporal continuity and redundancy of video data, some recent approaches employ optical flow, attention mechanisms, sequence models, and the like to mine the temporal characteristics of the video.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a video target detection method, a storage medium and a device based on sparse foreground priori, aiming at the defects in the prior art, so as to improve the detection performance of video target detection.
The invention adopts the following technical scheme:
the video target detection method based on sparse foreground priori comprises the following steps:
s1, dividing a video V into m video clips C i I=1, 2, …, m, for each video clip C i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning (t) Sparse foreground map E of (2) (t)
S2, respectively converting the video frames I (t) And sparse foreground map E (t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network (t) And sparse foreground prior feature map
Figure BDA0002802907620000021
Computing video frame I (t) Feature map F of (1) (t) Sparse foreground map E (t) Sparse foreground prior feature map of->
Figure BDA0002802907620000022
S3, through the characteristic pyramid structure, the video frame I (t) Is of each layer of features F (t) And corresponding sparse foreground prior features
Figure BDA0002802907620000023
Respectively with the features obtained by the up-sampling of the higher layers, calculate the video frame I (t) Semantic enhancement feature->
Figure BDA0002802907620000024
And foreground semantic enhancement feature->
Figure BDA0002802907620000025
S4, video frame I (t) Semantic enhanced features of (a)
Figure BDA0002802907620000026
And corresponding foreground semantic enhancement feature->
Figure BDA0002802907620000027
Fusing to obtain video frame I (t) Foreground prior fusion feature map of->
Figure BDA0002802907620000028
S5, in video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA0002802907620000029
An anchor frame is generated in the process;
s6, video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA0002802907620000031
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) The target detection is completed according to the classification and positioning results of all targets.
Specifically, in step S1, the video clip C i Every frame of image I in (t) Converting the gray level into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective function to obtain a video fragment C i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I (t) Corresponding sparse foreground map E of (2) (t) The objective function is calculated as follows:
Figure BDA0002802907620000032
wherein D is an orthogonal subspace, θ is an orthogonal subspace coefficient, alpha and beta are regulating parameters, and I.I.I.I row,1 Representing 1 norm of a matrix row, I k Is an identity matrix with the order of k.
Further, solving an objective function by adopting an alternating direction method, solving D and theta by using a block coordinate descent method, and defining a residual error term
Figure BDA0002802907620000033
Solving and updating D and theta by using residual error items; updating +.>
Figure BDA0002802907620000034
Contraction function->
Figure BDA0002802907620000035
For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative update until reaching convergence condition and maximum iterative times i Sparse front Jing Xianyan E of all frames in (b).
Specifically, in step S3, in video frame I (t) And sparse foreground map E (t) Obtaining a feature map F through a ResNet feature extraction network (t) And sparse foreground prior feature map
Figure BDA0002802907620000036
In the process of (1), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively +.>
Figure BDA0002802907620000037
The method comprises the steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the strong semantic features of the higher layer of the feature pyramid are subjected to nearest neighbor upsampling and then added with the features of the lower layer, and after a 3 multiplied by 3 convolution kernel, the features with semantic information are output>
Figure BDA0002802907620000038
And foreground prior feature->
Figure BDA0002802907620000039
Specifically, in step S4, the video frame I (t) Semantic enhanced features of (a)
Figure BDA0002802907620000041
And corresponding foreground semantic enhancement feature->
Figure BDA0002802907620000042
Cascading, and obtaining a foreground priori fusion characteristic diagram by 1X 1 convolution operation>
Figure BDA0002802907620000043
Specifically, in step S5, feature maps are fused in the foreground prior
Figure BDA0002802907620000044
Each pixel of each layer is provided with a basic anchor frame with the size of 16 multiplied by 16, on the premise of keeping the area unchanged, the length-width ratio is respectively 0.5,1 and 2, then the anchor frames with three different length-width ratios are respectively enlarged by 8,16,32 scales, and the feature map is fused with the foreground prior art>
Figure BDA0002802907620000045
A total of 9 anchor boxes are generated for each pixel on each layer of feature map.
Specifically, in step S6, the training classification and regression sub-network specifically includes:
s6011, randomly initializing weight parameters of classification and regression networks;
s6012, calculating the probability that the candidate areas belong to each category by using the initialized classification network for each candidate area, and calculating the position coordinates of the candidate areas by using the initialized regression network;
s6013, constructing a target detection loss function L;
s6014, updating the learning classification and regression network parameters through back propagation iteration by utilizing the target detection loss function L until the network converges, and obtaining the trained classification and regression sub-network.
Further, in step S6013, the loss function L:
Figure BDA0002802907620000046
where z is the true label of the i-th candidate region,
Figure BDA0002802907620000047
is the probability that the i-th candidate region belongs to the z-class object, gamma is the concentration parameter,/->
Figure BDA0002802907620000048
Is a focal loss for object classification; a, a i Is the position coordinates of the i-th candidate region, +.>
Figure BDA0002802907620000049
Is the coordinate vector of the real target frame corresponding to the i candidate region, +.>
Figure BDA00028029076200000410
Is the smoothl 1 regression loss of the target box, ω is the balance weight.
Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods.
Another aspect of the present invention is a computing device, including:
one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods.
Compared with the prior art, the invention has at least the following beneficial effects:
according to the video target detection method based on the sparse foreground prior, on the basis of the image target detection method, the motion foreground prior map is extracted by utilizing the sparse prior of the foreground and the space-time continuity prior of video data, so that a foreground semantic enhancement feature map is obtained, and the foreground prior fusion feature of the current frame is obtained by cascading the foreground semantic enhancement feature map with the current frame semantic enhancement feature, so that the video frames with motion blur, object shielding and large size change can be detected after the foreground prior feature fusion, and the detection accuracy is improved; the relation between the characteristics of adjacent frames is fully utilized, and the detection result does not need to be further processed after each frame is detected. Compared with the existing video target detection method based on post-processing of the image target detection result, the detection speed is improved.
Furthermore, a more interesting motion foreground target can be obtained by adopting a foreground extraction algorithm based on orthogonal subspace learning, wherein all video frames in a video segment are taken as a whole, and a foreground image of all frames is obtained by using an orthogonal subspace learning algorithm, so that the foreground sparse prior of video data is better utilized.
Further, an objective function is solved by adopting an alternate direction method, wherein unconstrained optimization parts are respectively optimized by adopting a block coordinate descent method, a large global optimization problem is decomposed into a plurality of easily solved sub-problems, and the solution of the global optimization problem is obtained by solving the plurality of sub-problems.
Furthermore, the features extracted from the ResNet network are constructed into feature pyramids, the multi-scale features of the video frames and the foreground prior image are obtained through the feature pyramids, wherein the low-level features are enhanced by the aid of the low-resolution high-level features with rich semantic information in the feature pyramids, and therefore the semantic information of the obtained semantic enhanced features is more abundant.
Furthermore, the semantic enhancement features of the foreground image and the semantic enhancement features of the current video frame are subjected to cascade convolution fusion to obtain a feature image with foreground prior enhancement, foreground sparse prior information is added in the detection process of the foreground object on the video frame, the feature information of the foreground object is enhanced, and the detection performance is further enhanced.
Further, by generating anchor frames on the feature map and classifying each anchor frame, regression is performed on the anchor frames which are judged to be positive samples, and accurate target positions are obtained. Generating anchor boxes on the feature map can limit the number of candidate regions to a controllable range, and the calculation amount is greatly reduced.
Furthermore, training of the video data is completed by constructing a classification sub-network and a regression sub-network, wherein the classification sub-network can obtain a fine target classification result, and the regression sub-network can further correct a target positioning result, so that the recognition result and the position of different targets in the finally obtained video frame are more accurate.
Further, the loss function L is set mainly to solve the problem of unbalanced proportion of positive and negative samples in the one-stage target detection task. The loss function reduces the weight of the number of redundant negative samples during training.
In summary, the invention fully utilizes the sparse prior of the foreground and the relation between the adjacent frame features aiming at the phenomena of motion blur, object shielding, large size change and the like in the video data, so that targets with different scales and blur in the video data can be effectively detected, and the detection accuracy is improved.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is an effect diagram of the present invention for video object detection, wherein (a) is the detection result of one frame in a video sequence targeted to a ship, and (b) is the detection result of another frame in a video sequence targeted to a ship;
FIG. 3 is a second effect diagram of the video object detection of the present invention, wherein (a) is the detection result of one frame of the video sequence targeted to the dog, and (b) is the detection result of another frame of the video sequence targeted to the dog;
fig. 4 is a third effect diagram of the video object detection according to the present invention, where (a) is a detection result of one frame in a video sequence of an automobile with an object being an elephant, and (b) is a detection result of another frame in a video sequence of an automobile with an object being an elephant.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a video target detection method based on sparse foreground prior, which comprises the steps of firstly obtaining a motion sparse foreground prior map of each frame of a video by using a foreground extraction method based on orthogonal subspace learning; then, extracting multi-scale semantic enhancement features of the video frames and the sparse foreground images thereof by using a ResNet feature extraction network and a feature pyramid structure; carrying out cascade fusion on the foreground semantic enhancement features and the semantic enhancement features of the current frame to obtain foreground prior fusion features; generating an anchor frame on each pixel on the foreground prior fusion feature map; obtaining the category and position coordinates of all targets through a classification and regression network; the sparse front Jing Xianyan of the video data is fully mined, and the target detection accuracy is improved.
Referring to fig. 1, the video target detection method based on sparse foreground priori is divided into two parts, namely training and testing, wherein the loss function of a network model is calculated in the training process, and then the network parameters are updated by using back propagation; in the test process, the trained network parameters are used for fusing the semantic enhancement features of the current frame with the foreground semantic enhancement features to obtain foreground priori fusion features of the video frame, and then the category and the position of the interested target in the video frame are obtained based on the foreground priori fusion features; the method comprises the following specific steps:
s1, dividing a video V into m video clips C i I=1, 2, …, m, for each video clip C i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning (t) Sparse foreground map E of (2) (t)
Video clip C i Every frame of image I in (t) Converting into column vectors after graying, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective functionAll frames are obtained corresponding to the front Jing Xianyan E.
The objective function is calculated as follows:
Figure BDA0002802907620000081
wherein D is an orthogonal subspace, θ is an orthogonal subspace coefficient, alpha and beta are regulating parameters, and I.I.I.I row,1 Representing 1 norm of a matrix row, I k Is an identity matrix with the order of k.
In specific implementation, for the objective function, the function can be solved by an inaccurate alternating direction method, and the following steps are repeatedly executed:
s101, solving D and theta by using a block coordinate descent method, and defining a residual error term
Figure BDA0002802907620000082
And solving and updating D and theta by using residual terms:
Figure BDA0002802907620000083
Figure BDA0002802907620000084
wherein the method comprises the steps of
Figure BDA0002802907620000085
S102, updating D and theta obtained by solving
Figure BDA0002802907620000086
Wherein the contraction function
Figure BDA0002802907620000087
"." means multiplication by element, "sign ()" is a sign function, in particular form
Figure BDA0002802907620000088
Iteratively updating until reaching convergence condition, namely, after reaching maximum iteration times, obtaining video segment C i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I (t) Corresponding sparse foreground map E of (2) (t)
S2, calculating video frame I (t) Feature map F of (1) (t) Sparse foreground map E (t) Is a sparse foreground prior feature map of (2)
Figure BDA0002802907620000091
Respectively divide video frame I (t) With sparse foreground map E (t) Inputting ResNet feature extraction network, each layer of ResNet feature extraction network outputting feature map F of the layer (t) And sparse foreground prior feature map
Figure BDA0002802907620000092
The ResNet feature extraction network is a feature extraction network consisting of 1 7×7 convolution layers, 1 max pooling layer and 16 residual blocks, wherein each residual block in the network is formed by combining 1×1 convolution layer, 13×3 convolution layer, 1×1 convolution layer, batch standardization layer and activation function layer. The 16 residual blocks are divided into 5 phases. The output of each stage is taken as a feature of the input image at a different semantic level.
S3, calculating video frame I (t) Semantic enhanced features of (a)
Figure BDA0002802907620000093
And foreground semantic enhancement feature->
Figure BDA0002802907620000094
Video frame I is processed through feature pyramid structure (t) Is of each layer of features F (t) And corresponding sparse foreground prior features
Figure BDA0002802907620000095
Respectively combined with the features obtained by the up-sampling of the higher layers to obtain the semantic enhancement features with rich semantic information
Figure BDA0002802907620000096
And foreground semantic enhancement feature->
Figure BDA0002802907620000097
In video frame I (t) And sparse foreground map E (t) Obtaining a feature map F through a ResNet feature extraction network (t) And sparse foreground prior feature map
Figure BDA0002802907620000098
In the process of (1) extracting 5 features with different scales from the ResNet intermediate layer, wherein the scales are respectively +.>
Figure BDA0002802907620000099
The feature pyramid is composed of these 5 features of different dimensions. The bottom of the feature pyramid is a high resolution feature map and the top feature map is a low resolution feature map, the higher the level, the smaller the feature map and the lower the resolution.
The high-level low-resolution high-semantic features with abstract information of the feature pyramid are subjected to nearest neighbor up-sampling and then added with the low-level features, and after a 3 multiplied by 3 convolution kernel, the features with rich semantic information are output
Figure BDA00028029076200000910
And foreground prior feature->
Figure BDA00028029076200000911
S4, calculating video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA00028029076200000912
Video frameI (t) Semantic enhanced features of (a)
Figure BDA0002802907620000101
And corresponding foreground semantic enhancement feature->
Figure BDA0002802907620000102
Cascading, and obtaining a foreground priori fusion characteristic diagram by 1X 1 convolution operation>
Figure BDA0002802907620000103
S5, in video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA0002802907620000104
Generating an anchor frame;
foreground priori fusion feature map
Figure BDA0002802907620000105
Each pixel of each layer is provided with a basic anchor frame with the size of 16 multiplied by 16, on the premise of keeping the area unchanged, the length-width ratio of the basic anchor frame is respectively 0.5,1 and 2, and the anchor frames with the three different length-width ratios are respectively enlarged by 8,16,32 scales, so that the feature map is fused with the foreground prior art>
Figure BDA0002802907620000106
A total of 9 anchor boxes are generated for each pixel on each layer of feature map.
S6, video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA0002802907620000107
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) Classification and positioning results of all targets in the model.
S601, training classification and regression sub-network:
s6011, randomly initializing weight parameters of classification and regression networks;
s6012, calculating the probability that the candidate areas belong to each category by using the initialized classification network for each candidate area, and calculating the position coordinates of the candidate areas by using the initialized regression network;
s6013, constructing a target detection loss function L:
Figure BDA0002802907620000108
where z is the true label of the i-th candidate region,
Figure BDA0002802907620000109
is the probability that the i-th candidate region belongs to the z-class object, gamma is the concentration parameter,/->
Figure BDA00028029076200001010
Is a focal loss for object classification; a, a i Is the position coordinates of the i-th candidate region, +.>
Figure BDA00028029076200001011
Is the coordinate vector of the real target frame corresponding to the i candidate region, +.>
Figure BDA00028029076200001012
The Smooth L1 regression loss of the target frame, and omega is the balance weight;
s6014, updating learning classification and regression network parameters through back propagation iteration by utilizing a target detection loss function L until the network converges, and obtaining a trained classification and regression sub-network;
s602, video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA00028029076200001013
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) Target category and target frame location of (c).
In yet another embodiment of the present invention, there is provided a terminal device including a processor anda memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor according to the embodiment of the invention can be used for detecting the video target based on sparse foreground priori, and comprises the following steps: dividing video V into m video segments C i I=1, 2, …, m, for each video clip C i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning (t) Sparse foreground map E of (2) (t) The method comprises the steps of carrying out a first treatment on the surface of the Respectively divide video frame I (t) And sparse foreground map E (t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network (t) And sparse foreground prior feature map
Figure BDA0002802907620000111
Computing video frame I (t) Feature map F of (1) (t) Sparse foreground map E (t) Sparse foreground prior feature map of->
Figure BDA0002802907620000112
Video frame I is processed through feature pyramid structure (t) Is of each layer of features F (t) And corresponding sparse foreground prior feature->
Figure BDA0002802907620000113
Respectively with the features obtained by the up-sampling of the higher layers, calculate the video frame I (t) Semantic enhancement feature->
Figure BDA0002802907620000114
And foreground semantic enhancement feature->
Figure BDA0002802907620000115
Frame I of video (t) Semantic enhancement feature->
Figure BDA0002802907620000116
And corresponding foreground semantic enhancement feature->
Figure BDA0002802907620000117
Fusing to obtain video frame I (t) Foreground prior fusion feature map of->
Figure BDA0002802907620000118
In video frame I (t) Foreground prior fusion feature map of (1)
Figure BDA0002802907620000119
An anchor frame is generated in the process; frame I of video (t) Foreground prior fusion feature map of->
Figure BDA00028029076200001110
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) The target detection is completed according to the classification and positioning results of all targets.
In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.
One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the method of checking a long-term service plan in a power grid in the above-described embodiments; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of: dividing video V into m video segments C i I=1, 2, …, m, for each video clip C i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning (t) Sparse foreground map E of (2) (t) The method comprises the steps of carrying out a first treatment on the surface of the Respectively divide video frame I (t) And sparse foreground map E (t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network (t) And sparse foreground prior feature map
Figure BDA0002802907620000121
Computing video frame I (t) Feature map F of (1) (t) Sparse foreground map E (t) Sparse foreground prior feature map of->
Figure BDA0002802907620000122
Video frame I is processed through feature pyramid structure (t) Is of each layer of features F (t) And corresponding sparse foreground prior feature->
Figure BDA0002802907620000123
Respectively with the features obtained by the up-sampling of the higher layers, calculate the video frame I (t) Semantic enhancement feature->
Figure BDA0002802907620000124
And foreground semantic enhancement feature->
Figure BDA0002802907620000125
Frame I of video (t) Semantic enhancement feature->
Figure BDA0002802907620000126
And corresponding foreground semantic enhancement feature->
Figure BDA0002802907620000127
Fusing to obtain video frame I (t) Foreground prior fusion feature map of->
Figure BDA0002802907620000128
In video frame I (t) Foreground prior fusion feature map of->
Figure BDA0002802907620000129
An anchor frame is generated in the process; frame I of video (t) Foreground prior fusion feature map of->
Figure BDA00028029076200001210
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) The target detection is completed according to the classification and positioning results of all targets.
The effect of the invention can be further illustrated by the following simulations:
1. simulation conditions
Using a workstation equipped with an RTX 2080TI graphics card, the software framework was PyTorch.
Selecting a video sequence with a large scale difference as a first group of detected video sequences, wherein the target is a ship, as shown in fig. 2;
selecting a video sequence with a large gesture difference as a second group of detected video sequences, wherein the target is a dog, as shown in fig. 3;
the video sequences with object shielding are selected as a third group of detected video sequences, as shown in fig. 4, wherein the objects are two objects, namely an elephant object and an automobile object.
2. Emulation content
Simulation 1, the method of the invention is used for detecting video targets of a first group of detected video sequences, and the detection results of two frames are obtained, as shown in fig. 2.
Simulation 2, the method of the invention is used for detecting video targets of a second group of detected video sequences, and the detection results of two frames are obtained, as shown in fig. 3.
Simulation 3, the method of the invention is used for detecting video targets of a third group of detected video sequences, and the detection results of two frames are shown in fig. 4.
3. Simulation result analysis
Fig. 2 (a) shows one frame of detection result of the video sequence of the target ship, and fig. 2 (b) shows the other frame of detection result of the video sequence of the target ship, and it can be seen that under the condition that the difference of the sizes of the targets is large, the invention can accurately detect the types and the positions of the targets with different sizes in the video; fig. 3 (a) shows the detection result of one frame of the video sequence with the target being a dog, and fig. 3 (b) shows the detection result of the other frame of the video sequence with the target being a dog, so that the invention can accurately detect the type and the position of the target in the video under the conditions of blurred pictures and large gesture difference; fig. 4 (a) shows the detection result of one frame of the video sequence of the object including the elephant and the automobile, and fig. 4 (b) shows the detection result of the other frame of the video sequence of the object including the elephant and the automobile, and it can be seen that the invention can accurately detect the type and the position of the blocked object in the video in the case that the different types of objects are blocked, especially when the left elephant in fig. 4 (b) is basically completely blocked.
In summary, according to the video target detection method based on sparse foreground priori, the category and the position of targets can be effectively detected for video sequences with targets of different scales and motion blur and occlusion phenomena.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. The video target detection method based on sparse foreground priori is characterized by comprising the following steps:
s1, dividing a video V into m video clips C i I=1, 2, …, m, for each video clip C i Obtaining a t-th frame video frame I in a video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning (t) Sparse foreground map E of (2) (t)
S2, respectively converting the video frames I (t) And sparse foreground map E (t) Inputting a ResNet feature extraction network, and outputting a feature map F of a corresponding layer by each layer of the ResNet feature extraction network (t) And sparse foreground prior feature map
Figure FDA0002802907610000011
Computing video frame I (t) Feature map F of (1) (t) Sparse foreground map E (t) Sparse foreground prior feature map of->
Figure FDA0002802907610000012
S3, through the characteristic pyramid structure, the video frame I (t) Is of each layer of features F (t) And corresponding sparse foreground prior features
Figure FDA0002802907610000013
Respectively with the features obtained by the up-sampling of the higher layers, calculate the video frame I (t) Semantic enhancement feature->
Figure FDA0002802907610000014
And foreground semantic enhancement feature->
Figure FDA0002802907610000015
S4, video frame I (t) Semantic enhanced features of (a)
Figure FDA0002802907610000016
And corresponding foreground semantic enhancement feature->
Figure FDA0002802907610000017
Fusing to obtain video frame I (t) Foreground prior fusion feature map of->
Figure FDA0002802907610000018
S5, in video frame I (t) Foreground prior fusion feature map of (1)
Figure FDA0002802907610000019
An anchor frame is generated in the process;
s6, video frame I (t) Foreground prior fusion feature map of (1)
Figure FDA00028029076100000110
All anchor frames are input into a trained classification and regression network to respectively obtain video frames I (t) The target detection is completed according to the classification and positioning results of all targets.
2. The method for detecting video objects based on sparse foreground prior of claim 1, wherein in step S1, video segment C is i Every frame of image I in (t) Converting the gray level into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to an objective function to obtain a video fragment C i Sparse front Jing Xianyan E of all frames in the list, splitting E according to columns, and restoring to obtain each frame I (t) Corresponding sparse foreground map E of (2) (t) The objective function is calculated as follows:
Figure FDA00028029076100000111
wherein D is an orthogonal subspace, θ is an orthogonal subspace coefficient, alpha and beta are regulating parameters, and I.I.I.I row,1 Representing 1 norm of a matrix row, I k Is an identity matrix with the order of k.
3. The video object detection method based on sparse foreground prior of claim 2, wherein an alternate direction method is used to solve an object function, a block coordinate descent method is used to solve D and θ, and a residual term is defined
Figure FDA0002802907610000021
Solving and updating D and theta by using residual error items; d and θ updates using solution
Figure FDA0002802907610000022
Contraction function->
Figure FDA0002802907610000023
For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative update until reaching convergence condition and maximum iterative times i Sparse front Jing Xianyan E of all frames in (b).
4. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S3, in video frame I (t) And sparse foreground map E (t) Obtaining a feature map F through a ResNet feature extraction network (t) And sparse foreground prior feature map
Figure FDA0002802907610000024
In the process of (1), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively +.>
Figure FDA0002802907610000025
The method comprises the steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the strong semantic features of the higher layer of the feature pyramid are sampled in nearest neighbor and then are compared with the features of the lower layerSign addition, after 3×3 convolution kernel, outputs features with semantic information +.>
Figure FDA0002802907610000026
And foreground prior feature->
Figure FDA0002802907610000027
5. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S4, video frame I is (t) Semantic enhanced features of (a)
Figure FDA0002802907610000028
And corresponding foreground semantic enhancement feature->
Figure FDA0002802907610000029
Cascading, and obtaining a foreground priori fusion characteristic diagram by 1X 1 convolution operation>
Figure FDA00028029076100000210
6. The method for detecting a video object based on sparse foreground prior as claimed in claim 1, wherein in step S5, feature maps are fused in foreground prior
Figure FDA00028029076100000211
Each pixel of each layer is provided with a basic anchor frame with the size of 16 multiplied by 16, on the premise of keeping the area unchanged, the length-width ratio is respectively 0.5,1 and 2, then the anchor frames with three different length-width ratios are respectively enlarged by 8,16,32 scales, and the feature map is fused with the foreground prior art>
Figure FDA00028029076100000212
A total of 9 anchor boxes are generated for each pixel on each layer of feature map.
7. The method for detecting a video object based on sparse foreground prior of claim 1, wherein in step S6, training the classification and regression sub-network specifically comprises:
s6011, randomly initializing weight parameters of classification and regression networks;
s6012, calculating the probability that the candidate areas belong to each category by using the initialized classification network for each candidate area, and calculating the position coordinates of the candidate areas by using the initialized regression network;
s6013, constructing a target detection loss function L;
s6014, updating the learning classification and regression network parameters through back propagation iteration by utilizing the target detection loss function L until the network converges, and obtaining the trained classification and regression sub-network.
8. The sparse foreground prior-based video object detection method of claim 7, wherein in step S6013, the loss function L:
Figure FDA0002802907610000031
where z is the true label of the i-th candidate region,
Figure FDA0002802907610000032
is the probability that the i-th candidate region belongs to the z-class object, gamma is the concentration parameter,/->
Figure FDA0002802907610000033
Is a focal loss for object classification; a, a i Is the position coordinates of the i-th candidate region, +.>
Figure FDA0002802907610000034
Is the coordinate vector of the real target frame corresponding to the i candidate region, +.>
Figure FDA0002802907610000035
Is the smoothl 1 regression loss of the target box, ω is the balance weight.
9. A computer readable storage medium storing one or more programs, wherein the one or more programs comprise instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.
10. A computing device, comprising:
one or more processors, memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for performing any of the methods of claims 1-8.
CN202011357082.7A 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori Active CN112434618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011357082.7A CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011357082.7A CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Publications (2)

Publication Number Publication Date
CN112434618A CN112434618A (en) 2021-03-02
CN112434618B true CN112434618B (en) 2023-06-23

Family

ID=74699279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011357082.7A Active CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Country Status (1)

Country Link
CN (1) CN112434618B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966697B (en) * 2021-03-17 2022-03-11 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN112861830B (en) * 2021-04-13 2023-08-25 北京百度网讯科技有限公司 Feature extraction method, device, apparatus, storage medium, and program product
CN113505737A (en) * 2021-07-26 2021-10-15 浙江大华技术股份有限公司 Foreground image determination method and apparatus, storage medium, and electronic apparatus
CN113743249B (en) * 2021-08-16 2024-03-26 北京佳服信息科技有限公司 Method, device and equipment for identifying violations and readable storage medium
CN116630334B (en) * 2023-04-23 2023-12-08 中国科学院自动化研究所 Method, device, equipment and medium for real-time automatic segmentation of multi-segment blood vessel

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680106A (en) * 2017-10-13 2018-02-09 南京航空航天大学 A kind of conspicuousness object detection method based on Faster R CNN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109447018A (en) * 2018-11-08 2019-03-08 天津理工大学 A kind of road environment visual perception method based on improvement Faster R-CNN
CN111310609A (en) * 2020-01-22 2020-06-19 西安电子科技大学 Video target detection method based on time sequence information and local feature similarity
CN111523439A (en) * 2020-04-21 2020-08-11 苏州浪潮智能科技有限公司 Method, system, device and medium for target detection based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680106A (en) * 2017-10-13 2018-02-09 南京航空航天大学 A kind of conspicuousness object detection method based on Faster R CNN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109447018A (en) * 2018-11-08 2019-03-08 天津理工大学 A kind of road environment visual perception method based on improvement Faster R-CNN
CN111310609A (en) * 2020-01-22 2020-06-19 西安电子科技大学 Video target detection method based on time sequence information and local feature similarity
CN111523439A (en) * 2020-04-21 2020-08-11 苏州浪潮智能科技有限公司 Method, system, device and medium for target detection based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Moving Object Detection Method via ResNet-18 With Encoder–Decoder Structure in Complex Scenes》;Xianfeng Ou等;《IEEE Access》;20191231;第7卷;第108152-108160页 *
《深度学习目标检测方法综述》;赵永强等;《中国图象图形学报》;20200430;第25卷(第4期);第629-653页 *

Also Published As

Publication number Publication date
CN112434618A (en) 2021-03-02

Similar Documents

Publication Publication Date Title
CN112434618B (en) Video target detection method, storage medium and device based on sparse foreground priori
CN108288088B (en) Scene text detection method based on end-to-end full convolution neural network
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN110782420A (en) Small target feature representation enhancement method based on deep learning
CN108734210B (en) Object detection method based on cross-modal multi-scale feature fusion
CN112528976B (en) Text detection model generation method and text detection method
CN111612807A (en) Small target image segmentation method based on scale and edge information
CN111507222B (en) Three-dimensional object detection frame based on multisource data knowledge migration
JP2012511756A (en) Apparatus having a data stream pipeline architecture for recognizing and locating objects in an image by detection window scanning
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
JP2024513596A (en) Image processing method and apparatus and computer readable storage medium
CN111179272B (en) Rapid semantic segmentation method for road scene
Pei et al. Multifeature selective fusion network for real-time driving scene parsing
CN112966659A (en) Video image small target detection method based on deep learning
CN112348116A (en) Target detection method and device using spatial context and computer equipment
Zhao et al. Bitnet: A lightweight object detection network for real-time classroom behavior recognition with transformer and bi-directional pyramid network
CN113822134A (en) Instance tracking method, device, equipment and storage medium based on video
Rao et al. Roads detection of aerial image with FCN-CRF model
CN113436115B (en) Image shadow detection method based on depth unsupervised learning
Zhao et al. Forward vehicle detection based on deep convolution neural network
Naveenkumar et al. Deep Learning Algorithms for Object Detection—A Study
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN111476190A (en) Target detection method, apparatus and storage medium for unmanned driving
Zhao et al. Scene text detection based on fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant