CN112434618A - Video target detection method based on sparse foreground prior, storage medium and equipment - Google Patents

Video target detection method based on sparse foreground prior, storage medium and equipment Download PDF

Info

Publication number
CN112434618A
CN112434618A CN202011357082.7A CN202011357082A CN112434618A CN 112434618 A CN112434618 A CN 112434618A CN 202011357082 A CN202011357082 A CN 202011357082A CN 112434618 A CN112434618 A CN 112434618A
Authority
CN
China
Prior art keywords
foreground
video
sparse
prior
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011357082.7A
Other languages
Chinese (zh)
Other versions
CN112434618B (en
Inventor
古晶
巨小杰
马文萍
孙新凯
刘芳
杨淑媛
焦李成
冯婕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202011357082.7A priority Critical patent/CN112434618B/en
Publication of CN112434618A publication Critical patent/CN112434618A/en
Application granted granted Critical
Publication of CN112434618B publication Critical patent/CN112434618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target detection method, a storage medium and equipment based on sparse foreground prior.A foreground extraction method based on orthogonal subspace learning is adopted to calculate and obtain a sparse foreground prior image corresponding to each frame in a video; utilizing the ResNet feature extraction network and the feature pyramid structure to obtain a semantic enhancement feature map of the video frame and the sparse foreground map thereof; after the semantic enhancement feature map of the sparse foreground prior map and the semantic enhancement feature map of the current frame are cascaded, obtaining the foreground prior fusion feature of the current frame through convolution fusion operation; mapping each pixel of the foreground prior fusion characteristic graph to generate a candidate anchor frame; and inputting the foreground prior fusion characteristics and all anchor frames into a trained classification and regression sub-network to obtain the category and position coordinates of the target object. The method fully excavates the sparse foreground prior of the video data and improves the target detection accuracy.

Description

Video target detection method based on sparse foreground prior, storage medium and equipment
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method based on sparse foreground prior, a storage medium and equipment.
Background
Computer vision is an important area of artificial intelligence that learns and understands real vision by training computers. By means of the pictures, the videos and the deep learning model, the concerned targets can be accurately classified and identified, and further judgment processing is carried out. Computer vision is generally divided into major tasks such as image recognition, target detection, instance segmentation, and the like. The classification task generally gives content description of the whole picture, while the detection task focuses more on a specific interested object, and requires to obtain an identification result and a positioning result of the interested object at the same time. In contrast to the classification task, detection is an understanding of the foreground and background of a picture, and it is also necessary to separate the object of interest from the background and determine the identification and location information of the object of interest.
The target detection is a popular direction in the field of computer vision research, and is widely applied to the fields of robot navigation, video monitoring, industrial detection, face recognition and the like. The task of image target detection is greatly improved in the last years, and the detection performance is obviously improved. However, in the fields of video surveillance, vehicle-assisted driving and the like, video-based target detection has a wider demand. However, applying image detection techniques directly to the video detection task faces new challenges. Firstly, the image target detection network is directly applied to each frame in the video for detection, which brings huge calculation cost; secondly, the conventional image target detection method cannot effectively utilize the time sequence continuity of the video data and the prior of the sparse foreground, and is difficult to mine the time sequence characteristics in the video data.
The video is composed of images, and the video target detection and the image target detection are closely related. In order to improve the video detection accuracy, after each frame is detected by image target detection, the detection result is further processed by using the time sequence characteristic of the video. In order to utilize the continuity and redundancy of video data in time sequence, some recent methods adopt optical flow, attention mechanism, sequence model, and the like to mine the time sequence characteristics of video.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a video target detection method, a storage medium and a device based on sparse foreground prior, aiming at the defects in the prior art, so as to improve the detection performance of video target detection.
The invention adopts the following technical scheme:
the video target detection method based on sparse foreground prior comprises the following steps:
s1, dividing the video V into m video segments CiI 1,2, …, m, for each video segment CiObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning(t)Sparse foreground map E of(t)
S2, respectively converting the video frame I(t)And sparse foreground map E(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer(t)And sparse foreground prior feature map
Figure BDA0002802907620000021
Computing video frames I(t)Characteristic diagram F of(t)And its sparse foreground map E(t)Sparse foreground prior feature map of
Figure BDA0002802907620000022
S3, converting the video frame I through the characteristic pyramid structure(t)Characteristic F of each layer(t)And corresponding sparse foreground prior features
Figure BDA0002802907620000023
Calculating video frame I by combining with features obtained by sampling at higher layer(t)Semantic enhanced features of
Figure BDA0002802907620000024
And foreground semantic enhancement features
Figure BDA0002802907620000025
S4, converting the video frame I(t)Semantic enhanced features of
Figure BDA0002802907620000026
And corresponding foreground semantic enhancement features
Figure BDA0002802907620000027
Fusing to obtain video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000028
S5, in video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000029
Generating an anchor frame;
s6, converting the video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000031
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)And (4) classifying and positioning results of all targets to finish target detection.
Specifically, in step S1, the video clip C is divided intoiWithin each frame image I(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain a video clip CiThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction(t)Corresponding sparse foreground map E(t)The objective function is calculated as follows:
Figure BDA0002802907620000032
wherein D is an orthogonal subspace, theta is an orthogonal subspace coefficient, alpha and beta are adjusting parameters, | | ·| computationally |row,11 norm, I, representing the matrix rowkIs an identity matrix with an order k.
Further, solving the objective function by adopting an alternating direction method, solving D and theta by using a block coordinate descent method, and defining residual error terms
Figure BDA0002802907620000033
Solving and updating D and theta by using a residual error item; updating by D and theta obtained by solving
Figure BDA0002802907620000034
Contraction function
Figure BDA0002802907620000035
For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative updating until a convergence condition is reached and the maximum iteration number is reachediSparse foreground priors E for all frames in.
Specifically, in step S3, in video frame I(t)And sparse foreground map E(t)Obtaining a feature graph F through a ResNet feature extraction network(t)And sparse foreground prior feature map
Figure BDA0002802907620000036
In the process of (3), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively the features of the lowest layer
Figure BDA0002802907620000037
Multiplying, forming a characteristic pyramid by 5 characteristics with different scales, wherein the bottom of the characteristic pyramid is a high-resolution characteristic diagram, and the top characteristic diagram is a low-resolution characteristic diagram; the method comprises the steps of performing nearest neighbor upsampling on the strong semantic features of the high layer of the feature pyramid, adding the nearest neighbor upsampling to the features of the low layer, performing 3 x 3 convolution kernel, and outputting the features with semantic information
Figure BDA0002802907620000038
And foreground prior characteristics
Figure BDA0002802907620000039
Specifically, in step S4, the video frame I is divided into two parts(t)Semantic enhanced features of
Figure BDA0002802907620000041
And corresponding foreground semantic enhancement features
Figure BDA0002802907620000042
Cascading, and obtaining a foreground prior fusion characteristic diagram through 1 multiplied by 1 convolution operation
Figure BDA0002802907620000043
Specifically, in step S5, the feature map is fused a priori in the foreground
Figure BDA0002802907620000044
Setting a base anchor frame with the size of 16 multiplied by 16 on each pixel of each layer, respectively setting the length-width ratio to be 0.5,1 and 2 on the premise of keeping the area unchanged, respectively amplifying 8,16 and 32 scales for the anchor frames with different length-width ratios, and respectively fusing a feature map for the foreground prior
Figure BDA0002802907620000045
A total of 9 anchor boxes are generated for each pixel on each layer of the feature map.
Specifically, in step S6, the training classification and regression sub-network specifically includes:
s6011, randomly initializing classification and weight parameters of a regression network;
s6012, for each candidate region, calculating the probability that the candidate region belongs to each category by using the initialized classification network, and calculating the position coordinates of the candidate region by using the initialized regression network;
s6013, constructing a target detection loss function L;
s6014, updating learning classification and regression network parameters through back propagation iteration by using the target detection loss function L until the network is converged to obtain a trained classification and regression sub-network.
Further, in step S6013, the loss function L:
Figure BDA0002802907620000046
wherein z is the true label of the ith candidate region,
Figure BDA0002802907620000047
is the probability that the ith candidate region belongs to the class z object, gamma is the concentration parameter,
Figure BDA0002802907620000048
is the focal loss for the target classification; a isiIs the position coordinates of the i-th candidate region,
Figure BDA0002802907620000049
is the coordinate vector of the real target box corresponding to the ith candidate region,
Figure BDA00028029076200000410
is the Smooth L1 regression loss for the target block, ω is the equilibrium weight.
Another aspect of the invention is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods described.
Another aspect of the present invention is a computing device, including:
one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods.
Compared with the prior art, the invention has at least the following beneficial effects:
on the basis of an image target detection method, a motion foreground priori image is extracted by using the sparse priori of a foreground and the spatial-temporal continuity priori of video data to obtain a foreground semantic enhanced feature image, and the foreground semantic enhanced feature image is cascaded with a current frame semantic enhanced feature to obtain a foreground priori fusion feature of the current frame, so that the video frame with motion blur, object shielding and large size change can be detected after the foreground priori feature fusion, and the detection accuracy is improved; the method makes full use of the relationship between the characteristics of the adjacent frames, and does not need to further process the detection result after each frame is detected. Compared with the existing video target detection method based on post-processing of the image target detection result, the method has the advantage that the detection speed is improved.
Furthermore, a foreground extraction algorithm based on orthogonal subspace learning can be used for obtaining a more interesting moving foreground object, wherein all video frames in a video segment are taken as a whole, foreground images of all frames are obtained by the orthogonal subspace learning algorithm, and foreground sparse prior of video data is better utilized.
Further, an alternating direction method is adopted to solve the objective function, wherein the parts of the unconstrained optimization are respectively optimized by a block coordinate descent method, a large global optimization problem is decomposed into a plurality of sub-problems which are easy to solve, and the solution of the global optimization problem is obtained by solving the plurality of sub-problems.
Furthermore, the features extracted from the ResNet network are constructed into a feature pyramid, multi-scale features of the video frame and the foreground prior image are obtained through the feature pyramid structure, wherein low-resolution high-level features with rich semantic information in the feature pyramid are used for enhancing the low-level features, and therefore the semantic information of the obtained semantic enhanced features is richer.
Furthermore, the semantic enhancement features of the foreground image and the semantic enhancement features of the current video frame are subjected to cascade convolution fusion to obtain a feature image with enhanced foreground priori, foreground sparse priori information is added in the detection process of the foreground target on the video frame, the feature information of the foreground target is enhanced, and the detection performance is further enhanced.
Further, an anchor frame is generated on the feature map, each anchor frame is classified, and then the anchor frame judged as a positive sample is regressed to obtain an accurate target position. The anchor frame generated on the feature map can limit the number of the candidate regions within a controllable range, and the calculation amount is greatly reduced.
Furthermore, training of the video data is completed by constructing a classification sub-network and a regression sub-network, wherein the classification sub-network can obtain a fine target classification result, and the regression sub-network can further correct a positioning result of a target, so that the finally obtained recognition results and positions of different targets in the video frame are more accurate.
Furthermore, the loss function L is set mainly to solve the problem of imbalance of the proportion of positive and negative samples in the one-stage target detection task. The loss function reduces the proportion of the large number of redundant negative samples in the training process.
In summary, the present invention fully utilizes the sparse prior of the foreground and the relationship between the adjacent frame features to solve the problems of motion blur, object occlusion, large size change, etc. existing in the video data, so that the present invention can effectively detect the targets with different scales and blurs in the video data, and improve the detection accuracy.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a diagram illustrating an effect of video object detection according to the present invention, wherein (a) is a detection result of one frame in a video sequence with a ship object, and (b) is a detection result of another frame in the video sequence with the ship object;
FIG. 3 is a diagram illustrating a second effect of detecting a video object according to the present invention, wherein (a) is a detection result of one frame in a video sequence targeted to a dog, and (b) is a detection result of another frame in the video sequence targeted to the dog;
fig. 4 is a diagram illustrating a third effect of video object detection according to the present invention, wherein (a) is a detection result of one frame in a video sequence of a car with an elephant as a target, and (b) is a detection result of another frame in the video sequence of the car with the elephant as a target.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a video target detection method based on sparse foreground prior.A foreground extraction method based on orthogonal subspace learning is used for obtaining a motion sparse foreground prior image of each frame of a video; then extracting the multi-scale semantic enhancement features of the video frame and the sparse foreground image thereof by using a ResNet feature extraction network and a feature pyramid structure; cascading and fusing the foreground semantic enhanced features and the semantic enhanced features of the current frame to obtain foreground prior fused features; generating an anchor frame on each pixel on the foreground prior fusion characteristic graph; then obtaining the category and position coordinates of all targets through a classification and regression network; sparse foreground prior of video data is fully mined, and target detection accuracy is improved.
Referring to fig. 1, the video target detection method based on sparse foreground prior of the present invention includes two parts, namely training and testing, wherein in the training process, network parameters are updated by calculating a loss function of a network model and then using back propagation; in the testing process, trained network parameters are used, the semantic enhancement features and the foreground semantic enhancement features of the current frame are fused to obtain foreground prior fusion features of the video frame, and then the category and the position of an interested target in the video frame are obtained based on the foreground prior fusion features; the method comprises the following specific steps:
s1, dividing the video V into m video segments CiI 1,2, …, m, for each video segment CiObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning(t)Sparse foreground map E of(t)
Video clip CiWithin each frame image I(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain the corresponding foreground prior E of all frames.
The objective function is calculated as follows:
Figure BDA0002802907620000081
wherein D is an orthogonal subspace, theta is an orthogonal subspace coefficient, alpha and beta are adjusting parameters, | | ·| computationally |row,11 norm, I, representing the matrix rowkIs an identity matrix with an order k.
In specific implementation, the objective function may be solved by an inaccurate alternating direction method, and the following steps are repeatedly performed:
s101, solving D and theta by using a block coordinate descent method, and defining residual error items
Figure BDA0002802907620000082
And solving for updated D and θ using the residual terms:
Figure BDA0002802907620000083
Figure BDA0002802907620000084
wherein
Figure BDA0002802907620000085
S102, updating by using D and theta obtained through solving
Figure BDA0002802907620000086
Wherein the contraction function
Figure BDA0002802907620000087
". represents element-by-element multiplication, sign ()" is a symbolic function, and the concrete form is
Figure BDA0002802907620000088
Iteratively updating until reaching a convergence condition, namely reaching the maximum iteration number, and obtaining a video segment CiThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction(t)Corresponding sparse foreground map E(t)
S2, calculating video frame I(t)Characteristic diagram F of(t)And its sparse foreground map E(t)Sparse foreground prior feature map of
Figure BDA0002802907620000091
Respectively convert video frames I(t)With its sparse foreground map E(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of the layer(t)And sparse foreground prior feature map
Figure BDA0002802907620000092
The ResNet feature extraction network is a feature extraction network consisting of 1 7 multiplied by 7 convolutional layer, 1 maximum pooling layer and 16 residual blocks, wherein each residual block in the network is formed by combining 1 multiplied by 1 convolutional layer, 1 multiplied by 3 convolutional layer, 1 multiplied by 1 convolutional layer, a batch normalization layer and an activation function layer. The 16 residual blocks are divided into 5 stages. The output of each stage serves as the feature of the input image at different semantic levels.
S3, calculating video frame I(t)Semantic enhanced features of
Figure BDA0002802907620000093
And foreground semantic enhancement features
Figure BDA0002802907620000094
Video frame I through characteristic pyramid structure(t)Characteristic F of each layer(t)And corresponding sparse foreground prior features
Figure BDA0002802907620000095
Respectively combined with the features obtained by sampling at higher layer to obtain semantic enhanced features with rich semantic information
Figure BDA0002802907620000096
And foreground semantic enhancement features
Figure BDA0002802907620000097
In video frame I(t)And sparse foreground map E(t)Obtaining a feature graph F through a ResNet feature extraction network(t)And sparse foreground prior feature map
Figure BDA0002802907620000098
In the process of (3), 5 features with different scales are extracted from the ResNet middle layer, and the scales are respectively the features of the lowest layer
Figure BDA0002802907620000099
And forming a characteristic pyramid by the 5 characteristics with different scales. The bottom of the feature pyramid is a high-resolution feature map, while the top feature map is a low-resolution feature map, and the higher the level, the smaller the feature map and the lower the resolution.
The method comprises the steps of performing nearest neighbor upsampling on low-resolution high-semantic features with abstract information of a feature pyramid high layer, adding the nearest neighbor upsampling to the low-layer features, performing 3 x 3 convolution kernel, and outputting features with rich semantic information
Figure BDA00028029076200000910
And foreground priorFeature(s)
Figure BDA00028029076200000911
S4, calculating video frame I(t)Foreground prior fusion feature map of
Figure BDA00028029076200000912
Video frame I(t)Semantic enhanced features of
Figure BDA0002802907620000101
And corresponding foreground semantic enhancement features
Figure BDA0002802907620000102
Cascading, and obtaining a foreground prior fusion characteristic diagram through 1 multiplied by 1 convolution operation
Figure BDA0002802907620000103
S5, in video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000104
Generating an anchor frame;
feature map fusion a priori in foreground
Figure BDA0002802907620000105
Setting a base anchor frame with the size of 16 multiplied by 16 on each pixel of each layer, respectively setting the length-width ratios of the base anchor frame to be 0.5,1 and 2 on the premise of keeping the area unchanged, and respectively amplifying the three anchor frames with different length-width ratios by 8,16 and 32 scales, thereby respectively fusing the characteristic diagram of the foreground prior
Figure BDA0002802907620000106
A total of 9 anchor blocks are generated for each pixel on each layer of the feature map.
S6, converting the video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000107
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)The classification and positioning results of all the targets.
S601, training classification and regression sub-network:
s6011, randomly initializing classification and weight parameters of a regression network;
s6012, for each candidate region, calculating the probability that the candidate region belongs to each category by using the initialized classification network, and calculating the position coordinates of the candidate region by using the initialized regression network;
s6013, constructing a target detection loss function L:
Figure BDA0002802907620000108
wherein z is the true label of the ith candidate region,
Figure BDA0002802907620000109
is the probability that the ith candidate region belongs to the class z object, gamma is the concentration parameter,
Figure BDA00028029076200001010
is the focal loss for the target classification; a isiIs the position coordinates of the i-th candidate region,
Figure BDA00028029076200001011
is the coordinate vector of the real target box corresponding to the ith candidate region,
Figure BDA00028029076200001012
is the Smooth L1 regression loss of the target box, ω is the equilibrium weight;
s6014, updating learning classification and regression network parameters through back propagation iteration by using a target detection loss function L until the network is converged to obtain a trained classification and regression sub-network;
s602, converting the video frame I(t)Foreground prior fusion feature map of
Figure BDA00028029076200001013
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)Object category and object frame position.
In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of video target detection based on sparse foreground prior, and comprises the following steps: dividing a video V into m video segments CiI 1,2, …, m, for each video segment CiObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning(t)Sparse foreground map E of(t)(ii) a Respectively convert video frames I(t)And sparse foreground map E(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer(t)And sparse foreground prior feature map
Figure BDA0002802907620000111
Computing video frames I(t)Characteristic diagram F of(t)And its sparse foreground map E(t)Sparse foreground prior feature map of
Figure BDA0002802907620000112
Video frame I through characteristic pyramid structure(t)Characteristic F of each layer(t)And corresponding sparse foreground prior features
Figure BDA0002802907620000113
Calculating video frame I by combining with features obtained by sampling at higher layer(t)Semantic enhanced features of
Figure BDA0002802907620000114
And foreground semantic enhancement features
Figure BDA0002802907620000115
Video frame I(t)Semantic enhanced features of
Figure BDA0002802907620000116
And corresponding foreground semantic enhancement features
Figure BDA0002802907620000117
Fusing to obtain video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000118
In video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000119
Generating an anchor frame; video frame I(t)Foreground prior fusion feature map of
Figure BDA00028029076200001110
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)And (4) classifying and positioning results of all targets to finish target detection.
In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.
The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the checking method related to the medium-term and long-term maintenance plan of the power grid in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of: dividing a video V into m video segments CiI 1,2, …, m, for each video segment CiObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning(t)Sparse foreground map E of(t)(ii) a Respectively convert video frames I(t)And sparse foreground map E(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer(t)And sparse foreground prior feature map
Figure BDA0002802907620000121
Computing video frames I(t)Characteristic diagram F of(t)And its sparse foreground map E(t)Sparse foreground prior feature map of
Figure BDA0002802907620000122
Video frame I through characteristic pyramid structure(t)Characteristic F of each layer(t)And corresponding sparse foreground prior features
Figure BDA0002802907620000123
Calculating video frame I by combining with features obtained by sampling at higher layer(t)Semantic enhanced features of
Figure BDA0002802907620000124
And foreground semantic enhancement features
Figure BDA0002802907620000125
Video frame I(t)Semantic enhanced features of
Figure BDA0002802907620000126
And corresponding foreground semantic enhancement features
Figure BDA0002802907620000127
Fusing to obtain video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000128
In video frame I(t)Foreground prior fusion feature map of
Figure BDA0002802907620000129
Generating an anchor frame; video frame I(t)Foreground prior fusion feature map of
Figure BDA00028029076200001210
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)And (4) classifying and positioning results of all targets to finish target detection.
The effects of the present invention can be further illustrated by the following simulations:
1. simulation conditions
The workstation with the RTX 2080TI graphics card was used and the software framework was PyTorch.
Selecting a video sequence with a ship as a target and larger scale difference as a first group of detected video sequences, as shown in FIG. 2;
selecting a video sequence with a dog as a target and large posture difference as a second group of detected video sequences, as shown in fig. 3;
the two targets, namely the elephant and the automobile, are selected, and the video sequence with the object occlusion is used as the third group of detected video sequences, as shown in fig. 4.
2. Emulated content
Simulation 1, performing video target detection on a first group of detected video sequences by using the method of the present invention to obtain detection results of two frames, as shown in fig. 2.
Simulation 2, performing video target detection on the second group of detected video sequences by using the method of the present invention to obtain detection results of two frames, as shown in fig. 3.
And 3, simulating to perform video target detection on the third group of detected video sequences by using the method of the invention to obtain detection results of two frames, as shown in fig. 4.
3. Analysis of simulation results
FIG. 2(a) is a frame of the video sequence with the object of the ship, and FIG. 2(b) is another frame of the video sequence with the object of the ship, so that the invention can accurately detect the types and positions of the objects with different sizes in the video under the condition that the sizes of the objects are greatly different; FIG. 3(a) is the detection result of one frame in the video sequence with the target of dog, and FIG. 3(b) is the detection result of the other frame in the video sequence with the target of dog, it can be seen that the invention can accurately detect the type and position of the target in the video under the condition of fuzzy picture and large posture difference; fig. 4(a) is a detection result of one frame in a video sequence of a target including an elephant and a car, and fig. 4(b) is a detection result of another frame in a video sequence of a target including an elephant and a car, so that the invention can accurately detect the type and the position of an occluded target in a video under the condition that the occluded target exists in different types, especially the left elephant in fig. 4(b) is basically completely occluded.
In summary, the video target detection method based on sparse foreground prior can effectively detect the type and position of targets with different scales, and video sequences with motion blur and shielding phenomena.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical solution according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. The video target detection method based on sparse foreground prior is characterized by comprising the following steps of:
s1, dividing the video V into m video segments CiI 1,2, …, m, for each video segment CiObtaining the t frame video frame I in the video segment by adopting a foreground extraction algorithm based on orthogonal subspace learning(t)Sparse foreground map E of(t)
S2, respectively converting the video frame I(t)And sparse foreground map E(t)Inputting ResNet feature extraction network, each layer of ResNet feature extraction network respectively outputting feature diagram F of corresponding layer(t)And sparse foreground prior feature map
Figure FDA0002802907610000011
Computing video frames I(t)Characteristic diagram F of(t)And its sparse foreground map E(t)Sparse foreground prior feature map of
Figure FDA0002802907610000012
S3, converting the video frame I through the characteristic pyramid structure(t)Characteristic F of each layer(t)And corresponding sparse foreground prior features
Figure FDA0002802907610000013
Calculating video frame I by combining with features obtained by sampling at higher layer(t)Semantic enhanced features of
Figure FDA0002802907610000014
And foreground semantic enhancement features
Figure FDA0002802907610000015
S4, converting the video frame I(t)Semantic enhanced features of
Figure FDA0002802907610000016
And corresponding foreground semantic enhancement features
Figure FDA0002802907610000017
Fusing to obtain video frame I(t)Foreground prior fusion feature map of
Figure FDA0002802907610000018
S5, in video frame I(t)Foreground prior fusion feature map of
Figure FDA0002802907610000019
Generating an anchor frame;
s6, converting the video frame I(t)Foreground prior fusion feature map of
Figure FDA00028029076100000110
Inputting all the anchor frames into the trained classification and regression network to respectively obtain video frames I(t)And (4) classifying and positioning results of all targets to finish target detection.
2. The sparse foreground prior-based video object detection method of claim 1, wherein in step S1, video segment C is segmentediWithin each frame image I(t)Converting the gray scale into column vectors, combining the column vectors into a two-dimensional matrix X, and calculating according to a target function to obtain a video clip CiThe sparse foregrounds of all the frames in the frame are prior E, then the E is split according to columns, and each frame I is obtained by reduction(t)Corresponding sparse foreground map E(t)The objective function is calculated as follows:
Figure FDA00028029076100000111
wherein D is an orthogonal subspace, theta is an orthogonal subspace coefficient, alpha and beta are adjusting parameters, | | ·| computationally |row,11 norm, I, representing the matrix rowkIs an identity matrix with an order k.
3. The sparse foreground prior-based video object detection method of claim 2, wherein an alternating direction method is used to solve the objective function, a block coordinate descent method is used to solve for D and θ, and residual terms are defined
Figure FDA0002802907610000021
Solving and updating D and theta by using a residual error item; updating by D and theta obtained by solving
Figure FDA0002802907610000022
Contraction function
Figure FDA0002802907610000023
For element-by-element multiplication, sign () is a sign function, and the video segment C is obtained after iterative updating until a convergence condition is reached and the maximum iteration number is reachediSparse foreground priors E for all frames in.
4. The sparse foreground prior-based video object detection method of claim 1, wherein in step S3, in video frame I(t)And sparse foreground map E(t)Obtaining a feature graph F through a ResNet feature extraction network(t)And sparse foreground prior feature map
Figure FDA0002802907610000024
In the process of (3), 5 features with different scales are extracted from the middle layer of the ResNet feature extraction network, and the scales are respectively the features of the lowest layer
Figure FDA0002802907610000025
Multiple times, willThe method comprises the following steps that 5 features with different scales form a feature pyramid, the bottom of the feature pyramid is a high-resolution feature map, and the top feature map is a low-resolution feature map; the method comprises the steps of performing nearest neighbor upsampling on the strong semantic features of the high layer of the feature pyramid, adding the nearest neighbor upsampling to the features of the low layer, performing 3 x 3 convolution kernel, and outputting the features with semantic information
Figure FDA0002802907610000026
And foreground prior characteristics
Figure FDA0002802907610000027
5. The sparse foreground prior-based video object detection method of claim 1, wherein in step S4, the video frame I is processed(t)Semantic enhanced features of
Figure FDA0002802907610000028
And corresponding foreground semantic enhancement features
Figure FDA0002802907610000029
Cascading, and obtaining a foreground prior fusion characteristic diagram through 1 multiplied by 1 convolution operation
Figure FDA00028029076100000210
6. The sparse foreground prior-based video object detection method of claim 1, wherein in step S5, feature maps are fused in foreground prior
Figure FDA00028029076100000211
Setting a base anchor frame with the size of 16 multiplied by 16 on each pixel of each layer, respectively setting the length-width ratio to be 0.5,1 and 2 on the premise of keeping the area unchanged, respectively amplifying 8,16 and 32 scales for the anchor frames with different length-width ratios, and respectively fusing a feature map for the foreground prior
Figure FDA00028029076100000212
A total of 9 anchor boxes are generated for each pixel on each layer of the feature map.
7. The sparse foreground prior-based video object detection method of claim 1, wherein in step S6, the training classification and regression sub-network specifically comprises:
s6011, randomly initializing classification and weight parameters of a regression network;
s6012, for each candidate region, calculating the probability that the candidate region belongs to each category by using the initialized classification network, and calculating the position coordinates of the candidate region by using the initialized regression network;
s6013, constructing a target detection loss function L;
s6014, updating learning classification and regression network parameters through back propagation iteration by using the target detection loss function L until the network is converged to obtain a trained classification and regression sub-network.
8. The sparse foreground prior-based video object detection method of claim 7, wherein in step S6013, the loss function L:
Figure FDA0002802907610000031
wherein z is the true label of the ith candidate region,
Figure FDA0002802907610000032
is the probability that the ith candidate region belongs to the class z object, gamma is the concentration parameter,
Figure FDA0002802907610000033
is the focal loss for the target classification; a isiIs the position coordinates of the i-th candidate region,
Figure FDA0002802907610000034
is the coordinate vector of the real target box corresponding to the ith candidate region,
Figure FDA0002802907610000035
is the Smooth L1 regression loss for the target block, ω is the equilibrium weight.
9. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-8.
10. A computing device, comprising:
one or more processors, memory, and one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-8.
CN202011357082.7A 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori Active CN112434618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011357082.7A CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011357082.7A CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Publications (2)

Publication Number Publication Date
CN112434618A true CN112434618A (en) 2021-03-02
CN112434618B CN112434618B (en) 2023-06-23

Family

ID=74699279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011357082.7A Active CN112434618B (en) 2020-11-26 2020-11-26 Video target detection method, storage medium and device based on sparse foreground priori

Country Status (1)

Country Link
CN (1) CN112434618B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112861830A (en) * 2021-04-13 2021-05-28 北京百度网讯科技有限公司 Feature extraction method, device, apparatus, storage medium, and program product
CN112966697A (en) * 2021-03-17 2021-06-15 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN113505737A (en) * 2021-07-26 2021-10-15 浙江大华技术股份有限公司 Foreground image determination method and apparatus, storage medium, and electronic apparatus
CN113743249A (en) * 2021-08-16 2021-12-03 北京佳服信息科技有限公司 Violation identification method, device and equipment and readable storage medium
CN114708531A (en) * 2022-03-18 2022-07-05 南京大学 Method and device for detecting abnormal behavior in elevator and storage medium
CN116630334A (en) * 2023-04-23 2023-08-22 中国科学院自动化研究所 Method, device, equipment and medium for real-time automatic segmentation of multi-segment blood vessel

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680106A (en) * 2017-10-13 2018-02-09 南京航空航天大学 A kind of conspicuousness object detection method based on Faster R CNN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109447018A (en) * 2018-11-08 2019-03-08 天津理工大学 A kind of road environment visual perception method based on improvement Faster R-CNN
CN111310609A (en) * 2020-01-22 2020-06-19 西安电子科技大学 Video target detection method based on time sequence information and local feature similarity
CN111523439A (en) * 2020-04-21 2020-08-11 苏州浪潮智能科技有限公司 Method, system, device and medium for target detection based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680106A (en) * 2017-10-13 2018-02-09 南京航空航天大学 A kind of conspicuousness object detection method based on Faster R CNN
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning
CN109447018A (en) * 2018-11-08 2019-03-08 天津理工大学 A kind of road environment visual perception method based on improvement Faster R-CNN
CN111310609A (en) * 2020-01-22 2020-06-19 西安电子科技大学 Video target detection method based on time sequence information and local feature similarity
CN111523439A (en) * 2020-04-21 2020-08-11 苏州浪潮智能科技有限公司 Method, system, device and medium for target detection based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIANFENG OU等: "《Moving Object Detection Method via ResNet-18 With Encoder–Decoder Structure in Complex Scenes》", 《IEEE ACCESS》 *
赵永强等: "《深度学习目标检测方法综述》", 《中国图象图形学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112966697A (en) * 2021-03-17 2021-06-15 西安电子科技大学广州研究院 Target detection method, device and equipment based on scene semantics and storage medium
CN112861830A (en) * 2021-04-13 2021-05-28 北京百度网讯科技有限公司 Feature extraction method, device, apparatus, storage medium, and program product
CN112861830B (en) * 2021-04-13 2023-08-25 北京百度网讯科技有限公司 Feature extraction method, device, apparatus, storage medium, and program product
CN113505737A (en) * 2021-07-26 2021-10-15 浙江大华技术股份有限公司 Foreground image determination method and apparatus, storage medium, and electronic apparatus
CN113743249A (en) * 2021-08-16 2021-12-03 北京佳服信息科技有限公司 Violation identification method, device and equipment and readable storage medium
CN113743249B (en) * 2021-08-16 2024-03-26 北京佳服信息科技有限公司 Method, device and equipment for identifying violations and readable storage medium
CN114708531A (en) * 2022-03-18 2022-07-05 南京大学 Method and device for detecting abnormal behavior in elevator and storage medium
CN116630334A (en) * 2023-04-23 2023-08-22 中国科学院自动化研究所 Method, device, equipment and medium for real-time automatic segmentation of multi-segment blood vessel
CN116630334B (en) * 2023-04-23 2023-12-08 中国科学院自动化研究所 Method, device, equipment and medium for real-time automatic segmentation of multi-segment blood vessel

Also Published As

Publication number Publication date
CN112434618B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN112434618B (en) Video target detection method, storage medium and device based on sparse foreground priori
CN108647585B (en) Traffic identifier detection method based on multi-scale circulation attention network
Xie et al. Multilevel cloud detection in remote sensing images based on deep learning
CN110929577A (en) Improved target identification method based on YOLOv3 lightweight framework
CN110910391B (en) Video object segmentation method for dual-module neural network structure
US20180114071A1 (en) Method for analysing media content
CN114202672A (en) Small target detection method based on attention mechanism
CN110782420A (en) Small target feature representation enhancement method based on deep learning
Fu et al. Camera-based basketball scoring detection using convolutional neural network
JP2012511756A (en) Apparatus having a data stream pipeline architecture for recognizing and locating objects in an image by detection window scanning
CN111274981B (en) Target detection network construction method and device and target detection method
CN112966659B (en) Video image small target detection method based on deep learning
CN111767962A (en) One-stage target detection method, system and device based on generation countermeasure network
US11809523B2 (en) Annotating unlabeled images using convolutional neural networks
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
CN112800955A (en) Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
JP2024513596A (en) Image processing method and apparatus and computer readable storage medium
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN111179272B (en) Rapid semantic segmentation method for road scene
CN113516053A (en) Ship target refined detection method with rotation invariance
Zhu et al. Spatial hierarchy perception and hard samples metric learning for high-resolution remote sensing image object detection
Yildirim et al. Ship detection in optical remote sensing images using YOLOv4 and Tiny YOLOv4
CN113963333A (en) Traffic sign board detection method based on improved YOLOF model
Yang et al. Real-Time object detector based MobileNetV3 for UAV applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant