CN111310609A - Video target detection method based on time sequence information and local feature similarity - Google Patents

Video target detection method based on time sequence information and local feature similarity Download PDF

Info

Publication number
CN111310609A
CN111310609A CN202010075005.6A CN202010075005A CN111310609A CN 111310609 A CN111310609 A CN 111310609A CN 202010075005 A CN202010075005 A CN 202010075005A CN 111310609 A CN111310609 A CN 111310609A
Authority
CN
China
Prior art keywords
frame
target
feature
layer
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010075005.6A
Other languages
Chinese (zh)
Other versions
CN111310609B (en
Inventor
古晶
刘芳
赵柏宇
焦李成
卞月林
巨小杰
张向荣
陈璞花
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010075005.6A priority Critical patent/CN111310609B/en
Publication of CN111310609A publication Critical patent/CN111310609A/en
Application granted granted Critical
Publication of CN111310609B publication Critical patent/CN111310609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target detection method based on time sequence information and local feature similarity, which mainly solves the problems of low accuracy rate and unmatched feature positions of video target detection in the prior art. The implementation scheme is as follows: extracting a feature map of each frame of the video by using a ResNet network; calculating the similarity of the feature graphs by using the local feature Hash similarity measurement, and representing the change of the current position feature by using the Hash similarity score; weighting the feature maps of the adjacent frames, and adding the feature maps with the features of the current frame to obtain the corrected features of the current frame; obtaining a candidate target frame of the correction characteristics by using a regional candidate network based on sparse classification; and pooling the interested areas to obtain the features with uniform sizes, and inputting the features with uniform sizes into the trained classification and regression network to obtain a detection result. The invention improves the detection accuracy and reduces the calculation complexity.

Description

Video target detection method based on time sequence information and local feature similarity
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method which can be used for target identification and positioning in a video.
Background
Computer vision is an important field of artificial intelligence, is science about computers and software systems, and can enable computers to identify and understand images and scenes, and the computer vision comprises branch fields of image identification, target detection, image generation, image super-resolution reconstruction and the like. Visual understanding mainly has three levels, namely classification, detection and segmentation. The classification task is concerned with the whole, the content description of the whole picture is given, the detection is concerned with a specific object target, and the identification result and the positioning result of the target are required to be obtained at the same time. In contrast to classification, detection gives an understanding of the foreground and background of a picture, and it is also necessary to separate an object of interest from the background and determine the result of identifying and locating this object.
The target detection is an important research subject in the field of computer vision, the realization of the target detection is the key of video analysis technologies such as moving target tracking, target recognition and behavior understanding, and the effect of the target detection directly influences the progress of subsequent work. The task of image target detection is greatly improved in the last years, and the detection performance is obviously improved. Especially in the fields of video monitoring, vehicle driving assistance and the like, video-based target detection has wider requirements. However, applying image detection techniques directly to video detection presents new challenges. First, applying a deep network to all video frames can bring huge operation cost; secondly, the video frames with motion blur, video virtual focus and rare gestures are directly detected by an image detection technology, so that the accuracy rate is low.
In order to improve the video detection accuracy, most of the earlier methods focus on post-processing, and after each frame is detected by image target detection, the detection result is further processed by using the specific time sequence characteristics of the video, such as a pipeline convolution neural network T-CNN and a sequence non-maximum suppression Seq-NMS method. However, this post-processing method undoubtedly increases the calculation required for detection, reduces the detection speed, and cannot meet the real-time requirement.
Disclosure of Invention
The present invention is directed to provide a video target detection method based on temporal information and local feature similarity to improve the detection speed and meet the real-time requirement.
The technical scheme of the invention is realized as follows:
the technical idea of the invention is to fully utilize the time sequence information of a video sequence and mine the change of target characteristics in adjacent frame images, and the scheme is as follows: firstly, extracting a feature map of each frame of a video by using a ResNet network; then, the characteristics of the current frame are corrected by utilizing the time sequence information of the adjacent preorder frames in a self-adaptive mode; obtaining a candidate target frame of the correction characteristics through a regional candidate network based on sparse classification; then pooling the interested region to obtain the characteristics with uniform size, and then obtaining the final detection result through classification and regression network, wherein the specific implementation steps comprise the following steps:
1. the video target detection method based on the time sequence information and the local feature similarity is characterized by comprising the following steps:
(1) respectively aiming at the t frame video frame I in the video V(t)With its first k frames I(t-k),...,I(t-1)Through ResNet network, obtain I(t)Characteristic diagram F of(t)And I(t-k),...,I(t-1)Characteristic diagram F of(t-k),...,F(t-1)
(2) Calculating F(t)And F(t-k),...,F(t-1)Local feature hash similarity score s of(t,t-k),...,s(t,t-1)
(3) Computing video frames I based on timing information(t)Corrected feature map of (2) F'(t)
(3.1) Hash similarity score s for local features(t,t-k),...,s(t,t-1)Performing softmax operation on each spatial position respectively to obtain a characteristic diagram F(t-k),...,F(t-1)Corresponding weight α(t-k),...,α(t-1)
(3.2) vs. feature map F(t-k),...,F(t-1)And corresponding weight α(t-k),...,α(t-1)Weighted sum at each spatial position and with F(t)Adding to obtain video frame I(t)Corrected feature map F'(t)
(4) Using video frames I(t)Corrected feature map F'(t)Selecting a video frame I(t)Candidate target region of (1):
(4.1) to I(t)Modified feature map F of frame'(t)This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I(t)Intermediate layer feature map F of a frame”(t)
And (4.2) generating 9 anchor frames with different scales at each position of the feature map, namely, firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to enable the length-width ratio of the base anchor frame to be (0.5,1,2), and respectively amplifying the three anchor frames with different length-width ratios by (8,16,32) scales to obtain 9 anchor frames in total.
(4.3) training parameters of the softmax layer and the target frame regression layer to obtain the softmax layer and the target frame regression layer after training;
(4.4) for each anchor frame I(t)Intermediate layer feature map F of a frame”(t)And judging whether the target is contained by using the trained softmax layer:
if the target is contained, the target frame after training is used for regressionFinely adjusting the coordinates of the anchor frame to obtain I(t)A plurality of candidate target areas of the frame, and (5) is executed;
if the target is not contained, discarding the anchor frame;
(5) in video frame I(t)Corrected feature map F'(t)The method comprises the steps of pooling each candidate target region by using an interested region to extract candidate region characteristics with uniform size;
(6) and obtaining the target category and the target frame position of the video frame by using the characteristics of each candidate region:
(6.1) training a classification and regression network to obtain the trained classification and regression network:
(6.2) video frame I(t)Inputting the characteristics of each candidate region into the trained classification and regression network to respectively obtain a video frame I(t)Object category and object frame position.
Further, F is calculated in (2)(t)And F(t-k),...,F(t-1)Local feature hash similarity score s of(t,t-k),...,s(t,t-1)The implementation is as follows:
(2.1) calculating the characteristic map F of the t-th frame(t)And t-k frame feature map F(t-k)Local feature hash similarity score of (1):
(2.1a) for the t-th frame I(t)Characteristic diagram F of(t)Taking eight neighborhoods at any position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure BDA0002378273530000031
To pair
Figure BDA0002378273530000032
Is averaged to obtain a characteristic mean value at location (i, j)
Figure BDA0002378273530000033
(2.1b) for the t-k frame I(t-k)Characteristic diagram F of(t-k)Taking eight neighborhoods at the position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure BDA0002378273530000034
To pair
Figure BDA0002378273530000035
Is averaged to obtain a characteristic mean value at location (i, j)
Figure BDA0002378273530000036
(2.1c) the tth frame I(t)Neighborhood feature block of
Figure BDA0002378273530000037
Each value in (1) and its average value
Figure BDA0002378273530000038
Are compared and are to
Figure BDA0002378273530000039
Is greater than or equal to the mean value
Figure BDA00023782735300000310
The hash value of (1) is set to
Figure BDA00023782735300000311
Middle to less than mean value
Figure BDA00023782735300000312
The hash value of (A) is set to 0, resulting in a hash value consisting of 0 and 1
Figure BDA00023782735300000313
Hash representation
Figure BDA00023782735300000314
(2.1d) the t-k frame I(t-k)Neighborhood feature block of
Figure BDA00023782735300000315
Each value in (1) and its average value
Figure BDA00023782735300000316
Are compared and are to
Figure BDA00023782735300000317
Is greater than or equal to the mean value
Figure BDA00023782735300000318
The hash value of (1) is set to
Figure BDA00023782735300000319
Middle to less than mean value
Figure BDA00023782735300000320
The hash value of (A) is set to 0, resulting in a hash value of 0 and 1
Figure BDA00023782735300000321
Hash representation
Figure BDA00023782735300000322
(2.1e) calculation
Figure BDA00023782735300000323
Hash representation
Figure BDA00023782735300000324
And
Figure BDA00023782735300000325
hash representation
Figure BDA00023782735300000326
Hamming distance of
Figure BDA00023782735300000327
(2.1f) neighborhood feature Block
Figure BDA0002378273530000041
Number of included values minus Hamming distance
Figure BDA0002378273530000042
Obtaining the Hash similarity score of the characteristic graph of the t th frame and the characteristic graph of the t-k th frame on the position (i, j)
Figure BDA0002378273530000043
(2.1g) repeating (2.1a) - (2.1F), calculating the characteristic map F of the t-th frame(t)And t-k frame feature map F(t-k)The Hash similarity scores at all positions are combined according to the space positions to obtain the Hash similarity score s of the local characteristics of the t-th frame characteristic image and the t-k frame characteristic image(t,t-k)
(2.2) repeating (2.1), calculating F(t)And F(t-k+1),...,F(t-1)Local feature hash similarity score s of(t,t-k+1),...,s(t,t-1)So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames(t ,t-k),...,s(t,t-1)
Compared with the prior art, the invention has the following advantages:
1) on the basis of the two-stage image target detection method, the invention considers the relation between adjacent frames based on the time sequence information, obtains the correction characteristics of the current frame in a self-adaptive manner by weighting the characteristics of the adjacent frames on the video sequence formed by a plurality of frames and adding the weighted characteristics with the characteristics of the current frame, can detect the video frames with motion blur, video virtual focus and rare postures after correcting the characteristics, and improves the detection accuracy.
2) According to the method, during the correction of the characteristics by using the time sequence information, the characteristic similarity is calculated by using the local characteristic Hash similarity measurement, and the change of the current position characteristics is represented by using the Hash similarity score, so that the problem of characteristic position mismatching caused by the position change of a moving target in a video is solved, the calculation complexity is reduced and the operation efficiency is improved compared with a common similarity measurement method.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a sub-flow diagram of the present invention for computing a partial feature hash similarity score;
FIG. 3 is a sub-flow diagram of the calculation of a revised feature in the present invention;
fig. 4 and 5 are diagrams illustrating the effect of video object detection using the present invention.
Detailed Description
The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.
The implementation of the method mainly comprises two parts, namely training and testing, wherein the training process is to update model parameters by calculating a model loss function and performing back propagation; the testing process is to fix parameters, firstly calculate the correction characteristics of the video frame by utilizing the time sequence information, and then obtain the target category and the target frame position of the video frame by using the correction characteristics.
Referring to fig. 1, the implementation steps of this example are as follows:
step 1, calculating a characteristic diagram of a t frame video frame and a preamble frame thereof.
For the t frame video frame I in the video V(t)With its first k frames I(t-k),...,I(t-1)Through ResNet network, obtain I(t)Characteristic diagram F of(t)And I(t-k),...,I(t-1)Characteristic diagram F of(t-k),...,F(t-1)
The ResNet network is a feature extraction network consisting of 1 7 × 7 convolutional layer, 1 3 × 3 maximum pooling layer and 16 residual blocks, wherein each residual block is formed by combining 1 × 1 convolutional layer, 1 3 × 3 convolutional layer, 1 × 1 convolutional layer and identity mapping.
And 2, calculating the local characteristic hash similarity score of the t frame video frame and the previous k frame.
2.1) calculating the characteristic map F of the t-th frame(t)And t-k frame feature map F(t-k)Local feature hash similarity score of (1):
referring to fig. 2, the specific implementation of this step is as follows:
2.1a) for the t-th frame I(t)Characteristic diagram F of(t)Taking eight neighborhoods at any position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure BDA0002378273530000051
To pair
Figure BDA0002378273530000052
Is averaged to obtain a characteristic mean value at location (i, j)
Figure BDA0002378273530000053
2.1b) for the t-k frame I(t-k)Characteristic diagram F of(t-k)Taking eight neighborhoods at the position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure BDA0002378273530000054
To pair
Figure BDA0002378273530000055
Is averaged to obtain a characteristic mean value at location (i, j)
Figure BDA0002378273530000056
2.1c) the tth frame I(t)Neighborhood feature block of
Figure BDA0002378273530000057
Each value in (1) and its average value
Figure BDA0002378273530000058
Are compared and are to
Figure BDA0002378273530000059
Is greater than or equal to the mean value
Figure BDA00023782735300000510
The hash value of (1) is set to
Figure BDA00023782735300000511
Middle to less than mean value
Figure BDA00023782735300000512
The hash value of (A) is set to 0, resulting in a hash value consisting of 0 and 1
Figure BDA00023782735300000513
Hash representation
Figure BDA00023782735300000514
2.1d) t-k frames I(t-k)Neighborhood feature block of
Figure BDA00023782735300000515
Each value in (1) and its average value
Figure BDA00023782735300000516
Are compared and are to
Figure BDA00023782735300000517
Is greater than or equal to the mean value
Figure BDA00023782735300000518
The hash value of (1) is set to
Figure BDA00023782735300000519
Middle to less than mean value
Figure BDA00023782735300000520
The hash value of (A) is set to 0, resulting in a hash value of 0 and 1
Figure BDA00023782735300000521
Hash representation
Figure BDA00023782735300000522
2.1e) calculation
Figure BDA00023782735300000523
Hash representation
Figure BDA00023782735300000524
And
Figure BDA00023782735300000525
hash representation
Figure BDA00023782735300000526
Hamming distance of
Figure BDA00023782735300000527
Figure BDA0002378273530000061
Figure BDA0002378273530000062
Wherein the content of the first and second substances,
Figure BDA0002378273530000063
are respectively as
Figure BDA0002378273530000064
The value of the l-th element;
2.1f) neighborhood feature Block
Figure BDA0002378273530000065
Number of included values minus Hamming distance
Figure BDA0002378273530000066
Obtaining the Hash similarity score of the characteristic graph of the t th frame and the characteristic graph of the t-k th frame on the position (i, j)
Figure BDA0002378273530000067
2.1g) repeat 2.1a) -2.1F), calculate the t-th frame feature map F(t)And t-k frame feature map F(t-k)The Hash similarity scores at all positions are combined according to the space positions to obtain the Hash similarity score s of the local characteristics of the t-th frame characteristic image and the t-k frame characteristic image(t,t-k)
2.2) repeat step 2.1), calculate F separately(t)And F(t-k+1),...,F(t-1)Local feature hash similarity score s of(t,t-k+1),...,s(t,t-1)So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames(t ,t-k),...,s(t,t-1)
And 3, calculating a modified characteristic diagram of the t frame video frame.
Referring to fig. 3, this step is implemented as follows:
3.1) Hash similarity score s to local features(t,t-k),...,s(t,t-1)Performing softmax operation on each spatial position respectively to obtain a characteristic diagram F(t-k),...,F(t-1)Corresponding weight α(t-k),...,α(t-1)
3.2) feature map F(t-k),...,F(t-1)And corresponding weight α(t-k),...,α(t-1)Weighted sum is performed at each spatial position and is compared with F(t)Adding to obtain video frame I(t)Corrected feature map F'(t)
Figure BDA0002378273530000068
Where β is the weight factor and where,
Figure BDA0002378273530000069
and 4, selecting a candidate target area by using the modified feature map of the t-th frame video frame.
4.1) pairs of I(t)Modified feature map F of frame'(t)This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I(t)Intermediate layer feature map F of a frame”(t)
4.2) in the intermediate layer feature map F”(t)Generating 9 anchor frames with different scales at each position, namely firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to enable the length-width ratio of the base anchor frame to be (0.5,1,2), and then respectively amplifying the three anchor frames with different length-width ratios to obtain (8,16,32) scales to obtain 9 anchor frames;
4.3) training the parameters of the softmax layer and the target frame regression layer:
4.3a) randomly initializing parameters of a softmax layer and a target frame regression layer;
4.3b) for each anchor frame, calculating the probability of the anchor frame containing the target by using the initialized softmax layer, and calculating the parameterized coordinate of the anchor frame by using the initialized target frame regression;
4.3c) constructing the region candidate loss function with the L1 regular term which constrains the softmax layer parameters
Figure BDA0002378273530000071
Figure BDA0002378273530000072
Wherein e isiIth anchor frame A calculated for softmax layeriThe probability of containing the object is determined,
Figure BDA0002378273530000073
is an anchor frame AiWhether or not to include the true value tag of the object, oiIs an anchor frame AiThe parametric coordinates of (a) are determined,
Figure BDA0002378273530000074
is connected with an anchor frame AiThe coordinates of the corresponding real-valued object box,
Figure BDA0002378273530000075
is the logarithmic loss that is present or absent from the target,
Figure BDA0002378273530000076
is the Smooth L1 loss of the target box regression,
Figure BDA0002378273530000077
for the parameters of the softmax layer, the layer parameters,
Figure BDA0002378273530000078
l1 canonical term, N, for constraining softmax layer parametersclsTo train the number of batches, NregNumber of anchor frames, λ1And λ2Is a balance weight;
4.3d) updating parameters of the softmax layer and the target frame regression layer by using the regional candidate loss function through a back propagation algorithm until the regional candidate loss function is converged to obtain the trained softmax layer and the target frame regression layer;
4.4) for each anchor frame I(t)Intermediate layer feature map F of a frame”(t)Calculating the probability p that the anchor frame contains the target by using the trained softmax layer, and comparing the probability with a set threshold value q:
if p is larger than q, the anchor frame contains the target, and the coordinates of the anchor frame are finely adjusted by using a trained target frame regression layer to obtain I(t)A plurality of candidate target areas of the frame, and executing the step 5;
if p is less than or equal to q, the anchor frame does not contain the target, and the anchor frame is discarded.
And 5, extracting candidate region characteristics with uniform size for each candidate target region.
In video frame I(t)Corrected feature map F'(t)In the above, the region-of-interest pooling is used for each candidate target region to extract the candidate region features with uniform size, i.e. each candidate target region is firstly subjected to the correction feature map F'(t)Is divided into wr×hrPerforming maximum pooling operation in each grid to obtain uniform size wr×hrThe candidate region feature of (1).
And 6, obtaining the target category and the target frame position of the video frame by using the characteristics of each candidate region.
6.1) training classification and regression network:
6.1a) randomly initializing the parameters of the classification and regression networks;
6.1b) for each candidate region feature, calculating the probability of the candidate region belonging to each category by using the initialized classification network, and calculating the parameterized coordinate of the candidate region by using the initialized regression network;
6.1c) constructing a target detection loss function
Figure BDA0002378273530000081
Figure BDA0002378273530000082
Wherein z is the true category of the ith candidate region,
Figure BDA0002378273530000083
is the probability that the ith candidate region belongs to class z, gamma is the concentration parameter,
Figure BDA0002378273530000084
is the focal loss of the target class; oiIs the parameterized coordinate of the ith candidate region,
Figure BDA0002378273530000085
is the coordinate vector of the real target box corresponding to the ith candidate region,
Figure BDA0002378273530000086
SmoothL1 regression loss for the target box, λ is the balance weight;
6.1d) updating the classification and regression network parameters by a back propagation algorithm by using a target detection loss function until the target detection loss function is converged to obtain a trained classification and regression network;
6.2) video frame I(t)Inputting the characteristics of each candidate region into the trained classification and regression network to respectively obtain a video frame I(t)Object category and object frame position.
The effects of the present invention can be further illustrated by the following simulations:
1. simulation conditions
A workstation with an RTX 2080TI graphics card was used, using a PyTorch software framework.
Selecting four continuous frames of images with blurry pictures as a first group of detected video sequences, as shown in fig. 4(a) -4 (d);
a second set of detected video sequences was selected from four consecutive images of a fast moving object, fig. 5(a) -5(d), of a dog.
2. Emulated content
Simulation 1, performing video target detection on the first group of detected video sequences by using the method of the present invention, and obtaining a detection result of the fourth frame, as shown in fig. 4 (d).
Simulation 2, performing video target detection on the second group of detected video sequences by using the method of the present invention to obtain a detection result of the fourth frame, as shown in fig. 5 (d).
3. Analysis of simulation results
It can be seen from fig. 4(d) that the present invention can accurately detect the type and position of the object in the video when the image is blurred, and it can be seen from fig. 5(d) that the present invention can accurately detect the object with large form change in the video under the action of high speed and violent motion.

Claims (6)

1. The video target detection method based on the time sequence information and the local feature similarity is characterized by comprising the following steps:
(1) respectively aiming at the t frame video frame I in the video V(t)With its first k frames I(t-k),...,I(t-1)Through ResNet network, obtain I(t)Characteristic diagram F of(t)And I(t-k),...,I(t-1)Characteristic diagram F of(t-k),...,F(t-1)
(2) Calculating F(t)And F(t-k),...,F(t-1)Local feature hash similarity score s of(t,t-k),...,s(t,t-1)
(3) Computing video frames I based on timing information(t)Corrected feature map of (2) F'(t)
(3.1) Hash similarity score s for local features(t,t-k),...,s(t,t-1)Performing softmax operation on each spatial position respectively to obtain a characteristic diagram F(t-k),...,F(t-1)Corresponding weight α(t-k),...,α(t-1)
(3.2) vs. feature map F(t-k),...,F(t-1)And corresponding weight α(t-k),...,α(t-1)At various spatial positionsWeighted sum, and is summed with F(t)Adding to obtain video frame I(t)Corrected feature map of (2) F'(t)
(4) Using video frames I(t)Corrected feature map of (2) F'(t)Selecting a video frame I(t)Candidate target region of (1):
(4.1) to I(t)Frame corrected feature map F'(t)This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I(t)Middle layer feature map of frame F "(t)
And (4.2) generating 9 anchor frames with different scales at each position of the feature map, namely, firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to enable the length-width ratio of the base anchor frame to be (0.5,1,2), and respectively amplifying the three anchor frames with different length-width ratios by (8,16,32) scales to obtain 9 anchor frames in total.
(4.3) training parameters of the softmax layer and the target frame regression layer to obtain the softmax layer and the target frame regression layer after training;
(4.4) for each anchor frame I(t)Middle layer feature map of frame F "(t)And judging whether the target is contained by using the trained softmax layer:
if the target is contained, fine adjustment is carried out on the coordinates of the anchor frame by regression of the trained target frame to obtain I(t)A plurality of candidate target areas of the frame, performing (5);
if the target is not contained, discarding the anchor frame;
(5) in video frame I(t)Corrected feature map of (2) F'(t)The method comprises the steps of pooling each candidate target region by using an interested region to extract candidate region characteristics with uniform size;
(6) and obtaining the target category and the target frame position of the video frame by using the characteristics of each candidate region:
(6.1) training a classification and regression network to obtain the trained classification and regression network:
(6.2) video frame I(t)Inputting the characteristics of each candidate region into the trained classification and regression network to respectively obtain a video frame I(t)Object category and object frame position.
2. The method of claim 1, wherein F is calculated in (2)(t)And F(t-k),...,F(t-1)Local feature hash similarity score s of(t,t-k),...,s(t,t-1)The implementation is as follows:
(2.1) calculating the characteristic map F of the t-th frame(t)And t-k frame feature map F(t-k)Local feature hash similarity score of (1):
(2.1a) for the t-th frame I(t)Characteristic diagram F of(t)Taking eight neighborhoods at any position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure FDA0002378273520000021
To pair
Figure FDA0002378273520000022
Is averaged to obtain a characteristic mean value at location (i, j)
Figure FDA0002378273520000023
(2.1b) for the t-k frame I(t-k)Characteristic diagram F of(t-k)Taking eight neighborhoods at the position (i, j) to form a neighborhood feature block with the position (i, j) as the center
Figure FDA0002378273520000024
To pair
Figure FDA0002378273520000025
Is averaged to obtain a characteristic mean value at location (i, j)
Figure FDA0002378273520000026
(2.1c) the tth frame I(t)Neighborhood feature block of
Figure FDA0002378273520000027
Each value in (1) and its average value
Figure FDA0002378273520000028
Are compared and are to
Figure FDA0002378273520000029
Is greater than or equal to the mean value
Figure FDA00023782735200000210
The hash value of (1) is set to
Figure FDA00023782735200000211
Middle to less than mean value
Figure FDA00023782735200000212
The hash value of (A) is set to 0, resulting in a hash value of 0 and 1
Figure FDA00023782735200000213
Hash representation
Figure FDA00023782735200000214
(2.1d) the t-k frame I(t-k)Neighborhood feature block of
Figure FDA00023782735200000215
Each value in (1) and its average value
Figure FDA00023782735200000216
Are compared and are to
Figure FDA00023782735200000217
Is greater than or equal to the mean value
Figure FDA00023782735200000218
The hash value of (1) is set to
Figure FDA00023782735200000219
Middle to less than mean value
Figure FDA00023782735200000220
The hash value of (A) is set to 0, resulting in a hash value of 0 and 1
Figure FDA0002378273520000031
Hash representation
Figure FDA0002378273520000032
(2.1e) calculation
Figure FDA0002378273520000033
Hash representation
Figure FDA0002378273520000034
And
Figure FDA0002378273520000035
hash representation
Figure FDA0002378273520000036
Hamming distance of
Figure FDA0002378273520000037
(2.1f) neighborhood feature Block
Figure FDA0002378273520000038
Number of included values minus Hamming distance
Figure FDA0002378273520000039
Obtaining the Hash similarity score of the characteristic graph of the t th frame and the characteristic graph of the t-k th frame on the position (i, j)
Figure FDA00023782735200000310
(2.1g) repeating (2.1a) - (2.1F), calculating the characteristic map F of the t-th frame(t)And t-k frame feature map F(t-k)At all positionsThe similarity scores are combined according to the space positions to obtain the hash similarity score s of the local features of the t-th frame feature map and the t-k frame feature map(t,t-k)
(2.2) repeating (2.1), calculating F(t)And F(t-k+1),...,F(t-1)Local feature hash similarity score s of(t ,t-k+1),...,s(t,t-1)So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames(t ,t-k),...,s(t,t-1)
3. The method of claim 1, wherein the ResNet network in (1) is a feature extraction network consisting of 1 7 x 7 convolutional layer, 1 3 x 3 max pooling layer, and 16 residual blocks, wherein each residual block is composed of 1 x 1 convolutional layer, 1 3 x 3 convolutional layer, 1 x 1 convolutional layer, and identity mapping.
4. The method of claim 1, wherein (4.3) the parameters for training the softmax layer and the target box regression layer are implemented as follows:
(4.3a) randomly initializing parameters of a softmax layer and a target frame regression layer;
(4.3b) for each anchor frame, calculating the probability that the anchor frame contains the target by using the initialized softmax layer, and calculating the parameterized coordinate of the anchor frame by using the initialized target frame regression;
(4.3c) constructing the region candidate loss function by using an L1 regular term for constraining the softmax layer parameters
Figure FDA00023782735200000311
Figure FDA00023782735200000312
Wherein e isiIth anchor frame A calculated for softmax layeriThe probability of containing the object is determined,
Figure FDA00023782735200000313
is an anchor frame AiWhether or not to include the true value tag of the object, oiIs an anchor frame AiThe parametric coordinates of (a) are determined,
Figure FDA00023782735200000314
is connected with an anchor frame AiThe coordinates of the corresponding real-valued object box,
Figure FDA0002378273520000041
is the logarithmic loss that is present or absent from the target,
Figure FDA0002378273520000042
is the Smooth L1 loss of the target box regression,
Figure FDA0002378273520000043
for the parameters of the softmax layer, the layer parameters,
Figure FDA0002378273520000044
l1 canonical term, N, for constraining softmax layer parametersclsTo train the number of batches, NregNumber of anchor frames, λ1And λ2Is a balance weight;
and (4.3d) updating parameters of the softmax layer and the target frame regression layer by using the regional candidate loss function through a back propagation algorithm until the regional candidate loss function is converged to obtain the trained softmax layer and the target frame regression layer.
5. The method according to claim 1, wherein in (4.4), the trained softmax layer is used to judge whether the anchor frame contains the target, and the trained softmax layer is used to calculate the probability p that the anchor frame contains the target, and compare the probability with a set threshold q:
if p is more than q, the anchor frame contains the target;
if p is less than or equal to q, the anchor frame does not contain the target.
6. The method of claim 1, wherein (6.1) said training classification and regression network is implemented as follows:
(6.1a) randomly initializing the parameters of the classification and regression networks;
(6.1b) for each candidate region feature, calculating the probability of the candidate region belonging to each category by using the initialized classification network, and calculating the parameterized coordinates of the candidate region by using the initialized regression network;
(6.1c) constructing a target detection loss function
Figure FDA0002378273520000045
Figure FDA0002378273520000046
Wherein z is the true category of the ith candidate region,
Figure FDA0002378273520000047
is the probability that the ith candidate region belongs to class z, gamma is the concentration parameter,
Figure FDA0002378273520000048
is the focal loss of the target class; oiIs the parameterized coordinate of the ith candidate region,
Figure FDA0002378273520000049
is the coordinate vector of the real target box corresponding to the ith candidate region,
Figure FDA00023782735200000410
is the Smooth L1 regression loss of the target box, λ is the equilibrium weight;
and (6.1d) updating the classification and regression network parameters by using the target detection loss function through a back propagation algorithm until the target detection loss function is converged to obtain the trained classification and regression network.
CN202010075005.6A 2020-01-22 2020-01-22 Video target detection method based on time sequence information and local feature similarity Active CN111310609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010075005.6A CN111310609B (en) 2020-01-22 2020-01-22 Video target detection method based on time sequence information and local feature similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010075005.6A CN111310609B (en) 2020-01-22 2020-01-22 Video target detection method based on time sequence information and local feature similarity

Publications (2)

Publication Number Publication Date
CN111310609A true CN111310609A (en) 2020-06-19
CN111310609B CN111310609B (en) 2023-04-07

Family

ID=71148862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010075005.6A Active CN111310609B (en) 2020-01-22 2020-01-22 Video target detection method based on time sequence information and local feature similarity

Country Status (1)

Country Link
CN (1) CN111310609B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380970A (en) * 2020-11-12 2021-02-19 常熟理工学院 Video target detection method based on local area search
CN112383821A (en) * 2020-11-17 2021-02-19 有米科技股份有限公司 Intelligent combination method and device for similar videos
CN112434618A (en) * 2020-11-26 2021-03-02 西安电子科技大学 Video target detection method based on sparse foreground prior, storage medium and equipment
CN113436188A (en) * 2021-07-28 2021-09-24 北京计算机技术及应用研究所 Method for calculating image hash value by convolution

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829398A (en) * 2019-01-16 2019-05-31 北京航空航天大学 A kind of object detection method in video based on Three dimensional convolution network
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN109829398A (en) * 2019-01-16 2019-05-31 北京航空航天大学 A kind of object detection method in video based on Three dimensional convolution network
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李玺等: "深度学习的目标跟踪算法综述", 《中国图象图形学报》 *
杨其睿: "油田安防领域基于改进的深度残差网络行人检测模型", 《计算机测量与控制》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380970A (en) * 2020-11-12 2021-02-19 常熟理工学院 Video target detection method based on local area search
CN112383821A (en) * 2020-11-17 2021-02-19 有米科技股份有限公司 Intelligent combination method and device for similar videos
CN112434618A (en) * 2020-11-26 2021-03-02 西安电子科技大学 Video target detection method based on sparse foreground prior, storage medium and equipment
CN112434618B (en) * 2020-11-26 2023-06-23 西安电子科技大学 Video target detection method, storage medium and device based on sparse foreground priori
CN113436188A (en) * 2021-07-28 2021-09-24 北京计算机技术及应用研究所 Method for calculating image hash value by convolution

Also Published As

Publication number Publication date
CN111310609B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN110136154B (en) Remote sensing image semantic segmentation method based on full convolution network and morphological processing
CN111310609B (en) Video target detection method based on time sequence information and local feature similarity
CN112052886B (en) Intelligent human body action posture estimation method and device based on convolutional neural network
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN110032925B (en) Gesture image segmentation and recognition method based on improved capsule network and algorithm
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN107633226B (en) Human body motion tracking feature processing method
CN109033945B (en) Human body contour extraction method based on deep learning
CN113642390B (en) Street view image semantic segmentation method based on local attention network
CN111640125A (en) Mask R-CNN-based aerial photograph building detection and segmentation method and device
CN112750140A (en) Disguised target image segmentation method based on information mining
CN111027493A (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN111259906A (en) Method for generating and resisting remote sensing image target segmentation under condition containing multilevel channel attention
CN111612008A (en) Image segmentation method based on convolution network
CN111523553A (en) Central point network multi-target detection method based on similarity matrix
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN113870157A (en) SAR image synthesis method based on cycleGAN
CN112084952B (en) Video point location tracking method based on self-supervision training
CN112329771A (en) Building material sample identification method based on deep learning
CN111985488B (en) Target detection segmentation method and system based on offline Gaussian model
CN111738099B (en) Face automatic detection method based on video image scene understanding
CN115861595B (en) Multi-scale domain self-adaptive heterogeneous image matching method based on deep learning
CN116597275A (en) High-speed moving target recognition method based on data enhancement
CN115862119A (en) Human face age estimation method and device based on attention mechanism
CN115294424A (en) Sample data enhancement method based on generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant