CN111310609B

CN111310609B - Video target detection method based on time sequence information and local feature similarity

Info

Publication number: CN111310609B
Application number: CN202010075005.6A
Authority: CN
Inventors: 古晶; 刘芳; 赵柏宇; 焦李成; 卞月林; 巨小杰; 张向荣; 陈璞花
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2023-04-07
Anticipated expiration: 2040-01-22
Also published as: CN111310609A

Abstract

The invention discloses a video target detection method based on time sequence information and local feature similarity, which mainly solves the problems of low accuracy rate and unmatched feature positions of video target detection in the prior art. The implementation scheme is as follows: extracting a feature map of each frame of the video by using a ResNet network; calculating the similarity of the feature graphs by using local feature hash similarity measurement, and representing the change of the current position feature by using the hash similarity score; weighting the feature maps of the adjacent frames, and adding the feature maps with the features of the current frame to obtain the corrected features of the current frame; obtaining a candidate target frame of the correction characteristics by using a regional candidate network based on sparse classification; and pooling the interested areas to obtain uniform-size features, and inputting the uniform-size features into the trained classification and regression network to obtain a detection result. The invention improves the detection accuracy and reduces the calculation complexity.

Description

Video target detection method based on time sequence information and local feature similarity

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a video target detection method which can be used for target identification and positioning in a video.

Background

Computer vision is an important field of artificial intelligence, is science about computers and software systems, and can enable computers to identify and understand images and scenes, and the computer vision comprises branch fields of image identification, target detection, image generation, image super-resolution reconstruction and the like. Visual understanding mainly has three levels, namely classification, detection and segmentation. The classification task is concerned with the whole, the content description of the whole picture is given, the detection is concerned with a specific object target, and the identification result and the positioning result of the target are required to be obtained at the same time. In contrast to classification, detection gives an understanding of the foreground and background of a picture, and also requires the separation of an object of interest from the background and the determination of the recognition and localization result of this object.

The target detection is an important research subject in the field of computer vision, the realization of the target detection is the key of video analysis technologies such as moving target tracking, target recognition and behavior understanding, and the effect of the target detection directly influences the progress of subsequent work. The task of image target detection is greatly improved in the last years, and the detection performance is obviously improved. Especially in the fields of video monitoring, vehicle driving assistance and the like, video-based target detection has wider requirements. However, applying image detection techniques directly to video detection presents new challenges. First, applying a deep network to all video frames can bring huge operation cost; secondly, the video frames with motion blur, video virtual focus and rare gestures are directly detected by an image detection technology, so that the accuracy rate is low.

In order to improve the video detection accuracy, most of the earlier methods focus on post-processing, and after each frame is detected by image target detection, the detection result is further processed by using the specific time sequence characteristics of the video, such as a pipeline convolution neural network T-CNN and a sequence non-maximum suppression Seq-NMS method. However, this post-processing method undoubtedly increases the calculation required for detection, reduces the detection speed, and cannot meet the real-time requirement.

Disclosure of Invention

The present invention is directed to provide a video target detection method based on temporal information and local feature similarity to improve the detection speed and meet the real-time requirement.

The technical scheme of the invention is realized as follows:

the technical idea of the invention is to fully utilize the time sequence information of a video sequence and mine the change of target characteristics in adjacent frame images, and the scheme is as follows: firstly, extracting a feature map of each frame of a video by using a ResNet network; then, the characteristics of the current frame are corrected by using the time sequence information of the adjacent preorder frames in a self-adaptive mode; obtaining a candidate target frame of the correction characteristics through a regional candidate network based on sparse classification; then pooling the interested region to obtain the characteristics with uniform size, and then obtaining the final detection result through classification and regression network, wherein the specific implementation steps comprise the following steps:

1. the video target detection method based on the time sequence information and the local feature similarity is characterized by comprising the following steps:

(1) Respectively to the t frame video frame I in the video V ^(t) With its first k frames I ^(t-k) ,...,I ^(t-1) Through ResNet network, obtain I ^(t) Characteristic diagram F of ^(t) And I ^(t-k) ,...,I ^(t-1) Characteristic diagram F of ^(t-k) ,...,F ^(t-1) ；

(2) Calculating F ^(t) And F ^(t-k) ,...,F ^(t-1) Local feature hash similarity score s of ^(t,t-k) ,...,s ^(t,t-1) ：

(3) Computing video frames I based on timing information ^(t) Corrected feature map of (2) F' ^(t) ：

(3.1) Hash similarity score s for local features ^(t,t-k) ,...,s ^(t,t-1) In each spacePerforming softmax operation on the positions respectively to obtain a characteristic diagram F ^(t-k) ,...,F ^(t-1) Corresponding weight α ^(t-k) ,...,α ^(t-1) ；

(3.2) vs. feature map F ^(t-k) ,...,F ^(t-1) And corresponding weight alpha ^(t-k) ,...,α ^(t-1) Weighted sum at each spatial position and with F ^(t) Adding to obtain video frame I ^(t) Corrected feature map F ^'(t) ；

(4) Using video frames I ^(t) Corrected feature map F ^'(t) Selecting a video frame I ^(t) Candidate target region of (1):

(4.1) to I ^(t) Modified feature map F of frame ^'(t) This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I ^(t) Intermediate layer feature map F of a frame ^”(t) ；

And (4.2) generating 9 anchor frames with different scales at each position of the feature map, namely, firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to enable the length-width ratio of the base anchor frame to be (0.5,1,2), and then respectively magnifying the three anchor frames with different length-width ratios (8,16,32) by scales to obtain 9 anchor frames in total.

(4.3) training parameters of the softmax layer and the target frame regression layer to obtain the softmax layer and the target frame regression layer after training;

(4.4) for each anchor frame I ^(t) Intermediate layer feature map F of a frame ^”(t) And judging whether the target is contained by using the trained softmax layer:

if the target is contained, fine adjustment is carried out on the coordinates of the anchor frame by regression of the trained target frame to obtain I ^(t) A plurality of candidate target areas of the frame, and (5) is executed;

if the target is not contained, discarding the anchor frame;

(5) In video frame I ^(t) Corrected feature map F ^'(t) The method comprises the steps of pooling each candidate target region by using an interested region to extract candidate region characteristics with uniform size;

(6) And obtaining the target category and the target frame position of the video frame by using the characteristics of each candidate region:

(6.1) training a classification and regression network to obtain the trained classification and regression network:

(6.2) video frame I ^(t) Inputting the characteristics of each candidate region into the trained classification and regression network to respectively obtain a video frame I ^(t) Object category and object frame position.

Further, calculating F in (2) ^(t) And F ^(t-k) ,...,F ^(t-1) Local feature hash similarity score s of ^(t,t-k) ,...,s ^(t,t-1) The implementation is as follows:

(2.1) calculating the characteristic map F of the t-th frame ^(t) And t-k frame feature map F ^(t-k) Local feature hash similarity score of (1):

(2.1 a) for the tth frame I ^(t) Characteristic diagram F of ^(t) Taking eight neighborhoods at any position (i, j) to form a neighborhood feature block with the position (i, j) as the center

Is paired and/or matched>

Is averaged to obtain a characteristic mean value->

(2.1 b) for the t-k frame I ^(t-k) Characteristic diagram F of ^(t-k) Taking eight neighborhoods at the position (i, j) to form a neighborhood feature block with the position (i, j) as the center

Is paired and/or matched>

Is averaged to obtain a characteristic mean value->

(2.1 c) the tth frame I ^(t) Neighborhood feature block of

Each value in and its mean value->

Compare and will->

Is greater than or equal to the mean value->

Has the hash value of 1 set to->

Is less than the mean value>

The hash value of (A) is set to 0, resulting in a value of 0 and 1>

Hash denotes->

(2.1 d) the t-k frame I ^(t-k) Neighborhood feature block of

Each value in and its mean value->

Compare and will->

Is greater than or equal to the mean value->

Has the hash value of 1 set to->

In less than mean>

The hash value of (a) is set to 0, resulting in a ^ or consisting of 0 and 1>

Hash denotes->

(2.1 e) calculation

Hash denotes->

And &>

Hash denotes->

Has a Hamming distance->

(2.1 f) neighborhood feature Block

The number of included values minus the hamming distance pick>

Get the hash similarity score ^ on the position (i, j) of the feature map of the t-th frame and the feature map of the t-k frame>

(2.1 g) repeating (2.1 a) - (2.1F), calculating the characteristic map F of the t-th frame ^(t) And t-k frameFeature map F ^(t-k) The Hash similarity scores at all positions are combined according to the space positions to obtain the Hash similarity score s of the local characteristics of the t-th frame characteristic image and the t-k frame characteristic image ^(t,t-k) ；

(2.2) repeat (2.1), calculate F separately ^(t) And F ^(t-k+1) ,...,F ^(t-1) Local feature hash similarity score s of ^(t,t-k+1) ,...,s ^(t,t-1) So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames ^(t ^,t-k) ,...,s ^(t,t-1) 。

Compared with the prior art, the invention has the following advantages:

1) On the basis of the two-stage image target detection method, the invention considers the relation between adjacent frames based on the time sequence information, obtains the correction characteristics of the current frame in a self-adaptive manner by weighting the characteristics of the adjacent frames on the video sequence formed by a plurality of frames and adding the weighted characteristics with the characteristics of the current frame, can detect the video frames with motion blur, video virtual focus and rare postures after correcting the characteristics, and improves the detection accuracy.

2) According to the method, during the correction of the characteristics by using the time sequence information, the characteristic similarity is calculated by using the local characteristic Hash similarity measurement, and the change of the current position characteristics is represented by using the Hash similarity score, so that the problem of characteristic position mismatching caused by the position change of a moving target in a video is solved, the calculation complexity is reduced and the operation efficiency is improved compared with a common similarity measurement method.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a sub-flow diagram of the present invention for computing a partial feature hash similarity score;

FIG. 3 is a sub-flow diagram of the calculation of a revised feature in the present invention;

fig. 4 and 5 are diagrams illustrating the effect of video object detection using the present invention.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

The implementation of the method mainly comprises two parts, namely training and testing, wherein the training process is to update model parameters by calculating a model loss function and performing back propagation; the testing process is to fix parameters, firstly calculate the correction characteristics of the video frame by utilizing the time sequence information, and then obtain the target category and the target frame position of the video frame by using the correction characteristics.

Referring to fig. 1, the implementation steps of this example are as follows:

step 1, calculating a characteristic diagram of a t frame video frame and a preamble frame thereof.

For the t frame video frame I in the video V ^(t) With its first k frames I ^(t-k) ,...,I ^(t-1) Through ResNet network, obtain I ^(t) Characteristic diagram F of ^(t) And I ^(t-k) ,...,I ^(t-1) Characteristic diagram F of ^(t-k) ,...,F ^(t-1) 。

The ResNet network is a characteristic extraction network consisting of 1 7 × 7 convolutional layer, 1 3 × 3 maximum pooling layer and 16 residual blocks, wherein each residual block is formed by combining 1 × 1 convolutional layer, 1 3 × 3 convolutional layer, 1 × 1 convolutional layer and identity mapping.

And 2, calculating the local characteristic hash similarity score of the t frame video frame and the previous k frame.

2.1 Computing the t-th frame feature map F ^(t) And t-k frame feature map F ^(t-k) Local feature hash similarity score of (1):

referring to fig. 2, the specific implementation of this step is as follows:

2.1 a) for the t-th frame I ^(t) Characteristic diagram F of ^(t) Taking eight neighborhoods at any position (i, j) to form a neighborhood feature block with the position (i, j) as the center

Is paired and/or matched>

Is averaged to obtain a characteristic mean value->

2.1 b) for the t-k frame I ^(t-k) Characteristic diagram F of ^(t-k) Taking eight neighborhoods at the position (i, j) to form a neighborhood feature block with the position (i, j) as the center

Is paired and/or matched>

Is averaged to obtain a characteristic mean value at location (i, j)

2.1 c) the tth frame I ^(t) Neighborhood feature block of

Each value in and its mean value->

Compare and will->

Is greater than or equal to the mean value->

Has a hash value of 1, will->

Is less than the mean value>

The hash value of (A) is set to 0, resulting in a ^ or consisting of 0 and 1>

Hash denotes->

2.1 d) t-k frames I ^(t-k) Neighborhood feature block of

Each value in and its mean value->

Are compared and are to

Is greater than or equal to the mean value->

Has the hash value of 1 set to->

Is less than the mean value>

Has a hash value of 0, resulting in a value of 0 and 1>

Hash denotes->

2.1 e) calculation

Hash denotes->

And &>

Hash denotes->

Has a Hamming distance->

Wherein the content of the first and second substances,

are respectively based on>

The value of the l-th element;

2.1 f) neighborhood feature Block

The number of included values minus the hamming distance pick>

2.1 g) repeat 2.1 a) -2.1F), calculate the t-th frame feature map F ^(t) And t-k frame feature map F ^(t-k) The Hash similarity scores at all positions are combined according to the space positions to obtain the Hash similarity score s of the local characteristics of the t-th frame characteristic image and the t-k frame characteristic image ^(t,t-k) ；

2.2 ) repeat step 2.1), calculate F separately ^(t) And F ^(t-k+1) ,...,F ^(t-1) Local feature hash similarity score s of ^(t,t-k+1) ,...,s ^(t,t-1) So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames ^(t ^,t-k) ,...,s ^(t,t-1) 。

And 3, calculating a modified characteristic diagram of the t frame video frame.

Referring to fig. 3, this step is implemented as follows:

3.1 A hash similarity score s for local features ^(t,t-k) ,...,s ^(t,t-1) Performing softmax operation on each spatial position respectively to obtain a characteristic diagram F ^(t-k) ,...,F ^(t-1) Corresponding weight α ^(t-k) ,...,α ^(t-1) ；

3.2 ) pair feature map F ^(t-k) ,...,F ^(t-1) And corresponding weight alpha ^(t-k) ,...,α ^(t-1) Weighted sum is performed at each spatial position and is compared with F ^(t) Adding to obtain video frame I ^(t) Corrected feature map F ^'(t) ：

Wherein the beta is a weight factor, wherein,

and 4, selecting a candidate target area by using the modified feature map of the t-th frame video frame.

4.1 Pair I) ^(t) Modified feature map F of frame ^'(t) This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I ^(t) Intermediate layer feature map F of a frame ^”(t) ；

4.2 In the intermediate layer feature map F ^”(t) Generating 9 anchor frames with different scales at each position, namely firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to ensure that the length-width ratio of the base anchor frame is (0.5,1,2), and then respectively amplifying the three anchor frames with different length-width ratios (8,16,32) by scales to obtain 9 anchor frames in total;

4.3 Training softmax layer and target box regression layer parameters:

4.3 a) randomly initializing parameters of a softmax layer and a target frame regression layer;

4.3 b) for each anchor frame, calculating the probability of the anchor frame containing the target by using the initialized softmax layer, and calculating the parameterized coordinate of the anchor frame by using the initialized target frame regression;

4.3 c) constructing the region candidate loss function by using the L1 regular term for restricting the softmax layer parameter

Wherein e is _i Ith anchor frame A calculated for softmax layer _i The probability of containing the object is determined,

is an anchor frame A _i Whether or not to include the true value tag of the object, o _i Is an anchor frame A _i Is parameterized by->

Is connected with an anchor frame A _i The coordinates of the corresponding real-valued object box,

is the logarithmic loss present in the target, is based on>

Is Smooth L1 loss of target frame regression, <' >>

Is a softmax level parameter, < >>

L1 regularization term, N, for constraining softmax layer parameters _cls To train the number of batches, N _reg Number of anchor frames, λ ₁ And λ ₂ Is a balance weight;

4.3 d) updating parameters of the softmax layer and the target frame regression layer by using the regional candidate loss function through a back propagation algorithm until the regional candidate loss function is converged to obtain the trained softmax layer and the target frame regression layer;

4.4 For each anchor frame at I) ^(t) Intermediate layer feature map F of a frame ^”(t) Calculating the probability p that the anchor frame contains the target by using the trained softmax layer, and comparing the probability with a set threshold value q:

if p is larger than q, the anchor frame contains the target, and the coordinates of the anchor frame are finely adjusted by using a trained target frame regression layer to obtain I ^(t) A plurality of candidate target areas of the frame, and executing the step 5;

if p is less than or equal to q, the anchor frame does not contain the target, and the anchor frame is discarded.

And 5, extracting candidate region characteristics with uniform size for each candidate target region.

In video frame I ^(t) Corrected feature map F ^'(t) In the above, the region-of-interest pooling is used for each candidate target region to extract the candidate region features with uniform size, i.e. each candidate target region is firstly subjected to the correction feature map F ^'(t) Is divided into w _r ×h _r Performing maximum pooling operation in each grid to obtain uniform size w _r ×h _r The candidate region feature of (1).

And 6, obtaining the target category and the target frame position of the video frame by using the characteristics of each candidate region.

6.1 Training classification and regression networks:

6.1 a) randomly initializing the parameters of the classification and regression networks;

6.1 b) for each candidate area characteristic, calculating the probability of the candidate area belonging to each category by using the initialized classification network, and calculating the parameterized coordinate of the candidate area by using the initialized regression network;

6.1 c) constructing a target detection loss function

Wherein z is the true category of the ith candidate region,

is the probability that the ith candidate region belongs to class z, γ is a concentration parameter, is present>

Is the focal loss of the target class; o _i Is a parameterized coordinate of the i-th candidate region, is based on the value of the parameter value>

Is the coordinate vector of the real target frame corresponding to the i-th candidate region, is->

Is the Smooth L1 regression loss of the target frame, λ is the balance weight;

6.1 d) updating the classification and regression network parameters by a back propagation algorithm by using a target detection loss function until the target detection loss function is converged to obtain a trained classification and regression network;

6.2 To convert video frame I ^(t) Inputting the characteristics of each candidate region into the trained classification and regression network to respectively obtain a video frame I ^(t) Object category and object frame position.

The effects of the present invention can be further illustrated by the following simulations:

1. simulation conditions

Using a workstation with an RTX 2080TI graphics card, the PyTorch software framework was used.

Selecting four continuous frames of images with blurry pictures as a first group of detected video sequences, as shown in fig. 4 (a) -4 (d);

a second set of detected video sequences was selected from four consecutive images of a fast moving object, fig. 5 (a) -5 (d), of a dog.

2. Emulated content

Simulation 1, performing video target detection on the first group of detected video sequences by using the method of the present invention, and obtaining a detection result of the fourth frame, as shown in fig. 4 (d).

Simulation 2, performing video target detection on the second group of detected video sequences by using the method of the present invention to obtain a detection result of the fourth frame, as shown in fig. 5 (d).

3. Analysis of simulation results

As can be seen from fig. 4 (d), the present invention can accurately detect the type and position of the object in the video when the screen is blurred, and as can be seen from fig. 5 (d), the present invention can accurately detect the object with large form change in the video under the action of high speed and violent motion.

Claims

(1) Respectively aiming at the t frame video frame I in the video V ^(t) With its first k frames I ^(t-k) ,...,I ^(t-1) Through ResNet network, obtain I ^(t) Characteristic diagram F of ^(t) And I ^(t-k) ,...,I ^(t-1) Characteristic diagram F of ^(t-k) ,...,F ^(t-1) ；

(3.1) Hash similarity score s for local features ^(t,t-k) ,...,s ^(t,t-1) Performing softmax operation on each spatial position respectively to obtain a characteristic diagram F ^(t-k) ,...,F ^(t-1) Corresponding weight α ^(t-k) ,...,α ^(t-1) ；

(3.2) vs. feature map F ^(t-k) ,...,F ^(t-1) And corresponding weight alpha ^(t-k) ,...,α ^(t-1) Weighted sum at each spatial position and with F ^(t) Adding to obtain video frame I ^(t) Corrected feature map ofF' ^(t) ；

(4) Using video frames I ^(t) Corrected feature map of (2) F' ^(t) Selecting a video frame I ^(t) Candidate target region of (1):

(4.1) to I ^(t) Frame corrected feature map F' ^(t) This is passed through convolution kernels of 3X 3 and 1X 1 in sequence to give I ^(t) Middle layer feature map of frame F " ^(t) ；

And (4.2) generating 9 anchor frames with different scales at each position of the feature map, namely, firstly setting a base anchor frame with the size of 16 multiplied by 16, keeping the area unchanged to enable the length-width ratio of the base anchor frame to be (0.5,1,2), and then respectively amplifying the three anchor frames with different length-width ratios (8,16,32) to obtain 9 anchor frames in total.

(4.4) for each anchor frame I ^(t) Middle layer feature map of frame F " ^(t) And judging whether the target is contained by using the trained softmax layer:

if the target is contained, fine adjustment is carried out on the coordinates of the anchor frame by regression of the trained target frame to obtain I ^(t) A plurality of candidate target areas of the frame, performing (5);

if the target is not contained, discarding the anchor frame;

(5) In video frame I ^(t) Corrected feature map of (2) F' ^(t) The method comprises the steps of pooling each candidate target region by using an interested region to extract candidate region characteristics with uniform size;

2. The method of claim 1, wherein F is calculated in (2) ^(t) And F ^(t-k) ,...,F ^(t-1) Local feature hash similarity score s of ^(t,t-k) ,...,s ^(t,t-1) The implementation is as follows:

Is paired and/or matched>

Is averaged to obtain a characteristic mean value->

Is paired and/or matched>

Is averaged to obtain a characteristic mean value->

(2.1 c) the tth frame I ^(t) Neighborhood feature block of

Each value in and its mean value->

Compare and will->

Is greater than or equal to the mean value->

Has a hash value of 1, will->

Is less than the mean value>

The hash value of (A) is set to 0, resulting in a ^ value consisting of 0 and 1>

Hash denotes->

(2.1 d) the t-k frame I ^(t-k) Neighborhood feature block of

Each value in and its mean value->

Compare and will->

Is greater than or equal to the mean value->

Has the hash value of 1 set to->

Is less than the mean value>

Hash denotes->

(2.1 e) calculation

Hash denotes->

And &>

Hash denotes->

Has a Hamming distance->

(2.1 f) neighborhood feature Block

The number of values contained minus the Hamming distance->

(2.1 g) repeating (2.1 a) - (2.1F), calculating the characteristic map F of the t-th frame ^(t) And t-k frame feature map F ^(t-k) The Hash similarity scores at all positions are combined according to the space positions to obtain the Hash similarity score s of the local characteristics of the t-th frame characteristic image and the t-k frame characteristic image ^(t,t-k) ；

(2.2) repeating (2.1), calculating F ^(t) And F ^(t-k+1) ,...,F ^(t-1) Local feature hash similarity score s of ^(t ^,t-k+1) ,...,s ^(t,t-1) So as to obtain the local characteristic hash similarity score s of the t frame video frame and the previous k frames ^(t ^,t-k) ,...,s ^(t,t-1) 。

3. The method of claim 1, wherein the ResNet network in (1) is a feature extraction network consisting of 1 7 x 7 convolutional layer, 1 3 x 3 max pooling layer, and 16 residual blocks, wherein each residual block is composed of 1 x 1 convolutional layer, 1 3 x 3 convolutional layer, 1 x 1 convolutional layer, and identity mapping.

4. The method of claim 1, wherein (4.3) the parameters for training the softmax layer and the target box regression layer are implemented as follows:

(4.3 a) randomly initializing parameters of a softmax layer and a target frame regression layer;

(4.3 b) for each anchor frame, calculating the probability that the anchor frame contains the target by using the initialized softmax layer, and calculating the parameterized coordinate of the anchor frame by using the initialized target frame regression;

(4.3 c) constructing the region candidate loss function by using the L1 regular term for constraining the softmax layer parameter

Wherein e is _i Ith anchor frame A calculated for softmax layer _i Containing objectsThe ratio of the total weight of the particles,

is the logarithmic loss present in the target, is based on>

Is Smooth L1 loss of target frame regression, <' >>

Is a softmax level parameter, < >>

and (4.3 d) updating parameters of the softmax layer and the target frame regression layer by using the regional candidate loss function through a back propagation algorithm until the regional candidate loss function is converged to obtain the trained softmax layer and the target frame regression layer.

5. The method according to claim 1, wherein in (4.4), the trained softmax layer is used to judge whether the anchor frame contains the target, and the trained softmax layer is used to calculate the probability p that the anchor frame contains the target, and compare the probability with a set threshold q:

if p is more than q, the anchor frame contains the target;

if p is less than or equal to q, the anchor frame does not contain the target.

6. The method of claim 1, wherein (6.1) said training classification and regression network is implemented as follows:

(6.1 a) randomly initializing the parameters of the classification and regression networks;

(6.1 b) for each candidate region feature, calculating the probability of the candidate region belonging to each category by using the initialized classification network, and calculating the parameterized coordinates of the candidate region by using the initialized regression network;

(6.1 c) constructing a target detection loss function

Wherein z is the true category of the ith candidate region,

is the probability that the ith candidate region belongs to class z, γ is a concentration parameter, which is greater than or equal to>

Is the focal loss of the target class; o. o _i Is a parameterized coordinate of the i-th candidate region, is based on the value of the parameter value>

Is the Smooth L1 regression loss of the target frame, λ is the equilibrium weight;

and (6.1 d) updating the classification and regression network parameters by using the target detection loss function through a back propagation algorithm until the target detection loss function is converged to obtain the trained classification and regression network.