CN111291679B

CN111291679B - Target specific response attention target tracking method based on twin network

Info

Publication number: CN111291679B
Application number: CN202010081733.8A
Authority: CN
Inventors: 王菡子; 赵鹏辉; 陈昊升; 梁艳杰; 严严
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2020-02-06
Filing date: 2020-02-06
Publication date: 2022-05-27
Anticipated expiration: 2040-02-06
Also published as: CN111291679A

Abstract

A target specific response attention target tracking method based on a twin network relates to a computer vision technology. The target specific response attention target tracking method based on the twin network is provided aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target, the influence of noise information on tracking performance in the tracking process is effectively weakened by the provided target response attention module, meanwhile, characteristic information which has discriminability on appearance change of a target object is strengthened, a better target response image generated by the twin network is enabled to be used for target position prediction, and therefore more robust tracking performance is achieved. It comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.

Description

Target specific response attention target tracking method based on twin network

Technical Field

The invention relates to a computer vision technology, in particular to a target specific response attention target tracking method based on a twin network.

Background

Target tracking is a basic task in computer vision, and has wide application in the fields of video monitoring, vehicle navigation, augmented reality and the like. Object tracking is the selection of an object of interest selected in the first frame of a given video sequence and the prediction of the position of the object in subsequent frames by computer vision algorithms. In recent years, a twin-based tracking method is receiving attention of researchers because of ensuring tracking accuracy and real-time speed, but the performance of a related algorithm is easily influenced by: the rapid movement of the target or the camera, the appearance change of the target, the disordered background and the like, which are inevitable conditions in reality.

In deep learning, attention mechanisms may enable a model to selectively capture important portions of an input based on a particular task. Since the performance of the model can be improved without excessive consumption of computation and storage. Attention mechanism has been widely used in the fields of image titles, machine translation, speech recognition, etc. For target tracking, davt draws recognition space attention, identifying certain specific regions on the target. acfn attempts to utilize the focus mechanism to select a set of correlation filters for tracking. The cst-dcf utilizes the foreground spatial reliability map to constrain the correlation filter learning. RTT utilizes a multidirectional recurrent neural network that generates saliency to select a reliable region that belongs to a target object. The pioneering algorithms verify the superiority of the attention mechanism in the aspect of target tracking, and meanwhile, the method effectively provides a prospect for improving the performance of the target tracking method by utilizing the attention mechanism.

Disclosure of Invention

The invention aims to provide a target specific response attention target tracking method based on a twin network, which can effectively cope with complex tracking scenes such as rapid movement, occlusion, rotation, background disorder and the like of a target.

The invention comprises the following steps:

1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by utilizing a target position obtained from the previous frame;

2) subjecting the product of step 1) toInputting the target template area and the target search area into a full convolution twin network to obtain the CNN characteristic F of the target template area Z_zAnd CNN feature F of target search area X_x；

3) The CNN characteristic F obtained in the step 2)_xAnd F_zInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention model_multiFor the response chart S_multiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;

4) taking the target position obtained in the last frame in the step 1) as a center, constructing a search scale pyramid for a target search area, executing the step 3) for each search area of an estimated scale in the scale pyramid, selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame, and combining the target position and the scale to obtain the actual size and position of the target, thereby realizing target tracking;

5) training a model: the model training is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.

In step 1), the specific steps of acquiring the target search area X in the current video frame to be tested are as follows:

(1) in the initial frame, the size of the extracted target template area is slightly larger than that of the actual target according to the real effective value, and the extracted actual size of the template is calculated according to the following formula:

where c is (w)_z+h_z)/2,w_zWidth, h, of target template region Z_zIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/S_zAnd saving, wherein the size is used for calculating the interception size of the search area.

(2) In connection with searchingThe resizing of the area is 255 × 255, and in order to ensure that the target scale of the search area is consistent with the scale of the template, the actual interception size of the search area is as follows: s_x＝255/scale。

In the step 2), the fully-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:

F_x＝ψ_p(X)

F_z＝ψ_p(Z)

the fully-convoluted twin network and the proposed target attention model are pre-trained using a visual recognition data set ILSVRC2015_ VID, by combining the twin network and the target specific response attention model into a unified framework, inputting a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence, and outputting a response map, wherein a loss function adopted in the training process is a cross-entropy loss function:

L(y,v)＝log(1+e^(-yv))

wherein y is the target label, and v is the score of the corresponding position of the response map.

In step 2), inputting the target template area and the target search area in step 1) into a full convolution twin network to obtain a CNN feature F of the target template area Z_zAnd CNN feature F of target search area X_xThe method comprises the following specific steps:

(1) after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse grain characteristics of the input image;

(2) network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; 1. a first convolution layer having a convolution kernel size of 11 × 11, a convolution step of 2, an input channel number of 3, and an output channel number of 96; 2. a first pooling layer, pooling size 3 x 3, maximal pooling of step size 2; 3. a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; 4. the second pooling layer is the same as the first pooling layer, the pooling size is 3 × 3, and the step size is 2; 5. the third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a convolution step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has a convolutional kernel size of 3 × 3, convolution step 1, input channel number 384, and output channel number 256.

(3) The whole tracking algorithm model is provided with two branch networks, one is a template branch, the other is a search area branch, the two branch networks adopt the networks described in the step (2), and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.

In step 3), the specific step of obtaining the final response map S may be:

(1) CNN feature F for target search region X and target template region Z_xAnd F_zPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is represented as:

S_multi＝Corr_cw(F_x,F_z)；

(2) f is to be_xInputting an attention network H (·), obtaining an attention weight omega of a channel, wherein the attention network is composed of a global mean pooling and a three-layer multilayer perceptron and is represented as follows:

ω＝H(F_x)

(3) weighting the calculated attention weight omega of the channel to the response graph S of the multiple channels_multiTo obtain a weighted multi-channel response map

Then will be

And S_multiAdding the residual error structures, and summing according to channels to obtain a final response diagram S_finalThe overall process is represented by the following equation:

in step 4), the specific step of selecting the scale of the target search area with the highest response value as the scale corresponding to the current frame may be:

(1) in order to balance the tracking precision and speed, three scales are selected for multi-scale search, and 3 scale factors lambda_iSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be S_xThe actual size of the truncation area per scale is

(2) And readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding a maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.

In step 5), the specific steps of training the model may be:

(1) 53200 training pairs are sampled from a visual identification data set ILSVRC2015_ VID, each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;

(2) for an original search area target located at the center of an area, the target center can be randomly shifted by 0-8 pixels during training, so that the generalization performance of network target shifting is improved;

(3) the model is subjected to iterative training for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is 10^-2Decays exponentially to 10^-5Training loss function LIs a weighted cross-entropy loss, as follows:

where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples taken in a batch.

The invention provides a target specific response attention target tracking method based on a twin network aiming at the defect that the original target tracking method based on the twin network is not robust enough to complex tracking scenes such as rapid movement, shielding, rotation, background disorder and the like of a target. The method comprises five main parts: CNN feature extraction; generating a response graph by cross-correlation of channels; generating weights by using an attention network, and weighting each channel response graph; the position of the target is determined on the final response diagram, and the training method of the proposed model is provided.

Compared with the existing attention mechanism, the attention mechanism of the invention learns the attention network in an end-to-end mode. By combining a target specific response attention model and a full convolution twin network in a unified frame, complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process are effectively processed, and real-time operation can be realized.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention.

Fig. 2 is a response diagram of each channel obtained by the channel-by-channel cross-correlation operation in the present invention.

Fig. 3 is a comparison of the resulting response map of the present invention and a baseline algorithm.

Detailed Description

The present invention will be described in detail with reference to the following examples, which are provided in the present invention, and the embodiments and specific operations of the present invention are given on the premise of the technical solution of the present invention, but the scope of the present invention is not limited to the following examples.

Referring to fig. 1, an embodiment of the present invention includes the steps of:

A. given a video sequence, the first frame contains a marked object. And defining a target template area Z and a target search area X, wherein the target template area is kept unchanged after being intercepted based on a given mark in a first frame, and the target search area is obtained in a current video frame to be tested and is an image block which is larger than the target template area and is intercepted by using the target position obtained from the previous frame.

B. Inputting the target template area and the target search area described in the step A into a full convolution twin network to obtain the CNN characteristic F of the target template area Z_zAnd CNN feature F of target search area X_x. The full-convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting features of the target template region and one for extracting features of the target search region. Both networks share the same parameters. The process is described as follows:

F_x＝ψ_p(X)

F_z＝ψ_p(Z)

C. these two CNN features F_xAnd F_zInput to a target-specific response attention model, resulting in a response map S of the multipaths that has been weighted by an attention mechanism_multiFor the response chart S_multiAnd summing channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram as the target initial position.

D. In the aspect of scale estimation, in the step A, a search scale pyramid is constructed in advance for a target search area by taking the target of the previous frame as a center, the step C is executed for the search area of each scale in the scale pyramid, the scale of the search area with the highest response value is selected as the current scale, and the actual size and position of the target are obtained by combining the position and the scale of the target, so that the target tracking is realized.

E. Model training, the twin network and the proposed target attention model are pre-trained by using a visual recognition data set ILSVRC2015_ VID, the training set has 3862 video segments in total, the total frame number exceeds 112 thousands, and all the video segments are marked with category information and target positions, a target template region sampled from a certain frame in a video sequence and a target search region sampled from a subsequent frame in the same video sequence are input by combining the twin network and the target specific response attention model into a unified frame, a response diagram is output, and a loss function adopted in the training process is a cross entropy loss function: l (y, v) log (1+ e)^(-yv)) Y is the target label and v is the response map corresponding location score.

The parameters in the single-frame target tracking process in step a are further described as follows:

A1. in the initial frame, the size of the target template region extracted according to the real effective value is slightly larger than that of the actual target for capturing some semantic information, and the actual size S extracted by the template_zCalculated according to the following formula:

where c is (w)_z+h_z)/2,w_zWidth, h, of target template region Z_zIndicating the height of the target template region Z, then resizing the truncated template image block to 125 x 125, and rescaling the rescale to 125/S_zAnd saving, wherein the size is used for calculating the interception size of the search area.

A2. The resizing of the search area is 255 × 255, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is as follows:

S_x＝255/scale

the deep full convolution neural network adopted in the step B comprises the following substeps:

B1. after the input image is resized, the network is a full convolution twin network of 5 convolution layers and is used for extracting the coarse-grained characteristics of the input image. A structure similar to AlexNet (A. Krizhevsky, I. Sutskeeper, GeoffreE. Hinto n, "ImageNet Classification with Deep conditional Neural Networks", NIPS:1106- -1114,2014.) was used.

B2. Network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU, 1. a first convolution layer, the convolution kernel size of which is 11 x 11, the convolution step length is 2, the number of input channels is 3, and the number of output channels is 96; 2. pooling next one maximal value of 3 × 3, step 2; 3. the second layer convolution is that the convolution kernel is 5 multiplied by 5, the step length is 1, the input channel is 96, and the output channel is 256; 4. a second pooling layer, the pooling size being 3 x 3, the step size being 2; 5. a third convolutional layer, the convolutional kernel size of which is 3 × 3, the step size of which is 1, the number of input channels 256, and the number of output channels 384; 6. a fourth convolution layer having a convolution kernel size of 3 × 3, a step size of 1, an input channel number 384, and an output channel number 384; 7. the fifth convolutional layer has convolutional kernel size of 3 × 3, step size 1, input channel number 384, and output channel number 256.

B3. The whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in B2, and the parameters are the same. The attention model of the target specific response immediately follows the search area branch.

The attention model of the target specific response in step C, further comprising the sub-steps of:

C1. first CNN feature F for X and Z_xAnd F_zPerforming channel-by-channel cross-correlation operation to obtain a multi-channel response diagram, which is expressed by the following formula:

S_multi＝Corr_cw(F_x,F_z)

C2. and inputting Fx into an attention network H (·), and obtaining the attention weight omega of the channel, wherein the attention network is composed of a global mean pooling layer and a three-layer multilayer perceptron. This process can be expressed as:

ω＝H(F_x)

C3. weighting the calculated weight omega to a multichannel response diagram S_multiTo obtain a weighted multi-channel response map

In will

the parameters and flow in the multi-scale strategy of the single frame in the step D are further described as follows:

D1. in the tracking process, three scales are selected for multi-scale search in order to balance tracking precision and speed, and 3 scale factors lambda_iSpecific values [0.96385,1,1.0375 ]]. Let the truncation size of the search region of the previous frame be S_xActual size of the truncation area per scale

D2. The image blocks of three scales are all readjusted to be 255 × 255 in size, and scale search is performed: the maximum value in the response image in each scale is found first, the maximum response values in the three scales are compared, and the scale where the maximum response value is located is the scale corresponding to the current frame.

The parameters of the model training process in step E and the flow thereof are further described as follows:

E1. 53200 training pairs are sampled from the ILSVRC2015_ VID, each training pair is composed of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block.

E2. For the original search area target located at the center of the area, the target center can be randomly shifted by 0 to 8 pixels during training, so that the generalization performance of network target shift is improved.

E3. The model is iteratively trained for 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10^-2Decays exponentially to 10^-5The trained penalty function is a weighted cross-entropy penalty, which is expressed as:

Table 1 shows the results of video attribute analysis of the method of the present invention compared to other methods on OTB 100.

TABLE 1

TABLE 1 (continuation)

SiamFC corresponds to the method proposed by Bertonitto, L et al (Bertonitto, L., Valldre, J., Henriques, J.F., Ved aldi, A., Torr, P.H.S.: full volumetric parameter networks for object tracking. in: European conference Computer Vision works, (ECCV works) pp.850-865 (2016));

the method proposed by Simyltri, Dong, X.et al (Dong, X., Shen, J.: triple loss in size network for object tracking. in: European Conference Computer Vision (ECCV). pp.472-488 (2018));

SRDCF corresponds to the method proposed by Danelljan, M et al (Danelljan, M., H. agent, G., Khan, F.S., Felsberg, M.: Learing particulate regulated correction filters for visual tracking. in 2015IEEE International Conference Computer Vision (ICCV). pp.4310-4318 (2015.);

CSR-DCF corresponds to the method proposed by Lukezic, A, et al (Lukezic, A., Vojir, T., Zajc, L.C., Matas, J., Krist an, M.: diagnostic correlation filter with channel and spectral reliability. in:2017IEEE conference ce on Computer Vision and Pattern Registration (CVPR). pp.4847-4856 (2017));

TRACA corresponds to the method proposed by Choi, J et al (Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeo ng, J., Demiris, Y., Choi, J.Y.: Context-aware feature compensation for high-speed visual tracking. in:2018IEEE Conference on Computer Vision and Pattern Recognition, CVPR.pp.479-488 (2018));

methods proposed by CFNet corresponding to varmadre, J, et al (Valmadre, J., Bertinetto, l., Henriques, j.f., Vedald I, a., Torr, p.h.s.: End-End rendering for correction filter based tracking. in:2017I EEE Conference on Computer Vision and Pattern Registration (CVPR): pp.5000-5008 (2017));

ACFN corresponds to the method proposed by Choi, J.et al (Choi, J., Chang, H.J., Yun, S., Fischer, T., Demiris, Y., Choi, J.Y.: environmental correction filter network for adaptive visual tracking. in:2017IEEE conference on Computer Vision and Pattern Registration (CVPR). pp.4828-4837 (2017));

the method proposed by Standard corresponds to Bertonitto, L.et al (Bertonitto, L., Valldre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Standard: minor workers for real-time tracking. in:2016IEEE Conference Computer Vision and Pattern Registration (CVPR). pp.1401-1409 (2016));

KCF corresponds to the method proposed by Henriques, J.F. et al (Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized conversion filters, IEEE traces. Pattern anal. Mach. Intell.37(3), 583-string 596 (2015)).

As can be seen from the table 1, the method can effectively process complex tracking scenes such as rapid target motion, occlusion, rotation, background clutter and the like in the actual tracking process, and the performance of the method is superior to that of other trackers. Showing the effectiveness of the proposed method.

Claims

1. A target specific response attention target tracking method based on a twin network is characterized by comprising the following steps:

1) a video sequence is given, wherein a first frame comprises a marked target, a target template area Z and a target search area X are defined, the target template area Z is kept unchanged after being intercepted based on the given mark in the first frame, the target search area X is obtained from a current video frame to be tested, and an image block larger than the target template area Z is intercepted by using a target position obtained from the previous frame;

2) inputting the target template area and the target search area in the step 1) into a full convolution twin network to obtain a CNN characteristic F of a target template area Z_zAnd CNN feature F of target search area X_x；

3) The CNN characteristic F obtained in the step 2)_xAnd F_zInputting the attention model of the target specific response to obtain a response map S of multiple channels weighted by the attention model_multiTo the response map S_multiSumming channel by channel to obtain a final response diagram S, and determining the position with the maximum response value in the response diagram S as a target initial position;

the specific steps for obtaining the final response diagram S are as follows:

S_multi＝Corr_cw(F_x,F_z)；

ω＝H(F_x)

Then will be

5) training a model: the training of the model is independent of the tracking process, and after the model is trained off-line, the trained model is used for the tracking step.

2. The twin network based target specific response attention target tracking method as claimed in claim 1, wherein in step 1), the specific steps of the target search area X being obtained in the video frame to be tested at present are as follows:

1.1 in the initial frame according to the real effective value cutting out the target template area, the size of the target template area is slightly bigger than the actual target for capturing some semantic information, the size of the target template area is calculated according to the following formula:

wherein c is (w)_z+h_z)/2,w_zWidth, h, of target template region Z_zIndicating the height of the target template region Z, and then re-scaling the truncated template image block to 125 x 125, the scaling scale being 125/S_zSaving, which is used for calculating the interception size of the search area;

1.2 the resizing is 255 × 255 with respect to the search area, and in order to ensure that the target dimension of the search area and the dimension of the template are consistent, the actual cut size of the search area is: s_x＝255/scale。

3. A twin network based target specific response attention target tracking method as claimed in claim 1, characterized in that in step 2), the fully convolution twin network adopts a five-layer network structure similar to AlexNet, and adopts two twin networks: one for extracting the features of the target template region and one for extracting the features of the target search region, the two networks sharing the same parameters; the process is described as follows:

F_x＝ψ_p(X)

F_z＝ψ_p(Z)

L(k,u)＝log(1+e^(-ku))

wherein k is the target label, and u is the score of the corresponding position of the response map.

4. Target specific response attention target tracking based on twin network as claimed in claim 1The tracking method is characterized in that in the step 2), the target template area and the target search area in the step 1) are input into a full convolution twin network to obtain the CNN characteristic F of the target template area Z_zAnd CNN feature F of target search area X_xThe method comprises the following specific steps:

2.1 after the input image of the full convolution twin network is resized, the full convolution twin network is a full convolution network of five convolution layers and is used for extracting the coarse-grained characteristic of the input image;

2.2 network details, each convolution layer is followed by a batch normalization layer and an activation layer ReLU; first convolution layer, its convolution kernel size is 11 x 11, convolution step is 2, input channel number is 3, output channel number is 96; pooling the first pooling layer with a pooling size of 3 × 3 and a maximum pooling step of 2; ③ a second convolution layer, the convolution kernel is 5 multiplied by 5, the convolution step length is 1, the number of input channels is 96, and the number of output channels is 256; the second pooling layer is the same as the first pooling layer, the pooling size is 3 multiplied by 3, and the step length is 2; fifthly, a third convolution layer, the convolution kernel size is 3 multiplied by 3, the convolution step size is 1, the input channel number is 256, and the output channel number is 384; sixthly, the size of a convolution kernel of the fourth convolution layer is 3 multiplied by 3, the convolution step length is 1, the number of input channels is 384, and the number of output channels is 384; seventhly, the convolution layer has the convolution kernel size of 3 multiplied by 3, the convolution step length of 1, the input channel number of 384 and the output channel number of 256;

2.3 the whole tracking algorithm model has two branch networks, one is a template branch, the other is a search area branch, the networks of the two branches are the networks described in the step 2.2, and the parameters are the same; the proposed attention model of the target specific response immediately follows the search area branch.

5. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 4), the specific step of selecting the scale of the target search area with the highest response value as the corresponding scale of the current frame is:

4.1 to balance tracking accuracy and speed, three scales are selected for multi-scale search, and 3 scale factorsλ_iSpecific values [0.96385,1,1.0375 ]]Let the actual truncation size of the search area of the previous frame be S_xThe actual size of the truncation area per scale is

And 4.2, readjusting the image blocks of the three scales to be 255 multiplied by 255, performing scale search, firstly finding the maximum value in the response image in each scale, and comparing the maximum response values in the three scales, wherein the scale where the maximum response value is located is the scale corresponding to the current frame.

6. The twin network-based target specific response attention target tracking method as claimed in claim 1, wherein in step 5), the specific steps of training the model are:

5.1 sampling 53200 training pairs from a visual identification data set ILSVRC2015_ VID, wherein each training pair consists of a template and a search area block, the template and the search area block of the same training pair belong to the same video sequence, and the frame of the template is in front of the search area block;

5.2, for an original search area target, locating the target at the center of the area, and randomly offsetting the target center by 0-8 pixels during training so as to improve the generalization performance of network target offset;

5.3 model iterative training 50 generations, the number of samples selected in one training is 8, and the learning rate of the training is from 10^-2Decays exponentially to 10^-5The trained loss function L is a weighted cross-entropy loss, as follows:

where w represents the weight of each sample, y represents the true label value, v represents the value predicted by the model, and n represents the number of samples sampled in a batch.