CN111723632B

CN111723632B - Ship tracking method and system based on twin network

Info

Publication number: CN111723632B
Application number: CN201911087711.6A
Authority: CN
Inventors: 单云霄
Original assignee: Zhuhai Dagama Technology Co ltd
Current assignee: Zhuhai Dagama Technology Co ltd
Priority date: 2019-11-08
Filing date: 2019-11-08
Publication date: 2023-09-15
Anticipated expiration: 2039-11-08
Also published as: CN111723632A

Abstract

The application discloses a ship tracking method and a ship tracking system based on a twin network. The model can mine the characteristics of different depths of the target through a large amount of data training, and can accurately and efficiently track various types of ships in different weather environments. In the sea tracking data collected by the model, the average tracking accuracy is 58%, and the average frame rate reaches 124.21 FPS.

Description

Ship tracking method and system based on twin network

Technical Field

The application relates to the technical field of image processing, in particular to a ship tracking method and system based on a twin network.

Background

In recent years, more and more maritime platforms are used for maritime transportation, ecological monitoring, marine safety and other scenes, and the exploration of the ocean is receiving a great deal of attention. For platform safety, visual target tracking techniques are widely used to track potential targets of interest, such as vessels, buoys, etc. However, it is not easy to achieve accurate and stable tracking tasks in complex dock environments, particularly ship tracking tasks. Thus, the visual tracking algorithm used to track the development of the vessel must have sufficient intelligence and sufficient capability to accommodate marine time applications.

There are some challenges in developing a ship vision tracking algorithm. Unlike ground vehicles, the vessel moves on a floating surface and its motion is complex and cannot be predicted accurately. Furthermore, the marine environment is sensitive to weather and light. Sea often has fog or rain and sea wind can exacerbate the sway of the ship. Thus, adjacent frames may be very different from each other due to unstable movements of the vessel. Furthermore, the quality of the image may be affected by sunlight reflected from the floating water surface.

The phase generation type tracking algorithm has the defects of higher calculation complexity, poorer real-time performance and lower tracking accuracy. The main reason is that the generation algorithm generally traverses the image to be detected to obtain the region with the highest similarity with the target as the target, and in order to improve the accuracy, more complex features such as texture, gradient and the like are required to be introduced, and the operations all cause larger calculation cost. For background subtraction, when the contrast between the ship and the background is smaller, the method is difficult to realize the detection and tracking of the target, and the tracking accuracy is greatly reduced. Although the tracking method based on level set segmentation can realize dynamic tracking of the target, the method needs to provide a priori target contour, which is difficult to realize in a practical scene. The above tracking methods have a common disadvantage: the detection process only focuses on the characteristic information of the target, and background information is ignored. Therefore, such trackers are difficult to discriminate between targets and interferents in some more complex scenarios, resulting in false detection.

Disclosure of Invention

The application provides a ship tracking method and a ship tracking system based on a twin network, which are used for solving the problem that in the prior art, a target tracker is difficult to distinguish targets from interference objects in some more complex scenes, so that false detection is caused.

The specific technical scheme is as follows:

a twin network-based vessel tracking method, the training method comprising:

acquiring a target position of a tracking target in a first frame image, and taking the target position as a reference frame;

preprocessing an input image to generate an anchor block, and calculating a first relative offset between the anchor block and a template frame reference frame, wherein the first relative offset is used as an input of a tracking network;

determining a second relative offset between the anchor block frame and the detected frame prediction frame, wherein the second relative offset is used as the output of a tracking network, and calculating a confidence value corresponding to each prediction frame, wherein the confidence is the reliability of the prediction frame as a target frame;

punishment is carried out on the predicted frames according to the influence of the historical track and the change of the size and the shape, all the predicted frames are reordered according to the confidence value, and the first K predicted frames with the largest confidence value are taken as target candidate frames;

and merging repeated prediction frames by adopting a non-maximum suppression algorithm, taking the prediction frame with the maximum confidence value as a target frame of the detection and taking the prediction frame as a reference frame of the next frame.

Optionally, before taking the target position as a reference frame, the method includes:

extracting template frame characteristics from the template frame;

extracting detection frame characteristics from the detection frames;

obtaining a classification value and a regression value corresponding to a recommended network according to the template frame characteristics and the detection frame characteristics;

and obtaining a regression frame according to the classification value and the regression value.

Optionally, reordering all confidence values, taking the first K prediction frames with the largest confidence values as target candidate frames, including:

selecting the previous M frame images in the history track and predicting the position of the target in the detection frame by adopting a least square method;

calculating the size and shape difference between two adjacent frames based on the position of the target in the template frame, and punishing the predicted frame with the difference exceeding a threshold value with the previous frame;

calculating a confidence value of each candidate frame as a target frame according to the specified function;

and reordering all the predicted frames according to the confidence value, and taking the top K predicted frames with the largest confidence value as target candidate frames.

Optionally, the test method includes:

a1, calibrating the position of a tracking target;

a2, randomly extracting a pair of images as template frames and detection frames, wherein the inter-frame interval is less than or equal to 10, preprocessing an input image to generate an anchor point frame, and calculating a third relative offset between the template frames and a corresponding truth value frame;

a3, sending the preprocessed third relative offset into a tracking network, and outputting a fourth relative offset between an anchor point frame and a prediction frame in the detection frame;

a4, calculating cross entropy according to the third relative offset and the fourth relative offset, and calculating total loss of the detection frame prediction frame and a truth frame thereof;

a5, calculating a gradient according to the total loss, carrying out gradient feedback and updating weights;

a6, repeating the steps A1 to A5 until the total loss is within a preset range.

Optionally, calculating the total loss of the prediction box and the truth box includes:

obtaining classification loss through specified public calculation;

normalizing the truth box and calculating to obtain regression loss;

and obtaining the total loss according to the classification loss, the regression loss and the loss function.

A twin network based vessel tracking system, the system comprising:

the system is initialized, the position of a tracking target in a first frame image is required to be provided as priori knowledge, and the position of the target is used as a reference frame;

the preprocessing module is used for preprocessing an input image to generate a template frame, a detection frame and an anchor point frame; and the network processing module calculates a first relative offset through a reference frame and an anchor frame provided by the template frame, and outputs a second offset of the prediction frame and the anchor frame as an input of the tracking network. And calculating all prediction frames according to the calculation result;

the selection module adds punishment of the historical track and the change of the size and the shape, reorders all the prediction frames according to the confidence value, and takes the first K prediction frames with the maximum confidence value as target candidate frames; combining repeated prediction frames by using a non-maximum value inhibition method, taking the prediction frame with the maximum confidence value as a target frame of the detection and taking the prediction frame as a reference frame of the next frame.

Optionally, the network processing module is used for extracting template frame characteristics from the template frame; extracting detection frame characteristics from the detection frames; obtaining a classification value and a regression value corresponding to a recommended network according to the template frame characteristics and the detection frame characteristics; and obtaining a regression frame according to the classification value and the regression value.

Optionally, the selection module is used for selecting the previous M frame images in the history track and predicting the position of the target in the detection frame by adopting a least square method; calculating the size and shape difference in two adjacent frames based on the position of the target, and punishing the predicted frame with the difference exceeding a threshold value with the previous frame; calculating a confidence value of each candidate frame as a target frame according to the specified function; and reordering all the predicted frames according to the confidence value, and taking the top K predicted frames with the largest confidence value as target candidate frames.

The method provided by the application adapts to the application of the Siamese RPN tracking model in offshore tracking scenes by improving the Siamese RPN tracking model with better land performance. The model can mine the characteristics of different depths of the target through a large amount of data training, and can accurately and efficiently track various types of ships in different weather environments. In the sea tracking data collected by the model, the average tracking accuracy is 58%, and the average frame rate reaches 124.21 FPS.

Drawings

FIG. 1 is a flow chart of a twin network-based ship tracking method in an embodiment of the application;

FIG. 2 is a schematic diagram of a twin network system according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an application result of a twin network-based ship tracking method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a ship tracking system based on a twin network according to an embodiment of the present application.

Detailed Description

The following detailed description of the technical solutions of the present application will be given by way of the accompanying drawings and the specific embodiments, and it should be understood that the specific technical features of the embodiments and the embodiments of the present application are merely illustrative of the technical solutions of the present application, and not limiting, and that the embodiments and the specific technical features of the embodiments of the present application may be combined with each other without conflict.

Fig. 1 is a flowchart of a ship tracking method based on a twin network according to an embodiment of the present application, where the method includes:

s1, acquiring a target position of a tracking target in a first frame image, and taking the target position as a reference frame;

s2, preprocessing an input image to generate an anchor block, calculating a first relative offset between the anchor block and a template frame reference frame, and taking the first relative offset as the input of a tracking network;

s3, determining second relative offset between the anchor point frame and a prediction frame of the detection frame, and calculating a confidence value corresponding to each prediction frame;

s4, punishing the predicted frames according to the influence of the historical track and the change of the size and the shape, reordering all the predicted frames according to the confidence values, and taking the first K predicted frames with the largest confidence values as target candidate frames;

and S5, merging repeated prediction frames by adopting a non-maximum suppression algorithm, taking the prediction frame with the maximum confidence value as a target frame of the current detection and taking the prediction frame as a reference frame of the next frame.

Firstly, the technical scheme of the application is based on a twin network, a feature extraction module in the twin network is an improved AlexNet, as shown in figure 2, the patent removes the filling in the AlexNet and modifies the depth in a convolution layer to adapt to our scene. Twin networks have two branches: template branching and detection branching. The two branches respectively realize the feature extraction of the template frame and the detection frame respectively byAnd->Denoted z denotes the template frame and x denotes the detection frame.

Regional recommendation network:

corresponding to the twin network, the regional recommendation network also has two branches: the classification branches judge whether the anchor block belongs to a target or a background, and the corresponding regression branches calculate the position offset of the anchor block. Let the number of anchor blocks be k, the output size of the classification branch be 2k, and the output of the regression branch be 4k.And->And respectively inputting the classification and regression values corresponding to the regional recommendation network calculation.

Defining the expression of the anchor block as ANC ^* ：

The correspondingly obtained regression box may be represented as REG ^* ：

The K candidate frames with the highest confidence coefficient are screened out from the regression frames obtained through calculation, and then constraint such as size deformation is added to screen out the optimal target boundary frame.

And (3) loss function design:

the output of this model consists of classification results and regression results, so our loss function consists of two parts. First, the calculation method of the classification loss is as follows:

wherein y is _i Tags representing classifications, S _i Indicating the probability that the classification is correct.

For the regression branch, let the prediction box be denoted (A _x ，A _y ，A _w ，A _h ) The corresponding truth box is denoted (G _x ，G _y ，G _w ，G _h ) It is normalized first:

the regression loss is calculated as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,the function is calculated as follows:

the total loss function can be expressed as:

Loss＝L _cls +γL _reg

where γ is a superparameter for balancing classification and regression losses.

Candidate box selection:

to select the optimal target bounding box from the K candidate boxes, we use two strategies. First, the first strategy is to add historical voyage trajectories as constraints. Considering that the motion of the target has continuity, the first 5 frames of the historical track are selected, the position of the target in the detection frame is predicted by adopting a least square method, and the Manhattan distance is selected for calculating the distance.

Wherein x is _predicted For target position predicted using historical trajectories, x is the target position predicted by the neural network, distance (x, x _predicted ) Representing the manhattan distance between two predicted points. The second strategy is to take into account the size and shape variations of the target. In video streaming, the target can consider that the size and shape in two adjacent frames do not change much, and penalize the prediction frame with larger size change from the previous frame:

where k is a manually set super parameter, r is the aspect ratio of the target frame of the previous frame, and r' is the candidate frame aspect ratio of the current predicted frame. s and s' are the areas of the target frames of the previous frame and the current frame, respectively. The present network uses a Softmax function to calculate the confidence Score for each candidate box as the target box. After adding the constraint, the target boxes are reordered according to the Score value size:

Score＝Score*(1-distance _influence -size _influence )+Penalty *size_influence+Candidate*distance_influence

the size_index and the distance_index are manually set super parameters. And merging repeated candidate frames by adopting a non-maximum suppression algorithm for the reordered candidate frames, and finally selecting the candidate frame with the highest confidence coefficient as the current target frame.

In executing the above process, first, the target tracking training needs to be performed, which specifically includes the following steps:

a1, calibrating the position of a tracking target;

calibrating the positions (xi, yi, wi, hi) of the tracked targets, i=1,..n, where N is the number of images;

a2, randomly extracting a pair of images as template frames and reference frames, wherein the inter-frame interval is less than or equal to 10, preprocessing an input image to generate an Anchor (j=1, m, m is the number of anchors), and calculating the relative offset_inj of the reference frame;

a3, the preprocessed relative offset_inj is sent into a network, and the network outputs the relative offset_outj of the Anchor and the prediction frame;

a4, calculating loss lossj of the prediction frame and the truth frame by using Softmax;

a5, returning the gradient and updating the weight W;

a6, repeating the steps A2 to A5 until loss tends to be stable or the prediction accuracy reaches a peak value.

Based on the method, the application also adopts the accuracy, the recall rate, the F value and the frame rate as performance measurement indexes of the experiment, and the specific calculation method is as follows:

set omega _t For the target to be the intersection of the prediction and truth boxes at time t,a truth box representing the target at time t,θ _t representing the confidence level of the prediction at time t, τ _θ Is a classification threshold, N _g Frame number, N, with non-empty truth box _t In order to detect the number of frames in which the predicted frame is not empty, N is the number of network predictions, and PT (im (i)) is the processing time of the detected frame im numbered i. The calculation formulas of the four measurement indexes are as follows:

by the method provided by the application, the marine tracking model can achieve 58% of average tracking accuracy, the running frame rate can achieve 124.21FPS, and real-time accurate tracking of the marine ship can be realized. In the experiment, the original AlexNet and vgg16 are used as the feature extraction modules for experiments, and the results show that the tracking accuracy of the improved AlexNet is improved by 3% relative to that of the original model and is 1% higher than that of vgg, and the high frame rate of 124.21 is obtained.

Specifically, as shown in table 1, model performance effects are tracked for different CNNs. The model_AlexNet feature extraction module is the original AlexNet, the model_ vgg16 feature extraction module is the original vgg, and the Model adopts the improved AlexNet.

TABLE 1

In order to verify the robustness of the model to tracking different ships in different weather environments, experiments are set according to weather conditions and ship types. The results are shown in tables 2, 3 and 4, wherein table 2 is the accuracy of the model on the data set, table 3 is the recall rate of the model on the data set, and table 4 is the F value of the model on the data set:

TABLE 2

TABLE 3 Table 3

TABLE 4 Table 4

Tables 2 and 3 show the accuracy and recall of our tracking model for different weather and vessel types. It can be seen from the table that the average accuracy of the tracker reached 56% and the average recall was 53% in severe weather conditions. Where fog weather has a greater impact on tracking because it obscures most of the characteristics of the target, appearance and speed will affect tracking for different types of vessels. In general, passenger vessels have a higher similarity to each other and are very close in size. In addition, the passenger ship has a fixed route so as to run smoothly, so that the passenger ship has a good tracking effect. In contrast, yacht tracking performance is poor. The yacht has a small size and high speed, and the driver changes his course randomly, so that its positional deviation and appearance change are remarkable, which has a negative effect on tracking. In Table 4, F-measure integrates precision and recall to demonstrate the good performance of the proposed tracker.

The method provided by the application adapts to the application of the Siamese RPN tracking model in offshore tracking scenes by improving the Siamese RPN tracking model with better land performance. The model can mine the characteristics of different depths of the target through a large amount of data training, and can accurately and efficiently track various types of ships in different weather environments. In the sea tracking data collected by the model, the average tracking accuracy is 58%, the average frame rate reaches 124.21FPS, and the application result of the method provided by the application in actual ship tracking is shown in figure 3.

Corresponding to the method provided by the application, the embodiment of the application also provides a ship tracking system of the base twin network, as shown in fig. 4, which is a schematic structural diagram of the ship tracking system of the base twin network in the embodiment of the application, and the system comprises:

the system initialization module 401 is required to provide the position of the tracking target in the first frame image as priori knowledge, and takes the position of the target as a reference frame;

the preprocessing module 402 is configured to preprocess an input image to generate a template frame, a detection frame and an anchor frame;

the network processing module 403 calculates a first relative offset from the anchor frame by the reference frame provided by the template frame, and outputs a second offset from the predicted frame as an input to the trace network. And calculating all prediction frames according to the calculation result;

the selection module 404 adds punishments of the historical track and the change of the size and the shape, reorders all the prediction frames according to the confidence values, and takes the first K prediction frames with the largest confidence values as target candidate frames; combining repeated prediction frames by using a non-maximum value inhibition method, taking the prediction frame with the maximum confidence value as a target frame of the detection and taking the prediction frame as a reference frame of the next frame.

Further, in the embodiment of the present application, the network processing module 403 is configured to extract template frame features from the template frame; extracting detection frame characteristics from the detection frames; obtaining a classification value and a regression value corresponding to a recommended network according to the template frame characteristics and the detection frame characteristics; and obtaining a regression frame according to the classification value and the regression value.

Further, in the embodiment of the present application, the selecting module 404 is configured to select the first M frame images in the history track and determine the position of the target in the detection frame by using a least square method; determining the size and shape change in two adjacent frames based on the position of the target, and punishing the predicted frame with the difference exceeding a threshold value from the previous frame; calculating a confidence value of each candidate frame as a target frame according to the specified function; and reordering the sizes of all the confidence values, and taking the top K prediction frames with the largest confidence values as target candidate frames.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following appended claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application, including those modified to include the use of specific symbols, labels, and so forth to determine vertices.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A twin network-based ship tracking method, comprising:

2. The method of claim 1, wherein prior to taking the target location as a reference frame, the method comprises:

extracting template frame characteristics from the template frame;

extracting detection frame characteristics from the detection frames;

3. The method of claim 1, wherein reordering all confidence values, taking the top K prediction frames with the largest confidence values as target candidate frames, comprises:

4. The method of claim 1, wherein the target tracking training is first performed by performing the above-described procedure as follows:

a1, calibrating the position of a tracking target;

a2, randomly extracting a pair of images as template frames and detection frames, wherein the inter-frame interval is less than or equal to 10, preprocessing an input image to generate an anchor point frame, and calculating a third relative offset of the template frames and a corresponding truth value frame;

a6, repeating the steps A1 to A5 until the total loss is within a preset range.

5. The method of claim 4, wherein calculating the total loss of the prediction box and the truth box comprises:

obtaining classification loss through calculation of a specified formula;

normalizing the truth box and calculating to obtain regression loss;

6. A twin network based vessel tracking system, the system comprising:

the system is initialized, the position of a tracking target in the first frame image is required to be provided as priori knowledge, and the position of the target is required to be used as a reference frame;

the preprocessing module is used for preprocessing an input image to generate a template frame, a detection frame and an anchor point frame; the network processing module calculates a first relative offset through a reference frame and an anchor frame provided by the template frame, serves as the input of a tracking network, outputs a second relative offset of a prediction frame and the anchor frame, and calculates all the prediction frames according to the second relative offset;

7. The system of claim 6, wherein the network processing module is configured to extract template frame features from the template frames; extracting detection frame characteristics from the detection frames; obtaining a classification value and a regression value corresponding to a recommended network according to the template frame characteristics and the detection frame characteristics; and obtaining a regression frame according to the classification value and the regression value.

8. The system of claim 6, wherein the selection module is configured to select a first M frame of images in the historical track and predict a location of the target in the detected frame using a least squares method; calculating the size and shape difference in two adjacent frames based on the position of the target, and punishing the predicted frame with the difference exceeding a threshold value with the previous frame; calculating a confidence value of each candidate frame as a target frame according to the specified function; and reordering all the predicted frames according to the confidence value, and taking the top K predicted frames with the largest confidence value as target candidate frames.