CN113033098B

CN113033098B - Ocean target detection deep learning model training method based on AdaRW algorithm

Info

Publication number: CN113033098B
Application number: CN202110324328.9A
Authority: CN
Inventors: 柳林; 李万武; 张继贤
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2022-05-17
Anticipated expiration: 2041-03-26
Also published as: CN113033098A

Abstract

The invention discloses a marine target detection deep learning model training method based on an AdaRW algorithm, belongs to the field of marine target detection, firstly provides an AdaRW adaptive gradient training algorithm, and solves the problem of deep learning rate attenuation caused by historical gradient accumulation in the AdaGrad algorithm; and simultaneously, an optimal interleaved parallel architecture OIPA is designed and consists of a plurality of PServer processes and a Worker _ DS process. When the ocean target detection deep learning model is trained, multi-core parallel training is carried out on the AdaRW algorithm through the OIPA framework, and the algorithm training speed is improved; the trained OceanTDA9_ AdaRW model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.

Description

Ocean target detection deep learning model training method based on AdaRW algorithm

Technical Field

The invention belongs to the field of marine target detection, and particularly relates to a marine target detection deep learning model training method based on an AdaRW algorithm.

Background

The training algorithm of the deep learning model comprises a gradient descent algorithm, a least square method, a Newton method, a quasi-Newton method and the like. The gradient descent method is an iterative solution, and the least square method is a computational analytic solution. The newton method/quasi-newton method is also an iterative solution, which is solved using the inverse of the hessian matrix or pseudo-inverse of the second order. The most commonly used model training is the gradient descent algorithm. Gradient descent does not necessarily enable finding a globally optimal solution, possibly a locally optimal solution.

Batch Gradient Descent (BGD) is the most common form of BGD and is characterized by high accuracy in updating parameters using all samples, but slow training speed. The principle of the Stochastic Gradient Descent (SGD) method is similar to that of the batch Gradient Descent method, except that one sample is used instead of all sample data to obtain a Gradient, so that the training speed is much faster, but one sample is iterated once, the direction change is large, and the convergence effect is poor. The current gradient value is considered when the parameter of the Momentum gradient descent algorithm (Momentum optimization) is updated, an accumulation item, namely impulse is added, the amplitude of the impulse is controlled by using the parameter gamma, the oscillation of model training is reduced, and the method is more favorable for accelerating the algorithm convergence compared with the traditional gradient descent algorithm.

The AdaGrad gradient descent algorithm is a gradient descent algorithm with adaptive learning rate, which is proposed by Duchi in 2011, and the convergence speed is faster in the direction with large gradient. However, the learning rate is gradually reduced, so that the learning rate is so small in the late training period that the training is stopped prematurely. The RMSprop algorithm is an improvement on the AdaGrad algorithm proposed by Hinton, introduces a hyper-parameter, accumulates a gradient square term for attenuation, and solves the problem of too fast attenuation of the learning rate. The adam (adaptive motion estimation) algorithm is an optimization algorithm proposed by Kingma et al in 2015, which combines the ideas of Momentum and RMSprop algorithms. Compared to the Momentum algorithm, the learning rate is adaptive. Compared with RMSprop, the momentum term is increased.

When the model trained by the algorithm is used for detecting the suspected targets in the ocean research area, the efficiency of detecting the ocean targets is relatively low.

Disclosure of Invention

In order to solve the problems, the invention provides an AdaRW algorithm-based marine target detection deep learning model training method, and the AdaRW algorithm is adopted for carrying out multi-core parallel training on the marine target detection deep learning model, so that the model training speed is increased, and the marine target detection efficiency is improved.

The technical scheme of the invention is as follows:

an AdaRW adaptive gradient training algorithm adopts a limited window to carry out historical gradient accumulation and adopts delta theta_tSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time t_mInstead of sub-windows in any historical gradient accumulation, overcomes the decay problem of deep learning rate; meanwhile, a multi-core parallel OIPA framework is designed, and an AdaRW algorithm is adopted for parallel training of the ocean target detection deep learning model; finally, the oceanic suspected targets were detected using the trained OceanTDA9_ AdaRW model.

Preferably, the method comprises the steps of:

s1, an AdaRW self-adaptive gradient training algorithm is proposed, and an algorithm updating formula is deduced;

s2, designing an optimal staggered parallel architecture OIPA;

s3, performing parallel training on the proposed AdaRW algorithm by adopting the designed OIPA framework to obtain an ocean target detection deep learning model OceanTDA9_ AdaRW;

and S4, detecting the suspected targets in the ocean area by using the trained OceanTDA9_ AdaRW model.

Preferably, the iterative update formula of the AdaRW adaptive gradient training algorithm is as follows:

Δθ_t＝λΔθ_t-1+(1-λ)g_t′⊙g_t′ (5)

where θ is a parameter, t is the current time, t is_mAt the mth historical moment, lambda is a hyper-parameter, and lambda is more than or equal to 0 and less than 1; ε takes a small value to prevent the denominator from being 0; g_tIs a small batch of random gradients of the loss function J (θ).

Preferably, the AdaRW algorithm comprises the steps of:

(1) determining a loss function, and adopting a cross entropy loss function as follows:

where θ is a parameter, y_-iIs the input value of the ith sample, h_θ(x_i) Is the output value of the ith sample x;

(2) initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size m, θ₀,θ₁,...,θ_nA value of (d);

(3) calculating the gradient of the current position loss function and saving t_mA gradient of time;

(4) calculating the distance d of the current position descent_iMultiplying the step length by the gradient to obtain the product;

(5) judging whether the gradient descending distance is less than the algorithm termination distance r or reaches the training times n, if so, terminating the algorithm, otherwise, turning to the step (6);

(6) updating all theta, and turning to the step (1), wherein the updating function is as follows;

(7) and finishing the algorithm and outputting the result.

Preferably, the OIPA architecture is composed of 1 central Node Chief and several sub-nodes Node, the central Node Chief is linked with each sub-Node in a star shape, and all the sub-nodes are logically linked in a closed loop.

Preferably, each node of the OIPA architecture consists of 1 parameter service unit PServer and 1 computation service unit Worker; the data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1, a Worker _ DS2 and a Worker _ DS 3; the sum of the data sets in all the child node Worker _ DS is equal to the training data set, and the central node data set Worker _ DS is a complete data set; the parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission.

Preferably, the training process based on the OIPA architecture is:

(1) the CPU0 in the parameter service unit in the child Node0 takes out the data set DS0 distributed to the Node from the data set according to the total number of the nodes in the cluster, 2 Batch training data sets are prepared according to the GPU number and the training Batch of the Node, the training data sets are distributed to 2 GPUs in a Worker _ DS0 for training, the trained gradient delta P is transmitted to the CPU1 in the parameter service unit PServer of the Node, and the CPU1 updates the parameters of the model by the aggregated gradient and then continues the training;

(2) after training of the appointed number of steps is completed, the Worker _ DS0 of the Node transmits the gradient of the last step to a model parameter folder of a main Node Chief0 in a cluster, extracts an optimal model parameter from the model parameter folder of the main Node Chief0, stores the optimal model parameter in the model parameter folder of the Node, updates the parameter to perform a new iteration, and then distributes the optimal model parameter to parameter service units of upstream and downstream nodes Node1 and Node2 connected with the Node;

(3) after the parameter service units of the Node1 and the Node2 verify the freshness of the parameters, the latest model parameters are stored in a model parameter folder of the Node, and the latest model parameters are provided for the new round of training of the Node;

(4) the CPU1 in the PServer of the master node Chief0 monitors the model parameter folder, reads the parameters transmitted by each node in the cluster in time, transmits the parameters to the Worker _ DS test evaluation model parameters of the node, and stores the optimal model parameters in the model parameter folder for each node to read the model in the model folder for continuous training.

The invention has the following beneficial technical effects:

the AdaRW algorithm adopts a window accumulation method for gradient accumulation, so that the problem of low learning rate caused by total accumulation during updating of the AdaGrad algorithm is solved, namely, a subset is taken from historical gradient accumulation according to a window for accumulation, and the learning rate is adjusted; meanwhile, to reflect the current gradient tendency, Δ θ is adopted_tThe square root replaces the hyperparameter η in the AdaGrad algorithm, and defining the gradient accumulation sub-window is a forward extrapolation from the current time t to the historical time t_mRather than a subset window in any historical gradient accumulation;

the AdaRW algorithm is a self-adaptive gradient training algorithm, overcomes the problem of deep learning rate attenuation caused by historical gradient accumulation, slows down the attenuation of the deep learning rate, and improves the deep learning training speed; meanwhile, a multi-core parallel architecture OIPA is designed, and multi-core parallel training is performed on the ocean target detection deep learning model adopting the AdaGrad algorithm, so that the algorithm training speed is increased; finally, the trained model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.

Drawings

FIG. 1 is a schematic diagram of the AdaRW algorithm of the present invention;

FIG. 2 is a flow chart of the AdaRW algorithm of the present invention;

FIG. 3 is an OIPA architecture diagram of the present invention;

FIG. 4 is a diagram of a single-machine multiple GPU deployment of the present invention;

FIG. 5 is a diagram of a multi-machine multi-GPU deployment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and detailed description:

the invention provides an AdaRW self-adaptive gradient training algorithm, wherein the AdaRW algorithm adopts a window accumulation method to carry out gradient accumulation; using Delta theta_tSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time t_mThe window of (2). The AdaRW algorithm overcomes the defect that the learning rate of the AdaGrad algorithm is gradually reduced, so that the learning rate is very low in the later training period, and the learning time is too long;

simultaneously, a multi-core Parallel Architecture, namely an Optimal Interleaved Parallel Architecture (OIPA) is designed, and an AdaRW algorithm is adopted for multi-core Parallel training of a marine target detection deep learning model;

finally, the trained OceanTDA9_ AdaRW model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.

AdaRW algorithm principle

Assuming that the loss function of the parameter theta is J (theta), the gradient of the parameter is the direction in which the function rises most quickly, and the SGD algorithm parameter updating expression is as follows:

wherein eta represents the learning rate, h_θAn optimization function is represented.

The SGD algorithm can converge to global optimization on a convex optimization problem theoretically, but a neural network model belongs to a complex nonlinear structure, has a plurality of local optimal points and mostly belongs to a non-convex optimization problem. Therefore, the adoption of the gradient descent algorithm may fall into local optimization, and convergence to global optimization cannot be guaranteed.

The AdaGrad algorithm realizes the self-adaptation of the learning rate and solves the problem that the learning rate in the SGD method is invariable all the time. The updating process is as follows:

wherein s represents the accumulated amount of the gradient squared; when the parameters are updated, the learning rate η is divided by the square root of this accumulation, and ε is used to ensure that the denominator is not 0. Since the historical gradients are accumulated all the time during the training iteration, the learning rate is gradually decaying to 0. If the initial gradient is large, the learning rate of the whole training process is always small, and the learning time is prolonged.

In order to solve the defects of the AdaGrad algorithm, the AdaGrad algorithm is improved in two aspects:

(1) in order to solve the problem of low learning rate caused by all accumulation during updating of the AdaGrad algorithm, a window accumulation method is adopted for gradient accumulation, namely, subsets are taken from historical gradient accumulation according to windows for accumulation, and the learning rate is adjusted.

(2) To reflect the current gradient trend, Δ θ is used_tThe square root replaces the hyperparameter η in the AdaGrad algorithm. And, defining the gradient accumulation sub-window is to deduce from the current time t to the historical time t_mRather than a sub-window in any historical gradient accumulation.

The improved algorithm is called AdaRW (Adagrad corrected by Windows) algorithm, and is also an adaptive gradient training algorithm. The iterative update formula is as follows:

Δθ_t＝λΔθ_t-1+(1-λ)g_t′⊙g_t′ (5)

where t is the current time, t_mAt the mth historical moment, lambda is a hyperparameter, lambda is more than or equal to 0 and less than 1, and the value is generally 0.9. The addition of a small value epsilon to the denominator of equation (4) is to prevent the denominator from being 0. g_tA small batch of random gradients for the loss function J (θ), expressed as:

the AdaRW algorithm proposed is directed to different training data, the accumulation window of which is adjustable. The size of the window is controlled by the size of m, thereby adjusting the size of the accumulation. The smaller m, the larger the accumulation window. When m is 1, the total history is accumulated, which corresponds to the accumulated amount of the squared gradient in the AdaGrad algorithm, and the principle is shown in fig. 1.

AdaRW algorithm process

The AdaRW algorithm flow is shown in fig. 2, and includes the following steps:

(1) and determining a loss function, wherein the algorithm adopts a cross entropy loss function.

Where θ is a parameter, y_-iIs the input value of the ith sample, h_θ(x_i) Is the output value of the ith sample x.

(2) Initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size m, θ₀,θ₁,...,θ_nThe value of (c).

(3) Calculating the gradient of the current position loss function and saving t_mThe gradient of the moment.

(4) Calculating the distance d of the current position descent_iAnd multiplying the step size by the gradient to obtain the target value.

(5) And (4) judging whether the gradient descending distance is less than the algorithm termination distance r or whether the training times n are reached, if so, terminating the algorithm, otherwise, turning to the step (6).

(6) And (4) updating all theta, and turning to the step (1), wherein the updating function is as follows.

(7) And finishing the algorithm and outputting the result.

Three, parallel distributed architecture design

The invention also designs an optimal interleaved parallel architecture OIPA, as shown in fig. 3. The OIPA architecture consists of 1 central Node Chief and a plurality of sub-nodes Node, wherein the central Node Chief is connected with each sub-Node in a star-shaped manner, and all the sub-nodes are logically connected in a closed loop manner. Unlike the traditional central architecture, each node is composed of 1 parameter service unit PServer and 1 computation service unit Worker. The data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1 and the like, the sum of the data sets in all the child node Worker _ DS is equal to a training data set, and the central node data set Worker _ DS is a complete data set. The parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission.

When all nodes are ready, the model begins to train. In an iteration process, each Worker of the child nodes completes own Batch training, calculates the gradient, transmits the gradient to the PServer, and reads the model in the model folder to continue training. The child node PServer transmits parameters such as gradient calculated by the Worker of the node to the central node Chief, receives model parameters of the central node, updates the model in the model folder of the node for the Worker to call, and transmits the model parameters to the upper end node and the lower end node connected with the model folder. The upper and lower end nodes check the model parameters from the child nodes, and compare the model parameters with the model parameters of the node to determine whether to update the model of the node. The PServer in the central Node Chief monitors and receives the model parameters transmitted by each child Node, transmits the model parameters to the Worker _ DS test evaluation model parameters in the Node, and updates the model parameters in the model folder for the child nodes to call.

Compared with other architectures, the OIPA architecture designed by the invention has the following advantages:

(1) the data set that each child node participates in training is fixed and unique. The sum of all the child node data sets is equal to the training data set, so that the data sets participating in training in the minimum batch are not repeated, and the training of all the data sets can be completed in the minimum batch under the limit condition.

(2) The multi-path ensures that the model parameters of each child node participating in training each time are always optimal. And the training result of each child node is transmitted to the central node through the PServer of the node, the central node aggregates all model parameters such as the gradient calculated by the Worker, and the optimal model parameters are transmitted to each child node after test and evaluation. In addition, after the sub-node obtains the optimal model parameters, the optimal model parameters are timely transmitted to the upper end node and the lower end node, so that the upper end node and the lower end node can timely update the model of the node, and the optimal model parameters called by the Worker of the node are ensured.

(3) The compute units of each child node are not intermittently cross-trained. After the training of each sub-node in a minimum batch is finished, the model is directly read from the model folder of the node to continue training without considering the transmission of model parameters, so that the calculation unit of each node is ensured to be trained without stop, and the staggered training of different sub-nodes is realized.

Four, OIPA architecture deployment

The OIPA framework designed by the invention consists of a plurality of PServer processes and a Worker _ DS process, and is mainly a distributed framework designed for a plurality of computers and a plurality of cards, and the distributed framework is firstly deployed in a single computer and a plurality of cards and then deployed in a plurality of computer and card environments. The ocean target detection model and the data set are optimized correspondingly aiming at the GPU with a single machine limited memory, and the OIPA can be deployed in a single machine multi-card environment by slightly improving the OIPA. The advantage of single machine multi-card (parallel) is to reduce the communication overhead between tasks, and the multi-machine multi-card (distributed) uses multiple servers to separate the parameter updating and the graph calculation, thereby reducing the pressure of the whole server.

1. Deployment of stand-alone multiple GPUs

The GPU is a computer image processor, is one of important factors influencing the training time of the deep neural network, and performs parallel deployment on a plurality of GPUs on a single machine to efficiently complete a training task. The OIPA supports the assignment of corresponding devices to complete corresponding operations, so how to allocate tasks is critical, the GPU is good at large amount of calculation, so the calculation of the whole Inference and gradient is allocated to the GPU, and the parameter update is allocated to the CPU. The single-machine dual-GPU deployment is as shown in FIG. 4, 2 batchs are processed at a time, each GPU processes data calculation of one Batch, model parameters or calculation graphs can be disassembled and put on different devices, parameters are shared through variable names, and variables (parameters) are stored on a CPU. Distributing data to 2 GPUs by a CPU (Central processing Unit), and completing calculation on the GPUs to obtain a gradient to be updated in each batch; collecting gradients to be updated on 2 GPUs on a CPU, calculating an average gradient, and updating parameters by using the average gradient; and thirdly, circularly performing the steps to finish the training. It should be noted that this process of collecting gradients is synchronous, and the CPU must wait for all GPUs to finish before the operation of averaging gradients starts, and it is obvious that the training speed of the whole model depends on the slowest GPU card.

2. Deployment of multiple machines and multiple GPUs

The multi-machine multi-card means that a plurality of servers have a plurality of GPU (graphics processing unit) devices, the performance of a plurality of computers is fully used, and different working nodes are divided. The deployment of the OIPA multi-machine multi-card experiment is shown in fig. 5, and 2-5 machines respectively form a cluster, and each machine is provided with 2 GPUs. The OIPA distributed machine learning framework divides the operation into a Parameter operation (Parameter Job) and a work operation (Worker Job), a Parameter Server (PS) runs the Parameter operation and is responsible for storing and updating the management parameters, and the work operation is responsible for the task of model calculation. The distribution of the OIPA enables inter-job data transfer, i.e., forward propagation of parameter jobs to work jobs and backward propagation of work jobs to parameter jobs.

(1) Building a distributed environment

And establishing a Cluster, distributing the working Job and the Task, distributing a host address for each Task, and establishing a service Server for each Task. When the Server is created, the Server must be introduced into the Cluster, so that each Server can know which Hosts the Cluster in which the Server is located contains, and then the communication between the Server and the Server can be realized. The creation of the Server needs to be on the Host of the Server, once all the servers are created on the respective hosts, the whole Cluster is established, and all the servers among the clusters can communicate with each other. Each Server contains two components: master and Worker. Where Master provides Master Service, which mainly can provide remote access (RPC protocol) to each device in Cluster, while another important function is as Target to create tf. And Worker provides Worker Service, which can execute the computation subgraph with local equipment.

(2) Initiating a service

And one Node is designated as a main Node (Chief) and is responsible for managing each Node, coordinating training among the nodes and finishing common operations such as model initialization, model saving and recovery.

(3) Begin training

In an iterative process, CPU0 in a parameter service unit in a child Node0 takes out a data set DS0 distributed to the Node from a data set according to the total number of nodes in a cluster, 2 Batch training data sets are prepared according to the GPU number and training Batch of the Node, the Batch training data sets are distributed to 2 GPUs in a Worker _ DS0 for training, the trained gradient delta P is transmitted to CPU1 in a parameter service unit PServer of the Node, and CPU1 updates parameters of a model by using the aggregated gradient and then continues training; secondly, after training of the specified number of steps is completed by the Worker _ DS0 of the Node, transferring the gradient of the last step to a model parameter folder of a main Node Chief0 in the cluster, extracting the optimal model parameters from the model parameter folder of the main Node Chief0, storing the optimal model parameters in the model parameter folder of the Node, updating the parameters to perform a new iteration, and then distributing the optimal model parameters to parameter service units of upstream and downstream nodes Node1 and Node2 connected with the Node; after verifying the freshness of the parameters, the parameter service units of the Node1 and the Node2 store the latest model parameters into a model parameter folder of the Node, so as to provide the latest model parameters for the new round of training of the Node; CPU1 in PServer of the main node Chief0 monitors the model parameter folder, reads parameters transmitted by each node in the cluster in time, transmits the parameters to the Worker _ DS test evaluation model parameters of the node, and stores the optimal model parameters in the model parameter folder for each node to read the model in the model folder for continuous training.

Parallel training experiment of AdaRW algorithm

The designed OIPA framework is adopted to carry out parallel training 8250 times on the proposed optimized training algorithm AdaRW, the learning rate is set to be 0.01, and other parameters are set according to algorithm default values. And comparing the training result with the existing SGD algorithm, Adagarad algorithm and Adam algorithm to obtain a loss _ batch curve and an accuracy _ batch curve.

In the loss _ batch curve, the loss of the AdaRW algorithm is 0.0766, taking 750 seconds, the standard deviation is 0.00015, the mean is 0.2178; the loss of Adam's algorithm is 0.0594, taking 660 seconds, the standard deviation is 0.00010, the mean is 0.2164; the loss for the SGD algorithm is 0.0869, which takes 668 seconds, the standard deviation is 0.00031, the mean is 0.2743; the loss for the Adagrad algorithm was 0.0535, which took 638 seconds, the standard deviation was 0.00026, and the mean was 0.2407. In a comprehensive view, the AdaRW algorithm is superior to Adagrad and SGD algorithms and is equivalent to Adam algorithm.

In the precision _ batch curve, the precision of the AdaRW algorithm is 0.9983, takes 750 seconds, the standard deviation is 0.00009, and the mean is 0.9187; the precision of the Adam algorithm is 0.9992, takes 660 seconds, the standard deviation is 0.00006, and the mean is 0.9194; the accuracy of the SGD algorithm is 0.9947, it takes 668 seconds, the standard deviation is 0.00013, the mean is 0.8925; the accuracy of the Adagrad algorithm was 0.9977, which took 638 seconds, the standard deviation was 0.00010, and the mean was 0.9107. In a comprehensive view, the AdaRW algorithm and the Adam training algorithm which have the best ocean target adaptation effect are also provided, and the standard deviation of the AdaRW algorithm is superior to that of the Adagrad algorithm and the Adam algorithm.

The advanced AdaRW algorithm and other three algorithms are adopted to train the ocean target detection deep learning model OceanTDA 9. Each optimized training algorithm was trained 8250 times with 100 sample data samples at a time. The results of the experiment are shown in table 1. In the table, the test accuracy and the test loss are the accuracy and the loss calculated on the test data set after the model training is finished, and the average accuracy and the average loss are the average values of the accuracy and the loss from the model training to the last 20 times. The experimental result shows that the test precision, the average precision and the average loss of the algorithm AdaRW are superior to those of Adagrad and SGD except time consumption, the test loss is between those of Adagrad and SGD, the comprehensive index is equivalent to that of Adam optimization algorithm, and the standard deviation is superior to that of Adam optimization algorithm. And the algorithm AdaRW provided by the invention is simpler.

Table 1 AdaRW compares results with other algorithms

Experimental data selects dual-polarized SAR data of an IW mode of Sentinel-1 in a Bohai sea area (between 37 degrees 07-40 degrees 56 'of north latitude and 117 degrees 33-122 degrees 08' of east longitude). The total 20 scenes, the time span is 2016 months 1 to 6 months.

After the advanced AdaRW algorithm is adopted to train the OceanTDA9 deep learning model, suspected targets are detected, and 90 suspected targets (each containing 28 × 28 pixels) such as a drilling platform are detected. The target missed detection number is 0, the detection rate is 100%, the false alarm rate is 10.9%, the time is 2.36 seconds, and the SAR image detection capability of 10 m resolution is about 58.5km 2/s. The results are shown in Table 2.

TABLE 2 detection results of marine targets

Experiments show that the test precision, the average precision and the average loss of the algorithm are superior to those of the Adagrad algorithm and the SGD algorithm, the test loss is between that of the Adagrad and that of the SGD algorithm, and the standard deviation of the curve is superior to that of the Adam optimization algorithm. The constructed ocean target detection deep learning model is trained by adopting the algorithm, and the trained model is used for detecting the suspected target in the research area, so that the efficiency of detecting the ocean target of the polarized SAR is improved. The experimental result fully verifies the effectiveness of the method.

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. An AdaRW algorithm-based marine target detection deep learning model training method is characterized in that the AdaRW adaptive gradient training algorithm adopts a limited window to carry out historical gradient accumulation, and adopts delta theta_tSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time t_mThe window of (1); meanwhile, a multi-core parallel OIPA framework is designed, and an AdaRW algorithm is adopted for parallel training of the ocean target detection deep learning model; finally, detecting the suspected ocean target by using the trained OceanTDA9_ AdaRW model;

the iterative update formula of the AdaRW adaptive gradient training algorithm is as follows:

Δθ_t＝λΔθ_t-1+(1-λ)g_t′⊙g_t′ (5)

where θ is a parameter, t is the current time, t is_mAt the mth historical moment, lambda is a hyper-parameter, and lambda is more than or equal to 0 and less than 1; ε takes a small value to prevent the denominator from being 0; g_tA small batch random gradient that is a loss function J (θ);

the OIPA framework consists of 1 central Node Chief and a plurality of sub-nodes Node, wherein the central Node Chief is connected with each sub-Node in a star shape, and all the sub-nodes are logically connected in a closed loop;

each node of the OIPA framework consists of 1 parameter service unit PServer and 1 calculation service unit Worker; the data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1, a Worker _ DS2 and a Worker _ DS 3; the sum of the data sets in all the child nodes Worker _ DS is equal to the training data set, and the data set in the central node Worker _ DS is a complete data set; the parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission;

the training process based on the OIPA framework is as follows:

2. The method for training the ocean target detection deep learning model based on the AdaRW algorithm is characterized by comprising the following steps:

s2, designing an optimal staggered parallel architecture OIPA;

3. The method for training the ocean target detection deep learning model based on the AdaRW algorithm is characterized in that the AdaRW algorithm comprises the following steps:

(2) initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size, θ₀,θ₁,...,θ_nA value of (d);

(4) calculating the distance d of the current position descent_iThe step length is multiplied by the gradient to obtain the gradient;

(7) and finishing the algorithm and outputting the result.