CN113033098B - Ocean target detection deep learning model training method based on AdaRW algorithm - Google Patents

Ocean target detection deep learning model training method based on AdaRW algorithm Download PDF

Info

Publication number
CN113033098B
CN113033098B CN202110324328.9A CN202110324328A CN113033098B CN 113033098 B CN113033098 B CN 113033098B CN 202110324328 A CN202110324328 A CN 202110324328A CN 113033098 B CN113033098 B CN 113033098B
Authority
CN
China
Prior art keywords
algorithm
training
node
adarw
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110324328.9A
Other languages
Chinese (zh)
Other versions
CN113033098A (en
Inventor
柳林
李万武
张继贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202110324328.9A priority Critical patent/CN113033098B/en
Publication of CN113033098A publication Critical patent/CN113033098A/en
Application granted granted Critical
Publication of CN113033098B publication Critical patent/CN113033098B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention discloses a marine target detection deep learning model training method based on an AdaRW algorithm, belongs to the field of marine target detection, firstly provides an AdaRW adaptive gradient training algorithm, and solves the problem of deep learning rate attenuation caused by historical gradient accumulation in the AdaGrad algorithm; and simultaneously, an optimal interleaved parallel architecture OIPA is designed and consists of a plurality of PServer processes and a Worker _ DS process. When the ocean target detection deep learning model is trained, multi-core parallel training is carried out on the AdaRW algorithm through the OIPA framework, and the algorithm training speed is improved; the trained OceanTDA9_ AdaRW model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.

Description

Ocean target detection deep learning model training method based on AdaRW algorithm
Technical Field
The invention belongs to the field of marine target detection, and particularly relates to a marine target detection deep learning model training method based on an AdaRW algorithm.
Background
The training algorithm of the deep learning model comprises a gradient descent algorithm, a least square method, a Newton method, a quasi-Newton method and the like. The gradient descent method is an iterative solution, and the least square method is a computational analytic solution. The newton method/quasi-newton method is also an iterative solution, which is solved using the inverse of the hessian matrix or pseudo-inverse of the second order. The most commonly used model training is the gradient descent algorithm. Gradient descent does not necessarily enable finding a globally optimal solution, possibly a locally optimal solution.
Batch Gradient Descent (BGD) is the most common form of BGD and is characterized by high accuracy in updating parameters using all samples, but slow training speed. The principle of the Stochastic Gradient Descent (SGD) method is similar to that of the batch Gradient Descent method, except that one sample is used instead of all sample data to obtain a Gradient, so that the training speed is much faster, but one sample is iterated once, the direction change is large, and the convergence effect is poor. The current gradient value is considered when the parameter of the Momentum gradient descent algorithm (Momentum optimization) is updated, an accumulation item, namely impulse is added, the amplitude of the impulse is controlled by using the parameter gamma, the oscillation of model training is reduced, and the method is more favorable for accelerating the algorithm convergence compared with the traditional gradient descent algorithm.
The AdaGrad gradient descent algorithm is a gradient descent algorithm with adaptive learning rate, which is proposed by Duchi in 2011, and the convergence speed is faster in the direction with large gradient. However, the learning rate is gradually reduced, so that the learning rate is so small in the late training period that the training is stopped prematurely. The RMSprop algorithm is an improvement on the AdaGrad algorithm proposed by Hinton, introduces a hyper-parameter, accumulates a gradient square term for attenuation, and solves the problem of too fast attenuation of the learning rate. The adam (adaptive motion estimation) algorithm is an optimization algorithm proposed by Kingma et al in 2015, which combines the ideas of Momentum and RMSprop algorithms. Compared to the Momentum algorithm, the learning rate is adaptive. Compared with RMSprop, the momentum term is increased.
When the model trained by the algorithm is used for detecting the suspected targets in the ocean research area, the efficiency of detecting the ocean targets is relatively low.
Disclosure of Invention
In order to solve the problems, the invention provides an AdaRW algorithm-based marine target detection deep learning model training method, and the AdaRW algorithm is adopted for carrying out multi-core parallel training on the marine target detection deep learning model, so that the model training speed is increased, and the marine target detection efficiency is improved.
The technical scheme of the invention is as follows:
an AdaRW adaptive gradient training algorithm adopts a limited window to carry out historical gradient accumulation and adopts delta thetatSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time tmInstead of sub-windows in any historical gradient accumulation, overcomes the decay problem of deep learning rate; meanwhile, a multi-core parallel OIPA framework is designed, and an AdaRW algorithm is adopted for parallel training of the ocean target detection deep learning model; finally, the oceanic suspected targets were detected using the trained OceanTDA9_ AdaRW model.
Preferably, the method comprises the steps of:
s1, an AdaRW self-adaptive gradient training algorithm is proposed, and an algorithm updating formula is deduced;
s2, designing an optimal staggered parallel architecture OIPA;
s3, performing parallel training on the proposed AdaRW algorithm by adopting the designed OIPA framework to obtain an ocean target detection deep learning model OceanTDA9_ AdaRW;
and S4, detecting the suspected targets in the ocean area by using the trained OceanTDA9_ AdaRW model.
Preferably, the iterative update formula of the AdaRW adaptive gradient training algorithm is as follows:
Figure BDA0002993986670000021
Δθt=λΔθt-1+(1-λ)gt′⊙gt′ (5)
where θ is a parameter, t is the current time, t ismAt the mth historical moment, lambda is a hyper-parameter, and lambda is more than or equal to 0 and less than 1; ε takes a small value to prevent the denominator from being 0; gtIs a small batch of random gradients of the loss function J (θ).
Preferably, the AdaRW algorithm comprises the steps of:
(1) determining a loss function, and adopting a cross entropy loss function as follows:
Figure BDA0002993986670000022
where θ is a parameter, y-iIs the input value of the ith sample, hθ(xi) Is the output value of the ith sample x;
(2) initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size m, θ01,...,θnA value of (d);
(3) calculating the gradient of the current position loss function and saving tmA gradient of time;
Figure BDA0002993986670000023
(4) calculating the distance d of the current position descentiMultiplying the step length by the gradient to obtain the product;
Figure BDA0002993986670000031
(5) judging whether the gradient descending distance is less than the algorithm termination distance r or reaches the training times n, if so, terminating the algorithm, otherwise, turning to the step (6);
(6) updating all theta, and turning to the step (1), wherein the updating function is as follows;
Figure BDA0002993986670000032
(7) and finishing the algorithm and outputting the result.
Preferably, the OIPA architecture is composed of 1 central Node Chief and several sub-nodes Node, the central Node Chief is linked with each sub-Node in a star shape, and all the sub-nodes are logically linked in a closed loop.
Preferably, each node of the OIPA architecture consists of 1 parameter service unit PServer and 1 computation service unit Worker; the data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1, a Worker _ DS2 and a Worker _ DS 3; the sum of the data sets in all the child node Worker _ DS is equal to the training data set, and the central node data set Worker _ DS is a complete data set; the parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission.
Preferably, the training process based on the OIPA architecture is:
(1) the CPU0 in the parameter service unit in the child Node0 takes out the data set DS0 distributed to the Node from the data set according to the total number of the nodes in the cluster, 2 Batch training data sets are prepared according to the GPU number and the training Batch of the Node, the training data sets are distributed to 2 GPUs in a Worker _ DS0 for training, the trained gradient delta P is transmitted to the CPU1 in the parameter service unit PServer of the Node, and the CPU1 updates the parameters of the model by the aggregated gradient and then continues the training;
(2) after training of the appointed number of steps is completed, the Worker _ DS0 of the Node transmits the gradient of the last step to a model parameter folder of a main Node Chief0 in a cluster, extracts an optimal model parameter from the model parameter folder of the main Node Chief0, stores the optimal model parameter in the model parameter folder of the Node, updates the parameter to perform a new iteration, and then distributes the optimal model parameter to parameter service units of upstream and downstream nodes Node1 and Node2 connected with the Node;
(3) after the parameter service units of the Node1 and the Node2 verify the freshness of the parameters, the latest model parameters are stored in a model parameter folder of the Node, and the latest model parameters are provided for the new round of training of the Node;
(4) the CPU1 in the PServer of the master node Chief0 monitors the model parameter folder, reads the parameters transmitted by each node in the cluster in time, transmits the parameters to the Worker _ DS test evaluation model parameters of the node, and stores the optimal model parameters in the model parameter folder for each node to read the model in the model folder for continuous training.
The invention has the following beneficial technical effects:
the AdaRW algorithm adopts a window accumulation method for gradient accumulation, so that the problem of low learning rate caused by total accumulation during updating of the AdaGrad algorithm is solved, namely, a subset is taken from historical gradient accumulation according to a window for accumulation, and the learning rate is adjusted; meanwhile, to reflect the current gradient tendency, Δ θ is adoptedtThe square root replaces the hyperparameter η in the AdaGrad algorithm, and defining the gradient accumulation sub-window is a forward extrapolation from the current time t to the historical time tmRather than a subset window in any historical gradient accumulation;
the AdaRW algorithm is a self-adaptive gradient training algorithm, overcomes the problem of deep learning rate attenuation caused by historical gradient accumulation, slows down the attenuation of the deep learning rate, and improves the deep learning training speed; meanwhile, a multi-core parallel architecture OIPA is designed, and multi-core parallel training is performed on the ocean target detection deep learning model adopting the AdaGrad algorithm, so that the algorithm training speed is increased; finally, the trained model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.
Drawings
FIG. 1 is a schematic diagram of the AdaRW algorithm of the present invention;
FIG. 2 is a flow chart of the AdaRW algorithm of the present invention;
FIG. 3 is an OIPA architecture diagram of the present invention;
FIG. 4 is a diagram of a single-machine multiple GPU deployment of the present invention;
FIG. 5 is a diagram of a multi-machine multi-GPU deployment of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
the invention provides an AdaRW self-adaptive gradient training algorithm, wherein the AdaRW algorithm adopts a window accumulation method to carry out gradient accumulation; using Delta thetatSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time tmThe window of (2). The AdaRW algorithm overcomes the defect that the learning rate of the AdaGrad algorithm is gradually reduced, so that the learning rate is very low in the later training period, and the learning time is too long;
simultaneously, a multi-core Parallel Architecture, namely an Optimal Interleaved Parallel Architecture (OIPA) is designed, and an AdaRW algorithm is adopted for multi-core Parallel training of a marine target detection deep learning model;
finally, the trained OceanTDA9_ AdaRW model is used for detecting the suspected target in the research area, and the efficiency of detecting the polarized SAR ocean target is improved.
AdaRW algorithm principle
Assuming that the loss function of the parameter theta is J (theta), the gradient of the parameter is the direction in which the function rises most quickly, and the SGD algorithm parameter updating expression is as follows:
Figure BDA0002993986670000041
wherein eta represents the learning rate, hθAn optimization function is represented.
The SGD algorithm can converge to global optimization on a convex optimization problem theoretically, but a neural network model belongs to a complex nonlinear structure, has a plurality of local optimal points and mostly belongs to a non-convex optimization problem. Therefore, the adoption of the gradient descent algorithm may fall into local optimization, and convergence to global optimization cannot be guaranteed.
The AdaGrad algorithm realizes the self-adaptation of the learning rate and solves the problem that the learning rate in the SGD method is invariable all the time. The updating process is as follows:
Figure BDA0002993986670000051
Figure BDA0002993986670000052
wherein s represents the accumulated amount of the gradient squared; when the parameters are updated, the learning rate η is divided by the square root of this accumulation, and ε is used to ensure that the denominator is not 0. Since the historical gradients are accumulated all the time during the training iteration, the learning rate is gradually decaying to 0. If the initial gradient is large, the learning rate of the whole training process is always small, and the learning time is prolonged.
In order to solve the defects of the AdaGrad algorithm, the AdaGrad algorithm is improved in two aspects:
(1) in order to solve the problem of low learning rate caused by all accumulation during updating of the AdaGrad algorithm, a window accumulation method is adopted for gradient accumulation, namely, subsets are taken from historical gradient accumulation according to windows for accumulation, and the learning rate is adjusted.
(2) To reflect the current gradient trend, Δ θ is usedtThe square root replaces the hyperparameter η in the AdaGrad algorithm. And, defining the gradient accumulation sub-window is to deduce from the current time t to the historical time tmRather than a sub-window in any historical gradient accumulation.
The improved algorithm is called AdaRW (Adagrad corrected by Windows) algorithm, and is also an adaptive gradient training algorithm. The iterative update formula is as follows:
Figure BDA0002993986670000053
Δθt=λΔθt-1+(1-λ)gt′⊙gt′ (5)
where t is the current time, tmAt the mth historical moment, lambda is a hyperparameter, lambda is more than or equal to 0 and less than 1, and the value is generally 0.9. The addition of a small value epsilon to the denominator of equation (4) is to prevent the denominator from being 0. gtA small batch of random gradients for the loss function J (θ), expressed as:
Figure BDA0002993986670000054
the AdaRW algorithm proposed is directed to different training data, the accumulation window of which is adjustable. The size of the window is controlled by the size of m, thereby adjusting the size of the accumulation. The smaller m, the larger the accumulation window. When m is 1, the total history is accumulated, which corresponds to the accumulated amount of the squared gradient in the AdaGrad algorithm, and the principle is shown in fig. 1.
AdaRW algorithm process
The AdaRW algorithm flow is shown in fig. 2, and includes the following steps:
(1) and determining a loss function, wherein the algorithm adopts a cross entropy loss function.
Figure BDA0002993986670000061
Where θ is a parameter, y-iIs the input value of the ith sample, hθ(xi) Is the output value of the ith sample x.
(2) Initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size m, θ01,...,θnThe value of (c).
(3) Calculating the gradient of the current position loss function and saving tmThe gradient of the moment.
Figure BDA0002993986670000062
(4) Calculating the distance d of the current position descentiAnd multiplying the step size by the gradient to obtain the target value.
Figure BDA0002993986670000063
(5) And (4) judging whether the gradient descending distance is less than the algorithm termination distance r or whether the training times n are reached, if so, terminating the algorithm, otherwise, turning to the step (6).
(6) And (4) updating all theta, and turning to the step (1), wherein the updating function is as follows.
Figure BDA0002993986670000064
(7) And finishing the algorithm and outputting the result.
Three, parallel distributed architecture design
The invention also designs an optimal interleaved parallel architecture OIPA, as shown in fig. 3. The OIPA architecture consists of 1 central Node Chief and a plurality of sub-nodes Node, wherein the central Node Chief is connected with each sub-Node in a star-shaped manner, and all the sub-nodes are logically connected in a closed loop manner. Unlike the traditional central architecture, each node is composed of 1 parameter service unit PServer and 1 computation service unit Worker. The data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1 and the like, the sum of the data sets in all the child node Worker _ DS is equal to a training data set, and the central node data set Worker _ DS is a complete data set. The parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission.
When all nodes are ready, the model begins to train. In an iteration process, each Worker of the child nodes completes own Batch training, calculates the gradient, transmits the gradient to the PServer, and reads the model in the model folder to continue training. The child node PServer transmits parameters such as gradient calculated by the Worker of the node to the central node Chief, receives model parameters of the central node, updates the model in the model folder of the node for the Worker to call, and transmits the model parameters to the upper end node and the lower end node connected with the model folder. The upper and lower end nodes check the model parameters from the child nodes, and compare the model parameters with the model parameters of the node to determine whether to update the model of the node. The PServer in the central Node Chief monitors and receives the model parameters transmitted by each child Node, transmits the model parameters to the Worker _ DS test evaluation model parameters in the Node, and updates the model parameters in the model folder for the child nodes to call.
Compared with other architectures, the OIPA architecture designed by the invention has the following advantages:
(1) the data set that each child node participates in training is fixed and unique. The sum of all the child node data sets is equal to the training data set, so that the data sets participating in training in the minimum batch are not repeated, and the training of all the data sets can be completed in the minimum batch under the limit condition.
(2) The multi-path ensures that the model parameters of each child node participating in training each time are always optimal. And the training result of each child node is transmitted to the central node through the PServer of the node, the central node aggregates all model parameters such as the gradient calculated by the Worker, and the optimal model parameters are transmitted to each child node after test and evaluation. In addition, after the sub-node obtains the optimal model parameters, the optimal model parameters are timely transmitted to the upper end node and the lower end node, so that the upper end node and the lower end node can timely update the model of the node, and the optimal model parameters called by the Worker of the node are ensured.
(3) The compute units of each child node are not intermittently cross-trained. After the training of each sub-node in a minimum batch is finished, the model is directly read from the model folder of the node to continue training without considering the transmission of model parameters, so that the calculation unit of each node is ensured to be trained without stop, and the staggered training of different sub-nodes is realized.
Four, OIPA architecture deployment
The OIPA framework designed by the invention consists of a plurality of PServer processes and a Worker _ DS process, and is mainly a distributed framework designed for a plurality of computers and a plurality of cards, and the distributed framework is firstly deployed in a single computer and a plurality of cards and then deployed in a plurality of computer and card environments. The ocean target detection model and the data set are optimized correspondingly aiming at the GPU with a single machine limited memory, and the OIPA can be deployed in a single machine multi-card environment by slightly improving the OIPA. The advantage of single machine multi-card (parallel) is to reduce the communication overhead between tasks, and the multi-machine multi-card (distributed) uses multiple servers to separate the parameter updating and the graph calculation, thereby reducing the pressure of the whole server.
1. Deployment of stand-alone multiple GPUs
The GPU is a computer image processor, is one of important factors influencing the training time of the deep neural network, and performs parallel deployment on a plurality of GPUs on a single machine to efficiently complete a training task. The OIPA supports the assignment of corresponding devices to complete corresponding operations, so how to allocate tasks is critical, the GPU is good at large amount of calculation, so the calculation of the whole Inference and gradient is allocated to the GPU, and the parameter update is allocated to the CPU. The single-machine dual-GPU deployment is as shown in FIG. 4, 2 batchs are processed at a time, each GPU processes data calculation of one Batch, model parameters or calculation graphs can be disassembled and put on different devices, parameters are shared through variable names, and variables (parameters) are stored on a CPU. Distributing data to 2 GPUs by a CPU (Central processing Unit), and completing calculation on the GPUs to obtain a gradient to be updated in each batch; collecting gradients to be updated on 2 GPUs on a CPU, calculating an average gradient, and updating parameters by using the average gradient; and thirdly, circularly performing the steps to finish the training. It should be noted that this process of collecting gradients is synchronous, and the CPU must wait for all GPUs to finish before the operation of averaging gradients starts, and it is obvious that the training speed of the whole model depends on the slowest GPU card.
2. Deployment of multiple machines and multiple GPUs
The multi-machine multi-card means that a plurality of servers have a plurality of GPU (graphics processing unit) devices, the performance of a plurality of computers is fully used, and different working nodes are divided. The deployment of the OIPA multi-machine multi-card experiment is shown in fig. 5, and 2-5 machines respectively form a cluster, and each machine is provided with 2 GPUs. The OIPA distributed machine learning framework divides the operation into a Parameter operation (Parameter Job) and a work operation (Worker Job), a Parameter Server (PS) runs the Parameter operation and is responsible for storing and updating the management parameters, and the work operation is responsible for the task of model calculation. The distribution of the OIPA enables inter-job data transfer, i.e., forward propagation of parameter jobs to work jobs and backward propagation of work jobs to parameter jobs.
(1) Building a distributed environment
And establishing a Cluster, distributing the working Job and the Task, distributing a host address for each Task, and establishing a service Server for each Task. When the Server is created, the Server must be introduced into the Cluster, so that each Server can know which Hosts the Cluster in which the Server is located contains, and then the communication between the Server and the Server can be realized. The creation of the Server needs to be on the Host of the Server, once all the servers are created on the respective hosts, the whole Cluster is established, and all the servers among the clusters can communicate with each other. Each Server contains two components: master and Worker. Where Master provides Master Service, which mainly can provide remote access (RPC protocol) to each device in Cluster, while another important function is as Target to create tf. And Worker provides Worker Service, which can execute the computation subgraph with local equipment.
(2) Initiating a service
And one Node is designated as a main Node (Chief) and is responsible for managing each Node, coordinating training among the nodes and finishing common operations such as model initialization, model saving and recovery.
(3) Begin training
In an iterative process, CPU0 in a parameter service unit in a child Node0 takes out a data set DS0 distributed to the Node from a data set according to the total number of nodes in a cluster, 2 Batch training data sets are prepared according to the GPU number and training Batch of the Node, the Batch training data sets are distributed to 2 GPUs in a Worker _ DS0 for training, the trained gradient delta P is transmitted to CPU1 in a parameter service unit PServer of the Node, and CPU1 updates parameters of a model by using the aggregated gradient and then continues training; secondly, after training of the specified number of steps is completed by the Worker _ DS0 of the Node, transferring the gradient of the last step to a model parameter folder of a main Node Chief0 in the cluster, extracting the optimal model parameters from the model parameter folder of the main Node Chief0, storing the optimal model parameters in the model parameter folder of the Node, updating the parameters to perform a new iteration, and then distributing the optimal model parameters to parameter service units of upstream and downstream nodes Node1 and Node2 connected with the Node; after verifying the freshness of the parameters, the parameter service units of the Node1 and the Node2 store the latest model parameters into a model parameter folder of the Node, so as to provide the latest model parameters for the new round of training of the Node; CPU1 in PServer of the main node Chief0 monitors the model parameter folder, reads parameters transmitted by each node in the cluster in time, transmits the parameters to the Worker _ DS test evaluation model parameters of the node, and stores the optimal model parameters in the model parameter folder for each node to read the model in the model folder for continuous training.
Parallel training experiment of AdaRW algorithm
The designed OIPA framework is adopted to carry out parallel training 8250 times on the proposed optimized training algorithm AdaRW, the learning rate is set to be 0.01, and other parameters are set according to algorithm default values. And comparing the training result with the existing SGD algorithm, Adagarad algorithm and Adam algorithm to obtain a loss _ batch curve and an accuracy _ batch curve.
In the loss _ batch curve, the loss of the AdaRW algorithm is 0.0766, taking 750 seconds, the standard deviation is 0.00015, the mean is 0.2178; the loss of Adam's algorithm is 0.0594, taking 660 seconds, the standard deviation is 0.00010, the mean is 0.2164; the loss for the SGD algorithm is 0.0869, which takes 668 seconds, the standard deviation is 0.00031, the mean is 0.2743; the loss for the Adagrad algorithm was 0.0535, which took 638 seconds, the standard deviation was 0.00026, and the mean was 0.2407. In a comprehensive view, the AdaRW algorithm is superior to Adagrad and SGD algorithms and is equivalent to Adam algorithm.
In the precision _ batch curve, the precision of the AdaRW algorithm is 0.9983, takes 750 seconds, the standard deviation is 0.00009, and the mean is 0.9187; the precision of the Adam algorithm is 0.9992, takes 660 seconds, the standard deviation is 0.00006, and the mean is 0.9194; the accuracy of the SGD algorithm is 0.9947, it takes 668 seconds, the standard deviation is 0.00013, the mean is 0.8925; the accuracy of the Adagrad algorithm was 0.9977, which took 638 seconds, the standard deviation was 0.00010, and the mean was 0.9107. In a comprehensive view, the AdaRW algorithm and the Adam training algorithm which have the best ocean target adaptation effect are also provided, and the standard deviation of the AdaRW algorithm is superior to that of the Adagrad algorithm and the Adam algorithm.
The advanced AdaRW algorithm and other three algorithms are adopted to train the ocean target detection deep learning model OceanTDA 9. Each optimized training algorithm was trained 8250 times with 100 sample data samples at a time. The results of the experiment are shown in table 1. In the table, the test accuracy and the test loss are the accuracy and the loss calculated on the test data set after the model training is finished, and the average accuracy and the average loss are the average values of the accuracy and the loss from the model training to the last 20 times. The experimental result shows that the test precision, the average precision and the average loss of the algorithm AdaRW are superior to those of Adagrad and SGD except time consumption, the test loss is between those of Adagrad and SGD, the comprehensive index is equivalent to that of Adam optimization algorithm, and the standard deviation is superior to that of Adam optimization algorithm. And the algorithm AdaRW provided by the invention is simpler.
Table 1 AdaRW compares results with other algorithms
Figure BDA0002993986670000101
Experimental data selects dual-polarized SAR data of an IW mode of Sentinel-1 in a Bohai sea area (between 37 degrees 07-40 degrees 56 'of north latitude and 117 degrees 33-122 degrees 08' of east longitude). The total 20 scenes, the time span is 2016 months 1 to 6 months.
After the advanced AdaRW algorithm is adopted to train the OceanTDA9 deep learning model, suspected targets are detected, and 90 suspected targets (each containing 28 × 28 pixels) such as a drilling platform are detected. The target missed detection number is 0, the detection rate is 100%, the false alarm rate is 10.9%, the time is 2.36 seconds, and the SAR image detection capability of 10 m resolution is about 58.5km 2/s. The results are shown in Table 2.
TABLE 2 detection results of marine targets
Figure BDA0002993986670000102
Experiments show that the test precision, the average precision and the average loss of the algorithm are superior to those of the Adagrad algorithm and the SGD algorithm, the test loss is between that of the Adagrad and that of the SGD algorithm, and the standard deviation of the curve is superior to that of the Adam optimization algorithm. The constructed ocean target detection deep learning model is trained by adopting the algorithm, and the trained model is used for detecting the suspected target in the research area, so that the efficiency of detecting the ocean target of the polarized SAR is improved. The experimental result fully verifies the effectiveness of the method.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (3)

1. An AdaRW algorithm-based marine target detection deep learning model training method is characterized in that the AdaRW adaptive gradient training algorithm adopts a limited window to carry out historical gradient accumulation, and adopts delta thetatSquare root instead of hyperparameter η in AdaGrad algorithm, the gradient accumulation sub-window is defined by forward deduction from current time t to historical time tmThe window of (1); meanwhile, a multi-core parallel OIPA framework is designed, and an AdaRW algorithm is adopted for parallel training of the ocean target detection deep learning model; finally, detecting the suspected ocean target by using the trained OceanTDA9_ AdaRW model;
the iterative update formula of the AdaRW adaptive gradient training algorithm is as follows:
Figure FDA0003574649380000011
Δθt=λΔθt-1+(1-λ)gt′⊙gt′ (5)
where θ is a parameter, t is the current time, t ismAt the mth historical moment, lambda is a hyper-parameter, and lambda is more than or equal to 0 and less than 1; ε takes a small value to prevent the denominator from being 0; gtA small batch random gradient that is a loss function J (θ);
the OIPA framework consists of 1 central Node Chief and a plurality of sub-nodes Node, wherein the central Node Chief is connected with each sub-Node in a star shape, and all the sub-nodes are logically connected in a closed loop;
each node of the OIPA framework consists of 1 parameter service unit PServer and 1 calculation service unit Worker; the data sets in each child node Worker are different and are respectively distinguished by a Worker _ DS0, a Worker _ DS1, a Worker _ DS2 and a Worker _ DS 3; the sum of the data sets in all the child nodes Worker _ DS is equal to the training data set, and the data set in the central node Worker _ DS is a complete data set; the parameter service unit PServer consists of a plurality of CPUs, is only responsible for transmitting and storing data and is not responsible for calculation; the calculation service unit Worker _ DS is composed of a plurality of GPUs and is only responsible for calculation and not responsible for data transmission;
the training process based on the OIPA framework is as follows:
(1) the CPU0 in the parameter service unit in the child Node0 takes out the data set DS0 distributed to the Node from the data set according to the total number of the nodes in the cluster, 2 Batch training data sets are prepared according to the GPU number and the training Batch of the Node, the training data sets are distributed to 2 GPUs in a Worker _ DS0 for training, the trained gradient delta P is transmitted to the CPU1 in the parameter service unit PServer of the Node, and the CPU1 updates the parameters of the model by the aggregated gradient and then continues the training;
(2) after training of the appointed number of steps is completed, the Worker _ DS0 of the Node transmits the gradient of the last step to a model parameter folder of a main Node Chief0 in a cluster, extracts an optimal model parameter from the model parameter folder of the main Node Chief0, stores the optimal model parameter in the model parameter folder of the Node, updates the parameter to perform a new iteration, and then distributes the optimal model parameter to parameter service units of upstream and downstream nodes Node1 and Node2 connected with the Node;
(3) after the parameter service units of the Node1 and the Node2 verify the freshness of the parameters, the latest model parameters are stored in a model parameter folder of the Node, and the latest model parameters are provided for the new round of training of the Node;
(4) the CPU1 in the PServer of the master node Chief0 monitors the model parameter folder, reads the parameters transmitted by each node in the cluster in time, transmits the parameters to the Worker _ DS test evaluation model parameters of the node, and stores the optimal model parameters in the model parameter folder for each node to read the model in the model folder for continuous training.
2. The method for training the ocean target detection deep learning model based on the AdaRW algorithm is characterized by comprising the following steps:
s1, an AdaRW self-adaptive gradient training algorithm is proposed, and an algorithm updating formula is deduced;
s2, designing an optimal staggered parallel architecture OIPA;
s3, performing parallel training on the proposed AdaRW algorithm by adopting the designed OIPA framework to obtain an ocean target detection deep learning model OceanTDA9_ AdaRW;
and S4, detecting the suspected targets in the ocean area by using the trained OceanTDA9_ AdaRW model.
3. The method for training the ocean target detection deep learning model based on the AdaRW algorithm is characterized in that the AdaRW algorithm comprises the following steps:
(1) determining a loss function, and adopting a cross entropy loss function as follows:
Figure FDA0003574649380000021
where θ is a parameter, y-iIs the input value of the ith sample, hθ(xi) Is the output value of the ith sample x;
(2) initializing algorithm-related parameters, initializing hyper-parameters λ, gradient accumulation window size, θ01,...,θnA value of (d);
(3) calculating the gradient of the current position loss function and saving tmA gradient of time;
Figure FDA0003574649380000022
(4) calculating the distance d of the current position descentiThe step length is multiplied by the gradient to obtain the gradient;
Figure FDA0003574649380000023
(5) judging whether the gradient descending distance is less than the algorithm termination distance r or reaches the training times n, if so, terminating the algorithm, otherwise, turning to the step (6);
(6) updating all theta, and turning to the step (1), wherein the updating function is as follows;
Figure FDA0003574649380000024
(7) and finishing the algorithm and outputting the result.
CN202110324328.9A 2021-03-26 2021-03-26 Ocean target detection deep learning model training method based on AdaRW algorithm Active CN113033098B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110324328.9A CN113033098B (en) 2021-03-26 2021-03-26 Ocean target detection deep learning model training method based on AdaRW algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110324328.9A CN113033098B (en) 2021-03-26 2021-03-26 Ocean target detection deep learning model training method based on AdaRW algorithm

Publications (2)

Publication Number Publication Date
CN113033098A CN113033098A (en) 2021-06-25
CN113033098B true CN113033098B (en) 2022-05-17

Family

ID=76474298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110324328.9A Active CN113033098B (en) 2021-03-26 2021-03-26 Ocean target detection deep learning model training method based on AdaRW algorithm

Country Status (1)

Country Link
CN (1) CN113033098B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642610B (en) * 2021-07-15 2024-04-02 南京航空航天大学 Distributed asynchronous active labeling method
CN113641905B (en) * 2021-08-16 2023-10-03 京东科技信息技术有限公司 Model training method, information pushing method, device, equipment and storage medium
CN116361635B (en) * 2023-06-02 2023-10-10 中国科学院成都文献情报中心 Multidimensional time sequence data anomaly detection method
CN116704317B (en) * 2023-08-09 2024-04-19 深圳华付技术股份有限公司 Target detection method, storage medium and computer device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171148A (en) * 2017-12-26 2018-06-15 上海斐讯数据通信技术有限公司 The method and system that a kind of lip reading study cloud platform is established
CN111274036A (en) * 2020-01-21 2020-06-12 南京大学 Deep learning task scheduling method based on speed prediction
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system
CN111931242A (en) * 2020-09-30 2020-11-13 国网浙江省电力有限公司电力科学研究院 Data sharing method, computer equipment applying same and readable storage medium
CN112463056A (en) * 2020-11-28 2021-03-09 苏州浪潮智能科技有限公司 Multi-node distributed training method, device, equipment and readable medium
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
WO2021051713A1 (en) * 2019-09-20 2021-03-25 广东浪潮大数据研究有限公司 Working method and device for deep learning training task

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035751B (en) * 2014-06-20 2016-10-12 深圳市腾讯计算机系统有限公司 Data parallel processing method based on multi-graphics processor and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108171148A (en) * 2017-12-26 2018-06-15 上海斐讯数据通信技术有限公司 The method and system that a kind of lip reading study cloud platform is established
WO2021051713A1 (en) * 2019-09-20 2021-03-25 广东浪潮大数据研究有限公司 Working method and device for deep learning training task
CN111274036A (en) * 2020-01-21 2020-06-12 南京大学 Deep learning task scheduling method based on speed prediction
CN111327692A (en) * 2020-02-05 2020-06-23 北京百度网讯科技有限公司 Model training method and device and cluster system
CN111931242A (en) * 2020-09-30 2020-11-13 国网浙江省电力有限公司电力科学研究院 Data sharing method, computer equipment applying same and readable storage medium
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
CN112463056A (en) * 2020-11-28 2021-03-09 苏州浪潮智能科技有限公司 Multi-node distributed training method, device, equipment and readable medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
深度学习加速技术研究;杨旭瑜等;《计算机系统应用》;20160915(第09期);第3-11页 *

Also Published As

Publication number Publication date
CN113033098A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
CN113033098B (en) Ocean target detection deep learning model training method based on AdaRW algorithm
CN110489223B (en) Task scheduling method and device in heterogeneous cluster and electronic equipment
Bao et al. Deep learning-based job placement in distributed machine learning clusters
Chen et al. Revisiting distributed synchronous SGD
Zhang et al. Deep learning with elastic averaging SGD
CN108009642A (en) Distributed machines learning method and system
Shetti et al. Optimization of the HEFT algorithm for a CPU-GPU environment
CN114237869B (en) Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment
Hou et al. Distredge: Speeding up convolutional neural network inference on distributed edge devices
Yan et al. Efficient deep neural network serving: Fast and furious
CN106776466A (en) A kind of FPGA isomeries speed-up computation apparatus and system
CN112162861A (en) Thread allocation method and device, computer equipment and storage medium
Ko et al. An in-depth analysis of distributed training of deep neural networks
CN111796517A (en) Task simulation using neural networks
Li et al. Model-distributed dnn training for memory-constrained edge computing devices
CN112070328A (en) Multi-water-surface unmanned search and rescue boat task allocation method with known environmental information part
CN110837395B (en) Normalization processing method, device and system for multi-GPU parallel training
CN116663639B (en) Gradient data synchronization method, system, device and medium
Yu et al. Efficient matrix factorization on heterogeneous CPU-GPU systems
CN109542585B (en) Virtual machine workload prediction method supporting irregular time intervals
CN116954866A (en) Edge cloud task scheduling method and system based on deep reinforcement learning
CN115794405A (en) Dynamic resource allocation method of big data processing framework based on SSA-XGboost algorithm
Lu et al. Distributed machine learning based mitigating straggler in big data environment
CN112764932B (en) Deep reinforcement learning-based calculation-intensive workload high-energy-efficiency distribution method
CN115081619A (en) Heterogeneous cluster-oriented acceleration distributed training method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant