CN116452951B - Remote sensing information extraction model distributed training method based on central data pool - Google Patents

Remote sensing information extraction model distributed training method based on central data pool Download PDF

Info

Publication number
CN116452951B
CN116452951B CN202310424978.XA CN202310424978A CN116452951B CN 116452951 B CN116452951 B CN 116452951B CN 202310424978 A CN202310424978 A CN 202310424978A CN 116452951 B CN116452951 B CN 116452951B
Authority
CN
China
Prior art keywords
data
training
key
model
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310424978.XA
Other languages
Chinese (zh)
Other versions
CN116452951A (en
Inventor
赫晓慧
李盼乐
程淅杰
乔梦佳
高亚军
李加冕
周涛
赵辉杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University
Original Assignee
Zhengzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University filed Critical Zhengzhou University
Priority to CN202310424978.XA priority Critical patent/CN116452951B/en
Publication of CN116452951A publication Critical patent/CN116452951A/en
Application granted granted Critical
Publication of CN116452951B publication Critical patent/CN116452951B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/95Hardware or software architectures specially adapted for image or video understanding structured as a network, e.g. client-server architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a remote sensing information extraction model distributed training method based on a central data pool, which is used for solving the problem of excessive memory resources occupied by each process of a distributed training framework by omitting the distributed training method of TensorFlow, using a Horovod distributed training framework to perform ring-all-reduce synchronous gradient update full-reduction distributed training on a TensorFlow single machine program, combining an MPI cross-host communication technology and an LMDB database data mapping technology, designing a data loading optimization algorithm based on constructing the central data pool, and solving the problem of excessive memory resources occupied by each process of the distributed training framework by the way of partitioning data and controlling the process to dynamically load data from a disk to a memory.

Description

Remote sensing information extraction model distributed training method based on central data pool
Technical Field
The invention relates to the technical field of data training, in particular to a remote sensing information extraction model distributed training method based on a central data pool.
Background
The remote sensing image extraction has wide application in the fields of meteorology, smart city, military and the like. In recent years, with the continuous progress and development of remote sensing technologies such as high resolution and wide coverage imaging, the use frequency of remote sensing satellites is higher and higher, the data volume of remote sensing images is also rapidly increased, the single-frame data volume can have the size of GB level, the whole-track data volume even breaks through hundred GB or even TB level, and the processing difficulty is increased while a large amount of data is brought to the application field. In fact, when using a stand-alone training remote sensing image extraction model, large-scale training sets and high-complexity models are often not completely stored, and their training time is also intolerable. For example, remote sensing data above TB level is put on a computer, and the training progress of the model is very slow with the current data transmission technology and computer performance, and then the overall data is generally distributed. For example, the deep learning model can generate a large amount of matrix operations in the training process, the operation result and intermediate value are usually stored in the memory or the video memory, and some deep learning neural networks have too many parameters, so that the storage space occupied by the CNN weight matrix may exceed the capacity of the video memory of a single GPU. At this time, the oversized matrix of the CNN layer needs to be subjected to block processing, and can be split into different display cards for cooperative calculation. Therefore, the distributed technology of the multi-node multi-display card collaborative parallel training is particularly important to be applied to deep learning network training.
TensorFlow is taken as the most mature deep learning frame of the industrialized background, and since the TensorFlow is introduced, the TensorFlow is always dominated by the deep learning frame market in the artificial intelligence field by virtue of the characteristics of the strong scientific and technological strength background of Google corporation, the TensorFlow is relatively stable in performance, relatively perfect in function and the like, and is promoted and maintained by various open source teams, and the TensorFlow is updated to version 2.11. The distributed extraction of remote sensing image information is researched based on TensorFlow technology, and has important value significance.
TensorFlow is a symbolic mathematical system based on data stream programming (dataflow programming), whose core idea is to stream tensors (Tensor), as the name implies. TensorFlow is also a library designed for fast execution of numerical calculations and parallel automation, parallelization is automatically achieved by its intelligent execution engine, usually without great adjustment, the parallelization achieved automatically works well, supporting execution and deployment of parallel code in one or more CPUs and GPUs. In addition, tensorFlow creates mathematical expressions like Theano, but instead of compiling the expressions into machine code, they are executed in an external engine written in c++, which is essentially a static diagram when the program is not executing, and becomes a dataflow diagram when the program is started and has data input. The TensorFlow uses a data flow graph to carry out numerical calculation, each node in the graph represents mathematical operation, edges between the nodes represent multidimensional arrays transferred in the nodes, a static graph is required to be defined in advance before the TensorFlow is used, the calculation mode promotes the TensorFlow to have the characteristics of strong portability, ecological perfection and optimal performance, and a flexible architecture enables developers to develop calculation tasks on various platforms. Further: when training the neural network, each node loads all training data to the computing nodes, so that the time delay of training data transmission is reduced to a certain extent, the training speed is accelerated, but a heavy memory burden is brought to the computing nodes, and the capability of processing large-scale data is limited.
The existing TensorFlow distributed deep learning framework uses gRPC as a specific method crossing a host communication layer, and simultaneously realizes a full-specification algorithm based on synchronous annular gradient update through NCCL of Nvidia. (1) gRPC: the concept or form of RPC remote services has wide application in distributed systems, where performance plays a vital role in communication between nodes in distributed training. In general, RPC implementation is based on transmission protocols such as XML, JSON serialization, HTTP, and the like. gRPC is a special RPC protocol framework, has the characteristic of cross-language and cross-platform, and the data message is custom designed based on HTTP2.0 protocol standard, and can support the characteristics of bidirectional flow, header compression, multiple multiplexing requests and the like. (2) data allocation: in multi-node distributed training, the data set needs to be fragmented to ensure convergence and reproducibility of the model. The data set slicing based on remote call is shown in fig. 1, and it is assumed that 4 computing nodes are provided in the cluster, each computing node is provided with 4 graphics cards, each computing node is provided with a data copy representing a complete data set, the length of training data of each batch is L, when training to the current batch of data, the batch of data is divided into all nodes in a bisection mode, namely, each node divides the data of L/4 length, and because each thread in the node is responsible for one piece of GPU training, the data of L/4 is subdivided into each thread in the node, and each thread is L/16 of the data length of the current batch. After each GPU in a single node uses the training data in charge of each GPU to forward and backward propagate the model, 4 model updating parameter gradients are obtained, and then the model updating parameter gradients are summed and averaged in a CPU. Finally, each node in the cluster completes the receiving and updating of the model parameters through a full-specification algorithm. When the computing cluster uses the original TensorFlow distributed framework for deep learning training, the complete training data copy can be repeatedly called into the respective memory by the training process in each computing node, the limit of the training copy size depends on the memory capacity of each computing node, and the memory of the computer is usually expensive and has limited capacity. At this time, this training method with obvious redundant data can limit the model training set size, so that the neural network model as a whole cannot be sufficiently learned, resulting in poor performance.
Therefore, the distributed training method for the remote sensing information extraction model based on the central data pool, which is used for solving the problem of excessive memory resources occupied by each process of the distributed training framework and improving the space efficiency of the distributed training algorithm, is a problem worthy of research by dividing data and controlling the processes to dynamically load data from a disk to a memory.
Disclosure of Invention
The invention aims to provide a remote sensing information extraction model distributed training method based on a central data pool, which solves the problem of excessive memory resources occupied by each process of a distributed training frame by partitioning data and controlling the processes to dynamically load data from a disk to a memory.
The purpose of the invention is realized in the following way:
the remote sensing information extraction model distributed training method based on the central data pool comprises the following steps:
step S1, establishing an LMDB data set: introducing LMDB data to perform distributed deep learning model training, wherein the LMDB data is original data and is manufactured through LMDB database primitives, and a data source is provided for data parallel;
step S2, generating a data block pointer set: based on the LMDB data set established in the step S1, real-time cutting is carried out on the LMDB data set based on a data segmentation method for constructing a central data pool during training initialization, a data block pointer set for network training is generated, and a training data address is provided for naive distributed training;
step S3, carrying out naive distributed training: based on the data block pointer set generated in the step S2, mapping training data from a disk to a memory through a central data pool data loading method, and then carrying out multi-machine distributed deep learning training on the model under the condition of naive gradient descent by the cluster according to the mapping data.
The specific steps of the step S1 are as follows: cutting image data into a plurality of small blocks of data and standardizing the small blocks of data, numbering each block of data by using primitive operation, assigning a unique address value to each block of data by a relation of set < key, value > key value pair, and finally obtaining the whole data set, wherein each set variable forms a single data block entity, and the whole data set is a set;
set collection: { set1, set2, & gt. Where n represents the number of the nth block data. The specific steps of the step S2 are as follows: based on the set generated in the step S1, extracting key values in the set to form a key value set, and then randomly scrambling the key value set to form a disordered key value set, wherein the disordered data effectively improves the training precision of the model;
out-of-order key set: { key3, key1, key4, &.. where n represents the number of the nth block data.
The specific steps of the step S3 are as follows:
step S3.1, mapping training sample data from a file system to a memory space: based on the out-of-order key value set generated in step S2, a method getValue (key) is provided to obtain a data block through an address value, and getValue (keyn) =dn and a batch training sample set are providedWherein x is p Refer to training data, y p The index tag, M refers to the number of the display cards, and p refers to the process number of the display card where the current sample data block is located;
setting the iteration number of a data set as N, the size b of a small batch of mini-batch, the current step length as s, the global step length step_per_epoch, calculating the step length as the key set length divided by the batch size, namely len (key)/b, the process number as rank, the total number of process GPU as j, and the data block pointer address (key) as pointer, wherein the training process and steps are as follows:
firstly, initializing model parameters w by a CPU (Central processing Unit) end;
for each iteration:
(1) the block pointer obtained by each GPU training task process is pointer=rank+ (j×s) and then passes through getValue (keyn)
Obtaining a corresponding sample;
(2) j GPUs perform parallel training;
(3) after training, the GPU completes forward and backward propagation of the neural network in each computing node based on the obtained data blocks, and completes gradient descent through a small batch random gradient descent method, and finally obtains updated parameters w of the model, wherein the process is shown in the following formula:
wherein w is s-1 Global model parameters representing the last step, D rank Representing the gradient of the current step model variable,representing model updating parameters which are responsible for a rank number display card process under the current step length;
step S3.2, each GPU completes model parameter updating through a full-protocol communication architecture: each GPU is based on step S3.1 in step (3)As a result, global model parameter updates are made through the full-specification communication architecture.
The full specification in the step S3.2 specifically comprises the following steps:
(1) equally dividing the model parameters into j parts, wherein j is the number of the GPUs;
(2) the scanner-reduce, the GPU exchanges j-1 times of data, so that each GPU can obtain a final result of a part of the model;
(3) all-gather, GPU carries on j-1 times of exchange to each obtained part of final result, all GPU obtains the complete final result finally;
(4) and (5) averaging the stored parameters in each GPU to finish the updating of the model parameters of the current step length.
The beneficial effects of the invention are as follows: the strategy and principle of the analysis method of the invention in terms of data parallelism and communication point out the bottleneck of the analysis method in terms of data calling. And then binding a display card for each MPI process by referring to an MPI message transfer model on the basis of a Horovod advanced parallel framework, abstracting all randomly disturbed training data into a central data pool for continuously providing training data for each process (single display card), simultaneously combining an LMDB database memory mapping data technology, sequentially dividing the data pool into numbered data blocks, redefining a data reading transmission process of the process by constructing the central data pool and the data numbers, and obtaining data which is uniquely matched with each process from the central data pool by combining the numbers and the current training batch number to train so as to achieve the aim of saving memory resources and realize multi-node multi-display card efficient and stable calculation under large-scale data.
Drawings
FIG. 1 is an exemplary image of a Huiji road dataset tested in accordance with the present invention and corresponding labels;
FIG. 2 is an MPI programming process of the present invention;
FIG. 3 is a distributed training program start-up flowchart of the present invention;
FIG. 4 is a schematic diagram of a central data pool configuration of the present invention;
FIG. 5 is a pseudo code of a central data pool distributed training algorithm of the present invention;
FIG. 6 is a diagram of a full protocol communication mode of the present invention;
FIG. 7 is a graph of a full-scale gradient update model of the present invention;
fig. 8 is a diagram of an algorithm architecture of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
A remote sensing information extraction model distributed training method based on a central data pool is shown in fig. 1 as a tif-format remote sensing image, namely an LMDB data original image in the step S1. Firstly, in order to improve and reduce the requirement of model training on hardware, an image is needed to be cut into image blocks with smaller side length, and renamed according to the rule of file name row number and column number. The image data is then converted into a sample data file in LMDB format using database operation primitives as shown below.
env=lmdb.open (): creating an LMDB environment
txn=env.begin (): establishing transactions
txn.put (key, value): performing insertion and modification
txn. Delete (key): deletion is performed
txn.get (key): query is performed
txn.cursof (): traversing
txn.commit (): commit changes
Each record in the LMDB format data is in the form of set < key, value >, where key is the sample name (address value) and value is the sample value. And (3) assigning a unique address value to each image according to the relation of set < key, value > key value pairs, wherein the image numerical value (value) and the address value (key) show a one-to-one correspondence relation, and finally, the whole image sample data set is obtained, each set variable forms a single influence block entity, and the whole data set is a set.
set collection: { set 1 ,set 2 ,……set n-1 ,set n N represents the number of the nth block picture.
After the set image set is obtained, the images are required to be standardized, the image standardization is to realize the centering of the data through the mean value, and the data centering accords with the data distribution rule according to the knowledge related to the convex optimization theory and the data probability distribution so as to improve the expression capability of the data on the characteristics and accelerate the convergence speed of the model. The normalization formula is shown as follows:
the above formula shows the data normalization process under the background of m images, wherein mu β Mean, sigma β As a function of the variance of the values,the normal standardized result of the image is obtained.
As shown in fig. 2, the MPI program starts to be written after the LMDB sample data set is obtained. And then embedding the MPI command into the debugged single-machine program, and realizing full-automatic starting or monitoring of the computing node through MPI y library operation. And finally, starting distributed deep learning training through a flash script.
The Bash script is shown below
The result of successful initiation of the distributed training task is shown in fig. 3.
After the task is successfully started, each computing node needs distributed loading data to perform distributed training, firstly, an out-of-order sample image data block address set is generated according to the step S2, and then each GPU obtains data through the image block address according to the rule shown in the step S3 to perform model training. The core idea of step S3 is shown in fig. 4, where all data used for training is abstracted into a central data pool in a cluster, local memories of computing nodes in the cluster do not store data copies any more, but data blocks allocated to the computing nodes are instantly fetched from a globally unique central data pool according to the training progress, and the data loading idea releases the memories of the computing nodes to a great extent, but excessive data block transmission causes extra time overhead, and the memory mapping file technology of the LMDB technology can reduce the loss.
The invention ignores the existence of nodes in the cluster, considers each GPU as a process capable of being independently trained in the distributed system, firstly numbers each process independently, then cuts training data according to the preset batch data length (mini-batch) of each GPU for training, and breaks the arrangement sequence of the cut data blocks, and meanwhile calculates the update step length of the global parameter gradient of the model according to the total length of the data blocks which are fixedly used each time in the cluster during training initialization, so that the data blocks of the central data pool and the training process can be in one-to-one correspondence through the step length and the process number. Each LMDB data block is provided with a corresponding pointer, and each GPU directly and rapidly reads the data block from a disk where the central data pool data is located into a memory through a pointer address by using a memory mapping technology. And finally, forming a full-scale structure by taking each GPU as a basic unit to perform gradient divergence and aggregation, and completing model training.
Furthermore, key values in the set are extracted during data loading to form a key value set, and then the key value set is randomly disturbed to form an out-of-order key value set. The training precision of the model can be effectively improved by the out-of-order data.
Out-of-order key set: { key 3 ,key 1 ,key 4 ,……key n ,key n-1 N represents the number of the nth block picture.
Based on the out-of-order key value set generated in step S2, each MPI process is distributively trained using an algorithm as shown in fig. 5.
Further, the gradient exchange is trained following the full reduction model parameter update method as shown in fig. 6 when each GPU performs distributed training.
The traffic data per GPU in the all-reduce method is shown as follows:
Data Transferred=2(K-1)N/K
communication is divided into two phases: the scale-reduce and all-gather. Each of the K GPUs will send and receive K-1 times the scaler-reduce, K-1 times the all-gather. Each time a GPU sends an N/K value, where N is the total number of values added on different GPUs in the array (N is the total size of the array required to be stored on each GPU).
Furthermore, the communication topology of all-reduce method is decentralised nodes, it omits parameter server, and distributes communication load equally to each calculation node, so that it ensures load balance of all nodes, so that the full-reduce method has no parameter serviceCentral bottleneck problem of the device. As shown in FIG. 7, where (0.ltoreq.i < n) represents the ith layer of the deep neural network. The back propagation computation is performed serially with the gradient synchronous communication, so the full reduction gradient synchronous communication will wait for all network layers (L 0 ,...,L (n-1) ) The execution begins again after the gradient computation is completed.
Further, specific to the gradient update concept. For a given dataset D, f (x, W) represents the output of the neural network with input x parameter W, defining l (·) as the loss function, the final objective of training is to minimize the loss function l (f (x, W), y), where y is the label. The updating formula of the neural network weight is shown in the formula.
Where K is the number of samples per batch (batch) in the batch training, in this method is the number of GPUs, η is the learning rate,is the gradient of the i-th sample.
The architecture of the present algorithm is shown in fig. 8. A single distributed training program based on pure Keras semantics is written through a Horovod command, and then each training process is started by referring to a cluster configuration file through an MPI command.

Claims (1)

1. The remote sensing information extraction model distributed training method based on the central data pool is characterized by comprising the following steps of: the method comprises the following steps:
step S1, establishing an LMDB data set: introducing LMDB data to perform distributed deep learning model training, wherein the LMDB data is original data and is manufactured through LMDB database primitives, and a data source is provided for data parallel; the specific steps of the step S1 are as follows: cutting image data into a plurality of small blocks of data and standardizing the small blocks of data, numbering each block of data by using primitive operation, assigning a unique address value to each block of data through the relation of set < key, value > key value pairs, wherein the data block value and the address value key show a one-to-one correspondence relation, finally obtaining the whole data set, each set variable forms a single data block entity, and the whole data set is a set;
set { set1, set2, … … setn-1, setn }, where n represents the number of the nth block data;
step S2, generating a data block pointer set: based on the LMDB data set established in the step S1, real-time cutting is carried out on the LMDB data set based on a data segmentation method for constructing a central data pool during training initialization, a data block pointer set for network training is generated, and a training data address is provided for naive distributed training; the specific steps of the step S2 are as follows: based on the set generated in the step S1, extracting key values in the set to form a key value set, and then randomly scrambling the key value set to form a disordered key value set, wherein the disordered data effectively improves the training precision of the model; an unordered key set is { key3, key1, key4, … … key, key-1 }, wherein n represents the number of the nth block of data;
step S3, carrying out naive distributed training: mapping training data from a disk to a memory through a central data pool data loading method based on the data block pointer set generated in the step S2, and then carrying out multi-machine distributed deep learning training on the model under the condition of naive gradient descent by the cluster according to the mapping data; the specific steps of the step S3 are as follows:
step S3.1, mapping training sample data from a file system to a memory space: based on the out-of-order key value set generated in step S2, a method getValue (key) is provided to obtain a data block through an address value, and getValue (keyn) =dn and a batch training sample set are providedWherein x is p Refer to training data, y p The index tag, M refers to the number of the display cards, and p refers to the process number of the display card where the current sample data block is located;
setting the iteration number of a data set as N, the size b of a small batch of mini-batch, the current step length as s, the global step length step_per_epoch, calculating the step length as the key set length divided by the batch size, namely len (key)/b, the process number as rank, the total number of process GPU as j, the data block pointer address key as pointer, and the training process and steps are as follows:
firstly, initializing model parameters w by a CPU (Central processing Unit) end;
for each iteration:
(1) the block pointer obtained by each GPU training task process is pointer=rank+ (j×s) and then passes through getValue (keyn)
Obtaining a corresponding sample;
(2) j GPUs perform parallel training;
(3) after training, the GPU completes forward and backward propagation of the neural network in each computing node based on the obtained data blocks, and completes gradient descent through a small batch random gradient descent method, and finally obtains updated parameters w of the model, wherein the process is shown in the following formula:
wherein w is s-1 Global model parameters representing the last step, D rank Representing the gradient of the current step model variable,representing model updating parameters which are responsible for a rank number display card process under the current step length;
step S3.2, each GPU completes model parameter updating through a full-protocol communication architecture: each GPU is based on step S3.1 in step (3)As a result, global model parameter updating is performed through the full-specification communication architecture; the full specification in the step S3.2 specifically comprises the following steps:
(1) equally dividing the model parameters into j parts, wherein j is the number of the GPUs;
(2) the scanner-reduce, the GPU exchanges j-1 times of data, so that each GPU can obtain a final result of a part of the model;
(3) all-gather, GPU carries on j-1 times of exchange to each obtained part of final result, all GPU obtains the complete final result finally;
(4) and (5) averaging the stored parameters in each GPU to finish the updating of the model parameters of the current step length.
CN202310424978.XA 2023-04-18 2023-04-18 Remote sensing information extraction model distributed training method based on central data pool Active CN116452951B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310424978.XA CN116452951B (en) 2023-04-18 2023-04-18 Remote sensing information extraction model distributed training method based on central data pool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310424978.XA CN116452951B (en) 2023-04-18 2023-04-18 Remote sensing information extraction model distributed training method based on central data pool

Publications (2)

Publication Number Publication Date
CN116452951A CN116452951A (en) 2023-07-18
CN116452951B true CN116452951B (en) 2023-11-21

Family

ID=87119865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310424978.XA Active CN116452951B (en) 2023-04-18 2023-04-18 Remote sensing information extraction model distributed training method based on central data pool

Country Status (1)

Country Link
CN (1) CN116452951B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059066A (en) * 2019-02-26 2019-07-26 中科遥感(深圳)卫星应用创新研究院有限公司 The method of spark combination tensorflow progress remote sensing image information extraction
CN111797833A (en) * 2020-05-21 2020-10-20 中国科学院软件研究所 Automatic machine learning method and system oriented to remote sensing semantic segmentation
US10817392B1 (en) * 2017-11-01 2020-10-27 Pure Storage, Inc. Ensuring resiliency to storage device failures in a storage system that includes a plurality of storage devices
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
CN112463056A (en) * 2020-11-28 2021-03-09 苏州浪潮智能科技有限公司 Multi-node distributed training method, device, equipment and readable medium
CN113515370A (en) * 2021-04-28 2021-10-19 之江实验室 Distributed training method for large-scale deep neural network
CN113971455A (en) * 2020-07-24 2022-01-25 腾讯科技(深圳)有限公司 Distributed model training method and device, storage medium and computer equipment
CN115906999A (en) * 2023-01-05 2023-04-04 中国科学技术大学 Management platform of large-scale reinforcement learning training task based on Kubernetes cluster

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220334990A1 (en) * 2014-07-03 2022-10-20 Pure Storage, Inc. Zone drive data format
WO2021163333A1 (en) * 2020-02-12 2021-08-19 Feedzai - Consultadoria E Inovacão Tecnológica, S.A. Interleaved sequence recurrent neural networks for fraud detection
US20210303164A1 (en) * 2020-03-25 2021-09-30 Pure Storage, Inc. Managing host mappings for replication endpoints
CN111709533B (en) * 2020-08-19 2021-03-30 腾讯科技(深圳)有限公司 Distributed training method and device of machine learning model and computer equipment
US11869668B2 (en) * 2021-05-28 2024-01-09 Tempus Labs, Inc. Artificial intelligence based cardiac event predictor systems and methods

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10817392B1 (en) * 2017-11-01 2020-10-27 Pure Storage, Inc. Ensuring resiliency to storage device failures in a storage system that includes a plurality of storage devices
CN110059066A (en) * 2019-02-26 2019-07-26 中科遥感(深圳)卫星应用创新研究院有限公司 The method of spark combination tensorflow progress remote sensing image information extraction
CN111797833A (en) * 2020-05-21 2020-10-20 中国科学院软件研究所 Automatic machine learning method and system oriented to remote sensing semantic segmentation
CN113971455A (en) * 2020-07-24 2022-01-25 腾讯科技(深圳)有限公司 Distributed model training method and device, storage medium and computer equipment
CN112464784A (en) * 2020-11-25 2021-03-09 西安烽火软件科技有限公司 Distributed training method based on hybrid parallel
CN112463056A (en) * 2020-11-28 2021-03-09 苏州浪潮智能科技有限公司 Multi-node distributed training method, device, equipment and readable medium
WO2022111042A1 (en) * 2020-11-28 2022-06-02 苏州浪潮智能科技有限公司 Multi-node distributed training method and apparatus, device and readable medium
CN113515370A (en) * 2021-04-28 2021-10-19 之江实验室 Distributed training method for large-scale deep neural network
CN115906999A (en) * 2023-01-05 2023-04-04 中国科学技术大学 Management platform of large-scale reinforcement learning training task based on Kubernetes cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MemchchedGPU:Scaling-up Scale-out Key-value Stores;Tayler H.Hetherington等;《Symposium on Cloud Computing》;第1-12页 *
云环境中海量空间数据处理关键技术研究;黄伟;《中国博士学位论文全文数据库基础科学辑》(第6期);第A008-12页 *

Also Published As

Publication number Publication date
CN116452951A (en) 2023-07-18

Similar Documents

Publication Publication Date Title
Zheng et al. Distdgl: distributed graph neural network training for billion-scale graphs
Gunarathne et al. Scalable parallel computing on clouds using Twister4Azure iterative MapReduce
US9619491B2 (en) Streamlined system to restore an analytic model state for training and scoring
US9953003B2 (en) Systems and methods for in-line stream processing of distributed dataflow based computations
Zou et al. Mariana: Tencent deep learning platform and its applications
CN111258744A (en) Task processing method based on heterogeneous computation and software and hardware framework system
US11630864B2 (en) Vectorized queues for shortest-path graph searches
US8373710B1 (en) Method and system for improving computational concurrency using a multi-threaded GPU calculation engine
WO2023179415A1 (en) Machine learning computation optimization method and platform
Song et al. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small
Ekanayake et al. Dryadlinq for scientific analyses
CN111444134A (en) Parallel PME (pulse-modulated emission) accelerated optimization method and system of molecular dynamics simulation software
US11222070B2 (en) Vectorized hash tables
CN115437760A (en) Computing resource allocation method, electronic device, storage medium, and program product
Kim et al. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
Moreno-Alvarez et al. Heterogeneous model parallelism for deep neural networks
Xia et al. Redundancy-free high-performance dynamic GNN training with hierarchical pipeline parallelism
CN116452951B (en) Remote sensing information extraction model distributed training method based on central data pool
Wang et al. Auto-map: A DQN framework for exploring distributed execution plans for DNN workloads
Anthony et al. Efficient training of semantic image segmentation on summit using horovod and mvapich2-gdr
Anwar et al. Recommender system for optimal distributed deep learning in cloud datacenters
Marques et al. A cloud computing based framework for general 2D and 3D cellular automata simulation
US11194625B2 (en) Systems and methods for accelerating data operations by utilizing native memory management
Li et al. Optimizing machine learning on apache spark in HPC environments
Xu et al. Efficient supernet training using path parallelism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant