CN116452951B

CN116452951B - Remote sensing information extraction model distributed training method based on central data pool

Info

Publication number: CN116452951B
Application number: CN202310424978.XA
Authority: CN
Inventors: 赫晓慧; 李盼乐; 程淅杰; 乔梦佳; 高亚军; 李加冕; 周涛; 赵辉杰
Original assignee: Zhengzhou University
Current assignee: Zhengzhou University
Priority date: 2023-04-18
Filing date: 2023-04-18
Publication date: 2023-11-21
Anticipated expiration: 2043-04-18
Also published as: CN116452951A

Abstract

The invention discloses a remote sensing information extraction model distributed training method based on a central data pool, which is used for solving the problem of excessive memory resources occupied by each process of a distributed training framework by omitting the distributed training method of TensorFlow, using a Horovod distributed training framework to perform ring-all-reduce synchronous gradient update full-reduction distributed training on a TensorFlow single machine program, combining an MPI cross-host communication technology and an LMDB database data mapping technology, designing a data loading optimization algorithm based on constructing the central data pool, and solving the problem of excessive memory resources occupied by each process of the distributed training framework by the way of partitioning data and controlling the process to dynamically load data from a disk to a memory.

Description

Remote sensing information extraction model distributed training method based on central data pool

Technical Field

The invention relates to the technical field of data training, in particular to a remote sensing information extraction model distributed training method based on a central data pool.

Background

The remote sensing image extraction has wide application in the fields of meteorology, smart city, military and the like. In recent years, with the continuous progress and development of remote sensing technologies such as high resolution and wide coverage imaging, the use frequency of remote sensing satellites is higher and higher, the data volume of remote sensing images is also rapidly increased, the single-frame data volume can have the size of GB level, the whole-track data volume even breaks through hundred GB or even TB level, and the processing difficulty is increased while a large amount of data is brought to the application field. In fact, when using a stand-alone training remote sensing image extraction model, large-scale training sets and high-complexity models are often not completely stored, and their training time is also intolerable. For example, remote sensing data above TB level is put on a computer, and the training progress of the model is very slow with the current data transmission technology and computer performance, and then the overall data is generally distributed. For example, the deep learning model can generate a large amount of matrix operations in the training process, the operation result and intermediate value are usually stored in the memory or the video memory, and some deep learning neural networks have too many parameters, so that the storage space occupied by the CNN weight matrix may exceed the capacity of the video memory of a single GPU. At this time, the oversized matrix of the CNN layer needs to be subjected to block processing, and can be split into different display cards for cooperative calculation. Therefore, the distributed technology of the multi-node multi-display card collaborative parallel training is particularly important to be applied to deep learning network training.

TensorFlow is taken as the most mature deep learning frame of the industrialized background, and since the TensorFlow is introduced, the TensorFlow is always dominated by the deep learning frame market in the artificial intelligence field by virtue of the characteristics of the strong scientific and technological strength background of Google corporation, the TensorFlow is relatively stable in performance, relatively perfect in function and the like, and is promoted and maintained by various open source teams, and the TensorFlow is updated to version 2.11. The distributed extraction of remote sensing image information is researched based on TensorFlow technology, and has important value significance.

TensorFlow is a symbolic mathematical system based on data stream programming (dataflow programming), whose core idea is to stream tensors (Tensor), as the name implies. TensorFlow is also a library designed for fast execution of numerical calculations and parallel automation, parallelization is automatically achieved by its intelligent execution engine, usually without great adjustment, the parallelization achieved automatically works well, supporting execution and deployment of parallel code in one or more CPUs and GPUs. In addition, tensorFlow creates mathematical expressions like Theano, but instead of compiling the expressions into machine code, they are executed in an external engine written in c++, which is essentially a static diagram when the program is not executing, and becomes a dataflow diagram when the program is started and has data input. The TensorFlow uses a data flow graph to carry out numerical calculation, each node in the graph represents mathematical operation, edges between the nodes represent multidimensional arrays transferred in the nodes, a static graph is required to be defined in advance before the TensorFlow is used, the calculation mode promotes the TensorFlow to have the characteristics of strong portability, ecological perfection and optimal performance, and a flexible architecture enables developers to develop calculation tasks on various platforms. Further: when training the neural network, each node loads all training data to the computing nodes, so that the time delay of training data transmission is reduced to a certain extent, the training speed is accelerated, but a heavy memory burden is brought to the computing nodes, and the capability of processing large-scale data is limited.

The existing TensorFlow distributed deep learning framework uses gRPC as a specific method crossing a host communication layer, and simultaneously realizes a full-specification algorithm based on synchronous annular gradient update through NCCL of Nvidia. (1) gRPC: the concept or form of RPC remote services has wide application in distributed systems, where performance plays a vital role in communication between nodes in distributed training. In general, RPC implementation is based on transmission protocols such as XML, JSON serialization, HTTP, and the like. gRPC is a special RPC protocol framework, has the characteristic of cross-language and cross-platform, and the data message is custom designed based on HTTP2.0 protocol standard, and can support the characteristics of bidirectional flow, header compression, multiple multiplexing requests and the like. (2) data allocation: in multi-node distributed training, the data set needs to be fragmented to ensure convergence and reproducibility of the model. The data set slicing based on remote call is shown in fig. 1, and it is assumed that 4 computing nodes are provided in the cluster, each computing node is provided with 4 graphics cards, each computing node is provided with a data copy representing a complete data set, the length of training data of each batch is L, when training to the current batch of data, the batch of data is divided into all nodes in a bisection mode, namely, each node divides the data of L/4 length, and because each thread in the node is responsible for one piece of GPU training, the data of L/4 is subdivided into each thread in the node, and each thread is L/16 of the data length of the current batch. After each GPU in a single node uses the training data in charge of each GPU to forward and backward propagate the model, 4 model updating parameter gradients are obtained, and then the model updating parameter gradients are summed and averaged in a CPU. Finally, each node in the cluster completes the receiving and updating of the model parameters through a full-specification algorithm. When the computing cluster uses the original TensorFlow distributed framework for deep learning training, the complete training data copy can be repeatedly called into the respective memory by the training process in each computing node, the limit of the training copy size depends on the memory capacity of each computing node, and the memory of the computer is usually expensive and has limited capacity. At this time, this training method with obvious redundant data can limit the model training set size, so that the neural network model as a whole cannot be sufficiently learned, resulting in poor performance.

Therefore, the distributed training method for the remote sensing information extraction model based on the central data pool, which is used for solving the problem of excessive memory resources occupied by each process of the distributed training framework and improving the space efficiency of the distributed training algorithm, is a problem worthy of research by dividing data and controlling the processes to dynamically load data from a disk to a memory.

Disclosure of Invention

The invention aims to provide a remote sensing information extraction model distributed training method based on a central data pool, which solves the problem of excessive memory resources occupied by each process of a distributed training frame by partitioning data and controlling the processes to dynamically load data from a disk to a memory.

The purpose of the invention is realized in the following way:

the remote sensing information extraction model distributed training method based on the central data pool comprises the following steps:

step S1, establishing an LMDB data set: introducing LMDB data to perform distributed deep learning model training, wherein the LMDB data is original data and is manufactured through LMDB database primitives, and a data source is provided for data parallel;

step S2, generating a data block pointer set: based on the LMDB data set established in the step S1, real-time cutting is carried out on the LMDB data set based on a data segmentation method for constructing a central data pool during training initialization, a data block pointer set for network training is generated, and a training data address is provided for naive distributed training;

step S3, carrying out naive distributed training: based on the data block pointer set generated in the step S2, mapping training data from a disk to a memory through a central data pool data loading method, and then carrying out multi-machine distributed deep learning training on the model under the condition of naive gradient descent by the cluster according to the mapping data.

The specific steps of the step S1 are as follows: cutting image data into a plurality of small blocks of data and standardizing the small blocks of data, numbering each block of data by using primitive operation, assigning a unique address value to each block of data by a relation of set < key, value > key value pair, and finally obtaining the whole data set, wherein each set variable forms a single data block entity, and the whole data set is a set;

set collection: { set1, set2, & gt. Where n represents the number of the nth block data. The specific steps of the step S2 are as follows: based on the set generated in the step S1, extracting key values in the set to form a key value set, and then randomly scrambling the key value set to form a disordered key value set, wherein the disordered data effectively improves the training precision of the model;

out-of-order key set: { key3, key1, key4, &.. where n represents the number of the nth block data.

The specific steps of the step S3 are as follows:

step S3.1, mapping training sample data from a file system to a memory space: based on the out-of-order key value set generated in step S2, a method getValue (key) is provided to obtain a data block through an address value, and getValue (keyn) =dn and a batch training sample set are providedWherein x is _p Refer to training data, y _p The index tag, M refers to the number of the display cards, and p refers to the process number of the display card where the current sample data block is located;

setting the iteration number of a data set as N, the size b of a small batch of mini-batch, the current step length as s, the global step length step_per_epoch, calculating the step length as the key set length divided by the batch size, namely len (key)/b, the process number as rank, the total number of process GPU as j, and the data block pointer address (key) as pointer, wherein the training process and steps are as follows:

firstly, initializing model parameters w by a CPU (Central processing Unit) end;

for each iteration:

(1) the block pointer obtained by each GPU training task process is pointer=rank+ (j×s) and then passes through getValue (keyn)

Obtaining a corresponding sample;

(2) j GPUs perform parallel training;

(3) after training, the GPU completes forward and backward propagation of the neural network in each computing node based on the obtained data blocks, and completes gradient descent through a small batch random gradient descent method, and finally obtains updated parameters w of the model, wherein the process is shown in the following formula:

wherein w is _s-1 Global model parameters representing the last step, D _rank Representing the gradient of the current step model variable,representing model updating parameters which are responsible for a rank number display card process under the current step length;

step S3.2, each GPU completes model parameter updating through a full-protocol communication architecture: each GPU is based on step S3.1 in step (3)As a result, global model parameter updates are made through the full-specification communication architecture.

The full specification in the step S3.2 specifically comprises the following steps:

(1) equally dividing the model parameters into j parts, wherein j is the number of the GPUs;

(2) the scanner-reduce, the GPU exchanges j-1 times of data, so that each GPU can obtain a final result of a part of the model;

(3) all-gather, GPU carries on j-1 times of exchange to each obtained part of final result, all GPU obtains the complete final result finally;

(4) and (5) averaging the stored parameters in each GPU to finish the updating of the model parameters of the current step length.

The beneficial effects of the invention are as follows: the strategy and principle of the analysis method of the invention in terms of data parallelism and communication point out the bottleneck of the analysis method in terms of data calling. And then binding a display card for each MPI process by referring to an MPI message transfer model on the basis of a Horovod advanced parallel framework, abstracting all randomly disturbed training data into a central data pool for continuously providing training data for each process (single display card), simultaneously combining an LMDB database memory mapping data technology, sequentially dividing the data pool into numbered data blocks, redefining a data reading transmission process of the process by constructing the central data pool and the data numbers, and obtaining data which is uniquely matched with each process from the central data pool by combining the numbers and the current training batch number to train so as to achieve the aim of saving memory resources and realize multi-node multi-display card efficient and stable calculation under large-scale data.

Drawings

FIG. 1 is an exemplary image of a Huiji road dataset tested in accordance with the present invention and corresponding labels;

FIG. 2 is an MPI programming process of the present invention;

FIG. 3 is a distributed training program start-up flowchart of the present invention;

FIG. 4 is a schematic diagram of a central data pool configuration of the present invention;

FIG. 5 is a pseudo code of a central data pool distributed training algorithm of the present invention;

FIG. 6 is a diagram of a full protocol communication mode of the present invention;

FIG. 7 is a graph of a full-scale gradient update model of the present invention;

fig. 8 is a diagram of an algorithm architecture of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

A remote sensing information extraction model distributed training method based on a central data pool is shown in fig. 1 as a tif-format remote sensing image, namely an LMDB data original image in the step S1. Firstly, in order to improve and reduce the requirement of model training on hardware, an image is needed to be cut into image blocks with smaller side length, and renamed according to the rule of file name row number and column number. The image data is then converted into a sample data file in LMDB format using database operation primitives as shown below.

env=lmdb.open (): creating an LMDB environment

txn=env.begin (): establishing transactions

txn.put (key, value): performing insertion and modification

txn. Delete (key): deletion is performed

txn.get (key): query is performed

txn.cursof (): traversing

txn.commit (): commit changes

Each record in the LMDB format data is in the form of set < key, value >, where key is the sample name (address value) and value is the sample value. And (3) assigning a unique address value to each image according to the relation of set < key, value > key value pairs, wherein the image numerical value (value) and the address value (key) show a one-to-one correspondence relation, and finally, the whole image sample data set is obtained, each set variable forms a single influence block entity, and the whole data set is a set.

set collection: { set ₁ ，set ₂ ，……set _n-1 ，set _n N represents the number of the nth block picture.

After the set image set is obtained, the images are required to be standardized, the image standardization is to realize the centering of the data through the mean value, and the data centering accords with the data distribution rule according to the knowledge related to the convex optimization theory and the data probability distribution so as to improve the expression capability of the data on the characteristics and accelerate the convergence speed of the model. The normalization formula is shown as follows:

the above formula shows the data normalization process under the background of m images, wherein mu _β Mean, sigma _β As a function of the variance of the values,the normal standardized result of the image is obtained.

As shown in fig. 2, the MPI program starts to be written after the LMDB sample data set is obtained. And then embedding the MPI command into the debugged single-machine program, and realizing full-automatic starting or monitoring of the computing node through MPI y library operation. And finally, starting distributed deep learning training through a flash script.

The Bash script is shown below

The result of successful initiation of the distributed training task is shown in fig. 3.

After the task is successfully started, each computing node needs distributed loading data to perform distributed training, firstly, an out-of-order sample image data block address set is generated according to the step S2, and then each GPU obtains data through the image block address according to the rule shown in the step S3 to perform model training. The core idea of step S3 is shown in fig. 4, where all data used for training is abstracted into a central data pool in a cluster, local memories of computing nodes in the cluster do not store data copies any more, but data blocks allocated to the computing nodes are instantly fetched from a globally unique central data pool according to the training progress, and the data loading idea releases the memories of the computing nodes to a great extent, but excessive data block transmission causes extra time overhead, and the memory mapping file technology of the LMDB technology can reduce the loss.

The invention ignores the existence of nodes in the cluster, considers each GPU as a process capable of being independently trained in the distributed system, firstly numbers each process independently, then cuts training data according to the preset batch data length (mini-batch) of each GPU for training, and breaks the arrangement sequence of the cut data blocks, and meanwhile calculates the update step length of the global parameter gradient of the model according to the total length of the data blocks which are fixedly used each time in the cluster during training initialization, so that the data blocks of the central data pool and the training process can be in one-to-one correspondence through the step length and the process number. Each LMDB data block is provided with a corresponding pointer, and each GPU directly and rapidly reads the data block from a disk where the central data pool data is located into a memory through a pointer address by using a memory mapping technology. And finally, forming a full-scale structure by taking each GPU as a basic unit to perform gradient divergence and aggregation, and completing model training.

Furthermore, key values in the set are extracted during data loading to form a key value set, and then the key value set is randomly disturbed to form an out-of-order key value set. The training precision of the model can be effectively improved by the out-of-order data.

Out-of-order key set: { key ₃ ，key ₁ ，key ₄ ，……key _n ，key _n-1 N represents the number of the nth block picture.

Based on the out-of-order key value set generated in step S2, each MPI process is distributively trained using an algorithm as shown in fig. 5.

Further, the gradient exchange is trained following the full reduction model parameter update method as shown in fig. 6 when each GPU performs distributed training.

The traffic data per GPU in the all-reduce method is shown as follows:

Data Transferred＝2(K-1)N/K

communication is divided into two phases: the scale-reduce and all-gather. Each of the K GPUs will send and receive K-1 times the scaler-reduce, K-1 times the all-gather. Each time a GPU sends an N/K value, where N is the total number of values added on different GPUs in the array (N is the total size of the array required to be stored on each GPU).

Furthermore, the communication topology of all-reduce method is decentralised nodes, it omits parameter server, and distributes communication load equally to each calculation node, so that it ensures load balance of all nodes, so that the full-reduce method has no parameter serviceCentral bottleneck problem of the device. As shown in FIG. 7, where (0.ltoreq.i < n) represents the ith layer of the deep neural network. The back propagation computation is performed serially with the gradient synchronous communication, so the full reduction gradient synchronous communication will wait for all network layers (L ₀ ，...，L _(n-1) ) The execution begins again after the gradient computation is completed.

Further, specific to the gradient update concept. For a given dataset D, f (x, W) represents the output of the neural network with input x parameter W, defining l (·) as the loss function, the final objective of training is to minimize the loss function l (f (x, W), y), where y is the label. The updating formula of the neural network weight is shown in the formula.

Where K is the number of samples per batch (batch) in the batch training, in this method is the number of GPUs, η is the learning rate,is the gradient of the i-th sample.

The architecture of the present algorithm is shown in fig. 8. A single distributed training program based on pure Keras semantics is written through a Horovod command, and then each training process is started by referring to a cluster configuration file through an MPI command.

Claims

1. The remote sensing information extraction model distributed training method based on the central data pool is characterized by comprising the following steps of: the method comprises the following steps:

step S1, establishing an LMDB data set: introducing LMDB data to perform distributed deep learning model training, wherein the LMDB data is original data and is manufactured through LMDB database primitives, and a data source is provided for data parallel; the specific steps of the step S1 are as follows: cutting image data into a plurality of small blocks of data and standardizing the small blocks of data, numbering each block of data by using primitive operation, assigning a unique address value to each block of data through the relation of set < key, value > key value pairs, wherein the data block value and the address value key show a one-to-one correspondence relation, finally obtaining the whole data set, each set variable forms a single data block entity, and the whole data set is a set;

set { set1, set2, … … setn-1, setn }, where n represents the number of the nth block data;

step S2, generating a data block pointer set: based on the LMDB data set established in the step S1, real-time cutting is carried out on the LMDB data set based on a data segmentation method for constructing a central data pool during training initialization, a data block pointer set for network training is generated, and a training data address is provided for naive distributed training; the specific steps of the step S2 are as follows: based on the set generated in the step S1, extracting key values in the set to form a key value set, and then randomly scrambling the key value set to form a disordered key value set, wherein the disordered data effectively improves the training precision of the model; an unordered key set is { key3, key1, key4, … … key, key-1 }, wherein n represents the number of the nth block of data;

step S3, carrying out naive distributed training: mapping training data from a disk to a memory through a central data pool data loading method based on the data block pointer set generated in the step S2, and then carrying out multi-machine distributed deep learning training on the model under the condition of naive gradient descent by the cluster according to the mapping data; the specific steps of the step S3 are as follows:

setting the iteration number of a data set as N, the size b of a small batch of mini-batch, the current step length as s, the global step length step_per_epoch, calculating the step length as the key set length divided by the batch size, namely len (key)/b, the process number as rank, the total number of process GPU as j, the data block pointer address key as pointer, and the training process and steps are as follows:

for each iteration:

Obtaining a corresponding sample;

(2) j GPUs perform parallel training;

step S3.2, each GPU completes model parameter updating through a full-protocol communication architecture: each GPU is based on step S3.1 in step (3)As a result, global model parameter updating is performed through the full-specification communication architecture; the full specification in the step S3.2 specifically comprises the following steps: