CN117875362A

CN117875362A - Distributed training method and device for large model and electronic equipment

Info

Publication number: CN117875362A
Application number: CN202410276489.9A
Authority: CN
Inventors: 田楷; 晏文仲; 陈立名; 代文静; 黄金; 曹彬; 胡江洪; 方超群; 王凯; 陈运泽
Original assignee: Fitow Tianjin Detection Technology Co Ltd
Current assignee: Fitow Tianjin Detection Technology Co Ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-04-12

Abstract

The invention provides a distributed training method and device for a large model and electronic equipment, and relates to the technical field of artificial intelligence; based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallelism and tensor parallelism segmentation on the model structure and the application data of the large model; evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value; and optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model, so as to perform distributed training on the large model based on the target segmentation result. Therefore, the user can dynamically networking only by giving out the resource requirement and the model configuration information of the large model, and the time dimension and the resource dimension are considered during optimization, so that the training precision can be improved, and the parallel performance is maximized.

Description

Distributed training method and device for large model and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a distributed training method and device for a large model and electronic equipment.

Background

The vertical large model (large model for short) refers to a large pre-training model based on deep learning, and is used for processing a large-scale data set of a specific field or task. Such models are typically composed of multiple sub-models, each of which is responsible for handling a different task or feature. By pre-training on a large data set, the vertical large model can learn rich semantic and contextual information, thus performing excellently on specific tasks. The current large vertical model in the field of industrial vision is scarce, mainly because of the large classification type requirement and the huge data volume requirement, the pressure on hardware and video memory is very large, and the training is carried out by using a distributed parallel technology and a plurality of video cards.

There are generally two types of parallel techniques: tensor parallelism and pipeline parallelism. Tensor parallelism is the parallel computation in one operation, such as: matrix-matrix multiplication. Pipelined parallelism is the computation of parallelism between layers. Thus, from another perspective, tensor parallelism may be considered intra-layer parallelism and pipeline parallelism may be considered inter-layer parallelism.

Whether tensor parallel or pipeline parallel, the calculation cost, the communication cost, the memory cost, the final training effect and other factors need to be considered. The existing model distributed training method firstly has the problems that engineering workers are difficult and the number of dimensions is large; second, for tensor parallelism and pipeline parallelism, the logic of its internal design may be problematic in various respects: the first problem is that in the process of using the distributed technology, engineering personnel are required to have very deep understanding on network design and distributed training design, and corresponding parameters are required to be manually proportioned and related codes are required to be written each time, so that a large number of debugging and design are carried out; the second problem is that the hardware resources of distributed training are often very much, such as hundreds or thousands of graphics card resources, tens of server resources and the like, and in the starting process, the states of the hardware resources and the coordination and allocation among the hardware resources are often required to be manually specified, so that the hardware resources are very complicated and have the problems of inflexibility (only one for one), and failure tolerance (the failure of hardware nodes in the training process leads to the interruption of training); thirdly, model parameters which are manually set are often not the most suitable parameters of a parallel scheme, and can have a lot of design redundancy, and training accuracy can be affected due to unreasonable design.

Disclosure of Invention

The invention aims to provide a distributed training method and device for a large model and electronic equipment, so as to at least solve one of the problems.

In a first aspect, an embodiment of the present invention provides a distributed training method for a large model, including:

acquiring resource demand configuration information of a target computing cluster and model configuration information of a large model to be trained;

based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model to obtain a current segmentation result;

evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value;

and optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model, so as to perform distributed training on the large model based on the target segmentation result.

Further, the sequentially performing pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model based on the resource demand configuration information and the model configuration information to obtain a current segmentation result, including:

Searching the target computing cluster to obtain current global available resources, and determining available resource information of the target computing cluster based on the global available resources and the resource demand configuration information;

operator splitting is carried out on the large model based on the model configuration information, and operator information of the large model is obtained;

carrying out pipeline parallelism and tensor parallelism segmentation on the weight parameters of each operator in the large model based on the operator information to obtain an initial segmentation result;

and carrying out tensor parallel segmentation on the application data of the large model based on the initial segmentation result and the available resource information to obtain a current segmentation result.

Further, the performing operator splitting on the large model based on the model configuration information to obtain operator information of the large model includes:

generating a network structure overall diagram of the large model according to the model configuration information;

and carrying out operator splitting on the large model based on the network structure integral graph and the registered operators to obtain operator information of the large model.

Further, the performing pipeline parallel and tensor parallel segmentation on the weight parameter of each operator in the large model based on the operator information to obtain an initial segmentation result includes:

Based on the operator information and a preset first factor, carrying out pipeline parallelism and tensor parallelism segmentation on the weight parameter of each operator in the large model to obtain an initial segmentation result; wherein the first factor includes a resource occupancy and forward and reverse times.

Further, based on the initial segmentation result and the available resource information, performing tensor parallel segmentation on the application data of the large model to obtain a current segmentation result, where the method includes:

performing tensor parallel segmentation on the application data of the large model based on the initial segmentation result, the available resource information and a preset second factor to obtain a current segmentation result; wherein the second factor comprises a batch size.

Further, the step of performing the evaluation on the time dimension and the resource dimension on the current segmentation result to obtain a current evaluation index value includes:

according to the current segmentation result, calculating to obtain resource occupation amount data and time occupation amount data;

determining a resource score and a time score based on the resource occupancy data and the time occupancy data;

based on preset weight data, carrying out weighted summation calculation on the resource scores and the time scores to obtain current evaluation index values; wherein the weight data includes a resource weight corresponding to the resource score and a time weight corresponding to the time score.

Further, the resource occupation amount data comprises first resource occupation data of an operator side and second resource occupation data of an application data side, and the time occupation amount data comprises first time occupation data of the operator side and second time occupation data of the application data side; the determining a resource score and a time score based on the resource occupancy data and the time occupancy data includes:

respectively carrying out score quantization on the resource occupation amount data and the time occupation amount data to obtain a first resource sub-score corresponding to the first resource occupation amount data, a second resource sub-score corresponding to the second resource occupation amount data, a first time sub-score corresponding to the first time occupation amount data and a second time sub-score corresponding to the second time occupation amount data;

summing the first resource sub-score and the second resource sub-score to obtain a resource score;

and carrying out summation calculation on the first time sub-score and the second time sub-score to obtain a time score.

Further, the optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model includes:

Judging whether the current evaluation index value reaches a preset index threshold value or not;

if not, acquiring the loss value of each node in the target computing cluster under the current segmentation result; calculating to obtain current parallel cost based on each loss value corresponding to the current segmentation result; updating the model configuration information based on the current parallel cost, re-executing the model configuration information based on the resource demand configuration information and the model configuration information based on the updated model configuration information, and sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model to obtain a current segmentation result;

and if so, determining the current segmentation result as a target segmentation result of the large model.

In a second aspect, an embodiment of the present invention further provides a distributed training apparatus for a large model, including:

the acquisition module is used for acquiring resource demand configuration information of the target computing cluster and model configuration information of a large model to be trained;

the segmentation module is used for sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model based on the resource demand configuration information and the model configuration information to obtain a current segmentation result;

The evaluation module is used for evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value;

and the optimization module is used for optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model so as to perform distributed training on the large model based on the target segmentation result.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory, and a processor, where the memory stores a computer program that can be run on the processor, and the processor implements the method of the first aspect when executing the computer program.

The distributed training method, the distributed training device and the electronic equipment for the large model can acquire the resource demand configuration information of the target computing cluster and the model configuration information of the large model to be trained; based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model to obtain a current segmentation result; evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value; and optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model, so as to perform distributed training on the large model based on the target segmentation result. Therefore, users can dynamically networking only by giving out resource requirements and model configuration information of the large model without explicitly defining a distributed strategy, namely, the automatic segmentation of the large model by combining assembly line parallelism with tensor parallelism and the dynamic optimization of the current segmentation result are realized, the target segmentation result matched with the model configuration information is obtained, and the training precision is improved; and the time dimension and the resource dimension are considered simultaneously in the optimization, so that the parallel performance can be maximized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of tensor parallel segmentation;

FIG. 2 is a schematic illustration of pipeline parallel segmentation;

FIG. 3 is a schematic flow chart of a large model distributed training method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a large-model distributed training device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Tensor parallel training is to divide a tensor into N blocks along a specific dimension, and each device holds only 1/N of the whole tensor, while not affecting the correctness of the computational graph. This requires additional communication to ensure the correctness of the results.

Taking a general matrix multiplication as an example, as shown in fig. 1, it is assumed that there is c=ab; b may be split along a column into [ B1B 2], with each device holding a column, so that C will also split along a column into C1 and C2, respectively; then multiplying A with each column in B on each device will result in [ AB1 AB2], AB1 being equal to C1 in C, AB2 being equal to C2 in C. At this point, each device still holds a portion of the results, e.g., device 1 holds AB1 and device 2 holds AB2. To ensure the correctness of the results, it is necessary to collect all the results and concatenate tensors along the column dimension. In this way, tensors can be distributed across the device while ensuring that the computational flow remains correct.

The core idea of pipeline parallelism is that a model is divided into a plurality of blocks in layers, each block is handed to one device, and as shown in fig. 2, the model is divided into a backbone network, a neck network, a head network and the like in layers, and the backbone network, the neck network and the head network are respectively held by a GPU0, a GPU1 and a GPU 2. During the forward propagation, each device passes the intermediate activation to the next stage. In the backward propagation process, each device passes the gradient of the input tensor back to the previous pipeline stage. This allows the devices to perform calculations simultaneously, thereby increasing the throughput of training.

The existing model distributed training method generally requires engineering personnel to manually proportion corresponding parameters and write related codes to carry out a large number of debugging and design, manually appoints hardware resources and the like, so that the problems of complex work, poor fault tolerance, low parallel performance, influence on training precision and the like exist. Based on the above, the distributed training method, the distributed training device and the electronic equipment for the large model provided by the embodiment of the invention can realize automatic tensor parallel and pipeline parallel segmentation, and a set of tensor parallel combined pipeline parallel segmentation scheme matched with the current large model is obtained after the model structure is rearranged and analyzed through an automatic search algorithm.

For the convenience of understanding the present embodiment, a detailed description will be first given of a large-model distributed training method disclosed in the present embodiment.

The embodiment of the invention provides a distributed training method of a large model, which can be executed by electronic equipment with data processing capability; the large model may be a large model in the field of target detection, such as a defect detection model, a target recognition model, etc., and the large model is taken as a defect detection large model for the following description, but the protection scope of the present invention is not limited thereto. Referring to fig. 3, a flow chart of a large model distributed training method mainly includes steps S310 to S340 as follows:

Step S310, obtaining resource demand configuration information of a target computing cluster and model configuration information of a defect detection large model to be trained.

The target computing cluster is used for training the defect detection large model, and one node in the target computing cluster is a computer. The resource requirement configuration information is used to characterize the resources required to train the defect detection large model and can be stored in a configuration file entered by the user. The resource requirement configuration information may include Nodes (i.e., the number of Nodes needed), nproc_per_node (i.e., the number of processes needed for each node), the number of GPU (graphics processing unit, graphics processor) resources needed, the amount of memory needed for each process, and the like. The model configuration information is used to characterize all parameters of the defect detection large model and may be stored in a user-specified configuration file. Model configuration information may include the model to be trained and specific parameters of the model (e.g., inchannel, outChannel and other parameters for a certain head, etc.).

It should be noted that the resource requirement configuration information and the model configuration information may be stored in the same configuration file, or may be stored in different configuration files.

Step S320, based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the defect detection large model to obtain a current segmentation result.

In this embodiment, the available resources may be dynamically obtained by searching the set based on the resource requirement configuration information, and then the model structure and the application data of the defect detection large model may be segmented by adopting a hybrid parallel manner based on the lower-upper reverse conduction based on the available resources.

In some possible embodiments, the step S320 may be implemented by the following sub-steps 1 to 4:

and 1, searching the target computing cluster to obtain the current global available resources, and determining the available resource information of the target computing cluster based on the global available resources and the resource demand configuration information.

The current global available resources can be obtained by searching available resources in the target computing cluster, and the global available resources can comprise memory resources and GPU resources. After the global available resources are obtained, whether the global available resources are larger than the number of applied resources corresponding to the resource demand configuration information can be judged, if the global available resources are larger than the number of applied resources, the control of the resources can be performed in a name ordering mode (in other embodiments, other ordering modes can also be adopted), and the available resource information corresponding to the applied resources is obtained. If the global available resource is smaller than or equal to the number of the applied resources, the resource information corresponding to the global available resource can be directly used as the available resource information of the target computing cluster.

And 2, carrying out operator splitting on the defect detection large model based on the model configuration information to obtain operator information of the defect detection large model.

Each network component involved in the large model can be subjected to operator splitting in advance and registered into a distributed algorithm framework, and the registered operators can be subsequently called to obtain operator information of the large model to be trained. Wherein the operator is the encapsulation result of a function having a specific input and a specific output. Based on this, sub-step 2 can be realized by the following procedure: generating a network structure overall diagram of the defect detection large model according to the model configuration information; and carrying out operator splitting on the defect detection large model based on the network structure integral graph and the registered operators to obtain operator information of the defect detection large model.

And 3, carrying out pipeline parallel and tensor parallel segmentation on the weight parameters of each operator in the defect detection large model based on the operator information to obtain an initial segmentation result.

The weight parameter of each operator can be segmented firstly by considering interlayer parallelism based on lower layer factors (the parameter quantity of the operator, the memory occupation quantity, the dimension of the feature map and the like). Inter-layer parallelism, also known as pipeline parallelism, is the distribution of different layers of a model to different devices. For example, if there is a two-layer model and two graphics cards, the first layer of the model may be placed on one graphics card and the second layer on the other graphics card. This allows for efficient and rapid processing of large amounts of data.

In some possible embodiments, based on operator information and a preset first factor, carrying out pipeline parallelism and tensor parallelism segmentation on the weight parameter of each operator in the defect detection large model to obtain an initial segmentation result; the first factor includes the resource occupation amount and the forward and reverse time, and the resource occupation amount may include the parameter amount and the memory occupation amount of the operator. In addition, in other embodiments, the first factor may further include a dimension of the feature map, so as to improve the matching degree with the model configuration information, thereby improving the training accuracy of the large model.

And step 4, performing tensor parallel segmentation on the application data of the defect detection large model based on the initial segmentation result and the available resource information to obtain a current segmentation result.

The initial segmentation result can be conducted to an upper layer, and based on upper layer factors (network structure overall diagram, batch processing size, node number, display card number and the like), intra-layer parallelism is considered, and segmentation is conducted on application data of the upper layer, so that a mixed parallel mode based on lower layer-upper layer reverse conduction is obtained. The intra-layer parallelism, also called tensor model parallelism (i.e., tensor parallelism), is to perform parameter segmentation on a certain layer of the model, and place the segmented parameters on different devices for calculation. One typical application of intra-layer parallelism is Megatron, which performs a segmentation of one of the dimensions of an input matrix or parameter matrix and places the segmentation on a different device.

In some possible embodiments, tensor parallel segmentation can be performed on the application data of the defect detection large model based on the initial segmentation result, the available resource information and a preset second factor, so as to obtain a current segmentation result; wherein the second factor includes batch size.

Step S330, the current segmentation result is evaluated in time dimension and resource dimension to obtain the current evaluation index value.

After the weight parameters of the operators are segmented, the resource occupation amount and the operator parameter amount of a single card (namely a single display card) can be calculated, and the forward time and the reverse time of the operator parameter amount can be estimated to obtain first resource occupation data of an operator side and first time occupation data of the operator side. After the application data is segmented, resource-time accounting can be performed, and the process needs to consider the time dimension and the resource dimension, and calculate the gradient all gap time (i.e. gradient aggregation time) of each display card under the NVLink+NCCL, so as to obtain second resource occupation data of the application data side and second time occupation data of the application data side. Based on the information, the current segmentation result can be evaluated in time dimension and resource dimension to obtain the current evaluation index value.

In some possible embodiments, the step S330 may be implemented as follows: according to the current segmentation result, calculating to obtain resource occupation amount data and time occupation amount data; determining a resource score and a time score based on the resource occupancy data and the time occupancy data; based on preset weight data, carrying out weighted summation calculation on the resource scores and the time scores to obtain current evaluation index values; wherein the weight data includes a resource weight corresponding to the resource score and a time weight corresponding to the time score, and a sum of the resource weight and the time weight may be equal to 1.

Optionally, the resource occupation amount data includes first resource occupation data of an operator side and second resource occupation data of an application data side, and the time occupation amount data includes first time occupation data of the operator side and second time occupation data of the application data side; based on this, the resource score and the time score may be calculated by the following process: respectively carrying out score quantization on the resource occupation amount data and the time occupation amount data to obtain a first resource sub-score corresponding to the first resource occupation amount data, a second resource sub-score corresponding to the second resource occupation amount data, a first time sub-score corresponding to the first time occupation amount data and a second time sub-score corresponding to the second time occupation amount data; summing the first resource sub-score and the second resource sub-score to obtain a resource score; and summing the first time sub-score and the second time sub-score to obtain the time score. It should be noted that, the specific manner of quantifying the score may be set according to the actual requirement, which is not limited herein; the calculation of the resource score and the time score has no sequence, and the resource score can be calculated first and then the time score can be calculated, or the time score can be calculated first and then the resource score can be calculated.

In specific implementation, the current evaluation index value can be calculated according to the following formulaA：

Wherein,k ₁ as the weight of the time in question,k ₂ for the weight of the resource(s),Mas the time score value, the time score,Nas a result of the resource score,k ₁ +k ₂ =1。

the weight data used in calculating the current evaluation index value may be set according to actual requirements, and is not limited herein. For example, resource weightsk ₂ Is 0.3, time weightk ₁ 0.7.

Step S340, optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large defect detection model, so as to perform distributed training on the large defect detection model based on the target segmentation result.

The current segmentation result can be evaluated based on the current evaluation index value, and when the current evaluation index value reaches a preset index threshold value, the current segmentation result is considered to be feasible and can be directly used for distributed training of a large model; when the current evaluation index value is smaller than the index threshold value, the current segmentation result is considered to be infeasible, and optimization is needed until the index threshold value is reached. It should be noted that the exponent threshold may be set according to actual requirements, and is not limited herein, for example, the exponent threshold is 0.8.

In some possible embodiments, step S340 may be implemented by the following sub-steps a to c:

And a substep a, judging whether the current evaluation index value reaches a preset index threshold value.

If not, executing the sub-step b; if so, sub-step c is performed.

Step b, obtaining loss values of all nodes in the target computing cluster under the current segmentation result; calculating to obtain current parallel cost based on each loss value corresponding to the current segmentation result; updating the model configuration information based on the current parallel cost, and re-executing step S320 based on the updated model configuration information.

Optionally, the current parallel costKThe method can be calculated by the following formula:

K=-∑ _ij lg(xij)

wherein,xijrepresent the firstiIndividual nodes are in dimensionjLower loss value, dimensionjThe value of (2) may be 1 or 2,j=1 represents the time dimension and,j=2 represents the resource dimension.

Optionally, network structure parameters corresponding to the current parallel cost and the current model configuration information can be input into a search optimization equation for solving, so that updated model configuration information is obtained. The specific process of solving the search optimization equation may refer to the related art, and will not be described herein.

It should be noted that, the specific calculation manner of the loss value of each node may refer to related prior art, which is not described herein.

And c, determining the current segmentation result as a target segmentation result of the defect detection large model.

Therefore, the dynamic acquisition of the resources is realized through the process, and the tensor parallel is combined with the dynamic optimization of the segmentation result corresponding to the pipeline parallel, and the time dimension and the resource dimension are simultaneously considered during the optimization so as to maximize the parallel performance without losing the precision.

The distributed training method of the large model provided by the embodiment of the invention can acquire the resource demand configuration information of the target computing cluster and the model configuration information of the defect detection large model to be trained; based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the defect detection large model to obtain a current segmentation result; evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value; and optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large defect detection model, so as to perform distributed training on the large defect detection model based on the target segmentation result. Therefore, users do not need to explicitly define a distributed strategy, only need to give out the resource requirements and the model configuration information of the large defect detection model, and can dynamically networking, namely, the automatic segmentation of the large defect detection model by combining the assembly line parallelism with tensor parallelism is realized, the dynamic optimization of the current segmentation result is realized, the target segmentation result matched with the model configuration information is obtained, and the training precision is improved; and the time dimension and the resource dimension are considered simultaneously in the optimization, so that the parallel performance can be maximized.

For ease of understanding, the distributed training method of the large model is described in detail below.

Before describing the distributed training method of the large model, the main characteristics of the large model (such as a defect detection large model) in the target detection field are described.

1. Batch size (bitch size):

parameter (used to describe the size of the model): increasing the batch size increases the memory requirements because more images need to be processed at the same time. This may require more parameters to store intermediate activation values and gradients.

Computing resource requirements: larger batch sizes may require more computing resources, such as GPUs and memories. This may require more parallel computing power.

Training speed: larger batch sizes generally increase training speed because computing resources may be more efficiently utilized, but more memory may also be required.

Stability: larger batch sizes typically introduce less noisy gradient estimates, thereby improving training stability.

2. Network entry size (input shape):

parameter number: increasing the size of the input image results in a larger feature map, which may require more parameters to process. The number of parameters of the convolutional layer may increase, especially the number of parameters of the fully connected layer.

Computing resource requirements: larger input images may require more computing resources to process because convolution operations may become more time consuming. This may require a larger GPU, memory, and computing power.

Feature extraction quality: larger input images may provide more rich feature information, helping to improve the performance of object detection, especially for small objects, but may also lead to more computational complexity.

3. Target category number (cls number):

parameter number: increasing the number of target classes increases the number of output channels of the classifier, as one output channel is required for each class. This increases the number of parameters of the fully connected layer.

Data requirements: more target categories typically require more training data to ensure that each category has enough examples to train.

Computing resource requirements: a larger number of categories would result in an increased computational effort for the classifier, as more category classification would be required.

4. Internal parameters

Parameter number: depending on the network structure, such as dual-stage Anchor based object detection, single-stage Anchor based object detection, anchor-free networks have special structures that increase the parameter dimension of the node.

Computing resource requirements: an increase in the parameter amount of the node results in an increase in the computation amount of the classifier because more resource computation is required.

The distributed parallel strategy corresponding to the distributed training method of the large model is described in detail below.

For the above-described defect detection large model, parallelism in the layer (dimension one) and parallelism between layers (dimension two) are performed according to the following rule.

In the scheme, aiming at the weight in the large defect detection model, a set of algorithm based on automatic strategy searching is used for dynamic networking before training is started, and differentiation of different strategies is carried out on structures related to the inside of the network. The method comprises the following specific steps:

step 1: and (5) registering operators. And (3) carrying out operator disassembly on all network components needing dynamic networking, registering into a distributed algorithm framework, and carrying out detailed disassembly and recombination on each network component in the defect detection large model to be trained. Some of the network components are shown in table 1 below.

TABLE 1

Wherein VGG refers to Visual Geometry Group; resnet50 refers to the residual network 50 layer; resnet101 refers to the residual network 101 layer; CSPnet refers to Cross Stage Partial Network, i.e., across a phase-section network; backbone refers to the Backbone network; image preprocess refers to Image preprocessing; anchor Generator refers to an Anchor box Generator; scales, ratios, strides scale ratio, aspect ratio, and movement step, respectively; region proposal net refers to a regional advice network; inChannel, feaChannel refers to the input dimension and the feature dimension, respectively; roiextractor refers to a region feature extractor; roilayer, outChannel each refers to a region feature layer and an output dimension; BBoxHead refers to a network head convergence fabric; roiFeatSize, numClass each refers to a region feature size and a category number; classification Layer refers to the classification layer; spatial Pyramin Pooling-Fast refers to a single-stage network feature extractor; PANet refers to a path aggregation network.

Step 2: the environment depends on the search. Before training is started, the configuration file transmits Nodes (the number of needed Nodes), nproc_per_node (the number of processes needed by each node), the number of needed GPU resources and the size of memory needed by each process to an elastencv module in an algorithm in a parameter transmission mode, the elastencv module is used for allocating environment dependence in a cluster, acquiring a resource list needed by elastic training in a local area network by searching available resources in the cluster, and automatically generating a corresponding parameter queue (namely an environment dependence list of elastencv), as shown in the following table 2.

TABLE 2

After the elastic env module acquires the global resource information, the cluster resource controller applies to access all cluster resources, including memory resources and GPU resources, if the resources are available and larger than the number of resources to be applied, the resources are controlled in a name ordering mode, and nodes and information of the available resources are returned to the elastic env module. If the number of resources to be applied is less than or equal to the number of resources required, the available resource list is returned. The elastosenv returns information to the Master node (i.e., master node) via the cluster controller (which includes the cluster resource controller) for Master node scheduling.

Step 3: networking strategies. After obtaining the available resources in the cluster through step 2, the available resources are transferred to a Elastic NetControl module, and the Elastic NetControl module performs dynamic networking through an automatic searching strategy. The specific steps are as follows:

step 3.1: the user designates a certain configuration file, all parameters are defined in detail in the configuration file, including the model to be selected, specific parameters of the model (such as Inchannel, outChannel of a head and other parameters, etc.), and the elastic netcontrol module obtains specific subdivision (i.e. specific parameter data) of the overall diagram of the network structure to be started currently according to the configuration file.

Step 3.2: according to the overall network structure diagram, the operators therein are split from the registrar (reg unit), under this step, the registrar (reg unit) provides the components and the operator structures of the components to the elastic netcontrol module, which includes detailed network parameters, and parameters, memory, and display memory estimated by the network parameters, and after the elastic netcontrol module, a reverse derivation-upper layer verification cross search algorithm is used to formulate a parallel mode, which specifically includes:

the factors are two layers, the first layer is the upper layer factors (network structure diagram, batch size, node number, display card number), and the second layer is the lower layer factors (parameter number, display memory occupation, feature map dimension of operators). Firstly, a lower layer is assigned with a strategy, interlayer parallelism is considered, the weight parameter of each operator is segmented, the weight parameter is segmented into the slice components of the global resource number obtained from the elastic env module, the resource occupation quantity (comprising the operator parameter quantity of a single card) is calculated independently, and the forward and reverse time of the lower layer is estimated. And then conducting the strategy to an upper layer, applying data parallelism to the upper layer, and further segmenting according to the batch size to obtain a mixed parallel mode based on the lower layer-upper layer reverse conduction. The upper layer elastic env module performs resource-time accounting on the parallel mode, and the process needs to consider the time dimension and the resource dimension and calculate the gradient all other time of each graphic card under NVLink+NCCL. After the upper layer component calculates the overall index, an index related to the current parallel mode is obtained and is marked as A.

Regarding NVLink and NCCL:

NVLink is a high-speed, high-bandwidth inter-GPU connection technique. It is an interface for connecting multiple GPUs that can provide higher bandwidth and lower latency than conventional PCIe buses. NVLink may enable direct communication between multiple GPUs. NVLink is mainly used for communication before a plurality of display cards on a single machine.

NCCL (NVIDIA Collective Communications Library) is a library for accelerating communications between multiple GPUs. NCCL provides high-performance collective communication primitives for enabling efficient data exchange and communication between multiple GPUs. The method can speed up data transmission in multi-GPU computing and improve the efficiency of parallel computing tasks. The NCCL is mainly used for communication of display cards among different servers.

Here, nvlink+nccl mainly refers to multi-card communication on the same computer and communication across different graphic cards on the computer.

The calculation mode of A is composed of two parts, namely time occupation and resource occupation, wherein:

the time occupation comprises the following components: parallel strategy, model complexity, data batch size on single card, data preprocessing method, optimizer and learning rate.

The parallel strategy and the data batch size on the single card are obtained by calculating parameters transmitted by an elastic env module through an elastic NetControl, and the parallel strategy comprises tensor parallel and pipeline parallel subdivision parameters and the data batch size on the single card, wherein each parameter has a corresponding score, and the scores are summed to obtain a parallel strategy score; the model complexity, the data preprocessing method, the optimizer and the learning rate are obtained by calculating values in the configuration file; and finally, summing the scores of the components to obtain a time score M.

The resource occupation comprises the following components: model size, batch size, network structure and layer number, parameter number, data type and data accuracy, optimizer and gradient calculation. Each component has a corresponding unit score, and specific model types, networks, layers and parameter numbers can be obtained according to the configuration file of the model, and the data types and data precision used in training and the types of optimizers are obtained. And calculating the resource score according to the parameter information, namely N.

And finally, summing M and N according to the weight ratio of 0.7 xM+0.3 xN to obtain A.

Step 3.3: and (3) evaluating the current parallel mode according to the index A obtained in the step 3.2, and if the A is more than 0.8, transmitting specific parameters (namely a result obtained by a parallel strategy, including parameter information of a segmentation strategy and required resource information) of the parallel mode into a Master node. If the evaluation index is less than 0.8, the current parallel cost is calculatedK，KThe calculation mode of (2) is as follows:

K=-∑ _ij lg(xij)

Obtaining parallel costs KThereafter, the cost is reducedKThe current parameter combination (the parameter of the network structure overall diagram obtained in the step 3.2) is input into a search optimization equation to be solved, a new parameter combination is obtained, and the steps 3.2 to 3 are repeated3.3, until the index A reaches 0.8, returning the final result and outputting the result to the Master node.

Step 4: and (5) node scheduling. After the Master node receives the parameter scheduling scheme, the distributed component Elastic pod will be started, and the distributed policy will be broadcasted to the Elastic pod on each node. And starting a corresponding operator on each node according to the distributed strategy.

The distributed training method of the large model provided by the embodiment of the invention has the following beneficial effects:

1. based on the distributed scheme of the search strategy, the user does not need to explicitly define the distributed strategy, but gives some desirability indexes, such as the number of required display cards and the required batch size, and the component can automatically perform search optimization.

2. And (3) optimizing the performance of the specific target detection network, carrying out operator disassembly on the target detection network in a registration stage, and analyzing each operator in a strategy stage to obtain the optimal parallel index.

Corresponding to the above-mentioned large model distributed training method, the embodiment of the present invention further provides a large model distributed training device, referring to a schematic structural diagram of a large model distributed training device shown in fig. 4, where the device includes:

An obtaining module 401, configured to obtain resource requirement configuration information of a target computing cluster and model configuration information of a defect detection large model to be trained;

the segmentation module 402 is configured to sequentially perform pipeline parallel and tensor parallel segmentation on the model structure and the application data of the defect detection large model based on the resource requirement configuration information and the model configuration information, so as to obtain a current segmentation result;

the evaluation module 403 is configured to perform evaluation on the time dimension and the resource dimension on the current segmentation result, so as to obtain a current evaluation index value;

and the optimizing module 404 is configured to optimize the current segmentation result based on the current evaluation index value, and obtain a target segmentation result of the large defect detection model, so as to perform distributed training on the large defect detection model based on the target segmentation result.

The distributed training device of the large model provided by the embodiment of the invention can acquire the resource demand configuration information of the target computing cluster and the model configuration information of the defect detection large model to be trained; based on the resource demand configuration information and the model configuration information, sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the defect detection large model to obtain a current segmentation result; evaluating the time dimension and the resource dimension of the current segmentation result to obtain a current evaluation index value; and optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large defect detection model, so as to perform distributed training on the large defect detection model based on the target segmentation result. Therefore, users do not need to explicitly define a distributed strategy, only need to give out the resource requirements and the model configuration information of the large defect detection model, and can dynamically networking, namely, the automatic segmentation of the large defect detection model by combining the assembly line parallelism with tensor parallelism is realized, the dynamic optimization of the current segmentation result is realized, the target segmentation result matched with the model configuration information is obtained, and the training precision is improved; and the time dimension and the resource dimension are considered simultaneously in the optimization, so that the parallel performance can be maximized.

Further, the segmentation module 402 is specifically configured to:

performing operator splitting on the defect detection large model based on the model configuration information to obtain operator information of the defect detection large model;

carrying out pipeline parallel and tensor parallel segmentation on weight parameters of each operator in the defect detection large model based on the operator information to obtain an initial segmentation result;

and carrying out tensor parallel segmentation on the application data of the defect detection large model based on the initial segmentation result and the available resource information to obtain a current segmentation result.

Further, the segmentation module 402 is further configured to:

generating a network structure overall diagram of the defect detection large model according to the model configuration information;

and carrying out operator splitting on the defect detection large model based on the network structure integral graph and the registered operators to obtain operator information of the defect detection large model.

Further, the segmentation module 402 is further configured to:

Based on the operator information and a preset first factor, carrying out pipeline parallelism and tensor parallelism segmentation on the weight parameter of each operator in the defect detection large model to obtain an initial segmentation result; wherein the first factor includes a resource occupancy and forward and reverse times.

Further, the segmentation module 402 is further configured to:

performing tensor parallel segmentation on the application data of the defect detection large model based on the initial segmentation result, the available resource information and a preset second factor to obtain a current segmentation result; wherein the second factor comprises a batch size.

Further, the evaluation module 403 is specifically configured to:

Further, the resource occupation amount data comprises first resource occupation data of an operator side and second resource occupation data of an application data side, and the time occupation amount data comprises first time occupation data of the operator side and second time occupation data of the application data side; the evaluation module 403 is further configured to:

Further, the optimization module 404 is specifically configured to:

if not, acquiring the loss value of each node in the target computing cluster under the current segmentation result; calculating to obtain current parallel cost based on each loss value corresponding to the current segmentation result; updating the model configuration information based on the current parallel cost, re-executing the model configuration information based on the resource demand configuration information and the model configuration information based on the updated model configuration information, and sequentially carrying out pipeline parallel and tensor parallel segmentation on the model structure and the application data of the defect detection large model to obtain a current segmentation result;

And if so, determining the current segmentation result as a target segmentation result of the defect detection large model.

The implementation principle and the generated technical effects of the distributed training device for the large model provided in this embodiment are the same as those of the distributed training method embodiment for the large model, and for the sake of brief description, reference may be made to corresponding contents in the distributed training method embodiment for the large model where the embodiment of the distributed training device for the large model is not mentioned.

As shown in fig. 5, an electronic device 500 according to an embodiment of the present invention includes: processor 501, memory 502 and bus, memory 502 stores a computer program executable on processor 501, and when electronic device 500 is running, processor 501 and memory 502 communicate via the bus, processor 501 executing the computer program to implement the above-described large model distributed training method.

Specifically, the memory 502 and the processor 501 can be general-purpose memories and processors, which are not particularly limited herein.

Embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the distributed training method of the large model described in the previous method embodiments. The computer-readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk, etc., which can store program codes.

Any particular values in all examples shown and described herein are to be construed as merely illustrative and not a limitation, and thus other examples of exemplary embodiments may have different values.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A distributed training method for a large model, comprising:

2. The method according to claim 1, wherein the sequentially performing pipeline parallel and tensor parallel segmentation on the model structure and the application data of the large model based on the resource requirement configuration information and the model configuration information to obtain a current segmentation result includes:

3. The method according to claim 2, wherein the performing operator splitting on the large model based on the model configuration information to obtain operator information of the large model includes:

4. The method according to claim 2, wherein the performing pipeline parallel and tensor parallel segmentation on the weight parameter of each operator in the large model based on the operator information to obtain an initial segmentation result includes:

5. The method according to claim 2, wherein performing tensor parallel segmentation on the application data of the large model based on the initial segmentation result and the available resource information to obtain a current segmentation result includes:

6. The method according to claim 1, wherein said evaluating the current segmentation result in a time dimension and a resource dimension to obtain a current evaluation index value comprises:

7. The method of claim 6, wherein the resource occupancy data comprises first resource occupancy data on an operator side and second resource occupancy data on an application data side, and wherein the time occupancy data comprises first time occupancy data on the operator side and second time occupancy data on the application data side; the determining a resource score and a time score based on the resource occupancy data and the time occupancy data includes:

8. The method according to claim 1, wherein optimizing the current segmentation result based on the current evaluation index value to obtain a target segmentation result of the large model comprises:

9. A large model distributed training apparatus, comprising:

10. An electronic device comprising a memory, a processor, the memory having stored therein a computer program executable on the processor, wherein the processor implements the method of any of claims 1-8 when the computer program is executed by the processor.