CN115481729A

CN115481729A - Hybrid operator model parallel training method, device, equipment and storage medium

Info

Publication number: CN115481729A
Application number: CN202211143025.8A
Authority: CN
Inventors: 任智祥; 任一铭; 田永鸿; 高文
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-16

Abstract

The invention relates to the technical field of image processing, in particular to a parallel training method, a device, equipment and a storage medium for a hybrid operator model. The method comprises the steps of firstly dividing a hybrid operator model into branch models, namely dividing the hybrid operator model into whole parts, then training a branch model on each training node, enabling each training node to simultaneously and parallelly train each branch model, finally summarizing output results of each branch model to obtain an integral output result of the hybrid operator model, and adjusting model parameters according to the integral output result to finish model training. From the analysis, the parallel training is realized by training each branch model on each training node, and the parallel training can save the training time, so as to improve the training speed and enable the hybrid operator model to be quickly converged.

Description

Hybrid operator model parallel training method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a parallel training method, a device, equipment and a storage medium for a hybrid operator model.

Background

The hybrid operator model is trained by adopting a large amount of training data, the trained hybrid operator model can be used for extracting image features, for example, convolution and self-attention mechanisms are two basic modules in a deep neural network (hybrid operator model), the former extracts local features of an image in a linear mode, and the latter encodes a high-order context relationship through non-local features. Because the two are complementary in nature, a mixed approximation operator consisting of a global volume and a self-attention mechanism can be deduced, and local and non-local feature interaction can be unified, so that the extremely competitive performance can be realized on visual and language tasks at the same time. The visual Transformer model achieves excellent performance exceeding a convolutional neural network in traditional visual tasks such as image classification and target recognition by using a mixed operator combining convolution and a self-attention mechanism. However, as the scale of the model increases, higher requirements are placed on the nodes of the training model. In the prior art, the mixed operator model is integrally placed on a single node for training, so that a large amount of time is needed for training the model.

In summary, the prior art method of training the hybrid operator model results in a slower training speed.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a parallel training method, a device, equipment and a storage medium for a hybrid operator model, and solves the problem of low training speed caused by the method for training the hybrid operator model in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a parallel training method for a hybrid operator model, wherein the method includes:

dividing the mixed operator model into branch models;

controlling each training node to run each branch model in parallel to obtain a branch output result of each branch model, wherein the training node is computer hardware used for training the hybrid operator model;

and finishing the training of the hybrid operator model according to the branch output result of each branch model.

In one implementation, the dividing the hybrid operator model into branch models includes:

according to the mixed operator model, sequentially connected convolutional layers, a self-attention network and a full-connection network covered by the mixed operator model are obtained, wherein the convolutional layers comprise a plurality of convolutional kernels, and the self-attention network comprises a plurality of self-attention layers;

dividing a plurality of convolution kernels of the convolution layer, a plurality of self-attention layers of the self-attention network and a plurality of full-connection layers of the full-connection network into branch models, wherein each branch model respectively comprises the convolution kernels, the self-attention layers and the full-connection layers.

In an implementation manner, the controlling each training node to run each branch model in parallel to obtain a branch output result of each branch model, where the training node is computer hardware used for training the hybrid operator model, includes:

inputting training input data into the convolution kernels of the branch models, and controlling the training nodes to run the convolution kernels in parallel to obtain output results of the convolution kernels;

controlling each training node to run the self-attention layer of each branch model in parallel according to the output result of each convolution kernel to obtain the output result of each self-attention layer;

and controlling each training node to run the full-connection layer of each branch model in parallel according to the output result of each self-attention layer to obtain the output result of each full-connection layer in each branch output result, wherein the full-connection layer, the self-attention layer and the convolution kernel on one branch model are positioned on the same training node.

In one implementation, the inputting training input data to convolution kernels of each branch model and controlling each training node to run each convolution kernel in parallel to obtain an output result of each convolution kernel includes:

obtaining a training GPU in the training nodes according to the training nodes;

controlling the training GPU to capture the training input data from a data storage node, wherein the data storage node is used for storing the training input data;

fixing the operation attribute of each convolution kernel;

and inputting the training input data to each convolution kernel after the operational attribute is fixed, and controlling each training GPU to run each convolution kernel in parallel to obtain an output result of each convolution kernel.

In one implementation, the controlling, according to the output result of each convolution kernel, each training node to run a self-attention layer of each branch model in parallel to obtain the output result of each self-attention layer includes:

multiplying the output result of each convolution kernel by parameter matrixes with set quantity to obtain construction matrixes with set quantity, wherein the set quantity is the quantity which is required by the self-attention layer in each branch model and is used as an input parameter;

and inputting the construction matrixes with the set number corresponding to the convolution kernels into the self-attention layers, controlling the training GPUs in the training nodes to run the self-attention layers after the construction matrixes are input in parallel, and obtaining output results of the self-attention layers, wherein one convolution kernel in each convolution kernel and the column component in each self-attention layer are located in the same branch model.

In an implementation manner, the inputting the construction matrices of the set number corresponding to each convolution kernel to each self-attention layer, and controlling a training GPU in each training node to run each self-attention layer after the construction matrices are input in parallel to obtain an output result of each self-attention layer, where one of the convolution kernels in each convolution kernel and a column component in each self-attention layer are located in a same branch model, includes:

obtaining a query matrix, a key matrix and a value matrix in each construction matrix according to each construction matrix;

inputting the query matrix, the key matrix, and the value matrix to each of the self-attention layers;

performing transposition operation on the key matrix on the training GPU to obtain a transposition matrix of the key matrix;

performing multiplication operation on the query matrix and the transpose matrix of the key matrix on the training GPU to obtain a multiplication matrix;

normalizing the multiplication matrix on the training GPU, obtaining a normalized matrix;

and performing multiplication operation of each self-attention layer on the normalization matrix and the value matrix in each training GPU in parallel to obtain an output result of each self-attention layer.

In one implementation, the controlling, according to the output result of each self-attention layer, each training node to run a full connection layer of each branch model in parallel to obtain an output result of each full connection layer in each branch output result, where the full connection layer, the self-attention layer, and the convolution kernel on one branch model are located on the same training node, includes:

controlling a training GPU in each training node to segment the weight matrix row of each full-connection layer to obtain a weight row matrix;

and controlling each training GPU to run multiplication of the output result of each self-attention layer on each branch model and the weight row matrix of each full-connection layer in parallel to obtain the output result of each full-connection layer in each branch output result.

In an implementation manner, each training node is a secondary node, a set number of secondary nodes are located in the same primary node, and the training of the hybrid operator model is completed according to a branch output result of each branch model, including:

aggregating branch output results of each branch model located on each secondary node to obtain a total output result corresponding to each primary node, wherein the total output result is an output result of the hybrid operator model;

and finishing the training of the hybrid operator model according to the total output result corresponding to each primary node.

In an implementation manner, the completing training of the hybrid operator model according to the total output result corresponding to each of the primary nodes includes:

determining the primary nodes which are communicated with each other in each primary node, and recording as a node cluster;

sharing the total output result corresponding to each primary node in the node cluster among the primary nodes;

controlling each primary node in the node cluster to calculate an average value of the total output results after sharing, and obtaining a final output result corresponding to each primary node in the node cluster;

comparing the final output result corresponding to each primary node in the node cluster with a set output result to obtain a comparison result;

and training the hybrid operator model according to the comparison result until the training of the hybrid operator model is completed.

In one implementation, the hybrid operator model is an image classification model.

In a second aspect, an embodiment of the present invention further provides a hybrid operator model parallel training apparatus, where the apparatus includes the following components:

the model segmentation module is used for dividing the mixed operator model into branch models;

the node control module is used for controlling each training node to run each branch model in parallel to obtain a branch output result of each branch model, and the training nodes are computer hardware used for training the hybrid operator model;

and the training module is used for finishing the training of the hybrid operator model according to the branch output result of each branch model.

In a third aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a hybrid operator model parallel training program that is stored in the memory and is executable on the processor, and when the processor executes the hybrid operator model parallel training program, the method for training the hybrid operator model in parallel is implemented.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a hybrid operator model parallel training program is stored on the computer-readable storage medium, and when the hybrid operator model parallel training program is executed by a processor, the steps of the hybrid operator model parallel training method are implemented.

Has the advantages that: the method comprises the steps of firstly dividing a hybrid operator model into branch models, namely dividing the hybrid operator model into whole parts, then training a branch model on each training node, enabling each training node to simultaneously and parallelly train each branch model, finally summarizing output results of each branch model to obtain an integral output result of the hybrid operator model, and adjusting model parameters according to the integral output result to finish model training. From the analysis, the parallel training is realized by training each branch model on each training node, and the parallel training can save the training time, so as to improve the training speed and lead the hybrid operator model to be fast converged.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a hybrid parallel flow diagram in an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating secondary node resource selection in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a decentralized architecture according to an embodiment of the present invention;

FIG. 5 is a flow chart of model training in an embodiment of the present invention;

fig. 6 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is clearly and completely described below by combining the embodiment and the attached drawings of the specification. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Research shows that a large amount of training data is adopted to train a hybrid operator model, the hybrid operator model after training can be used for extracting image features, such as convolution and self-attention mechanisms are two basic modules in a deep neural network (hybrid operator model), the former extracts local features of an image in a linear mode, and the latter encodes a high-order context relationship through non-local features. Because the two are complementary in nature, a mixed approximation operator consisting of a global volume and a self-attention mechanism can be deduced, and local and non-local feature interaction can be unified, so that the mixed approximation operator can realize extremely competitive performance on visual and language tasks at the same time. The visual Transformer model achieves excellent performance exceeding a convolutional neural network in traditional visual tasks such as image classification, target recognition and the like by using a mixed operator combining convolution and a self-attention mechanism. However, as the scale of the model increases, higher requirements are placed on the nodes of the training model. In the prior art, the mixed operator model is integrally placed on a single node for training, so that a large amount of time is needed for training the model.

In order to solve the technical problems, the invention provides a parallel training method, a device, equipment and a storage medium for a hybrid operator model, and solves the problem of low training speed caused by the method for training the hybrid operator model in the prior art. When the method is specifically implemented, firstly, the hybrid computation submodel is divided into each branch model, then each training node is controlled to run each branch model in parallel to obtain a branch output result of each branch model, and finally, the training of the hybrid computation submodel is completed according to the branch output result of each branch model. The training method of the embodiment can improve the training speed.

For example, the hybrid operator model is a deep neural network model composed of a convolutional layer, a self-attention network, and a fully-connected network, as shown in fig. 2, the convolutional layer includes a convolutional kernel 1, a convolutional kernel 2, and a convolutional kernel 3, the self-attention network includes a self-attention layer 1, a self-attention layer 2, and a self-attention layer 3, and the fully-connected network includes a fully-connected layer 1, a fully-connected layer 2, and a fully-connected layer 3.

The training nodes include a training node 1 (the training node is a computer), a training node 2, and a training node 3. The training model composed of the convolution kernel 1, the self-attention layer 1 and the full connection layer 1 is placed on the training node 1 for training, the training model composed of the convolution kernel 2, the self-attention layer 2 and the full connection layer 2 is placed on the training node 2 for training, the training model composed of the convolution kernel 3, the self-attention layer 3 and the full connection layer 3 is placed on the training node 3 for training, and the training node 1, the training node 2 and the training node 3 respectively train the three branch models in parallel at the same time, so that the model training speed can be improved.

Exemplary method

The hybrid operator model parallel training method of the embodiment can be applied to terminal equipment, and the terminal equipment can be a terminal product with a computing function, such as a computer. In this embodiment, as shown in fig. 1, the parallel training method for hybrid operator models specifically includes the following steps S100 to S400:

s100, training nodes are preprocessed, and communication among the training nodes is established.

In this embodiment, the training node is a GPU in a computer, the training node is used as a secondary node, several secondary nodes are combined together to form a primary node (a sub-cluster), each secondary node inside each sub-cluster is used for training each branch model in parallel, and after each sub-cluster finishes training the branch models, model parameters obtained by training are shared among the sub-clusters to complete the training of the models. And the model training is simultaneously carried out among the sub-clusters, so that the training speed of the model is further improved.

Step S100 includes the steps of:

s101, creating a cluster configuration file, wherein the cluster comprises each sub-cluster, and each sub-cluster comprises a plurality of secondary nodes.

In this embodiment, the cluster configuration file includes addresses of nodes (training nodes), the number of CPUs and GPUs included in each node, the size of a memory allocated to each node, and the like, and the working node (a node independent from the training nodes) generates a predefined decentralized network topology according to the cluster configuration file (the decentralized network topology is a network topology that records communication information between two training nodes, and two training nodes can directly communicate without depending on a third party, and is a decentralized network topology).

S102, training node grading

The method includes the steps that a GPU or NPU cluster is subjected to strong expansion grouping and divided into two-level nodes, each level-one node is a sub-cluster formed by a plurality of level-two nodes and serves as a work node inside a network (for example, ten training nodes are provided, a GPU and a CPU are arranged inside the ten training nodes, the ten training nodes are all level-two nodes, every two training nodes are divided into one level-one node, and the level-one node formed by every two training nodes is called a sub-cluster). The secondary nodes are a CPU-GPU/NPU mixed resource group, as shown in FIG. 3, and the system automatically allocates a CPU + GPU resource set according to the communication load and the logic distance. Conventionally, the CPU and the GPU resources on the same physical machine are divided. The GPU is used for simple gradient (the gradient is intermediate calculation of an optimization model, the gradient is a partial derivative of the iterative loss function of the current round to the current node parameter, the neural network back propagation process is used for calculating the node parameter and the gradient of each neuron, and the parameter value is optimized to realize network updating), and the CPU is used for initialization, inter-node communication, data loading, copying between a host memory and a device memory, starting a kernel function of the GPU (the GPU is distributed with multiple threads to support parallel functions, namely the kernel function is a parallel function) and the like. And the plurality of secondary nodes respectively store model slices (branch models) and cooperatively realize the multipath parallelism of the models.

S103, determining the adjacent cluster of the sub-cluster.

A sub-cluster adjacent cluster is defined according to a primary node (a sub-cluster) and a secondary node (a plurality of sub-clusters capable of communicating form the adjacent cluster), the sub-cluster is used as a node, topological information of adjacent nodes in a network is collected, and a communication graph is generated. The communication graph nodes are primary nodes, each primary node stores address information of n primary nodes which are closest to the physical distance (namely, the distance between two nodes is far and near, the longer the distance between two nodes is far, the longer the time required for communication between two nodes is long) of the primary node and information of the included secondary nodes, the distance is defaulted to a physical machine sequence, and local gradient updating of a subsequent asynchronous de-centering algorithm is guided.

And S104, loading the training input data to each subset group.

The data storage node (the node is independent of the training node) randomly extracts part of data slices in the data set through specific data pipeline (customized according to user requirements) and broadcasts the data slices to each sub-cluster (the broadcasting of the data slices to each sub-cluster is to select training input data from the data storage node and then transmit the input data to each sub-cluster), and the training input data is used for training the internal model of the sub-cluster. The sub-cluster is internally responsible for model parallel, binding CPU resource initialization model parameters, loading data slices, and calling kernel functions to distribute data to each GPU for training respectively to realize data parallel.

And S200, dividing the mixed operator model into branch models.

In this embodiment, the overall hybrid operator model is divided into branch models, and the division is to divide the hybrid operator model into different training nodes (servers) for operation. Step S100 includes the steps of:

s201, obtaining a convolution layer, a self-attention network and a full-connection network which are covered by the hybrid operator model and are connected in sequence according to the hybrid operator model, wherein the convolution layer comprises a plurality of convolution kernels, the self-attention network comprises a plurality of self-attention layers, and the full-connection network comprises a plurality of full-connection layers.

The hybrid operator model of this embodiment is a deep neural network model, and as shown in fig. 2, the deep neural network model includes a convolutional layer, a self-attention network, and a fully-connected network, which are connected in sequence, where an output end of the convolutional layer is connected to an input end of the self-attention network, and an output end of the self-attention network is connected to an input end of the fully-connected network. The number of the convolution kernels, the self-attention layer and the full-connection layer in each branch model can be one or more than one. In this embodiment, the number of the convolution kernel, the self-attention layer, and the full-connection layer in each branch model is one, and in this embodiment, the convolution layer includes three convolution kernels, the self-attention network also includes three self-attention layers, and the full-connection network also includes three full-connection layers.

In another embodiment, the hybrid operator model is a neural network model that includes only convolutional layers and a self-attention network.

S202, dividing a plurality of convolution kernels of the convolution layer, a plurality of self-attention layers of the self-attention network and a plurality of full-connection layers of the full-connection network into branch models, wherein each branch model comprises the convolution kernels, the self-attention layers and the full-connection layers respectively.

As shown in fig. 2, the solid line box is a branch model, i.e. each branch model includes a convolution kernel, a self-attention layer, and a full-connection layer. A branch model is put on a training node to run.

In one embodiment, the convolution kernel matrix corresponding to the convolutional layer is segmented into a plurality of two-dimensional sub-convolution kernels according to the number of channels, the two-dimensional convolution kernels after segmentation are each convolution kernel in fig. 2, the convolution kernels are spatially separated according to the number of GPUs in the group (a first-level node is a group, several GPUs in the first-level node divide the convolution kernels in the convolutional layer into several groups, and then the convolution kernels are respectively placed on different GPUs to realize spatial separation), and the convolution and pooling operations are completed only by using local channels.

S300, controlling each training node to run each branch model in parallel to obtain a branch output result of each branch model, wherein the training node is computer hardware used for training the hybrid operator model.

The training nodes run all branch models in parallel, namely all branch models run on all the training nodes simultaneously, and the three branch models are trained on the three training nodes respectively by the three branch models, so that the time for training the whole hybrid operator model is saved.

Step S300 includes steps S301 to S3012 as follows:

s301, obtaining a training GPU in the training nodes according to the training nodes.

As shown in fig. 4, one circle is a primary node, the primary node further includes a plurality of training GPUs as secondary nodes, and the primary node further includes a CPU for initialization, inter-node communication, data loading, and copying from a host memory to a device memory. The number of training GPUs in each level one node is equal to the number of branch models, so that one training GPU runs one branch model.

S302, controlling the training GPU to capture the training input data from a data storage node, wherein the data storage node is used for storing the training input data.

The training GPU randomly extracts a data set from the data storage nodes through specific data pipeline (pipeline is a data generation processing function) to serve as training input data. The training input data is randomly extracted to ensure that the mixed operator model trained by the data has generalization.

In one embodiment, captured training input data is not directly input to the convolution kernel, but the input data is first subjected to Embedding (i.e., vector Embedding, i.e., converting input data such as pictures or characters into a matrix), i.e., the original data is converted into a matrix of a specified size by multiplying a parameter matrix so as to facilitate subsequent operations, and the result is input to the convolution layer.

And S303, fixing the operation attribute of each convolution kernel.

In this embodiment, in order to ensure that the convolution kernels in the branch models are consistent, before training, the operation attribute (padding, stride) of the convolution kernels in the branch models is first fixed.

S304, inputting the training input data to each convolution kernel after the operation attribute is fixed, and controlling each training GPU to run each convolution kernel in parallel to obtain an output result of each convolution kernel.

As shown in fig. 2, training data are input into a convolution kernel 1, a convolution kernel 2, and a convolution kernel 3, and then three training GPUs in the same primary node run the three convolution kernels in parallel, so that the three convolution kernels output results respectively.

And S305, multiplying the output result of each convolution kernel by a set number of parameter matrices to obtain a set number of construction matrices, where the set number is the number required by the self-attention layer in each branch model as an input parameter.

And inputting the acquired convolutional layer output component into the split self-attention network. The self-attention network firstly needs to multiply the convolutional layer output by three parameter matrixes to construct Q, K and V, and then carries out the next operation, wherein the splitting is to divide the three parameter matrixes according to columns and carry out matrix multiplication operation with the convolutional layer output respectively to obtain column components (matrix multiplication property) of the sequence of the three dimensions of Q, K and V.

S306, inputting the query matrix, the key matrix and the value matrix to each self-attention layer.

The output result of each convolution kernel is multiplied by three different parameter matrixes respectively to obtain a query matrix Q (query), a key matrix K (key) and a value matrix V (value), so that the output result of the convolution kernel is converted into the query matrix Q, the key matrix K and the value matrix V to meet the form requirement of the full-connection layer on the input matrix, and the parameter matrix is also a designated matrix matched with the full-connection layer.

The three matrixes of the query matrix, the key matrix and the value matrix are in the form of column vectors, and data is input to the self-attention layer in the form of the column vectors so as to meet the form requirement of the self-attention layer on input data. And the query matrix, the key matrix and the value matrix in the form of column vectors are obtained in the following ways: firstly, the parameter matrix is divided into column vectors, and then the output result of the convolution kernel is multiplied by the parameter matrix in the form of the column vectors to obtain a query matrix, a key matrix and a value matrix.

S307, performing transposition operation on the key matrix on the training GPU to obtain a transposition matrix of the key matrix;

s308, performing multiplication operation on the query matrix and the transposed matrix of the key matrix on the training GPU to obtain a multiplication matrix;

s309, performing normalization processing on the multiplication matrix on the training GPU to obtain a normalization matrix.

S3010, performing multiplication operation of the self-attention layers on the normalization matrix and the value matrix in each training GPU in parallel to obtain output results of the self-attention layers.

Steps S307 to S3010 are to obtain an output result Attention (Q, K, V) from the Attention layer based on the following formula:

in the formula, d _k Representing the dimension of the matrix K for a constant, K ^T Is a transposed matrix of K and is,

x _i and C is the output value of the ith node and the total number of the labels.

S3011, controlling a training GPU in each training node to segment the weight matrix A row of each full-connection layer to obtain a weight row matrix A ₁ ,A ₂ ,…,A _n 。

And S3012, controlling each training GPU to run multiplication of the output result of each self-attention layer on each branch model and the weight row matrix of each full-connection layer in parallel to obtain the output result of each full-connection layer in each branch output result.

To ensure that the whole flow of the hybrid operator is equivalent in a mathematical sense, the sum of the row components of the tensor X (the result output by each self-attention layer) and the product of the column components of the tensor A is equivalent to the outer product thereof according to a matrix multiplication algorithm.

And splitting the full connection layer according to the rule to enable the mixed operators to form a closed loop in parallel.

And performing matrix multiplication on the segmented full-connection layer weight, calling a dropout function in the whole network, and aggregating and summing the calculation results on each GPU to realize the aggregation of the whole hybrid operator. In order to eliminate the influence of the coefficients and other dimensions, layerNorm is used for carrying out normalization operation on the aggregation result, the generalization of the aggregation result is ensured, parameters are stored, and a local gradient is calculated to support the subsequent operation of the system.

S400, finishing the training of the hybrid operator model according to the branch output result of each branch model.

The step S200 and the step S300 obtain the branch output result of each branch model (i.e. the branch output result on one of the secondary nodes in one primary node), aggregate the output results of each secondary node to obtain the output result of the whole hybrid operator model corresponding to one primary node, and finally aggregate the output results of each primary node to obtain the final output result of the hybrid operator model.

Step S400 includes the following steps:

s401, aggregating branch output results of each branch model on each secondary node to obtain a total output result corresponding to each primary node, wherein the total output result is an output result of the hybrid operator model.

For example, a circle in fig. 4 is a primary node, and averaging the output results of a branch model operated by each secondary node in the primary node is the output result of the entire hybrid operator model operated by the primary node.

S402, determining the primary nodes which are communicated with each other in each primary node, and marking as a node cluster (adjacent sub-cluster).

And S403, sharing the total output result corresponding to each primary node in the node cluster among the primary nodes.

Each primary node acquires the total output result of other primary nodes, so that the sharing of the total output result is realized.

S404, controlling each primary node in the node cluster to calculate an average value of the total output results after sharing, and obtaining a final output result corresponding to each primary node in the node cluster.

When any sub-cluster finishes training, adjacent sub-clusters in the communication graph can be randomly selected to form a set according to the communication graph, all sub-cluster parameters at the current time point in the set are obtained to be averaged, new parameters after averaging are transmitted to the sub-clusters, losses are calculated by using the average parameters and local gradients, the model is updated asynchronously, and iteration is repeatedly carried out until the whole network topology reaches global convergence.

S405, comparing the final output result corresponding to each primary node in the node cluster with a set output result (namely, a correct result which should be output by the hybrid operator model aiming at training input data) to obtain a comparison result.

S405, training the hybrid operator model according to the comparison result until the training of the hybrid operator model is completed.

In one embodiment, the comparison result is a loss function of the final output result relative to the set output result, and in each round of training, the inverse of the loss function to the model parameters (the loss function is lost after each round of training, the loss function is a loss function of the current round, and the model parameters are updated model parameters after the previous round of training) is calculated to obtain a gradient, and the model parameters are updated before the next round of training according to the magnitude of the gradient.

In another embodiment, the algorithm to calculate the gradient comprises a DGD algorithm, an EXTRA algorithm, a DIGing algorithm, an AD-PSGD algorithm:

DGD algorithm:

the algorithm x is a parameter, w is an adjacency matrix of the graph, alpha _k In order to self-define the step length,

is a gradient.

EXTRA algorithm

For the adjacency matrix correction term, the other parameters have the same meanings as above.

The DIGing algorithm:

the meaning of the parameters is the same as above.

AD-PSGD algorithm: replacing original model parameters

Calculating loss, updating a model:

in this embodiment, as shown in fig. 5, the overall process from step S100 to step S400 is to perform spatial separation on the convolution and pooling layers according to the GPU strongly-expanded packet, and perform feature extraction on the input tensor only through the local channel. By keeping the original padding and stride parameters, the output dimension of each grouped convolution layer is ensured to be consistent. And directly obtaining the output of each pipeline convolution layer from the attention network layer, and performing tensor multiplication after performing row segmentation on the Q, K and V weights so as to obtain the row component of the non-local features under the attention network. And after parameter aggregation is carried out on the weight matrix of the transversely-split full-connection layer and the upper-layer column component, carrying out only one-time global communication, calculating network loss, and optimizing the network until convergence. By the method, all modules can be aggregated, process communication is eliminated, global multipath parallelism is realized, original n × m times of communication is reduced to n times, and model communication overhead is greatly reduced. By adopting a centerless node topology, each node only shares parameters with specific adjacent nodes, the common convergence of a full cluster is realized, the communication load is balanced to all nodes, the network robustness is improved, and the data security is ensured.

The overall process of model training of the present invention is illustrated below in a specific example:

the framework is a general framework aiming at computer vision tasks, can be effectively applied to mainstream tasks such as image classification, target detection and image segmentation, and is particularly suitable for CvT (visual transformer model of mixed convolution). Specifically, the convolution layer is input as a matrix with a picture size of H × W × C (H, W is the picture length and width, C is the number of channels, C =1 is a grayscale, and C =3 is a color map), firstly, the picture features are extracted through convolution operation, and the picture matrix is stretched into a Sequence form required by an Attention layer, and the dimension is Sequence × token (here, a transducer concept, such a model is mainly used for text processing, each word in a sentence is processed into a vector with a Sequence length, token is the number of words, that is, a sentence is processed into a matrix with a Sequence × token size, for example, a translation of a sentence "i love in machine learning" needs to be disassembled into several words "i am, me, machine, learning" and converted into word vectors with a Sequence length, where token = 4), and the Transformer model obtains functional information such as translation context and the like mainly by means of the relationship between encoded word vectors. For the visual Transformer model, the concept of a word vector is abstracted into a local block of a picture, and the position relation and the characteristics among various human and objects on the picture are obtained by coding the relation among the blocks. The Attention layer outputs a matrix of a defined size (hidden size token), hidden _ size being a parameter of a predefined size with no practical significance, which can be understood as an intermediate variable. The final MLP (linear layer) acts to enhance the output effect of the Attention layer (again to enhance the inter-vector relationship coding), with the same output dimension as the Attention layer.

In summary, the hybrid operator model is firstly divided into branch models, namely the hybrid operator model is divided into zero, then a branch model is trained on each training node, so that each training node simultaneously trains each branch model in parallel, finally the output results of each branch model are summarized to obtain the overall output result of the hybrid operator model, and model parameters are adjusted according to the overall output result to complete model training. From the analysis, the parallel training is realized by training each branch model on each training node, and the parallel training can save the training time, so as to improve the training speed and enable the hybrid operator model to be quickly converged.

In addition, the invention provides a novel framework of continuous parallel of multi-source mixed operators, and realizes continuous parallel of a plurality of operators of different types such as convolution, self attention, full connection and the like in a mathematical equivalent way through a space separation and matrix transverse and longitudinal segmentation method, thereby greatly reducing communication traffic and ensuring high-performance training and reasoning of models in a limited bandwidth state.

The invention has good expansibility and can be split or adjusted according to actual conditions.

The invention ensures the consistency of the parameter quantity before and after the split of the mixed operator, only carries out equivalent split on the original basic operators such as weight or convolution and the like, and improves the network performance on the premise of not increasing any calculated quantity.

The invention introduces decentralized communication topology, balances the communication load of each node, eliminates the system performance reduction caused by overlarge communication traffic of the central node in the low-bandwidth environment of the traditional parameter server architecture, and improves the robustness of the system.

The combination of decentralized communication and hybrid operator in parallel balances the problem of more communication times of the decentralized network by reducing the communication amount of the sub-clusters, and ensures the global convergence speed in a low-bandwidth state.

The method separates the whole system data parallel part from the model parallel part, the primary node only executes operations such as data distribution slicing and the like, the secondary node executes local parameter calculation and model aggregation, asynchronous execution of the data parallel part and the model parallel part is realized, and the network efficiency is improved again.

Exemplary devices

The embodiment also provides a hybrid operator model parallel training device, which comprises the following components:

Based on the above embodiments, the present invention further provides a terminal device, and a schematic block diagram thereof may be as shown in fig. 6. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is configured to provide computing and control capabilities. The memory of the terminal equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The network interface of the terminal device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a hybrid operator model parallel training method. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is arranged in the terminal equipment in advance and used for detecting the operating temperature of the internal equipment.

It will be understood by those skilled in the art that the block diagram of fig. 6 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the terminal device to which the solution of the present invention is applied, and a specific terminal device may include more or less components than those shown in the figure, or may combine some components, or have different arrangements of components.

In one embodiment, a terminal device is provided, where the terminal device includes a memory, a processor, and a hybrid operator model parallel training program stored in the memory and executable on the processor, and when the processor executes the hybrid operator model parallel training program, the following operation instructions are implemented:

dividing the mixed operator model into branch models;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A parallel training method for a hybrid operator model is characterized by comprising the following steps:

dividing the mixed operator model into branch models;

2. The parallel training method of hybrid operator models according to claim 1, wherein the dividing of the hybrid operator models into branch models comprises:

according to the mixed operator model, sequentially connected convolutional layers, a self-attention network and a fully-connected network covered by the mixed operator model are obtained, wherein the convolutional layers comprise a plurality of convolutional kernels, the self-attention network comprises a plurality of self-attention layers, and the fully-connected network comprises a plurality of fully-connected layers;

3. The method for parallel training of hybrid operator models according to claim 2, wherein the controlling of each training node to run each of the branch models in parallel to obtain a branch output result of each of the branch models, the training node being computer hardware for training the hybrid operator models, comprises:

4. The parallel training method for the hybrid operator model according to claim 3, wherein the inputting of training input data to the convolution kernels of each of the branch models and controlling each of the training nodes to run each of the convolution kernels in parallel to obtain an output result of each of the convolution kernels comprises:

obtaining a training GPU in the training nodes according to the training nodes;

controlling the training GPU to capture the training input data from data storage nodes, wherein the data storage nodes are used for storing the training input data;

fixing the operation attribute of each convolution kernel;

and inputting the training input data to each convolution kernel after the operation attribute is fixed, and controlling each training GPU to run each convolution kernel in parallel to obtain an output result of each convolution kernel.

5. The parallel training method for hybrid operator models according to claim 3, wherein the controlling each training node to run the self-attention layer of each branch model in parallel according to the output result of each convolution kernel to obtain the output result of each self-attention layer comprises:

multiplying the output result of each convolution kernel by a set number of parameter matrixes to obtain a set number of construction matrixes, wherein the set number is the number of the self-attention layers in each branch model which are required to be used as input parameters;

6. The method for parallel training of a hybrid operator model according to claim 5, wherein the inputting a set number of the construction matrices corresponding to the convolution kernels into each of the self-attention layers, and controlling a training GPU in each of the training nodes to run each of the self-attention layers after the construction matrices are input in parallel to obtain an output result of each of the self-attention layers, wherein one of the convolution kernels in each of the convolution kernels and a column component in each of the self-attention layers are located in the same branch model, comprises:

performing multiplication operation on the query matrix and the transposed matrix of the key matrix on the training GPU to obtain a multiplication matrix;

performing normalization processing on the multiplication matrix on the training GPU to obtain a normalization matrix;

and performing multiplication operation of each self-attention layer on the normalized matrix and the value matrix in each training GPU in parallel to obtain an output result of each self-attention layer.

7. The method for parallel training of hybrid operator models according to claim 3, wherein the controlling each training node to run the fully-connected layer of each branch model in parallel according to the output result of each self-attention layer to obtain the output result of each fully-connected layer in each branch output result, wherein the fully-connected layer, the self-attention layer, and the convolution kernel on one branch model are located on the same training node, comprises:

8. The method for parallel training of hybrid operator models according to claim 1, wherein each of the training nodes is a secondary node, a set number of the secondary nodes are located in the same primary node, and the training of the hybrid operator models is completed according to the branch output result of each of the branch models, including:

9. The method for parallel training of a hybrid operator model according to claim 8, wherein said performing the training of the hybrid operator model according to the total output result corresponding to each of the primary nodes comprises:

controlling each primary node in the node cluster to calculate an average value of the total output results after sharing, so as to obtain a final output result corresponding to each primary node in the node cluster;

10. The hybrid operator model parallel training method of claim 1, wherein the hybrid operator model is an image classification model.

11. A hybrid operator model parallel training device is characterized by comprising the following components:

12. A terminal device, characterized in that the terminal device comprises a memory, a processor and a hybrid operator model parallel training program stored in the memory and operable on the processor, and the processor implements the steps of the hybrid operator model parallel training method according to any one of claims 1 to 10 when executing the hybrid operator model parallel training program.

13. A computer-readable storage medium, having stored thereon a hybrid operator model parallel training program, which when executed by a processor, performs the steps of the hybrid operator model parallel training method according to any one of claims 1 to 10.