CN115994590A

CN115994590A - Data processing method, system, equipment and storage medium based on distributed cluster

Info

Publication number: CN115994590A
Application number: CN202310288301.8A
Authority: CN
Inventors: 郭振华; 邱志勇; 赵雅倩; 李仁刚
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-04-21
Anticipated expiration: 2043-03-23
Also published as: CN115994590B

Abstract

The application discloses a data processing method, a system, equipment and a storage medium based on a distributed cluster, which are applied to the technical field of machine learning and comprise the following steps: acquiring 1 batch of training samples and training a self deep learning model, determining a first-order momentum matrix and a second-order momentum matrix of gradient data of the deep learning model, transmitting the first-order momentum matrix and the second-order momentum matrix to target edge domain equipment, enabling the target edge domain equipment to determine a momentum parameter ratio matrix according to data transmitted by each of n terminal equipment, compressing the matrix, receiving the compressed matrix, recovering the size of the matrix, carrying out parameter updating of the deep learning model through a first-order optimization algorithm based on a recovery result, and returning to training until model training is finished; and inputting the data to be identified into the deep learning model after training is completed, and obtaining the identification result of the data to be identified. By applying the scheme, the convergence speed of the distributed training of the deep learning model is improved, and the training time consumption is reduced.

Description

Data processing method, system, equipment and storage medium based on distributed cluster

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, system, device and storage medium based on a distributed cluster.

Background

At present, artificial intelligence is mutually supported with new generation information technologies such as 5G, cloud computing, edge computing and the like, intelligent transformation of production life style and social management mode is promoted, and the artificial intelligence landing scene is increasingly complicated, so that the cross-domain distributed artificial intelligence duty ratio of the cloud side end equipment is increasingly high.

Deep learning models are widely used, such as plant species recognition in cell phones, speech recognition and conversion to text, etc. The initial training of the deep learning model deployed in the terminal equipment requires huge calculation power, the calculation capability of a single terminal equipment is insufficient, and the training data owned by the single terminal equipment is insufficient, so that the most common solution is to combine all the terminal equipment in the edge domain to realize distributed training, update model parameters together, and finally finish the deep learning model training.

The conventional cross-domain distributed optimization algorithm, which is usually an SGD (Stochastic Gradient Descent, random gradient descent) algorithm, is a simple but very effective method, but the convergence speed of the algorithm is slow.

In addition, when the edge domain equipment is used for collaborative model training, the problem that the data transmission bandwidth is limited and the communication transmission load is heavy is faced, so that the efficient training is not facilitated.

In summary, how to perform the distributed training of the deep learning model to increase the convergence rate and the training speed is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a data processing method, a system, equipment and a storage medium based on a distributed cluster, so as to perform distributed training of a deep learning model, improve convergence speed and improve training speed.

In order to solve the technical problems, the invention provides the following technical scheme:

the data processing method based on the distributed cluster is applied to each terminal device in the distributed cluster, n terminal devices are all in communication connection with target edge domain devices, n is a positive integer, and the data processing method based on the distributed cluster comprises the following steps:

acquiring training samples of 1 batch, training a self deep learning model, and determining gradient data of the deep learning model;

determining a first-order momentum matrix and a second-order momentum matrix of the gradient data according to the gradient data of the deep learning model;

The first-order momentum matrix and the second-order momentum matrix are sent to the target edge domain equipment, so that the target edge domain equipment determines an average value of n first-order momentum matrices and an average value of n second-order momentum matrices according to received data sent by n terminal equipment, and determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices;

after the target edge domain equipment compresses the momentum parameter ratio matrix, receiving the compressed matrix sent by the target edge domain equipment, recovering the size of the matrix, updating parameters of the deep learning model through the first-order optimization algorithm based on a recovery result, and returning to execute the operation of acquiring 1 batch of training samples and training the deep learning model until model training is finished;

and inputting the data to be identified into the deep learning model after training is completed, and obtaining an identification result of the data to be identified.

Preferably, after determining the first order momentum matrix of the gradient data, the method further comprises:

Judging whether the difference between the first-order momentum matrix determined based on the training samples of the batch and the first-order momentum matrix determined based on the training samples of the previous batch is smaller than a preset first value or not;

if so, the first-order momentum matrix determined based on the training samples of the batch is saved, and the saved first-order momentum matrix is used as the first-order momentum matrix of the determined gradient data for each subsequent training batch.

Preferably, after determining the second order momentum matrix of the gradient data, the method further comprises:

judging whether the difference between the second-order momentum matrix determined based on the training samples of the batch and the second-order momentum matrix determined based on the training samples of the previous batch is smaller than a preset second value or not;

if so, the second-order momentum matrix determined based on the training samples of the batch is saved, and the saved second-order momentum matrix is used as the second-order momentum matrix of the determined gradient data for each subsequent training batch.

Preferably, after determining the first-order momentum matrix and the second-order momentum matrix of the gradient data, the method further includes:

Performing bias correction on the first-order momentum matrix and the second-order momentum matrix;

correspondingly, the sending the first order momentum matrix and the second order momentum matrix to the target edge domain device includes:

and transmitting the first-order momentum matrix subjected to offset correction and the second-order momentum matrix subjected to offset correction to the target edge domain equipment.

performing cholesky decomposition on the first-order momentum matrix;

correspondingly, sending the first-order momentum matrix to the target edge domain device comprises the following steps:

will beM or M ^T Transmitting to the target edge domain equipment so that the target edge domain equipment recovers the first-order momentum matrix according to the received data;

wherein ,Mrepresented is a triangular matrix obtained after cholesky decomposition of the first order momentum matrix,M ^T represented by a triangular matrixMIs a transposed matrix of (a).

performing cholesky decomposition on the second-order momentum matrix;

correspondingly, sending the second order momentum matrix to the target edge domain device comprises:

Will beV or V ^T Transmitting the second-order momentum matrix to the target edge domain equipment so that the target edge domain equipment recovers the second-order momentum matrix according to the received data;

wherein ,Vrepresented is a triangular matrix obtained after cholesky decomposition of the second order momentum matrix,V ^T represented by a triangular matrixVIs a transposed matrix of (a).

Preferably, the target edge domain device performs matrix compression on the momentum parameter ratio matrix, including:

and the target edge domain equipment carries out matrix quantization compression on the momentum parameter ratio matrix.

Preferably, the target edge domain device performs matrix quantization compression on the momentum parameter ratio matrix, including:

the target edge domain device willa×bThe momentum parameter ratio matrix of the magnitude according to each adjacentkThe method of averaging the data carries out the quantization compression of the matrix to obtain

A compressed matrix of size.

Preferably, the target edge domain device determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices, and the method comprises the following steps:

the target edge domain device is based on the determined average value of n first order momentum matrixes And the average of n second order momentum matrices according to

In the method, a momentum parameter ratio matrix of a preset first-order optimization algorithm is determinedr _t ；

wherein ,

representing the average value of n first order momentum matrixes determined by the target edge domain device,/->

The average value of n second-order momentum matrixes determined by the target edge domain equipment is represented, epsilon is a preset third parameter, and epsilon is more than 0.

Preferably, based on the recovery result, updating parameters of the deep learning model by the first-order optimization algorithm includes:

by passing through

Updating parameters of the deep learning model;

wherein ,x _t indicating that the first step is performedtParameters of the deep learning model after secondary training,x _t+1 indicating that the first step is performedtParameters of the deep learning model after +1 training,ηthe rate of learning is shown as a function of the learning rate,λrepresenting the attenuation coefficient asr _t And (2) representing the recovery result obtained after receiving the compressed matrix sent by the target edge domain device and recovering the size of the matrix.

Preferably, determining a first order momentum matrix of the gradient data according to the gradient data of the deep learning model includes:

by means of gradient data of the deep learning model m _t =β ₁ m _t-1 +（1-β ₁ ）g _t Determining a first order momentum matrix of the gradient data；

wherein ,β ₁ indicated is a preset first parameter which,m _t indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,m _t-1 indicating that the first step is performedtA first order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined at the time of training,tthe number of training is shown.

Preferably, determining a second order momentum matrix of the gradient data according to the gradient data of the deep learning model includes:

by means of gradient data of the deep learning modelv _t =β ₂ v _t-1 +(1-β ₂ )(g _t -m _t ) ² Determining a second order momentum matrix of the gradient data;

wherein ,β ₂ a second parameter is indicated which is preset,m _t indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,v _t indicating that the first step is performedtA second order momentum matrix of the gradient data determined during the secondary training,v _t-1 indicating that the first step is performedtA second order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined at the time of training,tthe number of training is shown.

Preferably, the inputting the data to be identified into the trained deep learning model, and obtaining the identification result of the data to be identified, includes:

and inputting the data to be identified into the deep learning model after training, and performing computer image identification, natural language identification or pattern identification to obtain an identification result of the data to be identified.

Preferably, the triggering condition for ending the model training is as follows:

the deep learning model converges and/or the training times of the deep learning model reach a set time threshold.

Preferably, the method further comprises:

and outputting fault prompt information when the communication connection with the target edge domain equipment is lost.

Preferably, the method further comprises:

and counting the time consumption of communication with the target edge domain equipment.

Preferably, the deep learning model used by the terminal device is a deep learning model sent by a data center and subjected to model pre-training.

A data processing system based on a distributed cluster, applied to each terminal device in the distributed cluster, where n terminal devices are all in communication connection with a target edge domain device, and n is a positive integer, the data processing system based on the distributed cluster includes:

The gradient data determining module is used for acquiring 1 batch of training samples and training a self deep learning model to determine gradient data of the deep learning model;

the momentum determining module is used for determining a first-order momentum matrix and a second-order momentum matrix of the gradient data according to the gradient data of the deep learning model;

the momentum transmitting module is used for transmitting the first-order momentum matrix and the second-order momentum matrix to the target edge domain equipment, so that the target edge domain equipment determines an average value of n first-order momentum matrices and an average value of n second-order momentum matrices according to received data transmitted by n terminal equipment respectively, and determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices;

the parameter updating module is used for receiving the compressed matrix sent by the target edge domain equipment and recovering the size of the matrix after the target edge domain equipment compresses the matrix of the momentum parameter ratio, updating the parameters of the deep learning model through the first-order optimization algorithm based on the recovery result, and triggering the gradient data determining module until model training is finished;

And the execution module is used for inputting the data to be identified into the deep learning model after training is completed, and obtaining the identification result of the data to be identified.

A data processing apparatus based on a distributed cluster, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the distributed cluster based data processing method as described above.

A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of a distributed cluster based data processing method as described above.

The technical scheme provided by the embodiment of the invention is applied to each terminal device in the distributed cluster, and n terminal devices are all in communication connection with the target edge domain device. During training, each terminal device can acquire 1 batch of training samples and train the deep learning model to determine gradient data of the deep learning model, and compared with a traditional SGD (generalized discrete gain) scheme, the method and the device can improve convergence rate by adopting a first-order optimization algorithm based on a first-order momentum matrix and a second-order momentum matrix. According to the gradient data of the deep learning model, a first-order momentum matrix and a second-order momentum matrix of the gradient data can be determined and sent to target edge domain equipment, namely the target edge domain equipment can determine an average value of n first-order momentum matrices and an average value of n second-order momentum matrices according to the received data sent by n terminal equipment respectively, and a momentum parameter ratio matrix of a preset first-order optimization algorithm is determined based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices.

Considering that the momentum parameter ratio matrix is large in scale, the target edge domain equipment compresses the momentum parameter ratio matrix and then sends the matrix to each terminal equipment, the terminal equipment can restore the size of the matrix after receiving the compressed matrix sent by the target edge domain equipment, and further, based on the restored matrix, the parameter updating of the deep learning model is carried out through a first-order optimization algorithm, and the next round of training is continued until model training is finished. As can be seen, the target edge domain device performs matrix compression on the momentum parameter ratio matrix and then transmits the momentum parameter ratio matrix, so that the data volume required to be transmitted in communication is reduced, and the training time consumption is reduced.

In conclusion, the scheme of the application can perform distributed training of the deep learning model, improves convergence speed and reduces training time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a data processing method based on distributed clusters in the present invention;

FIG. 2 is a schematic diagram of a device architecture in one embodiment;

FIG. 3 is a schematic diagram of the quantized compression of a matrix in one embodiment;

FIG. 4 is a schematic diagram of a momentum parameter ratio matrix to be compressed according to an embodiment;

FIG. 5 is a schematic diagram of a distributed cluster-based data processing system according to the present invention;

FIG. 6 is a schematic diagram of a distributed cluster-based data processing apparatus according to the present invention;

fig. 7 is a schematic diagram of a computer readable storage medium according to the present invention.

Detailed Description

The core of the invention is to provide a data processing method based on a distributed cluster, which can perform distributed training of a deep learning model, improves convergence speed and reduces training time consumption.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating an implementation of a data processing method based on a distributed cluster in the present invention, where the data processing method based on a distributed cluster may be applied to each terminal device in a distributed cluster, and n terminal devices are all communicatively connected to a target edge domain device, n is a positive integer, and the data processing method based on a distributed cluster includes the following steps:

step S101: and acquiring training samples of 1 batch, training a self deep learning model, and determining gradient data of the deep learning model.

Specifically, in the solution of the present application, the distributed training of the deep learning model is performed, so that the solution of the present application needs to be applied to each terminal device under the target edge domain device, that is, each terminal device under the target edge domain device has a respective deep learning model. The terminal devices in the distributed cluster may also be referred to as cross-domain heterogeneous devices, where the cross-domain refers to cross-geographic domain, and the heterogeneous devices refer to terminal devices that may include various different forms, such as mobile phones, cameras, personal computers, and other terminal devices with certain computing capabilities.

Furthermore, it should be noted that in the distributed cluster, 1 or more target edge domain devices may be provided, and for any 1 target edge domain device, the target edge domain device may be communicatively connected to one or more terminal devices. For example, fig. 2 is a schematic device architecture in one embodiment, where the target edge domain device 1 and the target edge domain device 2 are both communicatively connected to a data center, and the target edge domain device 1 and the target edge domain device 2 are both connected to a plurality of terminal devices.

In practical applications, the tasks performed by the deep learning model in different edge domains may be different, and thus, data interaction between different target edge domain devices is not generally required. In practical application, the deep learning model is trained in a given data set by using computing power of a data center or computing power of an edge domain, and after a specific number of times of training, the deep learning model is deployed in each terminal device, and of course, the precision of the deep learning model is not enough, and further training is required according to training data collected by the terminal device to meet the precision requirement.

That is, in a specific embodiment of the present invention, the deep learning model used by the terminal device is a deep learning model sent by the data center and subjected to model pre-training.

The specific training algorithm of model pre-training can be set and adjusted according to actual needs. In the embodiment, each terminal device can receive the deep learning model which is sent by the data center and is subjected to model pre-training, so that the time consumption of overall training is reduced, namely, each terminal device does not need to perform model training from beginning.

For any 1 terminal device, each time step S101 is triggered, the terminal device may acquire 1 batch of training samples and perform training of its own deep learning model, so as to obtain a local parameter gradient. The loss function may, for example, specifically employ a cross entropy loss function when determining gradient data.

In the training samples of 1 batch, the specific content of the training data can be set and adjusted according to the needs, for example, in some occasions, the deep learning model is used for image recognition, and when training, a plurality of training images are set in the training samples of 1 batch. When training the self deep learning model, the local parameter gradient can be obtained through forward calculation and reverse calculation.

In addition, the specific type of the deep learning model may be various, for example, the deep learning model for performing image recognition may be specifically used for performing plant species recognition, for example, the deep learning model for performing face recognition, the deep learning model for performing data classification recognition, the deep learning model for performing semantic analysis recognition, and the like.

Step S102: and determining a first-order momentum matrix and a second-order momentum matrix of the gradient data according to the gradient data of the deep learning model.

Because the first-order optimization algorithm based on the first-order momentum matrix and the second-order momentum matrix is adopted, the first-order momentum matrix and the second-order momentum matrix of the gradient data are determined according to the gradient data of the deep learning model determined in the step S101, and the specific calculation mode can be set and adjusted according to actual needs.

For example, in one embodiment of the present invention, determining a first order momentum matrix of gradient data according to gradient data of a deep learning model may specifically include:

from gradient data of deep learning model bym _t =β ₁ m _t-1 +（1-β ₁ ）g _t Determining a first order momentum matrix of the gradient data;

wherein ,β ₁ indicated is a preset first parameter, typically less than 1 and greater than 0.m _t Indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,m _t-1 indicating that the first step is performedtA first order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined during the secondary training, tThe number of training is shown.

In the practical application of the present invention,m ₀ typically set to 0, i.em ₀ Set to a matrix of all 0's, e.g. by training 1 st timem ₁ =β ₁ m ₀ +（1-β ₁ ）g ₁ Determining a first order momentum matrix of gradient datam ₁ I.e.m ₁ Is a first order momentum matrix determined based on the training samples of lot 1. Also, according to the above relation, each time training is performed, a corresponding first order momentum matrix can be determined.

It can be seen that in this embodiment, the manner of determining the first order momentum matrix of the gradient data is relatively simple and convenient.

In one embodiment of the present invention, determining a second order momentum matrix of gradient data according to gradient data of a deep learning model may include:

from gradient data of deep learning model byv _t =β ₂ v _t-1 +（1-β ₂ ）（g _t -m _t ） ² Determining a second-order momentum matrix of the gradient data;

wherein ,β ₂ indicated is a second parameter that is preset, typically less than 1 and greater than 0.m _t Indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,v _t indicating that the first step is performedtA second order momentum matrix of the gradient data determined during the secondary training,v _t-1 indicating that the first step is performedtA second order momentum matrix of the gradient data determined at training time 1, g _t Indicating that the first step is performedtGradient data of the deep learning model determined during the secondary training,tthe number of training is shown.

In this embodiment, a specific manner of determining the second order momentum matrix of the gradient data is described, namely byv _t =β ₂ v _t-1 +（1-β ₂ ）（g _t -m _t ） ² And determining a second order momentum matrix of the gradient data. In a general scheme, the first order momentum matrix is not considered when determining the second order momentum matrixm _t The design of the implementation mode is beneficial to improving the accuracy of the determined second-order momentum matrix, and is simple and convenient in calculation, namely the complexity of calculation is not excessively increased.

Step S103: and transmitting the first-order momentum matrix and the second-order momentum matrix to target edge domain equipment, so that the target edge domain equipment determines an average value of n first-order momentum matrices and an average value of n second-order momentum matrices according to the received data transmitted by n terminal equipment, and determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices.

Any 1 of the n terminal devices can send the determined first-order momentum matrix and the second-order momentum matrix to the target edge domain device, so that the target edge domain device can receive data sent by each of the n terminal devices during each iteration, namely, n first-order momentum matrices and n second-order momentum matrices are received.

The target edge domain device can determine the average value of n first-order momentum matrixes and the average value of n second-order momentum matrixes, and further determine a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrixes and the determined average value of n second-order momentum matrixes.

For example, in one embodiment of the present invention, the determining, by the target edge domain device, a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices may specifically include:

the target edge domain equipment is based on the determined average value of n first-order momentum matrixes and the average value of n second-order momentum matrixes, and according to the following steps of

wherein ,

representing the average value of n first order momentum matrixes determined by the target edge domain device, +.>

Step S104: after the target edge domain device compresses the momentum parameter ratio matrix, receiving the compressed matrix sent by the target edge domain device, recovering the size of the matrix, updating parameters of the deep learning model through a first-order optimization algorithm based on a recovery result, and returning to execute the operation of the step S101 until model training is finished.

The target edge domain equipment determines a momentum parameter ratio matrix of a preset first-order optimization algorithmr _t Thereafter, it is not directlyr _t To n terminal devices, matrix compression is advanced, thereby reducing the amount of data that needs to be transmitted.

The specific implementation manner of matrix compression can be set and adjusted according to actual needs, for example, in one specific implementation manner of the present invention, the target edge domain device performs matrix compression on the momentum parameter ratio matrix, which may specifically include:

According to the embodiment, the matrix is quantized and compressed, the calculation is convenient, and the scale of the matrix can be effectively reduced, so that the target edge domain equipment performs matrix quantization and compression on the momentum parameter ratio matrix.

Of course, there may be various specific quantization compression rules, for example, in an embodiment of the present invention, the target edge domain device performs matrix quantization compression on the momentum parameter ratio matrix, which may specifically include:

the target edge domain device willa×bMomentum parameter ratio matrix of magnitude according to each adjacentkThe method of averaging the data carries out the quantization compression of the matrix to obtain

A compressed matrix of size.

Referring to fig. 3, a schematic diagram of quantization compression of a matrix in one embodiment is shown. In this embodiment, the method is realized by the steps of adjacentkMeans for averaging the dataa×bThe momentum parameter ratio matrix of the magnitude is quantized and compressed into

A compressed matrix of a size, i.e. a matrix of a size which is only the size of the matrix of the ratio of the primary momentum parameterskOne-half of the total weight of the product.

In fig. 3, the momentum parameter ratio matrix with the size of 4×4 is before quantization compression, however, the size of the momentum parameter ratio matrix in practical application may be far higher than that, and fig. 3 only illustrates the momentum parameter ratio matrix with the size of 4×4 as an example. After the quantization compression of the matrix, a matrix with a size of 2×2 is obtained, and it can be seen that the size of the compressed matrix is only 4 times the size of the matrix with the ratio of the original momentum parameters, i.e. in the example of fig. 3,kin other embodiments, the ratio matrix of momentum parameters may be adaptively adjusted according to the specific scale of the matrixkIs the value of (c), in figure 3,

。

the same can be obtained:

。

in other embodiments, there may be other quantization compression rules. For example, taking FIG. 4 as an example, FIG. 4 is a specific embodiment The structure of the momentum parameter ratio matrix to be compressed in the formula is shown in the schematic diagram. For a of FIG. 4 ₁ ，a ₂ ，a ₃ A) ₄ Averaging to obtain the 1 st value of the 1 st row of the matrix after compression, and comparing the value with a ₂ ，b ₁ ，a ₄ B ₃ Averaging to obtain the 2 nd value of the 1 st row of the matrix after compression, for b ₂ ，b ₁ ，b ₃ B ₄ Averaging to obtain the 3 rd value of the 1 st row of the matrix after compression, for b ₂ ，e ₁ ，e ₂ B ₄ Averaging results in the 4 th value of the 1 st row of the matrix after compression.

Similarly, for FIG. 4 c ₁ ，c ₂ ，c ₃ C ₄ Averaging to obtain the 1 st value of the 2 nd row of the matrix after compression, for d ₁ ，c ₂ ，d ₃ C ₄ Averaging to obtain the 2 nd numerical value of the 2 nd row of the matrix after compression, for d ₂ ，d ₁ ，d ₃ D ₄ Averaging to obtain the 3 rd value of the 2 nd row of the matrix after compression, for d ₂ ，e ₃ ，e ₄ D ₄ Averaging results in the 4 th value of the 2 nd row of the matrix after compression.

In the example of fig. 4, since the original matrix has 5 columns that cannot be divided by 2, in this example, the column-direction data is selected by a sliding window, that is, there is a case of multiplexing data. The number of rows is 4, and can be divided by 2, so that the data in the row direction does not need to be repeatedly selected, i.e. the division in the row direction is the same as the principle of fig. 3.

And it can be understood that if there is a case of multiplexing data, the compression rate is reduced, that is, the compressed matrix is disadvantageously reduced in size, so that in practical application, the compression rate is reduced fora×bMomentum parameter ratio matrix of magnitude, inaAndbcan all be covered by

In the case of integer division, the quantization pressures in the foregoing embodiments are generally selectedShrink to->

An embodiment of a compressed matrix of size.

Since the target edge domain device transmits the data after the matrix compression, the terminal device needs to restore the size of the matrix after receiving the compressed matrix transmitted by the target edge domain device. For example, for the example of fig. 3, each element in a matrix of size 2×2 is extended to 4 parts so that a matrix of size 4×4 can be restored.

After the matrix is recovered, the parameters of the deep learning model can be updated according to the recovery result by a first-order optimization algorithm, at this time, 1 round of training or 1 round of iteration is completed, and the operation of the step S101 can be returned to be executed so as to start the next round of training until the model training is completed.

In a specific embodiment of the present invention, the updating of the parameters of the deep learning model based on the recovery result described in step S104 by the first-order optimization algorithm may specifically include:

By passing through

Parameter updating of the deep learning model is carried out;

wherein ,x _t indicating that the first step is performedtParameters of the deep learning model after the secondary training,x _t+1 indicating that the first step is performedtParameters of the deep learning model after +1 training,ηthe rate of learning is shown as a function of the learning rate,λrepresenting the attenuation coefficient asr _t And (2) representing the recovery result obtained after receiving the compressed matrix transmitted by the target edge domain device and recovering the size of the matrix.

In this embodiment, by recovering the matrixr _t The parameters of the deep learning model can be conveniently realizedx _t Is updated according to the update of the update program.

There may be various triggering modes for ending the model training, for example, in a specific embodiment of the present invention, the triggering conditions for ending the model training are:

In this embodiment, when the deep learning model converges, it is explained that the deep learning model obtained by training achieves a better learning effect, so that the deep learning model can be used as a trigger condition for ending the model training or as one of trigger conditions for ending the model training. In addition, in some occasions, when the training times of the deep learning model reach the set times threshold, whether the deep learning model converges or not can generally finish training, so that the problems of over fitting, overlong training time and the like are avoided. In the present application tI.e. the number of training, may also be referred to as the number of iterations.

In practical applications, the training time of the deep learning model is usually up to a set time threshold value, which is used as a trigger condition for ending the training of the model. Of course, in a few cases, the two can be simultaneously satisfied as a trigger condition for ending the model training according to actual needs.

Step S105: and inputting the data to be identified into the deep learning model after training is completed, and obtaining the identification result of the data to be identified.

After the training-completed deep learning model is obtained, the data to be identified is input into the training-completed deep learning model, and the identification result of the data to be identified can be obtained.

As described above, the specific recognition content of the deep learning model of the present application may be set according to need, and in one embodiment of the present invention, considering that the deep learning model of the present application is generally a deep learning model based on a neural network, and the computer image recognition, the natural language processing, and the data statistical analysis are application cases of a classical neural network, step S105 may be specifically: and inputting the data to be identified into the trained deep learning model, and performing computer image identification, natural language identification or pattern identification to obtain an identification result of the data to be identified.

Based on the deep learning model, computer image recognition can be performed, namely, the content in the image is recognized, natural language recognition is performed, namely, the text/voice content is recognized, the text is converted into text to be output, and pattern recognition is performed, namely, data analysis is performed, namely, data rule recognition is performed.

In one embodiment of the present invention, after determining the first order momentum matrix of the gradient data, the method may further include:

This embodiment considers that the first order momentum matrix determined in step S102 may not be substantially changed as training is continuously performed, and in this case, if step S102 is performed, the first order momentum matrix of the gradient data is still determined based on the current gradient data, which may result in a waste of computing resources.

In this embodiment, the first order momentum matrix determined based on the training samples of the present batch is compared with the first order momentum matrix determined based on the training samples of the previous batch. For example, carry out the firsttDuring training of 1 round, the first order momentum matrix determined based on the training samples of the batch ism _t-1 Proceed to the firsttIn the round of training, the first order momentum matrix determined based on the training samples of the batch ism _t ，m _t And (3) withm _t-1 Difference deltamCan be expressed as:

here +.>

The difference between the calculated matrices, i.e. the 2-norm, is represented, reflecting the linear distance of the two vector matrices in space.

If it ism _t And (3) withm _t-1 Difference deltamIf the first value is smaller than the preset first value, for example, the first value is specifically 0.001, the recorded first value is directly recorded for each subsequent training lot, i.e., for the subsequent re-execution of step S102m _t And a first order momentum matrix as determined gradient data, i.e. in this examplem _t+1 ，m _t+2 ，m _t+3 .. it is equal tom _t The first order momentum matrix is not needed to be calculated.

Of course, ifm _t And (3) withm _t-1 Difference deltamNot less than the predetermined first value, the first order momentum matrix calculation is continued, for example, in the above embodiment, by m _t =β ₁ m _t-1 +（1-β ₁ ）g _t To perform the calculation of the first order momentum matrix.

In one embodiment of the present invention, after determining the second order momentum matrix of the gradient data, the method further comprises:

This embodiment considers that, as in the previous embodiment, the determined second order momentum matrix determined in step S102 may not substantially change with continuous training, and if step S102 is performed, the second order momentum matrix of the gradient data is still determined based on the current gradient data, which may result in a waste of computing resources.

In this embodiment, the second order momentum matrix determined based on the training samples of the present batch is compared with the second order momentum matrix determined based on the training samples of the previous two batches. If the difference value is smaller than the preset second value, calculation of the second order momentum matrix is not needed, in the subsequent iteration process, the second order momentum matrix calculated and stored for the last 1 times is directly used, and the principle is the same as that described above, so that description is not expanded.

In one embodiment of the present invention, after determining the first order momentum matrix and the second order momentum matrix of the gradient data, the method may further include:

correspondingly, the first order momentum matrix and the second order momentum matrix are sent to the target edge domain device, and the method comprises the following steps:

the first-order momentum matrix after the offset correction and the second-order momentum matrix after the offset correction are sent to the target edge domain device.

This embodiment allows for further improvement in accuracy by performing offset correction on the first order momentum matrix and the second order momentum matrix of the determined gradient data.

Of course, the specific offset correction algorithm can be set and selected as desired, for example, in a specific case, by using a first order momentum matrixm _t For example, by

Performing a first order momentum matrixm _t Is used for the offset correction of (a).

And it can be understood that, since the first order momentum matrix and the second order momentum matrix are both offset corrected in this embodiment, the terminal device transmits the first order momentum matrix after the offset correction and the second order momentum matrix after the offset correction to the target edge domain device at the time of data transmission. Similarly, the target edge domain device calculates an average value based on the received first-order momentum matrix after offset correction sent by the n terminal devices, and calculates an average value based on the received second-order momentum matrix after offset correction sent by the n terminal devices.

performing cholesky decomposition on the first-order momentum matrix;

correspondingly, the first order momentum matrix is sent to the target edge domain device, which comprises the following steps:

will beM or M ^T Transmitting the data to the target edge domain equipment so that the target edge domain equipment recovers a first-order momentum matrix according to the received data;

This embodiment allows for a larger amount of data to be transmitted if the transmission of the first order momentum matrix is performed directly. Thus, in this embodiment a cholesky decomposition is performed on the first order momentum matrix.

In a first order momentum matrixm _t For example, the cholesky decomposition of the first order momentum matrix can be expressed as:

Chol（m _t ）=MM ^T chol here is a cholesky decomposition. Since only transmission is requiredM or M ^T The data volume required to be transmitted is reduced by half, so that training time consumption is reduced, and the data storage space required by the target edge domain equipment is reduced. The scheme adopted is generally to transmit M ^T 。

It will be appreciated that due to such embodimentsIn which the first order momentum matrix is not directly transmitted, but transmitted is obtained after cholesky decompositionM or M ^T Thus the target edge domain device receivesM or M ^T After that, according to the receivedM or M ^T Recovering a first order momentum matrixm _t 。

In one embodiment of the present invention, after determining the first order momentum matrix and the second order momentum matrix of the gradient data, the method further includes:

performing cholesky decomposition on the second-order momentum matrix;

correspondingly, the second order momentum matrix is sent to the target edge domain device, comprising:

will beV or V ^T Transmitting the second-order momentum matrix to the target edge domain equipment so as to enable the target edge domain equipment to recover the second-order momentum matrix according to the received data;

As in the previous embodiment, this embodiment allows for a larger amount of data to be transmitted if the transmission of the second order momentum matrix is performed directly. Therefore, the second-order momentum matrix is decomposed and then sent, so that the data volume required to be transmitted is reduced by half, training time consumption is reduced, and the data storage space required by the target edge domain equipment is reduced. Since the principle is the same as above, the description will not be repeated.

In one embodiment of the present invention, the method may further include:

Because of the distributed training, the scheme of the application needs to perform data interaction between the terminal equipment and the target edge domain equipment, so that when the communication connection with the target edge domain equipment is lost for a certain terminal equipment, fault prompt information can be output, so that a worker can timely process faults.

Further, in a specific embodiment of the present invention, the method may further include:

and counting the time consumption of communication with the target edge domain device.

In the scheme of the application, the time consumption of training can be effectively reduced, and the higher convergence rate is ensured. In some occasions, the training progress may still be slower, so that in this embodiment, the time consumed by communication with the target edge domain device may be counted, so that if the time consumed by communication between the terminal device and the target edge domain device is abnormal, a worker may find out in time, and the time consumed by communication is counted, which also helps the worker to perform subsequent communication analysis work and further optimization on communication.

The technical scheme provided by the embodiment of the invention is applied to each terminal device in the distributed cluster, and n terminal devices are all in communication connection with the target edge domain device. During training, each terminal device can acquire 1 batch of training samples and train a self deep learning model to determine gradient data of the deep learning model, and compared with a traditional SGD (generalized discrete gain) scheme, the method and the device can improve convergence rate by adopting a first-order optimization algorithm based on a first-order momentum matrix and a second-order momentum matrix. According to the gradient data of the deep learning model, a first-order momentum matrix and a second-order momentum matrix of the gradient data can be determined and sent to the target edge domain equipment, namely the target edge domain equipment can determine the average value of n first-order momentum matrices and the average value of n second-order momentum matrices according to the received data sent by n terminal equipment, and determine a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices.

Considering that the momentum parameter ratio matrix is large in scale, the target edge domain equipment compresses the momentum parameter ratio matrix and then sends the matrix to each terminal equipment, the terminal equipment can recover the size of the matrix after receiving the compressed matrix sent by the target edge domain equipment, and further, based on the recovered matrix, the parameter updating of the deep learning model is carried out through a first-order optimization algorithm, and the next round of training is continued until the model training is finished. As can be seen, the target edge domain device performs matrix compression on the momentum parameter ratio matrix and then transmits the momentum parameter ratio matrix, so that the data volume required to be transmitted in communication is reduced, and the training time consumption is reduced.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a data processing system based on a distributed cluster, which can be referred to above in a mutually corresponding manner.

Referring to fig. 5, a schematic structural diagram of a data processing system based on a distributed cluster in the present invention is applied to each terminal device in the distributed cluster, n terminal devices are all in communication connection with a target edge domain device, n is a positive integer, and the data processing system based on the distributed cluster includes:

the gradient data determining module 501 is configured to obtain 1 batch of training samples and perform training of a deep learning model of the training samples, so as to determine gradient data of the deep learning model;

the momentum determining module 502 is configured to determine a first-order momentum matrix and a second-order momentum matrix of gradient data according to gradient data of the deep learning model;

the momentum sending module 503 is configured to send the first-order momentum matrix and the second-order momentum matrix to the target edge domain device, so that the target edge domain device determines an average value of n first-order momentum matrices and an average value of n second-order momentum matrices according to the received data sent by n terminal devices, and determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices;

The parameter updating module 504 is configured to, after the target edge domain device compresses the momentum parameter ratio matrix, receive the compressed matrix sent by the target edge domain device, restore the size of the matrix, update parameters of the deep learning model by a first-order optimization algorithm based on the restoration result, and trigger the gradient data determining module 501 until model training is completed;

the execution module 505 is configured to input the data to be identified to the trained deep learning model, and obtain an identification result of the data to be identified.

In one embodiment of the present invention, the momentum determination module 502 is further configured to, after determining the first order momentum matrix of the gradient data:

In one embodiment of the present invention, the momentum determination module 502 is further configured to, after determining the second order momentum matrix of the gradient data:

In one embodiment of the present invention, after the momentum determination module 502 determines the first order momentum matrix and the second order momentum matrix of the gradient data, the momentum determination module is further configured to:

accordingly, the momentum sending module 503 sends the first order momentum matrix and the second order momentum matrix to the target edge domain device, including:

In one embodiment of the present invention, the momentum determination module 502 is further configured to, after determining the first order momentum matrix and the second order momentum matrix of the gradient data:

performing cholesky decomposition on the first-order momentum matrix;

accordingly, the momentum sending module 503 sends the first order momentum matrix to the target edge domain device, including:

performing cholesky decomposition on the second-order momentum matrix;

accordingly, the momentum sending module 503 sends the second order momentum matrix to the target edge domain device, including:

wherein ,Vrepresented is a triangular matrix obtained after cholesky decomposition of the second order momentum matrix, V ^T Represented by a triangular matrixVIs a transposed matrix of (a).

In one embodiment of the present invention, the target edge domain device performs matrix compression on a momentum parameter ratio matrix, including:

In a specific embodiment of the present invention, the target edge domain device performs matrix quantization compression on the momentum parameter ratio matrix, including:

A compressed matrix of size.

In a specific embodiment of the present invention, the target edge domain device determines a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices, including:

wherein ,

representing the average value of n first order momentum matrixes determined by the target edge domain device, +. >

In one embodiment of the present invention, the parameter updating module 504 performs parameter updating of the deep learning model through a first-order optimization algorithm based on the recovery result, including:

by passing through

Parameter updating of the deep learning model is carried out;

In one embodiment of the present invention, the momentum determination module 502 determines a first order momentum matrix of gradient data according to gradient data of a deep learning model, including:

wherein ,β ₁ indicated is a preset first parameter which,m _t indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training, m _t-1 Indicating that the first step is performedtA first order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined during the secondary training,tthe number of training is shown.

In one embodiment of the present invention, the momentum determination module 502 determines a second order momentum matrix of gradient data according to gradient data of a deep learning model, including:

from gradient data of deep learning model byv _t =β ₂ v _t-1 +(1-β ₂ )(g _t -m _t ) ² Determining a second-order momentum matrix of the gradient data;

wherein ,β ₂ a second parameter is indicated which is preset,m _t indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,v _t indicating that the first step is performedtA second order momentum matrix of the gradient data determined during the secondary training,v _t-1 indicating that the first step is performedtA second order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined during the secondary training,tthe number of training is shown.

In one embodiment of the present invention, the execution module 505 is specifically configured to:

and inputting the data to be identified into the trained deep learning model, and performing computer image identification, natural language identification or pattern identification to obtain an identification result of the data to be identified.

In one embodiment of the present invention, the triggering conditions for ending the model training are:

In a specific embodiment of the present invention, the system further includes a prompt information output module, configured to:

In a specific embodiment of the present invention, the method further includes a statistics module, configured to:

In a specific embodiment of the present invention, the deep learning model used by the terminal device is a deep learning model sent by the data center and subjected to model pre-training.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a data processing device based on the distributed cluster and a computer readable storage medium, which can be referred to correspondingly.

Referring to fig. 6, the distributed cluster-based data processing apparatus may include:

a memory 601 for storing a computer program;

a processor 602 for executing a computer program to perform the steps of the distributed cluster based data processing method as in any of the embodiments described above.

Referring to fig. 7, a computer program 71 is stored on the computer readable storage medium 70, which computer program 71, when executed by a processor, implements the steps of the distributed cluster-based data processing method as in any of the embodiments described above. The computer readable storage medium 70 as described herein includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The principles and embodiments of the present invention have been described herein with reference to specific examples, but the description of the examples above is only for aiding in understanding the technical solution of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

Claims

1. The data processing method based on the distributed cluster is characterized by being applied to each terminal device in the distributed cluster, wherein n terminal devices are all in communication connection with target edge domain devices, n is a positive integer, and the data processing method based on the distributed cluster comprises the following steps:

2. The distributed cluster-based data processing method of claim 1, further comprising, after determining the first order momentum matrix of the gradient data:

3. The distributed cluster-based data processing method of claim 1, further comprising, after determining the second order momentum matrix of the gradient data:

4. The distributed cluster-based data processing method of claim 1, further comprising, after determining the first order momentum matrix and the second order momentum matrix of the gradient data:

5. The distributed cluster-based data processing method of claim 1, further comprising, after determining the first order momentum matrix and the second order momentum matrix of the gradient data:

performing cholesky decomposition on the first-order momentum matrix;

6. The distributed cluster-based data processing method of claim 1, further comprising, after determining the first order momentum matrix and the second order momentum matrix of the gradient data:

performing cholesky decomposition on the second-order momentum matrix;

will beV or V ^T To the destinationThe target edge domain equipment is used for enabling the target edge domain equipment to recover the second-order momentum matrix according to the received data;

7. The distributed cluster-based data processing method according to claim 1, wherein the target edge domain device performs matrix compression on the momentum parameter ratio matrix, and the method comprises:

8. The distributed cluster-based data processing method according to claim 7, wherein the target edge domain device performs matrix quantization compression on the momentum parameter ratio matrix, and the method comprises:

A compressed matrix of size.

9. The distributed cluster-based data processing method according to claim 1, wherein the determining, by the target edge domain device, a momentum parameter ratio matrix of a preset first-order optimization algorithm based on the determined average value of n first-order momentum matrices and the determined average value of n second-order momentum matrices includes:

Is a square of (2)Determining a momentum parameter ratio matrix of a preset first-order optimization algorithmr _t ；

wherein ,

representing the average value of n first order momentum matrixes determined by the target edge domain device,/- >

10. The distributed cluster-based data processing method according to claim 1, wherein the parameter updating of the deep learning model by the first-order optimization algorithm based on the recovery result comprises:

by passing through

Updating parameters of the deep learning model;

11. The distributed cluster-based data processing method of claim 1, wherein determining a first order momentum matrix of gradient data from gradient data of the deep learning model comprises:

by means of gradient data of the deep learning modelm _t =β ₁ m _t-1 +（1-β ₁ ）g _t Determining a first order momentum matrix of the gradient data;

12. The distributed cluster-based data processing method of claim 1, wherein determining a second order momentum matrix of gradient data from gradient data of the deep learning model comprises:

wherein ,β ₂ a second parameter is indicated which is preset,m _t indicating that the first step is performedtA first order momentum matrix of the gradient data determined during the training,v _t indicating that the first step is performedtA second order momentum matrix of the gradient data determined during the secondary training,v _t-1 indicating that the first step is performedtA second order momentum matrix of the gradient data determined at training time 1,g _t indicating that the first step is performedtGradient data of the deep learning model determined at the time of training, tThe number of training is shown.

13. The distributed cluster-based data processing method according to claim 1, wherein the inputting the data to be identified into the training-completed deep learning model and obtaining the identification result of the data to be identified includes:

14. The distributed cluster-based data processing method according to claim 1, wherein the triggering condition for the model training to end is:

15. The distributed cluster-based data processing method of claim 1, further comprising:

16. The distributed cluster-based data processing method of claim 15, further comprising:

17. A distributed cluster-based data processing method according to any one of claims 1 to 16, wherein the deep learning model used by the terminal device is a model pre-trained deep learning model sent by a data center.

18. A data processing system based on a distributed cluster, which is applied to each terminal device in the distributed cluster, wherein n terminal devices are all in communication connection with a target edge domain device, and n is a positive integer, the data processing system based on the distributed cluster comprises:

19. A data processing apparatus based on a distributed cluster, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the steps of the distributed cluster-based data processing method as claimed in any one of claims 1 to 17.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the steps of the distributed cluster-based data processing method according to any of claims 1 to 17.