WO2024001870A1 - 一种人工智能模型的训练方法及相关设备 - Google Patents

一种人工智能模型的训练方法及相关设备 Download PDF

Info

Publication number
WO2024001870A1
WO2024001870A1 PCT/CN2023/101357 CN2023101357W WO2024001870A1 WO 2024001870 A1 WO2024001870 A1 WO 2024001870A1 CN 2023101357 W CN2023101357 W CN 2023101357W WO 2024001870 A1 WO2024001870 A1 WO 2024001870A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
weight
training unit
unit
units
Prior art date
Application number
PCT/CN2023/101357
Other languages
English (en)
French (fr)
Inventor
张靖义
王永忠
刘艳琳
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202210986001.2A external-priority patent/CN117390442A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP23808654.0A priority Critical patent/EP4332837A1/en
Publication of WO2024001870A1 publication Critical patent/WO2024001870A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application relates to the technical field of artificial intelligence (AI), and in particular to an AI model training method, a model training system, a computing cluster, a computer-readable storage medium, and a computer program product.
  • AI artificial intelligence
  • Machine learning algorithms are a type of algorithm that automatically analyze and obtain patterns from data and use the patterns to predict unknown data.
  • Training equipment such as training servers can use the above machine learning algorithms to train to obtain AI models.
  • the training process of the training server may include: the training server determines the neural network used in the model, initializes the weights of the neural network, and obtains the initialized neural network model; then the training server inputs the sample data in the data set into the initialized neural network model, Update the weights of the neural network model based on the processing results of the sample data by the neural network model.
  • the training server can stop training.
  • the two computing centers first synchronize to the master node, then synchronize the master node between the two computing centers, and then perform a synchronization update within each computing center.
  • This application provides a training method for AI models.
  • This method introduces a hierarchical asynchronous training mechanism, specifically using a synchronous update method within the training unit and an asynchronous update method between training units to avoid conflicts between training units. Synchronized waiting time shortens the overall training time, improves training efficiency, and reduces training costs.
  • This application also provides a model training system, a computing cluster, a computer-readable storage medium, and a computer program product.
  • this application provides a training method for an AI model. This method is applied to model training systems.
  • the model training system includes a first training unit and at least one second training unit.
  • the first training unit includes a plurality of first training sub-units.
  • the first training unit may receive the first training subtask, execute the first training subtask through multiple first training subunits, and obtain the synchronized first weights of the multiple first training subunits.
  • the first training unit can also asynchronously receive the second weight obtained by at least one second training unit executing the second training subtask. Then the first training unit obtains the weight of the AI model based on the above-mentioned first weight and second weight.
  • This method proposes a hierarchical asynchronous training method, that is, weight updates are performed synchronously between multiple training subunits of the same training unit, and weights are updated asynchronously between different training units of the model training system, solving the problem This solves the problem that due to the limited bandwidth between computing centers, the use of synchronous training will introduce unacceptable synchronization waiting time, making efficient training impossible.
  • the second training unit includes a plurality of second training sub-units.
  • the second training unit can also perform the second training subtask through multiple second training subunits to obtain synchronized second weights of the multiple second training subunits. in this way, Multiple training sub-units can be used for parallel training within the training unit, which improves training efficiency.
  • the second training unit can compress the synchronized second weights of multiple second training sub-units.
  • the first training unit may asynchronously receive the compressed second weight. This can reduce the amount of data transmitted between the first training unit and the second training unit, shorten the transmission time, avoid idle resources of the first training unit and the second training unit, improve resource utilization, and improve training efficiency.
  • the second training unit can compress the second weight through different compression mechanisms. For example, the second training unit can determine the difference between the second weights of the multiple second training sub-units after this synchronization and the second weight after the last synchronization, and synchronize the multiple second training sub-units this time based on the difference. The second weight after the compression. For another example, the second training unit can determine the norm of each row or column weight of the second weights of the multiple second training sub-units after this current synchronization, and synchronize the multiple second training sub-units this time based on the norm. The second weight after the compression.
  • the second training unit can select an appropriate compression mechanism according to the distribution law of the second weight and compress the second weight to reduce the amount of transmitted data and shorten the transmission time as much as possible, thereby shortening the training time and improving training. efficiency.
  • the first training unit can also obtain the third weight of the last asynchronous update of the first training unit and at least one second training unit, and determine the comprehensive weight determined by the above-mentioned first weight and the second weight. distance from the third weight.
  • the first training unit can obtain the weight of the AI model based on the first weight and the second weight.
  • the first training unit when the first training unit updates the weights between training units, it can obtain the first correlation degree of the first weight and the second weight according to the correlation measurement function, and then according to the third A correlation degree and the above-mentioned first weight and second weight are used to obtain the weight of the AI model.
  • This method uses the first correlation degree between the first weight and the second weight to update the weights between training units, which can update the weights reasonably and avoid the performance degradation caused by the averaging strategy of the model weights.
  • the first training unit can also update the weight in combination with historical update information. Specifically, the first training unit can obtain the second correlation between the comprehensive weight determined by the first weight and the second weight and the third weight according to the correlation measurement function, where the third weight is the first training unit and the weight of the last asynchronous update of at least one second training unit, and then the first training unit obtains the change of this global update based on the second correlation, the difference between the comprehensive weight and the third weight, and the change amount of the last global update. quantity.
  • the first training unit can obtain the weight of the AI model based on the change amount of this global update, the first correlation degree, the first weight, and the second weight.
  • the training unit may be a server.
  • the first training unit and the second training unit may be servers in the same computing center, and the training subunit may be a training card in the server. This enables efficient training in the computing center, which is especially suitable for training scenarios of specific task models.
  • this application provides a model training system.
  • the model training system includes a first training unit and at least one second training unit. training unit, the first training unit includes a plurality of first training sub-units:
  • the first training unit is configured to receive a first training subtask, execute the first training subtask through a plurality of first training subunits, and obtain a synchronized first weight of the plurality of first training subunits;
  • the second training unit is used to perform the second training subtask to obtain the second weight
  • the first training unit is also configured to asynchronously receive the second weight, and obtain the weight of the AI model based on the first weight and the second weight.
  • the second training unit is specifically used for:
  • the synchronized second weights of the plurality of second training subunits are obtained.
  • the second training unit is specifically used to:
  • the first training unit is also used to:
  • the first training unit is specifically used for:
  • the first training unit is specifically used to:
  • the weight of the AI model is obtained.
  • the third weight is the first training unit and the at least one third weight. 2.
  • the first training unit is specifically used for:
  • the weight of the AI model is obtained according to the change amount of this global update, the first correlation degree, the first weight and the second weight.
  • multiple training sub-units in the training unit adopt an improved parameter server architecture or a ring architecture for synchronization.
  • the training unit is a computing cluster
  • the training subunit is a server in the computing cluster.
  • the training unit is a server
  • the training subunit is a training card in the server.
  • this application provides a computing cluster.
  • the computing cluster includes at least one computing device including at least one processor and at least one memory.
  • the at least one processor and the at least one memory communicate with each other.
  • the at least one processor is configured to execute instructions stored in the at least one memory, so that the computing device or computing cluster executes the model training method as described in the first aspect or any implementation of the first aspect.
  • the present application provides a computer-readable storage medium that stores instructions instructing a computing device or computing cluster to execute the above-mentioned first aspect or any implementation of the first aspect.
  • the model training method described in the method described in the method.
  • the present application provides a computer program product containing instructions that, when run on a computing device or computing cluster, causes the computing device or computing cluster to execute the above first aspect or any implementation of the first aspect. Described model training method.
  • Figure 1 is a training process and flow chart of an AI model provided by an embodiment of the present application
  • Figure 2 is a schematic architectural diagram of a model training system provided by an embodiment of the present application.
  • Figure 3 is a flow chart of an AI model training method provided by an embodiment of the present application.
  • Figure 4 is a training flow chart of an AI model provided by an embodiment of the present application.
  • Figure 5 is a structural diagram of a model training system provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a computing cluster provided by an embodiment of the present application.
  • first and second in the embodiments of this application are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as “first” and “second” may explicitly or implicitly include one or more of these features.
  • AI models specifically refer to models trained through AI algorithms such as machine learning algorithms and used to complete AI tasks.
  • the AI model can be a general large model or a model used to complete specific tasks.
  • Specific tasks include but are not limited to image classification, object detection, text recognition, and speech recognition.
  • Weights in the field of AI, are usually a set of floating point numbers used as the main parameters of the neural network used by the AI model. Normally, weights can be involved in calculations during AI model training and updated during the backpropagation stage. It should be noted that the weight scale (also called parameter scale) of different AI models can be different. For example, the weight scale of some large AI models can reach hundreds of billions or ten trillions.
  • multiple training devices can be used for collaborative training.
  • multiple training servers in multiple computing centers can be used for collaborative training across computing centers to improve training efficiency.
  • each computing center can use a master-slave architecture for collaborative training.
  • the nodes in each computing center can be divided into two types: master nodes and worker nodes. type.
  • each computing center first performs a weight synchronization to the master node. For example, nodes such as worker a, worker b, and worker c in computing center 1 send W a , W b , W c to Master 1, and Master 1 Obtain the weight W 1 after internal synchronization of computing center 1 based on the weight sent by each worker node. Master1 can accumulate the weight of each worker to obtain W 1 .
  • computing center 2 can obtain internal synchronization of computing center 2 in a similar manner.
  • the master node between the two computing centers performs a weight synchronization.
  • computing center 1 sends W 1 to computing center 2 and receives W 2 sent by computing center 2.
  • the two computing centers can Obtain the synchronized weights between calculation centers
  • a synchronous update is performed within each computing center.
  • Master1 sends the weight W to worker a, worker b, and worker c in computing center 1
  • Master2 sends the weight W to worker a, worker b, and worker c in computing center 2.
  • Weight W weight W.
  • a worker in a certain computing center takes about 177 milliseconds (milliseconds) to perform forward operations, about 279ms to perform backpropagation, and about 147ms to update internal gradients.
  • the master is in a waiting state (the time period corresponding to the "WT" character in the second or fourth timeline in Figure 1). Then the master synchronizes the weights across computing centers, and the master communicates across computing centers, which takes about 1150ms. At this time, the workers in each computing center are in a waiting state.
  • each worker can perform gradient updates. In this way, even if the number of training devices is increased, the total training time may not be significantly reduced due to the long synchronization waiting time, or the total training time may even increase, affecting training efficiency and increasing training costs. Book.
  • embodiments of the present application provide a training method for an AI model. This method can be applied to model training systems.
  • the model training system includes multiple training units.
  • one of the training units is called a first training unit, and other training units other than the first training unit are called second training units.
  • the first training unit may include multiple first training sub-units, and similarly, the second training unit may also include multiple second training sub-units.
  • the first training unit receives the first training subtask, and then the first training unit executes the first training subtask through multiple first training subunits to obtain the synchronized first training subunit of the multiple first training subunits.
  • This method proposes a hierarchical asynchronous training method, that is, weight updates are performed synchronously between multiple training subunits of the same training unit, and weights are updated asynchronously between different training units of the model training system, solving the problem This solves the problem that due to the limited bandwidth between computing centers, the use of synchronous training will introduce unacceptable synchronization waiting time, making efficient training impossible.
  • the granularity of the training units and training sub-units in the model training system of the embodiment of the present application can be determined according to the weight scale of the AI model that needs to be trained.
  • the training unit can be a computing cluster
  • the computing cluster can be a computing center
  • the training sub-unit can be a server in the computing center.
  • the training unit can be a server and the training sub-unit can be a training card in the server.
  • the training card refers to the processor used to train AI models, including graphics processing unit (GPU) and neural network processing unit (NPU).
  • the system architecture of the embodiment of the present application is introduced below with the training unit as the computing center and the training sub-unit as the server.
  • the model training system includes multiple computing centers.
  • the model training system includes a total of two computing centers, Computing Center 1 and Computing Center 2, as an example.
  • Each computing center includes multiple servers, and the computing architectures of servers in different computing centers can be the same or different.
  • the server in computing center 1 can adopt a four-card architecture
  • the server in computing center 2 can adopt an eight-card architecture.
  • Different computing centers can be interconnected through a switch network.
  • the switch network may include multiple switches.
  • the switch network between computing centers is usually low-bandwidth and high-latency
  • the switch network (not shown in Figure 2) within the computing center is usually high-bandwidth and low-latency.
  • Figure 2 is an example of collaborative training across computing centers.
  • the weight scale of the AI model When the weight scale of the AI model is relatively small, it can also be trained collaboratively through multiple servers in a single computing center.
  • the network of multiple training cards in the server is a high-speed network with higher bandwidth and lower latency. Based on this, multiple training cards on the same server can be updated synchronously, and different servers can be updated asynchronously.
  • the embodiment of the present application also provides an AI model training method.
  • the following is an introduction to the AI model training method of the embodiment of the present application from the perspective of the model training system.
  • the model training system includes a first training unit and at least one second training unit.
  • the first training unit may be the computing center 1 in Figure 2, and the second training unit may be the computing center 2 in Figure 2.
  • the first training unit includes a plurality of first training sub-units.
  • the second training unit may include a plurality of second training sub-units.
  • the first training subunit may be a server in computing center 1, and the second training subunit may be a server in computing center 2.
  • the method includes the following step:
  • the first training unit receives the first training subtask.
  • the first training subtask is one of multiple training subtasks obtained by task splitting for training the AI model.
  • the number of training subtasks can be equal to the number of training units.
  • the first training subtask is specifically the training subtask scheduled to the first training unit among the plurality of training subtasks.
  • the first training unit performs the first training subtask through multiple first training subunits, and obtains the synchronized first weights of the multiple first training subunits.
  • the first training unit can execute the first training subtask in parallel through multiple first training subunits, obtain the weight obtained by each first training subunit executing the first training subtask, and then obtain the weight of the multiple first training subunits.
  • the weights are updated synchronously and the first weight is obtained.
  • the first training unit includes a master node, such as master1 shown in Figure 2.
  • Multiple first training sub-units (such as worker a, worker b or worker c in computing center 1) can report their respective trained weights (such as W a , W b , W c ) to the master node, and the master node can The weight reported by worker nodes such as worker a, worker b or worker c gets the first weight.
  • first training sub-units adopt different synchronization methods to obtain the first weight. Each is explained in detail below.
  • the first synchronization method can be a synchronization method based on an improved parameter server (PS) architecture.
  • PS parameter server
  • each training subunit such as a worker node
  • each training subunit trains the AI model based on its assigned data.
  • each worker updates the weight based on the gradient, and then uploads the weight to the master node.
  • the master node performs a sum or average operation based on the weight reported by each worker node, thereby obtaining The weights of the training subunits after synchronization.
  • the worker node Since the worker node does not need to report gradients, it waits for the parameter server to aggregate the gradients reported by each worker to obtain the updated weight, and receives the updated weight issued by the parameter server, which reduces the number of communications between the worker and the parameter server, thereby reducing the communication time and Communication overhead.
  • the second synchronization method can be a synchronization method based on a ring architecture.
  • the ring architecture can be the Ring All Reduce architecture.
  • the Ring All Reduce architecture is mainly divided into two steps: scatter reduce and allgather. Assuming a total of 5 workers are used, then in the scatter reduce step, the gradient calculated on each worker is divided into 5 equal parts, that is, the gradient of the weight is divided into 5 parts, and each worker uses the same division method. Then through five communications between workers, the gradients of some parameters on each worker are complete.
  • any node that completes internal synchronization can be used as a master node to update the weights between training units.
  • the second training unit receives the second training subtask.
  • the second training subtask is one of multiple training subtasks obtained by task splitting for training the AI model.
  • the second training subtask is specifically a training subtask among the plurality of training subtasks that is scheduled to the second training unit.
  • the second training unit can execute the second training subtask in parallel through multiple second training subunits, obtain the weight obtained by each second training subunit executing the second training subtask, and then obtain the weight of the multiple second training subunits.
  • the weight is updated synchronously to obtain the second weight.
  • the second training unit includes a master node, such as master2 shown in Figure 2.
  • Multiple second training sub-units (such as worker a, worker b or worker c) can report their respective trained weights (such as W a , W b , W c in computing center 2) to the master node (such as master2),
  • the master node can obtain the second weight based on the weight reported by worker nodes such as worker a, worker b, or worker c.
  • the plurality of second training sub-units of the second training unit may adopt a synchronization method similar to that of the first training sub-unit, which will not be described again here.
  • the first training unit receives the second weight asynchronously.
  • Asynchronous means executing one task without waiting for another task to finish.
  • the first training unit may receive the second weight in an asynchronous manner. Based on this, the first training unit may receive the first training unit during the execution of S304 or before the execution of S304. Second weight, without waiting for the completion of S304 before executing the above S310. This can avoid unnecessary synchronization waiting time, shorten the overall training time, and improve training efficiency.
  • the second training unit can also compress the synchronized second weight of the second training sub-unit to obtain the compressed second weight.
  • the first training unit can asynchronously receive the compressed second weight, which can greatly reduce transmission overhead and shorten transmission time between training units.
  • the second training unit may use different methods to compress the synchronized second weights of the plurality of second training sub-units.
  • the weight compression methods are described in detail below.
  • the first method is compression based on difference values.
  • the second training unit can determine the difference between the second weights of the plurality of second training subunits after this synchronization and the second weights after the last synchronization, and then calculate the current weights of the plurality of second training subunits based on the difference.
  • the second weight after the second synchronization is compressed.
  • the second training unit can obtain the compressed second weight in the following way:
  • th(k) represents the preset threshold
  • select() represents the selection function, specifically from and Select the elements whose difference values are greater than the preset threshold in the formed difference matrix, and then set the element values of other elements to 0 to obtain the corresponding sparse matrix.
  • Compressed second weight Can be the above sparse matrix.
  • the second method is norm-based compression. Specifically, the second training unit determines the norm of each row or column weight in the second weights of the plurality of second training sub-units after this synchronization, for example, the L2 norm, and then the second training unit can determine the norm according to the The norm compresses the second weights of the plurality of second training sub-units after this synchronization. Among them, the second training unit can filter out the target rows or target columns that meet the conditions according to the norm of the weight of each row or column, and obtain the compressed second weight based on the target row or column obtained through the screening.
  • S312 The second training unit asynchronously receives the first weight.
  • the second training unit may receive the first weight asynchronously. Based on this, the second training unit can receive the first weight during the execution of S308 or before the execution of S308, without having to wait for the completion of S308 before executing the above-mentioned S312. This can avoid unnecessary synchronization waiting time, shorten the overall training time, and improve training efficiency.
  • the first training unit may also compress the synchronized first weight of the first training sub-unit to obtain the compressed first weight.
  • the second training unit can asynchronously receive the compressed first weight, which can greatly reduce the transmission overhead and shorten the transmission time between training units.
  • the manner in which the first training unit compresses the first weight is similar to the manner in which the second training unit compresses the second weight. Please refer to the description of the relevant contents of S310, which will not be described again here.
  • the first training unit obtains the weight of the AI model based on the first weight and the second weight.
  • the first weight and the second weight can be the complete weight of the AI model.
  • the first training unit can perform an averaging operation on the first weight and the second weight, thereby obtaining the updated weight between training units.
  • the next training step can be continued until the training stop condition is met, and the updated weights between training units when the training is stopped can be used as the weight of the AI model.
  • the first training unit may also obtain the first correlation degree of the first weight and the second weight according to the correlation measurement function.
  • x and y respectively represent the two physical quantities involved in the correlation calculation.
  • the first weight is ⁇ k and the second weight is ⁇ t .
  • the second weight is ⁇ t .
  • the first training unit may obtain the weight of the AI model based on the first correlation and the first weight and the second weight.
  • ⁇ S+1 represents the updated weight between training units.
  • the first training unit can also combine historical update information Perform weight updates between training units.
  • the historical update information may include the third weight of the last asynchronous update of the first training unit and the at least one second training unit and the change amount of the last global update.
  • the third weight can be recorded as ⁇ S
  • the change amount of the last global update is recorded as ⁇ n-1 .
  • the first training unit may first obtain the second correlation between the comprehensive weight determined by the first weight and the second weight and the third weight according to the correlation measurement function.
  • the comprehensive weight can be the average of the first weight and the second weight
  • the second correlation can be:
  • the first training unit can obtain the change amount of this global update based on the second correlation degree, the difference between the comprehensive weight and the third weight, and the change amount of the last global update, as shown below. :
  • the first training unit may obtain the AI model's value based on the change amount of this global update, the first correlation, the first weight, and the second weight according to the first training unit. Weights.
  • the first training unit can stop training and use the updated weight as the weight of the AI model; when the training stop condition is not met, the first training unit can continue training until the training stop condition is met. Satisfied, obtain the weight updated between training units when the training stop condition is met as the weight of the AI model.
  • the first training unit may also determine the validity of the second weight or the compressed second weight sent by the second training unit. Let's take the case where the second training unit directly sends the second weight as an example. When the second weight is an invalid weight, the first training unit can give up this update, wait for the second weight sent next time by the second training unit, and continue to judge the validity of the second weight sent next time until the received When the second weight is a valid weight, the weight is updated based on the valid weight.
  • the first training unit may obtain the third weight last asynchronously updated by the first training unit and at least one second training unit, and determine the comprehensive weight determined by the first weight and the second weight and the third weight.
  • the comprehensive weight may be the average of the above-mentioned first weight and the second weight. It should be noted that averaging the first weight and the second weight is only one implementation method for obtaining the comprehensive weight. In other possible implementation methods of the embodiments of this application, the comprehensive weight can also be obtained through other methods.
  • the distance between the comprehensive weight and the third weight can be one or more of cosine distance or Euclidean distance. For convenience of description, the embodiment of the present application uses cosine distance as an example. Based on this, the distance between the comprehensive weight and the third weight can be determined by the above-mentioned second correlation degree. representation.
  • the first training unit can update the weight. For example, the first training unit may obtain the weight of the AI model based on the above-mentioned first weight and second weight. When the distance between the second weight and the third weight is not greater than the preset distance, the first training unit will give up this update. When the weights accumulate to a certain difference such that the distance between the comprehensive weight and the third weight is greater than the preset distance, the first training unit will proceed again. Weight update.
  • the second training unit obtains the weight of the AI model based on the first weight and the second weight.
  • the second training unit can perform an averaging operation on the first weight and the second weight to obtain updated weights between training units.
  • the next training step can be continued until the training stop condition is met, and the updated weights between training units when the training is stopped can be used as the weight of the AI model.
  • the second training unit may also obtain the first correlation degree of the first weight and the second weight according to the correlation measurement function, and then the second training unit may obtain the first correlation degree of the first weight and the second weight according to the first correlation degree and the first weight and the second weight. Get the weight of the AI model.
  • the specific implementation of how the second training unit obtains the weight of the AI model based on the first correlation, the first weight, and the second weight can be found in the description of the above formula (3), and will not be described again here.
  • the second training unit can combine historical update information to update the weights between training units, thereby avoiding performance degradation caused by the averaging strategy of model weights.
  • the process of updating the weights between training units by the second training unit in combination with the historical update information can be found in the description of S314, which will not be described again here.
  • the embodiment of the present application also provides a flow chart of the training method of the AI model in the embodiment of the present application.
  • the first training unit may be the computing center 1, denoted as DC1
  • the second training unit may be is the calculation center 2, recorded as DC2, and the
  • the worker node and the worker node in DC2 can perform forward operation and backpropagation, and then perform gradient updates respectively, and perform weight synchronization within the DC to obtain the synchronized weights W 1 and W 2 .
  • the worker of DC1 can perform the forward operation of the next training step, and the master of DC1 can asynchronously receive W 2 from DC2.
  • the worker of DC2 can perform the forward operation of the next training step, and the master of DC2 can asynchronously receive W 2 from DC2. W 1 of DC1.
  • the master of DC1 can update the weight in combination with the asynchronously received W 2 when the weight of the next training step is synchronized.
  • the master of DC2 can update the weight when the weight of the next training step is synchronized.
  • the weight is updated. Compared with Figure 1, this method significantly shortens the synchronization waiting time, shortens the overall training time, and improves training efficiency.
  • the second training unit may also include a second training subunit. In this way, the second training unit can send the weights trained by the second training sub-unit to the first training unit for weight update.
  • the first training unit 502 is configured to receive a first training subtask, execute the first training subtask through a plurality of first training subunits 5020, and obtain the first synchronized first training subunit of the plurality of first training subunits. Weights;
  • the second training unit 504 is used to perform the second training subtask to obtain the second weight
  • the first training unit 502 is also configured to asynchronously receive the second weight, and obtain the weight of the AI model based on the first weight and the second weight.
  • Communication module 5022 used to receive the first training subtask
  • the task execution module 5024 is configured to execute the first training subtask through a plurality of first training subunits 5020 and obtain the synchronized first weights of the plurality of first training subunits 5020;
  • the communication module 5022 is also used to asynchronously receive the second weight obtained by at least one second training unit 504 executing the second sub-training subtask;
  • the weight update module 5026 is used to obtain the weight of the AI model based on the first weight and the second weight.
  • the above-mentioned communication module 5022, task execution module 5024, and weight update module 5026 can be implemented by hardware modules or software modules.
  • the communication module 5022 can be implemented by a transceiver module such as a network interface card or a transceiver.
  • the task execution module 5024 and the weight update module 5026 may be devices implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • the second training unit 504 includes a plurality of second training sub-units 5040;
  • the second training unit 504 is specifically used for:
  • the synchronized second weights of the plurality of second training subunits are obtained.
  • the second training unit 504 may include the following functional modules:
  • Communication module 5042 used to receive the second training subtask
  • the task execution module 5044 is configured to execute the second training subtask through the plurality of second training subunits 5040 and obtain the synchronized second weights of the plurality of second training subunits 5040;
  • the communication module 5042 is also used to asynchronously send the second weight and asynchronously receive the first weight
  • the weight update module 5046 is used to obtain the weight of the AI model based on the first weight and the second weight.
  • the communication module 5042, the task execution module 5044, and the weight update module 5046 in the above-mentioned second training unit 504 can be implemented by hardware modules or software modules.
  • the communication module 5042, the task execution module 5044, and the weight update module 5046 may be applications or application modules running on a computing device (such as a server) or a computing cluster (such as a computing center).
  • the communication module 5042 can be implemented by a transceiver module such as a network interface card or a transceiver.
  • the task execution module 5044 and the weight update module 5046 may be devices implemented using an application specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be implemented by a complex program logic device CPLD, a field programmable gate array FPGA, a general array logic GAL, or any combination thereof.
  • the second training unit 504 also includes:
  • the compression module 5048 is used to compress the synchronized second weights of the plurality of second training sub-units 5040;
  • the communication module 5022 in the first training unit 502 is specifically used for:
  • the compressed second weight is received asynchronously.
  • the second training unit 504 (for example, the compression module 5048 in the second training unit 504) is specifically used to:
  • the first training unit 502 also includes:
  • the compression module 5028 is used to compress the synchronized first weights of the plurality of first training sub-units 5020;
  • the communication module 5022 is also used to asynchronously send the compressed first weight to the second training unit 504, so that the second training unit 504 can update the weight.
  • the compression module 5028 in the first training unit 502 can perform compression based on the difference of weights, or perform compression based on the norm of the weights of each row or column.
  • the compression module 5028 and the compression module 5048 can be implemented by hardware modules or software modules.
  • the compression modules 5028 and 5048 may be applications or application modules running on a computing device (eg, a server) or a computing cluster (eg, a computing center).
  • a computing device e.g, a server
  • a computing cluster e.g, a computing center
  • the compression module 5028 and the compression module 5048 may be devices implemented using an application specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application specific integrated circuit
  • PLD programmable logic device
  • the first training unit 502 also includes:
  • the distance determination module 5029 is used to obtain the third weight of the last asynchronous update of the first training unit and the at least one second training unit, and determine the comprehensive weight determined by the first weight and the second weight and The distance of the third weight;
  • the first training unit 502 (for example, the weight update module 5026 in the first training unit 502) is specifically used to:
  • the weight of the AI model is obtained based on the first weight and the second weight.
  • the second training unit 504 may also include a distance determination module 5049.
  • the distance determination module 5049 is used to obtain the third weight of the last asynchronous update of the second training unit 504 and the first training unit 502, and determine the distance between the comprehensive weight determined by the first weight and the second weight and the third weight.
  • the weight update module 5046 is specifically used to: when the distance between the comprehensive weight and the third weight is greater than the preset distance, obtain the weight of the AI model based on the first weight and the second weight.
  • the distance determination module 5029 and the distance determination module 5049 can be implemented by a hardware module or a software module.
  • the distance determination module 5029 and the distance determination module 5049 may be an application program or application program module running on a computing device (such as a server) or a computing cluster (such as a computing center).
  • a computing device such as a server
  • a computing cluster such as a computing center
  • the distance determines the module Block 5029 and distance determination module 5049 may be a device implemented using an application specific integrated circuit ASIC, or a programmable logic device PLD, or the like.
  • the first training unit 502 (for example, the weight update module 5026 in the first training unit 502) is specifically used to:
  • the weight of the AI model is obtained.
  • the first training unit 502 (for example, the weight update module 5026 in the first training unit 502) is also used to:
  • the third weight is the first training unit and the at least one third weight. 2.
  • the first training unit 502 (for example, the weight update module 5026 in the first training unit 502) is specifically used to:
  • the weight of the AI model is obtained according to the change amount of this global update, the first correlation degree, the first weight and the second weight.
  • multiple training sub-units (such as the first training sub-unit 5020 or the second training sub-unit 5040) in the training unit (such as the first training unit 502 or the second training unit 504) ) can adopt an improved parameter server architecture or a ring architecture for synchronization.
  • the training unit (for example, the first training unit 502 or the second training unit 504) may be a computing cluster, and the computing cluster may be a computing center including multiple servers.
  • the training subunit (For example, the first training sub-unit 5020 or the second training sub-unit 5040) may be a server in the computing cluster. This enables collaborative training across computing centers.
  • the training unit may be a server, and the training subunit may be a training card in the server. This enables collaborative training within the computing center.
  • the embodiment of this application also provides a computing cluster.
  • the computing cluster may include at least one computing device.
  • Computing device 600 may be a server or a terminal device. Terminal devices include but are not limited to desktop computers, laptops or smartphones.
  • computing device 600 includes: bus 602, processor 604, memory 606, and communication interface 608.
  • the processor 604, the memory 606 and the communication interface 608 communicate through the bus 602. It should be understood that this application does not limit the number of processors and memories in the computing device 600.
  • the bus 602 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 6, but it does not mean that there is only one bus or one type of bus.
  • Bus 602 may include a path that carries information between various components of computing device 600 (eg, memory 606, processor 604, communications interface 608).
  • the processor 604 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 606 may include volatile memory, such as random access memory (RAM). Memory 606 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (hard disk drive, HDD) or solid state drive (solid state drive) , SSD).
  • the memory 606 stores executable program code, and the processor 604 executes the executable program code to implement the aforementioned AI model training method.
  • the memory 606 stores instructions for the model training system 500 to execute the training method of the AI model.
  • the memory 606 can store the communication module, task execution module, weight update module, compression module, and distance determination in the model training system 500 The instructions corresponding to the module.
  • the communication interface 608 uses transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 600 and other devices, computing clusters, or communication networks.
  • transceiver modules such as, but not limited to, network interface cards and transceivers to implement communication between the computing device 600 and other devices, computing clusters, or communication networks.
  • a computing cluster may include multiple computing devices 600.
  • the memory 606 of multiple computing devices 600 in the computing cluster may store the instructions of the same model training system 500 for executing the training method of the AI model.
  • multiple computing devices 600 in the computing cluster can also be used to execute the model training system 500 for Execute some instructions of the AI model training method.
  • a combination of multiple computing devices 600 can jointly execute the instructions of the model training system 500 for executing the training method of the AI model.
  • FIG. 7 shows a possible implementation.
  • two computing devices 600A and 600B may be connected through a communication interface 608 .
  • the memory in the computing device 600A stores instructions for executing the functions of the first training unit 502.
  • the memory in the computing device 600A stores a communication module 5022, a task execution module 5024, a weight update module 5026, a compression module 5028, and a distance.
  • the memory in the computing device 600B stores instructions for executing the functions of the second training unit 504.
  • the memory in the computing device 600B stores a communication module 5042, a task execution module 5044, a weight update module 5046, a compression module 5048, and a distance. Determine the instructions corresponding to module 5049.
  • the memories 606 of the computing devices 600A and 600B jointly store instructions for the model training system 600 to perform the training method of the AI model.
  • computing device 600A shown in FIG. 7 may also be performed by multiple computing devices 600.
  • the functions of computing device 600B may also be performed by multiple computing devices 600 .
  • multiple computing devices in a computing cluster can be connected through a network.
  • the network may be a wide area network or a local area network, etc.
  • Figure 8 shows a possible implementation. As shown in Figure 8, two computing devices 600C and 600D are connected through a network. Specifically, the connection to the network is made through a communication interface in each computing device.
  • the memory 606 in the computing device 600C stores instructions for executing the functions of the first training unit 502.
  • the memory in the computing device 600A stores a communication module 5022, a task execution module 5024, The instructions corresponding to the weight update module 5026, the compression module 5028, and the distance determination module 5029.
  • the memory 606 in the computing device 600D stores instructions for executing the functions of the second training unit 504.
  • the memory in the computing device 600B stores a communication module 5042, a task execution module 5044, a weight update module 5046, and a compression module 5048. , instructions corresponding to the distance determination module 5049.
  • computing device 600C shown in FIG. 6 may also be performed by multiple computing devices 600.
  • computing device 600D may also be performed by multiple computing devices 600.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to perform the above-described training method for the AI model applied to the model training system 500 .
  • An embodiment of the present application also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on a computing device or computing cluster or stored in any available medium.
  • the computing device or computing cluster is caused to execute the above training method of the AI model.

Abstract

本申请提供了一种人工智能(AI)模型的训练方法,应用于模型训练系统,该系统包括第一训练单元和至少一个第二训练单元,该方法包括:第一训练单元接收第一训练子任务,通过多个第一训练子单元执行第一训练子任务,获得多个第一训练子单元同步后的第一权重,以及异步接收至少一个第二训练单元执行第二训练子任务获得的第二权重,根据第一权重和第二权重,获得AI模型的权重。该方法通过引入分层异步训练机制,即在训练单元内部采用同步更新,训练单元之间采用异步更新方式,避免了训练单元之间的同步等待时间,缩短了整体训练时间,提高了训练效率。

Description

一种人工智能模型的训练方法及相关设备
本申请要求于2022年06月29日提交中国国家知识产权局、申请号为202210764337.4、发明名称为“一种训练方法、装置和训练系统”的中国专利申请的优先权,以及要求于2022年08月16日提交中国国家知识产权局、申请号为202210986001.2、发明名称为“一种人工智能模型的训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能(artificial intelligence,AI)技术领域,尤其涉及一种AI模型的训练方法、模型训练系统、计算集群、计算机可读存储介质、计算机程序产品。
背景技术
随着AI技术的不断发展,尤其是机器学习(machine learning,ML)算法的兴起,产生了很多数据驱动的AI模型。这些AI模型可以用于完成图像分类、目标检测、文本识别、语音识别等各种AI任务,如此可以大幅降低人力成本。
机器学习算法是一类从数据中自动分析获得规律,并利用规律对未知数据进行预测的算法。训练设备如训练服务器可以采用上述机器学习算法训练得到AI模型。训练服务器的训练过程可以包括:训练服务器确定模型所采用的神经网络,对神经网络进行权重初始化,获得初始化后的神经网络模型;然后训练服务器将数据集中的样本数据输入初始化后的神经网络模型,根据神经网络模型对样本数据的处理结果更新神经网络模型的权重。当神经网络模型满足训练停止条件时,训练服务器可以停止训练。
考虑到一些AI模型的权重规模比较庞大,例如一些通用的AI大模型的权重规模可以达到千亿级别,可以采用多台训练设备协同训练,例如采用多个计算中心的多台训练服务器进行跨计算中心的协同训练,以提高训练效率。
以两个计算中心的协同训练为例,两个计算中心内部先向master节点进行一次同步,然后两个计算中心之间的master节点进行一次同步,接着再在各个计算中心内部进行一次同步更新。
由于计算中心之间的网络通信带宽低时延高,计算中心间传输的数据量又非常大,由此产生较长的同步等待时间。即使增加了训练设备的数量,也可能因较长的同步等待时间导致训练总时间难以大幅降低,甚至训练总时间可能会增加,影响了训练效率,增加了训练成本。
发明内容
本申请提供了一种AI模型的训练方法,该方法通过引入分层异步训练机制,具体是在训练单元内部采用同步更新方式,在训练单元之间采用异步更新方式,避免了训练单元之间的同步等待时间,缩短了整体训练时间,提高了训练效率,降低了训练成本。本申请还提供了模型训练系统、计算集群、计算机可读存储介质以及计算机程序产品。
第一方面,本申请提供一种AI模型的训练方法。该方法应用于模型训练系统。模型训练系统包括第一训练单元和至少一个第二训练单元。其中,第一训练单元包括多个第一训练子单元。
具体地,第一训练单元可以接收第一训练子任务,通过多个第一训练子单元执行第一训练子任务,获得多个第一训练子单元同步后的第一权重。此外,第一训练单元还可以异步接收至少一个第二训练单元执行第二训练子任务获得的第二权重。然后第一训练单元根据上述第一权重和第二权重,获得AI模型的权重。
该方法提出了一种分层异步训练方式,即在同一训练单元的多个训练子单元之间通过同步方式进行权重更新,在模型训练系统的不同训练单元之间通过异步方式进行权重更新,解决了由于计算中心之间的带宽有限,采用同步训练的方式会引入不可接受的同步等待时间,无法进行高效训练的问题。
在一些可能的实现方式中,第二训练单元包括多个第二训练子单元。相应地,第二训练单元也可以通过多个第二训练子单元执行所述第二训练子任务,获得多个第二训练子单元同步后的第二权重。如此, 训练单元内部均可以采用多个训练子单元并行训练,提高了训练效率。
在一些可能的实现方式中,第二训练单元可以对多个第二训练子单元同步后的第二权重进行压缩。相应地,第一训练单元可以异步接收压缩后的第二权重。如此可以减少第一训练单元和第二训练单元之间传输的数据量,缩短传输时间,避免第一训练单元、第二训练单元的资源闲置,提高资源利用率,并提高训练效率。
在一些可能的实现方式中,第二训练单元可以通过不同压缩机制对第二权重进行压缩。例如,第二训练单元可以确定多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,根据差值对多个第二训练子单元本次同步后的第二权重进行压缩。又例如,第二训练单元可以确定多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,根据范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。
在该方法中,第二训练单元可以根据第二权重的分布规律选择合适的压缩机制,对第二权重进行压缩,以尽可能地减少传输数据量,缩短传输时间,进而缩短训练时间,提高训练效率。
在一些可能的实现方式中,第一训练单元还可以获取该第一训练单元与至少一个第二训练单元上一次异步更新的第三权重,确定由上述第一权重和第二权重确定的综合权重与第三权重的距离。相应地,当综合权重与第三权重的距离大于预设距离,第一训练单元可以根据第一权重和第二权重,获得AI模型的权重。
在该方法中,第一训练单元通过权重之间的距离度量权重的有效性,如果新的权重和历史权重非常接近,第一训练单元可以跳过本次更新,等待累计到一定差异后再更新,如此可以减少更新的频次,提高资源利用率。
在一些可能的实现方式中,第一训练单元在进行训练单元之间的权重更新时,可以根据相关性度量函数获取所述第一权重和所述第二权重的第一相关度,然后根据第一相关度以及上述第一权重和第二权重,获得AI模型的权重。
该方法通过利用第一权重和第二权重之间的第一相关度,进行训练单元之间的权重更新,可以实现合理地更新权重,避免对模型权重的平均策略导致性能下降。
在一些可能的实现方式中,第一训练单元还可以结合历史更新信息进行权重更新。具体地,第一训练单元可以根据相关性度量函数获取由第一权重和所述第二权重确定的综合权重与第三权重的第二相关度,其中,第三权重为所述第一训练单元和至少一个第二训练单元上一次异步更新的权重,然后第一训练单元根据第二相关度、综合权重与第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量。相应地,第一训练单元可以根据本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。
该方法中,第一训练单元充分考虑了第一权重和第二权重之间的相关度、历史更新信息进行全局的权重更新,可以避免本轮更新过程中对模型权重的平均策略导致性能下降。
在一些可能的实现方式中,所述训练单元中的多个训练子单元采用改进的参数服务器架构或环形架构进行同步。其中,改进的参数服务器架构可以是参数服务器替换为master节点,worker节点自行根据梯度确定权重,然后向master节点上传权重,而不是向参数服务器上传梯度,由参数服务器根据各个worker节点上传的梯度确定权重,如此可以进一步缩短训练时间。环形架构是各个worker形成一个环型结构,每个worker和另外两个worker相连。每个worker都只和相邻的两个worker进行信息传递。每个worker上都有一份完整的模型参数,并进行梯度计算和更新。需要说明的是,环形架构下,完成内部同步的任一个节点均可以作为master节点,进行训练单元之间的权重更新。
在一些可能的实现方式中,训练单元可以为计算集群,该计算集群例如可以是一个计算中心对应的计算集群。训练子单元可以为计算集群中的服务器。如此可以实现跨计算中心的高效训练,尤其适用于AI大模型的训练场景。
在一些可能的实现方式中,训练单元可以为服务器,例如第一训练单元和第二训练单元可以是同一计算中心内的服务器,训练子单元可以为服务器中的训练卡。如此可以实现计算中心内的高效训练,尤其适用于特定任务模型的训练场景。
第二方面,本申请提供一种模型训练系统。所述模型训练系统包括第一训练单元和至少一个第二训 练单元,所述第一训练单元包括多个第一训练子单元:
所述第一训练单元,用于接收第一训练子任务,通过多个第一训练子单元执行所述第一训练子任务,获得所述多个第一训练子单元同步后的第一权重;
所述第二训练单元,用于执行第二训练子任务获得第二权重;
所述第一训练单元,还用于异步接收所述第二权重,根据所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,所述第二训练单元包括多个第二训练子单元;
所述第二训练单元具体用于:
通过所述多个第二训练子单元执行所述第二训练子任务,获得所述多个第二训练子单元同步后的第二权重。
在一些可能的实现方式中,所述第二训练单元还用于:
对所述多个第二训练子单元同步后的第二权重进行压缩;
所述第一训练单元具体用于:
异步接收压缩后的所述第二权重。
在一些可能的实现方式中,所述第二训练单元具体用于:
确定所述多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,根据所述差值对所述多个第二训练子单元本次同步后的第二权重进行压缩;或者,
确定所述多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,根据所述范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。
在一些可能的实现方式中,所述第一训练单元还用于:
获取所述第一训练单元与所述至少一个第二训练单元上一次异步更新的第三权重;
确定由所述第一权重和所述第二权重确定的综合权重与所述第三权重的距离;
所述第一训练单元具体用于:
当所述综合权重与所述第三权重的距离大于预设距离,所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,所述第一训练单元具体用于:
根据相关性度量函数获取所述第一权重和所述第二权重的第一相关度;
根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,所述第一训练单元还用于:
根据相关性度量函数获取由所述第一权重和所述第二权重确定的综合权重与第三权重的第二相关度,所述第三权重为所述第一训练单元和所述至少一个第二训练单元上一次异步更新的权重;
根据所述第二相关度、所述综合权重与所述第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量;
所述第一训练单元具体用于:
根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,所述训练单元中的多个训练子单元采用改进的参数服务器架构或环形架构进行同步。
在一些可能的实现方式中,所述训练单元为计算集群,所述训练子单元为所述计算集群中的服务器。
在一些可能的实现方式中,所述训练单元为服务器,所述训练子单元为所述服务器中的训练卡。
第三方面,本申请提供一种计算集群。所述计算集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器。所述至少一个处理器、所述至少一个存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得计算设备或计算集群执行如第一方面或第一方面的任一种实现方式所述的模型训练方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示计算设备或计算集群执行上述第一方面或第一方面的任一种实现方式所述的模型训练方法。
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备或计算集群上运行时,使得计算设备或计算集群执行上述第一方面或第一方面的任一种实现方式所述的模型训练方法。
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的一种AI模型的训练流程以及流水图;
图2为本申请实施例提供的一种模型训练系统的架构示意图;
图3为本申请实施例提供的一种AI模型的训练方法的流程图;
图4为本申请实施例提供的一种AI模型的训练流水图;
图5为本申请实施例提供的一种模型训练系统的结构图;
图6为本申请实施例提供的一种计算集群的结构示意图;
图7为本申请实施例提供的一种计算集群的结构示意图;
图8为本申请实施例提供的一种计算集群的结构示意图。
具体实施方式
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
首先对本申请实施例中所涉及到的一些技术术语进行介绍。
AI模型,具体是指通过AI算法如机器学习算法训练得到的、用于完成AI任务的模型。其中,AI模型可以是通用的大模型,也可以是用于完成特定任务的模型。特定任务包括但不限于图像分类、目标检测、文本识别、语音识别。
权重,在AI领域,通常是一组浮点数,用于作为AI模型所采用的神经网络的主要参数。通常情况下,权重可以在AI模型训练时参与运算,并在反向传播阶段被更新。需要说明的是,不同AI模型的权重规模(也可以称作参数规模)可以是不同的。例如,一些AI大模型的权重规模可以达到千亿级别、十万亿级别。
对于权重规模较大的AI模型,可以采用多台训练设备协同训练。例如,针对权重规模在十万亿级别的AI大模型,可以采用多个计算中心的多台训练服务器进行跨计算中心的协同训练,以提高训练效率。
以两个计算中心的协同训练为例,如图1所示,每个计算中心可以采用主从架构进行协同训练,基于此,每个计算中心中的节点可以分为master节点和worker节点两种类型。在进行协同训练时,每个计算中心内部先向master节点进行一次权重同步,例如,计算中心1中worker a、worker b、worker c等节点向Master 1发送Wa、Wb、Wc,Master 1根据各个worker节点发送的权重获得计算中心1内部同步后的权重W1,其中,Master1可以是将各个worker的权重累计获得W1,同理计算中心2可以采用类似方式获得计算中心2内部同步后的权重W2,然后两个计算中心之间的master节点进行一次权重同步,例如,计算中心1向计算中心2发送W1,接收计算中心2发送的W2,如此,两个计算中心可以获得计算中心之间同步后的权重接着再在各个计算中心内部进行一次同步更新,例如Master1向计算中心1中的worker a、worker b、worker c下发权重W,Master2向计算中心2中的worker a、worker b、worker c下发权重W。
由于计算中心之间的网络通信带宽低时延高,计算中心间传输的数据量又非常大,由此产生较长的同步等待时间。例如,参见图1,某个计算中心的worker进行前向运算耗时约为177毫秒(millisecond,ms),进行反向传播耗时约为279ms,进行内部梯度更新耗时约147ms,在上述阶段,master处于等待状态(图1中第二或第四个时间轴中的“WT”字符对应的时间段)。然后master跨计算中心同步权重,master之间进行跨计算中心通信,耗时约1150ms,此时各计算中心的worker处于等待状态,在master通信完成后,各worker可以进行梯度更新。如此,即使增加了训练设备的数量,也可能因较长的同步等待时间导致训练总时间难以大幅降低,甚至训练总时间可能会增加,影响了训练效率,增加了训练成 本。
有鉴于此,本申请实施例提供了一种AI模型的训练方法。该方法可以应用于模型训练系统。模型训练系统包括多个训练单元,为了便于描述,本申请实施例将其中一个训练单元称作第一训练单元,将第一训练单元之外的其他训练单元称作第二训练单元。其中,第一训练单元可以包括多个第一训练子单元,类似地,第二训练单元也可以包括多个第二训练子单元。
具体地,第一训练单元接收第一训练子任务,然后第一训练单元通过多个第一训练子单元执行所述第一训练子任务,获得所述多个第一训练子单元同步后的第一权重,以及异步接收所述至少一个第二训练单元执行第二训练子任务获得的第二权重,接着第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重。
该方法提出了一种分层异步训练方式,即在同一训练单元的多个训练子单元之间通过同步方式进行权重更新,在模型训练系统的不同训练单元之间通过异步方式进行权重更新,解决了由于计算中心之间的带宽有限,采用同步训练的方式会引入不可接受的同步等待时间,无法进行高效训练的问题。
需要说明的是,本申请实施例的模型训练系统中训练单元、训练子单元的粒度可以根据需要训练的AI模型的权重规模确定。例如,权重规模较大,需要较多的计算设备参与训练时,训练单元可以是计算集群,该计算集群可以是计算中心,训练子单元可以是计算中心中的服务器。又例如,权重规模较小,单个计算中心即可完成AI模型的训练时,训练单元可以是服务器,训练子单元可以是服务器中的训练卡。该训练卡是指用于训练AI模型的处理器,包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural network processing unit,NPU)。
为了使得本申请的技术方案更加清楚、易于理解,下文以训练单元为计算中心,训练子单元为服务器,对本申请实施例的系统架构进行介绍。
参见图2所示的模型训练系统的架构示意图,模型训练系统包括多个计算中心,图2中以模型训练系统包括计算中心1和计算中心2共计2个计算中心进行示例说明。每个计算中心包括多台服务器,不同计算中心中服务器的计算架构可以相同,也可以不同。例如,计算中心1中服务器可以采用四卡架构,计算中心2中服务器可以采用八卡架构。
不同计算中心之间可以通过交换机网络互连。其中,交换机网络可以包括多台交换机。其中,计算中心之间的交换机网络的通常是低带宽高时延,计算中心内部的交换机网络(图2中未示出)通常是高带宽低时延。
基于此,计算中心内部可以进行同步更新,计算中心1和计算中心2之间可以进行异步更新,由此实现分层异步训练。具体地,训练任务可以拆分成多个训练子任务,图1中以拆分为训练子任务1和训练子任务2进行示例说明。计算中心1接收训练子任务1,计算中心1可以通过多个服务器执行训练子任务1,获得多个服务器同步后的第一权重,以及异步接收计算中心2执行训练子任务2获得的第二权重,然后计算中心1根据上述第一权重和第二权重,获得AI模型的权重。
与计算中心1类似,计算中心2可以异步接收第一计算中心执行训练子任务1获得的第一权重,根据第一权重以及计算中心2通过多个服务器执行训练子任务2,获得的多个服务器同步后的第二权重,获得AI模型的权重。
图2是以跨计算中心的协同训练进行示例说明。当AI模型的权重规模相对较小时,也可以通过单个计算中心的多台服务器协同训练。其中,服务器中多张训练卡的网络为高速网络,具有较高带宽、较低时延。基于此,同一服务器的多张训练卡之间可以同步更新,不同服务器之间可以异步更新。
基于本申请实施例提供的模型训练系统,本申请实施例还提供了一种AI模型的训练方法,下面从模型训练系统的角度对本申请实施例的AI模型的训练方法进行介绍。
参见图3所示的AI模型的训练方法的流程图,模型训练系统包括第一训练单元和至少一个第二训练单元。例如,第一训练单元可以是图2中的计算中心1,第二训练单元可以是图2中的计算中心2。第一训练单元包括多个第一训练子单元。第二训练单元可以包括多个第二训练子单元。例如,第一训练子单元可以是计算中心1中的服务器,第二训练子单元可以是计算中心2中的服务器。该方法包括如下 步骤:
S302:第一训练单元接收第一训练子任务。
第一训练子任务是用于训练AI模型的任务拆分得到的多个训练子任务中的一个。训练子任务的数量可以等于训练单元的数量。第一训练子任务具体为多个训练子任务中被调度至第一训练单元的训练子任务。
S304:第一训练单元通过多个第一训练子单元执行第一训练子任务,获得多个第一训练子单元同步后的第一权重。
第一训练单元可以通过多个第一训练子单元并行地执行第一训练子任务,获得每个第一训练子单元执行第一训练子任务所得的权重,然后对多个第一训练子单元得到的权重进行同步更新,获得第一权重。
其中,第一训练单元包括master节点,例如是图2中所示的master1。多个第一训练子单元(如计算中心1中的worker a、worker b或worker c)可以将各自训练得到的权重(如Wa、Wb、Wc)上报至master节点,master节点可以根据worker节点如worker a、worker b或worker c上报的权重,获得第一权重。
需要说明的是,多个第一训练子单元采用不同同步方式,获得第一权重。下面分别进行详细说明。
第一种同步方式,可以是基于改进的参数服务器(parameter server,PS)架构的同步方式。具体地,每个训练子单元(如worker节点)具有完整的AI模型,每个训练子单元基于各自分配到的数据训练AI模型。区别于传统的PS架构,各个worker训练完一个step得到梯度后即基于该梯度更新权重,然后上传权重至master节点,master节点根据各worker节点上报的权重进行求和或求平均值操作,从而获得训练子单元同步后的权重。由于worker节点无需上报梯度,等待parameter server汇总各个worker上报的梯度获得更新后的权重,以及接收parameter server下发的更新后的权重,减少了worker和parameter server的通信次数,进而减少了通信时长和通信开销。
第二种同步方式,可以是基于环形架构的同步方式。其中,环形架构可以是Ring All Reduce架构。在该架构中没有parameter server,所以worker组成一个环形,每个worker和另外两个worker相连。每个worker都只和相邻的两个worker进行信息传递。每个worker上都有一份完整的模型参数,并进行梯度计算和更新。Ring All Reduce架构主要分为scatter reduce和allgather两个步骤。假设一共使用了5个worker,那么在scatter reduce步骤中,将每个worker上计算出来的梯度分割成5等份,即权重的梯度分成5份,每个worker都使用相同的分割方法。然后通过worker之间的5次通信,让每个worker上都有一部分参数的梯度是完整的。如此可以使得每个worker上都有一部分权重,能够融合所有其他worker上的梯度,得到一份完整的梯度。需要说明的是,Ring All Reduce架构下,完成内部同步的任一个节点均可以作为master节点,进行训练单元之间的权重更新。
S306:第二训练单元接收第二训练子任务。
与第一训练子任务类型,第二训练子任务是用于训练AI模型的任务拆分得到的多个训练子任务中的一个。其中,第二训练子任务具体为多个训练子任务中被调度至第二训练单元的训练子任务。
S308:第二训练单元通过多个第二训练子单元执行第二训练子任务,获得多个第二训练子单元同步后的第一权重。
第二训练单元可以通过多个第二训练子单元并行地执行第二训练子任务,获得每个第二训练子单元执行第二训练子任务所得的权重,然后对多个第二训练子单元得到的权重进行同步更新,获得第二权重。
其中,第二训练单元包括master节点,例如是图2中所示的master2。多个第二训练子单元(如worker a、worker b或worker c)可以将各自训练得到的权重(如计算中心2中的Wa、Wb、Wc)上报至master节点(如master2),master节点可以根据worker节点如worker a、worker b或worker c上报的权重,获得第二权重。
需要说明的是,第二训练单元的多个第二训练子单元可以采用与第一训练子单元相似的同步方式,在此不再赘述。
S310:第一训练单元异步接收第二权重。
异步是指执行一个任务时无需等待另一个任务结束。在本申请实施例中,第一训练单元可以采用异步方式接收第二权重。基于此,第一训练单元可以在执行S304过程中,或者在执行S304之前,接收第 二权重,而不必等待S304执行完毕后,再执行上述S310。如此可以避免不必要的同步等待时间,缩短整体的训练时间,提高训练效率。
进一步地,为了减小训练单元之间的传输开销,第二训练单元还可以对第二训练子单元同步后的第二权重进行压缩,获得压缩后的第二权重。相应地,第一训练单元可以异步接收压缩后的第二权重,如此可以较大程度地减少传输开销,缩短训练单元之间的传输时间。
其中,第二训练单元可以采用不同方式对多个第二训练子单元同步后的第二权重进行压缩。下面对权重压缩方式分别进行详细说明。
第一种方式为基于差值的压缩方式。具体地,第二训练单元可以确定多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,然后根据差值对多个第二训练子单元本次同步后的第二权重进行压缩。
以第二训练单元为第i个计算中心,当前训练步为第k步,多个第二训练子单元在本次同步后的第二权重记作多个第二训练子单元在上一次同步后的第二权重记作第二训练单元可以通过如下方式获得压缩后的第二权重:
其中,th(k)表示预设阈值,select()表示选择函数,具体是从形成的差值矩阵中选择出差值大于预设阈值的元素,然后将其他元素的元素值设置为0,从而获得相应稀疏矩阵。压缩后的第二权重可以为上述稀疏矩阵。
第二种方式为基于范数的压缩方式。具体地,第二训练单元确定所述多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,例如是L2范数,然后第二训练单元可以根据所述范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。其中,第二训练单元可以根据各行或各列权重的范数,筛选出满足条件的目标行或目标列,根据筛选得到目标行或目标列获得压缩后的第二权重。
S312:第二训练单元异步接收第一权重。
与第一训练单元异步接收第二权重类似,第二训练单元可以采用异步方式接收第一权重。基于此,第二训练单元可以在执行S308过程中,或者在执行S308之前,接收第一权重,而不必等待S308执行完毕后,再执行上述S312。如此可以避免不必要的同步等待时间,缩短整体的训练时间,提高训练效率。
进一步地,为了减小训练单元之间的传输开销,第一训练单元还可以对第一训练子单元同步后的第一权重进行压缩,获得压缩后的第一权重。相应地,第二训练单元可以异步接收压缩后的第一权重,如此可以较大程度地减少传输开销,缩短训练单元之间的传输时间。
其中,第一训练单元压缩第一权重的方式与第二训练单元压缩第二权重的方式类似,可以参见S310相关内容描述,在此不再赘述。
S314:第一训练单元根据第一权重和第二权重,获得AI模型的权重。
具体地,第一权重和第二权重可以为AI模型的完整权重,基于此,第一训练单元可以对第一权重和第二权重执行求平均值操作,从而获得训练单元之间更新后的权重。当训练停止条件未被满足时,可以继续执行下一个训练步,直至训练停止条件被满足,可以将训练停止时训练单元之间更新的权重作为AI模型的权重。
第一训练单元也可以根据相关性度量函数,获取所述第一权重和所述第二权重的第一相关度。其中,相关性度量函数可以根据经验设置,例如可以采用余弦相似度构建相关性度量函数,具体如下:
g(x,y)=1-xy/|x||y|                        (2)
其中,x、y分别表示参与相关度计算的两个物理量,假设第一权重为θk,第二权重为θt,将上述第一权重和第二权重代入上述公式(2),可以得到第一相关度g(θtk)。
相应地,第一训练单元可以根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重。其中,第一训练单元可以根据第一相关度确定第一权重和第二权重的系数,然后根据该系数对第一权重、第二权重进行加权求和,从而获得AI模型的权重,具体如下:
θS+1=g(θtkt+(1-g(θtk))θk                            (3)
其中,θS+1表示训练单元之间更新后的权重。
进一步地,为了避免对模型权重的平均策略导致性能下降,第一训练单元还可以结合历史更新信息 进行训练单元之间的权重更新。历史更新信息可以包括第一训练单元和至少一个第二训练单元上一次异步更新的第三权重以及上一次全局更新的变化量。其中,第三权重可以记作θS,上一次全局更新的变化量记作Δn-1
基于此,第一训练单元可以先根据相关性度量函数获取由所述第一权重和所述第二权重确定的综合权重与第三权重的第二相关度。其中,综合权重可以是第一权重和第二权重的平均值,第二相关度可以为:
然后,第一训练单元可以根据所述第二相关度、所述综合权重与所述第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量,具体如下所示:
接着,第一训练单元可以根据所述第一训练单元根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。其中,第一训练单元可以先根据所述第一训练单元根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得第一训练单元和第二训练单元本次更新的权重,具体如下:
θS+1=Δn+g(θtkt+(1-g(θtk))θk               (6)
当训练停止条件被满足时,第一训练单元可以停止训练,将本次更新的权重作为AI模型的权重;当训练停止条件未被满足时,第一训练单元可以继续训练,直至训练停止条件被满足,获取训练停止条件被满足时训练单元间更新的权重作为AI模型的权重。
在一些可能的实现方式中,第一训练单元还可以判断第二训练单元发送的第二权重或压缩后的第二权重的有效性。以第二训练单元直接发送第二权重的情况进行示例说明。当第二权重为无效权重时,第一训练单元可以放弃本次更新,等待第二训练单元下一次发送的第二权重,并继续判断下一次发送的第二权重的有效性,直至接收到的第二权重为有效权重时,根据该有效权重进行权重更新。
具体地,第一训练单元可以获取第一训练单元与至少一个第二训练单元上一次异步更新的第三权重,确定由所述第一权重和所述第二权重确定的综合权重与所述第三权重的距离。其中,综合权重可以是上述第一权重和第二权重的平均值。需要说明的是,对第一权重和第二权重求平均值仅仅是获得综合权重的一种实现方式,在本申请实施例其他可能的实现方式中,也可以通过其他方式获得综合权重。综合权重与第三权重的距离可以是余弦距离或欧式距离中的一种或多种。为了便于描述,本申请实施例以余弦距离示例说明。基于此,综合权重和第三权重的距离可以通过上述第二相关度表征。
当所述综合权重与所述第三权重的距离如大于预设距离,第一训练单元可以进行权重更新。例如,第一训练单元可以根据上述第一权重和第二权重,获得所述AI模型的权重。当第二权重和第三权重的距离不大于预设距离,第一训练单元根据放弃本次更新,待权重累计到一定差异,使得综合权重与第三权重的距离大于预设距离时,再进行权重更新。
S316:第二训练单元根据第一权重和第二权重,获得AI模型的权重。
与第一训练单元类型,第二训练单元可以对第一权重和第二权重执行求平均值操作,从而获得训练单元之间更新后的权重。当训练停止条件未被满足时,可以继续执行下一个训练步,直至训练停止条件被满足,可以将训练停止时训练单元之间更新的权重作为AI模型的权重。
第二训练单元也可以根据相关性度量函数,获取所述第一权重和所述第二权重的第一相关度,然后第二训练单元可以根据第一相关度以及第一权重和第二权重,获得AI模型的权重。其中,第二训练单元根据第一相关度、第一权重、第二权重,获得AI模型的权重的具体实现可以参见上述公式(3)相关内容描述,在此不再赘述。
进一步地,第二训练单元可以结合历史更新信息进行训练单元之间的权重更新,从而避免对模型权重的平均策略导致性能下降。第二训练单元结合历史更新信息进行训练单元之间的权重更新过程可以参见S314相关内容描述,在此不再赘述。
为了便于理解,本申请实施例还提供了本申请实施例的AI模型的训练方法的流水图,如图4所示,第一训练单元可以为计算中心1,记作DC1,第二训练单元可以为计算中心2,记作DC2,DC1中的 worker节点以及DC2中的worker节点可以进行前向运算、反向传播,然后分别进行梯度更新,并进行DC内的权重同步,获得同步后的权重W1、W2。DC1的worker可以进行下一个训练步的前向运算,DC1的master可以异步接收来自DC2的W2,类似地,DC2的worker可以进行下一个训练步的前向运算,DC2的master可以异步接收来自DC1的W1,如此,DC1的master可以在该下一个训练步的权重同步时,结合异步接收的W2进行权重更新,类似地,DC2的master可以在该下一个训练步的权重同步时,结合异步接收的W1进行权重更新。如图1相比,该方式明显地缩短了同步等待时间,缩短了整体的训练时间,提高了训练效率。
需要说明的是,上述S306、S308、S312、S316为本申请实施例的可选步骤,执行本申请实施例的方法也可以采用其他方式。例如,第二训练单元也可以包括一个第二训练子单元。如此,第二训练单元可以将第二训练子单元训练得到的权重发送至第一训练单元,以进行权重更新。
基于上述内容描述可知,本申请实施例提供了一种分层的混合异步更新的训练方法,实现单独训练单元内同步更新、跨训练单元间异步更新的混合更新策略,如此避免了不可接受的同步等待时长,大幅缩短了训练时间。而且,该方法可以通过启发式算法自适应调节训练单元间更新频率,在通信传输时引入选择性传输机制,在权重更新时引入将历史更新信息与当前更新信息结合的自适应聚合方式,在保证精度的前提下降低了通信成本。并且,结合历史更新信息对权重进行更新可以保证收敛的精度,进而保障AI模型的性能。
基于本申请实施例提供的AI模型的训练方法,本申请实施例还提供一种如前述的模型训练系统。
参见图5所示的模型训练系统500的结构示意图,模型训练系统500包括第一训练单元502和至少一个第二训练单元504。所述第一训练单元包括多个第一训练子单元5020:
所述第一训练单元502,用于接收第一训练子任务,通过多个第一训练子单元5020执行所述第一训练子任务,获得所述多个第一训练子单元同步后的第一权重;
所述第二训练单元504,用于执行第二训练子任务获得第二权重;
所述第一训练单元502,还用于异步接收所述第二权重,根据所述第一权重和所述第二权重,获得所述AI模型的权重。
其中,第一训练单元502可以包括如下功能模块:
通信模块5022,用于接收第一训练子任务;
任务执行模块5024,用于通过多个第一训练子单元5020执行所述第一训练子任务,获得所述多个第一训练子单元5020同步后的第一权重;
所述通信模块5022,还用于异步接收至少一个第二训练单元504执行第二子训练子任务获得的第二权重;
权重更新模块5026,用于根据第一权重和所述第二权重,获得所述AI模型的权重。
上述通信模块5022、任务执行模块5024、权重更新模块5026可以通过硬件模块实现或通过软件模块实现。
当通过软件实现时,通信模块5022、任务执行模块5024、权重更新模块5026可以是运行在计算设备(例如是服务器)或计算集群(例如是计算中心)上的应用程序或者应用程序模块。
当通过硬件实现时,通信模块5022可以通过网络接口卡、收发器一类的收发模块实现。任务执行模块5024、权重更新模块5026可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
在一些可能的实现方式中,所述第二训练单元504包括多个第二训练子单元5040;
所述第二训练单元504具体用于:
通过所述多个第二训练子单元5040执行所述第二训练子任务,获得所述多个第二训练子单元同步后的第二权重。
其中,第二训练单元504可以包括如下功能模块:
通信模块5042,用于接收第二训练子任务;
任务执行模块5044,用于通过多个第二训练子单元5040执行所述第二训练子任务,获得所述多个第二训练子单元5040同步后的第二权重;
所述通信模块5042,还用于异步发送第二权重,以及异步接收第一权重;
权重更新模块5046,用于根据第一权重和所述第二权重,获得所述AI模型的权重。
与第一训练单元502的各个模块类似,上述第二训练单元504中的通信模块5042、任务执行模块5044、权重更新模块5046可以通过硬件模块实现或通过软件模块实现。
当通过软件实现时,通信模块5042、任务执行模块5044、权重更新模块5046可以是运行在计算设备(例如是服务器)或计算集群(例如是计算中心)上的应用程序或者应用程序模块。
当通过硬件实现时,通信模块5042可以通过网络接口卡、收发器一类的收发模块实现。任务执行模块5044、权重更新模块5046可以是利用专用集成电路ASIC实现、或可编程逻辑器件PLD实现的设备等。其中,上述PLD可以是复杂程序逻辑器件CPLD、现场可编程门阵列FPGA、通用阵列逻辑GAL或其任意组合实现。
在一些可能的实现方式中,所述第二训练单元504还包括:
压缩模块5048,用于对多个第二训练子单元5040同步后的第二权重进行压缩;
所述第一训练单元502中的通信模块5022具体用于:
异步接收压缩后的所述第二权重。
在一些可能的实现方式中,所述第二训练单元504(例如是第二训练单元504中的压缩模块5048)具体用于:
确定所述多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,根据所述差值对所述多个第二训练子单元本次同步后的第二权重进行压缩;或者,
确定所述多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,根据所述范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。
在一些可能的实现方式中,第一训练单元502还包括:
压缩模块5028,用于对多个第一训练子单元5020同步后的第一权重进行压缩;
通信模块5022,还用于异步发送压缩后的第一权重至第二训练单元504,以便于第二训练单元504进行权重更新。
与第二训练单元504中的压缩模块5048类似,第一训练单元502中的压缩模块5028可以基于权重的差值进行压缩,或者基于各行或各列权重的范数进行压缩。
其中,压缩模块5028、压缩模块5048可以通过硬件模块实现或通过软件模块实现。
当通过软件实现时,压缩模块5028、压缩模块5048可以是运行在计算设备(例如是服务器)或计算集群(例如是计算中心)上的应用程序或者应用程序模块。当通过硬件实现时,压缩模块5028、压缩模块5048可以是利用专用集成电路ASIC实现、或可编程逻辑器件PLD实现的设备等。
在一些可能的实现方式中,所述第一训练单元502还包括:
距离确定模块5029,用于获取所述第一训练单元与所述至少一个第二训练单元上一次异步更新的第三权重,确定由所述第一权重和所述第二权重确定的综合权重与所述第三权重的距离;
所述第一训练单元502(例如是第一训练单元502中的权重更新模块5026)具体用于:
当所述综合权重与所述第三权重的距离大于预设距离,根据所述第一权重和所述第二权重,获得所述AI模型的权重。
与第一训练单元502类似,第二训练单元504也可以包括距离确定模块5049。距离确定模块5049用于获取第二训练单元504与第一训练单元502上一次异步更新的第三权重,确定由第一权重和第二权重确定的综合权重与第三权重的距离。权重更新模块5046具体用于:当综合权重与第三权重的距离大于预设距离,根据第一权重和第二权重,获得AI模型的权重。
其中,距离确定模块5029、距离确定模块5049可以通过硬件模块实现或通过软件模块实现。
当通过软件实现时,距离确定模块5029、距离确定模块5049可以是运行在计算设备(例如是服务器)或计算集群(例如是计算中心)上的应用程序或者应用程序模块。当通过硬件实现时,距离确定模 块5029、距离确定模块5049可以是利用专用集成电路ASIC实现、或可编程逻辑器件PLD实现的设备等。
在一些可能的实现方式中,所述第一训练单元502(例如是第一训练单元502中的权重更新模块5026)具体用于:
根据相关性度量函数获取所述第一权重和所述第二权重的第一相关度;
根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,第一训练单元502(例如是第一训练单元502中的权重更新模块5026)还用于:
根据相关性度量函数获取由所述第一权重和所述第二权重确定的综合权重与第三权重的第二相关度,所述第三权重为所述第一训练单元和所述至少一个第二训练单元上一次异步更新的权重;
根据所述第二相关度、所述综合权重与所述第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量;
相应地,第一训练单元502(例如是第一训练单元502中的权重更新模块5026)具体用于:
根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。
在一些可能的实现方式中,所述训练单元(例如是第一训练单元502或第二训练单元504)中的多个训练子单元(例如是第一训练子单元5020或第二训练子单元5040)可以采用改进的参数服务器架构或环形架构进行同步。
在一些可能的实现方式中,所述训练单元(例如是第一训练单元502或第二训练单元504)可以为计算集群,该计算集群可以是包括多台服务器的计算中心,所述训练子单元(例如是第一训练子单元5020或第二训练子单元5040)可以为所述计算集群中的服务器。如此以实现跨计算中心的协同训练。
在一些可能的实现方式中,所述训练单元可以为服务器,所述训练子单元可以为所述服务器中的训练卡。如此可以实现计算中心内的协同训练。
本申请实施例还提供了一种计算集群。该计算集群可以包括至少一台计算设备。计算设备600可以是服务器或终端设备。终端设备包括但不限于台式机、笔记本电脑或者智能手机。如图6所示,计算设备600包括:总线602、处理器604、存储器606和通信接口608。处理器604、存储器606和通信接口608之间通过总线602通信。应理解,本申请不限定计算设备600中的处理器、存储器的个数。
总线602可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线602可包括在计算设备600各个部件(例如,存储器606、处理器604、通信接口608)之间传送信息的通路。
处理器604可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器606可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器606还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。存储器606中存储有可执行的程序代码,处理器604执行该可执行的程序代码以实现前述AI模型的训练方法。具体的,存储器606上存有模型训练系统500用于执行AI模型的训练方法的指令,例如存储器606上可以存储模型训练系统500中通信模块、任务执行模块、权重更新模块、压缩模块、距离确定模块对应的指令。
通信接口608使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备600与其他设备、计算集群或通信网络之间的通信。
如图6所示,计算集群可以包括多台计算设备600。计算集群中的多台计算设备600中的存储器606中可以存有相同的模型训练系统500用于执行AI模型的训练方法的指令。
在一些可能的实现方式中,该计算集群中的多台计算设备600也可以用于执行模型训练系统500用于 执行AI模型的训练方法的部分指令。换言之,多台计算设备600的组合可以共同执行模型训练系统500用于执行AI模型的训练方法的指令。
图7示出了一种可能的实现方式。如图7所示,两个计算设备600A和600B可以通过通信接口608实现连接。计算设备600A中的存储器上存有用于执行第一训练单元502的功能的指令,例如计算设备600A中的存储器上存有通信模块5022、任务执行模块5024、权重更新模块5026、压缩模块5028、距离确定模块5029对应的指令。计算设备600B中的存储器上存有用于执行第二训练单元504的功能的指令,例如计算设备600B中的存储器上存有通信模块5042、任务执行模块5044、权重更新模块5046、压缩模块5048、距离确定模块5049对应的指令。换言之,计算设备600A和600B的存储器606共同存储了模型训练系统600用于执行AI模型的训练方法的指令。
应理解,图7中示出的计算设备600A的功能也可以由多个计算设备600完成。同样,计算设备600B的功能也可以由多个计算设备600完成。
在一些可能的实现方式中,计算集群中的多台计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图8示出了一种可能的实现方式。如图8所示,两个计算设备600C和600D之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备600C中的存储器606中存有执行第一训练单元502的功能的指令,例如计算设备600A中的存储器上存有通信模块5022、任务执行模块5024、权重更新模块5026、压缩模块5028、距离确定模块5029对应的指令。同时,计算设备600D中的存储器606中存有执行第二训练单元504的功能的指令,例如计算设备600B中的存储器上存有通信模块5042、任务执行模块5044、权重更新模块5046、压缩模块5048、距离确定模块5049对应的指令。
应理解,图6中示出的计算设备600C的功能也可以由多个计算设备600完成。同样,计算设备600D的功能也可以由多个计算设备600完成。
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述应用于模型训练系统500的AI模型的训练方法。
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备或计算集群上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在计算设备或计算集群上运行时,使得计算设备或计算集群执行上述AI模型的训练方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (23)

  1. 一种人工智能AI模型的训练方法,其特征在于,应用于模型训练系统,所述模型训练系统包括第一训练单元和至少一个第二训练单元,所述第一训练单元包括多个第一训练子单元,所述方法包括:
    所述第一训练单元接收第一训练子任务;
    所述第一训练单元通过多个第一训练子单元执行所述第一训练子任务,获得所述多个第一训练子单元同步后的第一权重,以及异步接收所述至少一个第二训练单元执行第二训练子任务获得的第二权重;
    所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重。
  2. 根据权利要求1所述的方法,其特征在于,所述第二训练单元包括多个第二训练子单元,所述方法包括:
    所述第二训练单元通过所述多个第二训练子单元执行所述第二训练子任务,获得所述多个第二训练子单元同步后的第二权重。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    所述第二训练单元对所述多个第二训练子单元同步后的第二权重进行压缩;
    所述第一训练单元异步接收所述至少一个第二训练单元执行第二训练子任务获得的第二权重,包括:
    所述第一训练单元异步接收压缩后的所述第二权重。
  4. 根据权利要求3所述的方法,其特征在于,所述第二训练单元对所述多个第二训练子单元同步后的第二权重进行压缩,包括:
    所述第二训练单元确定所述多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,根据所述差值对所述多个第二训练子单元本次同步后的第二权重进行压缩;或者,
    所述第二训练单元确定所述多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,根据所述范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    所述第一训练单元获取所述第一训练单元与所述至少一个第二训练单元上一次异步更新的第三权重;
    所述第一训练单元确定由所述第一权重和所述第二权重确定的综合权重与所述第三权重的距离;
    所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重,包括:
    当所述综合权重与所述第三权重的距离大于预设距离,所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重,包括:
    所述第一训练单元根据相关性度量函数获取所述第一权重和所述第二权重的第一相关度;
    所述第一训练单元根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    所述第一训练单元根据相关性度量函数获取由所述第一权重和所述第二权重确定的综合权重与第三权重的第二相关度,所述第三权重为所述第一训练单元和所述至少一个第二训练单元上一次异步更新的权重;
    所述第一训练单元根据所述第二相关度、所述综合权重与所述第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量;
    所述第一训练单元根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重,包括:
    所述第一训练单元根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述训练单元中的多个训练子单元采用改进的参数服务器架构或环形架构进行同步。
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述训练单元为计算集群,所述训练子单 元为所述计算集群中的服务器。
  10. 根据权利要求1至8任一项所述的方法,其特征在于,所述训练单元为服务器,所述训练子单元为所述服务器中的训练卡。
  11. 一种模型训练系统,其特征在于,所述模型训练系统包括第一训练单元和至少一个第二训练单元,所述第一训练单元包括多个第一训练子单元:
    所述第一训练单元,用于接收第一训练子任务,通过多个第一训练子单元执行所述第一训练子任务,获得所述多个第一训练子单元同步后的第一权重;
    所述第二训练单元,用于执行第二训练子任务获得第二权重;
    所述第一训练单元,还用于异步接收所述第二权重,根据所述第一权重和所述第二权重,获得所述AI模型的权重。
  12. 根据权利要求11所述的系统,其特征在于,所述第二训练单元包括多个第二训练子单元;
    所述第二训练单元具体用于:
    通过所述多个第二训练子单元执行所述第二训练子任务,获得所述多个第二训练子单元同步后的第二权重。
  13. 根据权利要求12所述的系统,其特征在于,所述第二训练单元还用于:
    对所述多个第二训练子单元同步后的第二权重进行压缩;
    所述第一训练单元具体用于:
    异步接收压缩后的所述第二权重。
  14. 根据权利要求13所述的系统,其特征在于,所述第二训练单元具体用于:
    确定所述多个第二训练子单元本次同步后的第二权重与上一次同步后的第二权重的差值,根据所述差值对所述多个第二训练子单元本次同步后的第二权重进行压缩;或者,
    确定所述多个第二训练子单元本次同步后的第二权重中各行或各列权重的范数,根据所述范数对所述多个第二训练子单元本次同步后的第二权重进行压缩。
  15. 根据权利要求11至14任一项所述的系统,其特征在于,所述第一训练单元还用于:
    获取所述第一训练单元与所述至少一个第二训练单元上一次异步更新的第三权重;
    确定由所述第一权重和所述第二权重确定的综合权重与所述第三权重的距离;
    所述第一训练单元具体用于:
    当所述综合权重与所述第三权重的距离大于预设距离,所述第一训练单元根据所述第一权重和所述第二权重,获得所述AI模型的权重。
  16. 根据权利要求11至15任一项所述的系统,其特征在于,所述第一训练单元具体用于:
    根据相关性度量函数获取所述第一权重和所述第二权重的第一相关度;
    根据所述第一相关度以及所述第一权重和所述第二权重,获得所述AI模型的权重。
  17. 根据权利要求16所述的系统,其特征在于,所述第一训练单元还用于:
    根据相关性度量函数获取由所述第一权重和所述第二权重确定的综合权重与第三权重的第二相关度,所述第三权重为所述第一训练单元和所述至少一个第二训练单元上一次异步更新的权重;
    根据所述第二相关度、所述综合权重与所述第三权重的差值以及上一次全局更新的变化量,获得本次全局更新的变化量;
    所述第一训练单元具体用于:
    根据所述本次全局更新的变化量、所述第一相关度、所述第一权重和所述第二权重,获得所述AI模型的权重。
  18. 根据权利要求11至17任一项所述的系统,其特征在于,所述训练单元中的多个训练子单元采用改进的参数服务器架构或环形架构进行同步。
  19. 根据权利要求11至18任一项所述的系统,其特征在于,所述训练单元为计算集群,所述训练子单元为所述计算集群中的服务器。
  20. 根据权利要求11至18任一项所述的系统,其特征在于,所述训练单元为服务器,所述训练子单元为所述服务器中的训练卡。
  21. 一种计算集群,其特征在于,所述计算集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令;所述至少一个处理器执行所述计算机可读指令,以使得所述计算集群执行如权利要求1至10中任一项所述的方法。
  22. 一种计算机可读存储介质,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。
  23. 一种计算机程序产品,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。
PCT/CN2023/101357 2022-06-29 2023-06-20 一种人工智能模型的训练方法及相关设备 WO2024001870A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23808654.0A EP4332837A1 (en) 2022-06-29 2023-06-20 Training method for artificial intelligence model, and related device

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210764337.4 2022-06-29
CN202210764337 2022-06-29
CN202210986001.2 2022-08-16
CN202210986001.2A CN117390442A (zh) 2022-06-29 2022-08-16 一种人工智能模型的训练方法及相关设备

Publications (1)

Publication Number Publication Date
WO2024001870A1 true WO2024001870A1 (zh) 2024-01-04

Family

ID=88978477

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101357 WO2024001870A1 (zh) 2022-06-29 2023-06-20 一种人工智能模型的训练方法及相关设备

Country Status (2)

Country Link
EP (1) EP4332837A1 (zh)
WO (1) WO2024001870A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460457A (zh) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 一种面向卷积神经网络的多机多卡混合并行异步训练方法
CN111582494A (zh) * 2020-04-17 2020-08-25 浙江大学 一种基于延迟处理的混合分布式机器学习更新方法
CN112712171A (zh) * 2021-01-12 2021-04-27 湖南工业大学 深度卷积神经网络的分布式训练方法、设备和存储介质
CN113011602A (zh) * 2021-03-03 2021-06-22 中国科学技术大学苏州高等研究院 一种联邦模型训练方法、装置、电子设备和存储介质
CN114548426A (zh) * 2022-02-17 2022-05-27 北京百度网讯科技有限公司 异步联邦学习的方法、业务服务的预测方法、装置及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460457A (zh) * 2018-03-30 2018-08-28 苏州纳智天地智能科技有限公司 一种面向卷积神经网络的多机多卡混合并行异步训练方法
CN111582494A (zh) * 2020-04-17 2020-08-25 浙江大学 一种基于延迟处理的混合分布式机器学习更新方法
CN112712171A (zh) * 2021-01-12 2021-04-27 湖南工业大学 深度卷积神经网络的分布式训练方法、设备和存储介质
CN113011602A (zh) * 2021-03-03 2021-06-22 中国科学技术大学苏州高等研究院 一种联邦模型训练方法、装置、电子设备和存储介质
CN114548426A (zh) * 2022-02-17 2022-05-27 北京百度网讯科技有限公司 异步联邦学习的方法、业务服务的预测方法、装置及系统

Also Published As

Publication number Publication date
EP4332837A1 (en) 2024-03-06

Similar Documents

Publication Publication Date Title
CN110263921B (zh) 一种联邦学习模型的训练方法及装置
US20190279088A1 (en) Training method, apparatus, chip, and system for neural network model
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
Guo et al. Cloud resource scheduling with deep reinforcement learning and imitation learning
CN110134636B (zh) 模型训练方法、服务器和计算机可读存储介质
JP6776696B2 (ja) 並列情報処理装置、情報処理方法、およびプログラム
US9607355B2 (en) Model parallel processing method and apparatus based on multiple graphic processing units
WO2024016542A1 (zh) 信息融合方法、数据通信方法、装置及电子设备和非易失性可读存储介质
Jiang et al. Fedmp: Federated learning through adaptive model pruning in heterogeneous edge computing
WO2022057310A1 (zh) 一种图神经网络训练的方法、装置及系统
JP2022017588A (ja) 深層学習フレームワークのトレーニング方法、装置及び記憶媒体
WO2021238508A1 (zh) 一种数据处理的方法、装置和设备
CN117061365B (zh) 一种节点选择方法、装置、设备及可读存储介质
WO2021115082A1 (zh) 作业调度方法以及作业调度装置
CN116663639B (zh) 一种梯度数据同步方法、系统、装置及介质
WO2024001870A1 (zh) 一种人工智能模型的训练方法及相关设备
CN113553149A (zh) 云服务器集群负载调度方法、系统、终端以及存储介质
CN117390442A (zh) 一种人工智能模型的训练方法及相关设备
CN113656494A (zh) 参数服务器的同步方法、系统及可读存储介质
CN113255902A (zh) 神经网络电路、系统和控制数据流的方法
CN113485796B (zh) 一种基于集群架构的分布式可扩展模拟计算方法
US20230043584A1 (en) Optimization of memory use for efficient neural network execution
CN115987998B (zh) 微服务系统领袖者选举方法、系统、存储介质和电子设备
WO2024001861A1 (zh) 模型训练方法、装置、系统及相关设备
WO2024066791A1 (zh) 数据处理方法、装置、系统、介质以及程序产品

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2023808654

Country of ref document: EP

Effective date: 20231127