CN113065642B

CN113065642B - Artificial intelligence acceleration method and system based on heterogeneous computing

Info

Publication number: CN113065642B
Application number: CN202110383757.3A
Authority: CN
Inventors: 李振兴; 江波; 丁湧; 姜鑫; 卜炜; 何加浪
Original assignee: Clp Digital Technology Co ltd; Cetc Digital Technology Group Co ltd
Current assignee: Clp Digital Technology Co ltd; Cetc Digital Technology Group Co ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2023-04-07
Anticipated expiration: 2041-04-09
Also published as: CN113065642A

Abstract

The invention provides an artificial intelligence acceleration system and method based on heterogeneous computing, which comprises the following steps: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized. The invention combines various computing units of CPU and FPGA, and improves the computing rate of the computing frame by 3 aspects of learning rate control optimization, frame optimization and communication optimization on the basis of a Tensorflow computing frame, and the running time is reduced by 90 percent compared with that of a CPU computing unit.

Description

Artificial intelligence acceleration method and system based on heterogeneous computing

Technical Field

The invention relates to the technical field of heterogeneous computing, in particular to an artificial intelligence acceleration method and system based on heterogeneous computing, and more particularly to an artificial intelligence acceleration framework based on heterogeneous computing.

Background

The artificial intelligence processing speed is limited by a CPU multi-hop design and a centralized network topology of a network center node, a high throughput computing task of tightly coupled data cannot be met, and in order to improve the artificial intelligence computing speed and reduce the waiting time, a parallel computing mechanism is required to be utilized to improve the operating speed of artificial intelligence computing and reduce the operating time. Although the parallel computing rate of the artificial intelligence algorithm is improved by distributed processing frameworks such as GraphLab, CNTK, tensrflow, and Gorila, the processing objects are mainly loosely coupled data, and the computing capability is particularly insufficient when tightly coupled data is faced. The fundamental reason is that in order to obtain an ideal target classifier as much as possible, the distributed processing frameworks all adopt a centralized training framework, and the framework transmits parameters obtained by computing on each computing node to a central node, so that the throughput efficiency of the network is reduced, serious network congestion is caused, and the central node becomes a bottleneck for improving the performance of the whole network and even the whole system. In the face of complex training tasks such as natural language recognition with many training samples, long training time and many training parameters, the centralized learning framework is not suitable any more.

In order to solve the problem of congestion of the central node, based on the hundred-degree Ring Allreduce, the Uber provides a Horovod computing framework based on a GPU, and the framework designs a network into a loop, so that the risk of congestion of the central node is solved, the training time is shortened, and the system throughput is improved.

Sridharan et al propose a Machine Learning Scaling Library design framework, in a cloud or HPC cluster, utilize Omni _ path, infiniBand high-speed networks and other advanced technologies to realize synchronous random gradient descent, in a distributed environment, accelerate AI application, and obtain lower errors. Cho et al have designed PowerAI DDL, and have improved the deep learning effect under the distributed environment through optimizing communication protocol, utilizing multi-ring.

Patent document CN107346170A (application number: 201710596217.7) discloses an FPGA heterogeneous computing acceleration system, which comprises a field programmable gate array FPGA chip; the control module is used for determining the FPGA cards with the power consumption to be reduced and generating control instructions corresponding to the FPGA cards with the power consumption to be reduced; and the control registers correspond to the FPGA chips one to one and are used for receiving the control instructions corresponding to the FPGA chips and controlling the on-off state of the power supply modules corresponding to the FPGA chips and/or the working state of the FPGA chips according to the control instructions.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an artificial intelligence acceleration system and method based on heterogeneous computing.

The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which comprises: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.

Preferably, the learning rate of the deep learning network in the TensorFlow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.

Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.

Preferably, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.

Preferably, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.

Preferably, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.

Preferably, the adjusted optimal communication opportunity comprises:

the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;

the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;

adjusted optimal communication opportunity module M3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.

The invention provides an artificial intelligence acceleration method based on heterogeneous computing, which comprises the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;

the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.

Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;

the calculation mode of the optimized neural network node weight comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.

Preferably, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;

the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;

the adjusted optimal communication opportunity comprises:

adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;

adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;

adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention combines various computing units of CPU and FPGA, on the basis of Tensorflow computing frame, through learning rate control optimization, frame optimization and communication optimization in 3 aspects, the computing rate of the computing frame is improved, and the operating time is reduced by 90% compared with that of the CPU computing unit.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 shows the accuracy with packet size.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1

The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which is characterized by comprising the following components: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.

Specifically, the learning rate of the deep learning network in the tensrflow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.

Specifically, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.

Specifically, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculating unit of the ResNet-50 by using a batch processing method, and updating the weight into Tensiloflow.

Specifically, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.

Specifically, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.

Specifically, the adjusted optimal communication opportunity includes:

Specifically, the method further comprises the following steps: according to the running states of a CPU and an FPGA, analyzing and calculating the key problem that insufficient threads are influencing the performance of the TensorFlow, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;

the calculation mode of the node weight of the optimized neural network comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.

Specifically, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;

the adjusted optimal communication opportunity comprises:

Example 2

Example 2 is a preferred example of example 1

Aiming at the artificial intelligence acceleration framework on the existing GPU, CPU and pure CPU computing framework, on the basis of properly sacrificing the accuracy rate and generalization effect of a learning machine, an FPGA computing unit is utilized to design an FPGA and CPU heterogeneous computing framework, so that a high-performance computing framework aiming at tightly coupled data is realized, the running time is reduced, and the artificial intelligence processing rate is improved, as shown in figure 1.

On the basis of a Tensorflow calculation frame, the calculation rate of the calculation frame is improved and the running timeliness of the calculation frame is improved through 3 aspects of learning rate control optimization, frame optimization and communication optimization.

The accuracy is improved: a widely used Stochastic Gradient Descent (SGD) is used, and this algorithm is often used for deep learning optimizers. When training a large number of small batch tasks, the number of SGD updates is gradually reduced as the small batch size increases. The following technique is used to solve the accuracy improvement problem.

And (3) learning rate control: for fast training, convergence is achieved, requiring a large learning rate. However, in the early stages of training, the high learning rate makes model training unstable. Therefore, the learning rate is adjusted by gradually increasing the learning rate. And after the learning rate reaches the upper limit of the gate valve, gradually descending to obtain a local optimal solution. And aiming at different levels in the deep learning network, the learning rates with different amplitudes are adopted so as to adapt to the multi-level structure of the deep learning network.

And (3) optimizing a framework: the basic computing framework uses Tensorflow, supports C and Python interfaces, and also supports multiple language interfaces. Tensorflow has good flexibility and extensibility, which enables AI to train models efficiently on clusters. The centralized operation of Tensorflow, which is only a small part, is likely to be the bottleneck of cluster operation. By analyzing the running states of the CPU and the FPGA, the key problem that the performance of the system is influenced by insufficient calculation threads is found, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the method specifically comprises the following steps:

the weight calculation is performed according to the samples and the class labels, the weight calculation of the neural network nodes is performed, the single-point calculation is changed into batch processing, and the calculation mode of the Tensorflow system is optimized.

Norm calculation on FPGA: norm calculation on the FPGA is used for updating network weights, and is one of weight calculation;

norm calculation needs to be carried out on the calculation nodes of each layer in the deep learning network. Compared with the large number of computing nodes in ResNet-50, the FPGA does not have enough threads.

Therefore, the calculation unit of ResNet-50 is weighted and updated to Tensorflow by using a batch processing method. Compared with single calculation, the batch processing can improve the calculation timeliness and reduce the running time.

Communication optimization: distributed parallel deep learning requires allreduce operations to exchange gradient information for different processes. The Allreduce communication header is not negligible in the cluster, and the communication occupied time is particularly prominent when the computing tasks are time-consuming and short.

Adjusting the size of the data packet: adjusting the size of the data packet: gradient information of different layers in deep learning depends on the hierarchy. If the gradient information is too small, it is sent in smaller packets, which may result in a large amount of header data for the allreduce operation. Therefore, in order to reduce the network delay caused by the allreduce operation, larger data packets must be set, each data packet contains more gradient information, and the number of packet transmissions is reduced. In the experiment, the set file is based on the size of KB.

Scheduled best communication opportunity: after a part of nodes complete the calculation tasks, the allreduce operation aiming at the tasks can be initiated without waiting for all the nodes to complete the calculation. However, this operation may cause some of the back propagation collisions, and in order to reduce the collisions, a threshold needs to be set to control the process of sending the data packets. In addition, processes of adjacent layers need to be grouped, and it is ensured that no conflict occurs in the whole communication process. In the implementation, the whole network is divided into several groups according to the correlation, and when one group finishes the allreduce operation, the other group starts the allreduce operation again.

Practice ofExample 3

Example 3 is a preferred example of example 1 and/or example 2

The experiment used 9 Xeon E5 compute nodes, with one node as the scheduling node and 8 nodes as compute nodes. Each computing node configures one Alveo U200 as an accelerator. And each node is provided with an ROCE network card and connected with each other through an Ethernet switch. The specific configuration is shown in table 1 below:

TABLE 1 hardware configuration for each node

Hardware options	Demand for
		CPU	IntelXeonE5
Memory device	192GB
		FPGA accelerator card	AlveoU200
Local storage	1TBNVMeSSD
		Storing	Shared storage system Lustre
Network card	25G double-port ConnectX-4EN network card
		Operating system	CentOS7.5

Wherein, the scheduling node adopts a slarm scheduler.

The storage system adopts Lustre, and three nodes form a storage cluster, which has a shared space of 25TB in total.

The experimental data set was ImageNet 2012classification dataset.

And a mixed precision method is used, half-precision floating point numbers are adopted during calculation and communication, and single-precision floating point numbers are adopted during weight updating. And training the sample by using a learning rate optimization method to ensure the accuracy of training.

In the ResNet-50 experiment, the best 74.5% accuracy was achieved. As shown in figure 1 of the drawings, in which,

as packet size increases, accuracy decreases. In order to optimize the system throughput, it is necessary to set the packet size and the timing of sending the packet reasonably.

The learning period is selected 100. The running time of the high-performance cluster based on the FPGA is 27 hours. The running time of the CPU high-performance computing cluster with the same configuration can be more than 10 times that of the FPGA.

The invention is based on the heterogeneous computing technology of FPGA + CPU, and can greatly improve the classification accuracy rate aiming at a large number of small-batch tasks in a high-performance computing cluster. In the ResNet-50 cluster, the accuracy rate reaches 74.5%.

The CPU is used as a floating point computing node, the operation efficiency is low, and the operation time length often greatly exceeds that of the FPGA and the GPU. Through experiments, the FPGA can be found to have excellent performance in floating point calculation, and by combining with MPI and utilizing the advantage of high-performance calculation, the operating speed of AI is greatly improved, and the operating time of AI is reduced.

Claims

1. An artificial intelligence acceleration system based on heterogeneous computing, comprising: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computing frame, gradually increasing the learning rate of a deep learning network in the TensorFlow computing frame until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;

further comprising: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;

the adjusting the optimal communication time of the data packet comprises:

the adjusted optimal communication opportunity module M3: and after the current group finishes the transmission of the data packet generated by the allreduce operation, the other group starts the allreduce operation again.

2. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, characterized in that the learning rate of the deep learning network in the TensorFlow computation framework comprises: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.

3. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, further comprising: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.

4. The artificial intelligence acceleration system based on heterogeneous computing of claim 3, wherein the optimized neural network node weight calculation mode includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.

5. An artificial intelligence acceleration method based on heterogeneous computing is characterized by comprising the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;

the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to a multi-level structure of the deep learning network;

adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;

the adjusting the best communication opportunity of the data packet comprises:

6. The artificial intelligence acceleration method based on heterogeneous computing of claim 5, further comprising: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;