CN113065642B - Artificial intelligence acceleration method and system based on heterogeneous computing - Google Patents

Artificial intelligence acceleration method and system based on heterogeneous computing Download PDF

Info

Publication number
CN113065642B
CN113065642B CN202110383757.3A CN202110383757A CN113065642B CN 113065642 B CN113065642 B CN 113065642B CN 202110383757 A CN202110383757 A CN 202110383757A CN 113065642 B CN113065642 B CN 113065642B
Authority
CN
China
Prior art keywords
artificial intelligence
heterogeneous
tensorflow
computing
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110383757.3A
Other languages
Chinese (zh)
Other versions
CN113065642A (en
Inventor
李振兴
江波
丁湧
姜鑫
卜炜
何加浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Clp Digital Technology Co ltd
Cetc Digital Technology Group Co ltd
Original Assignee
Clp Digital Technology Co ltd
Cetc Digital Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clp Digital Technology Co ltd, Cetc Digital Technology Group Co ltd filed Critical Clp Digital Technology Co ltd
Priority to CN202110383757.3A priority Critical patent/CN113065642B/en
Publication of CN113065642A publication Critical patent/CN113065642A/en
Application granted granted Critical
Publication of CN113065642B publication Critical patent/CN113065642B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an artificial intelligence acceleration system and method based on heterogeneous computing, which comprises the following steps: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized. The invention combines various computing units of CPU and FPGA, and improves the computing rate of the computing frame by 3 aspects of learning rate control optimization, frame optimization and communication optimization on the basis of a Tensorflow computing frame, and the running time is reduced by 90 percent compared with that of a CPU computing unit.

Description

Artificial intelligence acceleration method and system based on heterogeneous computing
Technical Field
The invention relates to the technical field of heterogeneous computing, in particular to an artificial intelligence acceleration method and system based on heterogeneous computing, and more particularly to an artificial intelligence acceleration framework based on heterogeneous computing.
Background
The artificial intelligence processing speed is limited by a CPU multi-hop design and a centralized network topology of a network center node, a high throughput computing task of tightly coupled data cannot be met, and in order to improve the artificial intelligence computing speed and reduce the waiting time, a parallel computing mechanism is required to be utilized to improve the operating speed of artificial intelligence computing and reduce the operating time. Although the parallel computing rate of the artificial intelligence algorithm is improved by distributed processing frameworks such as GraphLab, CNTK, tensrflow, and Gorila, the processing objects are mainly loosely coupled data, and the computing capability is particularly insufficient when tightly coupled data is faced. The fundamental reason is that in order to obtain an ideal target classifier as much as possible, the distributed processing frameworks all adopt a centralized training framework, and the framework transmits parameters obtained by computing on each computing node to a central node, so that the throughput efficiency of the network is reduced, serious network congestion is caused, and the central node becomes a bottleneck for improving the performance of the whole network and even the whole system. In the face of complex training tasks such as natural language recognition with many training samples, long training time and many training parameters, the centralized learning framework is not suitable any more.
In order to solve the problem of congestion of the central node, based on the hundred-degree Ring Allreduce, the Uber provides a Horovod computing framework based on a GPU, and the framework designs a network into a loop, so that the risk of congestion of the central node is solved, the training time is shortened, and the system throughput is improved.
Sridharan et al propose a Machine Learning Scaling Library design framework, in a cloud or HPC cluster, utilize Omni _ path, infiniBand high-speed networks and other advanced technologies to realize synchronous random gradient descent, in a distributed environment, accelerate AI application, and obtain lower errors. Cho et al have designed PowerAI DDL, and have improved the deep learning effect under the distributed environment through optimizing communication protocol, utilizing multi-ring.
Patent document CN107346170A (application number: 201710596217.7) discloses an FPGA heterogeneous computing acceleration system, which comprises a field programmable gate array FPGA chip; the control module is used for determining the FPGA cards with the power consumption to be reduced and generating control instructions corresponding to the FPGA cards with the power consumption to be reduced; and the control registers correspond to the FPGA chips one to one and are used for receiving the control instructions corresponding to the FPGA chips and controlling the on-off state of the power supply modules corresponding to the FPGA chips and/or the working state of the FPGA chips according to the control instructions.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an artificial intelligence acceleration system and method based on heterogeneous computing.
The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which comprises: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.
Preferably, the learning rate of the deep learning network in the TensorFlow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
Preferably, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Preferably, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.
Preferably, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.
Preferably, the adjusted optimal communication opportunity comprises:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity module M3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
The invention provides an artificial intelligence acceleration method based on heterogeneous computing, which comprises the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the optimized neural network node weight comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Preferably, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusted optimal communication opportunity comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention combines various computing units of CPU and FPGA, on the basis of Tensorflow computing frame, through learning rate control optimization, frame optimization and communication optimization in 3 aspects, the computing rate of the computing frame is improved, and the operating time is reduced by 90% compared with that of the CPU computing unit.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 shows the accuracy with packet size.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which is characterized by comprising the following components: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.
Specifically, the learning rate of the deep learning network in the tensrflow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Specifically, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
Specifically, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculating unit of the ResNet-50 by using a batch processing method, and updating the weight into Tensiloflow.
Specifically, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.
Specifically, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.
Specifically, the adjusted optimal communication opportunity includes:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity module M3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
The invention provides an artificial intelligence acceleration method based on heterogeneous computing, which comprises the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Specifically, the method further comprises the following steps: according to the running states of a CPU and an FPGA, analyzing and calculating the key problem that insufficient threads are influencing the performance of the TensorFlow, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the node weight of the optimized neural network comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Specifically, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusted optimal communication opportunity comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
Example 2
Example 2 is a preferred example of example 1
Aiming at the artificial intelligence acceleration framework on the existing GPU, CPU and pure CPU computing framework, on the basis of properly sacrificing the accuracy rate and generalization effect of a learning machine, an FPGA computing unit is utilized to design an FPGA and CPU heterogeneous computing framework, so that a high-performance computing framework aiming at tightly coupled data is realized, the running time is reduced, and the artificial intelligence processing rate is improved, as shown in figure 1.
On the basis of a Tensorflow calculation frame, the calculation rate of the calculation frame is improved and the running timeliness of the calculation frame is improved through 3 aspects of learning rate control optimization, frame optimization and communication optimization.
The accuracy is improved: a widely used Stochastic Gradient Descent (SGD) is used, and this algorithm is often used for deep learning optimizers. When training a large number of small batch tasks, the number of SGD updates is gradually reduced as the small batch size increases. The following technique is used to solve the accuracy improvement problem.
And (3) learning rate control: for fast training, convergence is achieved, requiring a large learning rate. However, in the early stages of training, the high learning rate makes model training unstable. Therefore, the learning rate is adjusted by gradually increasing the learning rate. And after the learning rate reaches the upper limit of the gate valve, gradually descending to obtain a local optimal solution. And aiming at different levels in the deep learning network, the learning rates with different amplitudes are adopted so as to adapt to the multi-level structure of the deep learning network.
And (3) optimizing a framework: the basic computing framework uses Tensorflow, supports C and Python interfaces, and also supports multiple language interfaces. Tensorflow has good flexibility and extensibility, which enables AI to train models efficiently on clusters. The centralized operation of Tensorflow, which is only a small part, is likely to be the bottleneck of cluster operation. By analyzing the running states of the CPU and the FPGA, the key problem that the performance of the system is influenced by insufficient calculation threads is found, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the method specifically comprises the following steps:
the weight calculation is performed according to the samples and the class labels, the weight calculation of the neural network nodes is performed, the single-point calculation is changed into batch processing, and the calculation mode of the Tensorflow system is optimized.
Norm calculation on FPGA: norm calculation on the FPGA is used for updating network weights, and is one of weight calculation;
norm calculation needs to be carried out on the calculation nodes of each layer in the deep learning network. Compared with the large number of computing nodes in ResNet-50, the FPGA does not have enough threads.
Therefore, the calculation unit of ResNet-50 is weighted and updated to Tensorflow by using a batch processing method. Compared with single calculation, the batch processing can improve the calculation timeliness and reduce the running time.
Communication optimization: distributed parallel deep learning requires allreduce operations to exchange gradient information for different processes. The Allreduce communication header is not negligible in the cluster, and the communication occupied time is particularly prominent when the computing tasks are time-consuming and short.
Adjusting the size of the data packet: adjusting the size of the data packet: gradient information of different layers in deep learning depends on the hierarchy. If the gradient information is too small, it is sent in smaller packets, which may result in a large amount of header data for the allreduce operation. Therefore, in order to reduce the network delay caused by the allreduce operation, larger data packets must be set, each data packet contains more gradient information, and the number of packet transmissions is reduced. In the experiment, the set file is based on the size of KB.
Scheduled best communication opportunity: after a part of nodes complete the calculation tasks, the allreduce operation aiming at the tasks can be initiated without waiting for all the nodes to complete the calculation. However, this operation may cause some of the back propagation collisions, and in order to reduce the collisions, a threshold needs to be set to control the process of sending the data packets. In addition, processes of adjacent layers need to be grouped, and it is ensured that no conflict occurs in the whole communication process. In the implementation, the whole network is divided into several groups according to the correlation, and when one group finishes the allreduce operation, the other group starts the allreduce operation again.
Practice ofExample 3
Example 3 is a preferred example of example 1 and/or example 2
The experiment used 9 Xeon E5 compute nodes, with one node as the scheduling node and 8 nodes as compute nodes. Each computing node configures one Alveo U200 as an accelerator. And each node is provided with an ROCE network card and connected with each other through an Ethernet switch. The specific configuration is shown in table 1 below:
TABLE 1 hardware configuration for each node
Hardware options Demand for
CPU IntelXeonE5
Memory device 192GB
FPGA accelerator card AlveoU200
Local storage 1TBNVMeSSD
Storing Shared storage system Lustre
Network card 25G double-port ConnectX-4EN network card
Operating system CentOS7.5
Wherein, the scheduling node adopts a slarm scheduler.
The storage system adopts Lustre, and three nodes form a storage cluster, which has a shared space of 25TB in total.
The experimental data set was ImageNet 2012classification dataset.
And a mixed precision method is used, half-precision floating point numbers are adopted during calculation and communication, and single-precision floating point numbers are adopted during weight updating. And training the sample by using a learning rate optimization method to ensure the accuracy of training.
In the ResNet-50 experiment, the best 74.5% accuracy was achieved. As shown in figure 1 of the drawings, in which,
as packet size increases, accuracy decreases. In order to optimize the system throughput, it is necessary to set the packet size and the timing of sending the packet reasonably.
The learning period is selected 100. The running time of the high-performance cluster based on the FPGA is 27 hours. The running time of the CPU high-performance computing cluster with the same configuration can be more than 10 times that of the FPGA.
The invention is based on the heterogeneous computing technology of FPGA + CPU, and can greatly improve the classification accuracy rate aiming at a large number of small-batch tasks in a high-performance computing cluster. In the ResNet-50 cluster, the accuracy rate reaches 74.5%.
The CPU is used as a floating point computing node, the operation efficiency is low, and the operation time length often greatly exceeds that of the FPGA and the GPU. Through experiments, the FPGA can be found to have excellent performance in floating point calculation, and by combining with MPI and utilizing the advantage of high-performance calculation, the operating speed of AI is greatly improved, and the operating time of AI is reduced.

Claims (6)

1. An artificial intelligence acceleration system based on heterogeneous computing, comprising: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computing frame, gradually increasing the learning rate of a deep learning network in the TensorFlow computing frame until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
further comprising: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusting the optimal communication time of the data packet comprises:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
the adjusted optimal communication opportunity module M3: and after the current group finishes the transmission of the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
2. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, characterized in that the learning rate of the deep learning network in the TensorFlow computation framework comprises: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
3. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, further comprising: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
4. The artificial intelligence acceleration system based on heterogeneous computing of claim 3, wherein the optimized neural network node weight calculation mode includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
5. An artificial intelligence acceleration method based on heterogeneous computing is characterized by comprising the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to a multi-level structure of the deep learning network;
adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusting the best communication opportunity of the data packet comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
6. The artificial intelligence acceleration method based on heterogeneous computing of claim 5, further comprising: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the optimized neural network node weight comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
CN202110383757.3A 2021-04-09 2021-04-09 Artificial intelligence acceleration method and system based on heterogeneous computing Active CN113065642B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110383757.3A CN113065642B (en) 2021-04-09 2021-04-09 Artificial intelligence acceleration method and system based on heterogeneous computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110383757.3A CN113065642B (en) 2021-04-09 2021-04-09 Artificial intelligence acceleration method and system based on heterogeneous computing

Publications (2)

Publication Number Publication Date
CN113065642A CN113065642A (en) 2021-07-02
CN113065642B true CN113065642B (en) 2023-04-07

Family

ID=76566579

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110383757.3A Active CN113065642B (en) 2021-04-09 2021-04-09 Artificial intelligence acceleration method and system based on heterogeneous computing

Country Status (1)

Country Link
CN (1) CN113065642B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339351A (en) * 2016-08-30 2017-01-18 浪潮(北京)电子信息产业有限公司 SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN108763360A (en) * 2018-05-16 2018-11-06 北京旋极信息技术股份有限公司 A kind of sorting technique and device, computer readable storage medium
CN109034386A (en) * 2018-06-26 2018-12-18 中国科学院计算机网络信息中心 A kind of deep learning system and method based on Resource Scheduler
CN111343148A (en) * 2020-02-05 2020-06-26 苏州浪潮智能科技有限公司 FGPA communication data processing method, system and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339351A (en) * 2016-08-30 2017-01-18 浪潮(北京)电子信息产业有限公司 SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN108763360A (en) * 2018-05-16 2018-11-06 北京旋极信息技术股份有限公司 A kind of sorting technique and device, computer readable storage medium
CN109034386A (en) * 2018-06-26 2018-12-18 中国科学院计算机网络信息中心 A kind of deep learning system and method based on Resource Scheduler
CN111343148A (en) * 2020-02-05 2020-06-26 苏州浪潮智能科技有限公司 FGPA communication data processing method, system and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于FPGA的人脸检测识别加速平台;杨森;《中国优秀硕士学位论文全文数据库》;20181215;I138-1290 *

Also Published As

Publication number Publication date
CN113065642A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
EP3158529B1 (en) Model parallel processing method and apparatus based on multiple graphic processing units
US10282809B2 (en) Data parallel processing method and apparatus based on multiple graphic processing units
Zhang et al. Asynchronous distributed ADMM for consensus optimization
CN110751280A (en) Configurable convolution accelerator applied to convolutional neural network
CN108460457A (en) A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks
CN111079921A (en) Efficient neural network training and scheduling method based on heterogeneous distributed system
US20220129408A1 (en) Data actor and data processing method thereof
CN106339351A (en) SGD (Stochastic Gradient Descent) algorithm optimization system and method
CN110990140B (en) Method for scheduling distributed machine learning flow in photoelectric switching network
US11425195B1 (en) Massively parallel in-network compute
CN106934454B (en) Test-schedule method in network on three-dimensional chip based on Petri network
Lee et al. Task parallelism-aware deep neural network scheduling on multiple hybrid memory cube-based processing-in-memory
Ko et al. An in-depth analysis of distributed training of deep neural networks
CN113065642B (en) Artificial intelligence acceleration method and system based on heterogeneous computing
US20220277125A1 (en) Initializing on-chip operations
Li et al. Joint optimization of auto-scaling and adaptive service placement in edge computing
Luo et al. A hybrid approach of ordinal optimization and iterated local search for manufacturing cell formation
CN112989270A (en) Convolution calculating device based on hybrid parallel
Duan et al. Lightweight federated reinforcement learning for independent request scheduling in microgrids
WO2021195104A1 (en) Digital-imc hybrid system architecture for neural network acceleration
EP3980888A1 (en) Explicit scheduling of on-chip operations
CN111367653A (en) Stream computing task management method
CN112448899A (en) Flow scheduling-based multitask training cluster network optimization method
Al-Lawati et al. Gradient Staleness in Asynchronous Optimization Under Random Communication Delays
Xie et al. SpikeNC: An Accurate and Scalable Simulator for Spiking Neural Network on Multi-Core Neuromorphic Hardware

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant