CN113065642B - Artificial intelligence acceleration method and system based on heterogeneous computing - Google Patents
Artificial intelligence acceleration method and system based on heterogeneous computing Download PDFInfo
- Publication number
- CN113065642B CN113065642B CN202110383757.3A CN202110383757A CN113065642B CN 113065642 B CN113065642 B CN 113065642B CN 202110383757 A CN202110383757 A CN 202110383757A CN 113065642 B CN113065642 B CN 113065642B
- Authority
- CN
- China
- Prior art keywords
- artificial intelligence
- heterogeneous
- tensorflow
- computing
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7807—System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
- G06F15/7817—Specially adapted for signal processing, e.g. Harvard architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides an artificial intelligence acceleration system and method based on heterogeneous computing, which comprises the following steps: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized. The invention combines various computing units of CPU and FPGA, and improves the computing rate of the computing frame by 3 aspects of learning rate control optimization, frame optimization and communication optimization on the basis of a Tensorflow computing frame, and the running time is reduced by 90 percent compared with that of a CPU computing unit.
Description
Technical Field
The invention relates to the technical field of heterogeneous computing, in particular to an artificial intelligence acceleration method and system based on heterogeneous computing, and more particularly to an artificial intelligence acceleration framework based on heterogeneous computing.
Background
The artificial intelligence processing speed is limited by a CPU multi-hop design and a centralized network topology of a network center node, a high throughput computing task of tightly coupled data cannot be met, and in order to improve the artificial intelligence computing speed and reduce the waiting time, a parallel computing mechanism is required to be utilized to improve the operating speed of artificial intelligence computing and reduce the operating time. Although the parallel computing rate of the artificial intelligence algorithm is improved by distributed processing frameworks such as GraphLab, CNTK, tensrflow, and Gorila, the processing objects are mainly loosely coupled data, and the computing capability is particularly insufficient when tightly coupled data is faced. The fundamental reason is that in order to obtain an ideal target classifier as much as possible, the distributed processing frameworks all adopt a centralized training framework, and the framework transmits parameters obtained by computing on each computing node to a central node, so that the throughput efficiency of the network is reduced, serious network congestion is caused, and the central node becomes a bottleneck for improving the performance of the whole network and even the whole system. In the face of complex training tasks such as natural language recognition with many training samples, long training time and many training parameters, the centralized learning framework is not suitable any more.
In order to solve the problem of congestion of the central node, based on the hundred-degree Ring Allreduce, the Uber provides a Horovod computing framework based on a GPU, and the framework designs a network into a loop, so that the risk of congestion of the central node is solved, the training time is shortened, and the system throughput is improved.
Sridharan et al propose a Machine Learning Scaling Library design framework, in a cloud or HPC cluster, utilize Omni _ path, infiniBand high-speed networks and other advanced technologies to realize synchronous random gradient descent, in a distributed environment, accelerate AI application, and obtain lower errors. Cho et al have designed PowerAI DDL, and have improved the deep learning effect under the distributed environment through optimizing communication protocol, utilizing multi-ring.
Patent document CN107346170A (application number: 201710596217.7) discloses an FPGA heterogeneous computing acceleration system, which comprises a field programmable gate array FPGA chip; the control module is used for determining the FPGA cards with the power consumption to be reduced and generating control instructions corresponding to the FPGA cards with the power consumption to be reduced; and the control registers correspond to the FPGA chips one to one and are used for receiving the control instructions corresponding to the FPGA chips and controlling the on-off state of the power supply modules corresponding to the FPGA chips and/or the working state of the FPGA chips according to the control instructions.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide an artificial intelligence acceleration system and method based on heterogeneous computing.
The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which comprises: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.
Preferably, the learning rate of the deep learning network in the TensorFlow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
Preferably, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Preferably, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.
Preferably, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.
Preferably, the adjusted optimal communication opportunity comprises:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity module M3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
The invention provides an artificial intelligence acceleration method based on heterogeneous computing, which comprises the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Preferably, the method further comprises the following steps: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the optimized neural network node weight comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Preferably, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusted optimal communication opportunity comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention combines various computing units of CPU and FPGA, on the basis of Tensorflow computing frame, through learning rate control optimization, frame optimization and communication optimization in 3 aspects, the computing rate of the computing frame is improved, and the operating time is reduced by 90% compared with that of the CPU computing unit.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 shows the accuracy with packet size.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The invention provides an artificial intelligence acceleration system based on heterogeneous computing, which is characterized by comprising the following components: the heterogeneous computation of the FPGA and the CPU is realized based on a TensorFlow computation framework, and the local optimal solution is obtained by gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve and then gradually decreasing, so that the artificial intelligence acceleration based on the heterogeneous computation is realized.
Specifically, the learning rate of the deep learning network in the tensrflow computing framework includes: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Specifically, the method further comprises the following steps: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
Specifically, the calculation method of the optimized neural network node weight includes: and (4) performing weight calculation on the calculating unit of the ResNet-50 by using a batch processing method, and updating the weight into Tensiloflow.
Specifically, the method further comprises the following steps: and adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation.
Specifically, the adjusting the packet size includes: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending quantity of the data packets is reduced, and network delay caused by allreduce operation is reduced.
Specifically, the adjusted optimal communication opportunity includes:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity module M3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
The invention provides an artificial intelligence acceleration method based on heterogeneous computing, which comprises the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
Specifically, the method further comprises the following steps: according to the running states of a CPU and an FPGA, analyzing and calculating the key problem that insufficient threads are influencing the performance of the TensorFlow, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the node weight of the optimized neural network comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Specifically, the method further comprises the following steps: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusted optimal communication opportunity comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
Example 2
Example 2 is a preferred example of example 1
Aiming at the artificial intelligence acceleration framework on the existing GPU, CPU and pure CPU computing framework, on the basis of properly sacrificing the accuracy rate and generalization effect of a learning machine, an FPGA computing unit is utilized to design an FPGA and CPU heterogeneous computing framework, so that a high-performance computing framework aiming at tightly coupled data is realized, the running time is reduced, and the artificial intelligence processing rate is improved, as shown in figure 1.
On the basis of a Tensorflow calculation frame, the calculation rate of the calculation frame is improved and the running timeliness of the calculation frame is improved through 3 aspects of learning rate control optimization, frame optimization and communication optimization.
The accuracy is improved: a widely used Stochastic Gradient Descent (SGD) is used, and this algorithm is often used for deep learning optimizers. When training a large number of small batch tasks, the number of SGD updates is gradually reduced as the small batch size increases. The following technique is used to solve the accuracy improvement problem.
And (3) learning rate control: for fast training, convergence is achieved, requiring a large learning rate. However, in the early stages of training, the high learning rate makes model training unstable. Therefore, the learning rate is adjusted by gradually increasing the learning rate. And after the learning rate reaches the upper limit of the gate valve, gradually descending to obtain a local optimal solution. And aiming at different levels in the deep learning network, the learning rates with different amplitudes are adopted so as to adapt to the multi-level structure of the deep learning network.
And (3) optimizing a framework: the basic computing framework uses Tensorflow, supports C and Python interfaces, and also supports multiple language interfaces. Tensorflow has good flexibility and extensibility, which enables AI to train models efficiently on clusters. The centralized operation of Tensorflow, which is only a small part, is likely to be the bottleneck of cluster operation. By analyzing the running states of the CPU and the FPGA, the key problem that the performance of the system is influenced by insufficient calculation threads is found, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the method specifically comprises the following steps:
the weight calculation is performed according to the samples and the class labels, the weight calculation of the neural network nodes is performed, the single-point calculation is changed into batch processing, and the calculation mode of the Tensorflow system is optimized.
Norm calculation on FPGA: norm calculation on the FPGA is used for updating network weights, and is one of weight calculation;
norm calculation needs to be carried out on the calculation nodes of each layer in the deep learning network. Compared with the large number of computing nodes in ResNet-50, the FPGA does not have enough threads.
Therefore, the calculation unit of ResNet-50 is weighted and updated to Tensorflow by using a batch processing method. Compared with single calculation, the batch processing can improve the calculation timeliness and reduce the running time.
Communication optimization: distributed parallel deep learning requires allreduce operations to exchange gradient information for different processes. The Allreduce communication header is not negligible in the cluster, and the communication occupied time is particularly prominent when the computing tasks are time-consuming and short.
Adjusting the size of the data packet: adjusting the size of the data packet: gradient information of different layers in deep learning depends on the hierarchy. If the gradient information is too small, it is sent in smaller packets, which may result in a large amount of header data for the allreduce operation. Therefore, in order to reduce the network delay caused by the allreduce operation, larger data packets must be set, each data packet contains more gradient information, and the number of packet transmissions is reduced. In the experiment, the set file is based on the size of KB.
Scheduled best communication opportunity: after a part of nodes complete the calculation tasks, the allreduce operation aiming at the tasks can be initiated without waiting for all the nodes to complete the calculation. However, this operation may cause some of the back propagation collisions, and in order to reduce the collisions, a threshold needs to be set to control the process of sending the data packets. In addition, processes of adjacent layers need to be grouped, and it is ensured that no conflict occurs in the whole communication process. In the implementation, the whole network is divided into several groups according to the correlation, and when one group finishes the allreduce operation, the other group starts the allreduce operation again.
Practice ofExample 3
Example 3 is a preferred example of example 1 and/or example 2
The experiment used 9 Xeon E5 compute nodes, with one node as the scheduling node and 8 nodes as compute nodes. Each computing node configures one Alveo U200 as an accelerator. And each node is provided with an ROCE network card and connected with each other through an Ethernet switch. The specific configuration is shown in table 1 below:
TABLE 1 hardware configuration for each node
Hardware options | Demand for |
CPU | IntelXeonE5 |
Memory device | 192GB |
FPGA accelerator card | AlveoU200 |
Local storage | 1TBNVMeSSD |
Storing | Shared storage system Lustre |
Network card | 25G double-port ConnectX-4EN network card |
Operating system | CentOS7.5 |
Wherein, the scheduling node adopts a slarm scheduler.
The storage system adopts Lustre, and three nodes form a storage cluster, which has a shared space of 25TB in total.
The experimental data set was ImageNet 2012classification dataset.
And a mixed precision method is used, half-precision floating point numbers are adopted during calculation and communication, and single-precision floating point numbers are adopted during weight updating. And training the sample by using a learning rate optimization method to ensure the accuracy of training.
In the ResNet-50 experiment, the best 74.5% accuracy was achieved. As shown in figure 1 of the drawings, in which,
as packet size increases, accuracy decreases. In order to optimize the system throughput, it is necessary to set the packet size and the timing of sending the packet reasonably.
The learning period is selected 100. The running time of the high-performance cluster based on the FPGA is 27 hours. The running time of the CPU high-performance computing cluster with the same configuration can be more than 10 times that of the FPGA.
The invention is based on the heterogeneous computing technology of FPGA + CPU, and can greatly improve the classification accuracy rate aiming at a large number of small-batch tasks in a high-performance computing cluster. In the ResNet-50 cluster, the accuracy rate reaches 74.5%.
The CPU is used as a floating point computing node, the operation efficiency is low, and the operation time length often greatly exceeds that of the FPGA and the GPU. Through experiments, the FPGA can be found to have excellent performance in floating point calculation, and by combining with MPI and utilizing the advantage of high-performance calculation, the operating speed of AI is greatly improved, and the operating time of AI is reduced.
Claims (6)
1. An artificial intelligence acceleration system based on heterogeneous computing, comprising: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computing frame, gradually increasing the learning rate of a deep learning network in the TensorFlow computing frame until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
further comprising: adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusting the optimal communication time of the data packet comprises:
the adjusted optimal communication opportunity module M1: grouping processes of adjacent preset layers in the deep learning network;
the adjusted optimal communication opportunity module M2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
the adjusted optimal communication opportunity module M3: and after the current group finishes the transmission of the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
2. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, characterized in that the learning rate of the deep learning network in the TensorFlow computation framework comprises: and adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to the multi-level structure of the deep learning network.
3. The artificial intelligence acceleration system based on heterogeneous computing of claim 1, further comprising: according to the running states of the CPU and the FPGA, the analysis and calculation of the insufficient threads is a key problem influencing the TensorFlow performance, the weight calculation mode of the neural network node is optimized, the throughput of the AI in the cluster environment is improved, and the artificial intelligence acceleration based on heterogeneous calculation is realized.
4. The artificial intelligence acceleration system based on heterogeneous computing of claim 3, wherein the optimized neural network node weight calculation mode includes: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
5. An artificial intelligence acceleration method based on heterogeneous computing is characterized by comprising the following steps: realizing heterogeneous computation of an FPGA and a CPU based on a TensorFlow computation framework, gradually increasing the learning rate of a deep learning network in the TensorFlow computation framework until the learning rate reaches the upper limit of a gate valve, gradually decreasing to obtain a local optimal solution, and realizing artificial intelligence acceleration based on the heterogeneous computation;
the learning rate of the deep learning network in the TensorFlow calculation framework comprises the following steps: adopting learning rates with different amplitudes according to different levels of the deep learning network so as to adapt to a multi-level structure of the deep learning network;
adjusting the size of the data packet and scheduling the optimal communication time of the data packet to realize artificial intelligence acceleration based on heterogeneous computation;
the resizing the packet comprises: the gradient information of different layers in deep learning is determined according to the levels, data packets with preset sizes are set, each data packet comprises a plurality of gradient information, the sending number of the data packets is reduced, and the network delay caused by the allreduce operation is reduced;
the adjusting the best communication opportunity of the data packet comprises:
adjusted optimal communication opportunity step S1: grouping processes of adjacent preset layers in the deep learning network;
adjusted optimal communication opportunity step S2: setting a threshold aiming at the current group, and controlling the process of sending the data packet in the current group based on the set threshold;
adjusted optimal communication opportunity step S3: and when the current group finishes sending the data packet generated by the allreduce operation, the other group starts the allreduce operation again.
6. The artificial intelligence acceleration method based on heterogeneous computing of claim 5, further comprising: according to the running states of the CPU and the FPGA, analyzing and calculating the defect of insufficient threads is a key problem influencing the TensorFlow performance, optimizing a neural network node weight calculation mode, improving the throughput of AI in a cluster environment, and realizing artificial intelligence acceleration based on heterogeneous calculation;
the calculation mode of the optimized neural network node weight comprises the following steps: and (4) performing weight calculation on the calculation unit of the ResNet-50 by using a batch processing method, and updating the calculation unit into Tensorflow.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383757.3A CN113065642B (en) | 2021-04-09 | 2021-04-09 | Artificial intelligence acceleration method and system based on heterogeneous computing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383757.3A CN113065642B (en) | 2021-04-09 | 2021-04-09 | Artificial intelligence acceleration method and system based on heterogeneous computing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113065642A CN113065642A (en) | 2021-07-02 |
CN113065642B true CN113065642B (en) | 2023-04-07 |
Family
ID=76566579
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110383757.3A Active CN113065642B (en) | 2021-04-09 | 2021-04-09 | Artificial intelligence acceleration method and system based on heterogeneous computing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113065642B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339351A (en) * | 2016-08-30 | 2017-01-18 | 浪潮(北京)电子信息产业有限公司 | SGD (Stochastic Gradient Descent) algorithm optimization system and method |
CN108763360A (en) * | 2018-05-16 | 2018-11-06 | 北京旋极信息技术股份有限公司 | A kind of sorting technique and device, computer readable storage medium |
CN109034386A (en) * | 2018-06-26 | 2018-12-18 | 中国科学院计算机网络信息中心 | A kind of deep learning system and method based on Resource Scheduler |
CN111343148A (en) * | 2020-02-05 | 2020-06-26 | 苏州浪潮智能科技有限公司 | FGPA communication data processing method, system and device |
-
2021
- 2021-04-09 CN CN202110383757.3A patent/CN113065642B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106339351A (en) * | 2016-08-30 | 2017-01-18 | 浪潮(北京)电子信息产业有限公司 | SGD (Stochastic Gradient Descent) algorithm optimization system and method |
CN108763360A (en) * | 2018-05-16 | 2018-11-06 | 北京旋极信息技术股份有限公司 | A kind of sorting technique and device, computer readable storage medium |
CN109034386A (en) * | 2018-06-26 | 2018-12-18 | 中国科学院计算机网络信息中心 | A kind of deep learning system and method based on Resource Scheduler |
CN111343148A (en) * | 2020-02-05 | 2020-06-26 | 苏州浪潮智能科技有限公司 | FGPA communication data processing method, system and device |
Non-Patent Citations (1)
Title |
---|
基于FPGA的人脸检测识别加速平台;杨森;《中国优秀硕士学位论文全文数据库》;20181215;I138-1290 * |
Also Published As
Publication number | Publication date |
---|---|
CN113065642A (en) | 2021-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3158529B1 (en) | Model parallel processing method and apparatus based on multiple graphic processing units | |
US10282809B2 (en) | Data parallel processing method and apparatus based on multiple graphic processing units | |
Zhang et al. | Asynchronous distributed ADMM for consensus optimization | |
CN110751280A (en) | Configurable convolution accelerator applied to convolutional neural network | |
CN108460457A (en) | A kind of more asynchronous training methods of card hybrid parallel of multimachine towards convolutional neural networks | |
CN111079921A (en) | Efficient neural network training and scheduling method based on heterogeneous distributed system | |
US20220129408A1 (en) | Data actor and data processing method thereof | |
CN106339351A (en) | SGD (Stochastic Gradient Descent) algorithm optimization system and method | |
CN110990140B (en) | Method for scheduling distributed machine learning flow in photoelectric switching network | |
US11425195B1 (en) | Massively parallel in-network compute | |
CN106934454B (en) | Test-schedule method in network on three-dimensional chip based on Petri network | |
Lee et al. | Task parallelism-aware deep neural network scheduling on multiple hybrid memory cube-based processing-in-memory | |
Ko et al. | An in-depth analysis of distributed training of deep neural networks | |
CN113065642B (en) | Artificial intelligence acceleration method and system based on heterogeneous computing | |
US20220277125A1 (en) | Initializing on-chip operations | |
Li et al. | Joint optimization of auto-scaling and adaptive service placement in edge computing | |
Luo et al. | A hybrid approach of ordinal optimization and iterated local search for manufacturing cell formation | |
CN112989270A (en) | Convolution calculating device based on hybrid parallel | |
Duan et al. | Lightweight federated reinforcement learning for independent request scheduling in microgrids | |
WO2021195104A1 (en) | Digital-imc hybrid system architecture for neural network acceleration | |
EP3980888A1 (en) | Explicit scheduling of on-chip operations | |
CN111367653A (en) | Stream computing task management method | |
CN112448899A (en) | Flow scheduling-based multitask training cluster network optimization method | |
Al-Lawati et al. | Gradient Staleness in Asynchronous Optimization Under Random Communication Delays | |
Xie et al. | SpikeNC: An Accurate and Scalable Simulator for Spiking Neural Network on Multi-Core Neuromorphic Hardware |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |