CN114595813A - Heterogeneous acceleration processor and data calculation method - Google Patents

Heterogeneous acceleration processor and data calculation method Download PDF

Info

Publication number
CN114595813A
CN114595813A CN202210132954.2A CN202210132954A CN114595813A CN 114595813 A CN114595813 A CN 114595813A CN 202210132954 A CN202210132954 A CN 202210132954A CN 114595813 A CN114595813 A CN 114595813A
Authority
CN
China
Prior art keywords
data
calculation
vector calculation
module
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210132954.2A
Other languages
Chinese (zh)
Inventor
尹首一
位经传
王洲
韩慧明
朱丹
刘雷波
魏少军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202210132954.2A priority Critical patent/CN114595813A/en
Publication of CN114595813A publication Critical patent/CN114595813A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a heterogeneous acceleration processor and a data calculation method, wherein the heterogeneous acceleration processor comprises the following components: the calculation module comprises a convolution PE array and a vector calculation control unit; the storage module RAM is used for storing data; the CPU is used for sending the vector calculation instruction to the calculation module; the convolution PE array comprises a plurality of PE units, and the PE units are used for carrying out neural network convolution calculation on original data; the data sorting and transforming module is used for sorting and transforming the intermediate data output by the computing module to obtain result data; the vector calculation control unit is used for controlling at least one PE unit to carry out vector calculation; the PE unit is also used for carrying out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction. The invention can flexibly carry out the convolution calculation and the vector calculation of the neural network and can avoid the waste of resources.

Description

Heterogeneous acceleration processor and data calculation method
Technical Field
The invention relates to the technical field of computers, in particular to a heterogeneous acceleration processor and a data calculation method.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
In a traditional neural network processor, a plurality of computing units are controlled in a data driving mode, and only whole or partial parallel array processing can be performed. When the neural network processor needs to perform some logic judgment or scalar/vector calculation functions, the dedicated neural network processor is highly optimized for neural network calculation, and the data path and the calculation array are fixed, so that data cannot be flexibly moved and calculated, some general conventional calculation and logic judgment cannot be well supported, and an optimization space exists. The traditional neural network processor has more computing array and on-chip storage resources and larger area. In some application scenarios with low demand on neural network computing, more general-purpose CPU computing resources are needed, and if an additional CPU general-purpose processor is implemented inside a chip separately, the resources are wasted.
Disclosure of Invention
An embodiment of the present invention provides a heterogeneous acceleration processor, which is configured to flexibly perform calculations such as neural network convolution calculation and vector calculation, and avoid resource waste, and includes:
the device comprises a calculation module, a storage module RAM, a CPU and a data sorting and transforming module, wherein the calculation module comprises a convolution PE array and a vector calculation control unit;
the RAM is used for storing original data, intermediate data and result data;
the CPU is used for sending the vector calculation instruction to the calculation module;
the convolution PE array comprises a plurality of PE units, and the PE units are used for performing neural network convolution calculation on original data;
the data sorting and transforming module is used for sorting and transforming the intermediate data output by the computing module to obtain result data;
the vector calculation control unit is used for controlling at least one PE unit to carry out vector calculation;
the PE unit is also used for carrying out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction.
An embodiment of the present invention further provides a data calculation method based on a heterogeneous acceleration processor, which is used to flexibly perform calculations such as neural network convolution calculation and vector calculation, and can avoid resource waste, and the method is applied to the heterogeneous acceleration processor, and includes:
after receiving a neural network convolution calculation instruction, carrying out neural network convolution calculation on original data stored in a storage module RAM through a calculation module to obtain intermediate data, and carrying out sorting transformation on the intermediate data through a data sorting transformation module to obtain result data;
after a vector calculation instruction is received, the vector calculation control unit controls the calculation module to perform vector calculation to obtain intermediate data, and the data sorting and conversion module sorts and converts the intermediate data to obtain result data.
In the embodiment of the invention, the calculation module comprises a convolution PE array and a vector calculation control unit; the RAM is used for storing original data, intermediate data and result data; the CPU is used for sending the vector calculation instruction to the calculation module; the convolution PE array comprises a plurality of PE units, and the PE units are used for carrying out neural network convolution calculation on original data; the data sorting and transforming module is used for sorting and transforming the intermediate data output by the computing module to obtain result data; the vector calculation control unit is used for controlling at least one PE unit to carry out vector calculation; the PE unit is also used for carrying out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction. Compared with the prior art, the method has the advantages that the original data are directly subjected to the neural network through the PE unit, the PE unit is controlled through the vector calculation control unit to perform vector calculation, the CPU only needs to send a vector calculation instruction, more general CPU calculation resources are not needed, and resource waste is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a first diagram illustrating a heterogeneous acceleration processor according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a second example of a heterogeneous acceleration processor according to an embodiment of the present invention;
FIG. 3 is a third diagram illustrating a heterogeneous acceleration processor according to an embodiment of the present invention;
FIG. 4 is a first flowchart of a data calculation method based on a heterogeneous accelerated processor according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a data calculation method based on a heterogeneous acceleration processor according to an embodiment of the present invention;
fig. 6 is a flowchart of a data calculation method based on a heterogeneous acceleration processor according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
First, terms related to embodiments of the present invention are explained.
AI: intellectual Intelligence, a branch of computer science, attempts to understand the essence of Intelligence and produces a new intelligent machine that can react in a manner similar to human Intelligence.
PE: processing Element, Processing unit. Computing array composed of a series of PE arrays (PE array)
A CPU: the central processing unit, which is used as the operation and control core of the computer system, is the final execution unit for information processing and program operation.
GPU: graphics processing unit, a graphics processor, is a microprocessor that is dedicated to do image and graphics related arithmetic work on a device.
A neural network processor: a neural-network processing unit, a neural network processor, is a microprocessor which is specially used for carrying out relevant operation work of a neural network on equipment.
A multi-core processor: two or more complete compute engines (cores) are integrated into a processor, where the processor can support multiple processors on a system bus, with all bus control signals and command signals provided by the bus controller.
Isomerization: the method realizes the cooperative computing and mutual acceleration among the computing units using different types of instruction sets and architectures, and effectively solves the problems of energy consumption, expandability and the like.
The CPU is a final execution unit for information processing and program operation, which is an operation and control core of the computer system. Von neumann architectures are the basis of modern computers. Under the system structure, programs and data are stored uniformly, instructions and data need to be accessed from the same storage space and transmitted through the same bus, and the programs and the data cannot be executed in an overlapped mode. According to the von neumann system, the operation of the CPU is divided into the following 5 stages: an instruction fetch stage (IF), an instruction decode stage (ID), an instruction execute stage (EX), an access count stage (WM), and a result write-back stage (WB).
The CPU is the core component of the computer responsible for reading, decoding and executing instructions. The central processor mainly comprises two parts, namely a controller and an arithmetic unit, and also comprises a cache memory and a bus for realizing data and control of the connection between the cache memory and the arithmetic unit. The three major core components of the computer are the CPU, internal memory, and input/output devices. The central processing unit mainly has the functions of processing instructions, executing operations, controlling time and processing data. In a computer architecture, a CPU is a core hardware unit that controls and allocates all hardware resources (such as memory and input/output units) of a computer and performs general operations. The CPU is the computational and control core of the computer. The operation of all software layers in the computer system will eventually be mapped to the operation of the CPU by the instruction set.
The GPU reduces the dependence of the graphics card on the CPU, and performs part of the original CPU work, and particularly, the core technologies adopted by the GPU in 3D graphics processing include hardware T & L (geometric transformation and illumination processing), cubic environment texture mapping and vertex mixing, texture compression and bump mapping, a dual-texture four-pixel 256-bit rendering engine, and the like, and the hardware T & L technology can be said to be a mark of the GPU.
A neural network processor, also called a neural network accelerator or a computing card, i.e. a deep learning processor, refers to a module dedicated to handling a large number of computing tasks in smart applications (other non-computing tasks are still handled by the CPU). Many data processes of neural networks involve matrix multiplication and addition. A large number of GPUs working in parallel provides an inexpensive approach, but has the disadvantage that the GPUs require more power. FPGAs with built-in DSP modules and local memory are more energy efficient, but they are generally more expensive. Deep learning refers to a multi-layered neural network and a method of training it. Neural network processors, in general, refer to learning, judging, and making decisions by deep neural networks and mechanisms simulating the human brain.
Accelerators generally refer to hardware structures used to accelerate a certain computational pattern or flow, including specific hardware logic unit optimizations and pipeline processing optimizations. For a specific computing mode, such as machine learning, conventional computing processors such as CPUs and GPUs need to be implemented by using a general-purpose computing architecture and implementation, and it is often difficult to maximize computing speed and performance. The calculation accelerator provided by the invention optimizes the design of a specific calculation mode from a hardware bottom layer, can reduce the calculation time, increases the calculation efficiency and achieves the effect of accelerating the calculation. For example, an artificial neural network processor adopting a special computing architecture is often faster in computing speed, better in performance and more advantageous in a specific application scenario than a general-purpose processor.
A multi-core processor refers to a processor that integrates two or more complete computing engines (cores) into one processor, and the processor can support multiple processors on a system bus, and a bus controller provides all bus control signals and command signals. The development of multi-core technology has resulted from the realization by engineers that merely increasing the speed of a single-core chip generates excessive heat and does not result in a corresponding performance improvement, as is the case with previous processor products. At that rate in previous products, the heat generated by the processor can be too high. Even without the heat problem, the cost performance is unacceptable and the cost of a processor that is faster is much higher. The application of the multi-core technology has two advantages: the method brings more powerful computing performance to users; more importantly, the requirement of the user for simultaneously performing multitasking and multitasking computing environment can be met.
The common processor chip comprises a CPU, a DSP, a GPU, an FPGA, an ASIC and the like, wherein the CPU and the GPU need software support, the FPGA and the ASIC are integrated structures of software and hardware, and the software is hardware. Energy consumption ratio: ASIC > FPGA > GPU > CPU, yielding the root cause of such a result: for the calculation-intensive algorithm, the higher the data movement and the calculation efficiency, the higher the energy consumption ratio. The ASIC and the FPGA are both closer to the bottom IO, so the calculation efficiency is high and the data movement is high, but the FPGA has redundant transistors and connecting wires, the operation frequency is low, so the energy consumption ratio is high without the ASIC. The GPU and the CPU belong to general processors, and all need to perform the processes of instruction fetching, instruction decoding and instruction execution, and the processing of bottom IO is shielded in such a way, so that software and hardware are decoupled, but the data moving and operation cannot achieve higher efficiency, and therefore no ASIC or FPGA has high energy consumption ratio. The difference of the energy consumption ratio between the GPU and the CPU is mainly characterized in that most of transistors in the CPU are used for cache and control logic units, so that compared with the GPU, the CPU has the defects that redundant transistors cannot play a role in an algorithm with intensive computation and low computation complexity, and the energy consumption ratio of the CPU is lower than that of the GPU.
In the process of long-term development of processor chips, the processor chips form some distinct characteristics in use and market. A large amount of open source software and application software exist in the field of CPU & GPU, and any new technology firstly uses the CPU to realize an algorithm, so that the CPU programming resource is rich and easy to obtain, the development cost is low, and the development period is long. The FPGA is realized by using bottom layer hardware description languages such as Verilog/VHDL and the like, developers need to deeply know the chip characteristics of the FPGA, but the high-parallelism characteristics of the FPGA can improve the service performance by orders of magnitude; meanwhile, the FPGA is dynamically reconfigurable, and after the data center is deployed, different logics can be configured according to service forms to realize different hardware acceleration functions; for example, the FPGA board card on the current server is deployed with a picture compression logic serving as a QQ service; at the moment, the real-time advertisement prediction needs to be expanded to obtain more FPGA computing resources, and through a simple FPGA reconfiguration flow, the FPGA board card can be changed into 'new' hardware to serve the real-time advertisement prediction, so that the method is very suitable for batch deployment. The ASIC chip can obtain the optimal performance, namely, the area utilization rate is high, the speed is high, and the power consumption is low; however, the development risk of the neural network SC is extremely high, a market with a large enough size is required to guarantee cost price, and the time period from development to market is long, so that the neural network SC is not suitable for the field in which algorithms such as deep learning CNN are rapidly iterated.
In the prior art, when the neural network processor needs to perform some logic judgment or scalar/vector calculation functions, the data cannot be flexibly moved and calculated, and some general conventional calculation and logic judgment cannot be well supported, so that an optimization space exists. In addition, the traditional neural network processor has more calculation arrays and on-chip storage resources, larger area and resource waste. Therefore, an embodiment of the present invention provides a heterogeneous acceleration processor to solve the above problem.
Fig. 1 is a first schematic diagram of a heterogeneous acceleration processor according to an embodiment of the present invention, including:
the device comprises a calculation module, a storage module RAM, a CPU and a data sorting and transforming module, wherein the calculation module comprises a convolution PE array and a vector calculation control unit;
the RAM is used for storing original data, intermediate data and result data;
the CPU is used for sending the vector calculation instruction to the calculation module;
the convolution PE array comprises a plurality of PE units, and the PE units are used for carrying out neural network convolution calculation on original data;
the data sorting and transforming module is used for sorting and transforming the intermediate data output by the computing module to obtain result data;
the vector calculation control unit is used for controlling at least one PE unit to carry out vector calculation;
the PE unit is also used for carrying out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction.
The neural network convolution calculation includes matrix calculation, logic determination, and the like, and the vector calculation includes scalar calculation, logic determination, and the like.
Various embodiments of the heterogeneous accelerated processor provided by the embodiments of the present invention exist.
Example one
The heterogeneous acceleration processor provided by the invention is a novel neural network/CPU heterogeneous acceleration processor architecture integrating convolution/matrix calculation, logic judgment and scalar/vector calculation functions, the neural network convolution calculation and CPU logic depth are integrated, and all or part of PE units can be reconstructed into vector calculation units suitable for matching with a CPU. The PE unit is configured as a neural network calculation unit, such as CNN convolution calculation, when performing the neural network convolution calculation. When the CPU is used for calculation, the vector calculation unit is configured to execute vector calculation.
In the above embodiments, two schemes of heterogeneous accelerated processors are provided.
The first scheme is as follows: all PE units can carry out neural network convolution calculation, and carry out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit.
Scheme II: all PE units can perform neural network convolution calculation, and partial PE units can perform vector calculation on original data or intermediate data under the control of the vector calculation control unit.
In one embodiment, the raw data includes weight data and feature map data;
the RAM comprises a right storage module and a feature map storage module;
the right storage module is used for storing right data;
the characteristic diagram storage module is used for storing characteristic diagram data.
In the above embodiment, the raw data is the raw data required for the neural network convolution calculation.
In one embodiment, the convolution PE array further comprises a normalization and activation function module, a pooling module;
the PE unit is specifically used for performing convolution network matrix calculation on the feature map data utilization weight data to obtain a first result; or under the control of the vector calculation control unit, carrying out vector calculation on the original data or the intermediate data to obtain a first result; or after receiving the vector calculation instruction, carrying out vector calculation on data in the vector calculation instruction to obtain a first result;
the normalization and activation function module is used for carrying out batch normalization calculation and function activation realization on the first result to obtain a second result;
and the pooling module is used for performing pooling calculation on the second result to obtain intermediate data.
In the above embodiment, the whole computation module performs convolution network matrix computation on the feature map data utilization weight data to implement a convolution layer. The calculation module realizes the function of convolution calculation or general matrix calculation in the calculation. The implementation mode is mainly that a plurality of groups of weight data are parallelly slid on the feature map data, and a plurality of groups of convolution results are obtained through calculation.
The normalization and Activation function module is used for realizing batch normalization (batch norm) and function Activation (Activation), and the batch normalization mainly comprises multiplication-first-addition operation. In some optimizations, batch normalization calculations are fused into convolution calculations. The activation function includes relu, sigmoid, tanh, and the like which are commonly used.
Wherein the pooling module is used for pooling of convolution results. Usually, after convolution, a feature map with large dimensions is obtained, and the feature map is down-sampled, that is, the feature map is averagely sliced into a plurality of regions, and the maximum value or the average value of each region is taken to obtain a new feature with small dimensions. The most common pooling operations are average pooling mean and maximum pooling max. Average pooling: and calculating the average value of each small area of the feature map as the pooled value of the area. Maximum pooling: and calculating the maximum value of each small area of the characteristic diagram as the pooled value of the area.
The data sorting and transforming module is used for sorting and transforming data, and internally comprises a feature map buffer area which is used for processing the feature map and changing the shape or dimension of the feature map so as to enable the feature map to be suitable for the input requirement of the next layer of calculation. And sending the arranged feature map to a RAM or an off-chip memory.
In one embodiment, the vector calculation unit comprises one or any combination of an ALU, a MAC, and a vector parallel acceleration unit.
The vector parallel acceleration unit is a unit which can provide vector calculation besides ALU and MAC.
In an embodiment, at least one PE unit in the convolutional PE array is capable of being reconstructed as a vector computation unit adapted to a CPU;
the vector calculation unit is used for carrying out vector calculation on the original data or the intermediate data; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction.
In the above embodiments, two other schemes of heterogeneous acceleration processors are provided.
And a third scheme is as follows: all PE units can perform neural network convolution calculation, and when vector calculation is needed, all PE units can be reconstructed into vector calculation units. The vector calculation unit is capable of performing vector calculation on the original data or the intermediate data under the control of the vector calculation control unit.
And the scheme is as follows: all PE units can perform neural network convolution calculation, and when vector calculation is needed, part of PE units can be reconstructed into vector calculation units. The vector calculation unit is capable of performing vector calculation on the original data or the intermediate data under the control of the vector calculation control unit.
In addition, in the four schemes from the first scheme to the fourth scheme, the CPU can send the vector calculation instruction to the calculation module, because the conventional CPU has no vector calculation capability, the heterogeneous acceleration processor of the present invention combines the neural network convolution calculation with the CPU processing flow. Specifically, the CPU issues an instruction (vector calculation instruction) after the instruction decoding stage (ID) to the PE unit of the heterogeneous acceleration processor, thereby implementing vector calculation. The CPU displays a state in which the vector calculation is being performed, when the reconstructed vector calculation unit performs the vector calculation. In addition, the calculation result of the calculation module is also returned to the CPU, mainly at the execution instruction phase (EX).
In the first to fourth schemes, the deep fusion of the neural network convolution calculation array resource and the CPU calculation and logic function is realized, so that the operation speed is further improved, and the overall power consumption is reduced.
Example two
In this embodiment, the heterogeneous acceleration processor provided by the present invention performs fusion multiplexing on RAM resources at the same time to form a novel heterogeneous acceleration processor that shares RAM resources and merges CPU vector computation and neural network convolution computation, the RAM is split into multiple groups, each group contains a cache or RAM required by each CPU, after the RAM resources are deeply shared by the CPUs, data results obtained by the neural network computation can be stored in the RAM, the CPU processes results obtained by directly using the neural network convolution processing, and the heterogeneous acceleration processor can share computation results with the neural network processor without moving data through a bus again to perform cooperative computation.
Fig. 2 is a second schematic diagram of a heterogeneous processing diagram in an embodiment of the present invention, in an embodiment, the RAM includes a plurality of RAM groups, and each RAM group includes a right storage module and a feature map storage module.
In the above embodiment, in the mode of performing the neural network convolution calculation, the RAM groups support simultaneous use of resources of each group of RAM groups.
In one embodiment, the RAM includes an instruction RAM and a data RAM;
the instruction RAM is used for storing instructions; the data RAM is used for storing original data, intermediate data and result data;
the CPU is also used for reading instructions from the instruction RAM and data from the data RAM.
In the above-described embodiment, in the CPU vector calculation mode, the existing RAM can be reconfigured into an Instruction RAM (Data RAM, IRAM) and a Data RAM (DRAM).
The instructions stored in the instruction RAM may be instructions of the vector calculation control unit. And carrying out vector calculation on the data according to the read instruction. The data may be results calculated when the neural network convolves the calculation pattern, including the raw results and the intermediate. Thus, the CPU directly uses the data stored in the DRAM to perform the vector and logic judgment operation. Therefore, the CPU can share the calculation result with the convolution PE unit for carrying out the neural network convolution calculation without carrying data through a bus again, and the cooperative calculation is carried out.
EXAMPLE III
The embodiment of the invention provides a multi-core heterogeneous accelerated processor which performs multi-group fusion on a single heterogeneous accelerated processor to form a novel multi-core heterogeneous accelerated processor integrating calculation fusion and storage sharing of a multi-core CPU and a multi-core neural network processor. Grouping and fusing the heterogeneous acceleration processors of parallel computing to form a design scheme simultaneously suitable for vector computing and neural network convolution computing. The multi-core CPU vector computing structure is designed into a multi-core CPU processor integrated architecture by adding CPU instruction flow logic into a plurality of groups of vector computing units (convolution PE units reconstructed by the convolution PE units or controlled by a vector computing control unit to perform vector computing) and an RAM, so that the multi-core CPU vector computing structure can be formed when a multi-core CPU is required, such as a robot autonomous navigation scene, and the multi-core CPU vector computing structure is configured into a neural network convolution computing mode to perform CNN and DNN computing and the like when the neural network computing is required.
Fig. 3 is a third schematic diagram of the heterogeneous acceleration processor according to an embodiment of the present invention, in an embodiment, there are a plurality of computing modules, and the plurality of computing modules can perform a neural network convolution calculation or a vector calculation in parallel.
The plurality of heterogeneous accelerated processors form a many-core heterogeneous accelerated processor array, wherein each heterogeneous accelerated processor can perform the scheme of the first embodiment and the scheme of the second embodiment.
When the large-scale parallel convolution or matrix calculation of the neural network is needed, the heterogeneous acceleration processors execute the convolution calculation of the neural network, such as the pipeline calculation of executing the convolution, batchnorm, ativation and posing operations.
During vector calculation, each RAM is reconstructed into IRAM and DRAM, each calculation module is reconstructed into a calculation module containing a vector calculation unit adaptive to a CPU, multi-core vector calculation is carried out on the convolution calculation result of the neural network, and the purpose of multi-task parallel execution is achieved.
In summary, the heterogeneous accelerated processor provided in the embodiment of the present invention has the following beneficial effects:
firstly, a convolution PE array of a traditional neural network processor and a vector calculation unit adaptive to a CPU are deeply fused, and a neural network convolution calculation mode and a CPU vector calculation mode are respectively adopted according to different calculation requirements, so that the waste of area and power consumption caused by the realization of the traditional independent neural network processor calculation array and CPU vector design can be reduced.
And secondly, the resources in the RAM are deeply fused and shared, and the resources are configured into corresponding storage modes in different computing modes, so that compared with the RAM of an independent neural network processor and the RAM of a CPU general processor, the area of a chip is saved, and the power consumption of the chip is reduced.
And thirdly, the multi-core heterogeneous accelerated processor array is realized, the calculation fusion and the storage sharing of the multi-core CPU and the multi-core neural network processor can be fused, and the multi-task parallel execution can be carried out, so that the effects of improving the running speed and reducing the power consumption of the multi-task scene are realized aiming at a specific scene.
The embodiment of the invention also provides a data calculation method based on the heterogeneous acceleration processor, which is described in the following embodiment. Because the principle of solving the problem of the method is similar to that of the heterogeneous accelerated processor, the implementation of the method can refer to the implementation of the heterogeneous accelerated processor, and repeated details are not repeated.
Fig. 4 is a first flowchart of a data calculation method based on a heterogeneous acceleration processor according to an embodiment of the present invention, including:
step 401, after receiving a neural network convolution calculation instruction, performing neural network convolution calculation on original data stored in an RAM through a calculation module to obtain intermediate data, and performing arrangement conversion on the intermediate data through a data arrangement conversion module to obtain result data;
step 402, after receiving the vector calculation instruction, controlling the calculation module to perform vector calculation through the vector calculation control unit to obtain intermediate data, and performing sorting conversion on the intermediate data through the data sorting conversion module to obtain result data.
In one embodiment, the obtaining of the intermediate data by performing a neural network convolution calculation on the raw data stored in the RAM by the calculation module includes:
performing convolution network matrix calculation on the characteristic map data utilization weight data to obtain a first result;
performing batch normalization calculation and function activation on the first result to obtain a second result;
and performing pooling calculation on the second result to obtain intermediate data.
In one embodiment, the controlling the calculation module to perform vector calculation by the vector calculation control unit to obtain the intermediate data includes:
under the control of the vector calculation control unit, carrying out vector calculation on the original data or the intermediate data to obtain a first result; or after receiving the vector calculation instruction, carrying out vector calculation on data in the vector calculation instruction to obtain a first result;
performing batch normalization calculation and function activation on the first result to obtain a second result;
and performing pooling calculation on the second result to obtain intermediate data.
Fig. 5 is a second flowchart of a data calculation method based on a heterogeneous acceleration processor according to an embodiment of the present invention, where in an embodiment, the method further includes:
step 501, after receiving a vector calculation instruction, reconstructing at least one PE unit in the convolution PE array into a vector calculation unit adapted to a CPU;
step 502, the vector calculation control unit controls the vector calculation unit in the calculation module to perform vector calculation, so as to obtain intermediate data.
Fig. 6 is a flow chart three of the data calculation method based on the heterogeneous acceleration processor in the embodiment of the present invention, and in an embodiment, the method further includes:
601, after receiving a neural network convolution calculation instruction, performing neural network convolution calculation in parallel through a plurality of heterogeneous acceleration processors;
and step 602, after receiving the vector calculation instruction, performing vector calculation in parallel by a plurality of heterogeneous accelerated processors.
In summary, the data calculation method based on the heterogeneous acceleration processor provided by the embodiment of the present invention has the following beneficial effects:
firstly, a convolution PE array of a traditional neural network processor and a vector calculation unit adaptive to a CPU are deeply fused, and a neural network convolution calculation mode and a CPU vector calculation mode are respectively adopted according to different calculation requirements, so that the waste of area and power consumption caused by the realization of the traditional independent neural network processor calculation array and CPU vector design can be reduced.
And secondly, the resources in the RAM are deeply fused and shared, and the resources are configured into corresponding storage modes in different computing modes, so that compared with the RAM of an independent neural network processor and the RAM of a CPU general processor, the area of a chip is saved, and the power consumption of the chip is reduced.
And thirdly, the multi-core heterogeneous accelerated processor array is realized, the calculation fusion and the storage sharing of the multi-core CPU and the multi-core neural network processor can be fused, and the multi-task parallel execution can be carried out, so that the effects of improving the running speed and reducing the power consumption of the multi-task scene are realized aiming at a specific scene.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A heterogeneous acceleration processor, comprising: the device comprises a calculation module, a storage module RAM, a CPU and a data sorting and transforming module, wherein the calculation module comprises a convolution PE array and a vector calculation control unit;
the storage module RAM is used for storing original data, intermediate data and result data;
the CPU is used for sending the vector calculation instruction to the calculation module;
the convolution PE array comprises a plurality of PE units, and the PE units are used for carrying out neural network convolution calculation on original data;
the data sorting and transforming module is used for sorting and transforming the intermediate data output by the computing module to obtain result data;
the vector calculation control unit is used for controlling at least one PE unit to carry out vector calculation;
the PE unit is also used for carrying out vector calculation on the original data or the intermediate data under the control of the vector calculation control unit; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction.
2. The heterogeneous acceleration processor of claim 1, wherein the raw data comprises weight data and feature map data;
the RAM comprises a right storage module and a feature map storage module;
the right storage module is used for storing right data;
the characteristic diagram storage module is used for storing characteristic diagram data.
3. The heterogeneous acceleration processor of claim 2, wherein the convolutional PE array further comprises a normalization and activation function module, a pooling module;
the PE unit is specifically used for performing convolution network matrix calculation on the feature map data utilization weight data to obtain a first result; or under the control of the vector calculation control unit, carrying out vector calculation on the original data or the intermediate data to obtain a first result; or after receiving the vector calculation instruction, carrying out vector calculation on data in the vector calculation instruction to obtain a first result;
the normalization and activation function module is used for performing batch normalization calculation and function activation on the first result to obtain a second result;
and the pooling module is used for performing pooling calculation on the second result to obtain intermediate data.
4. The heterogeneous accelerated processor of claim 1, wherein the vector computation unit comprises one or any combination of an ALU, a MAC, and a vector parallel acceleration unit.
5. The heterogeneous acceleration processor of claim 1, wherein at least one PE unit in the convolutional PE array is reconfigurable as a CPU-adapted vector computation unit;
the vector calculation unit is used for carrying out vector calculation on the original data or the intermediate data; and after receiving the vector calculation instruction, performing vector calculation on data in the vector calculation instruction.
6. The heterogeneous acceleration processor of claim 2, wherein the RAM comprises a plurality of RAM packets, each RAM packet comprising a weight storage module and a feature map storage module.
7. The heterogeneous acceleration processor of claim 1, wherein the RAM includes an instruction RAM and a data RAM;
the instruction RAM is used for storing instructions; the data RAM is used for storing original data, intermediate data and result data;
the CPU is also used for reading instructions from the instruction RAM and data from the data RAM.
8. The heterogeneous acceleration processor of claim 1, wherein there are a plurality of the computation modules, the plurality of computation modules capable of performing neural network convolution computations or vector computations in parallel.
9. A data computing method based on a heterogeneous acceleration processor, which is applied to the heterogeneous acceleration processor of any one of claims 1 to 8, and comprises the following steps:
after receiving a neural network convolution calculation instruction, carrying out neural network convolution calculation on original data stored in a storage module RAM through a calculation module to obtain intermediate data, and carrying out sorting transformation on the intermediate data through a data sorting transformation module to obtain result data;
after receiving the vector calculation instruction, the vector calculation control unit controls the calculation module to perform vector calculation to obtain intermediate data, and the data sorting and conversion module sorts and converts the intermediate data to obtain result data.
10. The method of claim 9, further comprising:
after receiving a vector calculation instruction, reconstructing at least one PE unit in the convolution PE array into a vector calculation unit adapted to a CPU;
and controlling a vector calculation unit in the calculation module to perform vector calculation through a vector calculation control unit to obtain intermediate data.
CN202210132954.2A 2022-02-14 2022-02-14 Heterogeneous acceleration processor and data calculation method Pending CN114595813A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210132954.2A CN114595813A (en) 2022-02-14 2022-02-14 Heterogeneous acceleration processor and data calculation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210132954.2A CN114595813A (en) 2022-02-14 2022-02-14 Heterogeneous acceleration processor and data calculation method

Publications (1)

Publication Number Publication Date
CN114595813A true CN114595813A (en) 2022-06-07

Family

ID=81807000

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210132954.2A Pending CN114595813A (en) 2022-02-14 2022-02-14 Heterogeneous acceleration processor and data calculation method

Country Status (1)

Country Link
CN (1) CN114595813A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116962176A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Data processing method, device and system of distributed cluster and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116962176A (en) * 2023-09-21 2023-10-27 浪潮电子信息产业股份有限公司 Data processing method, device and system of distributed cluster and storage medium
CN116962176B (en) * 2023-09-21 2024-01-23 浪潮电子信息产业股份有限公司 Data processing method, device and system of distributed cluster and storage medium

Similar Documents

Publication Publication Date Title
CN107679620B (en) Artificial neural network processing device
CN107679621B (en) Artificial neural network processing device
CN107704922B (en) Artificial neural network processing device
Yin et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications
CN109858620B (en) Brain-like computing system
WO2020073211A1 (en) Operation accelerator, processing method, and related device
Kästner et al. Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ
CN112799726B (en) Data processing device, method and related product
Sun et al. A high-performance accelerator for large-scale convolutional neural networks
CN114781632A (en) Deep neural network accelerator based on dynamic reconfigurable pulse tensor operation engine
CN112446471B (en) Convolution acceleration method based on heterogeneous many-core processor
CN114595813A (en) Heterogeneous acceleration processor and data calculation method
CN111783966A (en) Hardware device and method of deep convolutional neural network hardware parallel accelerator
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
CN116468078A (en) Intelligent engine processing method and device for artificial intelligent chip
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
Song Analysis on Heterogeneous Computing
CN114692844A (en) Data processing device, data processing method and related product
CN113469327B (en) Integrated circuit device for performing rotation number advance
CN113469328B (en) Device, board, method and readable storage medium for executing revolution passing
Li FPGA Theoretical Analysis and Its Advantage Comparison in Artificial Intelligence
CN116402091A (en) Hybrid engine intelligent computing method and device for artificial intelligent chip
US20230085718A1 (en) Neural network scheduling method and apparatus
CN116400926A (en) Scalar engine processing method and device oriented to artificial intelligent chip
CN115081606A (en) Device and board card for executing Winograd convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination