CN113240074B - Reconfigurable neural network processor - Google Patents

Reconfigurable neural network processor Download PDF

Info

Publication number
CN113240074B
CN113240074B CN202110407002.2A CN202110407002A CN113240074B CN 113240074 B CN113240074 B CN 113240074B CN 202110407002 A CN202110407002 A CN 202110407002A CN 113240074 B CN113240074 B CN 113240074B
Authority
CN
China
Prior art keywords
calculation
neural network
computing
array
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110407002.2A
Other languages
Chinese (zh)
Other versions
CN113240074A (en
Inventor
陈亮
徐东君
宋文娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110407002.2A priority Critical patent/CN113240074B/en
Publication of CN113240074A publication Critical patent/CN113240074A/en
Application granted granted Critical
Publication of CN113240074B publication Critical patent/CN113240074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention provides a reconfigurable neural network processor, which comprises an instruction compiling module, a model mapping module, a calculation array control module and a calculation array, wherein the instruction compiling module is used for compiling instructions; the instruction compiling module is used for compiling each neural network application program to be operated into a network operation instruction; the model mapping module is used for matching a micro-operation code corresponding to a network operation instruction, and mapping a corresponding neural network application program to the computing array by indexing the micro-operation code to obtain a computing unit set of each neural network application program on the computing array; and the computing array control module is used for controlling the reading and writing and the computation of each computing unit set aiming at the corresponding neural network application program. The processor provided by the invention realizes parallel accelerated computation supporting a plurality of neural networks and cooperative computation of the plurality of neural networks, and improves the utilization rate of computing resources and the parallel processing capability of the neural networks.

Description

Reconfigurable neural network processor
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a reconfigurable neural network processor.
Background
The development and application of the neural network continuously promote the progress of the artificial intelligence technology, and the neural network is widely applied to the fields of image, voice and word processing. In order to solve the problems of the complex neural network such as computation speed and power consumption, a hardware accelerator for the neural network is also concerned. However, the traditional CPU (Central Processing Unit) focuses more on the versatility of functions and the integrity of control, so that more complex instructions and more interrupt switching need to be supported in the architecture design, and the performance of the CPU in the acceleration of the complex neural network is poor.
At present, in order to improve the parallel computing capability of a neural network, there are various accelerated computing methods and architectures based on a GPU (Graphics Processing Unit), an FPGA (Field Programmable Gate Array), and an ASIC (Application Specific Integrated Circuit), but there are certain problems: in large-scale neural network training and reasoning computation, a large amount of SIMD (Single Instruction Multiple Data) stream computing resources in the GPU can greatly improve the parallel acceleration capability, but the computing resources have more redundancy and the power consumption is larger, so that the bottleneck of improving the performance is formed; the accelerator based on the FPGA is not as flexible as the GPU in algorithm support, the hardware programming time is relatively long, but the resource utilization rate and the energy efficiency are relatively high; hardware accelerators employing ASICs have good performance and power consumption, but often employ relatively fixed neural network task models for higher energy efficiency, which are difficult to follow on algorithm update iterations.
The architecture design of the reconfigurable computing chip has the configurable capability similar to that of an FPGA (field programmable gate array), and is similar to an ASIC (application specific integrated circuit) in the aspects of performance and power consumption, however, most of the architectures of the existing reconfigurable neural network acceleration methods only focus on supporting a single-network computing task under different configuration conditions, and cannot support parallel computing of a plurality of neural networks.
Disclosure of Invention
The invention provides a reconfigurable neural network processor, which is used for overcoming the defect that the prior art cannot support the parallel computation of a plurality of neural networks and realizing the support of the parallel computation of the plurality of neural networks.
The invention provides a reconfigurable neural network processor, which comprises an instruction compiling module, a model mapping module, a calculation array control module and a calculation array, wherein the instruction compiling module is used for compiling a plurality of instructions;
the instruction compiling module is used for compiling each neural network application program to be operated into a network operation instruction;
the model mapping module is used for matching a micro-operation code corresponding to a network operation instruction, and mapping a corresponding neural network application program to the computing array by indexing the micro-operation code to obtain a computing unit set of each neural network application program on the computing array; any one computing unit set is used for independently computing the computing tasks of the corresponding neural network application programs, or is used for cooperatively computing the computing tasks of the respective corresponding neural network application programs with the rest computing unit sets, and the computing unit sets share a storage space during cooperative computing;
and the computing array control module is used for controlling the reading and writing and the computation of each computing unit set aiming at the corresponding neural network application program.
According to the reconfigurable neural network processor provided by the invention, each calculation unit set is composed of a plurality of adjacent calculation columns, and the calculation columns are one column of calculation units in the calculation array;
shift buffers are respectively arranged at two sides of each computing unit;
when any calculation column independently forms a calculation unit set, the calculation column is used for independently executing the corresponding neural network application program, and the shift buffers on two sides of each calculation unit in the calculation column are in an enabling state;
when a plurality of adjacent calculation columns form a calculation unit set, the shift buffer at the edge side of the calculation column at the edge of the calculation unit set is in an enabled state, the shift buffer at the inner side of the calculation column at the edge of the calculation unit set is in a disabled state, and the shift buffers at two sides of the calculation column in the calculation unit set are in a disabled state.
According to the reconfigurable neural network processor provided by the invention, the computing unit is used for carrying out convolution operation of a corresponding neural network application program based on the shift operation of the characteristic operand and the corresponding convolution kernel value;
the feature operands are derived based on a feature map of the corresponding neural network application.
According to the reconfigurable neural network processor provided by the invention, the computing unit is a computing core array;
each computing core in the computing core array comprises a basic operation set, and the computing core is used for selecting corresponding computing operation from the basic operation set based on a received computing control signal and executing the corresponding computing operation.
According to the reconfigurable neural network processor provided by the invention, the computing core array is provided with two rows; the calculation unit is used for performing convolution operation and activation function calculation of two lines of data at the same time.
According to the reconfigurable neural network processor provided by the invention, the calculation array control module comprises a plurality of control units, and the number of the control units is the same as the number of columns of the calculation array;
and each control unit is used for controlling the reading and writing and the calculation of the calculation units in the corresponding column aiming at the corresponding neural network application program.
According to the reconfigurable neural network processor provided by the present invention, the processor further comprises:
the storage module is used for caching weight and source data required by the computing array to execute each neural network application program;
the calculation array control module is also used for controlling the read-write operation of the storage module.
According to the reconfigurable neural network processor provided by the invention, the storage module comprises an in-column selector which is used for selectively receiving the output data of the calculation array;
the storage module is further configured to buffer the output data received by the in-column selector.
According to the reconfigurable neural network processor provided by the invention, the storage module further comprises a data cache unit and an output cache unit; the data buffer unit and the output buffer unit are ping-pong buffer structures and are alternately used for accessing the source data and the output data received by the in-column selector.
According to the reconfigurable neural network processor provided by the invention, the data cache unit comprises a shared cache unit, and the shared cache unit is used for storing shared data of a plurality of computing unit sets;
the calculation array control module is further configured to control the multiple calculation unit sets to perform cooperative calculation of the corresponding neural network application program based on the shared data in the shared cache unit.
According to the reconfigurable neural network processor provided by the invention, the calculation unit sets capable of calculating aiming at the corresponding neural network application program are obtained by grouping and reconfiguring the calculation array, so that the parallel accelerated calculation of a plurality of neural networks can be supported, the cooperative calculation of the plurality of neural networks can also be supported, the problem that the hardware structure of the conventional hardware accelerator needs to be configured for multiple times or even redesigned when the plurality of neural networks are accelerated is solved, the number of resources occupied by each group can be dynamically adjusted according to the calculation requirement of the neural networks, and the utilization rate of calculation resources and the parallel processing capability of the neural networks are improved.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic structural diagram of a reconfigurable neural network processor provided by an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of computing unit sets according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating an operation mode of a calculation column according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a compute array provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of convolution and matrix multiplication operations provided by an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a computing unit provided in an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a memory module according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a reconfigurable neural network processor provided by an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a control unit according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The architecture design of the reconfigurable computing chip has configurable capacity similar to that of an FPGA (field programmable gate array), but the reconfigurable configuration quantity is less, the configuration speed is higher, the redundant resources are less, and the architecture design is similar to that of an ASIC (application specific integrated circuit) in the aspects of performance and power consumption, so that the reconfigurable computing chip has the characteristics of better balancing network adaptability and high energy efficiency requirements.
With the development of artificial intelligent cloud servers and edge computing terminals, the hardware resource utilization rate and energy efficiency under multitask parallel computing are improved aiming at the problem that the requirements of parallel acceleration of multiple neural networks, complex network splitting and multi-network collaborative computing are gradually increased, and the method is particularly important for the field of neural network processors in edge computing terminals with relatively lack of computing resources. In view of the above, embodiments of the present invention provide a reconfigurable neural network processor to improve parallel acceleration capability for multiple types of neural networks.
Fig. 1 is a schematic structural diagram of a reconfigurable neural network processor provided by an embodiment of the present invention, and as shown in fig. 1, the processor includes an instruction compiling module 110, a model mapping module 120, a compute array control module 130, and a compute array 140;
the instruction compiling module 110 is configured to compile each neural network application program to be run into a network operation instruction;
the model mapping module 120 is configured to match a micro-operation code corresponding to the network operation instruction, and map the corresponding neural network application program to the calculation array 140 by indexing the micro-operation code, so as to obtain a calculation unit set of each neural network application program on the calculation array 140; any one computing unit set is used for independently computing the computing tasks of the corresponding neural network application programs, or is used for cooperatively computing the computing tasks of the respective corresponding neural network application programs with the rest of the computing unit sets, and the computing unit sets share a storage space during cooperative computing;
the calculation array control module 130 is configured to control reading, writing and calculation of each calculation unit set for a corresponding neural network application program.
Herein, a neural network application refers to a computer program developed to accomplish neural network model computation tasks. By operating the neural network application program, the calculation task of the neural network model corresponding to the neural network application program can be completed.
The computing array 140 is composed of a plurality of computing units, and can be reconstructed by grouping the computing units to support parallel accelerated computing of a plurality of neural networks. The number of rows R and the number of columns C of the calculation array 140 may be set arbitrarily according to user requirements, for example, R may be 8, C may be 16, and this is not specifically limited in this embodiment of the present invention.
In consideration of the problem that when a plurality of neural network applications need to be run, only one neural network application is executed at a time, which may result in low calculation efficiency, in the embodiment of the present invention, each neural network application to be run is compiled into a network operation instruction by the instruction compiling module 110, mapping of each neural network application to the calculation hardware, that is, the calculation array 140, is implemented by the model mapping module 120, each calculation unit combines according to reconstruction information to obtain each calculation unit set, any calculation unit set may independently calculate the calculation tasks of the corresponding neural network applications, or may cooperatively calculate the calculation tasks of the respective corresponding neural network applications with the remaining plurality of calculation unit sets, and each calculation unit set shares a storage space during cooperative calculation, so that a dynamically reconfigurable manner of the calculation array is employed, which may support parallel calculation of a plurality of neural network tasks, or may implement cooperative calculation between data-associated neural networks by sharing storage, thereby increasing calculation speed.
Here, the number N of groups of the network operation instruction, that is, the number of groups corresponding to the neural network tasks that need to be subjected to parallel computation, may be arbitrarily set according to a user requirement, for example, if a user needs to perform parallel computation on 4 groups of neural network tasks, each neural network application to be run may be compiled into 4 groups of network operation instructions. Further, after obtaining each set of network operation instructions, the network operation instructions need to be stored in an instruction cache, so that the model mapping module 120 can read the instructions.
After the model mapping module 120 reads the N sets of network operation instructions, the mapping from each neural network application to the computing array 140 may be specifically implemented by the following steps: and obtaining the micro-operation codes corresponding to the N sets of network operation instructions through matching to obtain N sets of micro-operation codes, and then combining the computing units corresponding to each set of micro-operation codes into a computing unit set through indexing the corresponding relationship between the micro-operation codes and the computing units in the computing array 140 to finally obtain N computing unit sets. Here, the micro operation code refers to a control code of a specific operation process of the computing array 140, each micro operation code is configured with its corresponding computing unit structure in advance, the computing unit structure may include one or more computing units, and specifically relates to an internal configuration of a single computing unit and a joint configuration between multiple computing units, and its corresponding computing unit structure may be used to implement an operation indicated by the micro operation code.
Then, the calculation array control module 130 controls the reading and writing and calculation of each calculation unit set for the corresponding neural network application program according to each group of the micro operation codes, thereby realizing the parallel calculation of a plurality of neural network tasks. Here, a specific control manner may be to control the entire computing unit set, or to individually control the computing unit set in rows or columns, or to individually control each computing unit in the computing unit set, which is not specifically limited in this embodiment of the present invention.
The processor provided by the embodiment of the invention can obtain each computing unit set capable of computing aiming at a corresponding neural network application program by performing grouping reconstruction on the computing array, can support parallel accelerated computing of a plurality of neural networks and cooperative computing of the plurality of neural networks, solves the problem that the hardware structure of the existing hardware accelerator needs to be configured for multiple times or even redesigned when the plurality of neural networks are accelerated, and can dynamically adjust the quantity of resources occupied by each group according to the operation requirement of the neural networks, thereby improving the utilization rate of computing resources and the parallel processing capability of the neural networks.
Based on any one of the above embodiments, each calculation unit set is composed of a plurality of adjacent calculation columns, and the calculation columns are calculation units in a column of the calculation array;
shift buffers are respectively arranged on two sides of each computing unit;
when any calculation column independently forms a calculation unit set, any calculation column is used for independently executing the corresponding neural network application program, and the shift buffers on two sides of each calculation unit in any calculation column are in an enabling state;
when a plurality of adjacent calculation columns form a calculation unit set, the shift buffers on the edge side of the calculation column at the edge of the calculation unit set are in an enabled state, the shift buffers on the inner side of the calculation column at the edge of the calculation unit set are in a disabled state, and the shift buffers on the two sides of the calculation column in the calculation unit set are in a disabled state.
Specifically, in order to facilitate reconstruction and control of the calculation unit sets, in the embodiment of the present invention, each calculation unit set is separately controlled, and a plurality of adjacent calculation units belonging to the same neural network calculation task form the same calculation unit set by indexing the micro-operation code, a single calculation unit set occupies at least one calculation column and at most C calculation columns, and finally, each calculation unit set corresponding to each neural network calculation task can be obtained. Fig. 2 is a schematic structural diagram of each computing unit set, where the computing unit is PE _ ROW in the diagram, and the shift buffer is B in the diagram, as shown in fig. 2, each computing unit set respectively forms each computing task area, and independent reconstruction and computing control are provided between each computing task area, so as to implement parallel computing of a multi-neural network computing task. Here, a compute column is a column of compute units in a compute array.
Further, fig. 3 is a schematic diagram of an operation mode of a computation column according to an embodiment of the present invention, and as shown in fig. 3, each computation element PE _ ROW may specifically be controlled and computed according to a combination of columns in the following manner: operand 0 (g 0) contains m numbers and is broadcast to all PE _ ROW units in a column, each number corresponding to one compute core in PE _ ROW; operand 1 ({ s0, s1, \ 8230;, sR }) contains R numbers, each corresponding to a PE _ ROW element; the calculation control is broadcasted to all PE _ ROWs according to columns to carry out calculation process scheduling; and (3) independently setting a PE _ ROW starting control signal, wherein the whole column has R-bits, and when the control signal is effective, the PE _ ROW calculates according to the control bit, so that the control flexibility of the PE _ ROW can be improved, and the idle power consumption can be reduced.
Considering that two adjacent calculation columns may belong to the same calculation unit set or may belong to different calculation unit sets, and when two adjacent calculation columns belong to different calculation unit sets, that is, when different calculation tasks need to be grouped, two adjacent calculation columns may interfere with each other, for this problem, as shown in the schematic structural diagram of the calculation array shown in fig. 4, shift buffers B are respectively disposed on the left and right sides of each calculation unit, and whether each shift buffer in the calculation column is enabled or not is determined according to the adjacency relation of any one calculation column in the calculation unit set to which it belongs:
when any calculation column independently forms a calculation unit set, namely the calculation column is used for independently executing the corresponding neural network application program, the shift buffers on two sides of each calculation unit in the calculation column can be in an enabling state, so that the calculation column and other calculation unit sets are independent and do not interfere with each other;
when a plurality of adjacent calculation columns form a calculation unit set, the shift buffers on the edge side of the calculation columns at the edge of the calculation unit set are in an enabled state, the shift buffers on the inner side of the calculation columns at the edge of the calculation unit set are in a disabled state, and the shift buffers on the two sides of the calculation columns in the calculation unit set are in a disabled state, so that the calculation columns at the edge of the calculation unit set and other calculation unit sets are independent and do not interfere with each other, and the calculation columns adjacent to the calculation columns and belonging to the calculation unit set can be directly connected through shifting.
Further, a set of shift buffers disposed on both sides of each PE _ ROW may buffer two operands for boundary buffering during shifting of convolution operations. Data can be transferred between left and right adjacent PE _ ROWs by shifting.
Based on any of the above embodiments, the calculation unit is configured to perform convolution operation of a corresponding neural network application based on shift operation of a feature operand and a corresponding convolution kernel value;
the feature operands are derived based on a feature map of the corresponding neural network application.
Here, considering that a large number of convolution operations are included in the neural network computing task, in the embodiment of the present invention, each computing unit is combined in a column to perform convolution operations corresponding to the neural network application program, so that R computing units in one computing column can synchronously perform parallel computing of R convolution kernels, thereby improving the efficiency of convolution operations.
Fig. 5 is a schematic diagram of convolution and matrix multiplication operations provided in an embodiment of the present invention, and as shown in fig. 5 (left), firstly, a feature operand, namely operand 0, is obtained according to a feature map (Fmap) of a neural network application program corresponding to a computing unit PE _ ROW, and a corresponding position array of a convolution kernel (Filter {0-R }) is combined into an operand 1 ({ s0, s1, 8230;, sR } in fig. 5), on this basis, each PE _ ROW may specifically perform convolution operations in the following manner: inputting two ROWs of Fmap into the PE _ ROW one by one according to the shift operation of operand 0, for example, ROW0, ROW1 may be input first, and ROW2, ROW3 may be input next; meanwhile, the convolution kernel at the position corresponding to the PE _ ROW successively inputs convolution kernel values into the PE _ ROW according to the sequence indicated by the reverse S dashed line in fig. 5; the PE _ ROW may perform a multiplication operation on the data input each time, thereby completing a convolution operation corresponding to each ROW of data in the operand 0, and finally obtaining a result of the convolution operation in two ROWs at each PE _ ROW.
Further, besides convolution operation, matrix multiplication operation is also the core of the neural network computation task, as shown in fig. 5 (right), the embodiment of the present invention also improves the existing matrix multiplication operation method: in order to complete two matrix operations of X × Y during matrix multiplication operation, one ROW of matrix Y corresponds to operand 0 in fig. 3 and is broadcast to one ROW of PE _ ROW, and matrix Y is read ROW by ROW and input into PE _ ROW during operation, as shown by the downward dashed arrow in fig. 5; meanwhile, the matrix X is read column by column as the operand 1, as shown by a right dotted arrow in fig. 5, at this time, one PE _ ROW may input two values of the operand 1, and then one column PE _ ROW may perform 2R data computations in one column of the matrix X in parallel, thereby achieving that at most 2R ROW results can be obtained in one round of computation.
Based on any of the above embodiments, the computing unit is a compute core array;
each computing core in the computing core array comprises a basic operation set, and the computing cores are used for selecting corresponding computing operation from the basic operation set based on received computing control signals and executing the corresponding computing operation.
Here, unlike the prior art that the computation task of the simple neural network is realized only by simple addition and multiplication operations, in the embodiment of the present invention, each computation core in each computation unit is provided with a basic operation set, where the basic operation set may include any one of basic operations that are required in the neural network computation, such as multiplication, addition, comparison, shift, and the like, and on this basis, each computation core may select and execute a corresponding computation operation from the basic operation set according to a received computation control signal.
The processor provided by the embodiment of the invention can complete various basic operations in the neural network according to the configuration information through each computing unit, the plurality of computing units can reconstruct and complete the complete computation of the neural network, and the computing process of the computing array can be reconfigured to realize the computing process supporting various types of neural networks.
Based on any of the above embodiments, the compute kernel array is provided with two rows; the calculation unit is used for performing convolution operation and activation function calculation of two lines of data at the same time.
Specifically, the computing unit may be arranged as an array including two ROWs of computing cores, fig. 6 is a schematic structural diagram of the computing unit according to the embodiment of the present invention, where the computing cores are cores in the diagram, as shown in fig. 6, each PE _ ROW is composed of 2 × m cores, each Core is a reconfigurable structure, and operations such as convolution operation and activation function calculation of 2 × m points may be completed every time one round of computation is performed. Here, m, i.e. the number of columns of the computational core array, may be arbitrarily set according to needs, for example, m may be 16, which is not particularly limited in this embodiment of the present invention.
Further, all cores may be configured to the same calculation process, synchronize the calculation and output results. The PE _ ROW comprises a basic operation set, a register group and a selector, wherein the register group stores intermediate calculation results or data transmitted by adjacent PE _ ROWs, the selector selects proper data from source data and the register group as operation operands according to calculation control signals, selects corresponding calculation operations from the basic operation set to carry out calculation, and stores results into the register group and controls data output.
PE _ ROW can adopt a partial and temporary storage mode in fixed Core, namely, part and calculation are completed for many times in the fixed Core, each part and result are temporarily stored, all the parts and results are finally superposed, and finally convolution results corresponding to each Core are output, so that the reusability of a calculation Core and the independence of an output channel are well utilized, and the bandwidth requirement is reduced. In addition, the PE _ ROW can also adopt maximum or average pooling to reduce the output result to m/2 according to control selection, thereby effectively reducing the data output and buffer load of the intermediate result.
Based on any one of the above embodiments, the calculation array control module includes a plurality of control units, and the number of the control units is the same as the number of columns of the calculation array;
and each control unit is used for controlling the reading and writing and the calculation of the calculation units in the corresponding column aiming at the corresponding neural network application program.
Specifically, considering that each computing unit set is composed of a plurality of adjacent computing columns, in order to facilitate control of each computing unit set, a plurality of control units may be disposed in the computing array control module, the number of the control units is the same as the number of the columns of the computing array, each control unit corresponds to one column of the computing array, and the computing units in the corresponding column may be controlled according to the micro-operation code to perform reading and writing and computing operations on the corresponding neural network application program.
Based on any of the above embodiments, the processor further comprises:
the storage module is used for caching weight and source data required by the calculation array to execute each neural network application program;
the calculation array control module is also used for controlling the read-write operation of the storage module.
Specifically, in consideration of the fact that the weight and the source data are needed in the process of executing each neural network application program by the computing array, a storage module is arranged in the processor and used for caching the weight and the source data needed by the computing array to execute each neural network application program, and the read-write operation of the storage module can be completed according to the operation of the computing array control module in the computing process.
Further, the storage module may be further configured to cache output data after the execution of the compute array is completed. The Memory module may be composed of a large number of SRAM (Static Random-Access Memory) Memory cells and control logic, and the source of the initial data may be read from an off-chip Memory space in a DMA (Direct Memory Access) manner.
According to any of the above embodiments, the memory module includes an in-column selector for selectively receiving output data of the compute array;
the storage module is further configured to buffer the output data received by the in-column selector.
In particular, in order to relieve the buffer pressure, the storage module can selectively receive the output data of the computing array by arranging the in-column selector, so that partial buffer of the output data of the computing array is realized. Further, according to the number of columns of the calculation array, the in-column selectors may be set to C groups, and each group of in-column selectors may be configured to regulate and control result output of multiple PE _ ROWs in a corresponding calculation column, where a specific regulation and control method may be adjustment according to a cache bandwidth, for example, 8 PE _ ROWs exist in one calculation column, and when an output cache can only receive a result of one PE _ ROW at a time, the in-column selectors select 1 for 8; when the results of all 8 PE _ ROWs in a column can be accepted at once, no intra-column selector is needed.
Based on any of the above embodiments, the storage module further includes a data cache unit and an output cache unit; the data buffer unit and the output buffer unit are of a ping-pong buffer structure and are alternately used for accessing the source data and the output data received by the in-column selector.
Specifically, in order to complete seamless buffering of data and improve throughput of the data, a data buffer unit and an output buffer unit are further arranged in the storage module, the data buffer unit and the output buffer unit adopt a ping-pong buffer structure and are alternately used for accessing source data and output data received by the in-column selector, and the neural network performs a function of interchanging after one-layer calculation.
Further, fig. 7 is a schematic structural diagram of the memory module according to the embodiment of the present invention, and as shown in fig. 7, according to the number of columns of the calculation array, data in the data cache unit and the output cache unit can be divided into C groups, each group corresponds to one column of the calculation array, and the data cache area is initialized according to the amount of data to be processed by the neural network. In addition, according to the number of neural network task groups needing parallel calculation in the processor, the cached weight numbers are also divided into N groups, each group of weights corresponds to the calculation of one group of neural network tasks, and the initialization is only carried out once corresponding to the same group of neural network tasks.
Based on any of the above embodiments, the data caching unit includes a shared caching unit, and the shared caching unit is configured to store shared data of a plurality of computing unit sets;
the calculation array control module is also used for controlling the plurality of calculation unit sets to perform cooperative calculation of the corresponding neural network application program based on the shared data in the shared cache unit.
Here, in consideration of the requirement of the multitask collaborative computing complex neural network, in the embodiment of the present invention, a shared cache unit is disposed in a data cache unit of a storage module to store shared data of a plurality of computing unit sets, and on this basis, each computing unit set can perform data communication through the shared cache unit under the control of a computing array control module, thereby implementing collaborative computing between data-associated neural networks.
Based on any of the above embodiments, fig. 8 is a schematic structural diagram of a reconfigurable neural network processor provided in an embodiment of the present invention, where the instruction compiling module is an instruction cache in a diagram, the model mapping module is a model mapping in the diagram, the calculation array control module is a calculation array controller in the diagram, the storage module is a storage component in the diagram, and the calculation array is a calculation component in the diagram. As shown in fig. 8, the main controller is used for executing the neural network application, preparing data, and calling the acceleration processor, and the off-chip storage is mainly used for storing input data and network model initialization information. The acceleration processor mainly comprises a storage component, a calculation component and a control component, wherein the calculation component obtains weights and data from the weights and the data cache in the storage component respectively, network operation is completed according to the calculation array controller, and output results are selectively stored in an output cache. The computing component can reconstruct and support the operation process of various types of neural networks on one hand, and can reconstruct and group the operation process and support the parallel computation of the neural networks on the other hand.
The control component mainly comprises an instruction cache, a model mapping and a calculation array controller. The instruction cache is grouped according to the number of the neural network tasks, and is initialized once corresponding to the same group of neural network tasks. And the model mapping finishes the instruction reading, decodes the instruction, forms reconstruction information, reading and writing and calculation control information of the calculation array, initializes and modifies an operand address pointer and a register file in the calculation array controller according to the instruction, realizes the mapping of the corresponding neural network application program to the calculation array, and obtains a calculation unit set of each neural network application program on the calculation array.
Fig. 9 is a schematic structural diagram of a control unit according to an embodiment of the present invention, where a register file is Regfile in the diagram, and a control unit is Ctrl in the diagram, as shown in fig. 9, a computing array controller includes operand address pointers, C configurable Ctrl, and N regfiles, where the operand address pointers point to buffers of weights, source data, and output data respectively during a computing process; reading and writing and calculation control of a corresponding calculation column in each Ctrl control calculation array, and when a plurality of columns in the calculation array are combined to complete a neural network calculation task, a plurality of corresponding Ctrls are also combined to perform the same microcode operation control; the N Regfiles correspond to N parallel computing neural network tasks, and each Regfile is provided with at least C registers and is used for recording the completion progress of the computing tasks, updating the address pointers and other functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A reconfigurable neural network processor is characterized by comprising an instruction compiling module, a model mapping module, a calculation array control module and a calculation array;
the instruction compiling module is used for compiling each neural network application program to be operated into a network operation instruction;
the model mapping module is used for matching a micro-operation code corresponding to a network operation instruction, and mapping a corresponding neural network application program to the computing array by indexing the micro-operation code to obtain a computing unit set of each neural network application program on the computing array; any one computing unit set is used for independently computing the computing tasks of the corresponding neural network application programs, or is used for cooperatively computing the computing tasks of the respective corresponding neural network application programs with the rest of the computing unit sets, and the computing unit sets share a storage space during cooperative computing;
the calculation array control module is used for controlling the reading, writing and calculation of each calculation unit set aiming at the corresponding neural network application program;
each calculation unit set consists of a plurality of adjacent calculation columns, and each calculation column is a column of calculation units in the calculation array;
shift buffers are respectively arranged at two sides of each computing unit;
when any calculation column independently forms a calculation unit set, the calculation column is used for independently executing the corresponding neural network application program, and the shift buffers on two sides of each calculation unit in the calculation column are in an enabling state;
when a plurality of adjacent calculation columns form a calculation unit set, the shift buffer at the edge side of the calculation column at the edge of the calculation unit set is in an enabled state, the shift buffer at the inner side of the calculation column at the edge of the calculation unit set is in a disabled state, and the shift buffers at two sides of the calculation column in the calculation unit set are in a disabled state.
2. The reconfigurable neural network processor of claim 1, wherein the computing unit is configured to perform convolution operations of corresponding neural network applications based on shift operations of feature operands and corresponding convolution kernel values;
the feature operands are derived based on a feature map of the corresponding neural network application.
3. The reconfigurable neural network processor of claim 1, wherein the computational unit is a computational core array;
each computing core in the computing core array comprises a basic operation set, and the computing core is used for selecting corresponding computing operation from the basic operation set based on a received computing control signal and executing the corresponding computing operation.
4. The reconfigurable neural network processor of claim 3, wherein the computational core array is provided in two rows; the calculation unit is used for performing convolution operation and activation function calculation of two lines of data at the same time.
5. The reconfigurable neural network processor of any one of claims 1 to 3, wherein the computational array control module comprises a plurality of control units, and the number of the control units is the same as the number of columns of the computational array;
and each control unit is used for controlling the reading, writing and calculation of the calculation units in the corresponding column aiming at the corresponding neural network application program.
6. The reconfigurable neural network processor of any one of claims 1 to 3, further comprising:
the storage module is used for caching the weight and source data required by the calculation array to execute each neural network application program;
the calculation array control module is also used for controlling the read-write operation of the storage module.
7. The reconfigurable neural network processor of claim 6, wherein the storage module comprises an in-column selector configured to selectively receive output data of the computational array;
the storage module is further configured to buffer the output data received by the in-column selector.
8. The reconfigurable neural network processor of claim 7, wherein the storage module further comprises a data buffer unit and an output buffer unit; the data buffer unit and the output buffer unit are ping-pong buffer structures and are alternately used for accessing the source data and the output data received by the in-column selector.
9. The reconfigurable neural network processor of claim 8, wherein the data cache unit comprises a shared cache unit, and the shared cache unit is configured to store data shared by a plurality of computing unit sets;
the calculation array control module is further configured to control the multiple calculation unit sets to perform cooperative calculation of the corresponding neural network application program based on the shared data in the shared cache unit.
CN202110407002.2A 2021-04-15 2021-04-15 Reconfigurable neural network processor Active CN113240074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110407002.2A CN113240074B (en) 2021-04-15 2021-04-15 Reconfigurable neural network processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110407002.2A CN113240074B (en) 2021-04-15 2021-04-15 Reconfigurable neural network processor

Publications (2)

Publication Number Publication Date
CN113240074A CN113240074A (en) 2021-08-10
CN113240074B true CN113240074B (en) 2022-12-06

Family

ID=77128250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110407002.2A Active CN113240074B (en) 2021-04-15 2021-04-15 Reconfigurable neural network processor

Country Status (1)

Country Link
CN (1) CN113240074B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392443B (en) * 2022-10-27 2023-03-10 之江实验室 Pulse neural network application representation method and device of brain-like computer operating system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9449257B2 (en) * 2012-12-04 2016-09-20 Institute Of Semiconductors, Chinese Academy Of Sciences Dynamically reconstructable multistage parallel single instruction multiple data array processing system
CN107169560B (en) * 2017-04-19 2020-10-16 清华大学 Self-adaptive reconfigurable deep convolutional neural network computing method and device
CN107797962B (en) * 2017-10-17 2021-04-16 清华大学 Neural network based computational array
CN108537330B (en) * 2018-03-09 2020-09-01 中国科学院自动化研究所 Convolution computing device and method applied to neural network
CN108647773B (en) * 2018-04-20 2021-07-23 复旦大学 Hardware interconnection system capable of reconstructing convolutional neural network
CN109472356A (en) * 2018-12-29 2019-03-15 南京宁麒智能计算芯片研究院有限公司 A kind of accelerator and method of restructural neural network algorithm
CN111401533A (en) * 2020-04-28 2020-07-10 南京宁麒智能计算芯片研究院有限公司 Special calculation array for neural network and calculation method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Collection in charge of NMOS from single event effect;Jinqiu Wang等;《Electronics Express》;20161231;全文 *

Also Published As

Publication number Publication date
CN113240074A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN106940815B (en) Programmable convolutional neural network coprocessor IP core
US11442786B2 (en) Computation method and product thereof
CN108170640B (en) Neural network operation device and operation method using same
US20070157166A1 (en) System, method and software for static and dynamic programming and configuration of an adaptive computing architecture
JP6865805B2 (en) Arithmetic logic unit and calculation method
CN110674927A (en) Data recombination method for pulse array structure
CN108509270A (en) The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige
CN113010213B (en) Simplified instruction set storage and calculation integrated neural network coprocessor based on resistance change memristor
Huang et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency
CN113240074B (en) Reconfigurable neural network processor
CN112486903A (en) Reconfigurable processing unit, reconfigurable processing unit array and operation method thereof
WO2016024508A1 (en) Multiprocessor device
CN110414672B (en) Convolution operation method, device and system
Song et al. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
CN111178492B (en) Computing device, related product and computing method for executing artificial neural network model
CN116050492A (en) Expansion unit
CN115357854A (en) Efficient matrix multiplication operation accelerating device and method
CN112862079B (en) Design method of running water type convolution computing architecture and residual error network acceleration system
CN115081600A (en) Conversion unit for executing Winograd convolution, integrated circuit device and board card
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
CN113159302A (en) Routing structure for reconfigurable neural network processor
CN110766150A (en) Regional parallel data loading device and method in deep convolutional neural network hardware accelerator
Zhou et al. A customized NoC architecture to enable highly localized computing-on-the-move DNN dataflow
CN113254078B (en) Data stream processing method for efficiently executing matrix addition on GPDPU simulator

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant