CN113469327A

CN113469327A - Integrated circuit device for executing advance of revolution

Info

Publication number: CN113469327A
Application number: CN202110703451.1A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2021-10-01
Anticipated expiration: 2041-06-24
Also published as: CN113469327B

Abstract

The present invention relates to an integrated circuit device for performing revolution-advance, wherein the computing device of the present invention is included in an integrated circuit device comprising a universal interconnect interface and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by the user. The integrated circuit device may further include a storage device, which is connected to the computing device and the other processing device, respectively, for data storage of the computing device and the other processing device.

Description

Integrated circuit device for executing advance of revolution

Technical Field

The present invention relates generally to the field of neural networks. More particularly, the present invention relates to an integrated circuit device that performs the revolution advance.

Background

In recent years, neural network algorithms, as a branch category in artificial intelligence algorithms, exhibit good adaptability and superior performance in more and more fields, such as: image recognition, target detection, natural language processing, etc. have become a research hotspot in academic and industrial fields.

However, the calculation amount of the neural network algorithm is large (up to 100 hundred million orders of magnitude of operation), the model training needs a back propagation process, a large amount of hardware resources are consumed, and the requirement of an intelligent application scene cannot be met in order to take account of the universality of the traditional general processor, so that a high-performance and low-power consumption neural network accelerator becomes one of the research hotspots in the field of the architecture in recent years.

Since the accelerators have different architectures and have different constraints on data placement, blocking, moving, and operation, the hardware implementation details of the bottom layer need to be considered for the corresponding programming system to generate the instructions. Particularly, convolution and full-connection operators in the neural network model occupy most of operation resources, and the operation efficiency is reduced due to insufficient hardware computing power.

Therefore, a compiling and optimizing scheme for the neural network model is urgently needed.

Disclosure of Invention

To at least partially solve the technical problems mentioned in the background, an aspect of the present invention provides an integrated circuit device that performs the revolution advancing.

In one aspect, an integrated circuit device for performing turn-ahead in a neural network model that includes a first layer and a second layer, the first and second layers including a load phase, a compute phase, and a store phase, respectively, the compute phase of the second layer including a quantization operation and a low precision operation is disclosed. The integrated circuit device includes a processing device and a computing device. The processing device is used for advancing the quantization operation to a position between the calculation stage and the storage stage of the first layer; the computing device is used for executing a loading stage, a computing stage, a quantization operation and a storage stage of the first layer at the first layer, and executing a loading stage, a low-precision operation and a storage stage of the second layer at the second layer.

In another aspect, the present invention discloses an integrated circuit device for performing turn-ahead in an operator of a neural network model, the operator comprising a loading phase and a computing phase, the loading phase and the computing phase comprising a pre-processing operation and a post-processing operation, respectively, the pre-processing operation of the computing phase comprising a quantization operation. An integrated circuit device includes: processing means for advancing the quantization operation to a post-processing operation of the loading stage; and a computing device to perform the quantization operation in a post-processing operation of the loading phase.

The invention provides a scheme with advanced revolution, which can meet the requirement of cyclic division of computational logic according to an algorithm in a neural network operator and cyclic division of data slices in a computational hierarchy, save interlayer data migration amount, reduce bandwidth occupation amount of hardware and improve performance.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are illustrated by way of example and not by way of limitation, and like reference numerals designate like or corresponding parts throughout the several views, in which:

fig. 1 is a structural diagram showing a board card of the embodiment of the present invention;

FIG. 2 is a block diagram illustrating an integrated circuit device of an embodiment of the invention;

FIG. 3 is a schematic diagram showing the internal structure of a computing device of an embodiment of the invention;

FIG. 4 is a schematic diagram showing the internal structure of a processor core of an embodiment of the invention;

FIG. 5 is a diagram illustrating an execution tree of an embodiment of the present invention;

FIG. 6 is a diagram illustrating parsing a traversal execution tree according to an embodiment of the invention;

FIG. 7 is a flow diagram illustrating quantization based on an execution tree according to an embodiment of the present invention; and

FIG. 8 is a diagram illustrating convolution and full link layer turn advance according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be understood that the terms "first", "second", "third" and "fourth", etc. in the claims, the description and the drawings of the present invention are used for distinguishing different objects and are not used for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification and claims of this application, the singular form of "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this specification refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

The following detailed description of embodiments of the invention refers to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the present invention. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System-on-Chip (SoC) or System-on-Chip, and is integrated with one or more combined processing devices, which are artificial intelligence arithmetic units, for supporting various deep learning and machine learning algorithms, and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining, and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred by the external device 103 to the chip 101 through the external interface device 102. The calculation result of the chip 101 may be transmitted back to the external device 103 via the external interface device 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this purpose, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of central processing unit (cpu), Graphics Processing Unit (GPU), or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present invention may be considered to have a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The DRAM204 is used for storing data to be processed, and is a DDR memory, which is typically 16G or larger in size and is used for storing data of the computing device 201 and/or the processing device 203.

Fig. 3 shows an internal structural diagram of the computing apparatus 201. The computing device 201 is used for processing input data such as computer vision, voice, natural language, data mining and the like, the computing device 201 in the figure is designed by adopting a multi-core hierarchical structure, the computing device 201 is used as a system on chip and comprises a plurality of clusters (clusters), each cluster comprises a plurality of processor cores, in other words, the computing device 201 is formed by a system on chip-cluster-processor core hierarchy.

Looking at the system-on-chip hierarchy, as shown in FIG. 3, the computing device 201 includes an external storage controller 301, a peripheral communication module 302, an on-chip interconnect module 303, a synchronization module 304, and a plurality of clusters 305.

There may be multiple external memory controllers 301, 2 shown by way of example in the figure, for accessing an external memory device, such as DRAM204 in figure 2, to read data from or write data to off-chip in response to an access request issued by a processor core. The peripheral communication module 302 is used for receiving the control signal from the processing device 203 through the interface device 202 and starting the computing device 201 to execute the task. The on-chip interconnect module 303 connects the external memory controller 301, the peripheral communication module 302 and the plurality of clusters 305 for transmitting data and control signals between the respective modules. The synchronization module 304 is a global synchronization barrier controller (GBC) for coordinating the operation progress of the clusters and ensuring the synchronization of the information. The plurality of clusters 305 are the computing cores of the computing device 201, 4 are exemplarily shown in the figure, and as the hardware is developed, the computing device 201 of the present invention may further include 8, 16, 64, or even more clusters 305. The clusters 305 are used to efficiently execute deep learning algorithms.

Viewed at the cluster level, as shown in FIG. 3, each cluster 305 includes a plurality of processor cores (IPU core)306 and a memory core (MEM core) 307.

The number of the processor cores 306 is exemplarily shown as 4 in the figure, and the present invention does not limit the number of the processor cores 306. The internal architecture is shown in fig. 4. Each processor core 306 includes three major modules: a control module 41, an arithmetic module 42 and a storage module 43.

The control module 41 is used for coordinating and controlling the operations of the operation module 42 and the storage module 43 to complete the deep learning task, and includes an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412. The instruction fetch unit 411 is used to obtain an instruction from the processing device 203, and the instruction decode unit 412 decodes the obtained instruction and sends the decoded result to the operation module 42 and the storage module 43 as control information.

The operation module 42 includes a vector operation unit 421 and a matrix operation unit 422. The vector operation unit 421 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation; the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, namely matrix multiplication and convolution.

The storage module 43 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM)431, a weight storage unit (weight RAM, WRAM)432, an input/output direct memory access module (input/output type DMA)433, and a transport direct memory access module (MVDMA) 434. NRAM 431 is used to store the feature map for processor core 306 to compute and the intermediate result after computation; the WRAM 432 is used for storing the weight of the deep learning network; the input/output class DMA 433 controls the access of NRAM 431/WRAM 432 and DRAM204 through the broadcast bus 309; the MVDMA 434 is used to control access of the NRAM 431/WRAM 432 and the SRAM 308.

Returning to FIG. 3, the storage core 307 is primarily used to store and communicate, i.e., store shared data or intermediate results among the processor cores 306, as well as perform communications between the clusters 305 and the DRAMs 204, communications among the clusters 305, communications among the processor cores 306, and the like. In other embodiments, storage core 307 has the capability of scalar operations to perform scalar operations.

The memory core 307 includes a shared memory unit (SRAM)308, a broadcast bus 309, a Cluster Direct Memory Access (CDMA) 310, and a Global Direct Memory Access (GDMA) 311. The SRAM 308 plays a role of a high-performance data transfer station, data multiplexed among different processor cores 306 in the same cluster 305 do not need to be acquired from the DRAM204 through the processor cores 306 respectively, but are transferred among the processor cores 306 through the SRAM 308, and the storage core 307 only needs to rapidly distribute the multiplexed data from the SRAM 308 to the plurality of processor cores 306, so that the inter-core communication efficiency is improved, and the on-chip and off-chip input/output access is greatly reduced.

The broadcast bus 309, CDMA 310, and GDMA 311 are used to perform communication among the processor cores 306, communication among the cluster 305, and data transfer between the cluster 305 and DRAM204, respectively. As will be described separately below.

The broadcast bus 309 is used to accomplish high-speed communication among the processor cores 306 in the cluster 305, and the broadcast bus 309 of this embodiment supports inter-core communication modes including unicast, multicast and broadcast. Unicast refers to point-to-point (i.e., from a single processor core to a single processor core) data transfer, multicast is a communication for transferring a copy of data from SRAM 308 to a specific number of processor cores 306, and broadcast is a communication for transferring a copy of data from SRAM 308 to all processor cores 306, and is a special case of multicast.

CDMA 310 is used to control access to SRAM 308 between different clusters 305 within the same computing device 201.

The GDMA 311 cooperates with the external memory controller 301 to control the access of the SRAM 308 of the cluster 305 to the DRAM204 or to read data from the DRAM204 into the SRAM 308. As can be seen from the foregoing, communication between DRAM204 and NRAM 431 or WRAM 432 may be achieved via 2 channels. The first channel is to directly connect DRAM204 with NRAM 431 or WRAM 432 through input and output DAM 433; the second channel is that data is transferred between DRAM204 and SRAM 308 via GDMA 311, and then between SRAM 308 and NRAM 431 or WRAM 432 via MVDMA 434. Although seemingly the second channel requires more components and the data flow is longer, in some embodiments, the bandwidth of the second channel is substantially greater than the first channel, and thus communication between DRAM204 and NRAM 431 or WRAM 432 may be more efficient over the second channel. The embodiment of the invention can select the data transmission channel according to the hardware condition.

In other embodiments, the functionality of the GDMA 311 and the functionality of the input output class DMA 433 may be integrated in the same component. For convenience of description, the GDMA 311 and the input/output DMA 433 are considered as different components, and it is within the scope of the present invention for those skilled in the art to achieve the same functions and achieve the same technical effects as the present invention. Further, the function of the GDMA 311, the function of the input/output DMA 433, the function of the CDMA 310, and the function of the MVDMA 434 may be implemented by the same component.

The neural network framework to which this embodiment applies predefines a series of neural network layers or operator interfaces. A developer sets layer parameters of each layer by calling an Application Programming Interface (API) of a neural network framework, and links the dependency relationship between data and the layers to build a neural network model structure. After the network model training process, the model parameters and weight data are saved in a structured model file, stored in DRAM 204. During deployment operation, the processing device 203 calls the API of the framework, loads the trained network model, and executes the forward inference process of the network model on the computing device 201 using the actual input data to obtain the final output result of the network. Since the model structure and parameters are known in the forward reasoning process, this embodiment uses this information to speed up.

This embodiment proposes a tree-like neural network operator programming method, called an execution tree. Fig. 5 shows a schematic diagram of the execution tree of this embodiment. The nodes of the execution tree are an iterative data structure, and are formed by connecting a root node 501 with a sub-tree, wherein the sub-tree can comprise any multiple layers and any multiple sub-nodes, and the sub-nodes are divided into non-leaf nodes and leaf nodes. The non-leaf nodes are located in the middle level of the subtree, and 2 non-leaf nodes 502 and 503 are exemplarily shown in fig. 5. The leaf nodes are located at the last level of the subtree, and 2 leaf nodes 504 and 505 are shown in fig. 5 as an example. The number of layers and the number of child nodes of the subtree depend on the needs of the operator, and this embodiment is not limited.

The execution logic of the operations of the root node and the child nodes is the same, and comprises the following steps: initial operation, pretreatment operation, main body operation, post-treatment operation and finishing operation. The root node and child nodes also include a loop operation (not shown) to record the number of times the node needs to be executed repeatedly.

The initial operation is the first part to be executed in the execution tree of the same level, and is executed only once, is not repeatedly executed along with the loop, and belongs to a one-time initialization instruction, such as a register initialization instruction, an activation operation configuration instruction and the like. The preprocessing operation is executed after the initial operation, and is executed repeatedly at least once according to the loop operation, which is responsible for preprocessing before the main body operation, for example, in the Scale operator, the short vector right operand corresponds to the fetch operation of the loop segment data, and the like. The subject operation is performed after the pre-processing operation, and is also repeatedly performed at least once according to the loop operation, which is responsible for the calculation part of the operator subject loop. If the node is a root node or a non-leaf node, the main body operation is only used for cutting data and distributing tasks to the child nodes of the next layer; if it is a leaf node, its main operation is to execute the operation core part of the tree, for example, to perform the accumulation addition operation. The post-processing operation is repeatedly executed at least once after the main operation according to the loop operation, and is responsible for the post-processing operation after the operation, such as the shift of the multiplexing data, the register offset and the like. The finishing operation is performed only once to output the calculation result.

The execution times and the execution timing of the above operations are created by the processing device 203 based on the loop analysis of the operation instruction of the neural network operator on the computing device 201, and are not the function limitation of the execution tree. When the cyclic operation is required, the cyclic part is a pretreatment operation, a main body operation and a post-treatment operation.

In this embodiment, the execution of the neural network operator can be roughly divided into 3 stages: the loading stage, the calculating stage and the storing stage, so the processing device 203 divides the execution tree of the neural network operator into three types of trees of loading, calculating and storing, and the execution tree of each operator is composed of a root node and a subtree of the loading, calculating and storing tree, that is, all the execution trees of one operator belong to one of the 3 trees, and each tree has the structure of fig. 5.

In running the neural network model, 3 execution trees of one operator can implement all the instructions required for the neural network operator to run on the computing device 201. First, the computing device 201 executes all instructions of the operations in the corresponding execution sequence of one leaf node of the load tree, then executes one leaf node of the compute tree, and finally executes the leaf node of the store tree, and the process is repeated until all the nodes are executed.

In more detail, in the compiling stage, when the processing device 203 parses and traverses an execution tree, according to the order of the prior traversal of the preambles, the initial and pre-processing operations of the root node are executed first, then all the nodes in the main operation of the subtree are traversed, and finally the post-processing and the end of the root node are executed. Wherein the pre-processing, the main body and the post-processing are repeatedly performed in a loop.

To implement the loop operation, when repeated execution is required, a synchronization instruction is inserted after a post-processing operation of a node that needs repeated execution. When the computing device 201 is running, if a synchronization instruction is received, the node returns to the preprocessing operation of the node, and executes the preprocessing operation, the main body operation and the post-processing operation again until the cycle number of the cycle operation is satisfied, and then executes the ending operation of the node.

Fig. 6 shows a schematic diagram of parsing the traversal execution tree in this embodiment. The simplified execution tree includes a root node 601, a first leaf node 602, and a second leaf node 603. Assuming that the loop operation of the root node 601 records the loop times of the root node 601 as 3 times, the loop operation of the first leaf node 602 records the loop times of the first leaf node 602 as 5 times, and the loop operation of the second leaf node 603 records the loop times of the second leaf node 603 as 1 time. When traversing the execution tree, the processing device 203 first executes the initial and pre-processing operations of the root node 601, and then executes the initial, pre-processing, main and post-processing operations of the first leaf node 602 according to the front and back link order of the sub-tree, and when receiving the synchronization instruction 604, the loop information record of the synchronization instruction 604 needs to be executed repeatedly 5 times. Since the first leaf node 602 is executed only once, the pre-processing, main processing, and post-processing operations of the first leaf node 602 are repeatedly executed until 5 cycles are performed, and finally the end operation of the first leaf node 602 is executed. By this time all operations of the subtree of the first leaf node 602 have been traversed.

The processing means 203 then traverses the sub-tree executing the second leaf node 603. Since the second leaf node 603 only needs to cycle once, the second leaf node 603 returns to the root node 601 by directly executing the initial, pre-processing, main body, post-processing, and finishing operations without inserting a synchronization instruction.

The root node 601 is traversed, i.e., post-processing operations of the root node 601 are performed. Since the root node 601 needs to be executed 3 times, the post-processing operation of the root node 601 is followed by the synchronization instruction 605, and the loop information record of the synchronization instruction 605 needs to be executed 3 times repeatedly. At this time, the processing device 203 returns to the root node 601 to perform the pre-processing operation repeatedly, then performs the entire operation flow of all the subtrees thereof as described above, performs the post-processing operation of the root node 601 again until 3 cycles are performed, and finally performs the ending operation of the root node 601 to complete the execution of all the operations in the root node 601 tree.

As for the example of fig. 6, as the traversal order of the single execution tree, it can be seen from the above that the computing apparatus 201 repeatedly traverses the nodes of the execution tree based on the chain loop of load → compute → store → load → compute → store when computing the operator.

When compiling the execution tree, the processing device 203 analyzes the execution tree based on the specific algorithm of the neural network operator to obtain the calculation loop level, construct the corresponding execution tree level, and link the sub-tree relationship. And then, the maximum input (or output) data volume in each calculation cycle is obtained according to the occupation proportion or the actual size of the on-chip resources (mainly NRAM 431 memory space) of the data blocks such as input, output and constants at each time, and the cycle level of the data slice is obtained by dividing the input data volume of the specific calculation cycle level by the maximum input data volume of the single cycle so as to link the subtree relationship. In each subtree, memory allocation and release are performed in appropriate operations according to the data amount in actual circulation. And finally, filling corresponding instructions for loading off-chip data, moving multiplexing data, calculating, storing output data and the like in proper operation of each subtree so as to finish the compiling work of the operator.

Since convolutional layers and fully-connected layers occupy most of the computation in the full network, optimization needs to be performed for these computations to improve the performance of the full network. In this embodiment, it is considered that there is a certain redundancy due to a large parameter amount of weight data in the convolutional layer and the fully-connected layer, and a low-precision calculation method is adopted based on the condition that precision is completely lost or precision within an allowable range is lost. In other words, to save the use of hardware resources, the embodiment uses quantization to convert high precision floating point numbers into low precision fixed point numbers to speed up the neural network operation. For example, the matrix operation unit 422 only supports multiply-accumulate operations with 8-bit fixed-point numbers (INT8), and before performing matrix operations, both input data and weight data are converted into fixed-point numbers of INT8 data types, and then introduced into the matrix operation unit 422 for calculation.

In the two layers, the weight data can be converted in advance by using an off-line preprocessing quantization method. The weight data is stored in the model off line, can be preprocessed during compiling, is converted according to the corresponding data type, is stored in a new network model file, modifies the corresponding network model structure description file, marks the operation data type of the corresponding neural network layer, and adds the corresponding parameter required for quantification. At compile time, a sequence of instructions for the computing device 201 is generated according to the quantized network model parameters. During operation, the computing device 201 loads the required weight data to the WRAM 432 according to the bit width of the corresponding computing data type through the generated instruction sequence, and performs convolution and operation of the full connection layer, so as to realize network acceleration.

However, input data of convolution and full join operators may be output results from other neural network layers in the network, and data types cannot be converted in advance during compiling, and corresponding instruction sequences need to be used on a chip to complete data type conversion operation. The above instructions are all calculation type instructions and will be performed in the calculation stage of the operator.

Fig. 7 shows a flowchart of quantization based on an execution tree in this embodiment. As mentioned earlier, the leaf nodes of the execution tree are in the order of load → compute → store, and are omitted from the corresponding description of fig. 7 since the initial and ending operations are not critical operations to quantization. In step 701, a preprocessing operation of loading leaf nodes is performed; in step 702, the main operation of loading leaf nodes is performed, and the input data and the weight are loaded into NRAM 431 and WRAM 432 at this time; in step 703, a post-processing operation of loading leaf nodes is performed; in step 704, a pre-processing operation of calculating leaf nodes is performed; in step 705, a subject operation of computing leaf nodes is performed; in step 706, post-processing operations to compute the leaf nodes are performed. After the calculation phase, there is also a loading of the storage phase → calculation → storage operation, which is likewise omitted in the corresponding description of fig. 7.

The data type conversion instruction (quantization operation) executes the continuous data in the NRAM 431 in a vectorized manner in the vector operation unit 421 according to the corresponding data type, and is theoretically implemented in the preprocessing operation of the leaf node of the computation tree of the corresponding operator, that is, the quantization is performed in step 704. In this embodiment, the data type conversion (quantization operation) is completed in advance during data movement, and becomes an input/output instruction, which can be implemented in the post-processing operation of loading the leaf nodes of the tree, i.e. the quantization is performed in advance in step 703. Therefore, the data transportation amount can be reduced in the ending operation of the loading stage and the preprocessing operation of the computing stage.

Moreover, this embodiment also provides a compilation optimization method for data type conversion scheduling ahead, for data type conversion operations occurring in operators such as convolution and full join.

In addition to the advanced rotation number as shown in fig. 7, in the full network operation, the matrix operation in the operator layers such as convolution and full connection involves a low-precision operation using a bit width of an input data type (INT8) smaller than that of other operator layers using high-precision calculation (FP16), and when the source of an input data block is the output of other data layers, the processing device 203 schedules the data type conversion operation into a calculation stage in the corresponding preceding operator layer. Fig. 8 shows a schematic diagram of the convolution and the revolution advance of the fully connected layer of this embodiment, in which 2 layers are exemplarily shown: a first layer and a second layer. Before the advance of the number of revolutions is executed in compiling, the first layer executes the operations of loading 801 → calculating 802 → storing 803, the second layer is a convolution or full-connection layer, and the operations of loading 804 → quantizing 805 → convolution/full-connection 806 → storing 807 are executed, wherein the operations of quantizing 805 and convolution/full-connection 806 are the calculation stages of the second layer, and since the convolution/full-connection 806 only accepts fixed-point numbers, floating-point numbers are firstly converted into fixed-point numbers in quantizing 805.

When performing turn advance at compile time, the processing device 203 advances the quantization 805 between the calculation 802 and the storage 803 of the first layer, i.e., before and after the turn advance, the computing device 201 performs the operations of loading 801 → calculation 802 → quantization 805 → storage 803 at the first layer, and the second layer performs the operations of loading 804 → convolution/full-join 806 → storage 807. Therefore, the data amount in the output data operation of the storage 803 of the first layer and the input data operation of the loading 804 of the second layer is only half of the original data, thereby saving the bandwidth and improving the performance.

According to different application scenarios, the electronic device or apparatus of the present invention may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, an internet of things terminal, a mobile phone, a car recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph. The electronic device or apparatus of the present invention can also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical care, and the like. Furthermore, the electronic equipment or the device can be used in application scenes such as a cloud end, an edge end and a terminal which are related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, the electronic device or apparatus with high computational power according to the present disclosure may be applied to a cloud device (e.g., a cloud server), and the electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device to simulate the hardware resources of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device, and uniform management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration can be completed.

It is noted that for the sake of simplicity, the present invention sets forth some methods and embodiments thereof as a series of acts or combinations thereof, but those skilled in the art will appreciate that the inventive arrangements are not limited by the order of acts described. Accordingly, persons skilled in the art may appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the invention. Further, those skilled in the art will appreciate that the described embodiments of the invention are capable of being practiced in other alternative embodiments that may involve fewer acts or modules than are necessary to practice one or more aspects of the invention. In addition, the description of some embodiments of the present invention is also focused on different schemes. In view of this, those skilled in the art will understand that portions of the present invention that are not described in detail in one embodiment may also refer to related descriptions of other embodiments.

In particular implementations, based on the disclosure and teachings of the present invention, one skilled in the art will appreciate that the several embodiments disclosed in the present invention can be practiced by other methods than those disclosed in the present embodiments. For example, as for each unit in the foregoing embodiment of the electronic device or apparatus, the embodiment splits the unit based on the logic function, and there may be another splitting manner in actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present invention, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the scheme of the embodiment of the invention. In addition, in some scenarios, multiple units in an embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In view of this, the various devices (e.g., computing devices or other processing devices) described in this embodiment may be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The above embodiments of the present invention are described in detail, and the principle and the implementation of the present invention are explained in this embodiment by applying specific examples, and the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An integrated circuit device that performs turn advance in a neural network model, the neural network model including a first layer and a second layer, the first and second layers including a load phase, a compute phase, and a store phase, respectively, the compute phase of the second layer including a quantization operation and a low precision operation, the integrated circuit device comprising:

processing means for advancing the quantization operation between a computation phase and a storage phase of the first layer; and

a computing device to perform a load phase, a compute phase, the quantization operation, and a store phase of the first layer at the first layer, and to perform a load phase, the low precision operation, and a store phase of the second layer at the second layer.

2. The integrated circuit device according to claim 1, wherein the calculation means comprises a neuron storage unit, the loading stage, the calculation stage and the storage stage are in the structure of an execution tree, and the processing means calculates the maximum input and output data amount within a cycle from the occupation ratio or the actual size of each data block of input, output, constant and the like in the neuron storage unit when compiling the execution tree.

3. The integrated circuit device according to claim 2, wherein the processing means derives a rotation level of a data slice to link sub-tree relationships in the execution tree from an amount of input data of a particular computation rotation level divided by the amount of data.

4. The integrated circuit device according to claim 1, wherein the computing device comprises a vector operation unit to perform the quantization operation vectorially.

5. The integrated circuit device according to claim 1, wherein the low precision operation is one of a convolution and a full join operation.

6. An integrated circuit device for performing turn-ahead in an operator of a neural network model, the operator comprising a loading phase and a computing phase, the loading phase and the computing phase comprising a pre-processing operation and a post-processing operation, respectively, the pre-processing operation of the computing phase comprising a quantization operation, the integrated circuit device comprising:

processing means to advance the quantization operation to a post-processing operation of the loading phase; and

computing means to perform the quantization operation in a post-processing operation of the loading phase.

7. The integrated circuit device according to claim 6, wherein the calculation means comprises a neuron storage unit, the loading stage and the calculation stage are in the structure of an execution tree, and the processing means calculates the maximum input and output data amount within a cycle from the occupation ratio or actual size of each data block of input, output, constant, etc. in the neuron storage unit when compiling the execution tree.

8. The integrated circuit device according to claim 7, wherein the processing means derives a rotation level of a data slice to link sub-tree relationships in the execution tree from an amount of input data of a particular computation rotation level divided by the amount of data.

9. The integrated circuit device according to claim 6, wherein the computing device comprises a vector operation unit to perform the quantization operation vectorially.

10. The integrated circuit device according to claim 6, wherein the low precision operation is one of a convolution and a full join operation.