CN111831332A

CN111831332A - Control system and method for intelligent processor and electronic equipment

Info

Publication number: CN111831332A
Application number: CN202010689114.7A
Authority: CN
Inventors: 赵永威; 支天; 杜子东; 陈云霁; 徐志伟; 孙凝晖; 郭崎
Original assignee: Institute of Computing Technology of CAS; University of Chinese Academy of Sciences
Current assignee: Institute of Computing Technology of CAS; University of Chinese Academy of Sciences
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-10-27

Abstract

The present disclosure provides a control system for an intelligent processor, each layer of fractal calculation subunit of the intelligent processor includes a control system, the control system includes: the serial decomposition module is used for performing serial decomposition on a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions and temporarily storing the serial decomposition sub-instructions; the degradation module is used for degrading the serial decomposition sub-instruction and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer; and the parallel decomposition module is used for performing parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor. The control system can efficiently and accurately control the intelligent processor to execute fractal operation and protocol operation.

Description

Control system and method for intelligent processor and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a control system and method for an intelligent processor, and an electronic device.

Background

The machine learning algorithm is used as a new tool and is applied more and more in the industry, and the machine learning algorithm comprises the fields of image recognition, voice recognition, face recognition, video analysis, intelligent recommendation, game competition and the like. In recent years, many machine learning-dedicated computers of different sizes have appeared in the industry for machine learning loads that are more and more widely applied. For example, some smart phones employ a machine learning processor for face recognition at the mobile end, and employ a machine learning computer for acceleration at the cloud service end.

Machine learning algorithms have a wide prospect, but applications are limited by programming problems. The application scene is wide and not only can be embodied in various application fields, but also can be embodied in hardware platforms with different scales. Programming difficulties due to program-scale dependencies arise if each application on each type of hardware is to be programmed separately. Therefore, developers have adopted programming frameworks (e.g., TensorFlow, PyTorch, MXNet) as a bridging model to bridge various applications and various hardware to ameliorate this problem.

However, the programming framework only alleviates the programming challenges that users encounter in programming; the challenge becomes more acute for hardware vendors. Now, hardware manufacturers need to provide a programming interface for each hardware product and also need to migrate each programming framework to each hardware product, which results in huge software development cost. A framework of TensorFlow alone has more than a thousand operators, and optimizing an operator on a piece of hardware requires a high level software engineer to work for several months.

Disclosure of Invention

In view of the above-mentioned drawbacks, the present disclosure is directed to a control system, a method and an electronic device for an intelligent processor, which are used to at least partially solve the above technical problems.

According to a first aspect of the present disclosure, there is provided a control system for a smart processor, each layer of fractal calculation subunit of the smart processor comprising the control system, the control system comprising: the serial decomposition module is used for performing serial decomposition on a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain a serial decomposition sub-instruction and temporarily storing the serial decomposition sub-instruction; the degradation module is used for degrading the serial decomposition sub-instruction and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer; and the parallel decomposition module is used for performing parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor.

In some embodiments, the serial decomposition module comprises a first instruction queue temporary storage unit, a serial decomposition unit and a second instruction queue temporary storage unit; the first instruction queue temporary storage unit is used for temporarily storing the fractal instruction set; the serial decomposition unit is used for serially decomposing the fractal instruction set into the serial decomposition sub-instructions which are executed in sequence according to the hardware capacity corresponding to the intelligent processor; the second instruction queue temporary storage unit is used for temporarily storing the serial decomposition sub-instruction.

In some embodiments, the granularity of the serially-decomposed sub-instructions does not exceed the allowable range of the hardware capacity.

In some embodiments, the destage module comprises an allocation unit, a DMA, and a replacement unit; the allocation unit is used for allocating a local storage space for an operand which is positioned in an external memory in the serial decomposition sub-instruction; the DMA is used for writing data required by the fractal operation into the fractal calculation subunit from a local memory before the serial decomposition sub-instruction is executed; the replacing unit is used for replacing the operand corresponding to the serial decomposition sub-instruction with a local backup operand.

In some embodiments, the DMA is further to write the data out of the fractal computation subunit to the local memory after execution of the serial disassembly sub-instruction.

In some embodiments, the serial decomposition sub-instruction comprises a fractal instruction and a native instruction.

In some embodiments, the parallel decomposition module comprises a parallel decomposition unit and a specification control unit; the parallel decomposition unit is used for performing k-decomposition on the fractal instruction to obtain a fractal sub-instruction, and sending the fractal sub-instruction to the fractal processing units of each layer of the intelligent processor to execute fractal operation; and the protocol control unit is used for executing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to the local processing units of each layer of the intelligent processor so as to perform protocol operation on the result of each layer of fractal operation.

In some embodiments, the control system further comprises a register; the register is used for temporarily storing the local instruction.

In some embodiments, the protocol control unit is further configured to send the local instruction to the register, send the local instruction to the parallel decomposition unit to perform k-decomposition when the current fractal operation is finished, and send the decomposed instruction to the fractal processing unit to perform protocol operation.

According to a second aspect of the present disclosure, there is provided a control method for an intelligent processor, by which each layer of a fractal calculation subunit of the intelligent processor is controlled to perform a fractal operation, the control method comprising: performing serial decomposition on a fractal instruction set corresponding to fractal operation executed by the intelligent processor to obtain a serial decomposition sub-instruction, and temporarily storing the serial decomposition sub-instruction; degrading the serial decomposition sub-instruction, and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into a serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer; and carrying out parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor, so that the fractal calculation sub-unit executes fractal operation according to the parallel decomposition sub-instruction.

In some embodiments, the step of performing serial decomposition on the fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain a serial decomposition sub-instruction, and temporarily storing the serial decomposition sub-instruction includes: taking the fractal instruction set, and temporarily storing the fractal instruction set in a first instruction queue temporary storage unit; according to the hardware capacity corresponding to the intelligent processor, serially decomposing the fractal instruction set into serially-decomposed sub-instructions which are executed in sequence; and temporarily storing the serial decomposition sub-instruction in a second instruction queue temporary storage unit.

In some embodiments, the step of demoting the serially resolved sub-instructions comprises: allocating a local storage space for an operand located in an external memory in the serial decomposition sub-instruction; before the serial decomposition sub-instruction is executed, writing data required by the fractal operation into the fractal calculation sub-unit from a local memory; and replacing the operand corresponding to the serial decomposition sub-instruction with a local backup operand.

In some embodiments, the data is written out of the fractal computation subunit to the local memory after execution of the serial decomposition sub-instruction.

In some embodiments, said parallel decomposition of the degraded serially decomposed sub-instructions comprises: performing k-decomposition on the fractal instruction to obtain a fractal sub-instruction, and sending the fractal sub-instruction to fractal calculation sub-units of each layer of the intelligent processor to perform fractal operation; and performing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to the local processing units of each layer of the intelligent processor so as to perform reduction operation on the result of each layer of fractal operation.

In some embodiments, the control method further comprises: the local instruction is temporarily stored by a register.

In some embodiments, the control method further comprises: and when the current fractal operation is finished, performing k-decomposition on the local instruction, and sending the decomposed instruction to the fractal calculation subunit to perform the protocol operation.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising the control device described above.

Drawings

FIG. 1 schematically illustrates an architecture diagram for a first embodiment of the present disclosure to provide a fractal von Neumann architecture;

FIG. 2 schematically illustrates a control system architecture diagram for an intelligent processor provided by a first embodiment of the present disclosure;

fig. 3 schematically shows a flow chart of a control method provided by a first embodiment of the present disclosure;

FIG. 4 is a flow chart schematically illustrating an instruction decomposition method provided by a second embodiment of the present disclosure;

FIG. 5 is a logic diagram schematically illustrating a specific example of an instruction decomposition method according to a second embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of an instruction decomposition apparatus provided in a second embodiment of the present disclosure;

FIG. 7 schematically illustrates a fractal pipeline formed by a two-layer intelligent processor provided in a third embodiment of the present disclosure;

fig. 8 schematically shows a block diagram of an instruction execution apparatus provided in a third embodiment of the present disclosure;

fig. 9 is a diagram schematically illustrating a structure of a memory management device according to a fourth embodiment of the present disclosure;

fig. 10 schematically shows a flowchart of a memory management method according to a fourth embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It should be noted that in the drawings or the description, the same drawing reference numerals are used for similar or identical parts. Implementations not depicted or described in the drawings are of a form known to those of ordinary skill in the art. Additionally, while examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the respective values, but may be approximated to the respective values within acceptable error tolerances or design constraints. In addition, directional terms such as "upper", "lower", "front", "rear", "left", "right", and the like, referred to in the following embodiments, are directions only referring to the drawings. Accordingly, the directional terminology used is intended to be in the nature of words of description rather than of limitation.

Research has found that an ideal machine learning computer should have homogeneous, serial, and hierarchical characteristics to simplify programming (including writing machine learning applications and migrating programming frameworks). If all machine learning computers, even if they are of disparate sizes, employ the same instruction set architecture, then the migration of programs no longer requires a separate redo for each new product, which would significantly liberate programmer productivity. Based on this, the embodiments of the present disclosure construct a fractal machine learning computer by introducing the idea of an intelligent processor, so as to solve the above technical problems. As described in detail below.

To construct a fractal machine learning computer, it is first determined that the application load of machine learning is suitable for being expressed as a fractal form. The disclosed embodiments study the common computing primitives that several typical machines learn about application loads, and find that these application loads can be described using a set of computing primitives (vector inner product, vector distance, ordering, activation function, count, etc.).

Machine learning application loads are typically computing and memory intensive applications, but vary widely in terms of execution control flow, learning approaches, and training methodologies. However, all machine learning application loads have a high degree of concurrency at some granularity, and therefore many heterogeneous machine learning computers design dedicated hardware to take advantage of this characteristic to achieve acceleration. Examples of such specialized hardware include GPUs, FPGAs, and ASIC chips. The disclosed embodiments first decompose these application loads into computation primitives, which are then expressed using fractal expressions.

In particular, the disclosed embodiments select six representative machine learning application loads, execute on a classical dataset, and decompose the execution time required for each of the computation primitives.

TABLE 1

As shown in table 1, the following loads were selected by the embodiments of the present disclosure:

CNN-in view of the prevalence of deep learning, AlexNet algorithm and ImageNet dataset were chosen as representative application loads for Convolutional Neural Networks (CNN).

DNN-also for deep learning techniques, a 3-layer structure of multilayer perceptron (MLP) was chosen as a representative application of Deep Neural Networks (DNN).

K-Means-K-average algorithm, a classical machine learning clustering algorithm.

K-NN-K-nearest neighbor algorithm, a classical machine learning classification algorithm.

SVM-support vector machine, a classical machine learning classification algorithm.

LVQ-learning vector quantization, a classical machine learning classification algorithm.

Based on this, the machine learning application load is decomposed into matrix operations and vector operations. Operations such as vector-matrix multiplication or matrix-vector multiplication are merged into matrix multiplication, operations such as matrix-matrix addition/subtraction, matrix-scalar multiplication, vector element-by-element operations, etc. into element-by-element transformations. The decomposition then yields 7 main computational primitives, including inner product, convolution, pooling, moment multiplication, element-by-element transformation, sorting, and counting. In order to simplify the expression of deep learning application, special convolution and pooling operation are additionally added besides the moment multiplication; the inner product is actually a vector-vector multiplication and can also be used to represent a fully connected layer in a deep neural network. It can be observed that these 7 common computing primitive languages basically express machine learning application load.

Next, the disclosed embodiment employs a fractal operation to describe the above 7 common computation primitive,

TABLE 2

As shown in Table 2, each computation primitive may have multiple k-decomposition modes. Some operations produce partial results after decomposition, and the final result can be obtained by reduction, and the required reduction operations are listed in table 2; shared input data may exist among fractal sub-operations obtained after decomposition of some operations, and data redundancy needs to be introduced at this time, and redundant parts are listed in table 2. It is easy to find that all 7 common computation primitives can be represented as fractal operations by introducing reduction operations and data redundancy. Therefore, to design a new dedicated architecture to efficiently perform these fractal operations, the disclosed embodiments need to address the following three key challenges:

1. reduction operations-to efficiently process reduction operations, embodiments of the present disclosure require the introduction of lightweight local processing units (LFUs) in the architecture. After reclaiming part of the result data from the fractal processing unit (FFU), the local processing unit may efficiently perform a reduction operation thereon.

2. Data redundancy-during the execution of the fractal operation, the embodiments of the present disclosure need to introduce data redundancy. For this reason, the storage hierarchy in the fractal machine learning computer needs to ensure data consistency and find data multiplexing opportunities.

3. Data communication between different nodes of a fractal machine learning computer may result in complex physical wiring, resulting in area, delay, and energy consumption overhead. Therefore, the embodiment of the disclosure finds that only data communication is needed between parent and child nodes in the fractal operation execution process, so that the data path design is greatly simplified; the designer can design the fractal machine learning computer by iterative modularization, and all connecting lines are limited between parents, so that the connecting line congestion is reduced.

The technical solutions of the embodiments of the present disclosure to solve the above-mentioned key challenges are described in detail below.

A first embodiment of the present disclosure provides a control system for an intelligent processor, each layer of fractal calculation subunit of the intelligent processor includes the control system, and the control system includes: the serial decomposition module is used for serially decomposing a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions and temporarily storing the serial decomposition sub-instructions; the degradation module is used for degrading the serial decomposition sub-instruction and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer; and the parallel decomposition module is used for performing parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction meeting the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor.

Fig. 1 schematically illustrates an architecture diagram of a first embodiment of the present disclosure providing a fractal von neumann architecture. The intelligent processor according to the embodiment of the present disclosure is a computing system constructed using a fractal von Neumann architecture.

In geometry, fractal refers to a geometric figure that is self-similar on different scales, therefore, the concept of fractal includes a scale invariant for describing geometric figures, which is defined by a set of simple generation rules, and can generate complex figures with any scale by continuously and repeatedly replacing a certain part of the figure with a pattern. The replacement rule of the graph is a scale invariant. The disclosed embodiments employ a similar idea, taking the system description as a scale invariant, resulting in a fractal von neumann architecture.

As shown in fig. 1, the fractal von neumann architecture is an architecture that can be iteratively and modularly designed, and is composed of a plurality of copies generated by copying itself — a minimum fractal von neumann architecture is composed of a memory, a controller and an operator (LFU and FFU), and a minimum-scale computing system, that is, a fractal computing subunit, can be formed by matching input/output modules. The larger fractal Von Neumann system structure takes a smaller fractal Von Neumann system structure as an arithmetic unit, and is formed by a plurality of concurrent arithmetic units, a controller, a memory and an input/output module; by analogy, the fractal von Neumann architecture enables computing systems of any scale to be built in an iterative modular design. Wherein the controller employed by each layer of the fractal von neumann architecture has the same structure. Therefore, when a hardware circuit is designed, the iterative modular design of the fractal von Neumann architecture can greatly simplify the design and verification work of control logic.

The fractal von neumann architecture employs the same instruction set architecture on each layer, called fractal instruction set architecture (fish). The fractal instruction set structure comprises two instructions: native instructions and fractal instructions.

This embodiment gives a definition of the fractal instruction set architecture formalization:

definition 3.1(FISA instruction) FISA instruction I is a triple < O, P, G >, where O is an operation, P is a finite set of operands, and G is a granularity identifier.

Define 3.2 (fractal instruction) FISA Instructions I<O，P，G>Is a fractal instruction if and only if there is a set of granularity identifiers G'₁，G′₂，…，G′_n(G′_iG, ≦ is a partial ordering relationship defined on the granularity identifier space) so that the execution behavior of I may be represented by I'₁(G′₁)，I′₂(G′₂)，...，I′_n(G′_n) Executed in sequence with other FISA instructions to simulate.

Definition 3.3(FISA instruction set) an instruction set is the FISA instruction set if and only if it contains at least one fractal instruction.

Definition 3.4 (fractal computer) a computer M with a architecture of the FISA instruction set is a fractal computer, and fractal execution on the computer M is carried out if and only if there is at least one fractal instruction.

TABLE 3

The FISA instruction set design of the intelligent processor of the embodiment of the disclosure adopts a higher abstraction level, so that the programming production efficiency can be improved and a high calculation access ratio can be achieved. As shown in table 3, high-level operations such as convolution and sorting can be represented directly by one instruction. Low-level operations with lower computational memory access ratios are also added to the instruction set, thus achieving better programming flexibility. These low-level operations would typically be treated as native instructions, and the intelligent processor would prefer to use LFUs to execute them to reduce data handling.

Further, native instructions, describing reduction operations, are issued by the controller to a local processing unit (LFU) and executed on the local processing unit of the fractal von neumann architecture; the controller receives the fractal instruction and then performs k-decomposition on the fractal instruction to decompose sub-instructions and local instructions, wherein the sub-instructions still have the form of the fractal instruction and are sent to a fractal processing unit (FFU) for execution. Thus, when programming a split von Neumann architecture, the programmer need only consider a single, serial instruction set architecture. The heterogeneity between LFU and FFU, and the parallelism among multiple FFUs can be solved by the controller. Because each node (fractal processing unit) of the fractal von neumann architecture on different levels has the same instruction set architecture, a programmer does not need to take different levels of difference into consideration when programming and writes different programs for computers with fractal von neumann architectures of different sizes. Moreover, after the fractal von Neumann architecture of the same series is adopted, the supercomputer can execute the same program with the intelligent object-side equipment, and the effect that a set of codes can be operated from the cloud to the end without modification is achieved.

The fractal von neumann architecture constructs a memory hierarchy and manages the memory in two categories: external storage and local storage. Only the outermost external storage is visible (requiring programming management) to the programmer. In a fractal von neumann architecture, the local storage of this level will be treated as external storage of the next level, shared for use by all fractal processing units (FFUs). Different from the design principle of a Reduced Instruction Set Computer (RISC), in a fractal instruction set structure, all storage spaces which can be operated by a programmer are positioned in external storage, and each layer of controller is responsible for controlling data communication between the external storage and local storage; the controller of the present layer generates the command to the next layer, which plays the role of programmer for the controller of the next layer, so that the controller also follows the principle of managing only the local storage of the present layer, and does not manage the internal storage of the lower layer. By the design, all storage in the fractal von Neumann architecture is managed by the controller of the layer, the division of duties is clear, and programming is simple.

Fig. 2 schematically shows a control system structure diagram for an intelligent processor according to a first embodiment of the present disclosure.

As shown in fig. 2, each node (i.e. each layer of fractal calculation subunit) of the intelligent processor has the same controller for managing the child nodes, so that the entire intelligent processor operates in a fractal manner. Each controller comprises a serial decomposition module, a downgrade module and a parallel decomposition module.

The serial decomposition module comprises a first instruction queue temporary storage unit (IQ), a serial decomposition unit (SD) and a second instruction queue temporary storage unit (SQ).

In the serial decomposition stage, the input fractal instruction set is first buffered in the IQ and then fetched by the SD. And the SD serially decomposes the fractal instruction set into serially decomposed sub-instructions which are executed in sequence according to the limitation of the hardware capacity corresponding to the intelligent processor, wherein the granularity of each serially decomposed sub-instruction does not exceed the allowable range of the hardware capacity, and writes the serially decomposed instructions into the SQ for temporary storage. Since the serial decomposition module has two first-in first-out queues of IQ and SQ as buffers, the serial decomposition stage may not execute at the synchronous pace of the pipeline, but asynchronously execute alone until IQ is empty or SQ is full.

The destage module (DD) includes a check unit, an allocation unit, a DMA, and a replacement unit. The DD takes out a serial decomposition sub-instruction from the SQ, degrades the serial decomposition sub-instruction, and rewrites the instruction from the previous node to the instruction from the current node to the next node, and the specific operations comprise:

the check unit checks whether the data dependency is satisfied, schedules when an instruction is launched into the pipeline, and inserts a pipeline bubble.

The allocation unit allocates a local memory space for operands located in the external memory in the serially resolved sub-instruction.

The DMA (Direct Memory Access) generates a DMAC instruction to control the DMA to write data into the DMA before the instruction is executed and write the data out after the instruction is executed, so that local backup of external data is formed, and the DMA is convenient for a next-level node to Access.

And the replacing unit replaces the operand corresponding to the serial decomposition sub-instruction with the local backup operand.

The parallel decomposition module comprises a parallel decomposition unit (PD) and a protocol control unit (RC). The PD is used for executing k-decomposition on the fractal instruction to obtain a fractal sub-instruction, and sending the fractal sub-instruction to a fractal processing unit in a fractal calculation sub-unit of each layer of the intelligent processor to execute fractal operation. And the RC is used for executing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to the local processing unit in the fractal calculation sub-units of each layer of the intelligent processor so as to perform reduction operation on the result of each layer of fractal operation.

The RC may also decide whether to delegate the local instruction to the fractal processing unit for execution as a delegate, and may choose to do so when a node with a weaker LFU performance encounters a local instruction with a larger amount of computation. That is, the RC does not send the local instruction to the LFU, but sends the local instruction to a request register (CMR) of the control system for temporary storage for one beat, and at the next beat, the local instruction is treated as a fractal instruction and is delivered to the PD for decomposition and then to the FFU for execution. Because the LFU in the pipeline always works one beat behind the FFU, after the CMR is temporarily stored, the data dependency relationship on the pipeline cannot be changed, and the execution correctness can still be ensured.

In summary, the present embodiments provide an intelligent processor based on a fractal von neumann architecture by introducing a lightweight local processing unit (LFU). After reclaiming part of the result data from the fractal processing unit (FFU), the local processing unit may efficiently perform a reduction operation thereon. Meanwhile, the structure of the intelligent processor controller is reasonably designed, so that the intelligent processor can be efficiently and accurately controlled to execute fractal operation.

The first embodiment of the present disclosure further provides a control method for an intelligent processor, by which each layer of fractal calculation subunit of the intelligent processor can be controlled to perform a fractal operation, and fig. 3 schematically illustrates a flowchart of the control method provided by the first embodiment of the present disclosure, and as shown in fig. 3, the control method includes:

s301, performing serial decomposition on a fractal instruction set corresponding to fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions, and temporarily storing the serial decomposition sub-instructions.

S302, degrading the serial decomposition sub-instruction, and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer.

And S303, carrying out parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor, so that the fractal calculation sub-units execute fractal operation according to the parallel decomposition sub-instruction.

For the details of the control method, please refer to the above-mentioned control system, and the technical effects thereof are the same as those of the control system, which are not described herein again.

In order to improve the efficiency and accuracy of the above instruction decomposition, a second embodiment of the present disclosure provides an instruction decomposition method for the control system and method provided by the first embodiment, and fig. 4 schematically illustrates a flowchart of the instruction decomposition method provided by the second embodiment of the present disclosure, and as shown in fig. 4, the method may include:

s401, determining the decomposition priority of the dimension for decomposing the operand of the fractal instruction.

S402, selecting the dimension of the current decomposition according to the decomposition priority.

S403, serially decomposing the operands of the branching instruction in the current decomposition dimension.

Fig. 5 schematically shows a logic diagram of a specific example of an instruction decomposition method provided in the second embodiment of the present disclosure, and as shown in fig. 5, the specific logic is as follows:

first, the serial decomposition unit needs to record the dimension t that each fractal instruction can decompose₁， t₂，...，t_NIn order of priority between them.

Then, the serial decomposition unit needs to determine which dimension to decompose according to the priority, and the specific determination method is as follows: for one dimension, setting the dimension and the dimension with the priority lower than the dimension as atomic granularity, and keeping the granularity with the priority higher than the dimension at the original granularity to obtain a first instruction identifier; decomposing the operand according to the first instruction identifier; judging whether the memory capacity required by the decomposed operand is smaller than the capacity of a memory component of the intelligent processor or not; if yes, selecting the dimension as the dimension of the current decomposition to decompose the operand, and if not, selecting the dimension of the next decomposition to judge. I.e. for each

i

0, 1, 2₁，t₂，...，t_iSetting to atomic granularity, forming a new granularity identifier<1，1，...，1，t_i+1，t_i+2，...，t_N>。

Finally, in the current dimension, serially decomposing operands of the branching instruction, including: and the decomposition granularity corresponding to the dimension with the priority lower than the current decomposition dimension is taken as the atom granularity, the granularity corresponding to the dimension with the priority higher than the current decomposition dimension is kept unchanged, the maximum granularity of which the current decomposition dimension meets the memory component capacity limit of the intelligent processor is determined, and the second instruction identifier is obtained. The operands of the packed instruction are serially decomposed according to a second instruction identifier. I.e. selected at t_iSerially decomposing in dimension, then t₁，t₂，...，t_i-1Are all decomposed to atomic particle size (particle size 1), and t_i+1，t_i+2，...，t_NThe original granularity is kept unchanged. According to the determination of a binary search method, the maximum granularity t 'meeting the capacity limit is found'_iThe final output instruction has a granularity identifier<1，1，...，1，t′_i，t_i+1， t_i+2，...，t_N>。

Further, a binary search method determines the maximum granularity t 'satisfying the capacity limit'_iThe method comprises the following steps:

the minimum decomposition particle size min is set to 0 and the maximum decomposition particle size max is set to t_iThen at t_iThe dimension direction decomposition granularity is (max-min)/2 dimensions.

And judging whether the memory capacity required by the decomposed operand is larger than the capacity of the memory component of the intelligent processor, if so, the maximum decomposition granularity of the operand is (max-min)/2 dimensions, and if not, the minimum decomposition granularity of the operand is (max-min)/2 dimensions.

Judging whether (max-min) is equal to 1, if yes, t_iThe (max-min)/2-dimensional decomposition particle size is selected for decomposition.

The number of times of judgment needed in the serial decomposition process is at most N + log M, wherein M is the maximum capacity of hardware. Assuming that the serial decomposer can execute the judgment once in each hardware clock cycle, a fractal instruction with 10 dimensions is subjected to serial decomposition on a node with 4GB storage, and at most 42 clock cycles need to be executed, so that an optimal decomposition scheme can be found in a reasonable time range. After finding out the optimal decomposition scheme, the serial decomposer circularly outputs an instruction template according to the granularity; the addresses of the operands in the resolved sub-instructions are calculated by accumulation.

Furthermore, the parallel decomposer for the serial sub-instructions after serial decomposition can be realized by: performing k-decomposition on the input instruction, and pressing the decomposed instruction back to the input stack; and continuously circulating until the number of instructions in the stack exceeds the number of FFUs in the node.

A DMA controller (DMAC) accepts a higher-level instruction form (DMAC instruction) and can transfer data (e.g., an n-dimensional tensor) according to a higher-level data structure. The DMAC internally controls DMA execution by generating cycles to translate DMAC instructions into low-level DMA control primitives.

The instruction decomposition method provided by the embodiment can find the optimal decomposition scheme within a reasonable time range, and according to the optimal decomposition scheme, the serial decomposer circularly outputs the instruction template according to the granularity, and calculates the addresses of the operands in the decomposed sub-instructions through accumulation, so that the parallel efficiency of the fractal operation is improved.

The second embodiment of the present disclosure further provides an instruction decomposition device used in the control system and method provided in the first embodiment, fig. 6 schematically shows a block diagram of the instruction decomposition device provided in the second embodiment of the present disclosure, and as shown in fig. 6, the device 600 may include:

a determination module 610 for determining a decomposition priority of a dimension that decomposes operands of a packed instruction.

And a selecting module 620, configured to select a dimension of the current decomposition according to the decomposition priority.

The decomposition module 630 is configured to perform serial decomposition on the operand of the fractal instruction in the current decomposition dimension.

For the details of the embodiment of the instruction decomposition device, please refer to the embodiment of the instruction decomposition method, which brings the same technical effects as the embodiment of the instruction decomposition method, and will not be described herein again.

Due to the fact that the intelligent processor executes fractal operation, the root node decodes the fractal instruction set and then sends the fractal instruction set to the FFU, and each FFU repeats the same execution mode until the leaf node. The leaf nodes finish actual operation, the results are sent back to the father nodes, and each node repeats the same execution mode until the final results are collected to the root node. In the process, the FFU can only wait for data and instructions to arrive most of the time, and after the operation is completed, the FFU waits for the data to return to the root node. Thus, intelligent processors do not achieve ideal execution efficiency if not executed in a pipelined manner.

In order to improve the throughput rate of the intelligent processor, a third embodiment of the present disclosure provides an instruction execution method for the intelligent processor, the instruction execution method including: and instruction decoding, namely decoding the serial decomposition sub-instruction for executing the fractal operation into a local instruction and a fractal operation instruction. Data loading, namely reading data required by fractal operation from an external storage unit to a local storage unit of the intelligent processor; and performing operation execution, namely completing fractal operation on the data according to the fractal operation instruction. And (5) performing protocol execution, namely performing protocol operation on the result of the fractal operation according to the local instruction. And writing back data, namely reading the protocol operation result stored in the local memory to the external memory. Instruction decoding, data loading, operation execution, specification execution and data write-back are executed in a pipeline mode.

With continued reference to FIG. 2, the execution of the FISA instruction is divided into five pipeline stages: an instruction decode stage (ID), a data load stage (LD), an operation execute stage (EX), a reduce execute stage (RD), and a data write back stage (WB). In the ID stage, one serial decomposition sub-instruction decodes the controlled controller into three control signals, namely a local instruction, a fractal instruction and a DMAC instruction; in the LD phase, DMA transfers data from external storage to local storage for access by FFU and LFU; in the EX stage, the FFU completes fractal sub-operation; in the RD stage, the LFU completes reduction operation; and in the WB stage, the DMA carries the operation result from the local storage to the external storage to finish the execution of a serial decomposition sub-instruction.

Further, before the ID, the method for executing the instruction further comprises a serial decomposition of the instruction, and the SD decomposes the original fractal instruction set fia into serial decomposition sub-instructions. The fractal instruction in the IQ is continuously decomposed into serial decomposition sub-instructions and written into the SQ for temporary storage outside the independent pipeline.

Because the analytic computing system of the embodiment of the present disclosure adopts the fractal von neumann architecture, at a single level, the instructions of the fractal computing subunit of each level are executed according to an instruction decoding, data loading, operation execution, specification execution and data write-back pipeline. In the overall architecture, a five-stage pipeline formed on a single level forms a recursively nested fractal pipeline. Fig. 7 schematically illustrates a fractal pipeline formed by a two-layer architecture intelligent processor, as shown in fig. 7, different types of grids represent the execution of one fractal instruction, and each block represents the execution stage of one serial decomposition sub-instruction. Within an EX stage of the previous stage, the next stage runs its own pipeline. Thus, the intelligent processor can bring up all modules at all levels at any time, except for the startup and drain phases of the pipeline.

According to the instruction execution method provided by the embodiment, the execution of the instruction is divided into a plurality of stages of an instruction decoding stage, a data loading stage, an operation execution stage, a reduction execution stage and a data writing back stage, the pipeline execution is performed, the serial decomposition of the instruction is independent of asynchronous execution outside the pipeline, all modules on all layers can be called up at any time, the data throughput rate of the intelligent processor is improved, and the execution efficiency of the intelligent processor is improved.

The third embodiment of the present disclosure further provides an instruction execution apparatus for an intelligent processor, and fig. 8 schematically illustrates a block diagram of the instruction execution apparatus provided in the third embodiment of the present disclosure, and as shown in fig. 8, the apparatus 800 may include:

the command decoding unit 810 decodes the serial decomposition sub-command for performing the fractal operation into a local command and a fractal operation command.

And a data loading unit 820, configured to read data required by the fractal operation from an external storage unit to a local storage unit of the intelligent processor.

And the operation execution unit 830 is configured to complete a fractal operation on the data according to the fractal operation instruction.

And the reduction execution unit 840 is used for carrying out reduction operation on the result of the fractal operation according to the local instruction.

And a data write-back unit 850, configured to read the specification operation result stored in the local memory to an external memory.

The instruction decoding unit, the data loading unit, the operation execution unit, the protocol execution unit and the data write-back unit execute in a pipeline mode.

For details of the embodiment of the instruction execution apparatus, please refer to the embodiment of the instruction execution method, which brings about the same technical effects as the embodiment of the instruction execution method, and will not be described herein again.

During the operation of the controller, SD, DD, and PD may need to allocate memory space, and therefore, memory management of the smart processor is critical to overall efficiency. Wherein the space allocated by PD usually only survives two adjacent pipeline stages of EX and RD, the space allocated by DD survives a complete serial decomposition sub-instruction cycle, and the life cycle of the space allocated by SD spans multiple serial decomposition sub-instruction cycles.

Based on the difference in instruction life cycle, a memory management device according to a fourth embodiment of the present disclosure is provided, and fig. 9 schematically illustrates a structure diagram of the memory management device according to the fourth embodiment of the present disclosure, as shown in fig. 9, the memory management device 900 includes:

the loop memory section 910 is used for storing external data contained in the serialized decomposed sub-instruction, calculation results, temporary intermediate results required for reduction, and the like.

Since there are three hardware functional units that may access the circular memory segment: FFU (at EX stage), LFU (at RD stage) and DMA (at LD and WB stage), so the loop memory segment is divided into three regions, including a first memory region 911, a second memory region 912 and a third memory region 913, which are respectively used for fractal operation, reduction operation, data loading and call during write back in the operation process of the intelligent processor. Three functional units each use one segment to avoid data collisions. The three regions will call the first memory region 911, the second memory region 912 and the third memory region 913 following the cycle execution of the pipeline, and the cycle process is: after the FFU executes the EX stage on a certain area, in the next pipeline period, the LFU acquires the memory and completes the execution of the RD stage in the memory; after the LFU completes the execution of the RD stage, the DMA obtains the memory in the next pipeline period, the execution of the WB stage is completed firstly, and then the execution of the LD stage of a new instruction is completed; and returning the memory area of the block to the FFU in the next period, and so on.

The static memory segment 920 includes a fourth memory area 921 and a fifth memory area 922, and is used for storing fractal instructions input during operation of the intelligent processor, that is, data that is pre-loaded during the serial decomposition and shared among a plurality of serial decomposition sub-instructions is placed. The static memory segment is also divided into two areas, and the SD alternately arranges and uses the space of the static memory segment for each input fractal instruction so as to avoid data collision caused by the overlapping of life cycles between adjacent instructions.

Further, as the DD and SD control the allocation of memory, memory space is not released actively. The space is recycled as the pipeline is processed, and after a cycle of memory segment usage, new data is directly overwritten on old data. In order to fully utilize the data temporarily written in the memory, as shown in fig. 2, the memory management device further includes a tensor replacement unit (or tensor replacement table TTT) for recording an external storage address corresponding to the currently stored data in the cyclic memory segment or the static memory segment, and when the next operation needs to access the data in the external memory at the same address, the external storage address is replaced, and the backup data temporarily stored in the local memory of the intelligent processor replaces the data in the external memory, so as to reduce the data. In the operation process of the intelligent processor, the first memory region 911, the second memory region 912 and the third memory region 913 are periodically called circularly, and when entering the next cycle, the tensor replacement unit clears the external storage address recorded in the current cycle. In order to guarantee the timeliness of the replacement data. After the TTT is added, the intelligent processor can forward the operation result (generated after the RD stage is finished) of the previous serial decomposition sub-instruction to the input of the next serial decomposition sub-instruction (which needs to be prepared before the EX stage is started) directly without writing back and reading in again. TTT can significantly improve the execution efficiency of intelligent processors while data consistency is maintained.

In the embodiment, the execution efficiency of the intelligent processor can be improved by classifying and managing the memory of the controller based on the difference of the instruction execution life cycles, and the execution efficiency of the intelligent processor can be further improved remarkably by adding the tensor exchange unit in the memory management device, and meanwhile, the data consistency is maintained.

Those skilled in the art will also appreciate that, in addition to implementing the client and server as pure computer readable program code, the client and server could well be implemented by logically programming method steps such that the client and server perform the same functions in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such clients and servers may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as structures within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment is described with emphasis on differences from other embodiments. In particular, both for the embodiments of the client and the server, reference may be made to the introduction of embodiments of the method described above.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

A fourth embodiment of the present disclosure further provides a memory management method for an intelligent processor, and fig. 10 schematically shows a flowchart of the memory management method provided in the fourth embodiment of the present disclosure, and as shown in fig. 10, the memory management method includes:

and S1001, when serially decomposing the input fractal instruction, storing by adopting a fourth memory area and a fifth memory area of the static memory segment.

S1002, in the operation process of the intelligent processor, the fractal operation, the rule operation, the data loading and the data writing back of the intelligent processor respectively call a first memory area, a second memory area and a third memory area of the circulating memory segment.

For the details of the embodiment of the memory management method, please refer to the embodiment of the memory management device, which brings the same technical effects as the embodiment of the memory management device, and will not be described herein again.

Furthermore, in some embodiments of the present disclosure, a chip is disclosed that includes the above-described smart processor.

In some embodiments of the present disclosure, a chip package structure is disclosed, which includes the above chip.

In some embodiments of the present disclosure, a board card is disclosed, which includes the above chip packaging structure.

In some embodiments of the present disclosure, an electronic device is disclosed, which includes the above board card.

The electronic device comprises a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device vehicle, a household appliance, and/or a medical device.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only examples of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A control system for a smart processor, wherein each layer of fractal calculation subunit of the smart processor comprises the control system, the control system comprising:

the serial decomposition module is used for serially decomposing a fractal instruction set corresponding to the fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions and temporarily storing the serial decomposition sub-instructions;

the degradation module is used for degrading the serial decomposition sub-instruction and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer;

and the parallel decomposition module is used for performing parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor.

2. The control system of claim 1, wherein the serial decomposition module comprises a first instruction queue temporary storage unit, a serial decomposition unit, and a second instruction queue temporary storage unit;

the first instruction queue temporary storage unit is used for temporarily storing the fractal instruction set;

the serial decomposition unit is used for serially decomposing the fractal instruction set into the serial decomposition sub-instructions which are executed in sequence according to the hardware capacity corresponding to the intelligent processor;

the second instruction queue temporary storage unit is used for temporarily storing the serial decomposition sub-instruction.

3. The control system of claim 2, wherein the serially-decomposed sub-instructions have a granularity that does not exceed a range that is allowable by the hardware capacity.

4. The control system of claim 1, wherein the destage module comprises an allocation unit, a DMA, and a replacement unit;

the allocation unit is used for allocating a local storage space for an operand which is positioned in an external memory in the serial decomposition sub-instruction;

the DMA is used for writing data required by the fractal operation into the fractal calculation subunit from a local memory before the serial decomposition sub-instruction is executed;

the replacing unit is used for replacing the operand corresponding to the serial decomposition sub-instruction with a local backup operand.

5. The control system of claim 1, wherein the DMA is further configured to write the data out of the fractal calculation subunit to the local memory after execution of the serial disassembly sub-instruction.

6. The control system of claim 1, wherein the serial disassembly sub-instruction comprises a fractal instruction and a native instruction.

7. The control system of claim 6, wherein the parallel decomposition module comprises a parallel decomposition unit and a specification control unit;

the parallel decomposition unit is used for performing k-decomposition on the fractal instruction to obtain a fractal sub-instruction, and sending the fractal sub-instruction to the fractal processing units of each layer of the intelligent processor to execute fractal operation;

and the protocol control unit is used for executing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to the local processing units of each layer of the intelligent processor so as to perform protocol operation on the result of each layer of fractal operation.

8. The control system of claim 6, further comprising a register; the register is used for temporarily storing the local instruction.

9. The control system according to claim 8, wherein the protocol control unit is further configured to send the local instruction to the register, send the local instruction to the parallel decomposition unit to perform k-decomposition when the current fractal operation is completed, and send the decomposed instruction to the fractal processing unit to perform protocol operation.

10. A control method for an intelligent processor, wherein each layer of fractal calculation subunit of the intelligent processor is controlled by the control method to perform fractal operation, and the control method comprises the following steps:

performing serial decomposition on a fractal instruction set corresponding to fractal operation executed by the intelligent processor to obtain serial decomposition sub-instructions, and temporarily storing the serial decomposition sub-instructions;

degrading the serial decomposition sub-instruction, and modifying the serial decomposition sub-instruction issued by the fractal calculation sub-unit of the previous layer to the fractal calculation sub-unit of the current layer into a serial decomposition sub-instruction issued by the fractal calculation sub-unit of the current layer to the fractal calculation sub-unit of the next layer;

and carrying out parallel decomposition on the degraded serial decomposition sub-instruction to obtain a parallel decomposition sub-instruction which meets the concurrency requirement of concurrent operation of all fractal calculation sub-units in the intelligent processor, so that the fractal calculation sub-unit executes fractal operation according to the parallel decomposition sub-instruction.

11. The control method according to claim 10, wherein the step of performing serial decomposition on the fractal instruction set corresponding to the fractal operation performed by the intelligent processor to obtain serial decomposition sub-instructions and temporarily storing the serial decomposition sub-instructions comprises:

acquiring the fractal instruction set, and temporarily storing the fractal instruction set in a first instruction queue temporary storage unit;

serially decomposing the fractal instruction set into serially decomposed sub-instructions which are executed in sequence according to the hardware capacity corresponding to the intelligent processor;

and temporarily storing the serial decomposition sub-instruction in a temporary storage unit of a second instruction queue.

12. The control method according to claim 11, wherein the granularity of the serial disassembly sub-instruction does not exceed a range that can be allowed by the hardware capacity.

13. The control method of claim 10, wherein the step of downgrading the serially resolved sub-instruction comprises:

allocating a local storage space for an operand located in an external memory in the serial decomposition sub-instruction;

before the serial decomposition sub-instruction is executed, writing data required by the fractal operation into the fractal calculation sub-unit from a local memory;

and replacing the operand corresponding to the serial decomposition sub-instruction with a local backup operand.

14. The control method of claim 13, wherein the data is written out of the fractal calculation subunit to the local memory after execution of the serial decomposition sub-instruction.

15. The control method of claim 13, wherein the serial disassembly sub-instruction comprises a fractal instruction and a native instruction.

16. The control method of claim 15, wherein said parallel decomposing the downgraded serial decomposition sub-instructions comprises:

performing k-decomposition on the fractal instruction to obtain a fractal sub-instruction, and sending the fractal sub-instruction to fractal calculation sub-units of each layer of the intelligent processor to perform fractal operation;

and performing k-decomposition on the local instruction to obtain a local sub-instruction, and sending the local sub-instruction to the local processing units of each layer of the intelligent processor so as to perform reduction operation on the result of each layer of fractal operation.

17. The control method according to claim 15, characterized by further comprising:

the local instruction is temporarily stored by a register.

18. The control method according to claim 17, characterized by further comprising:

and when the current fractal operation is finished, performing k-decomposition on the local instruction, and sending the decomposed instruction to the fractal calculation subunit to perform the protocol operation.

19. An electronic device, characterized in that it comprises a control system according to any one of claims 1-9.