CN113438171A

CN113438171A - Multi-chip connection method of low-power-consumption storage and calculation integrated system

Info

Publication number: CN113438171A
Application number: CN202110497911.XA
Authority: CN
Inventors: 唐建石; 臧浩名; 吴华强; 高滨; 钱鹤
Original assignee: Tsinghua University
Current assignee: Beijing Xinli Technology Innovation Center Co.,Ltd.
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-09-24
Anticipated expiration: 2041-05-08
Also published as: CN113438171B

Abstract

The invention belongs to the technical field of integrated circuits, and particularly relates to a multi-chip connection method of a low-power-consumption storage and calculation integrated system. The method utilizes the on-chip interconnection network and the PCIe communication protocol to carry out high-efficiency and low-power-consumption design on the inter-chip interconnection of the storage and computation integrated chip, optimizes the transaction arrangement and the packaging efficiency of the interaction between the chips, simplifies the path selection by cutting the internal skip of the routing node, and greatly reduces the deadlock probability of the network. Compared with the traditional scheme, the method has the advantages that the hardware overhead is greatly reduced, the transmission efficiency is obviously improved, the problem of efficient interconnection among chips in the process of processing the hardware mapping of the complex neural network algorithm by the storage and computation integrated chip system in the future is solved, and compared with the traditional scheme, the method is obviously improved in the aspects of low power consumption and universality.

Description

Multi-chip connection method of low-power-consumption storage and calculation integrated system

Technical Field

The invention belongs to the technical field of integrated circuits, and particularly relates to a multi-chip connection method of a low-power-consumption storage and calculation integrated system.

Background

Storage and calculation integrated core based on Resistive Random Access Memory (RRAM)The sheet takes RRAM devices as basic bottom layers to form a cross array to form a basic operation unit. RRAMs offer significant advantages over conventional memories: the reading and writing speed is high, and the fastest reading and writing time is less than 5 ns; the reliability is high, the data retention capacity is excellent, the data can be normally maintained for 3000 hours at 150 ℃, and the scratch resistance times exceed 1012 times; the power consumption is low, and the programming with the power consumption below 1pJ can be realized; simple structure, not only the minimum unit size reaches 4F²And the three-dimensional integration can be supported, and high-density storage is realized. At present, the mainstream research direction for the banked chips still mainly focuses on optimizing the device performance and organizing the basic architecture of the chips, and the interconnection problem among the banked chips is not involved. Taking the mapping of a neural network algorithm on a cross array as an example, the mapping of a convolutional neural network AlexNet on a storage-computation integrated chip needs to be applied to dozens of cross arrays, and the input arrangement on the cross arrays and the data exchange among the arrays are all customized and optimized according to the AlexNet network and appear in the form of a special circuit (hereinafter referred to as ASIC). Such a custom circuit is very energy efficient, but also has significant drawbacks: firstly, the algorithms of the current neural network are infinite, the customized design can only carry out specific optimization aiming at a certain algorithm, and the compatibility technical difficulty of all algorithms by a special circuit is larger at present; secondly, the resources in the chip cannot be expanded infinitely, a simple resource stacking form cannot be simply adopted in the face of a more complex network environment, the data scheduling difficulty and the power consumption of the chip are greatly improved, and therefore a huge and calculation integrated chip system needs to be established.

The resistive random access memory is mainly characterized in that two operands of the operation of the resistive random access memory need to be allocated to a device in a time-sharing mode, such as a multiplication multiplier and a multiplicand, one of the operands is invariable in convolution operation, the resistive random access memory is an analog device, and the conductance characteristics of the resistive random access memory can be changed and the resistance value can be kept invariable in a certain state by controlling input voltage. The conductance values disposed on the device can be regarded as multiplicands that are not changed in multiplication, so that two inputs of multiplication mapping on the resistive random access memory need to be separated. The data path and the control path of the storage and computation integrated chip are usually designed separately, the control path is responsible for the control instruction and network weight deployment of the transmission chip, and the data path is responsible for the characteristics of the transmission network and the computation result of the chip.

In the prior art, a bus connection mode is the most widely applied system connection scheme, bus skipping can realize direct connection of a master device and a slave device only through bus arbitration, and the high-speed bus is high in operation speed. However, the bus type interface is relatively complex, and especially, dozens of standard interfaces of a high-speed bus are generally available, which is very disadvantageous for inter-chip transmission. The PCIe protocol is a high-speed interconnection protocol which is proposed by Intel and is specially used for a computer system, and has the advantages that a high-speed differential serial interface is adopted at the bottom layer, so that the number of pins between chips is greatly reduced; however, in the application of the storage-computation-integrated system, complex transactions set for the computer system are too redundant, the efficiency of data packets is relatively low, the tree-shaped connection mode of the PCIe bus is easily broken down locally due to the blocking of the root node, and the peer-to-peer communication mode is not flexible. Meanwhile, the over-complete design has a great limitation on energy consumption, and the efficiency of the self-training process of the network after power-on is low. Finally, the routing network is very good in flexibility, the bottom layer interface can be simplified in number through optimized design, and meanwhile, the self-adaptive capacity is strong; but the jumping of the routing network is generally complex, and the network configuration which is too flexible is easy to generate the problems of deadlock and livelock.

Disclosure of Invention

The invention aims to provide a multi-chip connection method of a low-power-consumption storage and calculation integrated system, which is used for realizing customized design of multi-chip interconnection of the storage and calculation integrated chip system, so that the transmission efficiency of the storage and calculation integrated chip system is improved, and the power consumption is reduced.

The invention provides a multi-chip connection method of a low-power-consumption storage and calculation integrated system, which comprises the following steps:

(1) addressing routing nodes in a network formed by low-power-consumption storage and calculation integrated chips, wherein base address information comprises the abscissa and the ordinate of the routing nodes;

(2) the method comprises the steps that superior main equipment connected with an integrated chip sends routing node base address configuration instructions to routing nodes connected with the main equipment, a local instruction analysis unit on the routing nodes analyzes the routing node base address configuration instructions, and corresponding routing node base address registers are assigned according to the routing node base address configuration instructions; and sending an enable signal to a central switch of the router;

(3) the central switch sends a node base address configuration instruction corresponding to the enabling signal to other adjacent routing nodes through an x positive sending buffer area, an x negative sending buffer area, a y positive sending buffer area and a y negative sending buffer area respectively according to the received enabling signal;

(4) adjacent routing nodes receive a base address configuration instruction through corresponding receiving buffer areas and send the received base address configuration instruction to a link instruction analysis unit corresponding to the receiving buffer areas, after the link instruction analysis unit obtains an enabling signal, the link instruction analysis unit assigns a corresponding routing node base address register according to the base address configuration instruction and then judges the assignment, if the assignment fails, a base address register assignment failure message is generated, and a base address configuration failure feedback instruction is generated according to a source base address value in the base address configuration instruction; the link instruction analysis unit selects a corresponding sending buffer area according to the current input interface and sends a base address configuration failure feedback instruction to the sending buffer area; if the assignment is successful, the position of the current routing node is further judged according to a network scale field in the base address configuration instruction, if the current routing node is located at the edge position of the network, a base address configuration success feedback instruction is generated, the base address configuration success feedback instruction is sent to a corresponding sending buffer area according to the rule of the failure feedback instruction, if the current routing node is not located at the edge position of the network, the base address configuration instruction is sent to the corresponding sending buffer area after being processed according to a packet header I/R value in the base address configuration instruction, and the rule of the failure feedback instruction is as follows: if the field of the base address to be configured in the current base address configuration instruction is different from the field of the network scale in the instruction, the link instruction analysis unit selects two sending buffers from the four sending buffers of the routing node and generates a four-bit sending request to the central switch, and the central switch processes the base address configuration instruction according to a sending buffer selection rule after receiving the request and sends the processed base address configuration instruction to the corresponding sending buffers;

(5) traversing all routing nodes in the network, repeating the step (4), if the routing node connected with the main equipment receives a base address configuration failure feedback instruction sent by other routing nodes, the link instruction analysis unit regenerates an appointed base address configuration instruction according to a source base address field in the base address configuration failure feedback instruction; the link instruction analysis unit generates a four-bit sending buffer area selection request according to the regenerated specified base address configuration instruction and sends the request to the central arbiter;

(6) after receiving the sending request, the central arbiter selects one sending request from the four sending requests according to the sending buffer selection rule in the step (4) to output an arbitration result, and sends the arbitration result to the central switch, the central switch selects a corresponding sending buffer according to the arbitration result of the central arbiter, and a specified base address configuration instruction generated by the link instruction analysis unit is sent; the link instruction analysis unit adds one to an error counter in the link instruction analysis unit, subtracts one from the error counter when receiving a feedback instruction that the base address configuration of the destination routing node is successful, and completes network initialization when the routing node connected with the master device receives the feedback instruction that the base address configuration transmitted by the four corner routing nodes is successful and the error counter returns to zero;

(7) a superior main device connected with an integrated chip sends a feature configuration instruction to a routing node, an instruction analysis unit of the routing node packages the feature configuration instruction according to a package format after receiving the feature configuration instruction, compares a target base address field for generating a package head with a base address register of the current routing node, generates a four-bit sending buffer area selection request according to a comparison result, a central arbiter selects one of the four-bit sending requests to output an arbitration result after receiving the sending request, and sends the arbitration result to a central switch, the central switch selects a corresponding sending buffer area according to the arbitration result of the central arbiter, and sends the packaged feature configuration instruction to an adjacent corresponding routing node through the sending buffer area;

(8) after the corresponding adjacent routing node receives the feature configuration instruction, the trigger instruction analysis unit analyzes the feature configuration instruction, if the destination address of the packet header is equal to the base address register value of the current routing node, the feature configuration command is a local command, the command parsing unit will issue an arbitration request to the local arbiter, wait for arbitration to pass, unpacking the characteristic configuration instruction, transmitting the characteristic configuration instruction into a local receiving buffer area through a local switch, if the destination address of the packet header is not equal to the base address register value of the current routing node, the feature configuration command continues to propagate over the network and the command resolution unit initiates a four-bit send buffer selection request to the central arbiter, either two or one, namely two 1's, after the arbitration is passed, the command is transmitted into the corresponding sending buffer area through the central switch until the feature configuration command reaches the destination router;

(9) the storage and calculation integrated chip returns a calculation result to the main equipment in the network through the destination router, and multi-chip connection between the network main equipment and the routing node in the low-power-consumption storage and calculation integrated system is achieved.

The invention provides a multi-chip connection method of a low-power-consumption storage and calculation integrated system, which has the characteristics and advantages that:

the multi-chip connection method of the low-power-consumption storage and calculation integrated system draws the traditional internet of business and PCIe communication protocol, and specially designs the inter-chip interconnection of the storage and calculation integrated chip more efficiently and with low power consumption. The method has the advantages of combining the interface simplicity of the PCIe protocol and the flexibility of the routing network, optimizing the transaction arrangement and the packaging efficiency of interaction between chips, simplifying the path selection of the internal skip of the routing node in a cutting mode and greatly reducing the deadlock probability of the network. Compared with the traditional scheme, the invention not only greatly reduces the hardware overhead, but also obviously improves the transmission efficiency. The method can solve the problem of high-efficiency interconnection between chips in the process of processing complex neural network algorithm hardware mapping of a future storage and computation integrated chip system, and compared with the traditional scheme, the method has obvious advantages of low power consumption and universalityAnd (5) lifting. Aiming at route skipping, the method designs a maximum matching route algorithm easy for hardware realization, and compared with the traditional polling scheme, the link occupation optimization effect exceeds 25%. And in the power-on reset process, the parallelism is also improved, and the transmission path is optimized. And a more simplified packet header is adopted in the design of the data packet, and the average performance is improved by 50% compared with that of a PCIe protocol. Through circuit synthesis, the power consumption of a single routing node in the invention is 0.5798mW, and the total area is 8519.924um²The highest clock frequency, 600MHz, exceeds the performance of the conventional scheme.

Drawings

Fig. 1 is a schematic diagram of a router structure involved in the method of the present invention.

Fig. 2 is a schematic diagram of the route cutting involved in the method of the present invention.

Fig. 3 is a schematic diagram of the network initialization process involved in the method of the present invention.

FIG. 4 is a diagram illustrating feature configuration instructions and packet formats involved in the method of the present invention.

Fig. 5 is a schematic diagram of the route arbitration process involved in the method of the present invention.

Detailed Description

(1) addressing routing nodes in a network consisting of a low-power-consumption storage and calculation integrated system, wherein base address information comprises the abscissa and the ordinate of the routing nodes;

(2) the method comprises the steps that superior main equipment connected with an integrated chip sends routing node base address configuration instructions to routing nodes connected with the main equipment, a local instruction analysis unit on the routing nodes analyzes the routing node base address configuration instructions, and corresponding routing node base address registers are assigned according to the routing node base address configuration instructions; and sending an enable signal to a central switch of the router; the structure of a routing node in a network is shown in fig. 1;

(4) adjacent routing nodes receive a base address configuration instruction through corresponding receiving buffer areas and send the received base address configuration instruction to a link instruction analysis unit corresponding to the receiving buffer areas, after the link instruction analysis unit obtains an enabling signal, the link instruction analysis unit assigns a corresponding routing node base address register according to the base address configuration instruction and then judges the assignment, if the assignment fails, a base address register assignment failure message is generated, and a base address configuration failure feedback instruction is generated according to a source base address value in the base address configuration instruction; the link instruction analysis unit selects a corresponding sending buffer area according to the current input interface to send the base address configuration failure feedback instruction to the sending buffer area (for example, if the base address configuration failure feedback instruction acquisition end is an x forward receiving buffer area, the failure instruction selects the x forward sending buffer area); if the assignment is successful, the position of the current routing node is further judged according to a network scale field in the base address configuration instruction, if the current routing node is located at the edge position of the network, a base address configuration success feedback instruction is generated, the base address configuration success feedback instruction is sent to a corresponding sending buffer area according to the rule of the failure feedback instruction, if the current routing node is not located at the edge position of the network, the base address configuration instruction is sent to the corresponding sending buffer area after being processed according to a packet header I/R value in the base address configuration instruction, and the rule of the failure feedback instruction is as follows: if the field of the base address to be configured in the current base address configuration instruction is different from the field of the network scale in the instruction, the link instruction analysis unit selects two sending buffers from the four sending buffers of the routing node and generates a four-bit sending request to the central switch, and the central switch processes the base address configuration instruction according to a sending buffer selection rule after receiving the request and sends the processed base address configuration instruction to the corresponding sending buffers; one example of a transmit buffer selection rule is: the fields of the base addresses to be matched are 7'd 2 and 7'd 2, the network sizes are 7'd 8 and 7'd 8, the receiving buffer is in x forward direction, the I/R value is 0, the transmission request generated by the command parsing unit is positive x to negative y, following a type I cut, as shown in fig. 2, after the central switch receives the transmission request, respectively changing the base address field to be matched of the base address configuration instruction into 7'd 3 and 7'd 2, sending the modified base address configuration instruction to an x forward sending buffer, modifying the base address field to be matched in the base address configuration instruction into 7'd 2 and 7'd 1, sending the modified base address configuration instruction to a y negative direction sending buffer area, and if the base address field to be matched of the current base address configuration instruction is the same as the x or y coordinate of the network scale field, indicating that the instruction transmission reaches the boundary, only selecting different directions for configuration;

(5) traversing all routing nodes in the network, repeating the step (4), as shown in fig. 3, if a routing node connected with the master device receives a base address configuration failure feedback instruction sent by other routing nodes, the link instruction parsing unit regenerates an appointed base address configuration instruction according to a source base address field in the base address configuration failure feedback instruction; the link instruction analysis unit generates a four-bit sending buffer area selection request according to the regenerated specified base address configuration instruction and sends the request to the central arbiter;

(7) when a particular network is deployed to a storage and computation integrated system, communication is also initiated by the master device, the superior master device connected with the integrated chip sends a feature configuration instruction (as shown in fig. 4) to the routing node, the instruction analysis unit of the routing node packages the feature configuration instruction according to the package format shown in fig. 4 after receiving the feature configuration instruction, and compares the destination base address field of the generated packet header with the base address register of the current routing node, generating a four-bit transmission buffer selection request according to the comparison result, selecting one from the four-bit transmission requests to output an arbitration result after the central arbiter receives the transmission request, and transmitting the arbitration result to the central switch, the central switch selects the corresponding transmission buffer according to the arbitration result of the central arbiter, sending the packaged feature configuration instruction to the adjacent corresponding routing node through the sending buffer area;

(9) the storage and computation integrated chip returns a computation result to the main equipment in the network through the destination router;

(10) and (4) repeating the steps (2) to (9) by all the routing nodes in the network, and realizing multi-chip connection between the network master equipment and the routing nodes in the low-power-consumption storage and calculation integrated system.

In step (6) of the above multi-chip connection method, as shown in fig. 5, the route arbitration method includes the following steps:

(1) 5 x 4-dimensional vectors composed of 5 instruction analysis units and 4 link sending buffers in the routing node are integrated to obtain a 5 x 4-dimensional input matrix;

(2) summing each column of the input matrix, and selecting the minimum non-zero column from the summation result as the input of the next arbitration;

(3) summing each row in the input matrix;

(4) setting the request with the minimum row summation result in the non-zero columns in the step (2) as a passing arbitration, enabling all other requests in the column passing the arbitration to return to zero, and enabling other requests in the row where the request is located to automatically return to zero;

(5) and (5) traversing all columns in the 5 x 4-dimensional input matrix, repeating the step (2), the step (3) and the step (4) until all columns in the 5 x 4-dimensional input matrix are arbitrated or all zero columns appear, and ending the arbitration.

The working principle and the working process of the method of the invention are described in detail in the following with the accompanying drawings:

the method of the present invention relates to the data link layer as well as the physical layer. The organization of the interconnection network is first described below. The interconnection protocol adopts a two-dimensional grid network structure, compared with a PCIe (peripheral component interface express) tree structure, the grid network is flexible in jumping, and the network negative accumulation phenomena of communication blockage of root nodes and total necrosis of all child nodes of the tree network in the prior art can be well solved. Each node in the grid network is a network routing node, and all communication tasks on the network are accessed into the integrated storage and computation chip by the agent. Each route is divided into 8 interfaces (4 in, 4 out) for accessing the network and two interfaces (1 in, 1 out) for connecting the storage integrated chip, as shown in fig. 1. The data link layer is mainly responsible for instructions transmitted by the butt chip, carries out standard data link layer packaging on the instructions, and then transmits the instructions to other routes through the physical layer; and judging whether the data packet transmitted from the physical layer has the unpacking requirement according to the destination, and if so, unpacking the data packet according to the standard and transmitting the data packet to the chip. The physical layer is responsible for data transmission tasks, and 8 interfaces of the routing network are supported by 8 sets of mutually independent high-speed differential serial interfaces of the physical layer.

Protocol packet format:

the biggest difference of the integrated storage and computation chip compared with other AI accelerating chips is that the characteristics and the weight are arranged separately, so the integrated storage and computation chip does not need the operation instructions of two operands. And because the scale difference of the neural network is large, the dimension of the result transferred between the levels is flexibly changed, and the instruction length of the chip is preferably changed in a length-changing mode. The multi-chip protocol data link layer handles mainly two transactions: routing base address configuration and instruction transfer. All the messages are transmitted by the main equipment of the routing agent, and only one main equipment is allowed to initiate a base address configuration message request under the scene of multiple main equipments. Under normal working state, it is mainly responsible for instruction transmission, and adds the original address field to receive the inter-chip feedback result. The command packet format and the header format are shown in FIG. 3, where B/I is a transaction flag, 0 indicates a base address configuration transaction, and 1 indicates a command transfer transaction; R/A configuration direction flag bit, 0 represents configuration request, and 1 represents configuration response; M/S is a propagation mode flag bit, and 0 represents broadcasting; 1 represents unicast; S/N configuration success marks, wherein 0 represents failure and 1 represents success; I/R is a route link division mode mark, 0 represents I type division, 1 represents R type division, and the flag bit only appears in a base address configuration instruction and is used for marking the route propagation direction of different areas in the figure 3; T/N tail flag, 0 indicates no tail, 1 indicates tail.

Each routing interface in the two-dimensional grid network has four directions, each direction is divided into a sending part and a receiving part, the routing transmission condition is complex, each inlet can correspond to three forwarding outlets of a route, and simultaneously, the routing can be transmitted into a chip facing to an internal interface. The complicated jump condition is very unfavorable for the routing network; firstly, when the routing is selected, the energy consumption of a routing algorithm is greatly increased due to excessive options, and deadlock risks are caused due to too flexible jump. Therefore, castration is considered to be carried out on the tight coupling relation among the input and output interfaces under the condition that the link coverage is not influenced, so that better link performance is obtained. Data transmission on the network is considered integrally, it is not difficult to see that all routes on the network can be guaranteed to be interconnected as long as signals can be transmitted along the diagonal line of the network, the original interface full-association relation is considered to be divided into two parts which are connected, data transmitted into each inlet after division are only coupled with two outlets of the route, the flexibility of jumping is still kept under the condition that the integrity of the path is guaranteed, and the shortest path can be obtained by jumping along the route after castration under the condition that each route is kept smooth by the two-dimensional grid type network structure.

The last problem of network transmission is the dispatch of base addresses, the base address dispatch task of the two-dimensional individual network is initiated by the main equipment and dispatched to all networks through the agent route of the main equipment, wherein the base address configuration packet is diffused in a broadcasting mode. The main device route sends instruction packets to 4 adjacent routes, and the routes in the diagonal area are uniformly diffused to two adjacent routes by adopting a route division mode until the network is completely covered. The routes in the color-coated area in the boot process all transmit the base address distribution instruction in a broadcasting mode according to the R-type division, and the routes in the color-unfilled area transmit according to the I-type division, so that the routes filled with the same color in the network complete the base address configuration in the same time slot.

The arbiter in the router is a 5-input-4-output special arbiter. Firstly, the arbitration requests of 5 input ends are not only one, but also judged according to route division; and secondly, the routing result determines routing jump, and the deterministic routing mode has low hardware overhead and high transmission efficiency. Therefore, the invention specially sets a maximum allocation arbitration strategy which is easy to realize by hardware. As shown in fig. 5, the input matrix is summed vertically, that is, the number of requests on each path is counted, and the path with the smallest number is taken for arbitration, the vertical arbitration rule still follows the weight ranking rule, the weight of each input is the sum of all the requests in the path where the input is located, and the request with the smallest weight is responded preferentially. When an incoming request is responded to, other requests are automatically masked, see the red-marked lines. While the lane summation is also zeroed indicating the end of the lane arbitration. The essence of the algorithm is that to achieve maximum occupancy of the link, requests with higher latitude need to be ranked at the lowest response priority, since they are more likely to be responded to when there is a conflict with other requests with lower latitude.

Since the control side of the central switch in the route is driven by the arbitration result, the central arbiter must comply with the occupation of the link in the command arbitration. The basic principle is to introduce a new channel occupation pin occupy, which is used for pulling up the occupy signal to inform the arbiter to keep the arbitration result after the request device receives the arbitration pass response, then completing the continuous occupation of a certain channel on the central switch, and quickly pulling down the occupy return channel to the arbiter after the instruction transmission is completed, and starting the next arbitration sequence. Arbitration should also take into account the state of the link, when the receive buffer state is already close to saturation and the channel is about to get stuck, the requesting device cannot effect the data transfer even though the arbiter passes the arbitration request. This situation may introduce a channel status signal rdy from the buffer, which the arbiter can pass requests on the link only when rdy is high.

All inputs 2 to the routing port are waveform verified²⁰1048576, it can be found by comparing the results of the traditional polling arbitration and the longitudinal unidirectional optimization: compared with the traditional scheme, the arbitration scheme provided by the invention is improved by more than 25%, and the optimization effect is obvious. The power consumption of a single route is 0.5789mW, the clock frequency is 600MHz, and the area is 8520um²And the design target is achieved by surpassing SPI and PCIe.

Claims

1. A multi-chip connection method of a low-power-consumption storage and calculation integrated system is characterized by comprising the following steps:

2. The method of claim 1, wherein the route arbitration method in step (6) comprises the steps of:

(3) summing each row in the input matrix;