CN113220625A

CN113220625A - Design method of calculation amount proving system based on stream processing chip

Info

Publication number: CN113220625A
Application number: CN202110514566.6A
Authority: CN
Inventors: 余光辉
Original assignee: Beijing Huizhichengxin Technology Co ltd
Current assignee: Beijing Huizhichengxin Technology Co ltd
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-06
Anticipated expiration: 2041-05-12
Also published as: CN113220625B

Abstract

The invention discloses a design method of a calculation amount certification system based on a stream processing chip, belonging to the technical field of computer processing, and characterized in that the design method of the certification system comprises the following steps: the invention has the beneficial effects of Write packet, Read req packet, Read data send back packet, Error packet: the method adopts the whole system structural design of a streaming processing chip, utilizes the command and message format between the chip and a main controller, and combines the whole system to carry out the operation of an Ethash algorithm, thereby solving the problem of fast operation proved by the Ethern calculation amount, and uses 183:176, 175:168 and 167:160 of a data packet for ECC64 check, respectively checks { [191:184], [159:112] }, [111:56] and [55:0], interleaves all bits in the packet, ensures that any continuous 3 bits belong to different ECC groups, and obtains the effect of tolerating burst errors of the longest 3 bits.

Description

Design method of calculation amount proving system based on stream processing chip

The technical field is as follows:

the invention belongs to the technical field of computer processing, and particularly relates to a design method of a calculation amount proving system based on a streaming processing chip.

Background art:

proof of computation (POW) is a bottom-level consensus mechanism adopted by mainstream block chains such as etherhouses, which requires a large amount of hash operations and consumes a large amount of storage bandwidth to find a hash value meeting a specific difficulty condition. Since the ethereal computation proves that the algorithm needs to frequently access a data set which exceeds 1GB (the data set grows approximately linearly with time, the data set in 2021 year exceeds 4GB, and the data set in 2024 year exceeds 6 GB), the requirement on the memory capacity and the memory access bandwidth of the system is high.

In order to solve the problem of fast operation proved by Ethengfang calculated quantity, the most current solutions are as follows: the high-performance GPU display card adopting the English viagra and the AMD has the advantages of being high in computing capability and high in memory bandwidth, so that the display card has better cost performance and performance power consumption ratio when processing calculated amount verification compared with a CPU and an FPGA.

On the other hand, there are also some chip design companies that have specifically customized GPU-like chips for etherhouse computing volume proofs. The chips have the access bandwidth equivalent to that of the GPU, and meanwhile, a customized operation module is adopted (the calculation performance exceeds that of an operation unit of the GPU, and the power consumption is greatly reduced).

There is a common problem with both GPU and GPU-like ASIC chips: off-chip memory bandwidth is low, resulting in difficult performance enhancement of the chip.

In addition, the amount of calculation in ether factories proves that the required memory capacity exceeds 4GB, and with the current chip design process, it is impossible to integrate 4GB SRAM in a single chip. Further improvements in GPU and ASIC chip performance are therefore also limited.

The invention content is as follows:

in order to solve the problems and overcome the defects of the prior art, the invention provides a design method of a calculation amount proving system based on a streaming processing chip, which can effectively solve the problem of fast calculation of calculation amount proving of an ether workshop.

The specific technical scheme for solving the technical problems comprises the following steps: the design method of the calculation amount certification system based on the streaming processing chip is characterized by comprising the following steps:

(1) write packet: the main controller sends out write operation to SRAM and REG in the chip; a packet [191:190] =' b01 (binary 01) indicates that the message is a write packet; bit [189] of the data packet, 0 represents writing only to the chip corresponding to the chip _ id, and 1 represents broadcasting to all chips; bit [188] of the packet is reserved for future use in address expansion; the address is located in [159:128] bits of the data packet, and the total number is 32 bits; the number of a packet target chip is 4, and the number of [187:184] of the data packet is 4; 183:176, 175:168 and 167:160 of the data packet are used for ECC64 checks to check { [191:184], [159:112] }, [111:56] and [55:0] respectively, and all bits in the packet are interleaved so that any consecutive 3 bits belong to different ECC groups to tolerate a burst error of up to 3 bits; data is located in [127:0], the format of a data packet between the main controller and the chip, and the width of the data packet is 192 bits;

(2) read req packet: the main controller generates a condition for the read request of the SRAM or reg in each chip: when the SRAM is updated, whether the SRAM is written correctly is checked; the master controller needs to check and observe the internal registers of the chip; a data packet [191:188] =' b0001 (binary representation 0001) indicates that the message is a read packet; the total 4 bits of the data packet [187:184] represent the target chip number of the packet initiated by the master controller; the address is located in [159:128] bits of the data packet, and the total number is 32 bits; data is in data packet 127:0, data packet format between the main controller and the chip, data packet width 192 bits,

(3) read data send back packet: the Chipid chip returns the internal SRAM or REG data to the main controller; a data packet [191:188] =' b1001 (binary representation 1001) indicates that the message is a read data send back packet; the position of the address and the data in the data packet is the same as that of the read packet: the total number of bits is 4 in [187:184], which represents the chip number of the packet to be initiated; the format of the data packet between the host and the chip, the data packet width 192 bits,

(4) error packet: and (3) sending the error detected by the Chipid chip to a main controller, wherein the occurrence condition is as follows: serdes ECC error, illegal address, illegal packet type, FIFO under/over flow, Serdes Lane hang up between chips; a data packet [191:188] =' b1011 (binary representation 1011) indicates that the message is an error packet; the address and the ECC are located in the same write packet in the data packet; the total 4 bits of the data packet [187:184] represent the chip number of the chip initiating the packet; the total 12 bits of the data packet [127:116] are all 0; the data packet [115:112] has 4 bits in total and represents the error type, 4 'b 0000 represents that ECC is corrected, 4' b0001 represents that ECC is parity-corrected, 4 'b 0010 represents that ECC is erroneous but not corrected, 4' b0100 represents an illegal read address, 4 'b 1000 represents an illegal write address, 4' b1100 represents an illegal packet type, 4 'b 1110 represents FIFO under/over flow, and 4' b1111 represents lane hang-up; data packet [111:96] represents id of lane where error occurred; the format of the data packet between the host and the chip, the data packet width is 192 bits.

Furthermore, the calculated amount proving system based on the streaming processing chip adopts a system architecture in which a preset number of chips are interconnected, each chip uses a part of an sram storage data set, the total storage capacity of the preset number of chips exceeds the required storage space, and the chips are connected with each other through a high-speed serdes bus.

Further, the error occurrence in step (4) includes:

i: inter-chip Serdes ECC error, ii: illegal address, iii: illegal packet type, iv: FIFO under/over flow, V: serdes Lane hung up.

The invention has the beneficial effects that: the invention adopts the whole system structure design of the stream processing chip and utilizes

The command and message format between the chip and the main controller are combined with the whole system to carry out the operation of the Ethash algorithm, thereby solving the problem of fast operation proved by the calculation amount of the Etheng

The invention uses 183:176, 175:168 and 167:160 of the data packet for ECC64 check, checks 191:184, 159:112, 111:56 and 55:0, respectively, and interleaves all the bits in the packet, so that any continuous 3 bits belong to different ECC groups, and the effect of tolerating burst error of longest 3 bits is obtained.

The specific implementation mode is as follows:

in the description of the invention, specific details are given only to enable a full understanding of the embodiments of the invention, but it should be understood by those skilled in the art that the invention is not limited to these details for the implementation. In other instances, well-known structures and functions have not been described or shown in detail to avoid obscuring the points of the embodiments of the invention. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

The specific implementation mode of the invention is as follows:

the working principle is as follows:

(1) write packet: the main controller sends out write operation to SRAM and REG in chip, the address is located in [159:128] bit of the data packet, 32 bits in total; the number of a packet target chip is 4, and the number of [187:184] of the data packet is 4; 183:176, 175:168 and 167:160 of the data packet are used for ECC64 checks to check { [191:184], [159:112] }, [111:56] and [55:0] respectively, and all bits in the packet are interleaved so that any consecutive 3 bits belong to different ECC groups to tolerate a burst error of up to 3 bits; the format of the data packet between the main controller and the chip, the width of the data packet is 192 bits;

(2) read req packet: the main controller generates a condition for the read request of the SRAM or reg in each chip; when the SRAM is updated, whether the SRAM is written correctly is checked; the master controller needs to check and observe the internal registers of the chip; the data packet [187:184] has 4 bits in total and represents the number of a target chip for initiating the packet by the FPGA; the format of the data packet between the main controller and the chip, the width of the data packet is 192 bits;

(3) read data send back packet: the Chipid chip returns the internal SRAM or REG data to the main controller, and the situation occurs; address arrangement: the same read packet; the total number of bits is 4 in [187:184], which represents the chip number of the packet to be initiated; the format of the data packet between the main controller and the chip, the width of the data packet is 192 bits;

(4) error packet: the Chipid chip detects errors and then sends the errors to the main controller, and the occurrence conditions comprise:

In the using process:

(1) in order to meet the requirement of a storage space exceeding 4GB, a system architecture design of interconnection of a plurality of chips is adopted, a part of a sram storage data set is used in each chip, the total storage capacity of the plurality of chips exceeds the required storage space, and the chips are connected with one another through a plurality of high-speed serdes buses.

(2) The core computing unit of the ETHASH algorithm is 64 rounds of calculation and memory access operation, in order to improve the computing efficiency, the cycle is divided into pipeline operation, and each pipeline stage processes 1 round of calculation and memory access operation.

(3) On a streaming processing architecture formed by multiple chips, each chip is responsible for processing one or more pipeline stages of a 64-stage pipeline, and simultaneously each chip is also responsible for responding to access and storage and calculation requests of other chips to perform corresponding operations.

(4) In order to reduce the design complexity of a full-system PCB, a multi-chip full-interconnection topological structure is abandoned, and a ring mode multi-chip interconnection scheme is adopted. When chip 0 (chip number starting from 0) needs to send data or commands to chip N (N > 1), the data and commands need to pass through chip 1, chip 2, … … in series, all the way to chip N. The intermediate pass-through chip encounters these data and commands and will send the data directly to the next level chip.

(5) The main controller is realized by a desktop or an embedded SoC chip, and the main controller is connected with each chip through a low-speed or high-speed bus (not limited to using I2C, SPI, Serdes and other IO interfaces) to form an out-of-band main control network. The main controller is responsible for initializing all chips, starting and ending tasks, error handling and recovery.

(6) Considering the number of algorithm pipeline stages, in order to reduce the data delay overhead caused by overlong communication link, the scheme only comprises the following system structure with the number of chips: 8,9, 10, 11, 12, 13, 16, 17, 32, 33.

In the specific implementation:

assuming that the number of chips of the system is 4, the Sram capacity inside a single chip exceeds 1 GB. The total storage capacity of the system exceeds 4 GB.

The 64-stage computation is divided into two 32-stage pipeline stages, and the steps of the 32-stage pipeline processing are described next.

The first embodiment is as follows:

the first step is as follows: chip 0 processes the 1 st pipeline stage and then sends the intermediate result to chip 1 through serdes

The second step is that: chip 1 processes the 2 nd pipeline stage and then sends the intermediate result to chip 2 through serdes

The third step: sequentially processing, and calculating 32 pipeline stages by 4 chips after 8 rounds

The fourth step: according to the steps from one to three, through 8 rounds of calculation, 64 pipeline levels are calculated by 4 chips, and the whole algorithm is completed

Example two:

the first step is as follows: chip 0 processes 1 st-8 th, total 8 pipeline stages, and then sends the intermediate result to chip 1 through serdes

The second step is that: chip 1 processes the 9 th-16 th, 8 th pipeline stages, and then sends the intermediate result to chip 2 through serdes

The third step: sequentially processing, and calculating 32 pipeline stages by 4 chips after 1 round

The fourth step: according to the steps from one to three, through 1 round of calculation, 64 pipeline levels are calculated by 4 chips, and the whole algorithm is completed

The first scheme is more concise in chip microstructure design.

The second scheme has better memory access locality and higher bandwidth utilization rate, and is similar to burst operation of a memory.

Burst operation of a traditional memory: 8 pipeline-level computations need 8 accesses to internal storage, and multiple accesses must hit the sram of the same chip. Therefore, the request command and the return data for accessing the same chip can be packaged, and the utilization rate of the serdes bandwidth is improved. But this design requires more internal buffers to store intermediate results for multiple pipeline stages.

In summary, the following steps: (1) in order to meet the requirement of a storage space exceeding 4GB, a system architecture design of interconnection of a plurality of chips is adopted, and each chip internally stores a part of a data set by using sram. The total memory capacity of the multiple chips will exceed the required memory space. The chips are connected through a plurality of high-speed serdes buses.

(2) The core computing unit of the ETHASH algorithm is 64 rounds of calculation and memory access operation, in order to improve the computing efficiency, the loop is divided into pipeline operation, and each pipeline stage processes 1 round of calculation and memory access operation.

The full system structural design of the streaming processing chip is adopted, the command and message format between the chip and the main controller are utilized, and the full system is combined to carry out the Ethash algorithm operation, so that the problem of fast operation proved by the calculation amount of the Etheng is solved;

Claims

1. A design method of a computation amount certification system based on a streaming processing chip is characterized in that the design method of the certification system comprises the following steps:

2. The method according to claim 1, comprising a system architecture in which a predetermined number of chips are interconnected, wherein each chip uses a portion of sram memory data set, the total memory capacity of the predetermined number of chips exceeds the required memory space, and the chips are connected to each other via a high-speed serdes bus.

3. The method for designing a computation volume certification system based on a streaming processing chip according to claim 1, wherein the error occurrence in step (4) includes: