CN113220625B

CN113220625B - Design method of calculation amount proving system based on stream processing chip

Info

Publication number: CN113220625B
Application number: CN202110514566.6A
Authority: CN
Inventors: 余光辉
Original assignee: Beijing Huizhichengxin Technology Co ltd
Current assignee: Beijing Huizhichengxin Technology Co ltd
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2024-05-14
Anticipated expiration: 2041-05-12
Also published as: CN113220625A

Abstract

The invention discloses a design method of a calculation amount proving system based on a stream processing chip, which belongs to the technical field of computer processing and is characterized in that the design method of the proving system comprises the following steps: WRITE PACKET, READ REQ PACKET and READ DATA SEND back packet and Error packet, and the beneficial effects of the invention are as follows: the whole system structural design of the stream processing chip is adopted, command and message formats between the chip and the main controller are utilized, ethash algorithm operation is carried out by combining the whole system, the problem of quick operation of Ethernet calculation amount demonstration is solved, according to the [183:176], [175:168] and [167:160] of the data packet, the data packet is used for ECC64 verification, the { [191:184], [159:112] }, the [111:56] and the [55:0] are respectively verified, all bits in the packet are interleaved, any continuous 3 bits belong to different ECC groups, and the burst error effect of the longest 3 bits is tolerated is obtained.

Description

Design method of calculation amount proving system based on stream processing chip

Technical field:

The invention belongs to the technical field of computer processing, and particularly relates to a design method of a calculation amount proving system based on a stream processing chip.

The background technology is as follows:

The Proof of Work (POW) is a bottom consensus mechanism adopted by mainstream blockchains such as ethernet, which requires a large amount of hash operations and consumes a large amount of memory bandwidth to find hash values meeting specific difficulty conditions. Since the ethernet computation proves that the algorithm needs to access a data set more than 1GB frequently (the data set increases approximately linearly with time, the 2021 data set will exceed 4GB, and the 2024 data set will exceed 6 GB), there is a high requirement on the memory capacity and access bandwidth of the system.

In order to solve the problem of fast operation of the calculation amount evidence of the Ethernet, most of the current solutions are as follows: the high-performance GPU display card adopting the Inlet and the AMD has the characteristics of strong operation capability and high memory bandwidth, so that the display card has better cost performance and performance power consumption ratio when processing calculation amount evidence compared with a CPU and an FPGA.

On the other hand, there are also some chip design companies that have specifically tailored GPU-like chips for ethernet computation. The chips have access bandwidth equivalent to that of the GPU, and meanwhile, a customized operation module is adopted (the calculation performance exceeds that of the GPU operation unit, and the power consumption is greatly reduced).

Whether GPU or GPU-like ASIC chips, there is a common problem: the off-chip memory bandwidth is low, which makes it difficult to improve the performance of the chip.

In addition, the calculated amount of the Ethernet proves that the required memory capacity exceeds 4GB, and with the current chip design process, the integration of the SRAM of 4GB inside a single chip is not possible at all. Thus limiting further increases in GPU and ASIC chip performance.

The invention comprises the following steps:

In order to solve the problems and overcome the defects of the prior art, the invention provides a design method of a calculation amount proving system based on a stream processing chip, which can effectively solve the problem of quick operation of the calculation amount proving of the Ethernet.

The specific technical scheme for solving the technical problems is as follows: the design method of the calculation amount proving system based on the stream processing chip is characterized by comprising the following steps:

(1) WRITE PACKET: writing operations on SRAM and REG inside chipid chips, which are sent by the main controller; data packet [191:190] =' b01 (binary representation 01) indicates that the message is WRITE PACKET; bit [189] of the data packet, 0 indicates writing to only the chip corresponding to the chip_id, and 1 indicates broadcasting to all chips; bit [188] of the data packet is reserved and is available for address expansion in the future; the address is located in [159:128] bits of the data packet, which is 32 bits in total; packet target chipid chip number, 4 bits total of [187:184] of the data packet; the [183:176], [175:168] and [167:160] of the data packet are used for ECC64 verification, and { [191:184], [159:112] }, [111:56] and [55:0] are respectively verified, and all bits in the packet are interleaved so that any continuous 3 bits belong to different ECC groups to tolerate burst errors of the longest 3 bits; the data is positioned in the format of a data packet between the main controller and the chip [127:0], and the width of the data packet is 192 bits;

(2) READ REQ PACKET: the main controller requests the read of SRAM or reg in each chipid chips, and the occurrence is as follows: when updating the SRAM, checking whether the SRAM is correctly written; the master controller needs to check and observe the internal register of the chip; data packet [191:188] =' b0001 (binary 0001) indicates that the message is READ PACKET; the total number of the data packets [187:184] is 4, which indicates the number of the target chipid chip of the packet initiated by the main controller; the address is located in [159:128] bits of the data packet, which is 32 bits in total; data is located in the data packet [127:0], the format of the data packet between the main controller and the chip, the data packet width is 192 bits,

(3) READ DATA SEND Back packet: chipid the chip returns the internal SRAM or REG data to the host controller; data packet [191:188] =' b1001 (binary representation 1001) indicates that the packet is READ DATA SEND back packet; address and data are located in the packet as READ PACKET: [187:184] is 4 bits in total, which represents chipid chip numbers for initiating the packet; the format of the data packet between the main controller and the chip, the data packet width is 192 bits,

(4) Error packet: the Chipid chip detects errors and then sends the errors to the main controller, and the occurrence condition is that: SERDES ECC error, illegal address, illegal packet type, FIFO under/over flow, SERDES LANE; data packet [191:188] =' b1011 (binary representation 1011) indicates that the message is error packet; the address and ECC are located in the same WRITE PACKET in the data packet; the total number of the data packet [187:184] is 4 bits, which represents chipid chip numbers for initiating the packet; the total 12 bits of the data packet [127:116] are all 0; the data packet [115:112] has 4 bits in total and represents an error type, 4'b0000 represents that ECC has corrected errors, 4' b0001 represents that ECC has parity errors, 4'b0010 represents that ECC is in error but not corrected errors, 4' b0100 represents an illegal read address, 4'b1000 represents an illegal write address, 4' b1100 represents an illegal packet type, 4'b1110 represents FIFO under/over flow, and 4' b1111 represents that Lane is blocked; the data packet [111:96] represents the id of the lane where the error occurred; the format of the data packet between the main controller and the chip is 192 bits in width.

Furthermore, the computing power proving system based on the flow processing chips adopts a system architecture of interconnection of a preset number of chips, a part of a data set is stored in each chip by using a sram, the total storage capacity of the preset number of chips exceeds the required storage space, and the chips are connected through a high-speed serdes bus.

Further, the error occurrence in step (4) includes:

I: inter-chip SERDES ECC error, ii: illegal address, iii: illegal packet type, iv: FIFO under/over flow, v: SERDES LANE is hung up.

The beneficial effects of the invention are as follows: the invention adopts the whole system structural design of the stream processing chip and utilizes

The command and message formats between the chip and the main controller are combined with the whole system to carry out Ethash algorithm operation, so that the problem of quick operation of the Ethernet calculation amount proving is solved

According to the invention, according to the data packet [183:176], [175:168] and [167:160] for ECC64 verification, the { [191:184], [159:112] }, [111:56] and [55:0] are respectively verified, and all bits in the packet are interleaved, so that any continuous 3 bits belong to different ECC groups, and the burst error effect of the longest 3 bits is obtained.

The specific embodiment is as follows:

specific details are set forth in the description of the invention in order to provide a thorough understanding of embodiments of the invention, it will be apparent to those skilled in the art that the invention is not limited to these details. In other instances, well-known structures and functions have not been shown or described in detail to avoid obscuring aspects of embodiments of the invention. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Specific embodiments of the invention:

The working principle is as follows:

(1) WRITE PACKET: writing operations on SRAM and REG in chipid chips, wherein addresses are positioned in [159:128] bits of a data packet and are 32 bits in total, and the write operations are sent by a main controller; packet target chipid chip number, 4 bits total of [187:184] of the data packet; the [183:176], [175:168] and [167:160] of the data packet are used for ECC64 verification, and { [191:184], [159:112] }, [111:56] and [55:0] are respectively verified, and all bits in the packet are interleaved so that any continuous 3 bits belong to different ECC groups to tolerate burst errors of the longest 3 bits; the format of the data packet between the main controller and the chip, and the width of the data packet is 192 bits;

(2) READ REQ PACKET: the main controller requests the read request of SRAM or reg in each chipid chip, and the situation occurs; when updating the SRAM, checking whether the SRAM is correctly written; the master controller needs to check and observe the internal register of the chip; the total number of the data packets [187:184] is 4, which indicates the number of the target chipid chip of the packet initiated by the FPGA; the format of the data packet between the main controller and the chip, and the width of the data packet is 192 bits;

(3) READ DATA SEND Back packet: the Chipid chip returns the internal SRAM or REG data to the main controller, and the situation occurs; address arrangement: READ PACKET at the same time; [187:184] is 4 bits in total, which represents chipid chip numbers for initiating the packet; the format of the data packet between the main controller and the chip, and the width of the data packet is 192 bits;

(4) Error packet: chipid after detecting the error, the chip sends the error to the main controller, and the occurrence conditions comprise:

During the use process:

(1) In order to meet the requirement of the storage space exceeding 4GB, a system architecture design of interconnection of a plurality of chips is adopted, a part of a data set is stored by using a sram in each chip, the total storage capacity of the plurality of chips exceeds the required storage space, and the chips are connected through a plurality of high-speed serdes buses.

(2) The core computing unit of ETHASH algorithm is a 64-round cycle computing and memory access operation, and in order to improve the computing efficiency, the cycle is split into pipeline operations, and each pipeline stage processes 1-round cycle computing and memory access operation.

(3) On a multi-chip streaming architecture, each chip is responsible for processing one or more pipeline stages of a 64-stage pipeline, and simultaneously, each chip is also responsible for responding to memory access and calculation requests of other chips to perform corresponding operations.

(4) In order to reduce the design complexity of the whole system PCB, we abandon the topological structure of the multi-chip full interconnection and adopt the multi-chip interconnection scheme of ring mode. When chip 0 (chip number starting from 0) needs to send data or commands to chip N (N > 1), the data and commands need to pass through chip 1, chip 2, … … serially, all the way to chip N. The chip passing in between encounters these data and commands, and will send the data directly to the next chip.

(5) The main controller is realized by a desktop or embedded SoC chip, and is connected with each chip through a low-speed or high-speed bus (not limited to I2C, SPI, serdes and other IO interfaces) to form an out-of-band main control network. The main controller is responsible for initializing all chips, starting and ending tasks, error handling and recovery.

(6) Considering the number of algorithm pipeline stages, in order to reduce the data delay overhead caused by overlong communication links, the scheme only comprises the following system structures with the number of chips: 8,9, 10, 11, 12, 13, 16, 17, 32, 33.

The specific implementation method comprises the following steps:

Assuming that the number of chips of the system is 4, the Sram capacity inside a single chip exceeds 1GB. The total memory capacity of the system exceeds 4GB.

The 64-stage computation is divided into two 32-stage pipeline stages, and the steps of the 32-stage pipeline process are described next.

Embodiment one:

The first step: chip 0 processes the 1 st pipeline stage and then sends the intermediate result to chip 1 via serdes

And a second step of: chip 1 processes the 2 nd pipeline stage and then sends the intermediate result to chip 2 via serdes

And a third step of: sequentially processing, and calculating 32 pipeline stages by 8 rounds and 4 chips

Fourth step: according to the first to third steps, 8 rounds of calculation are carried out, and the total number of the 64 pipeline stages is calculated by 4 chips to finish the whole algorithm

Embodiment two:

The first step: chip 0 processes 1-8, for a total of 8 pipeline stages, and then sends the intermediate result to chip 1 via serdes

And a second step of: chip 1 processes the 9 th-16 th, 8 th pipeline stages altogether, and then sends the intermediate result to chip 2 through serdes

And a third step of: sequentially processing, and calculating 32 pipeline stages by 1 round of 4 chips

Fourth step: according to the first to third steps, 1 round of calculation is carried out, and the total number of the 64 pipeline stages is calculated by 4 chips to finish the whole algorithm

The first scheme is simpler in design on the chip microstructure.

The second scheme has better access locality and higher bandwidth utilization, similar to burst operation of memory.

Burst operation of traditional memory: 8 pipeline-level computations need to access 8 internal memories, and there must be multiple accesses to the sram of the same chip. Therefore, the request command and the return data which access the same chip can be packaged, and the utilization rate of the serdes bandwidth is improved. But this design requires more internal buffers for storing intermediate results of multiple pipeline stages.

To sum up: (1) To meet the requirement of more than 4GB of memory space, a system architecture design of interconnection of a plurality of chips is adopted, and a part of a data set is stored in each chip by using a sram. The total memory capacity of the plurality of chips will exceed the required memory space. The chips are connected through a plurality of high-speed serdes buses.

(2) The core computing unit of ETHASH algorithm is a 64-round cycle computing and memory access operation, and in order to improve the computing efficiency, we split the cycle into pipeline operations, and each pipeline stage processes 1-round cycle computing and memory access operation.

The whole system structural design of the stream processing chip is adopted, the command and message format between the chip and the main controller are utilized, ethash algorithm operation is carried out by combining the whole system, and the problem of rapid operation of the Ethernet calculation amount evidence is solved;

Claims

1. The design method of the calculation amount proving system based on the flow processing chip is characterized in that the calculation amount proving system based on the flow processing chip is as follows:

Adopting a system architecture design of interconnection of a plurality of chips, wherein a part of a data set is stored in each chip by using a sram, the total storage capacity of the plurality of chips exceeds the required storage space, and the chips are connected through a plurality of high-speed serdes buses;

the core computing unit of ETHASH algorithm is a 64-round cycle computing and memory access operation, the cycle is split into pipeline operation, each pipeline stage processes 1-round cycle computing and memory access operation;

On a flow processing architecture formed by multiple chips, each chip is responsible for processing one or more flow stages of a 64-stage pipeline, and simultaneously, each chip is also responsible for responding to memory access and calculation requests of other chips to perform corresponding operations;

When chip 0, the chip number starts from 0, data or command needs to be sent to chip N (N > 1), the data and command needs to pass through chip 1, chip 2, … … in series until chip N, the chip passing through in the middle encounters the data and command, and the data is directly sent to the next chip;

The main controller is realized by a desktop or embedded SoC chip, is connected with each chip through a low-speed or high-speed bus to form an out-of-band main control network, and is responsible for initializing all the chips, starting and ending tasks, and performing error processing and recovery;

the design method of the proving system comprises the following steps:

(1) WRITE PACKET: writing operations on SRAM and REG inside chipid chips, which are sent by the main controller; data packet [191:190] =' b01, binary representation 01, and message WRITE PACKET; bit [189] of the data packet, 0 indicates writing to only the chip corresponding to the chip_id, and 1 indicates broadcasting to all chips; bit [188] of the data packet is reserved and is available for address expansion in the future; the address is located in [159:128] bits of the data packet, which is 32 bits in total; packet target chipid chip number, 4 bits total of [187:184] of the data packet; the [183:176], [175:168] and [167:160] of the data packet are used for ECC64 verification, and { [191:184], [159:112] }, [111:56] and [55:0] are respectively verified, and all bits in the packet are interleaved so that any continuous 3 bits belong to different ECC groups to tolerate burst errors of the longest 3 bits; the data is positioned in the format of a data packet between the main controller and the chip [127:0], and the width of the data packet is 192 bits;

(2) READ REQ PACKET: the main controller requests the read of SRAM or reg in each chipid chips, and the occurrence is as follows: when updating the SRAM, checking whether the SRAM is correctly written; the master controller needs to check and observe the internal register of the chip; data packet [191:188] =' b0001, binary 0001, which indicates that the message is READ PACKET; the total number of the data packets [187:184] is 4, which indicates the number of the target chipid chip of the packet initiated by the main controller; the address is located in [159:128] bits of the data packet, which is 32 bits in total; data is located in the data packet [127:0], the format of the data packet between the main controller and the chip, the data packet width is 192 bits,

(3) READ DATA SEND Back packet: chipid the chip returns the internal SRAM or REG data to the host controller; data packet [191:188] =' b1001, binary representation 1001, and the message is READ DATA SEND back packet; address and data are located in the packet as READ PACKET: [187:184] is 4 bits in total, which represents chipid chip numbers for initiating the packet; the format of the data packet between the main controller and the chip, the data packet width is 192 bits,

(4) Error packet: the Chipid chip detects errors and then sends the errors to the main controller, and the occurrence condition is that: SERDES ECC error, illegal address, illegal packet type, FIFO under/over flow, SERDES LANE; data packet [191:188] =' b1011, binary representation 1011, indicating that the message is error packet; the address and ECC are located in the same WRITE PACKET in the data packet; the total number of the data packet [187:184] is 4 bits, which represents chipid chip numbers for initiating the packet; the total 12 bits of the data packet [127:116] are all 0; the data packet [115:112] has 4 bits in total and represents an error type, 4'b0000 represents that ECC has corrected errors, 4' b0001 represents that ECC has parity errors, 4'b0010 represents that ECC is in error but not corrected errors, 4' b0100 represents an illegal read address, 4'b1000 represents an illegal write address, 4' b1100 represents an illegal packet type, 4'b1110 represents FIFO under/over flow, and 4' b1111 represents that Lane is blocked; the data packet [111:96] represents the id of the lane where the error occurred; the format of the data packet between the main controller and the chip is 192 bits in width.

2. The method for designing a flow processing chip-based computation amount proving system according to claim 1, wherein the error occurrence in the step (4) comprises: