CN112214427B

CN112214427B - Cache structure, workload proving operation chip circuit and data calling method thereof

Info

Publication number: CN112214427B
Application number: CN202011079281.6A
Authority: CN
Inventors: 汪福全; 刘明; 蔡凯
Original assignee: Sunlune Technology Beijing Co Ltd
Current assignee: Shenglong Singapore Pte Ltd
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2022-02-11
Anticipated expiration: 2040-10-10
Also published as: CN112214427A

Abstract

The application relates to a cache structure, a workload proving operation chip circuit and a data calling method thereof, wherein the cache structure comprises: the system comprises a plurality of first cache regions, at least one request screening unit and a plurality of second cache regions; the first cache region is used for caching the received calculation request; the request screening unit is used for screening out calculation requests with different addresses for requesting access in the calculation requests; the second cache region is used for caching the calculation requests screened by the request screening unit. The technical scheme provided by the application can improve the random access performance of the calculation request.

Description

Cache structure, workload proving operation chip circuit and data calling method thereof

Technical Field

The present invention relates to the field of integrated circuit technology, and more particularly, to a cache structure, a workload proving operation chip circuit, and a data calling method thereof.

Background

The Proof of Work (POW) is a common recognition mechanism used for mainstream encrypted digital coins such as ethernet coins, and is basically characterized in that a large amount of hash operations are required, and a hash value meeting the conditions is found under the condition of a specific difficulty value. However, encrypted digital currency centered on the ETHASH algorithm requires a data set larger than 1GB and frequent access to the data set during workload certification.

The traditional approach is to use a separate external memory to store the data set off the computing chip, but is limited by bandwidth, which is poor performance. In order to increase the bandwidth, as shown in fig. 1, the conventional structure generally employs a computing unit connected to a routing unit, the routing unit is connected to an arbitration unit through a crossbar switch, and the arbitration unit is connected to a storage unit respectively. Based on the structure shown in fig. 1, the principle of issuing a calculation request by a calculation unit is as follows:

as shown in fig. 1, since one computing unit is connected to one routing unit, the computing request sent by each computing unit is transmitted to the routing unit connected to the computing unit, and then transmitted to each arbitration unit by the routing unit through the crossbar switch, until the computing request sent by the computing unit is arbitrated by some arbitration unit, and then is sent to the corresponding storage unit to execute the request, the computing unit may send the next computing request to the routing unit connected to the computing unit.

The structure shown in fig. 1 and the sending mechanism of the computation request result in that when the access addresses sent by a plurality of computation units are the same, only the request of one computation unit is received by the arbitration unit, resulting in low access efficiency.

Disclosure of Invention

In view of the above, the present invention provides a cache structure, a workload proving operation chip circuit and a data calling method thereof, so as to improve the access efficiency of the workload proving operation chip data.

The invention provides a cache structure, comprising:

the system comprises a plurality of first cache regions, at least one request screening unit and a plurality of second cache regions;

the first cache region is used for caching the received calculation request;

the request screening unit is used for screening out calculation requests with different addresses for requesting access in the calculation requests;

the second cache region is used for caching the calculation requests screened by the request screening unit.

Therefore, by arranging the first cache region, a plurality of computing requests from the same computing unit can be cached, the computing requests pass through the request screening unit to obtain a plurality of computing requests with different addresses for requesting access, and the screened computing requests are cached to the second cache region.

As an implementation manner of the first aspect, the first cache region is specifically configured to: and caching the received calculation request by a pre-entry pre-existing rule.

As an implementation manner of the first aspect, the second cache region is specifically configured to: and caching the calculation request screened by the request screening unit according to a rule which is input first and stored first.

Therefore, by entering the pre-existing rule, the corresponding computation requests can be orderly cached to the first cache region and the second cache region.

A workload proving operation chip circuit comprising the above cache structure, the circuit comprising: the system comprises a computing unit, a cache structure, a routing unit, an arbitration unit and a storage unit which are connected in sequence;

the computing unit is used for sending a computing request to the cache structure according to the received computing task;

the routing unit is used for determining an access path of the calculation request screened by the cache structure and sending the access path of the calculation request to the arbitration unit;

the arbitration unit is used for arbitrating the access path of the received calculation request, and if the access path of the calculation request meets arbitration conditions, the calculation request is sent to a corresponding storage unit to call related data through the access path;

the storage unit is used for storing the data set of the workload proving operation chip.

As an implementation manner of the second aspect, the computing unit, the cache structure, the routing unit, the arbitration unit, and the storage unit are respectively multiple, where:

each computing unit, each cache structure and each routing unit are connected in a one-to-one correspondence manner;

each arbitration unit is connected with each storage unit in a one-to-one correspondence manner;

each routing unit is fully connected with each arbitration unit.

As an implementation of the second aspect, the routing unit and the arbitration unit are connected by a crossbar.

Therefore, the cache structure is added in the workload proving operation chip circuit, so that the efficiency of the whole cross switch can be improved, and the access performance of chip data can be improved.

A data calling method of the workload proving operation chip circuit comprises the following steps:

sending out a calculation request according to the calculation task;

screening out calculation requests with different access request paths in the calculation requests;

determining the access path of the calculation request with different access request paths according to a preset routing table, and arbitrating the determined access path of the calculation request;

the data required by the arbitrated computation requests are invoked separately.

In summary, the present invention can solve the following problems: the data access efficiency is improved.

Drawings

FIG. 1 is a diagram illustrating a workload proving computing chip according to the prior art;

fig. 2 is a schematic structural diagram of a cache structure according to an embodiment of the present disclosure;

fig. 3 is a circuit diagram of a workload verification operation chip according to an embodiment of the present application.

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are shown in the drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

In the following description, references to the terms "first \ second \ third, etc. or module a, module B, module C, etc. are used solely to distinguish between similar objects and do not denote a particular order or importance to the objects, but rather the specific order or sequence may be interchanged as appropriate to enable embodiments of the application described herein to be practiced in an order other than that shown or described herein.

In the following description, reference to reference numerals indicating steps, such as S100, S200 … …, etc., does not necessarily indicate that the steps are performed in this order, and the order of the preceding and following steps may be interchanged or performed simultaneously, where permissible.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

As shown in fig. 2, one embodiment of the present application provides a cache structure, which includes: the system comprises a plurality of first cache regions, at least one request screening unit and a plurality of second cache regions. The input end of the request screening unit is connected with the output end of each first cache region, and the output end of the request screening unit is connected with the input end of each second cache region.

The first cache region is used for caching the received calculation request and sending the calculation request to the request screening unit. The first cache region may perform caching according to a preset rule when caching the received calculation request, for example, the preset rule is that the calculation requests are sequentially cached from small to large according to the sequence of receiving the calculation requests. For example: in fig. 2, the cache address of the first cache region a is 0, the cache address of the first cache region b is 1, and so on, and the cache address of the first cache region M is x, then, when the cache structure receives a computation request, the received first computation request is cached in the first cache region a with the cache address of 0, the received second computation request is cached in the first cache region b with the cache address of 1, and so on, until all computation requests are cached or until all memories of the first cache region are occupied.

The request screening unit is used for screening out calculation requests with different access addresses in the calculation requests from the first cache regions, and sending the screened calculation requests to the second cache region. For example: the address to be accessed by the first computation request received in the request screening unit is 1, the address to be accessed by the second computation request received in the request screening unit is the same as the address to be accessed by the first computation request, the address to be accessed by the third computation request received in the computation request area is 2, at this time, the computation requests with different addresses to be accessed in the computation requests, namely the first computation request and the third computation request, are screened out, and the first computation request and the third computation request are sent to the second cache area at the same time. Correspondingly, the two first cache regions corresponding to the first computation request and the third computation request release the corresponding storage resources. When computing requests with the same addresses requested to be accessed are screened, a first-in first-out strategy is used for screening according to the precedence mechanism, namely in the first computing request and the second computing request which are accessed with the addresses 1 in the example, the first computing request is stored into the first cache region firstly in time, so that the computing requests are selected preferentially during screening, and the strategy is also used for screening subsequent computing requests.

The second cache region is used for caching the calculation requests screened by the request screening unit. The caching mode is the same as the mode of caching the calculation request in the first cache region, and therefore, the detailed description is omitted.

In this embodiment, the numbers of the first buffer, the request screening unit, and the second buffer may be adjusted respectively, so as to achieve the balance between resources and efficiency in the use condition. Namely: when the efficiency required by the use condition is higher, more first cache regions, request screening units and second cache regions can be arranged at the moment, and when the efficiency required by the use condition is not high and is limited by cost, less first cache regions, request screening units and second cache regions can be arranged at the moment.

As shown in fig. 3, another embodiment of the present application provides a workload proving operation chip circuit including the above-mentioned cache structure, the circuit including: the system comprises a computing unit, a cache structure, a routing unit, an arbitration unit and a storage unit which are connected in sequence.

The computing unit is used for sending a computing request to the cache structure according to the received computing task. The number of calculation requests is not limited here.

The cache structure is configured to cache and screen the computation requests sent by the computation unit, and the specific cache rule and the screening rule are the same as those in the previous embodiment, so that details thereof are not repeated here.

The routing unit is used for determining the access path of the calculation request screened by the cache structure and sending the access path of the calculation request to the arbitration unit;

the arbitration unit is used for arbitrating the access path of the received calculation request, and if the access path of the calculation request meets the arbitration condition, the calculation request is sent to the corresponding storage unit to call the related data through the access path.

Here, taking the example that the first computing unit in fig. 3 issues three computing requests, a specific working process of the workload proving arithmetic chip circuit in this embodiment is described:

the first step, the first computing unit sends a first computing request, a second computing request and a third computing request to a first cache structure connected with the first computing request according to the computing task;

secondly, a first cache region in the first cache structure caches a first calculation request, a second calculation request and a third calculation request respectively according to a preset rule; the request screening unit in the first cache structure screens request access addresses of the first computation request, the second computation request and the third computation request, and if an address to be accessed by the first computation request is 1, an address to be accessed by the second computation request is also 1, and an address to be accessed by the third computation request is 2, the request screening unit simultaneously sends the first computation request and the third computation request to a second cache region in the first cache structure, and a cache rule of the request screening unit is also a preset rule in the previous embodiment;

and thirdly, the second cache region in the first cache structure sends the first calculation request and the third calculation request to a first routing unit connected with the first cache structure, the first routing unit determines access paths of the two calculation requests, and sends the access paths of the two calculation requests to a first arbitration unit-a k-th arbitration unit shown in fig. 3 through a cross switch for arbitration respectively, and if the access path of the first calculation request is arbitrated by a certain arbitration unit, the arbitration unit sends the first calculation request to a storage unit connected with the arbitration unit so as to call related data.

It can be seen from the above that, since the workload proves that the operation chip has the cache structure, each routing unit can send out a plurality of calculation requests at the same time, and the working efficiency of the whole chip is improved.

With reference to fig. 2 and 3, the following provides an implementation of the present application:

first, the computing unit issues a computing request (the number of the computing requests is not limited), the computing request is firstly cached at a free cache address in a first cache region of the cache structure, if the cache address in the first cache region of the cache structure is empty at 0, the first computing request is cached at the cache address in the first cache region of 0, if the computing request already exists at 0, the first computing request is cached at the cache address in the first cache region of 1, and so on. The computing unit may continue to send computing requests until the first cache region of the cache structure is full.

Secondly, the request screening unit in the cache structure selects the computation requests with non-conflicting addresses from all the computation requests of the first cache region in the cache structure and fills the computation requests into the second cache region in the cache structure, and the filling mode is the same as the filling mode.

Then, the calculation requests of the second cache area in the cache structure simultaneously participate in the arbitration of the first arbitration unit to the k-th arbitration unit through the routing unit, if the calculation requests do not meet the arbitration condition, no calculation request is sent to the storage unit through the first arbitration unit to the k-th arbitration unit, if one calculation request meets the arbitration condition, one calculation request is sent to the storage unit through one of the first arbitration unit to the k-th arbitration unit, and if a plurality of calculation requests meet the arbitration condition, a plurality of calculation requests are sent to the storage unit through the first arbitration unit to the k-th arbitration unit of the arbitration unit.

The efficiency of the cache structure provided by the present application is verified as follows:

for example, the number of calculation units is 32, and the number of storage units is 32. If the chip circuit structure shown in fig. 1 is used, when the computing unit sends a computing request to the routing unit, the routing unit sends the computing request to the arbitration unit for arbitration, and if the computing request does not pass arbitration, the computing request occupies the position of the routing unit, so that the computing unit cannot continue to send a new computing request, and the computing unit cannot send a next computing request until the computing request passes arbitration of the arbitration unit. If the address of the computation request issued by the computation unit is completely random or close to random, the efficiency of the structure is:

when the cache structure provided by the present application is used, after the computing unit sends a computing request, the computing request is cached in the cache structure first, and at this time, the computing unit may continue to send the computing request until the first cache region of the cache structure is full. Meanwhile, one or more requests of the cache structure can be simultaneously sent to the arbitration unit for arbitration through the routing unit. Under this configuration, similarly, the number of calculation units is 32, and the number of storage units is 32:

if the number of the first buffer areas in the buffer structure is 4 and the number of the second buffer areas in the buffer structure is 4, the efficiency of the structure can reach 78%.

If the number of the first buffer areas in the buffer structure is 4 and the number of the second buffer areas is 8, the efficiency of the structure can reach 90%.

If the number of the first buffer areas in the buffer structure is 8 and the number of the second buffer areas is 12, the efficiency of the structure can reach 94%.

If the number of the first buffer areas in the buffer structure is 12 and the number of the second buffer areas is 12, the efficiency of the structure can reach 95%.

In practical application, the number of the first cache regions and the number of the second cache regions in the computing unit, the storage unit and the cache structure can be set according to actual requirements.

By last, the cache structure that this application provided has apparent effect to promoting access efficiency.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A cache structure, the cache structure comprising:

the first cache region is used for caching the received calculation request;

2. The structure of claim 1, wherein the first cache region is specifically configured to: and caching the received calculation request by a pre-entry pre-existing rule.

3. The structure of claim 1, wherein the second cache area is specifically configured to: and caching the calculation request screened by the request screening unit according to a rule which is input first and stored first.

4. A workload certification op-chip circuit comprising the cache architecture of any one of claims 1 to 3, the circuit comprising: the system comprises a computing unit, a cache structure, a routing unit, an arbitration unit and a storage unit which are connected in sequence;

5. The circuit according to claim 4, wherein the computing unit, the cache structure, the routing unit, the arbitration unit and the storage unit are respectively plural, and wherein:

each routing unit is fully connected with each arbitration unit.

6. The circuit of claim 5, wherein the routing unit and the arbitration unit are connected by a crossbar.

7. A data calling method based on the workload proving operation chip circuit as claimed in any one of claims 4 to 6, wherein the method comprises:

sending out a calculation request according to the calculation task;

data required for the computation requests satisfying the arbitration conditions are respectively called.