CN111863139B

CN111863139B - Gene comparison acceleration method and system based on near-memory computing structure

Info

Publication number: CN111863139B
Application number: CN202010278048.4A
Authority: CN
Inventors: 谭光明; 刘万奇; 臧大伟; 孙凝晖; 陈灿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2022-10-18
Anticipated expiration: 2040-04-10
Also published as: CN111863139A

Abstract

The invention provides a gene comparison acceleration method and system based on a near memory computing structure, which comprises the following steps: grouping a plurality of vertical cubic memory structures to obtain a plurality of gene comparison processing groups; acquiring reference sequence data, splitting the reference sequence data into reference data sections, respectively storing the reference data sections into a gene comparison processing group, and realizing data communication between cubic memory structures through an on-chip network of a gene comparison accelerator; acquiring gene sequence data to be compared, splitting the gene sequence data to be compared into data sections to be compared, inputting the data sections to be compared into logic layers of cubic memory structures in a gene comparison processing group respectively, judging whether a reference data section compared with the current data section to be compared is located in a local storage layer by the logic layers, if so, acquiring the reference data section from the local storage layer, carrying out gene comparison with the current data section to be compared to obtain a comparison result, and otherwise, obtaining the comparison result by adopting a functional message transmission and remote processing mode.

Description

Gene comparison acceleration method and system based on near-memory computing structure

Technical Field

The invention relates to the field of computer system structure design and the field of biological gene data processing, in particular to a gene comparison acceleration method and system based on a near-memory computing structure.

Background

In recent years, bioinformatics based on biotechnology and computer technology has been vigorously developed, and one important field thereof is gene data analysis. The gene sequencing technology is an indispensable link in gene data analysis, with the continuous reduction of the cost of second-generation sequencing and the explosive growth of gene sequencing data, the performance of the existing processor can not meet the increasing gene sequencing requirements, and gene comparison is used as a step which is indispensable and time-consuming in gene sequencing, becomes a main performance bottleneck and has a very important position in the whole gene sequencing process. Therefore, in order to increase the speed of gene sequencing to meet the challenges in practical applications, a new processor architecture is urgently needed to accelerate the gene alignment process.

In order to improve the processing efficiency, some accelerating systems for gene alignment applications are proposed, and the accelerating effect mainly depends on two aspects: 1) Speed of processing a single iteration and requesting data for a subsequent iteration from memory; 2) The speed at which the memory provides data to the processor. Some of the recently emerging custom accelerators for data-intensive applications focus on the first aspect, where the custom processing unit can handle each iteration with high efficiency and achieve high off-chip bandwidth utilization through massively parallel execution structures. The implementation of the custom accelerator converts the performance bottleneck of the application from low memory bandwidth utilization rate to insufficient memory bandwidth supply, and it is very effective to simply increase the number of processing units to improve the access concurrency.

Disclosure of Invention

The invention provides a gene comparison algorithm accelerator based on a near memory computing technology, which mainly aims at and solves two key problems: 1) How to provide more sufficient memory bandwidth for the gene comparison big data application based on the FM-index; 2) How to more fully utilize the provided high memory bandwidth to achieve performance enhancement maximization.

Aiming at the problem that the performance bottleneck of a gene comparison application customized accelerator is converted from low memory bandwidth utilization rate to insufficient memory bandwidth supply and the effect is very little by simply increasing the number of processing units to improve the access concurrency, the invention designs a novel gene comparison accelerator structure based on near memory calculation, provides more sufficient memory bandwidth for the gene comparison big data application based on FM-index and simultaneously fully utilizes the provided high memory bandwidth to realize the maximization of the performance improvement of an application algorithm.

The novel gene comparison accelerator structure based on near memory calculation is a processor structure for accelerating a gene comparison algorithm by high memory bandwidth provided by a novel memory structure based on a 3D stacking technology. The 3D stacking technology is used for providing high memory bandwidth, and the performance maximization of the comparison algorithm is realized in a most resource-saving mode.

The near memory computing acceleration structure utilizes the internal bandwidth of a novel memory structure based on a 3D stacking technology, and the structure is shown in FIG. 1. The invention changes the traditional structure that the processor and the memory are separated in different chips into the structure that the processor and the memory are integrated in the same chip, realizes the calculation function in the internal logic layer of the novel memory structure, provides sufficient memory bandwidth for the novel memory structure, more importantly, the design structure can realize the multiplication of the memory bandwidth along with the expansion of the memory capacity, and simultaneously, the invention integrates simple calculation logic on the logic layer. The calculation amount of comparison application in each LF mapping iteration is very small, and effective delay hiding cannot be performed on the memory access, so that the memory access delay is an important factor influencing the performance. This architecture can provide lower access latency compared to DDR memory because data movement in a near memory computing architecture is entirely within the same chip, whereas in a conventional processor and memory separate architecture, data must be moved back and forth between the processor chip and the memory chip, making transmission delays high.

The novel memory structure based on the 3D stacking technology is a cube and is formed by vertically stacking a plurality of storage layers and a logic layer. Each layer is divided into a plurality of vertically adjacent regions, called vaults. Each Vault is composed of a plurality of memory packets (banks), each Vault has its own memory controller at its logic level, called a Vault controller, which can control and schedule memory accesses of the local Vault, and the Vault controller controls the Vault accesses through lower-level DRAM commands. The Vault controller is interconnected with the DRAM banks to provide novel memory fabric internal communication with high internal bandwidth and low power consumption. The available Bandwidth of the structure is divided into an Internal Bandwidth (Internal Bandwidth) and an External Bandwidth (External Bandwidth). The internal bandwidth is the data transmission bandwidth between the internal storage layer and the logic layer of the structure cube; external bandwidth refers to the bandwidth that the fabric provides data to other devices over an external Link (Link). The nature of the internal bandwidth of the structure is that it can be expanded by times as its capacity is expanded, and this internal bandwidth, which can be expanded by times as its storage capacity is expanded, is important to further speed up the gene mapping algorithm.

Aiming at the defects of the prior art, the invention provides a gene comparison acceleration method based on a near memory computing structure, which comprises the following steps:

step 1, acquiring a gene comparison accelerator based on a near memory computing structure, wherein the gene comparison accelerator is composed of a plurality of vertical cubic memory structures, and each cubic memory structure is formed by stacking a plurality of storage layers and a logic layer;

step 2, grouping the plurality of vertical cubic memory structures according to the BWT reference sequence occupation space required by BWT-based gene comparison to obtain a plurality of gene comparison processing groups;

step 3, acquiring reference sequence data, splitting the reference sequence data into reference data segments, respectively storing the reference data segments into a gene comparison processing group, and realizing data communication between cubic memory structures through an on-chip network of a gene comparison accelerator;

step 4, acquiring gene sequence data to be compared, splitting the gene sequence data to be compared into data segments to be compared and inputting the data segments to be compared into logic layers of all cubic memory structures in a gene comparison processing group respectively, judging whether a reference data segment compared with the current data segment to be compared is located in a local storage layer or not by the logic layers, if so, acquiring the reference data segment from the local storage layer, carrying out gene comparison with the current data segment to be compared to obtain a comparison result, and otherwise, obtaining the comparison result by adopting a functional message transmission and remote processing mode;

and 5, circulating the step 3 and the step 4 until all to-be-compared data segments of the to-be-compared gene sequence data are subjected to gene comparison, and summarizing all comparison results to obtain a complete gene comparison result of the to-be-compared gene sequence data.

The gene comparison acceleration method based on the near memory computing structure comprises the following specific steps of adopting a functional message transfer and remote processing mode in the step 4:

the source cubic memory structure sends the data address of the reference data segment to the remote cubic memory structure with the reference data segment, the remote cubic memory structure locally performs data access and gene comparison operation on the cubic memory structure after receiving the request, and finally returns the comparison result to the source cubic memory structure.

The gene comparison acceleration method based on the near memory computing structure is characterized in that a memory controller is arranged in a logic layer of each cubic memory structure and used for controlling access of data of a storage layer, and the memory controller encapsulates a bottom layer protocol, so that internal network communication of the novel memory structure can be transmitted based on packets.

The gene comparison acceleration method based on the near memory computing structure, wherein the step 4 comprises:

step 41, the source cubic memory structure sends a request message to the remote cubic memory structure, and then the source cubic memory structure continues other processing;

step 42, the remote cubic memory structure receives the processing request;

step 43, the remote cubic memory structure prefetches the reference data segment data according to the position pointer in the request message;

step 44, the remote cubic memory structure distributes tasks to the local logic layer thereof for carrying out primary gene comparison to obtain an index value;

step 45, the remote cubic memory structure returns the index value as a response to the source cubic memory structure;

and step 46, the source cube memory structure receives the response of the task and waits for the subsequent task scheduling.

The gene comparison accelerator comprises a memory access unit which is used as a part of a prefetcher and is placed in front of a PE array to continuously provide data for the PE array through data prefetching, a dispatcher of an input queue carries out address conversion on processing requests of a request queue in the input queue to obtain memory addresses, the memory addresses are sent to the prefetcher, the prefetcher carries out data access in a storage layer of a cubic memory structure according to the corresponding memory addresses, and the data are sent to a data cache of the PE array after being retrieved for subsequent calculation of the PE.

The invention also provides a gene comparison acceleration system based on a near memory computing structure, which comprises:

the method comprises the following steps that a module 1 is used for obtaining a gene comparison accelerator based on a near memory computing structure, wherein the gene comparison accelerator is composed of a plurality of vertical cubic memory structures, and each cubic memory structure is formed by stacking a plurality of storage layers and a logic layer;

the module 2 is used for grouping the plurality of vertical cubic memory structures according to the BWT reference sequence occupation space required by BWT-based gene comparison to obtain a plurality of gene comparison processing groups;

a module 3, acquiring reference sequence data, splitting the reference sequence data into reference data segments, respectively storing the reference data segments into a gene comparison processing group, and realizing data communication between cubic memory structures through an on-chip network of a gene comparison accelerator;

the module 4 acquires gene sequence data to be compared, divides the gene sequence data to be compared into data segments to be compared and inputs the data segments into logic layers of cubic memory structures in a gene comparison processing group, the logic layers judge whether a reference data segment compared with the current data segment to be compared is located in a local storage layer, if yes, the logic layers acquire the reference data segment from the local storage layer and perform gene comparison with the current data segment to be compared to obtain a comparison result, and if not, the logic layers acquire the comparison result by adopting a functional message transmission and remote processing mode;

and the module 5 circulates the module 3 and the module 4 until all the to-be-compared data segments of the to-be-compared gene sequence data complete gene comparison, and summarizes all comparison results to obtain the complete gene comparison result of the to-be-compared gene sequence data.

The gene comparison acceleration system based on the near memory computing structure comprises a module 4, wherein the module adopts a functional message transfer and remote processing mode specifically as follows:

The gene comparison acceleration system based on the near memory computing structure is characterized in that a memory controller is arranged in a logic layer of each cubic memory structure and used for controlling access of data of a storage layer, and the memory controller encapsulates a bottom layer protocol, so that internal network communication of the novel memory structure can be transmitted based on packets.

The near memory computing structure-based gene alignment acceleration system, wherein the module 4 comprises:

the source cubic memory structure sends a request message to the remote cubic memory structure, and then the source cubic memory structure continues other processing;

module 42. The remote cube memory structure receives the processing request;

the remote cubic memory structure prefetches reference data segment data according to the location pointer in the request message;

a module 44, performing a gene comparison on the remote cubic memory structure distribution task to the local logic layer of the remote cubic memory structure distribution task to obtain an index value;

the remote cubic memory structure returns the index value as a response to the source cubic memory structure;

and a module 46, receiving the response of the task by the memory structure of the source cube, and waiting for the subsequent task scheduling.

The gene comparison acceleration system based on the near memory computing structure comprises a memory access unit, wherein the memory access unit is used as a part of a prefetcher and is placed in front of a PE array to continuously provide data for the PE array through data prefetching, a scheduler of an input queue carries out address conversion on processing requests of a request queue in the input queue to obtain a memory address, the memory address is sent to the prefetcher, the prefetcher carries out data access on a storage layer of a cubic memory structure according to the corresponding memory address, and the data are sent to a data cache of the PE array after being retrieved for subsequent computing of PE.

According to the scheme, the invention has the advantages that:

the invention solves a core problem of gene comparison algorithm acceleration by introducing a new structure of a near memory computing structure, namely how to provide more sufficient memory bandwidth for the application of FM-index-based gene comparison large data. Where FM-index is an algorithm that indexes compressed data. The data is compressed by an algorithm such as BWT, and the FM-index can directly index on the compressed data, so that the efficiency is high. By introducing the near memory computing structure, sufficient and expandable memory bandwidth is provided, and simultaneously, high memory bandwidth provided by the novel memory structure is utilized more fully, so that the performance improvement maximization is realized in a most resource-saving mode.

Drawings

FIG. 1 is a processor-main memory block diagram;

FIG. 2 is a general block diagram of a near memory computing accelerator;

FIG. 3 is a diagram of a Vault packet structure;

FIG. 4 is an exemplary diagram of memory access traces;

FIG. 5 is a diagram of a functional message structure;

FIG. 6 is a message passing diagram;

FIG. 7 is a diagram of a processing unit compute-access structure;

FIG. 8 is a diagram of the number of processing units in the Vault and utilization.

Detailed Description

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The overall structure. Fig. 2 is a logic block diagram of a structure of a comparison application accelerator based on near memory computing according to the present invention, and the near memory computing accelerator is based on a novel memory structure based on a 3D stacking technique, and can provide a rich internal memory bandwidth. As shown in fig. 2 (a), the accelerator is composed of 16 new memory structures "cubes" and provides 128GB of memory capacity, and although these new memory structures can communicate through the interconnection link, in the application scenario of the present invention, thanks to the independent concurrency among reads, these new memory structures can separately store the genome reference sequence and the read stream data, so as to be able to independently perform the alignment process, and this parallel processing mode makes the internal bandwidth of the whole system up to 8TB/s. Wherein, the read stream data is the read-in small segments of gene segments to be compared (the gene comparison process is to compare the gene segments with the genome reference sequence).

Fig. 2 (b) depicts a new memory structure "cube" which is formed by stacking 8 DRAM memory layers with 8Gb capacity and a logic layer, where each new memory structure is vertically divided into 32 regions (called vaults) connected by an on-chip network. The invention divides 32 vaults of each novel memory structure into 2 groups, each group has 16 vaults, thus each group of vaults can store one reference sequence data, and different groups can be independently executed in parallel.

Each of the vaults has a logic layer structure as shown in fig. 2 (c) in addition to the storage layer, the structural details of which can be seen in fig. 2 (d), and each of the vaults has a built-in dedicated memory controller for controlling access to data of the storage layer, and the built-in DRAM controller encapsulates the underlying protocol, so that the internal network communication of the novel memory structure can be transmitted based on packets (packets) rather than using low-level DRAM commands as in the case of the DDRx memory. The encapsulation process is to embed the protocols of the underlying network communication (there are protocols to use packet transmission), while DDRx requires low level DRAM commands. The communication between the vaults is performed by a Network Interface (NI), which can not only send and receive messages from other vaults, but also receive forwarded messages as "routes" in the network on chip. In order to realize the customized calculation function in the memory, a processing unit array (PE array) is arranged on each vault logic layer, the structures of the processing units are similar to the structure of a PE unit designed by a traditional gene comparison accelerator, and the FM-index calculation function can be completed. The on-chip prefetcher is an important functional unit on a logic layer and mainly completes the prefetching of reference sequence data and the prefetching of a read stream, the prefetching of a reference sequence data block can prepare the needed BWT character string data for a PE array before the calculation of the local frequency number, and the prefetching of the read stream can sequentially prefetch the next read data or several read data in the read stream to prepare for the subsequent read processing. The input and output queue is an interface for interaction between the logic layer of the vault and the network on chip, wherein the output queue is used as an outlet to send out a processing request on the upper surface of the 'far-end' vault; the input queue is used as an entrance to receive the processing request from the 'far-end' vault and the processing result returned by the 'far-end'. The BWT string is data obtained by transforming character data using BWT algorithm (Burrows-Wheeler transform).

Data partitioning and vault grouping. BWT-based gene alignment applications require frequent random access to BWT sequences of reference sequences, and in order to eliminate the overhead of frequent access to external memory devices, reference sequences of about 3GB size need to be stored in memory. However, the capacity of the "cube" of a new memory structure is only 8GB, the storage space of each vault storage layer of the new memory structure is only 256MB, and complete reference sequences cannot be stored independently, and therefore, the computing unit on the vault cannot complete sequence alignment by accessing only the "local" storage layer.

The vaults of the novel memory structure are provided with independent memory controllers and can provide a remote access mechanism based on a message packet, a network interface of a logic layer also provides a foundation for realizing a network on chip, and different vaults on the chip can communicate through the network on chip. Based on the method, the reference sequence data is divided into small blocks, the small blocks are respectively stored in different storage areas (vaults), and data communication between the vaults is realized through the on-chip network.

If the reference sequence is divided into 32 parts and stored in 32 vaults, all 32 vaults in the novel memory structure need to be interconnected, and data access spans 32 areas, which not only results in large scale of the network on chip and increased implementation difficulty, but also results in large average distance of the "far-end" access and affects performance. In fact, the 'half' of the novel memory structure, namely 16 vaults, the memory capacity is 4GB, and can completely contain the reference sequence, the invention divides the vaults of the novel memory structure into two groups, each group has 16 vaults, and each group is respectively connected by a 2D grid piece network; meanwhile, the reference sequence is divided into 16 blocks of continuous intervals, which are stored on the storage layers of 16 vaults respectively, as shown in fig. 3.

Non-blocking functional messaging. Although data partitioning solves the problem of storing reference sequences, the memory access scope of the Processing Elements (PEs) on the vault is also limited to the vault, since the vault memory controller can only access the "local" storage layer. However, access to the reference sequence is random throughout the sequence, and FIG. 4 shows the trajectory of memory accesses within a vault group, with the vault access order being "15-7-2" for the sequence "ACT".

The most intuitive method for solving the problem of remote access is to directly request data, the local vault sends a data request to the remote vault, the remote vault accesses the data, corresponding data is taken out and transmitted back to the local vault waiting for the data, and the local vault performs subsequent operation. This communication is called blocking communication, which is the most intuitive communication but has two prominent problems: firstly, the computing part of the local vault is blocked due to waiting data, so that the utilization rate of the computing part is reduced, and the overall performance is influenced; secondly, the network communication data volume is large, and large pressure is brought to the network on chip. For example, for the "counting" process, in each LF mapping iteration calculation, two BWT ranks corresponding to sp and ep are needed, each rank having 64Byte data, so that 2x64=128byte data needs to be transferred for each iteration, and then the data requested by the "remote" is added, and for the network-on-chip bandwidth of 16B/cycle [90], such "large block" data transfer will put more pressure on the network-on-chip.

The invention adopts the mode of functional message transmission and 'far-end' processing to solve the problem, each LF mapping iteration of FM-index is used for updating BWT index value, for 'counting' process, two (sp, ep) head-to-tail indexes are used, and for 'judging' process, a new suffix character string index is used. Thus, for one iteration over a vault, only the new index value needs to be obtained, and there is no concern as to where the computation process of updating the index occurs. In other words, since all the vaults can complete the computation of the index update, the index update can be performed completely on the "remote" vaults of the data store, and the new index value is returned to the "local" vaults after the computation is completed. The BWT algorithm can perform specific transformation on original data according to the characteristic that repeated character strings exist in text data, so that the frequency of repeated characters appearing in continuous (adjacent) new character strings is increased, and the data can achieve higher compression ratio by matching with some compression algorithms.

The invention adopts a 'far-end' processing mode, namely when the local vault carries out LF mapping, if the needed BWT data is not on a local storage layer but on the far-end vault, the local vault sends a data address and other parameters to the far-end vault, the far-end vault directly carries out data access and subsequent calculation operation 'on site' after receiving a request, and finally, a calculation result (an updated index value) is returned to the source vault, thereby avoiding the overhead of carrying large blocks of BWT data.

In the local request-remote processing mode, the local vault only needs to send a processing request to the remote vault and receive a processing result returned by the remote vault, and long-distance data transportation is not needed. To further relieve network stress, the present invention customizes the communication messages between the local and remote vaults based on the LF mapping feature to occupy as few bytes as possible containing the appropriate information.

The functional message definition is shown in fig. 5, where the meaning of the fields is explained as follows:

READ #: request or reply the corresponding read number. We reserve 4-8 read spaces for the read slot in the Input Queue (Input Queue), so after a read is "transmitted" its ID in the read slot needs only 3-bits to represent.

RES/RESP: a request or a response identification. The request message indicates that the message is sent by other vaults and needs to be processed on the vault; and the response message indicates the index update value returned by other vaults after the LF mapping is completed. This field only needs one bit to be marked since only two cases request/reply are involved.

SRC/DEST: source vault and destination vault identities. In order to mark the direction of message transmission, the message must carry the identifications of the message sender (source vault) and the message recipient (destination vault), and since the number of each group of vaults is 0-15, this field only needs 4 bits.

SP _ V/EP _ V: SP/EP valid flag bits. These two bits mark whether the SP and EP fields in this message are valid, respectively, and if SP _ V and EP _ V are both "1", it indicates that the two BWT ranks pointed by (SP, EP) required by this iteration are within the same vault. Otherwise, the two fields are only one valid, indicating that the data pointed by sp and ep are distributed in different vaults, but this is extremely rare because, according to the average principle, when the first two bases have been determined in the (sp, ep) interval, the BWT positions corresponding to the suffixes of the first two bases which are the same will fall within the agreed vaults, and even if there is slight unevenness, the present invention can arrange the data in such a way that the first two bases are the same, for example, "AA …" is placed at vault No. 0, "AC …" is placed at vault No. 1, and as such, "TT …" is placed at vault No. 15. Thus, when querying the first base, although the interval may span multiple memory regions, the region distribution of the first base can be recorded in advance, and the processing can be directly performed at the corresponding position, and after the first base, the BWT data of all (ep, sp) tuples will fall into the same vault, that is, only one request needs to be sent to the target vault for each iteration.

SP/EP: index (and updated index) storage areas. The request message needs to carry an (sp, ep) index value, the response message also needs to return an updated (sp, ep) index, and both the sp and the ep in the original algorithm need 64-bit storage.

READ [ C ]: the 2 bits in this field store the target base of the current iteration.

Based on the above design method, the functional message defined by the present invention only occupies 72 bits (9 bytes), and compared with the transmission of 128 bytes BWT data, the combination of the functional message transmission and the "remote" in-place processing is undoubtedly more efficient.

Fig. 6 (a) shows the original inter-vault communication process, which is called as "blocking message delivery" in the present invention, and is mainly characterized in that the source vault needs to wait for the destination vault to return data or processing result after sending a data request or processing request to the destination vault, and during this period, the source vault can only wait "idle", although this communication method is most intuitive, the resource idle and waste problems brought by this method are serious, and the throughput rate of the processing unit is reduced.

In order to solve the problem, the present invention proposes a "non-blocking message passing" method, as shown in fig. 6 (b), in combination with a remote in-place processing mode, after a source vault sends a message of a processing request to a destination vault, a corresponding issue slot is allocated to the source vault in a scheduler in an input queue, and other components do not need to wait for a return result, and can perform operations such as access and calculation on processing requests sent by other vaults, or perform next processing on other return values. The communication mode is as follows:

1. the source vault sends a request message to the destination (remote) vault, and then the source vault continues other processing;

2. the destination vault receives the processing request;

3. the destination vault prefetches BWT data according to the location pointer in the message;

4. the target vault allocates tasks to the PE to perform one-time iterative computation to obtain a new index value;

5. the target vault returns the updated index as a response to the source vault;

6. the source vault receives the response of the task and waits for the scheduling of the subsequent task.

Through the communication mode, the data communication between the vaults and the task processing on the vaults can be well pipelined, the 'waiting' overhead of the source vaults is eliminated, and the long-time communication delay is hidden, so that the throughput rate is greatly improved.

Compute-access decoupling. In the design of the traditional gene comparison accelerator, each PE can completely process one LF mapping iteration, and the PE first requests memory access according to an iterative processing flow, as shown in fig. 7 (a), each PE includes an address translation unit (AU) and a memory access unit (MU), which retrieve BWT data from the memory, and then perform vector calculation operation by a Calculation Unit (CU), that is, the PE needs to wait for data return in the memory access stage, and the calculation unit is idle at this time, although this may cause the memory access delay waiting and the idling of the calculation unit, the traditional gene comparison accelerator design masks the memory access delay inside a single PE in a parallel manner by large-scale PEs, so that the utilization rate of the whole memory bandwidth is still high.

However, for near memory calculation, the on-chip area and power consumption of a logic layer are limited, and the increase of the memory access parallelism through the arrangement of a large number of PEs is limited by on-chip resources. The optimized PE structure is shown in fig. 7 (b), where the PE only reserves the computation part (CU) and some auxiliary computation memory units (e.g. registers and on-chip scratch pad memory), and the memory access unit and address translation unit MEM are "decoupled" and provide data to the PE array as part of the PE sharing in the array before moving from the PE interior to the PE array.

In fact, the memory access units are "decoupled" and placed before the PE array as part of the prefetcher, providing data to the PE array continuously via data prefetching. The scheduler of the input queue performs simple address conversion on the processing request of the request queue in the input queue, then sends the memory address to the prefetcher, the prefetcher performs data access in the vault storage layer according to the corresponding memory address, and the data is sent to the data cache of the PE array after being retrieved for subsequent calculation by the PE. Therefore, the memory access and calculation part of each iteration process can be carried out in a pipeline mode, and when the PE processes the previous task, the prefetcher simultaneously carries out data prefetching of the next task, so that the PE is ensured not to wait for data memory access.

Through decoupling optimization of calculation and access and realization of data prefetching, all functional components are in a flow operation state, the idle state of the PE is eliminated, the PE is in a working state all the time, and therefore higher throughput rate is achieved.

The number of PEs is traded off. Through the decoupling of calculation and memory access, the optimization of prefetching and the like, the invention realizes the maximization of PE calculation efficiency, but in order to match the calculation rate of the PE array in the vault with the rate of the prefetcher supply, the two can be pipelined in an optimal mode, and the setting of the PE number in the PE array also needs to be explored. In order to achieve the effect of parallel processing, the internal memory bandwidth of the vault is consumed to the maximum extent, the number of the PEs still needs to be expanded, and if the access speed of the prefetcher cannot be matched only by depending on the performance of a single PE, the computing throughput rate cannot achieve the best effect. However, if the number of the PEs is set too much, the prefetch device may not provide the data at the same speed as the PE array, and a lot of PE resources are idle, resulting in resource waste.

The invention carries out quantitative analysis on the setting of the number of PEs in the vault, as shown in FIG. 8, when the number of the PEs on each vault does not exceed 3, the calculation rate of the PE array cannot keep up with the data supply rate of the prefetcher, so that although the PEs are always in a full-load working state of 100%, the 'consumption' of the memory bandwidth is still insufficient due to the limitation of the calculation parallelism. When the number of the PEs is 4, no matter the process of counting or determining, the PE idle rate is 10-20%, which shows that the calculation rate of the PE array exceeds the supply rate of the prefetcher in this case, and the memory access bandwidth is fully utilized by the calculation concurrency. When the number of PEs exceeds 4, an increasing proportion of PE idle rate occurs, i.e. resource waste becomes more and more obvious. Based on this, the invention selects the PE array of the vault to be composed of 4 PEs, thus not only fully utilizing the memory bandwidth resource, but also ensuring the effective utilization of the resource.

The following are system embodiments corresponding to the above method embodiments, and this embodiment mode can be implemented in cooperation with the above embodiment modes. The related technical details mentioned in the above embodiments are still valid in the present embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related art details mentioned in the present embodiment can also be applied to the above-described embodiments.

a module 2, grouping the plurality of vertical cubic memory structures according to the BWT reference sequence occupation space required by BWT-based gene comparison to obtain a plurality of gene comparison processing groups;

the module 3 acquires reference sequence data, divides the reference sequence data into reference data segments, respectively stores the reference data segments into a gene comparison processing group, and realizes data communication between cubic memory structures through an on-chip network of the gene comparison accelerator;

the module 4 acquires gene sequence data to be compared, divides the gene sequence data to be compared into data segments to be compared and inputs the data segments to be compared into logic layers of all cubic memory structures in a gene comparison processing group, the logic layers judge whether the reference data segments to be compared with the current data segments to be compared are located in a local storage layer, if yes, the reference data segments are acquired from the local storage layer and are subjected to gene comparison with the current data segments to be compared to obtain comparison results, and otherwise, the comparison results are obtained by adopting a functional message transmission and remote processing mode;

and the module 5 circulates the module 3 and the module 4 until all the to-be-compared data segments of the to-be-compared gene sequence data complete gene comparison, and summarizes all comparison results to obtain a complete gene comparison result of the to-be-compared gene sequence data.

module 41. Source cubic memory structure sends request message to remote cubic memory structure, then source cubic memory structure continues other processing;

module 42. The remote cube memory structure receives the processing request;

The gene comparison acceleration system based on the near memory computing structure comprises a memory access unit, wherein the memory access unit is used as a part of a prefetcher and is placed in front of a PE array to continuously provide data for the PE array through data prefetching, a dispatcher of an input queue carries out address conversion on processing requests of a request queue in the input queue to obtain memory addresses, the memory addresses are sent to the prefetcher, the prefetcher carries out data access in a storage layer of a cubic memory structure according to the corresponding memory addresses, and the data are sent to a data cache of the PE array after being retrieved for subsequent computing of the PE.

Claims

1. A gene comparison acceleration method based on a near memory computing structure is characterized by comprising the following steps:

step 4, acquiring gene sequence data to be compared, splitting the gene sequence data to be compared into data segments to be compared, and then respectively inputting the data segments to be compared into logic layers of all cubic memory structures in a gene comparison processing group, wherein the logic layers judge whether a reference data segment compared with the current data segment to be compared is located in a local storage layer, if so, acquiring the reference data segment from the local storage layer, performing gene comparison with the current data segment to be compared to obtain a comparison result, and otherwise, acquiring the comparison result by adopting a functional message transmission and remote processing mode;

and 5, circulating the step 3 and the step 4 until all the data sections to be compared of the gene sequence data to be compared complete the gene comparison, summarizing all comparison results and obtaining a complete gene comparison result of the gene sequence data to be compared.

2. The method according to claim 1, wherein the functional message passing and remote processing in step 4 are specifically:

3. The method as claimed in claim 1, wherein a memory controller is disposed in the logic layer of each cubic memory structure for controlling access to data in the storage layer, the memory controller encapsulating the underlying protocol so that the internal network communication of the new memory structure can be based on packet transmission.

4. The method according to claim 2, wherein the step 4 comprises:

step 42, the remote cubic memory structure receives a processing request;

5. The method as claimed in claim 1, wherein the gene comparison accelerator comprises a memory access unit, which is placed before the PE array as a part of the prefetcher, and provides data for the PE array continuously by prefetching data, the scheduler of the input queue performs address conversion on the processing request of the "request" queue in the input queue to obtain a memory address, and sends the memory address to the prefetcher, the prefetcher performs data access in the storage layer of the cubic memory structure according to the corresponding memory address, and the data is retrieved and sent to the data cache of the PE array for the PE to perform subsequent calculation.

6. A gene alignment acceleration system based on a near memory computing structure, comprising:

the module 4 acquires gene sequence data to be compared, divides the gene sequence data to be compared into data segments to be compared and then respectively inputs the data segments to be compared into logic layers of all cubic memory structures in a gene comparison processing group, and the logic layers judge whether the reference data segment compared with the current data segment to be compared is located in a local storage layer or not;

and the module 5 circulates the module 3 and the module 4 until all the data sections to be compared of the gene sequence data to be compared complete the gene comparison, and summarizes all comparison results to obtain a complete gene comparison result of the gene sequence data to be compared.

7. The system of claim 6, wherein the means for functional message passing and remote processing in module 4 is specifically:

the source cubic memory structure sends the data address of the reference data segment to the remote cubic memory structure with the reference data segment, and the remote cubic memory structure locally performs data access and gene comparison operation on the cubic memory structure after receiving the request, and finally returns the comparison result to the source cubic memory structure.

8. The system as claimed in claim 6, wherein a memory controller is disposed in the logic layer of each cubic memory structure for controlling access to data in the storage layer, the memory controller encapsulating underlying protocols such that internal network communication of the new memory structure is based on packet transmission.

9. The near memory computing architecture based gene alignment acceleration system of claim 7, wherein the module 4 comprises:

module 41. Source cubic memory structure sends request message to remote cubic memory structure, and then source cubic memory structure continues other processing;

module 42. The remote cube memory structure receives the processing request;

10. The system of claim 6, wherein the gene comparison accelerator comprises a memory access unit, which is disposed in front of the PE array as a part of the prefetcher, and provides data to the PE array continuously through data prefetching, the scheduler of the input queue performs address conversion on the processing request of the request queue in the input queue to obtain a memory address, and sends the memory address to the prefetcher, the prefetcher performs data access in the storage layer of the cubic memory structure according to the corresponding memory address, and the data is retrieved and sent to the data cache of the PE array for subsequent computation by the PE.