CN113257356A - Gene sequencing data external sequencing method and device based on different storage levels - Google Patents

Gene sequencing data external sequencing method and device based on different storage levels Download PDF

Info

Publication number
CN113257356A
CN113257356A CN202110633578.0A CN202110633578A CN113257356A CN 113257356 A CN113257356 A CN 113257356A CN 202110633578 A CN202110633578 A CN 202110633578A CN 113257356 A CN113257356 A CN 113257356A
Authority
CN
China
Prior art keywords
data
memory
gene sequencing
sequencing
sorted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110633578.0A
Other languages
Chinese (zh)
Inventor
谭光明
刘万奇
李叶文
康宁
孙凝晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Western Research Institute Of China Science And Technology Computing Technology
Original Assignee
Western Research Institute Of China Science And Technology Computing Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Western Research Institute Of China Science And Technology Computing Technology filed Critical Western Research Institute Of China Science And Technology Computing Technology
Priority to CN202110633578.0A priority Critical patent/CN113257356A/en
Publication of CN113257356A publication Critical patent/CN113257356A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a gene sequencing data external sequencing method and device based on different storage levels, belonging to the technical field of computer system structure design and data sequencing and providing the following scheme: the gene sequencing data outer sequencing method based on different storage levels comprises the following steps: reading the storage capacity required by the data to be sorted to judge the size of the data to be sorted; if the size of the data to be sorted exceeds a preset threshold value, the data to be sorted is sorted in a grading way through a first storage, a second storage and a third storage; after the data to be sorted are sorted, the sorting result of the data to be sorted is written back to the external memory for storage. The technical scheme of the invention ensures the performance of sequencing the gene sequencing data and simultaneously improves the efficiency of sequencing the gene sequencing data.

Description

Gene sequencing data external sequencing method and device based on different storage levels
Technical Field
The invention relates to the technical field of computer system structural design and data sequencing, in particular to a gene sequencing data external sequencing method and device based on different storage levels.
Background
With the rapid development of bioinformatics, gene analysis has become a widely used technical means in scientific research and industrial fields, and has been successfully applied in aspects of species identification, disease diagnosis and the like, the gene analysis is based on a gene sequencing technology, and the second-generation sequencing technology is generally adopted at present.
The cost of current next generation sequencing is continuously reduced, which leads to the rapid increase of gene sequencing data, and the effect is more and more obvious, and the gene sequencing data reaches the amazing magnitude in the future. In order to process massive gene sequencing data, a human needs to complete a set of gene analysis process by means of a modern computing system, wherein after the gene sequencing data are compared with a reference sequence, sequencing is an important step.
The data to be sorted of the genes may be relatively large, even the data is difficult to be read into the memory for calculation, and an external sorting mode is required for the data, but the scheme widely used at present is software external sorting, namely, a processor is used as a sorting control and calculation unit, intermediate data are moved between the memory and a hard disk, and the intermediate data are combined to obtain a final sorting result. However, this outer sorting scheme uses a processor as a processing unit in the sorting process, which burdens the CPU and makes the sorting inefficient.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a gene sequencing data external sequencing method and device based on different storage levels, so that the gene sequencing data sequencing performance is ensured, and the gene sequencing data sequencing efficiency is improved.
The basic scheme provided by the invention is as follows:
the gene sequencing data outer sequencing method based on different storage levels comprises the following steps:
reading the storage capacity required by the data to be sorted to judge the size of the data to be sorted;
if the size of the data to be sorted exceeds a preset threshold value, the data to be sorted is sorted in a grading way through a first storage, a second storage and a third storage;
after the data to be sorted are sorted, the sorting result of the data to be sorted is written back to an external memory for storage.
The principle of the basic scheme is as follows:
in the scheme, the gene sequencing data external sequencing method based on different storage layers is based on a gene sequencing data sequencing engine which calculates in storage, and carries out sequencing in order according to names or coordinates in the gene sequencing data; specifically, the storage capacity of the data to be sorted is read to judge the size of the data to be sorted; and when the size of the data to be sorted exceeds a preset threshold value, carrying out multi-level hierarchical sorting on the data to be sorted through the first memory, the second memory and the third memory, and storing a sorting result of outer sorting into an external memory.
The data to be sorted is sorted hierarchically by the first memory, the second memory and the third memory, namely when the storage capacity of the data to be sorted is large and exceeds the capacity of the internal memory, the data to be sorted is sorted by using a hardware sorting network interacted between the internal memory and the external memory.
The basic scheme has the beneficial effects that:
(1) in the scheme, for larger data to be sorted, the occupied storage capacity of the data is larger than that of internal storage, the first storage, the second storage and the third storage are adopted for multi-level hierarchical sorting, and compared with the direct sorting of the larger data to be sorted, the multi-level sorting can divide the larger data to be sorted into smaller data for sorting respectively, the multi-level sorting can be operated in parallel and time-sharing multiplexing can be realized, so that the sorting efficiency of external sorting is improved, meanwhile, the larger gene sequencing data can be stored due to the large storage capacity of the external storage, and the performance of the external sorting is improved.
(2) In the scheme, the sequencing of the gene sequencing data is completed in a hardware network mode such as a first memory, a second memory, a third memory and the like, so that the problem of software burden caused by software sequencing such as a processor is avoided.
(3) In the scheme, the sequencing of the gene sequencing data is completed in hardware network modes such as the first storage, the second storage, the third storage and the like, so that a large amount of I/O (input/output) expenses caused by the fact that the data to be sequenced are transmitted back and forth between the processor and the storage are avoided, and the performance of sequencing the gene sequencing data is improved.
Further, the storage capacity of the first memory is smaller than that of the second memory, and the storage capacity of the second memory is smaller than that of the third memory.
In the scheme, the data to be sequenced are subjected to hierarchical sequencing through the multilevel memories with different storage capacities, so that the performance of sequencing the gene sequencing data is improved.
Further, the first memory is a static random access memory, the second memory is a dynamic random access memory, and the third memory is an external memory.
In the scheme, the memories with different capacities are further limited to static random access memory, dynamic random access memory and external memory, so that the classification of hierarchical sequencing on the data to be sequenced is facilitated.
Further, if the size of the data to be sorted exceeds a preset threshold, the step of sorting the data to be sorted by a first memory, a second memory and a third memory in a grading way comprises the following steps:
equally dividing the data to be sequenced into a plurality of first gene sequencing data through a first memory;
respectively performing lossless compression on each first gene sequencing data in the data to be sequenced;
and performing double-tone sequencing on the sequencing data of each first gene after lossless compression.
According to the scheme, when the size of the data to be sequenced exceeds a preset threshold value, the data to be sequenced is equally divided into a plurality of first gene sequencing data through the first storage, namely the data to be sequenced is equally divided into a plurality of first gene sequencing data through static random access storage, each first gene sequencing data is subjected to lossless compression, the first gene sequencing data subjected to lossless compression is sequenced through bitonic sequencing, the bitonic sequencing is suitable for hardware implementation, namely the sequencing of the first gene sequencing data is directly completed through the first storage, the speed of sequencing of the gene sequencing data subjected to blocking is improved, the limitation of storage bandwidth of the storage when the data to be sequenced is subjected to outer sequencing is avoided, and the bandwidth utilization rate of an external storage is improved.
Further, the step of bitonic ordering of each of the losslessly compressed first gene sequencing data comprises:
merging the sequenced first gene sequencing data into a plurality of second gene sequencing data according to a tree structure through a second memory;
and carrying out double-tone sequencing on the combined second gene sequencing data.
The step of performing bitonic ordering on the merged second gene sequencing data further comprises:
merging the sequenced second gene sequencing data into a plurality of third gene sequencing data according to a tree structure through a third memory;
and carrying out double-tone sequencing on the combined sequencing data of the third genes.
The step of performing bitonic ordering on the merged third gene sequencing data comprises the following steps:
and merging the sequenced third gene sequencing data into ordered final gene sequencing data according to the tree structure and writing the ordered final gene sequencing data back to the external memory.
According to the scheme, sequencing data of each sequenced first gene are merged according to a tree structure through a dynamic random access memory, a plurality of second gene sequencing data are subjected to double-tone sequencing based on a hardware network after merging, the sequenced second gene sequencing data are further merged according to the tree structure through an external memory, a plurality of third gene sequencing data are subjected to double-tone sequencing based on the hardware network after merging, and compared with the method that the sequencing data of the blocked genes are directly sequenced through the external memory, the method has the advantages that the hardware cost is less, the parallel work is facilitated, and the efficiency of sequencing the gene sequencing data is improved.
Further, the equally dividing the data to be sequenced into a plurality of first gene sequencing data specifically comprises:
the data to be sequenced is provided with N reading pairs, the N reading pairs are equally divided into T parts, and the number of the reading pairs in the first gene sequencing data after each equal division is N/T.
For the data to be sequenced, the data to be sequenced is equally divided according to the read pairs of the data to be sequenced, so that the data to be sequenced is divided into a plurality of first gene sequencing data, the first gene sequencing data can be conveniently subjected to hierarchical outer sequencing, the sequencing process of the larger data to be sequenced can be completed at a storage terminal, and the sequencing performance of the gene sequencing data is improved.
Further, the lossless compression is specifically to encode the repeated information of each first gene sequencing data according to a directed acyclic graph.
The repeated information of each first gene sequencing data is coded based on the directed acyclic graph, namely, the repeated information in the first gene sequencing data is compressed, the size of the first gene sequencing data is reduced, and meanwhile, the method can be used for directly sequencing on the files in the compressed format, so that the bandwidth utilization rate of an external memory is improved.
In addition, in order to achieve the above object, the present invention further provides a gene sequencing data outer sorting apparatus based on different storage levels, wherein the gene sequencing data outer sorting apparatus includes:
the system comprises a memory, a processor and a gene sequencing data external sequencing program based on different storage levels, wherein the gene sequencing data external sequencing program based on different storage levels is stored on the memory and can run on the processor, and when being executed by the processor, the steps of the gene sequencing data external sequencing method based on different storage levels are realized.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the method for gene sequencing data outer sequencing based on different storage levels according to the present invention;
FIG. 2 is a schematic diagram of a gene alignment analysis module according to an embodiment of the method for sorting out gene sequencing data based on different storage levels;
FIG. 3 is a schematic circuit diagram of internal/external sorting according to an embodiment of the method for external sorting of gene sequencing data based on different memory levels of the present invention;
FIG. 4 is a schematic diagram of an external sequencing hierarchical sequencing structure according to an embodiment of the method for external sequencing of gene sequencing data based on different storage levels of the present invention;
FIG. 5 is a schematic structural diagram of block compression in outer sorting according to an embodiment of the method for outer sorting of gene sequencing data based on different storage levels of the present invention;
FIG. 6 is a schematic diagram of an embodiment of an in-memory computed gene sequencing data sequencing engine according to the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following is further detailed by way of specific embodiments:
the gene sequencing data external sequencing device based on different storage layers in the scheme can be terminal equipment and is a structure of a hardware operation environment. The gene sequencing data external sequencing device based on different storage layers in the embodiment of the invention can be terminal equipment such as a PC (personal computer), a portable computer and the like.
The terminal device may include: a processor, a communication bus, a user interface, a network interface, a memory. The communication bus is used for realizing the connection and communication among the processor, the user interface, the network interface and the memory. The user interface may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), a tablet, a stylus pen, etc., and the optional user interface may also include a standard wired interface, a wireless interface. The network interface may optionally include a standard wired interface (e.g., RJ45 interface), a wireless interface (e.g., WIFI interface).
In the sequencing equipment of the scheme, the user interface is mainly used for carrying out data communication with each terminal; the network interface is mainly used for connecting the background server and carrying out data communication with the background server; and the processor can be used for calling the gene sequencing data outer sequencing program based on different storage levels stored in the memory and executing the following operations as shown in the figure 1:
s100, acquiring data to be sequenced generated after a Fastq file is compared with a reference sequence;
that is, referring to fig. 2, before sequencing data to be sequenced, one Fastq file and a reference sequence need to be aligned, or a plurality of Fastq files and reference sequences need to be aligned, so as to generate SAM intermediate data, where gene sequencing data in the SAM intermediate data is unordered and needs to be sequenced in order according to the name of a gene fragment in the SAM intermediate data or the position of the gene fragment on the reference sequence. It should be noted that many fragmented gene fragments are stored in the Fastq file, and in the process of comparing the Fastq file with the reference sequence, the positions of the gene fragments on the existing reference sequence are obtained to obtain the complete sequencing gene sequence.
Furthermore, in the gene analysis process of double-end sequencing, two Fastq files are firstly compared with a reference sequence, and larger data to be sequenced (SAM files) are usually generated, wherein the SAM files are unordered, sequencing needs to be performed according to names of gene sequencing data in the SAM files or coordinates of the reference sequence so as to analyze the gene sequencing data, the size of the SAM files to be sequenced is positively correlated with the size of the input gene data files, the sizes of the gene data files are small, such as 2GB, 8GB and 16GB, and the storage capacities of the gene data files are large, such as 128GB and 256 GB.
Step S200, reading the storage capacity required by the data to be sorted to judge the size of the data to be sorted;
step S300, if the size of the data to be sorted exceeds a preset threshold value, the data to be sorted is sorted in a grading way through a first storage, a second storage and a third storage;
in the implementation, when the size of the data to be sequenced does not exceed a preset threshold, the sequencing of the gene sequencing data is completed by combining a traditional processor (CPU) and an internal memory (DRAM); and if the size of the data to be sorted exceeds a preset threshold value, the data to be sorted is sorted by a first memory, a second memory and a third memory in a grading way.
The storage capacity of the gene sequencing data is large or small, so that the data to be sequenced (SAM files) generated after the gene sequencing data is compared with the reference sequence is correspondingly large or small, the internal memory adopted in the scheme has the characteristics of being relatively small and high in speed, the external memory has the characteristics of being relatively large and low in speed, the internal sequencing is used when the storage capacity of the gene sequencing data is small, the external sequencing is waited when the storage capacity of the gene sequencing data is large, namely the first memory, the second memory and the third memory are waited to perform hierarchical sequencing when the storage capacity of the gene sequencing data is large, and the self-adaptive mode is beneficial to improving the advantages and avoiding the disadvantages.
The circuit structure of the scheme is as shown in fig. 3, firstly, the internal sorting or the external sorting is selected according to the size of the SAM file generated by comparison and the size of the actual internal memory of the computing system. For example, when the size of the SAM file to be sorted is 2GB and the size of the internal memory of the computing system is 16GB, the SAM file to be sorted only needs to be sorted in an internal sorting manner, and at this time, the SAM file to be sorted is directly regarded as a common sorting task, i.e., a task is completed by running a fast sorting algorithm on a conventional processor-internal memory (CPU-DRAM) system. When the size of the SAM file to be sequenced is 200GB, and the size of the internal memory of the computing system is only 16GB, the SAM file to be sequenced needs to adopt an external sequencing scheme, at the moment, the external sequencing is an I/O intensive task, an external memory also needs to participate in the sequencing process, the SAM file to be sequenced is partitioned by adopting a double-tone sequencing algorithm, and the SAM file to be sequenced is sequenced from small to large layer by layer.
Further, the storage capacity of the first memory is smaller than that of the second memory, and the storage capacity of the second memory is smaller than that of the third memory, specifically, the first memory is a static random access memory, the second memory is a dynamic random access memory, and the third memory is an external memory. The memories with different storage capacities are sorted in a grading way, so that the sorted data can be better divided in the grading way, and the performance of sorting the gene sequencing data is improved.
And S400, after the data to be sorted are sorted, writing the sorting result of the data to be sorted back to an external memory for storage.
Based on the above embodiment, in order to fully exploit the performance of outer sorting, referring to fig. 4 and fig. 5, the scheme performs outer sorting on larger gene sequencing data, specifically, sorting by using three storage levels. The three storage levels are respectively an SRAM level, a DRAM level and a Flash level; the first storage level is an on-chip cache (SRAM), namely, the static random access storage in the scheme; the second memory level is an internal memory (DRAM), namely a dynamic random access memory in the scheme; the third storage layer is Flash (Flash), namely an external memory in the scheme, and the external ordering performance is obtained as good as possible in a layer-by-layer progressive mode.
Further, data to be sorted is stored in Flash at the beginning, after blocking and compression, a sorter in an SRAM hierarchy sorts data blocks generated after the data to be sorted in Flash is blocked, sorting results are sent to a merger in a DRAM hierarchy, one merger can merge the sorted data blocks in a plurality of SRAM hierarchies together, a plurality of mergers are arranged in the DRAM hierarchy to merge the sorting blocks in an iterative manner, and finally the obtained merged blocks are written into Flash, so that the sorting of a disordered Flash data block is completed. FIG. 5 shows Flash-level sequencing, in which multiple data blocks sequenced by DRAM level are combined step by step through a tree structure, and finally complete sequenced gene sequencing data is obtained on Flash level. It should be noted that the lossless compression involved improves the bandwidth utilization, and saves the storage space of the flash memory.
In the implementation, the gene sequencing data external sequencing method based on different storage levels is an internal sequencing and external sequencing self-adaptive method, specifically, whether internal sequencing or external sequencing is adopted is determined according to the size of data to be sequenced, in the scheme, an internal memory refers to an internal storage DRAM in a traditional architecture, and an external memory refers to an external storage Flash in a hard disk.
Further, the gene sequencing data external ordering method based on different storage levels is characterized in that ordered ordering is carried out according to names or coordinates in the gene sequencing data according to a gene sequencing data ordering engine calculated in storage, and the gene sequencing data external ordering method based on different storage levels comprises internal/external ordering judgment, rapid ordering, external ordering data blocking, lossless compression, double-tone ordering and external ordering combination. According to the scheme, second generation sequencing (NGS) gene preprocessing is adopted, a plurality of broken gene segments are stored in a Fastq file, a Fastq file needs to be compared with a reference sequence before internal/external sequencing judgment, or two Fastq files need to be compared with the reference sequence to generate SAM data to be sequenced, gene sequencing data in the SAM data to be sequenced are unordered, and sequencing needs to be performed according to the names of the gene segments of the SAM data to be sequenced or the positions of the gene segments on the reference sequence. The alignment process is to obtain the positions of the broken gene segments on the existing reference sequence to obtain the complete sequencing gene sequence.
And after the compared data to be sorted is obtained, reading the storage capacity of the data to be sorted to judge the size of the data to be sorted, wherein the internal/external sorting judgment decides to adopt internal sorting or external sorting according to the size of the data to be sorted, and a preset threshold value is dynamically set at the position. When the data to be sorted exceeds a preset threshold value, adopting outer sorting, and when the data to be sorted does not exceed the preset threshold value, adopting inner sorting; where fast ordering is used for the inner ordering case, the data ordering is done in the manner of a conventional processor CPU and internal memory DRAM. The external sequencing data blocking means that data to be sequenced with large data quantity are equally divided, and sequencing each piece of first gene sequencing data which is equally divided; the double-tone ordering is used for the outer ordering, and the ordering algorithm is suitable for hardware implementation, so that the data ordering is completed by using a hardware ordering network mode such as an internal memory, an external memory and the like. The outer sequencing combination means that under the condition of outer sequencing, the sequenced first gene sequencing data are combined into ordered final gene sequencing data. The lossless compression refers to performing lossless compression on first gene sequencing data in the external memory under the condition of external sequencing so as to improve the bandwidth utilization rate of the external memory, and meanwhile, the lossless compression algorithm can enable sequencing to be performed on compressed data directly.
In this embodiment, data to be sequenced with a large data amount is divided equally, specifically, for gene sequencing data (SAM file/Fastq file), read pairs (reads) are required to be sequenced, and if N read pairs are included in one data to be sequenced and divided equally into T, the number of read pairs in each first gene sequencing data is N/T. The sequencing of each first gene sequencing data is specifically a hardware sequencing tree for internal sequencing of the first gene sequencing data, and a hardware merging tree for merging of each first gene sequencing data.
It should be noted that Fastq is a text format in which biological sequences (usually nucleic acid sequences) and corresponding quality assessments are stored, and is encoded in ASCII, a standard format for high-throughput gene sequencing. The inner ordering is ordering in an inner memory, the outer ordering is ordering combining the inner memory and an outer memory, and the inner memory and the outer memory have interaction. The preset threshold value set dynamically can be adaptively set according to the size of the internal memory, so as to distinguish whether to adopt inner sorting or outer sorting according to the size of the data to be sorted.
In one embodiment, referring to fig. 3, the gene sequencing data external ordering device based on different storage levels performs ordered ordering according to names or coordinates in the gene sequencing data according to the gene sequencing data ordering engine calculated in the storage, and may be an asic, including: the system comprises an internal/external sequencing judger, an external memory chip, a dual tone sequencer, a data merger and a fast sequencing processor which are connected with the internal/external sequencing judger in sequence, wherein the external memory chip is provided with a data partitioning device and a lossless compressor;
the input end of the data blocking device is the input end of the external memory chip, the output end of the data blocking device is connected with the input end of the lossless compressor, and the output end of the lossless compressor is the output end of the external memory chip;
the fast sequencing processor is connected with an internal memory, the double-tone sequencer is connected with an on-chip buffer and an internal memory, and the data merger is connected with an external memory.
In this embodiment, the external memory chip, that is, the flash memory chip in fig. 3, has a data blocking device and a lossless compressor, and thus the storage and calculation unit unloads the compression step to the programmable hardware logic unit in the integrated circuit of the present embodiment, so that the gene sequencing data is compressed in the storage process, thereby implementing the overlap of data input/output (I/O) and calculation, and reducing the time overhead of switching between the steps of the gene sequencing data; in addition, because the hardware is used for unloading the compression flow, high concurrent processing of the compression process can be realized, and the time overhead caused by data compression and decompression in the switching process of the traditional gene sequencing flow is further reduced.
The gene sequencing data external sequencing method based on different storage levels can be operated in a gene sequencing data external sequencing device based on different storage levels, and the gene sequencing data external sequencing device based on different storage levels can comprise: the system comprises a memory, a processor, a communication bus and a gene sequencing data outer sequencing program which is stored on the memory and is based on different memory levels:
the communication bus is used for realizing connection communication between the processor and the memory;
the processor is used for executing the gene sequencing data external sequencing program based on different storage layers so as to control the normal operation of the gene sequencing data external sequencing device based on different storage layers.
In this embodiment, the asic is a carrier for implementing the gene data sorting algorithm shown in fig. 2. The internal/external sequencing judger can select a quick sequencing processor or a double-tone sequencer according to a preset threshold value and the size of the SAM file to be sequenced; the SAM file to be ordered stored in the external memory can be blocked and compressed by a data blocking device and a lossless compressor which are arranged in the external memory chip; the partitioned and compressed gene sequencing data are sent to a bitonic sequencer for preliminary sequencing, then a data merger merges the preliminary sequencing results into a larger sequencing result, iteration is repeated, and finally the data merger sends the result to an external memory.
It should be noted that, referring to fig. 6, the gene sequencing data sequencing engine based on the calculation in the memory performs ordered sequencing according to the name or the coordinate in the gene sequencing data, the gene sequencing data sequencing engine based on the calculation in the memory has a Flash memory controller and a Flash memory conversion layer, the Flash memory controller controls the reading and writing of the Flash memory of the external memory, and the Flash memory conversion layer processes the conversion of the logical address and the physical address and the scheduling of the Flash memory access; the configurator and the scheduler are connected with the flash conversion layer, the configurator can receive the size of the SAM file and write the configuration information obtained by analysis into the integrated circuit, and the scheduler can receive the information of the SAM file divided equally by the data blocking device and controls the running of a gene sequencing data sequencing engine calculated in the storage in cooperation with the flash conversion layer; the flash memory chip is provided with a hardware execution unit which is divided into blocks and compressed; the integrated circuit for sequencing gene sequencing data is responsible for finishing the actual sequencing task.
The steps implemented when the gene sequencing data sequencing program running on the processor is executed can refer to the embodiments of the gene sequencing data outer sequencing method based on different storage levels in the present invention, and are not described herein again.
The foregoing are merely exemplary embodiments of the present invention, and no attempt is made to show structural details of the invention in more detail than is necessary for the fundamental understanding of the art, the description taken with the drawings making apparent to those skilled in the art how the several forms of the invention may be embodied in practice with the teachings of the invention. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (10)

1. The gene sequencing data outer sequencing method based on different storage levels is characterized by comprising the following steps of:
reading the storage capacity required by the data to be sorted to judge the size of the data to be sorted;
if the size of the data to be sorted exceeds a preset threshold value, the data to be sorted is sorted in a grading way through a first storage, a second storage and a third storage;
after the data to be sorted are sorted, the sorting result of the data to be sorted is written back to an external memory for storage.
2. The method of claim 1, wherein the storage capacity of the first memory is smaller than that of the second memory, and the storage capacity of the second memory is smaller than that of the third memory.
3. The method of claim 2, wherein the first memory is a static random access memory, the second memory is a dynamic random access memory, and the third memory is an external memory.
4. The method of claim 3, wherein if the size of the data to be sorted exceeds a preset threshold, the step of sorting the data to be sorted by the first memory, the second memory and the third memory in a hierarchical manner comprises:
equally dividing the data to be sequenced into a plurality of first gene sequencing data through a first memory;
respectively performing lossless compression on each first gene sequencing data in the data to be sequenced;
and performing double-tone sequencing on the sequencing data of each first gene after lossless compression.
5. The method of claim 4, wherein the step of bitonic ordering the lossless compressed first gene sequencing data comprises:
merging the sequenced first gene sequencing data into a plurality of second gene sequencing data according to a tree structure through a second memory;
and carrying out double-tone sequencing on the combined second gene sequencing data.
6. The method of claim 5, wherein the step of bitonic ordering the merged second gene sequencing data further comprises:
merging the sequenced second gene sequencing data into a plurality of third gene sequencing data according to a tree structure through a third memory;
and carrying out double-tone sequencing on the combined sequencing data of the third genes.
7. The method of claim 6, wherein the step of bitonic ordering the merged third gene sequencing data comprises:
and merging the sequenced third gene sequencing data into ordered final gene sequencing data according to the tree structure and writing the ordered final gene sequencing data back to the external memory.
8. The method for gene sequencing data outer sequencing based on different storage levels according to claim 4, wherein the step of equally dividing the data to be sequenced into a plurality of first gene sequencing data is specifically as follows:
the data to be sequenced is provided with N reading pairs, the N reading pairs are equally divided into T parts, and the number of the reading pairs in the first gene sequencing data after each equal division is N/T.
9. The method for sorting out gene sequencing data based on different storage levels according to claim 4, wherein the lossless compression is specifically to encode the repeated information of each first gene sequencing data according to a directed acyclic graph.
10. The gene sequencing data external sequencing device based on different storage levels is characterized by comprising:
the system comprises a memory, a processor and a gene sequencing data external sequencing program based on different storage levels, wherein the gene sequencing data external sequencing program based on different storage levels is stored on the memory and can run on the processor, and when being executed by the processor, the method for realizing the gene sequencing data external sequencing based on different storage levels according to any one of claims 1 to 9 is realized.
CN202110633578.0A 2021-06-07 2021-06-07 Gene sequencing data external sequencing method and device based on different storage levels Pending CN113257356A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110633578.0A CN113257356A (en) 2021-06-07 2021-06-07 Gene sequencing data external sequencing method and device based on different storage levels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110633578.0A CN113257356A (en) 2021-06-07 2021-06-07 Gene sequencing data external sequencing method and device based on different storage levels

Publications (1)

Publication Number Publication Date
CN113257356A true CN113257356A (en) 2021-08-13

Family

ID=77186873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110633578.0A Pending CN113257356A (en) 2021-06-07 2021-06-07 Gene sequencing data external sequencing method and device based on different storage levels

Country Status (1)

Country Link
CN (1) CN113257356A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055232A1 (en) * 2009-08-26 2011-03-03 Goetz Graefe Data restructuring in multi-level memory hierarchies
CN108197433A (en) * 2017-12-29 2018-06-22 厦门极元科技有限公司 Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform
CN111261227A (en) * 2020-01-20 2020-06-09 苏州浪潮智能科技有限公司 Sequencing data storage method, device and equipment and computer readable storage medium
WO2020182175A1 (en) * 2019-03-14 2020-09-17 Huawei Technologies Co., Ltd. Method and system for merging alignment and sorting to optimize

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055232A1 (en) * 2009-08-26 2011-03-03 Goetz Graefe Data restructuring in multi-level memory hierarchies
CN108197433A (en) * 2017-12-29 2018-06-22 厦门极元科技有限公司 Datarams and hard disk the shunting storage method of rapid DNA sequencing data analysis platform
WO2020182175A1 (en) * 2019-03-14 2020-09-17 Huawei Technologies Co., Ltd. Method and system for merging alignment and sorting to optimize
CN111261227A (en) * 2020-01-20 2020-06-09 苏州浪潮智能科技有限公司 Sequencing data storage method, device and equipment and computer readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZEKUN YIN 等: ""Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges"", 《COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL》, vol. 15, pages 403 - 411 *
王元戎 等: ""Gene Panel流程的并行设计与优化研究"", 《计算机学报》, vol. 42, no. 11, pages 2429 - 2446 *

Similar Documents

Publication Publication Date Title
US11741014B2 (en) Methods and systems for handling data received by a state machine engine
US9886017B2 (en) Counter operation in a state machine lattice
EP2791862B1 (en) Device for detection in a state machine
US10007605B2 (en) Hardware-based array compression
US10698697B2 (en) Adaptive routing to avoid non-repairable memory and logic defects on automata processor
CN109783023B (en) Method and related device for data scrubbing
US9569381B2 (en) Scheduler for memory
CN113257352A (en) Gene sequencing data sequencing method, integrated circuit and sequencing equipment
CN105830160B (en) For the device and method of buffer will to be written to through shielding data
US9570125B1 (en) Apparatuses and methods for shifting data during a masked write to a buffer
US9880930B2 (en) Method for operating controller and method for operating device including the same
CN108628760A (en) The method and apparatus of atom write order
US20230385258A1 (en) Dynamic random access memory-based content-addressable memory (dram-cam) architecture for exact pattern matching
CN113257356A (en) Gene sequencing data external sequencing method and device based on different storage levels
US7096462B2 (en) System and method for using data address sequences of a program in a software development tool
CN115827221A (en) BAM file parallel reading method, system and medium
CN118043821A (en) Hybrid sparse compression
CN114816322A (en) External sorting method and device of SSD and SSD memory
US11610102B1 (en) Time-based memory allocation for neural network inference
JP2023503034A (en) Pattern-based cache block compression
CN112306379A (en) Data movement recovery method and device, electronic equipment and storage medium
CN111045959A (en) Complex algorithm variable mapping method based on storage optimization
CN117393046B (en) Space transcriptome sequencing method, system, medium and equipment
Ito et al. Acceleration of BAM I/O on distributed file systems
CN116860435A (en) Nuclear function priority determining method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination