CN116701326A - Method and system for generating target FastQ file from undetermined FastQ file generated by NGS - Google Patents

Method and system for generating target FastQ file from undetermined FastQ file generated by NGS Download PDF

Info

Publication number
CN116701326A
CN116701326A CN202210177362.2A CN202210177362A CN116701326A CN 116701326 A CN116701326 A CN 116701326A CN 202210177362 A CN202210177362 A CN 202210177362A CN 116701326 A CN116701326 A CN 116701326A
Authority
CN
China
Prior art keywords
fastq
box
file
ngs
fastq file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210177362.2A
Other languages
Chinese (zh)
Inventor
费家俊
唐英荣
田常丰
洪强
李萌
马臻
龚崝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xuzhenda Biotechnology Co ltd
Original Assignee
Shanghai Xuzhenda Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xuzhenda Biotechnology Co ltd filed Critical Shanghai Xuzhenda Biotechnology Co ltd
Priority to CN202210177362.2A priority Critical patent/CN116701326A/en
Publication of CN116701326A publication Critical patent/CN116701326A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a method and system for generating a target FastQ file from an undetermined FastQ file generated by an NGS, and a computer readable storage medium and terminal. The method comprises the following steps: s1, preparing an original file and Index information to obtain an undetermined FastQ file generated by the NGS; s2, generating and correcting the division points; s3, decompressing, filtering and compressing; s4, merging. The system comprises a database unit, a data processing unit and a data processing unit, wherein the database unit is used for storing original data generated by NGS, sample information and data generated after analysis; the processing unit comprises a division point generating and correcting module, a decompression filtering compression module and a merging module; the database unit is in communication with the processing unit. The invention can filter fast and accurately from the undetermined FastQ file generated by NGS to the FastQ file matched with the barcode provided by the customer, and has practical value.

Description

Method and system for generating target FastQ file from undetermined FastQ file generated by NGS
Technical Field
The invention relates to a second generation sequencing data processing method and a system, in particular to a method and a system for generating a target FastQ file from undetermined FastQ files generated by NGS
Background
With the maturation and popularization of the second generation sequencing technology (Next-generation sequencing, NGS), the sequencing cost is greatly reduced, and the generated data volume is very large, for example, the Nova seq 6000 single-time down machine data of the second generation sequencer can have 3.6tbx2 data volume which is more than 2000 times that of the human genome. In recent years, NGS is increasingly used in the biomedical field for sequencing biological macromolecules. The method is characterized in that a biological macromolecule, a genome or a metagene and other samples to be tested are provided by biological medicine enterprises, universities or scientific institutions to professional sequencing companies, and the sequencing companies detect the samples through NGS.
To facilitate the distribution and sharing of sequencing data, the bases and their mass fraction data measured by NGS are mostly stored in FASTQ format (corresponding file, FASTQ file). The FASTQ format is common gene sequencing data, stored in units of reads, carrying quality information of base sequencing, and is the most common format for downstream analysis software.
The NGS-generated undetermined FastQ (Undetermined FastQ) file is a sequencing result in which the sequence tag (called a barcode in NGS) cannot match the barcode in the sample list (samplesheet) in the NGS primary split result. Some customers may wish to be able to obtain this portion of data. However, due to the design of the current splitting process, the Undertermined FastQ file is directly copied from the splitting result and is not "clean", so that the data information of other clients can be transmitted outside, which causes risks to enterprises.
Therefore, prior to delivery, files need to be filtered to exclude 2 types of data:
1) The known barcode is excluded.
2) In the case of non-packet pipes (called Lane in NGS), the other customers' barcode is excluded.
There are various situations where a barcode cannot be matched, such as a missing 1 barcode when a customer extracts data, resulting in the sequence marked by that barcode being attributed to the nondeterministic FastQ file produced by the NGS in its entirety. Because of the large individual file volumes of the Undetermined FastQ file, the time required to directly filter is not within an acceptable range. It is therefore necessary to divide the Undetermined FastQ file into small pieces and then filter the pieces. The present invention describes the process required to filter this file.
The amount of data generated by NGS is very large, and sequencing companies typically store sequencing data on public clouds (ali, AWS, tech., etc.) for clients to download by themselves; under the condition that a box is not deployed, a client needs a special computer and is provided with a special downloading tool corresponding to public cloud, meanwhile, a network with larger bandwidth is accessed, whether sequencing data are generated or not needs to be confirmed manually, a downloading process is added manually after the data are generated, if the data are interrupted due to the network or other reasons, transmission service needs to be restarted manually, a large amount of manpower is involved, the integrity of the data cannot be protected, the downloading of the data is completed after a long time is required due to the large delivery data, data transmission is interrupted due to any other reasons in the process, files are incomplete and the like, and errors are easily caused to subsequent message processing links. Therefore, the current delivery of sequencing data has the defects of huge data volume, complicated manual download management, long time consumption and the like. How to provide a special software system for high-speed delivery of sequencing data for various scientific research institutions, universities, hospitals, medical institutions, pharmaceutical factories and gene companies, and solve the problem of last kilometer of gene data delivery is a technical problem which needs to be solved in the field.
Disclosure of Invention
In order to be able to filter faster and more accurately from the NGS-generated undetermined fasq files to FastQ files that match the customer-provided barcode, i.e., target FastQ files, the present invention discloses a method of generating target FastQ files from NGS-generated undetermined FastQ files, comprising the steps of:
s1, preparing an original file and Index information to obtain an undetermined FastQ file generated by the NGS;
s2, generating and correcting the division points: cutting the uncertain FastQ file generated by the NGS into a plurality of FastQ fragments according to the file size R; for the boundary point of each FastQ fragment obtained by cutting, carrying out boundary search of bgzf within a range of +/-M; dividing the FastQ fragments by using the searched offset as an actual dividing point to obtain divided FastQ fragments; r is 0.01-20Gb, M is 32-128Kb;
s3, decompression, filtration and compression: decompressing the FastQ fragments segmented in the step S2, filtering by using a target barcode (S), and then compressing to obtain filtered FastQ fragments;
s4, merging: and (3) merging the filtered FastQ fragments obtained in the step (S3) again in sequence to generate the target FastQ file. Wherein R is a real number Gb in 0.01-20Gb, and is not limited to two bits after decimal point. R is a number obtained by comprehensively weighting the io speed by a program according to the actual environment such as the main frequency of CPU and the memory size. Generally, the better the computer performance, the smaller the value of R given by the program.
Further, R is 2-8Gb, and M is 32-64Kb. Preferably, R is 4Gb and M is 64Kb.
In some embodiments, the step S1 is specifically: and separating data from the original file according to the known Index information, wherein the data which is not separated from the original file is the uncertain FastQ file generated by the NGS. The acquisition of the uncertain FastQ file produced by the NGS will typically involve the following operations: extracting a first-level split result catalog for extracting the undetermined FastQ file generated by the NGS and all known Index on Lane where the undetermined FastQ file generated by the NGS is located in sequencing from an order task and an on-line task respectively; lane to be split is selected in the work order system, and the source I7/I5 length of the undetermined FastQ file generated by the NGS is selected. Index, barcode, is a tag used in sequencing to identify information such as the origin of a sequence. I5 and I7 refer to the biological label of each DNA fragment, which is respectively added at two ends of the DNA fragment. In the case of double-ended sequencing, I5 and I7 are added to both ends of the DNA, and in the case of single-ended sequencing, I7 is only added. There is no way to distinguish to which sample each piece of DNA belongs without adding I7 or I5 and I7 to both ends of the DNA. I5 and I7 are biological labels, and Index is a specific value.
In some embodiments, the decompression, filtering, compression in step S3 is done in stream in the same task (task), and the resulting file of the filtered FastQ segment ending with part- $ { fragment number } is stored into the alicloud object storage service (Object Storage Service, OSS).
In some embodiments, the step S3 includes putting the 3 processes of decompressing, filtering, and compressing into 3 different threads (threads), respectively, and connecting the threads through a queue (queue).
In some embodiments, the step S3 further includes running several jobs (worker) on each batch of nodes, each worker being composed of 3 different threads, respectively responsible for decompression, filtering, compression; the number of the workers=virtual machine kernel number/2.
In some embodiments, the step S4 further comprises generating a further md5 check accompanying the target FastQ file for verifying the authenticity of the target FastQ file.
In some embodiments, the combining of step S4 is done in a local computer.
In some embodiments, the step S4 further includes uploading to the cloud without using multiple slices after the merging is completed in the local computer.
In some embodiments, step S4 further comprises uploading to the ali cloud using multi-shard after the merging is completed in the local computer.
In some embodiments, further comprising S5, data delivery: completing delivery of the target FastQ file (i.e., delivering the target FastQ file to a customer) by a box delivery service system; the box delivery service system comprises a box service network, a box cluster and a zookeeper service module, wherein a plurality of server nodes are arranged in the box service network, the box cluster comprises a plurality of box bodies, each box body is independently deployed and respectively connected with the box service network, the box service network is connected with a work order system through wireless to realize data interaction, and the zookeeper service module is used for managing the state of the box service network. When the box body needs to communicate with the work order system, the box body is not connected to one work order system, but is connected to the box service network, and then the work order system is connected to the box service network for data transmission/box management and other works.
In the above-mentioned box delivery service system, the server node is a cube resh node (box network node), and the box body is a cube node (box node).
The above-mentioned box delivery service system, wherein the server node includes a remote procedure call service module (RPCServer), a cluster state management module (clusterin keeper), a first MetaData module (MetaData), a box runtime information module (runtimebenfo), a box state management module (cube keeper), a box state heartbeat detection module (cube keepalive works), a box connection resolution module (cube connectionstate), and a box messaging protocol library module (CubeTalk Message Library);
the remote procedure call service module (RPCServer) remotely accessing an entry of the box for the work order system;
the cluster state management module (clusterin keeper) is used for maintaining state data between the server node and the zookeeper service module and registering the server node into the box service network;
the first MetaData module (MetaData) stores a MetaData library which is needed to be used by the box service network;
the box runtime information module (RuntimeUbieInfo) is used for abstracting an online box state information;
the box state management module (cube keeper) is used for maintaining the connection state between the box and one box;
The box state heartbeat detection module (cube keep alive workbench) is used for carrying out polling on the box at low frequency so as to avoid a zombie box generated by inactive connection;
the box connection analysis module (cube connection resolution) is used as a service end of box communication and is responsible for establishing connection with the box and encoding and decoding the communication;
the box message communication protocol library module (CubeTalk Message Library) is a protocol library of box communication, and is internally provided with message type data possibly used by all boxes;
the box is used for receiving data pushing from the work order system, managing Storage equipment connected to the box and outputting data to the Storage equipment, and comprises a Metadata Storage module (Storage Metadata), a network node connection module (MeshConnector), a volume space usage detection thread module (VolumeSpaceUstrageCheckThread), a disk operation thread module (DiskManipotengThread), a disk state monitoring module (DiskWatcher), a data downloading and releasing module (DownloadExtractor), a second Metadata module (Metadata), a large block data reading module (BLOBReader), a downloading module (Downloader), a hard disk initialization module (FSMaker) and a box communication protocol module (Cube Talk Protocol);
The Metadata Storage module (Storage Metadata) is used for managing a Metadata structure of the local Storage device;
the network node connection module (MeshConnector) is a client for box communication and is responsible for communication establishment and encoding and decoding with the box service network;
the volume space usage detection thread module (VolumeSpaceUsageCheckThread) is responsible for monitoring the change in the remaining capacity of the available partitions;
the disk operating thread module (diskmanimulatethread) is responsible for processing management work of a local disk, and comprises mounting, ejecting and initializing;
the disk state monitoring module (diskwatch) is responsible for monitoring the change of the storage device, such as the insertion of a new disk;
the data downloading and releasing module (downloading extravector) is responsible for outputting a completed or partially completed downloading to the external storage device;
the second Metadata module (Metadata) stores a Metadata structure associated with the service;
the large block data reading module (BLOBReader) is used for providing the reading capability of the downloaded data, and in order to control the number of downloaded file handles, the data of single Download is integrated into one blob;
The Download module (Downloader) is a Downloader, and is used for executing a Download action for one Downloader, and the Download module (Downloader) does not perform connection data transmission and file writing work of the Download, but is responsible for controlling 3 components to complete the Download:
fragment download (fragdownload) for files without content, after obtaining https download address, the file will be split into several data blocks, the download module (download) will create several fragment downloads (fragdownload), and download these data blocks in parallel through a local queue;
fragment acknowledgement write (fragCommitter-fragcom) will write the completed frag to local disk according to the offset;
content download-for files containing content, directly obtaining the content of the file from the box service network;
the hard disk initialization module (FSMakerchusi) is used for providing disk initialization capability, and provides a plurality of implementations to support different types of file partition tables and partition formats;
the box communication protocol module (CubeTalk Protocol) stores network protocols used for communication between the box and the box service network.
In the above-mentioned box delivery service system, the communication protocol of the remote procedure call service module (RPCServer) connected with the work order system is a thraft protocol;
The MetaData module (MetaData) stores a MetaData base including box information (cube info), hard disk information (HDDInfo), download information (Download), and release information (Extraction), wherein the box information (cube info) is registration information of a box, the hard disk information (HDDInfo) is a known hard disk package, the Download information (Download) is a Download task of the package, and the release information (Extraction) is process data for packaging one data output to an external storage;
the Metadata structure in the Metadata Storage module (Storage Metadata) comprises block information (BlockInfo), disk information (DiskFlag), volume information (VolumeInfo), disk flag (DiskFlag), and internal Storage disk (internalstore), wherein the block information (BlockInfo) is abstract all Linux block service devices (Linux block device) and comprises unidentifiable ones, the disk information (diskffo) is abstract one Storage device, the volume information (VolumeInfo) is a partition on an abstract disk, the disk flag (DiskFlag) is a partition flag on an abstract disk partition table, the internal Storage disk (internalstore disk) is used for abstracting and marking one disk information (diskffo) and is internal Storage, the internal Storage can be used for receiving data from a work order system, and the rest can be used for outputting pushed data;
The Metadata structure in the second Metadata module (Metadata) includes a Download (Download), a Download record (Download entry), a fragment (flag), an EntryState (Download state), a Download task state (Download state), a release (Extraction state), and a release state (Extraction entry), where the Download (Download) is an abstract one-time data push, the Download record (Download entry) is an abstract one-time downloaded file, the downloaded file is divided into a Download file containing content and a Download file not containing content, the fragment (flag) is a Download file not containing content, the fragment (flag) abstracts the Download record (Download entry) after a piece of divided, the EntryState is a state of an abstract one-time Download record (Download entry) in a Download process, the Download task state (Download state) is an abstract one-time data push state, and the release state of the Download task state is an abstract one-time data release state of an abstract one-time data release process.
The box delivery service system described above, wherein the network protocol includes a negotiation phase and a communication phase;
the negotiation phase is used to establish a link between the box and the serving network, whether uplink or downlink, and is the same, and is performed as follows:
Step A1: the box establishes a TCP (Transmission control protocol) connection with the box service network, and sends a protocol version of the box to the box service network;
step A2: the box service network replies the protocol version to the box;
step A3: whether the box response protocol version is compatible;
if not, the box service network may be disconnected;
if so, the box service network generates an RSA key and sends the RSA public key length and the RSA public key in the RSA key to the box;
step A4: the box generates an identifier which contains the self equipment ID and the connection type of the box, then uses the read RSA public key to encrypt the identifier, and sends the RSA public key length and the identifier RSA key to the box service network;
step A5: after decrypting the identifier by using an RSA private key in the RSA key, the box service network checks the registration information of the equipment in the identifier and sends an identification result to the box;
step A6: if the result is not recognition completion, the box disconnects the network;
step A7: the box service network sends a negotiation completion identification;
step A8: entering a communication stage;
after the negotiation phase is completed, the link enters the phase, which is in the form of a question and answer, and is endless, and the communication behavior of the box and the service network is communicated in the phase, and the communication phase is as follows:
Step B1: reading 4 bytes from the connection of the box to the box service process;
step B2: the read 4 bytes are identified to obtain a message length M;
step B3: reading data of length M bytes from the connection;
step B4: decoding data of M length bytes by using AES, wherein a key used for decoding a service network is a key used by a box when the service network is registered, the key used for decoding the box is a key configured locally, and for the same box, the keys of the two keys must be matched;
step B5: processing the received message (message) after decoding;
step B6: repeating the steps B1-B5 for the next cycle.
Upon receiving data pushed by the work order system, the box needs to go through the following phases:
creating a download task: the work order system creates a downlink to the box through the RPC service of the box service network, and the box cannot see the original OSS key of each file, but can know the ID and the file name of each file;
acquiring a download address: the box initiates a request to a service network, the address of a file is required to be acquired, the box can use the ID of the file to identify the file to be acquired, the service network searches the file requested by the box, and an https download address without access secret is generated through the object signature function of the Arian;
Segmentation data: after the box acquires the download URL, http range request divides the data into several small blocks;
downloading and assembling: the box downloads the divided data fragments through built-in Frag downloading, and gives the fragments to the Frag downloading to be combined, then writes the fragments into a final BLOB, and when all the fragments are downloaded, the downloading of a single file is declared to be finished.
In some embodiments, further comprising S5, data delivery: completing the delivery of the target FastQ file by a box delivery service method; the box delivery service method comprises the following steps:
step C1, accessing the box into the Internet;
step C2, carrying out information configuration related to the box and initializing a hard disk;
step C3, automatically registering to a box service network after networking of the box;
and step C4, when data are generated, the box service network informs the box to start a data downloading task.
In the above-mentioned box delivery service method, in the step C1, after the box is accessed to the internet, configuration of the PPOE, the password, the intranet card parameter, and the communication public key of the box is required;
in the step C2, a mobile phone is used for scanning the two-dimension code of the equipment on the box shell to enter a configuration interface, and information related to the box is configured;
And C3, automatically registering the box to a box service network according to the preset configuration after the box is networked.
After the data is downloaded, the method can copy the data to the occasion needing to be used through an external mobile hard disk or access the catalog of the box through a network;
the information about the box further comprises operator related information, and step C2 further comprises binding the operator to the box.
The invention also discloses a system for generating a target FastQ file from the undetermined FastQ file generated by the NGS, which comprises:
the database unit is used for storing original data generated by the NGS, sample information and data generated after analysis;
a processing unit comprising:
the generation and correction dividing point module is used for dividing the undetermined FastQ file generated by the NGS into a plurality of FastQ fragments according to the file size R; for the boundary point of each FastQ fragment obtained by cutting, carrying out boundary search of bgzf within a range of +/-M; dividing the FastQ fragments by using the searched offset as an actual dividing point to obtain divided FastQ fragments; r is 1-20GB, M is 32-128KB;
the decompression filtering compression module is used for decompressing the segmented FastQ fragments, filtering by using a target barcode(s), and then compressing to obtain filtered FastQ fragments;
The merging module is used for merging the filtered FastQ fragments again according to the sequence to generate the target FastQ file;
the database unit is in communication with the processing unit. Wherein R is a real number Gb in 0.01-20Gb, and is not limited to two bits after decimal point. R is determined by a specific scene, and the io speed is comprehensively weighted by a program according to an actual environment such as a CPU main frequency, a memory size and a number. Generally, the better the computer performance, the smaller the value of R given by the program.
Further, R is 2-8Gb, and M is 32-64Kb. Preferably, R is 4Gb and M is 64Kb.
In some embodiments, the decompression filtering compression module is further configured to complete the decompression, filtering, and compression tasks in the same task in a stream, and store the generated file with the filtered FastQ segment ending in part- $ { fragment number } into an Ardisy OSS.
In some embodiments, the decompression filtering compression module is further configured to put the 3 processes of decompression, filtering and compression into 3 different threads respectively, and connect the middle of the processes through a queue.
In some embodiments, the decompression filtering compression module is further configured to run a plurality of worker on each batch of nodes, where each worker is composed of 3 different threads and is responsible for decompression, filtering, and compression respectively; the number of the workers=virtual machine kernel number/2.
In some embodiments, the merge module is further configured to generate a further md5 check along with the target FastQ file. md5 checks are used to verify the authenticity of the target FastQ file.
In some embodiments, the merge module is further configured to complete the merge in the local computer.
In some embodiments, the merging module is further configured to upload to the cloud end without using multiple slices after the merging is completed in the local computer.
In some embodiments, the merge module is further configured to upload to the ali cloud using multi-shard after the merge is completed in the local computer.
In some embodiments, a box delivery service system as described above is also included for completing the delivery of the targeted FastQ file; the box delivery service system is in communication with the database unit; the box delivery service system communicates with the processing unit.
The present invention also provides a computer readable storage medium having stored therein executable instructions that when executed implement a method of generating a target FastQ file from an undetermined FastQ file generated by an NGS as described above.
The invention also provides a terminal, which comprises:
A memory for storing executable instructions;
and a processor for implementing the method for generating a target FastQ file from the undetermined FastQ file generated by the NGS as described above when executing the executable instructions stored in the memory.
The invention makes it possible to obtain specific FastQ file from mass Undetermined FastQ file produced by NGS, and has practical value, solves one long-standing problem of sequencing company, and obtains unexpected technical effect. The method and the system can quickly and accurately find the customer data which is lost originally due to the mismatch of the barcode.
The beneficial effects of the invention are as follows:
1. the undetermined FastQ file generated by NGS is cut into a plurality of fragments with a certain size (R) and then decompressed, filtered and compressed, so that the time for obtaining the target FastQ file by filtering is greatly shortened.
2. To reduce the number of copies of the file, the decompression/filtering/compression work needs to be done in stream in the same task. The resulting fastq clip file is stored in the alicloud OSS and ends with part- $ { fragment number }.
3. In this process of decompression-filtering-compression, 3 processes occupy different resources, respectively (see table 1). The direct execution of 3 operations in the same context results in underutilization of node resources, network idleness during filtering, and CPU idleness during decompression. The node characteristics of the alicloud batch calculation are more abundant for the CPU. Therefore, the 3 processes are respectively put into 3 different threads, and the middle is connected through the queue, so that the processing speed is further improved.
4. Running a plurality of works on each batch of calculated nodes, wherein each work consists of 3 different threads and is respectively responsible for decompression, filtering and compression; the number of the workers=the number of the cores of the virtual machine/2, which is also beneficial to further improving the processing speed.
5. md5 verification is beneficial to ensure that the user gets the real data. If the target FastQ file is tampered with, the check code will change. The user can easily find that the data is tampered by comparing the verification code with the original verification code.
6. The invention is beneficial to being applied to cloud services except the alicloud without using multi-fragment uploading.
7. Multiple slice uploading using alicloud can reduce the copy number once, thereby saving processing time.
TABLE 1 and 3 Processes respectively occupy different resources
Procedure Occupying resources
Decompression Network IO (Down)
Filtration CPU
Compression CPU+network (uplink)
At present, a product capable of solving the problem of high-speed delivery of huge data amount of gene data does not exist on the market, but the box delivery service system solves the problem. After the box is adopted, a system inside the box intelligently perceives cloud new data generation, automatically synchronizes multithreading and fragmented downloading in real time, has a data retransmission mechanism, and solves the defects of huge data quantity, complex manual downloading management, long time consumption and the like of sequencing data; the data delivery is quick, timely, complete and safe, supports public cloud object storage service (and can dynamically expand and support more manufacturers according to requirements), can access a user network as network storage access (NAS) after the data is transmitted to the flash magic box, and also supports data output through copying of a high-speed usb3.0 mobile hard disk.
The box delivering service method comprises the whole solutions of software, hardware and network, and clients can enjoy the service of downloading high-speed data to the local place only by paying; the cloud data generation is automatically detected, manual intervention is not needed, the sequencing data is automatically downloaded, and the downloading can be performed while the generation is performed, so that the data delivery time is greatly shortened; the method is convenient and simple to use, saves investment, does not need to be equipped with a computer, installs special download software for downloading, and is used for solving the defects caused by the prior art.
The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.
Drawings
FIG. 1 is a schematic diagram of BGZF segmentation;
FIG. 2 is a decompression-filtration-compression schematic;
FIG. 3 is a schematic diagram of running several workers under 1 node to complete 3 threads of decompression-filtering-compression;
FIG. 4 is a schematic diagram of a merging step;
FIG. 5 is a schematic diagram of a box delivery service system according to the present invention;
FIG. 6 is a box service network node diagram;
FIG. 7 is a box node diagram;
FIG. 8 is a box communication negotiation diagram;
FIG. 9 is a block diagram of a box communication message;
FIG. 10 is a diagram illustrating a data download stage 1;
FIG. 11 is a diagram illustrating a data download stage 2;
FIG. 12 is a schematic diagram of the data download stage 3;
FIG. 13 is a diagram of the data download stage 4;
fig. 14 is a flow chart of a box delivery service method of the present invention.
Wherein, the reference numerals are as follows:
the box service network 100, box cluster 200, zookeeper service module 300, job ticket system 400, server node 110, box ontology 210, remote procedure call service module 111, cluster state management module 112, first metadata module 113, box runtime information module 114, box state management module 115, box state heartbeat detection module 116, box connection resolution module 117, box messaging protocol library module 118, metadata storage module 211, network node connection module 212, volume space usage detection thread module 213, disk operation thread module 214, disk state monitoring module 215, data download and release module 216, second metadata module 217, chunk data reading module 218, download module 219, hard disk initialization module 220, box messaging protocol module 221.
Detailed Description
The invention is further described with reference to the following detailed description in order to make the technical means, the inventive features, the achieved objects and the effects of the invention easy to understand. The present invention is not limited to the following examples.
It should be understood that the structures, proportions, sizes, etc. shown in the drawings are for illustration purposes only and should not be construed as limiting the invention to the extent that it can be practiced, since modifications, changes in the proportions, or otherwise, used in the practice of the invention, are not intended to be critical to the essential characteristics of the invention, but are intended to fall within the spirit and scope of the invention.
Example 1
This embodiment provides a method of generating a target FastQ file from an undetermined FastQ file generated by an NGS comprising the steps of:
s1, preparing an original file and Index information
For filtering, the following information needs to be extracted from the order task and the on-line task respectively:
1) The first-level split result directory (which may be either a result directory of a normal first-level split or a result directory of a re-split) used for extracting Undetermined FastQ files.
2) The original Undetermined FastQ file was sequenced for all known indices on Lane. If the design is repartitioned, the newly provided Index needs to be applied.
In addition, service personnel are required to select Lane to be split in the work order system and the source I7/I5 length of the original Undetermined FastQ file.
S2, generating and correcting the division points
As shown in fig. 1, after the original Undetermined FastQ file (in this embodiment, the save path and name of the file are/path/to/FastQ/uncovered_s0_l002_r2_001.fastq.gz) is obtained, the FastQ file is cut into several FastQ segments (named as frag# digital range (lower number limit-upper number limit)) according to a certain size (e.g., R takes any real Gb of 0.01-20 Gb). For the boundary point of each FastQ segment, it is necessary to perform a boundary search of BGZF (a compression algorithm) within + -M (M is 32-128Kb, in this embodiment M takes a value of 64 Kb). And the searched offset is used as an actual segmentation point to segment the original Undetermined FastQ file encoded by BGZF.
The fastQ file (including the splittable fastQ file and the non-splittable original Undetermined FastQ file) generated by bcl2FastQ (file format when the bcl sequencer is off-machine) uses the BGZF coding specification, and the coding mode is compatible with gzip coding unidirectionally. In the segmentation and compression/decompression process described in the present invention, the relevant code is referred toRFC1952(https:// datatracker. Ietf. Org/doc/html/rfc 1952) andSAM/BAM Format Specificationis realized in section 4.1 (https:// samtools. Github. Io/hts-specs/samv1. Pdf).
S3, decompressing, filtering and compressing
To reduce the number of copies of the file, the decompression/filtering/compression work needs to be done in stream in the same task. The resulting FastQ clip file is stored in the Alicloud OSS and ends with a. Part- $ { fragment number }.
Decompressing, filtering and compressing the FastQ fragments generated in the step S2. As shown in FIG. 2, which is described herein by way of example only, the other FastQ segments generated in step S2 are also processed similarly to the frag#n range (252866201170-253401202339). The frag #n range (252866201170-253401202339) is read by BGZF and decompressed to a nucleotide sequence beginning with @ (e.g. @ a00456:750:HXXXXXXX2:2:1101:6379:1047 1:N:0: nucleotide sequence + … …). The filter is then made to the nucleotide sequence beginning with @ carrying the tag (e.g. @ A00456:750:HXXXXXXX2:2:1101:6379:1047 1:N:0: nucleotide sequence + … …) by a known tag (knownInds, such as ACGTACGT+CCCT). Then written and compressed by BGZF into a FastQ clip file named part- $ { fragment number } (e.g., undcetriminated_xxx. FastQ. Gz. Part-n in this embodiment).
Because the 3 processes of decompression, filtering and compression occupy different resources (table 1), three operations are directly executed in the same context, so that node resources are not fully used, the network is idle during filtering, and the CPU is idle during decompression. And the node characteristic of the Ali cloud batch calculation is that the CPU is more abundant. The 3 processes are thus placed in 3 different threads, respectively, with the middle being connected by a queue (queue).
On each batch computing Node (Node) several works (virtual machine kernel number/2) are run, each of which is made up of three different threads, responsible for decompression/filtering/compression respectively (fig. 3). Fig. 3 also illustrates a fragment #n range (252866201170-253401202339) as an example.
S4, merging
Since the currently used computing framework (ali cloud batch computing) does not support file merging, fragments belonging to the same target FastQ need to be combined locally once again in sequence in a separate task, and md5 verification is generated and then written back to ali cloud OSS.
As shown in fig. 4, the underscaled_xxx.fastq.gz.part-1, underscaled_xxx.fastq.gz.part-2, underscaled_xxx.fastq.gz.part-3, … … underscaled_xxx.fastq.gz.part-n obtained by the previous three steps are combined in order to obtain underscaled_xxx.fastq.gz. And then obtaining the target Fastq file Undesteringjxxx. Tar fastq.md5 with the verification code by using the Undesteringjxxx. Fastq.gz through md5 sum. md5sum refers to the Message-Digest Algorithm fifth edition (Message-Digest Algorithm 5) calculation. md5sum is a command in the linux system to calculate the value of the file md5.
In the merging process of this stage, multi-fragment uploading of the ali cloud is not used, and although the copy number can be reduced once by doing so, as the operation is not the operation of standard object storage, the function is not used in consideration of the future potential deployment possibility of other cloud services.
S5, data delivery
The delivery of the target FastQ file (i.e., delivering the target FastQ file to the customer) is accomplished by the box delivery service system.
The box is a temporary storage which is landed to the end-use end when the work order system delivers data, and the box needs to meet the following requirements: can allow for the receipt of data generated by a stored work order system including, but not limited to: fastQ (a sequencing format file), BCL (a raw sequencing data format file generated by a sequencer) file, unnknowfastq (FastQ file in which attribution data cannot be distinguished, such as a target FastQ file generated in the present invention); connecting an external storage device to the box, copying the received data into the external storage device, providing an access portal for network services, and allowing the received data in the box to be accessed directly in a local area network where the box is located;
the box has the management function for external storage equipment, and the coverage of the function needs to be capable of meeting the process of insertion of the storage equipment, data replication, extraction of the storage equipment, and the box itself has the monitoring capability and can monitor the following contents:
the status of the box itself, the data transmission process of receiving data from the work order system, the process of copying the data to the external storage device, and the provision of a user delivery interface, the user may perform through the user interaction interface:
The management box already stores the data received from the work order system, manages the external storage device, performs the process of outputting the data to the external storage device, manages the network service built in the box, and the communication between the box and the work order system is safe and controlled.
There are cases where the box is actively connected to the work order system and where the work order system is actively connected to the box during the application process, but since the boxes are deployed in an uncontrolled network environment, this means that each box has no fixed network entry (ip address/domain name, etc.).
Therefore, the box needs to have a passive communication function with the work order system, namely, once the box establishes connection with the work order system, the work order system can send a request to the box at any time.
The invention provides a box delivery service system, which aims at providing an integral solution of software, hardware and network for a box, wherein a customer can enjoy service of downloading high-speed data to a local place only by paying; the cloud data generation is automatically detected, manual intervention is not needed, the sequencing data is automatically downloaded, and the downloading can be performed while the generation is performed, so that the data delivery time is greatly shortened; the method is convenient and simple to use, saves investment, does not need to be equipped with a computer, and installs special download software for downloading.
As shown in fig. 5, in the first aspect, the box delivery service system includes a box service network 100, a box cluster 200, and a zookeeper service module 300, where the zookeeper service module 300 is built with a distributed state management software with an open source, a plurality of server nodes 110 are disposed in the box service network 100, the box cluster 200 includes a plurality of box bodies 210, each box body 210 is separately disposed and separately connected to the box service network 100, the box service network 100 creates a connection with the worksheet system 400 by wireless to implement data interaction, the zookeeper service module 300 is used for managing the state of the box service network 100, when the box body 210 needs to communicate with the worksheet system 400, it is not connected to a worksheet system 400, but is connected to the box service network 100, and then is connected to the box service network 100 by the worksheet system 400 to perform data transmission/box management and other tasks.
In the above-mentioned box delivery service system, the server node 110 is a cube meshnode (box network node), and the box body 210 is a cube node.
As shown in fig. 6-3, the above-mentioned box delivery service system, wherein the server node 110 includes a remote procedure call service module 111 (RPCServer), a cluster state management module 112 (ClusterKeeper), a first MetaData module 113 (MetaData), a box runtime information module 114 (rutimecubreinfo), a box state management module 115 (cube keeper), a box state heartbeat detection module 116 (cube keepalive worker), a box connection resolution module 117 (cube connection resolution), and a box message communication protocol library module 118 (CubeTalk Message Library);
Remote procedure call service module 111 (RPCServer) remotely accesses the box's entry for worksheet system 400;
the cluster state management module 112 (clusterikeeper) is configured to maintain state data between the server node 110 and the zookeeper service module 300, and register the server node 110 into the box service network 100;
the first MetaData module 113 (MetaData) stores therein a MetaData library that needs to be used by the box service network 100;
the box runtime information module 114 (runtimeubeinfo) is used to abstract an online box state information;
the box state management module 115 (cube) is responsible for maintaining the connection state with one box;
the box state heartbeat detection module 116 (cube keep alive workbench) is used for polling the box at low frequency to avoid a zombie box generated by inactive connection;
a box connection analysis module 117 (cube connection resolution) is used as a service end of box communication and is responsible for establishing connection with the box and encoding and decoding the communication;
the box messaging protocol library module 118 (CubeTalk Message Library) is a protocol library for box messaging, and is embedded with message type data that may be used by all boxes;
the box is used for receiving data pushing from the work order system 400, managing Storage devices connected to the box, and outputting data to the Storage devices, and includes a Metadata Storage module 211 (Storage Metadata), a network node connection module 212 (MeshConnector), a volume space usage detection thread module 213 (volume space usagecheckthread), a disk operation thread module 214 (disk management thread), a disk status monitoring module 215 (disk watch), a data download and release module 216 (downlink extraction), a second Metadata module 217 (Metadata), a chunk data reading module 218 (BLOBReader), a download module 219 (downlink), a hard disk initialization module 220 (fsmaker chusi), and a box communication protocol module 221 (Cube Talk Protocol);
The Metadata Storage module 211 (Storage Metadata) is used for managing the Metadata structure of the local Storage device;
the network node connection module 212 (merhconnector) is a client of the box communication, and is responsible for establishing and encoding and decoding communication with the box service network 100;
a volume space usage detection thread module 213 (volumespace usagecheckthread) is responsible for monitoring the change in the remaining capacity of the available partitions;
the disk operating thread module 214 (disklyidulingthread) is responsible for handling management tasks of the local disk, including mounting, ejecting, and initializing;
the disk state monitoring module 215 (diskwatch) is responsible for monitoring changes in the storage device, such as insertion of a new disk, etc.;
the data download and release module 216 (downloadextrator) is responsible for outputting a completed or partially completed download to the external storage device;
the second Metadata module 217 (Metadata) stores Metadata structures associated with the service;
the block data read module 218 (BLOBReader) is used to provide the read capability of the downloaded data, and in order to control the number of file handles downloaded, the data of a single Download is integrated into one blob;
The Download module 219 (Downloader) is a Downloader for performing a Download action for one Downloader, and the Download module 219 (Downloader) itself does not perform connection data transmission and file writing work of the Download, but is responsible for controlling 3 components to complete the Download:
fragment download (fragdownload) for files without content, the file is split into several data blocks after obtaining https download address, download module 219 (download) creates multiple fragment downloads (fragdownload) and downloads these data blocks in parallel through a local queue;
fragment acknowledgement write (fragCommitter-fragcom) will write the completed frag to local disk according to the offset;
content download-for files containing content, the content of the file is obtained directly from the box service network 100;
a hard disk initialization module 220 (FSMakerchusi) for providing disk initialization capabilities, providing multiple implementations to support different types of file partition tables and partition formats;
the box communication protocol module 221 (CubeTalk Protocol) stores network protocols used for communication between the box and the box service network 100.
The above-mentioned box delivery service system, wherein the communication protocol of the remote procedure call service module 111 (RPCServer) connected to the work order system 400 is a threft protocol;
The MetaData module 113 (MetaData) stores a MetaData library including box information (cube), hard disk information (HDDInfo), download information (Download), release information (Extraction), the box information (cube) being registration information of the box, the hard disk information (HDDInfo) being a known hard disk to be packaged, the Download information (Download) being a Download task to be packaged, the release information (Extraction) being process data to be packaged and output to the external storage;
the Metadata structure in the Metadata Storage module (Storage Metadata) includes block information (BlockInfo), disk information (diskifo), volume information (VolumeInfo), disk flag (DiskFlag), internal Storage disk (internalstoredisk), the block information (BlockInfo) is abstract all Linux block service devices (Linux block device) and includes unrecognizable ones, the disk information (diskifo) is abstract one Storage device, the volume information (VolumeInfo) is abstract one partition on one disk, the disk flag (DiskFlag) is one partition flag on the abstract disk partition table, the internal Storage disk (internalstoredisk) is used for abstracting and marking one disk information (DiskInfo), and is internal Storage which can be used for receiving data push from the work order system 400, and the rest of the disks can be used for outputting pushed data;
The Metadata structure in the second Metadata module 217 (Metadata) includes a Download (Download), a Download record (Download entry), a fragment (fragment), an EntryState (Download state), a Download task state (Download state), a release (Extraction), a release state (Extraction entry), the Download (Download) being an abstract one-time data push, the Download record (Download entry) being an abstract one-time downloaded file, the downloaded file being divided into a Download file with content and no content, the fragment (fragment) being a Download file without content, the fragment (fragment) being an abstract one-time divided Download record (Download entry), the EntryState being a state of the abstract one-time Download record (Download entry) in the Download process, the Download task state (Download state) being a state of the abstract one-time Download, the release state being an abstract one-time data release state of the Extraction process.
The box delivery service system comprises a network protocol, a service platform and a service platform, wherein the network protocol comprises a negotiation stage and a communication stage;
as shown in fig. 8, the negotiation phase is used to establish a link between the box and the serving network, whether uplink or downlink, and is the same as follows:
Step A1: the box establishes a TCP (transmission control protocol) connection with the box service network 100, sending the protocol version of the box (2 bytes unsigned int) to the box service network 100;
step A2: the box service network 100 replies to the box with a protocol version (2 bytes unsigned int);
step A3: whether the box response protocol version is compatible, 1=compatible, 2=incompatible (2 bytes unsigned int);
if not compatible, the box service network 100 may be disconnected;
if compatible, the box service network 100 generates an RSA key and sends the RSA public key length n (2 bytes unsigned int) and RSA public key (n bytes) in the RSA key to the box;
step A4: the box generates an identifier, wherein the identifier comprises the self device ID and the connection type of the box, then the read RSA public key is used for encrypting the identifier, the RSA public key length and the identifier RSA key are sent to the box service network 100, the identifier comprises the self device ID and the connection type (uplink/downlink) of the box, and then the read RSA public key is used for encrypting the identifier and sending the set length (2 bytes unsigned int) and the identifier;
step A5: after decrypting the identifier by using the RSA private key in the RSA key, the box service network 100 checks the registration information of the device in the identifier, and sends the identification result to the box, where the identification result is: 1 = recognition complete, 2 = no recognition box, 3 = box disabled;
Step A6: if the result is not recognition completion, the box disconnects the network;
step A7: the box service network 100 transmits a negotiation completion identification 1, (1 unsigned byte);
step A8: entering a communication stage;
as shown in fig. 9, after the negotiation phase is completed, the link enters a phase in the form of a question and answer, and the endless loop, in which the communication behaviors of the box and the service network are communicated, and the communication phase is as follows:
step B1: reading 4bytes from the connection of the box to the box service process (4 bytes unsigned int);
step B2: the read 4bytes are identified to obtain a message length M;
step B3: reading data of length M bytes from the connection;
step B4: decoding data of M length bytes by using AES, wherein a key used for decoding a service network is a key used by a box when the service network is registered, the key used for decoding the box is a key configured locally, and for the same box, the keys of the two keys must be matched;
step B5: processing the received message (message) after decoding;
step B6: repeating the steps B1-B5 for the next cycle.
Upon receiving the data pushed by the work order system 400, the box needs to go through the following phases:
As shown in fig. 10, a download task is created: the job ticket system 400 creates a Download to the box through the RPC service of the box service network 100, the box cannot see the original OSS key of each file, but can know the ID and file name of each file;
as shown in fig. 11, the download address is acquired: the box initiates a request to a service network, the address of a file is required to be acquired, the box can use the ID of the file to identify the file to be acquired, the service network searches the file requested by the box, and an https download address without access secret is generated through the object signature function of the Arian;
as shown in fig. 12, the data is divided: after the box acquires the download URL, http range request divides the data into several small blocks;
as shown in fig. 13, download and assemble: the box downloads the divided data fragments through the built-in fragdownload, and sends the fragments to the Frag loader for merging, and then writes the fragments into a final BLOB, and when the downloading of all the fragments is completed, the downloading of the single file is declared to be finished.
Example 2
In step S4, the embodiment uses multi-slice uploading of the alicloud.
Step S5, data delivery: and completing the delivery of the target FastQ file by a box delivery service method. As shown in fig. 14, the box delivering service method includes the steps of:
Step C1, accessing the box into the Internet;
step C2, carrying out information configuration related to the box and initializing a hard disk;
step C3, automatically registering the box to the box service network 100 after the box is networked;
step C4, when data is generated, the box service network 100 notifies the box to start a data download task.
In the above-mentioned box delivery service method, in the step C1, after the box is accessed to the internet, configuration of the PPOE, the password, the intranet card parameter, and the communication public key of the box is required;
in the step C2, a mobile phone is used for scanning the two-dimension code of the equipment on the box shell to enter a configuration interface, and information related to the box is configured;
in step C3, the box is automatically registered in the box service network 100 according to the internal preset configuration after networking.
After the data is downloaded, the method can copy the data to the occasion needing to be used through an external mobile hard disk or access the catalog of the box through a network;
the information about the box also contains operator related information and step C2 also contains binding the operator to the box.
The remainder is the same as in example 1.
Example 3
This embodiment provides a system for generating a target FastQ file from an undetermined FastQ file generated by an NGS, comprising:
The box delivery service system of embodiment 1 for completing delivery of the target FastQ file;
the database unit is used for storing original data generated by the NGS, sample information and data generated after analysis;
a processing unit comprising:
the generation and correction dividing point module is used for dividing the undetermined FastQ file generated by the NGS into a plurality of FastQ fragments according to the file size R; for the boundary point of each FastQ fragment obtained by cutting, carrying out boundary search of bgzf within a range of +/-M; dividing the FastQ fragments by using the searched offset as an actual dividing point to obtain divided FastQ fragments; r is 1-20GB, M is 32-128KB;
the decompression filtering compression module is used for decompressing the segmented FastQ fragments, filtering by using a target barcode(s), and then compressing to obtain filtered FastQ fragments;
the merging module is used for merging the filtered FastQ fragments again according to the sequence to generate the target FastQ file;
the database unit is in communication with the processing unit; the box delivery service system is in communication with the database unit; the box delivery service system communicates with the processing unit. Wherein R is a real number Gb in 0.01-20Gb, and is not limited to two bits after decimal point.
The decompression filtering compression module is also used for completing decompression, filtering and compression in the same task in a streaming mode, and the generated file of the filtered FastQ fragment ending with the part-fragment number is stored in the Ardisk OSS.
The decompression filtering compression module is also used for respectively putting the 3 processes of decompression, filtering and compression into 3 different threads, and the middle is connected through a queue.
The decompression filtering compression module is also used for running a plurality of works on each batch of calculated nodes, and each work consists of 3 different threads and is respectively responsible for decompression, filtering and compression; the number of the workers=virtual machine kernel number/2.
The merge module is also used to generate an md5 check along with the target FastQ file. md5 checks are used to verify the authenticity of the target FastQ file.
The merging module is also used for completing merging in a local computer.
And the merging module is also used for uploading the merged data to the cloud end without using multiple fragments after the merging is completed in the local computer.
Example 4
In this embodiment, the merging module is further configured to upload the merged data to the ali cloud using multiple slices after the merging is completed in the local computer, and the rest is the same as embodiment 3.
Example 5
This embodiment provides a computer readable storage medium having stored therein executable instructions that when executed implement the method of generating a target FastQ file from an undetermined FastQ file generated from NGS described in embodiment 1 or 2.
Example 6
The present embodiment provides a terminal, including:
a memory for storing executable instructions;
a processor configured to implement the method of generating a target FastQ file from an undetermined FastQ file generated from NGS of embodiments 1 or 2 when executing executable instructions stored in the memory.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention without requiring creative effort by one of ordinary skill in the art. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (10)

1. A method of generating a target FastQ file from an undetermined FastQ file generated by NGS, comprising the steps of:
S1, preparing an original file and Index information to obtain an undetermined FastQ file generated by the NGS;
s2, generating and correcting the division points: cutting the uncertain FastQ file generated by the NGS into a plurality of FastQ fragments according to the file size R; for the boundary point of each FastQ fragment obtained by cutting, carrying out boundary search of bgzf within a range of +/-M; dividing the FastQ fragments by using the searched offset as an actual dividing point to obtain divided FastQ fragments; r is 1-20Gb, M is 32-128Kb;
s3, decompression, filtration and compression: decompressing the FastQ fragments segmented in the step S2, filtering by using a target barcode (S), and then compressing to obtain filtered FastQ fragments;
s4, merging: and (3) merging the filtered FastQ fragments obtained in the step (S3) again in sequence to generate the target FastQ file.
2. The method of generating a target FastQ file from an undetermined FastQ file generated from NGS according to claim 1, wherein the decompression, filtering, compression in step S3 is done in stream in the same task, and the generated filtered FastQ fragments are stored in an alicloud OSS in a file ending in a part- $ { fragment number }.
3. The method for generating a target fastQ file from an undetermined FastQ file generated from an NGS according to claim 1, wherein the step S3 comprises respectively putting the 3 processes of decompression, filtering and compression into 3 different threads, and connecting the threads through a queue.
4. The method for generating a target fastQ file from an undetermined fastQ file generated from NGS according to claim 3, wherein the step S3 further comprises running a plurality of workers on each batch of calculated nodes, each of the workers being composed of 3 different threads, and each of the workers being responsible for decompression, filtering, and compression; the number of the workers=virtual machine kernel number/2.
5. The method of generating a target FastQ file from an undetermined FastQ file generated from NGS according to claim 1, wherein step S4 further comprises generating a md5 check along with the target FastQ file for verifying authenticity of the target FastQ file.
6. The method of generating a target FastQ file from an undetermined FastQ file generated from NGS according to claim 1, wherein the combining of step S4 is done in a local computer; and step S4, after the merging is completed in the local computer, uploading the merged data to the cloud end without using multiple fragments.
7. The method of generating a target FastQ file from an undetermined FastQ file generated from NGS of claim 1, further comprising S5, data delivery: completing delivery of the target FastQ file through a box delivery service system; the box delivery service system comprises a box service network, a box cluster and a zookeeper service module, wherein a plurality of server nodes are arranged in the box service network, the box cluster comprises a plurality of box bodies, each box body is independently deployed and respectively connected with the box service network, the box service network is connected with a work order system through wireless to realize data interaction, and the zookeeper service module is used for managing the state of the box service network.
8. A system for generating a target FastQ file from an undetermined FastQ file generated by NGS, comprising:
the database unit is used for storing original data generated by the NGS, sample information and data generated after analysis;
a processing unit comprising:
the generation and correction dividing point module is used for dividing the undetermined FastQ file generated by the NGS into a plurality of FastQ fragments according to the file size R; for the boundary point of each FastQ fragment obtained by cutting, carrying out boundary search of bgzf within a range of +/-M; dividing the FastQ fragments by using the searched offset as an actual dividing point to obtain divided FastQ fragments; r is 1-20GB, M is 32-128KB;
The decompression filtering compression module is used for decompressing the segmented FastQ fragments, filtering by using a target barcode(s), and then compressing to obtain filtered FastQ fragments;
the merging module is used for merging the filtered FastQ fragments again according to the sequence to generate the target FastQ file;
the database unit is in communication with the processing unit.
9. A computer readable storage medium having stored therein executable instructions that when executed implement the method of generating a target FastQ file from an NGS-generated undetermined FastQ file as recited in any one of claims 1-7.
10. A terminal, the terminal comprising:
a memory for storing executable instructions;
a processor for implementing the method of generating a target FastQ file from NGS-generated undetermined FastQ files according to any one of claims 1-7 when executing executable instructions stored in said memory.
CN202210177362.2A 2022-02-24 2022-02-24 Method and system for generating target FastQ file from undetermined FastQ file generated by NGS Pending CN116701326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210177362.2A CN116701326A (en) 2022-02-24 2022-02-24 Method and system for generating target FastQ file from undetermined FastQ file generated by NGS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210177362.2A CN116701326A (en) 2022-02-24 2022-02-24 Method and system for generating target FastQ file from undetermined FastQ file generated by NGS

Publications (1)

Publication Number Publication Date
CN116701326A true CN116701326A (en) 2023-09-05

Family

ID=87831708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210177362.2A Pending CN116701326A (en) 2022-02-24 2022-02-24 Method and system for generating target FastQ file from undetermined FastQ file generated by NGS

Country Status (1)

Country Link
CN (1) CN116701326A (en)

Similar Documents

Publication Publication Date Title
US10142106B2 (en) System and method for securing sensitive data
US9460307B2 (en) Managing sensitive data in cloud computing environments
CN107220142B (en) Method and device for executing data recovery operation
US10534929B2 (en) System and method for automatically securing sensitive data in public cloud using a serverless architecture
CN107665233A (en) Database data processing method, device, computer equipment and storage medium
US11061867B2 (en) Application aware deduplication allowing random access to compressed files
CN105339924A (en) Efficient data compression and analysis as a service
US11163726B2 (en) Context aware delta algorithm for genomic files
CN112559463B (en) Compressed file processing method and device
CN109416716A (en) Processing control apparatus, process control method and record have the recording medium of processing control program
CN111917630B (en) Data transmission method, data transmission device, storage medium and electronic device
US11017029B2 (en) Data transfer system, data transfer apparatus, data transfer method, and computer-readable recording medium
EP3163469A1 (en) Method and device for realizing ip disk file storage
KR100763526B1 (en) Device and method for management of application context
EP3842980B1 (en) System and method for automatically securing sensitive data in public cloud using a serverless architecture
CN116701326A (en) Method and system for generating target FastQ file from undetermined FastQ file generated by NGS
CN108563396B (en) Safe cloud object storage method
CN115604343A (en) Data transmission method, system, electronic equipment and storage medium
CN114915566A (en) Application identification method, device, equipment and computer readable storage medium
AU2015292266B2 (en) System and method for simultaneous forensic acquisition, examination and analysis of a computer readable medium at wire speed
CN116112487A (en) Box delivery service system and method
EP4216094A1 (en) System and method for automatically securing sensitive data in public cloud using a serverless architecture
JP2017033255A (en) Parameter specification program, method for specifying parameter, and parameter specification device
CN117407903A (en) Data encryption backup method, device and server of target cluster
CN118041934A (en) Configuration file synchronization method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination