CN111767256B - Method for separating sample read data from fastq file - Google Patents

Method for separating sample read data from fastq file Download PDF

Info

Publication number
CN111767256B
CN111767256B CN202010442649.4A CN202010442649A CN111767256B CN 111767256 B CN111767256 B CN 111767256B CN 202010442649 A CN202010442649 A CN 202010442649A CN 111767256 B CN111767256 B CN 111767256B
Authority
CN
China
Prior art keywords
sample
read data
data
queue
barcode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010442649.4A
Other languages
Chinese (zh)
Other versions
CN111767256A (en
Inventor
黄俊松
文晋
邵艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Herui Exquisite Medical Laboratory Co ltd
Original Assignee
Beijing Herui Exquisite Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Herui Exquisite Medical Laboratory Co ltd filed Critical Beijing Herui Exquisite Medical Laboratory Co ltd
Priority to CN202010442649.4A priority Critical patent/CN111767256B/en
Publication of CN111767256A publication Critical patent/CN111767256A/en
Application granted granted Critical
Publication of CN111767256B publication Critical patent/CN111767256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The embodiment of the invention provides a method for separating sample read data from a fastq file, which comprises the steps of loading the fastq file containing a plurality of samples through two threads simultaneously, constructing read data and outputting the read data; analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a sample queue of the sample to which the read data belongs; and writing read data in the sample queue into an output fastq file of a corresponding sample through asynchronous sample threads in an asynchronous sample thread pool. In this way, by using a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the time consumption for separating sample read data from fastq files is greatly shortened, the performance utilization rate of a computer is improved, and the aim of rapidly separating sample read data from fastq files is fulfilled.

Description

Method for separating sample read data from fastq file
Technical Field
Embodiments of the present invention relate generally to the field of gene sequencing and, more particularly, to a method of isolating sample read data from fastq files.
Background
In the field of gene sequencing, fastq format is the most commonly used file format for storing the base sequence of a gene and the corresponding mass fraction and related information. The down data of the sequencer can be stored as fastq format files after being processed. To maximize the use of sequencers and on-board kits, it is now essential to mix multiple samples with the sequencer for sequencing and then output a fastq file that contains the data of the multiple sample genes. Such fastq files containing multiple samples are typically very large, as small as a few GB, and as large as tens of hundreds of GB. For further gene sequence analysis, it is necessary to separate the fastq file in samples from such an original fastq file, i.e., to separate the gene data of each sample into a single fastq file (for double-ended sequencing, there are two separate fastq files per sample). The traditional method for separating sample read data is to read the original fastq file line by using the scripting language such as python, analyze and construct the read, identify the sample slave of the read, and additionally write the read into the sample fastq file. This serial mode of operation, and the use of a scripting language with poor performance, makes this process particularly lengthy, resulting in the lengthy time required to separate sample read data from the fastq file; for example, when the next fastq file is only a few GB in size, this approach can take nearly 1 hour to complete the gene data separation. When the next machine data reaches tens or hundreds of GB, more than ten hours are needed to complete the most basic data splitting service.
Disclosure of Invention
According to an embodiment of the present invention, a scheme for separating sample read data from fastq files is provided.
In a first aspect of the present invention, a method of separating sample read data from fastq files is provided. The method comprises the following steps:
concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data;
analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;
and writing read data in the sample queue into an output fastq file of a corresponding sample through asynchronous sample threads in an asynchronous sample thread pool.
Further, the concurrent loading of fastq files containing a plurality of samples by two threads, constructing read data and outputting, includes:
starting a first thread and a second thread, distributing a data block queue and setting the size of a data block;
in the first thread, loading the fastq file block by block according to the size of the data block, and inserting the loaded fastq block data into the tail of the data block queue;
in the second thread, taking out data blocks one by one from the head of the data block queue to obtain fastq block data;
and carrying out line feed analysis on the fastq block data according to line feed symbols, sequentially constructing read data for every 4 lines of data to obtain a plurality of read data, and outputting the read data one by one in sequence.
Further, the parsing the barcode pair from the read data includes:
taking the first 8 characters of the second row of the read data as the barcode of the read data;
constructing a barcode pair according to the barcode;
in the single-ended sequencing condition, the number of the barcode is one, and the barcode is copied to obtain two identical barcode pairs serving as the barcode pairs;
in the case of double-ended sequencing, the number of the barcode is two, and two barcode are taken as the pair of the barcode.
Further, the identifying the sample of the read data according to the corresponding relation of the barcode pair and the sample number comprises:
grouping the barcode in the read data to obtain a plurality of barcode groups; each of the barcode groups comprises a plurality of different barcodes, and any of the barcodes and the barcode which are the pair of the barcode are in the same group, so that the unique corresponding relation between the pair of the barcode and the group of the barcode is obtained;
defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;
and identifying the sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.
Further, the inserting the read data into a sample queue of samples to which the read data belongs, comprises:
assigning a sample queue to each sample; the sample queue is used for storing read data of the same sample in sequence;
the read data is inserted into the tail of the sample queue of the sample to which it belongs.
Further, the method further comprises the following steps:
in the case of double-ended sequencing, the read data is two, namely read1 data and read2 data; associating the read2 data with read1 data;
the associated read1 data is inserted into the tail of the sample queue of the sample to which it belongs.
Further, the method further comprises the following steps:
after each time the read data is inserted into a sample queue of a corresponding sample, judging whether the read data is obtained completely, if so, setting end marks for all sample queues; otherwise, the return execution inserts the read data into the sample queue of the sample to which it belongs.
Further, the writing, by the asynchronous sample thread in the asynchronous sample thread pool, read data in the sample queue into the output fastq file of the corresponding sample includes:
judging whether the sample queue is empty and an ending mark is set in the asynchronous sample thread, and ending the current asynchronous thread operation if the sample queue is empty and the ending mark is set; if the sample queue is empty and an end flag is not set, entering a wait state until the sample queue is not empty or an end flag is set; if the sample queue is not empty, retrieving a read data from the sample queue head;
performing inverse association on the acquired read data, and if an inverse association result is obtained, writing the acquired read data and the inverse association result thereof into an output fastq file of a corresponding sample respectively; and if the anti-association result is not obtained, writing the acquired read data into an output fastq file of a corresponding sample.
Further, the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are mutually independent.
In a second aspect of the invention, an electronic device is provided. The electronic device includes: a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method as described above when executing the program.
It should be understood that the description in this summary is not intended to limit the critical or essential features of the embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention will become apparent from the description that follows.
According to the invention, fastq files containing a plurality of samples are loaded through a plurality of threads concurrently, read data of the plurality of samples are separated, and the read data are output to the fastq files through asynchronous operation of a plurality of sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the time consumption for separating sample read data from fastq files is greatly shortened, the performance utilization rate of a computer is improved, and the aim of rapidly separating the sample read data from the fastq files is fulfilled.
Drawings
The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:
FIG. 1 is a flow chart of a method of separating sample read data from fastq files according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a concurrent loading of fastq files and outputting read data according to an embodiment of the present invention;
FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention;
FIG. 4 is a flowchart of a process of identifying the sample to which the read data belongs, according to an embodiment of the invention;
FIG. 5 is a schematic diagram of the correspondence between a pair of barcode and a sample number according to an embodiment of the present invention;
FIG. 6 is a flow chart of inserting the read data into a sample queue of samples to which it belongs according to an embodiment of the invention;
FIG. 7 is a flow chart of writing read data in the sample queue to an output fastq file according to an embodiment of the present invention;
FIG. 8 is a flow chart of inserting the read data into a sample queue of a sample to which it belongs in a single-ended sequencing embodiment according to the present invention;
FIG. 9 is a flow chart of writing read data in the sample queue to an output fastq file in a single-ended sequencing embodiment according to the present invention;
FIG. 10 is a flow chart of inserting the read data into a sample queue of a sample to which it belongs in a double-ended sequencing embodiment according to the present invention;
FIG. 11 is a flow chart of writing read data in the sample queue to an output fastq file in a double-ended sequencing embodiment according to the present invention;
fig. 12 is a block diagram of an exemplary electronic device capable of implementing embodiments of the invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the invention, fastq files containing a plurality of samples are loaded through a plurality of threads concurrently, read data of the plurality of samples are separated, and the read data is output to the fastq files through asynchronous operation of a plurality of sample threads corresponding to the samples; by using a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the working time is shortened, the performance utilization rate of a computer is improved, and the aim of quickly separating sample read data from fastq files is fulfilled.
FIG. 1 illustrates a flow chart of a method of separating sample read data from fastq files according to an embodiment of the invention.
The method S100 includes:
s110, concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data.
The fastq file containing multiple samples is typically very large, as small as a few GB, and as large as tens of hundreds of GB. For the next gene sequence analysis, it is necessary to separate fastq files in sample units from such original fastq files, i.e., to separate the gene data of each sample into individual fastq files.
As an embodiment of the present invention, the present method contemplates two threads, a first thread and a second thread; the first thread is used for reading fastq file data in blocks and inserting the read data blocks into a data block queue; and the second thread is used for taking out the data block from the data block queue, analyzing the data in the data block, and obtaining read data for output. The process of loading fastq files through the two threads concurrently, constructing read data and outputting the read data, as shown in fig. 2, includes:
s111, distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of a data block; the data block queue comprises a plurality of data blocks and is arranged in sequence; when it is not empty, there is at least a head data block and a tail data block. The access logic defining the data block queue is fetched by tail store and head store. Setting a fixed size, for example 1MB, for the data block; for equalizing fastq block data size per load as a data access size criterion.
S112, in the first thread, loading the fastq file block by block according to the size of the data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, the data block is inserted into the tail of the data block queue, and the inserted data block is the tail of the data block queue; continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to wait for distributing the data block, and continuing to load the fastq file.
S113, judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; and if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data.
The first thread and the second thread are processed simultaneously in parallel, namely the first thread successively loads the data blocks in the fastq file into the data block queue, and simultaneously the second thread sequentially extracts the data blocks from the head of the data block queue one by one and analyzes the data blocks in the memory, and the read data are analyzed from the data blocks through an analysis process. By using a parallel working mode, the multithreading is operated cooperatively at the same time, so that the operating efficiency is improved, and the operating time is greatly reduced.
FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention.
S114, carrying out line feed analysis on the analysis data according to a line feed symbol, sequentially constructing read data for every 4 lines of data to obtain a plurality of read data, and sequentially outputting the read data one by one.
In one embodiment of the present invention, as shown in fig. 3 (a), four pieces of mutually discontinuous memories are used to store four lines of data of one read, wherein the first line of behavior information, the second line of behavior gene sequence data, the third line of behavior annotation data, and the fourth line of behavior sequence quality data, respectively; and outputting the read data one by one in sequence in the read data memory format.
In another embodiment of the present invention, as shown in FIG. 3 (b), a single block of contiguous memory is used to store four lines of complete data for READ, such a block of contiguous memory is named READ, with a start position of 0 and an end position of end. Three index values are used to point to the positions of the line breaks of the first, second, and third lines of data, such as lf_pos1, lf_pos2, and lf_pos3, respectively, of the read data. Three line-feed symbols divide a piece of data into 4 lines of data, wherein the first line of data is an information line, denoted as READ [0, LF_pos1 ], and represents a left-closed right-open section between 0 and the character LF_pos 1; the second row of data is a sequence row, denoted as READ [ LF_pos1+1, LF_pos2), representing a left-closed right-open section between characters LF_pos1+1 to LF_pos2; the third row of data is an annotation row, denoted as READ [ LF_pos2+1, LF_pos3), representing a left-closed right-open section between characters LF_pos2+1 and LF_pos3; the fourth line of data is a quality line, denoted as READ [ lf_pos3+1, end), representing the left-closed right-open section between the characters lf_pos3+1 to the READ end position end. A whole block of continuous memory is used for completely storing four lines of read data of the read, and each read data is output in sequence.
After the data block is consumed, a request is issued to release the data block.
By optimizing the read storage structure, the number of basic operations on read data is greatly reduced. For example, assuming that there are 10 hundred million reads in the original fastq file, using the optimized read storage structure, during the whole gene data separation period, 10 hundred million times 4=40 hundred million times four-line data splitting and copying operations are reduced, and 10 hundred million times four-line data reorganizing and splicing operations into a continuous memory are reduced. Thus freeing up a large extra unnecessary CPU and memory consumption.
S120, analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs.
Further, S121, the parsing the barcode pair from the read data includes:
s1211, taking the first 8 characters of the second line of the read data as the barcode of the read data.
S1212, constructing a barcode pair according to the barcode.
In the single-ended sequencing condition, the number of the barcode is one, and the barcode is copied to obtain two identical barcode pairs serving as the barcode pairs;
in the case of double-ended sequencing, the number of the barcode is two, and two barcode are taken as the pair of the barcode.
Further, S122, the identifying, according to the correspondence between the barcode pair and the sample number, the sample to which the read data belongs, as shown in fig. 4, includes:
s1221, grouping the barcode in the read data, as shown in FIG. 5, to obtain a plurality of barcode groups. Each of the barcode groups contains a plurality of barcode, the barcode of all the groups is not repeated, any of the barcode and the barcode of which are the barcode pairs are in the same group, the barcode pairs and the belonging barcode groups are in a many-to-one relationship, and the unique corresponding relationship between the barcode pairs and the barcode groups is obtained, namely, the unique barcode groups can be positioned by any of the barcode pairs.
S1222, defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;
s1223, identifying the belonging sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.
Further, S123, inserting the read data into a sample queue of a sample to which the read data belongs, as shown in fig. 6, includes:
assigning a sample queue to each sample; the sample queue is used for storing read data of the same sample in sequence;
judging whether single-ended sequencing is carried out, if so, acquiring a read data, and inserting the read data into the tail of a corresponding sample queue; otherwise, obtaining read1 data and read2 data from the two fastq files of r1 and r2 respectively, wherein the total of the two pieces of read data are related to the read2 data, and the read1 data are inserted into the tail of the corresponding sample queue after the related data are related.
As an embodiment of the present invention, after each time the read data is inserted into the sample queue of the corresponding sample, judging whether the read data is obtained, if yes, setting end marks for all sample queues, and ending the process of extracting the read data; otherwise, the return execution inserts the read data into the sample queue of the sample to which it belongs.
The judgment on whether the read data is obtained is realized by identifying an end-of-file mark provided by the underlying file system, for example, when the reading of the fastq file is ended, an EOF mark returned by the file system is obtained, and the EOF is English shorthand of file and indicates that the file is already read.
S130, writing read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread pool, as shown in FIG. 7, including:
in the asynchronous sample thread pool, a read data is obtained from the head of a sample queue corresponding to one of the asynchronous sample threads. The asynchronous sample thread pool is provided with a plurality of asynchronous sample threads, each asynchronous sample thread only corresponds to one sample queue, and different asynchronous sample threads are mutually independent, namely different asynchronous sample threads can be processed in parallel.
And (3) performing inverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data, wherein the attempt can generate two results, namely, one can be used for inversely associating the result, and the other can be used for not inversely associating the result. If the result can be reversely correlated, the double-ended sequencing is indicated, and the obtained read data and the reversely correlated result are required to be written into an output fastq file of a corresponding sample; if the anti-correlation result is not obtained, indicating single-ended sequencing, writing the obtained read data into an output fastq file of a corresponding sample.
After the obtained read data is written into the output fastq file of the corresponding sample, judging whether the current sample queue is empty or not, and ending the current asynchronous sample thread operation if the current sample queue is empty, wherein the current asynchronous sample thread operation is ended, and the fact that all read data in the fastq file are fetched is indicated; of course, if the current queue is empty, but there is no end mark, it only indicates that there is no data in the current queue, and the fastq file has unwritten unconsumed read data, and needs to enter a waiting state until the current queue is not empty or the end mark is set in the queue; if the current queue is not empty, a return is made to re-execute S130, writing the read data in the sample queue into the output fastq file of the corresponding sample.
By the technical scheme, the time consumption for separating the sample read data from the fastq file is greatly shortened. The detailed performance is compared with the following table:
as shown in the above table, in the worst case sample read data separation scenario "double-ended fastq separation", the separation tool based on the present invention is shortened by more than 80% on average compared to the conventional python implemented separation tool.
The invention gives different processing modes of single-end sequencing and double-end sequencing in each link, so that the invention is perfectly suitable for single-end sequencing and double-end sequencing, and can perfectly support the function of rapidly and efficiently separating sample read data from fastq files no matter single-end sequencing or double-end sequencing. And ensures the correctness of the output data in two modes.
In some alternative implementations of the present embodiment, in the case of single-ended sequencing, the process of separating sample read data from fastq files according to the present invention is shown in fig. 1-5 and 8-9, and in the above embodiment, step S110 is to perform loading of a fastq file.
In step S122, one barcode is obtained from the second data of the read data, and the barcode is copied to obtain two identical barcodes as the pair of the barcodes.
As shown in fig. 8, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; a piece of read data of a fastq file is obtained, and the piece of read data is inserted into the tail of a corresponding sample queue.
As shown in fig. 9, in step S130, in the asynchronous sample thread, a read data is acquired from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one sample queue uniquely, and different asynchronous sample threads are mutually independent, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data through the acquired read data, and writing the acquired read data into an output fastq file of a corresponding sample if a result cannot be inversely associated at the moment.
In some alternative implementations of the present embodiment, in the case of double-ended sequencing, the process of the present invention for separating sample read data from fastq files is shown in fig. 1-5 and 10-11, in which two fastq files are r1 and r2, respectively; two reads are output, read1 and read2, respectively. Each pair of read1 and read2 of two fastq files has the same ID.
In step S122, two read data are obtained, and one barcode is obtained through the second data of the read data, and the two barcodes are used as the pair of the barcodes.
As shown in fig. 10, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; obtaining read1 and read2 of two fastq files of r1 and r2, correlating the read2 with the read1, and inserting the read1 into the tail of the corresponding sample queue after correlating.
As shown in fig. 11, in step S130, in the asynchronous sample thread, a read1 data is acquired from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one sample queue uniquely, and different asynchronous sample threads are mutually independent, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read1 data, wherein the inverse association is successful, the inverse association result is read2, and the read1 data and the inverse associated read2 data are respectively written into the output fastq file of the corresponding sample.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
An exemplary electronic device capable of implementing embodiments of the invention is shown in fig. 12.
The device 1200 includes a Central Processing Unit (CPU) 1201 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 1202 or loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.
Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The processing unit 1201 performs the respective methods and processes described above, for example, the method S100. For example, in some embodiments, the method S100 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the CPU 1201, one or more steps of the method S100 described above may be performed. Alternatively, in other embodiments, CPU 1201 may be configured to perform method S100 by any other suitable means (e.g., by means of firmware).
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.
Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (9)

1. A method of separating sample read data from fastq files, comprising:
concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data;
analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;
writing read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool; wherein,,
concurrently loading fastq files containing a plurality of samples by two threads, constructing read data and outputting, including:
distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of the data block;
in the first thread, sending an allocation request to wait for allocation of a data block with the data block size; in the first thread, reading the fastq file according to the size of the data block, putting the read data into the distributed data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue; continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to the data block waiting for allocation, and continuing to load the fastq file;
judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data;
sequentially carrying out line feed analysis on the fastq block data to obtain a plurality of read data, and sequentially outputting the read data one by one;
after the data block is consumed, a request is issued to release the data block.
2. The method of claim 1, wherein parsing the pair of barcode from the read data comprises:
taking the first 8 characters of the second row of the read data as the barcode of the read data;
constructing a barcode pair according to the barcode;
in the single-ended sequencing condition, the number of the barcode is one, and the barcode is copied to obtain two identical barcode pairs serving as the barcode pairs;
in the case of double-ended sequencing, the number of the barcode is two, and two barcode are taken as the pair of the barcode.
3. The method of claim 1, wherein the identifying the sample to which the read data belongs according to the correspondence of the barcode pair to sample numbers comprises:
grouping the barcode in the read data to obtain a plurality of barcode groups; each of the barcode groups comprises a plurality of different barcodes, and any of the barcodes and the barcode which are the pair of the barcode are in the same group, so that the unique corresponding relation between the pair of the barcode and the group of the barcode is obtained;
defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;
and identifying the sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.
4. The method of claim 1, wherein the inserting and inserting the read data into a sample queue of samples to which it belongs comprises:
assigning a sample queue to each sample; the sample queue is used for storing read data of the same sample in sequence;
the read data is inserted into the tail of the sample queue of the sample to which it belongs.
5. The method as recited in claim 4, further comprising:
in the case of double-ended sequencing, the read data is two, namely read1 data and read2 data; associating the read2 data with read1 data;
the associated read1 data is inserted into the tail of the sample queue of the sample to which it belongs.
6. The method as recited in claim 1, further comprising:
after each time the read data is inserted into a sample queue of a corresponding sample, judging whether the read data is obtained completely, if so, setting end marks for all sample queues; otherwise, the return execution inserts the read data into the sample queue of the sample to which it belongs.
7. The method of claim 1, wherein writing read data in the sample queue into the output fastq file of the corresponding sample by an asynchronous sample thread in an asynchronous sample thread pool comprises:
judging whether the sample queue is empty and an ending mark is set in the asynchronous sample thread, and ending the current asynchronous thread operation if the sample queue is empty and the ending mark is set; if the sample queue is empty and an end flag is not set, entering a wait state until the sample queue is not empty or an end flag is set; if the sample queue is not empty, retrieving a read data from the sample queue head;
performing inverse association on the acquired read data, and if an inverse association result is obtained, writing the acquired read data and the inverse association result thereof into an output fastq file of a corresponding sample respectively; and if the anti-association result is not obtained, writing the acquired read data into an output fastq file of a corresponding sample.
8. The method according to claim 1 or 7, wherein the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other.
9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method according to any of claims 1-8.
CN202010442649.4A 2020-05-22 2020-05-22 Method for separating sample read data from fastq file Active CN111767256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010442649.4A CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010442649.4A CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Publications (2)

Publication Number Publication Date
CN111767256A CN111767256A (en) 2020-10-13
CN111767256B true CN111767256B (en) 2023-10-20

Family

ID=72719645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010442649.4A Active CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Country Status (1)

Country Link
CN (1) CN111767256B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
CN108866051A (en) * 2018-06-19 2018-11-23 上海锐翌生物科技有限公司 Amplicon sequencing library and its construction method
CN109727644A (en) * 2018-11-12 2019-05-07 山东省医学科学院基础医学研究所 Venn figure production method and system based on microbial genome two generations sequencing data
CN110008262A (en) * 2019-02-02 2019-07-12 阿里巴巴集团控股有限公司 A kind of data export method and device
CN110033830A (en) * 2019-04-16 2019-07-19 苏州金唯智生物科技有限公司 A kind of data transmission method for uplink, device, equipment and storage medium
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100323348A1 (en) * 2009-01-31 2010-12-23 The Regents Of The University Of Colorado, A Body Corporate Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process
US10090857B2 (en) * 2010-04-26 2018-10-02 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US9552458B2 (en) * 2012-03-16 2017-01-24 The Research Institute At Nationwide Children's Hospital Comprehensive analysis pipeline for discovery of human genetic variation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
CN108866051A (en) * 2018-06-19 2018-11-23 上海锐翌生物科技有限公司 Amplicon sequencing library and its construction method
CN109727644A (en) * 2018-11-12 2019-05-07 山东省医学科学院基础医学研究所 Venn figure production method and system based on microbial genome two generations sequencing data
CN110008262A (en) * 2019-02-02 2019-07-12 阿里巴巴集团控股有限公司 A kind of data export method and device
CN110033830A (en) * 2019-04-16 2019-07-19 苏州金唯智生物科技有限公司 A kind of data transmission method for uplink, device, equipment and storage medium
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
一种基于蚁群算法的生物序列并行比对方法;李娟;汤德佑;傅娟;;计算机工程与科学(第09期);34-40 *
基于异构系统的生物序列比对并行处理研究进展;朱香元;李仁发;李肯立;胡忠望;;计算机科学(第S2期);399-404+408 *
高通量测序数据比对算法研究进展;陶然;宋晓峰;;计算机与应用化学(第01期);47-54 *
高通量计算在大规模人群队列基因组数据解析应用中的挑战;曾瀞瑶;苑娜;魏文娟;李根;杜政霖;;数据与计算发展前沿(第01期);121-131 *

Also Published As

Publication number Publication date
CN111767256A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN108985008B (en) Method and system for rapidly comparing gene data
CN111506498A (en) Automatic generation method and device of test case, computer equipment and storage medium
CN101860449B (en) Data query method, device and system
US10826980B2 (en) Command process load balancing system
CN107704539B (en) Method and device for large-scale text information batch structuring
CN108509556A (en) Data migration method and device, server, storage medium
CN109101603B (en) Data comparison method, device, equipment and storage medium
WO2022178933A1 (en) Context-based voice sentiment detection method and apparatus, device and storage medium
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
JP2020194523A (en) Method, apparatus, device, and storage medium for processing access request
CN113760839A (en) Log data compression processing method and device, electronic equipment and storage medium
CN112615758A (en) Application identification method, device, equipment and storage medium
US9213759B2 (en) System, apparatus, and method for executing a query including boolean and conditional expressions
CN106020984A (en) Creation method and apparatus of processes in electronic device
CN111767256B (en) Method for separating sample read data from fastq file
CN107168788A (en) The dispatching method and device of resource in distributed system
CN111782609B (en) Method for rapidly and uniformly slicing fastq file
JP2002041551A (en) Compile method for data and storage medium storing the same
CN115242861B (en) RTE layer communication data mapping configuration file generation method and system, computer readable storage medium and electronic equipment
CN111767255B (en) Optimization method for separating sample read data from fastq file
CN113204706B (en) Data screening and extracting method and system based on MapReduce
CN112069006B (en) Method and device for detecting and analyzing GPU (graphics processing Unit) rate state and computer readable medium
CN113850265A (en) PDF document analysis method and device, electronic equipment and storage medium
CN109947559B (en) Method, device, equipment and computer storage medium for optimizing MapReduce calculation
CN115938480A (en) Optimization device and system for genome assembly result error correction method by long-read long-sequencing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant after: Beijing Herui precision medical device technology Co.,Ltd.

Address before: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant before: Beijing Herui precision medical laboratory Co.,Ltd.

CB02 Change of applicant information
TA01 Transfer of patent application right

Effective date of registration: 20230913

Address after: Room 102 and Room 103, 1st Floor, Building 5, No. 4 Life Park Road, Life Science Park, Changping District, Beijing, 102206

Applicant after: Beijing Herui exquisite medical laboratory Co.,Ltd.

Address before: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant before: Beijing Herui precision medical device technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant