CN111767256B

CN111767256B - Method for separating sample read data from fastq file

Info

Publication number: CN111767256B
Application number: CN202010442649.4A
Authority: CN
Inventors: 黄俊松; 文晋; 邵艳军
Original assignee: Beijing Herui Exquisite Medical Laboratory Co ltd
Current assignee: Beijing Herui Exquisite Medical Laboratory Co ltd
Priority date: 2020-05-22
Filing date: 2020-05-22
Publication date: 2023-10-20
Anticipated expiration: 2040-05-22
Also published as: CN111767256A

Abstract

The embodiment of the invention provides a method for separating sample read data from a fastq file, which comprises the steps of loading the fastq file containing a plurality of samples through two threads simultaneously, constructing read data and outputting the read data; analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a sample queue of the sample to which the read data belongs; and writing read data in the sample queue into an output fastq file of a corresponding sample through asynchronous sample threads in an asynchronous sample thread pool. In this way, by using a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the time consumption for separating sample read data from fastq files is greatly shortened, the performance utilization rate of a computer is improved, and the aim of rapidly separating sample read data from fastq files is fulfilled.

Description

Method for separating sample read data from fastq file

Technical Field

Embodiments of the present invention relate generally to the field of gene sequencing and, more particularly, to a method of isolating sample read data from fastq files.

Background

In the field of gene sequencing, fastq format is the most commonly used file format for storing the base sequence of a gene and the corresponding mass fraction and related information. The down data of the sequencer can be stored as fastq format files after being processed. To maximize the use of sequencers and on-board kits, it is now essential to mix multiple samples with the sequencer for sequencing and then output a fastq file that contains the data of the multiple sample genes. Such fastq files containing multiple samples are typically very large, as small as a few GB, and as large as tens of hundreds of GB. For further gene sequence analysis, it is necessary to separate the fastq file in samples from such an original fastq file, i.e., to separate the gene data of each sample into a single fastq file (for double-ended sequencing, there are two separate fastq files per sample). The traditional method for separating sample read data is to read the original fastq file line by using the scripting language such as python, analyze and construct the read, identify the sample slave of the read, and additionally write the read into the sample fastq file. This serial mode of operation, and the use of a scripting language with poor performance, makes this process particularly lengthy, resulting in the lengthy time required to separate sample read data from the fastq file; for example, when the next fastq file is only a few GB in size, this approach can take nearly 1 hour to complete the gene data separation. When the next machine data reaches tens or hundreds of GB, more than ten hours are needed to complete the most basic data splitting service.

Disclosure of Invention

According to an embodiment of the present invention, a scheme for separating sample read data from fastq files is provided.

In a first aspect of the present invention, a method of separating sample read data from fastq files is provided. The method comprises the following steps:

concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data;

analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and a sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;

and writing read data in the sample queue into an output fastq file of a corresponding sample through asynchronous sample threads in an asynchronous sample thread pool.

Further, the concurrent loading of fastq files containing a plurality of samples by two threads, constructing read data and outputting, includes:

starting a first thread and a second thread, distributing a data block queue and setting the size of a data block;

in the first thread, loading the fastq file block by block according to the size of the data block, and inserting the loaded fastq block data into the tail of the data block queue;

in the second thread, taking out data blocks one by one from the head of the data block queue to obtain fastq block data;

and carrying out line feed analysis on the fastq block data according to line feed symbols, sequentially constructing read data for every 4 lines of data to obtain a plurality of read data, and outputting the read data one by one in sequence.

Further, the parsing the barcode pair from the read data includes:

taking the first 8 characters of the second row of the read data as the barcode of the read data;

constructing a barcode pair according to the barcode;

in the single-ended sequencing condition, the number of the barcode is one, and the barcode is copied to obtain two identical barcode pairs serving as the barcode pairs;

in the case of double-ended sequencing, the number of the barcode is two, and two barcode are taken as the pair of the barcode.

Further, the identifying the sample of the read data according to the corresponding relation of the barcode pair and the sample number comprises:

grouping the barcode in the read data to obtain a plurality of barcode groups; each of the barcode groups comprises a plurality of different barcodes, and any of the barcodes and the barcode which are the pair of the barcode are in the same group, so that the unique corresponding relation between the pair of the barcode and the group of the barcode is obtained;

defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;

and identifying the sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.

Further, the inserting the read data into a sample queue of samples to which the read data belongs, comprises:

assigning a sample queue to each sample; the sample queue is used for storing read data of the same sample in sequence;

the read data is inserted into the tail of the sample queue of the sample to which it belongs.

Further, the method further comprises the following steps:

in the case of double-ended sequencing, the read data is two, namely read1 data and read2 data; associating the read2 data with read1 data;

the associated read1 data is inserted into the tail of the sample queue of the sample to which it belongs.

Further, the method further comprises the following steps:

after each time the read data is inserted into a sample queue of a corresponding sample, judging whether the read data is obtained completely, if so, setting end marks for all sample queues; otherwise, the return execution inserts the read data into the sample queue of the sample to which it belongs.

Further, the writing, by the asynchronous sample thread in the asynchronous sample thread pool, read data in the sample queue into the output fastq file of the corresponding sample includes:

judging whether the sample queue is empty and an ending mark is set in the asynchronous sample thread, and ending the current asynchronous thread operation if the sample queue is empty and the ending mark is set; if the sample queue is empty and an end flag is not set, entering a wait state until the sample queue is not empty or an end flag is set; if the sample queue is not empty, retrieving a read data from the sample queue head;

performing inverse association on the acquired read data, and if an inverse association result is obtained, writing the acquired read data and the inverse association result thereof into an output fastq file of a corresponding sample respectively; and if the anti-association result is not obtained, writing the acquired read data into an output fastq file of a corresponding sample.

Further, the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are mutually independent.

In a second aspect of the invention, an electronic device is provided. The electronic device includes: a memory and a processor, the memory having stored thereon a computer program, the processor implementing the method as described above when executing the program.

It should be understood that the description in this summary is not intended to limit the critical or essential features of the embodiments of the invention, nor is it intended to limit the scope of the invention. Other features of the present invention will become apparent from the description that follows.

According to the invention, fastq files containing a plurality of samples are loaded through a plurality of threads concurrently, read data of the plurality of samples are separated, and the read data are output to the fastq files through asynchronous operation of a plurality of sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the time consumption for separating sample read data from fastq files is greatly shortened, the performance utilization rate of a computer is improved, and the aim of rapidly separating the sample read data from the fastq files is fulfilled.

Drawings

The above and other features, advantages and aspects of embodiments of the present invention will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 is a flow chart of a method of separating sample read data from fastq files according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a concurrent loading of fastq files and outputting read data according to an embodiment of the present invention;

FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention;

FIG. 4 is a flowchart of a process of identifying the sample to which the read data belongs, according to an embodiment of the invention;

FIG. 5 is a schematic diagram of the correspondence between a pair of barcode and a sample number according to an embodiment of the present invention;

FIG. 6 is a flow chart of inserting the read data into a sample queue of samples to which it belongs according to an embodiment of the invention;

FIG. 7 is a flow chart of writing read data in the sample queue to an output fastq file according to an embodiment of the present invention;

FIG. 8 is a flow chart of inserting the read data into a sample queue of a sample to which it belongs in a single-ended sequencing embodiment according to the present invention;

FIG. 9 is a flow chart of writing read data in the sample queue to an output fastq file in a single-ended sequencing embodiment according to the present invention;

FIG. 10 is a flow chart of inserting the read data into a sample queue of a sample to which it belongs in a double-ended sequencing embodiment according to the present invention;

FIG. 11 is a flow chart of writing read data in the sample queue to an output fastq file in a double-ended sequencing embodiment according to the present invention;

fig. 12 is a block diagram of an exemplary electronic device capable of implementing embodiments of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

In the invention, fastq files containing a plurality of samples are loaded through a plurality of threads concurrently, read data of the plurality of samples are separated, and the read data is output to the fastq files through asynchronous operation of a plurality of sample threads corresponding to the samples; by using a parallel working mode, a plurality of threads work cooperatively at the same time, so that the working efficiency is improved, the working time is shortened, the performance utilization rate of a computer is improved, and the aim of quickly separating sample read data from fastq files is fulfilled.

FIG. 1 illustrates a flow chart of a method of separating sample read data from fastq files according to an embodiment of the invention.

The method S100 includes:

s110, concurrently loading fastq files containing a plurality of samples through two threads, constructing read data and outputting the read data.

The fastq file containing multiple samples is typically very large, as small as a few GB, and as large as tens of hundreds of GB. For the next gene sequence analysis, it is necessary to separate fastq files in sample units from such original fastq files, i.e., to separate the gene data of each sample into individual fastq files.

As an embodiment of the present invention, the present method contemplates two threads, a first thread and a second thread; the first thread is used for reading fastq file data in blocks and inserting the read data blocks into a data block queue; and the second thread is used for taking out the data block from the data block queue, analyzing the data in the data block, and obtaining read data for output. The process of loading fastq files through the two threads concurrently, constructing read data and outputting the read data, as shown in fig. 2, includes:

s111, distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of a data block; the data block queue comprises a plurality of data blocks and is arranged in sequence; when it is not empty, there is at least a head data block and a tail data block. The access logic defining the data block queue is fetched by tail store and head store. Setting a fixed size, for example 1MB, for the data block; for equalizing fastq block data size per load as a data access size criterion.

S112, in the first thread, loading the fastq file block by block according to the size of the data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, the data block is inserted into the tail of the data block queue, and the inserted data block is the tail of the data block queue; continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to wait for distributing the data block, and continuing to load the fastq file.

S113, judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; and if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data.

The first thread and the second thread are processed simultaneously in parallel, namely the first thread successively loads the data blocks in the fastq file into the data block queue, and simultaneously the second thread sequentially extracts the data blocks from the head of the data block queue one by one and analyzes the data blocks in the memory, and the read data are analyzed from the data blocks through an analysis process. By using a parallel working mode, the multithreading is operated cooperatively at the same time, so that the operating efficiency is improved, and the operating time is greatly reduced.

FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention.

S114, carrying out line feed analysis on the analysis data according to a line feed symbol, sequentially constructing read data for every 4 lines of data to obtain a plurality of read data, and sequentially outputting the read data one by one.

In one embodiment of the present invention, as shown in fig. 3 (a), four pieces of mutually discontinuous memories are used to store four lines of data of one read, wherein the first line of behavior information, the second line of behavior gene sequence data, the third line of behavior annotation data, and the fourth line of behavior sequence quality data, respectively; and outputting the read data one by one in sequence in the read data memory format.

In another embodiment of the present invention, as shown in FIG. 3 (b), a single block of contiguous memory is used to store four lines of complete data for READ, such a block of contiguous memory is named READ, with a start position of 0 and an end position of end. Three index values are used to point to the positions of the line breaks of the first, second, and third lines of data, such as lf_pos1, lf_pos2, and lf_pos3, respectively, of the read data. Three line-feed symbols divide a piece of data into 4 lines of data, wherein the first line of data is an information line, denoted as READ [0, LF_pos1 ], and represents a left-closed right-open section between 0 and the character LF_pos 1; the second row of data is a sequence row, denoted as READ [ LF_pos1+1, LF_pos2), representing a left-closed right-open section between characters LF_pos1+1 to LF_pos2; the third row of data is an annotation row, denoted as READ [ LF_pos2+1, LF_pos3), representing a left-closed right-open section between characters LF_pos2+1 and LF_pos3; the fourth line of data is a quality line, denoted as READ [ lf_pos3+1, end), representing the left-closed right-open section between the characters lf_pos3+1 to the READ end position end. A whole block of continuous memory is used for completely storing four lines of read data of the read, and each read data is output in sequence.

After the data block is consumed, a request is issued to release the data block.

By optimizing the read storage structure, the number of basic operations on read data is greatly reduced. For example, assuming that there are 10 hundred million reads in the original fastq file, using the optimized read storage structure, during the whole gene data separation period, 10 hundred million times 4=40 hundred million times four-line data splitting and copying operations are reduced, and 10 hundred million times four-line data reorganizing and splicing operations into a continuous memory are reduced. Thus freeing up a large extra unnecessary CPU and memory consumption.

S120, analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs.

Further, S121, the parsing the barcode pair from the read data includes:

s1211, taking the first 8 characters of the second line of the read data as the barcode of the read data.

S1212, constructing a barcode pair according to the barcode.

Further, S122, the identifying, according to the correspondence between the barcode pair and the sample number, the sample to which the read data belongs, as shown in fig. 4, includes:

s1221, grouping the barcode in the read data, as shown in FIG. 5, to obtain a plurality of barcode groups. Each of the barcode groups contains a plurality of barcode, the barcode of all the groups is not repeated, any of the barcode and the barcode of which are the barcode pairs are in the same group, the barcode pairs and the belonging barcode groups are in a many-to-one relationship, and the unique corresponding relationship between the barcode pairs and the barcode groups is obtained, namely, the unique barcode groups can be positioned by any of the barcode pairs.

S1222, defining a unique corresponding relation between the barcode grouping and the sample number to obtain a unique corresponding relation between the barcode grouping and the sample number;

s1223, identifying the belonging sample of the read data corresponding to the barcode according to the unique corresponding relation of the barcode to the sample number.

Further, S123, inserting the read data into a sample queue of a sample to which the read data belongs, as shown in fig. 6, includes:

judging whether single-ended sequencing is carried out, if so, acquiring a read data, and inserting the read data into the tail of a corresponding sample queue; otherwise, obtaining read1 data and read2 data from the two fastq files of r1 and r2 respectively, wherein the total of the two pieces of read data are related to the read2 data, and the read1 data are inserted into the tail of the corresponding sample queue after the related data are related.

As an embodiment of the present invention, after each time the read data is inserted into the sample queue of the corresponding sample, judging whether the read data is obtained, if yes, setting end marks for all sample queues, and ending the process of extracting the read data; otherwise, the return execution inserts the read data into the sample queue of the sample to which it belongs.

The judgment on whether the read data is obtained is realized by identifying an end-of-file mark provided by the underlying file system, for example, when the reading of the fastq file is ended, an EOF mark returned by the file system is obtained, and the EOF is English shorthand of file and indicates that the file is already read.

S130, writing read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread pool, as shown in FIG. 7, including:

in the asynchronous sample thread pool, a read data is obtained from the head of a sample queue corresponding to one of the asynchronous sample threads. The asynchronous sample thread pool is provided with a plurality of asynchronous sample threads, each asynchronous sample thread only corresponds to one sample queue, and different asynchronous sample threads are mutually independent, namely different asynchronous sample threads can be processed in parallel.

And (3) performing inverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data, wherein the attempt can generate two results, namely, one can be used for inversely associating the result, and the other can be used for not inversely associating the result. If the result can be reversely correlated, the double-ended sequencing is indicated, and the obtained read data and the reversely correlated result are required to be written into an output fastq file of a corresponding sample; if the anti-correlation result is not obtained, indicating single-ended sequencing, writing the obtained read data into an output fastq file of a corresponding sample.

After the obtained read data is written into the output fastq file of the corresponding sample, judging whether the current sample queue is empty or not, and ending the current asynchronous sample thread operation if the current sample queue is empty, wherein the current asynchronous sample thread operation is ended, and the fact that all read data in the fastq file are fetched is indicated; of course, if the current queue is empty, but there is no end mark, it only indicates that there is no data in the current queue, and the fastq file has unwritten unconsumed read data, and needs to enter a waiting state until the current queue is not empty or the end mark is set in the queue; if the current queue is not empty, a return is made to re-execute S130, writing the read data in the sample queue into the output fastq file of the corresponding sample.

By the technical scheme, the time consumption for separating the sample read data from the fastq file is greatly shortened. The detailed performance is compared with the following table:

as shown in the above table, in the worst case sample read data separation scenario "double-ended fastq separation", the separation tool based on the present invention is shortened by more than 80% on average compared to the conventional python implemented separation tool.

The invention gives different processing modes of single-end sequencing and double-end sequencing in each link, so that the invention is perfectly suitable for single-end sequencing and double-end sequencing, and can perfectly support the function of rapidly and efficiently separating sample read data from fastq files no matter single-end sequencing or double-end sequencing. And ensures the correctness of the output data in two modes.

In some alternative implementations of the present embodiment, in the case of single-ended sequencing, the process of separating sample read data from fastq files according to the present invention is shown in fig. 1-5 and 8-9, and in the above embodiment, step S110 is to perform loading of a fastq file.

In step S122, one barcode is obtained from the second data of the read data, and the barcode is copied to obtain two identical barcodes as the pair of the barcodes.

As shown in fig. 8, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; a piece of read data of a fastq file is obtained, and the piece of read data is inserted into the tail of a corresponding sample queue.

As shown in fig. 9, in step S130, in the asynchronous sample thread, a read data is acquired from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one sample queue uniquely, and different asynchronous sample threads are mutually independent, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data through the acquired read data, and writing the acquired read data into an output fastq file of a corresponding sample if a result cannot be inversely associated at the moment.

In some alternative implementations of the present embodiment, in the case of double-ended sequencing, the process of the present invention for separating sample read data from fastq files is shown in fig. 1-5 and 10-11, in which two fastq files are r1 and r2, respectively; two reads are output, read1 and read2, respectively. Each pair of read1 and read2 of two fastq files has the same ID.

In step S122, two read data are obtained, and one barcode is obtained through the second data of the read data, and the two barcodes are used as the pair of the barcodes.

As shown in fig. 10, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; obtaining read1 and read2 of two fastq files of r1 and r2, correlating the read2 with the read1, and inserting the read1 into the tail of the corresponding sample queue after correlating.

As shown in fig. 11, in step S130, in the asynchronous sample thread, a read1 data is acquired from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread corresponds to one sample queue uniquely, and different asynchronous sample threads are mutually independent, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read1 data, wherein the inverse association is successful, the inverse association result is read2, and the read1 data and the inverse associated read2 data are respectively written into the output fastq file of the corresponding sample.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

An exemplary electronic device capable of implementing embodiments of the invention is shown in fig. 12.

The device 1200 includes a Central Processing Unit (CPU) 1201 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 1202 or loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

Various components in device 1200 are connected to I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

The processing unit 1201 performs the respective methods and processes described above, for example, the method S100. For example, in some embodiments, the method S100 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the CPU 1201, one or more steps of the method S100 described above may be performed. Alternatively, in other embodiments, CPU 1201 may be configured to perform method S100 by any other suitable means (e.g., by means of firmware).

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), etc.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Moreover, although operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method of separating sample read data from fastq files, comprising:

writing read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool; wherein,,

concurrently loading fastq files containing a plurality of samples by two threads, constructing read data and outputting, including:

distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit of the data block queue and the size of the data block;

in the first thread, sending an allocation request to wait for allocation of a data block with the data block size; in the first thread, reading the fastq file according to the size of the data block, putting the read data into the distributed data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is smaller than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue; continuously judging whether the fastq file is read completely, if so, setting a data block queue end mark, and ending the first thread; if the data block is not read, returning to the data block waiting for allocation, and continuing to load the fastq file;

judging whether the data block queue is empty and an ending mark is set in the second thread, and ending the second thread if the data block queue is empty and the ending mark is set; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain fastq block data;

sequentially carrying out line feed analysis on the fastq block data to obtain a plurality of read data, and sequentially outputting the read data one by one;

2. The method of claim 1, wherein parsing the pair of barcode from the read data comprises:

constructing a barcode pair according to the barcode;

3. The method of claim 1, wherein the identifying the sample to which the read data belongs according to the correspondence of the barcode pair to sample numbers comprises:

4. The method of claim 1, wherein the inserting and inserting the read data into a sample queue of samples to which it belongs comprises:

5. The method as recited in claim 4, further comprising:

6. The method as recited in claim 1, further comprising:

7. The method of claim 1, wherein writing read data in the sample queue into the output fastq file of the corresponding sample by an asynchronous sample thread in an asynchronous sample thread pool comprises:

8. The method according to claim 1 or 7, wherein the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other.

9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method according to any of claims 1-8.