Disclosure of Invention
According to an embodiment of the present invention, a scheme for separating sample read data from a fastq file is provided.
In a first aspect of the invention, a method of separating sample read data from a fastq file is provided. The method comprises the following steps:
loading a fastq file containing a plurality of samples through two threads concurrently, constructing and outputting read data;
analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;
and writing the read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool.
Further, the concurrently loading a fastq file containing a plurality of samples through two threads, constructing and outputting read data, including:
starting a first thread and a second thread, allocating a data block queue and setting the size of a data block;
in the first thread, loading the fastq file block by block according to the size of the data block, and inserting the loaded fastq block data into the tail of the data block queue;
in the second thread, taking out data blocks from the head of the data block queue one by one to obtain fastq block data;
and performing line feed analysis on the fastq block data according to line feed symbols, sequentially constructing one read data from every 4 rows of data to obtain a plurality of read data, and outputting the read data one by one in sequence.
Further, the parsing the barcode pair from the read data includes:
taking the first 8 characters of the second line of the read data as the barcode of the read data;
constructing a barcode pair according to the barcode;
under the condition of single-ended sequencing, copying the one barcode to obtain two identical barcodes serving as the barcode pair;
in the case of paired-end sequencing, the barcode is two, with two barcodes as the barcode pair.
Further, the identifying the sample to which the read data belongs according to the correspondence between the barcode pair and the sample number includes:
grouping the barcode in the read data to obtain a plurality of barcode groups; each barcode group comprises a plurality of different barcodes, and any barcode and the barcodes which are the pairs of the barcodes are in the same group to obtain the unique corresponding relation between the barcode pairs and the barcode groups;
defining the unique corresponding relation between the barcode group and the sample number to obtain the unique corresponding relation between the barcode pair and the sample number;
and identifying the sample to which the read data corresponding to the barcode belongs according to the unique corresponding relation between the barcode pair and the sample number.
Further, the inserting the read data into the sample queue of the sample to which the read data belongs comprises:
allocating a sample queue for each sample; the sample queue is used for storing read data of the same sample in sequence;
and inserting the read data into the tail of the sample queue of the sample to which the read data belongs.
Further, still include:
in the case of double-ended sequencing, the read data are two, namely read1 data and read2 data; associating the read2 data with read1 data;
the associated read1 data is inserted at the end of the sample queue for the sample to which it belongs.
Further, still include:
after the read data are inserted into the sample queues of the corresponding samples each time, judging whether the read data are acquired completely, if so, setting end marks for all the sample queues; otherwise return to execute the sample queue of the sample to which the read data is inserted.
Further, the writing, by an asynchronous sample thread in an asynchronous sample thread pool, read data in the sample queue into an output fastq file of a corresponding sample includes:
in the asynchronous sample thread, judging whether the sample queue is empty and setting an end mark, and if the sample queue is empty and the end mark is set, ending the current asynchronous thread operation; entering a wait state if the sample queue is empty and no end flag is set until the sample queue is not empty or an end flag is set; if the sample queue is not empty, taking out a piece of read data from the head of the sample queue;
performing anti-correlation on the obtained read data, and if an anti-correlation result is obtained, respectively writing the obtained read data and the anti-correlation result into output fastq files of corresponding samples; and if the anti-correlation result is not obtained, writing the obtained read data into an output fastq file of the corresponding sample.
Furthermore, the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other.
In a second aspect of the invention, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of any embodiment of the invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
According to the method, a fastq file containing a plurality of samples is loaded through a plurality of threads concurrently, read data of the samples are separated, and the read data are output to the fastq file through asynchronous operation of the sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads simultaneously cooperate to work, so that the working efficiency is improved, the time consumption for separating sample read data from the fastq file is greatly reduced, the performance utilization rate of the computer is improved, and the aim of quickly separating the sample read data from the fastq file is fulfilled.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
According to the method, a fastq file containing a plurality of samples is loaded through a plurality of threads concurrently, read data of the samples are separated, and the read data are output to the fastq file through asynchronous operation of the sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads simultaneously cooperate to work, so that the working efficiency is improved, the working time is shortened, the performance utilization rate of the computer is improved, and the purpose of quickly separating sample read data from a fastq file is achieved.
FIG. 1 shows a flow diagram of a method of separating sample read data from a fastq file, according to an embodiment of the invention.
The method S100 includes:
and S110, loading the fastq file containing a plurality of samples through two threads concurrently, and constructing and outputting read data.
The fastq file containing multiple samples is generally very large, ranging from a few GB to tens of hundreds of GB in size. In order to perform the next gene sequence analysis, the fastq file in sample units needs to be split from the original fastq file, i.e. the gene data of each sample is independent into a single fastq file.
As an embodiment of the invention, the method designs two threads, namely a first thread and a second thread; the first thread is used for reading fastq file data in a blocking mode and inserting the read data blocks into a data block queue; and the second thread is used for taking out the data block from the data block queue, analyzing the data in the data block and outputting the read data. The process of concurrently loading the fastq file through the two threads, constructing read data and outputting the read data, as shown in fig. 2, includes:
s111, distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit and the size of a data block of the data block queue; the data block queue comprises a plurality of data blocks which are arranged in sequence; when it is not empty, there is at least a head data block and a tail data block. The access logic defining the data block queue is tail-in, head-out. Setting a fixed size, e.g., 1MB, for the data block; the method is used as a data access size standard to enable the size of the fastq block data loaded each time to be equal.
S112, loading the fastq file block by block according to the size of the data block in the first thread, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is less than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue, wherein the inserted data block is the tail of the data block queue; continuously judging whether the reading of the fastq file is finished, if the reading is finished, setting a data block queue finishing mark, and finishing the first thread; and if not, returning to wait for distributing the data block and continuously loading the fastq file.
S113, in the second thread, judging whether the data block queue is empty and setting an end mark, and if the data block queue is empty and the end mark is set, ending the second thread; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; and if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain the fastq block data.
And the first thread and the second thread are simultaneously processed in parallel, namely the first thread sequentially loads the data blocks in the fastq file into the data block queue, the second thread sequentially extracts the data blocks from the head of the data block queue one by one, analyzes the data blocks in the memory, and analyzes read data from the data blocks through an analysis process. By utilizing a parallel working mode, multiple threads can simultaneously cooperate, the working efficiency is improved, and the working time is greatly reduced.
FIG. 3 is a diagram illustrating a read data structure according to an embodiment of the present invention.
S114, performing line feed analysis on the analysis data according to the line feed symbols, sequentially constructing one read data from every 4 rows of data to obtain a plurality of read data, and outputting the read data one by one in sequence.
In an embodiment of the present invention, as shown in fig. 3(a), four discontinuous memories are used to store four lines of data of a read, respectively, where the first line of behavior information data, the second line of behavior gene sequence data, the third line of behavior annotation data, and the fourth line of behavior quality data; the read data are sequentially output one by one in the memory format of the read data.
In another embodiment of the present invention, as shown in FIG. 3(b), a whole block of contiguous memory is used to store four rows of complete data of a READ, and such a block of contiguous memory is named READ, with a start position of 0 and an end position of end. Three index values are used to point to the locations of the linebreaks of the first, second, and third lines of read data, such as LF _ pos1, LF _ pos2, and LF _ pos3, respectively. The three line feed symbols divide one piece of data into 4 lines of data, wherein the first line of data is an information line and is represented as READ [0, LF _ pos1) and represents a left-closed right-open interval from 0 to a character LF _ pos 1; the second line of data is a sequential line denoted READ [ LF _ pos1+1, LF _ pos2) representing a left-closed right-open interval between characters LF _ pos1+1 and LF _ pos 2; the third row of data is an annotation row, denoted READ [ LF _ pos2+1, LF _ pos3), representing the left-closed and right-open interval between the characters LF _ pos2+1 and LF _ pos 3; the fourth line data is a quality line, represented by READ [ LF _ pos3+1, end), representing a left-closed right-open interval between the characters LF _ pos3+1 to the READ end position end. And completely storing four lines of read data of the read by using a whole continuous memory, and outputting each obtained read data in sequence.
After the data block is consumed, a request to release the data block is issued.
Through the optimization of the read storage structure, the basic operation times of the read data are greatly reduced. For example, assuming that there are 10 hundred million reads in the original fastq file, the optimized read storage structure reduces the operations of splitting and copying data by 10 hundred million 4-40 hundred million times and splicing data into continuous memory by 10 hundred million times and four rows during the whole gene data separation period. Thereby freeing up a large block of additional unnecessary CPU and memory consumption.
S120, parsing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs.
Further, the step S121 of parsing the barcode pair from the read data includes:
s1211, taking the first 8 characters of the second line of the read data as the barcode of the read data.
And S1212, constructing a barcode pair according to the barcode.
Under the condition of single-ended sequencing, copying the one barcode to obtain two identical barcodes serving as the barcode pair;
in the case of paired-end sequencing, the barcode is two, with two barcodes as the barcode pair.
Further, in S122, the identifying, according to the correspondence between the barcode pair and the sample number, the sample to which the read data belongs includes, as shown in fig. 4:
s1221, as shown in fig. 5, grouping the barcode in the read data to obtain a plurality of barcode groups. Each barcode group contains a plurality of barcodes, the barcodes of all the groups are not repeated, any barcode and the barcodes of which each other is a barcode pair are in the same group, the barcode pair and the belonged barcode group are in a many-to-one relationship, and the unique corresponding relationship between the barcode pair and the barcode group is obtained, namely the unique barcode group can be positioned by any barcode pair.
S1222, defining a unique corresponding relation between the barcode group and the sample number to obtain the unique corresponding relation between the barcode pair and the sample number;
and S1223, identifying the sample to which the read data corresponding to the barcode belongs according to the unique corresponding relation between the barcode pair and the sample number.
Further, S123, inserting the read data into the sample queue of the sample to which the read data belongs, as shown in fig. 6, includes:
allocating a sample queue for each sample; the sample queue is used for storing read data of the same sample in sequence;
judging whether the single-ended sequencing is performed, if so, acquiring a piece of read data, and inserting the piece of read data into the tail of the corresponding sample queue; otherwise, acquiring read1 data and read2 data from the r1 and r2 fastq files respectively, wherein the two pieces of read data are total, associating the read2 data with the read1 data, and inserting the read1 data into the tail of the corresponding sample queue after association.
As an embodiment of the present invention, after inserting the read data into the sample queue of the corresponding sample each time, determining whether the read data is completely acquired, if yes, setting an end mark for all sample queues, and simultaneously ending the process of extracting the read data; otherwise return to execute the sample queue of the sample to which the read data is inserted.
The judgment of whether the read data is completely acquired is realized by identifying a file end mark provided by a bottom-layer file system, for example, when reading a fastq file is ended, an EOF mark returned by the file system is acquired, where the EOF is an english abbreviation of end of file and indicates that the file has been completely read.
S130, writing the read data in the sample queue into the output fastq file of the corresponding sample through the asynchronous sample thread pool, as shown in fig. 7, including:
in the asynchronous sample thread pool, a piece of read data is obtained from the head of a sample queue corresponding to one asynchronous sample thread. The asynchronous sample thread pool is provided with a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel.
And performing decorrelation on the acquired read data, namely attempting to correlate the read data associated with the acquired read data through the acquired read data, wherein two results occur in the attempt, one is that the result can be obtained by decorrelation, and the other is that the result cannot be obtained by decorrelation. If the result can be obtained through reverse correlation, which indicates that the result is double-end sequencing, the obtained read data and the reverse correlation result are required to be written into an output fastq file of a corresponding sample; and if the anti-correlation result is not obtained, indicating that the single-ended sequencing is performed, writing the obtained read data into an output fastq file of a corresponding sample.
After the obtained read data is written into the output fastq file of the corresponding sample, judging whether a current sample queue is empty or not, and if so, ending the current asynchronous sample thread operation, which indicates that all the read data in the fastq file is taken out; of course, if the current queue is empty but there is no end mark, it only indicates that there is no data in the current queue, and the fastq file has unwritten read data that is not consumed, and needs to enter a wait state until the current queue is not empty or the queue is set with an end mark; and if the current queue is not empty, returning to re-executing S130, and writing the read data in the sample queue into the output fastq file of the corresponding sample.
Through the technical scheme of the invention, the time consumption for separating the sample read data from the fastq file is greatly reduced. Detailed performance pairs are as follows:
as shown in the table above, in the worst case sample read data separation scene "double-ended fastq separation", the time consumption of the separation tool based on the invention is reduced by more than 80% on average compared with the separation tool realized by the traditional python.
The method provides different processing modes of single-ended sequencing and double-ended sequencing in each link, so that the method is perfectly matched with the single-ended sequencing and the double-ended sequencing, and the method can perfectly support the function of quickly and efficiently separating sample read data from a fastq file no matter the single-ended sequencing or the double-ended sequencing is adopted. And ensures the correctness of the output data in the two modes.
In some optional implementations of this embodiment, in the case of single-ended sequencing, as shown in fig. 1 to 5 and 8 to 9, the process of separating sample read data from a fastq file according to the present invention is shown, and in the above embodiment, the step S110 is to perform loading of a fastq file.
In the step S122, the read data is one, a barcode is obtained through the second data of the read data, and the barcode is copied to obtain two identical barcodes as the barcode pair.
As shown in fig. 8, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; and acquiring a piece of read data of a fastq file, and inserting the piece of read data into the tail of the corresponding sample queue.
As shown in fig. 9, in the step S130, in the asynchronous sample thread, a piece of read data is obtained from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel. And performing reverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data through the acquired read data, and writing the acquired read data into an output fastq file of a corresponding sample if a result cannot be obtained through the reverse association.
In some optional implementations of this embodiment, in the case of double-ended sequencing, as shown in fig. 1 to 5 and 10 to 11, a process of separating sample read data from a fastq file according to the present invention is shown, in the above embodiment, there are two fastq files, which are r1 and r 2; two reads are output, read1 and read2 respectively. Each pair of read1 and read2 of two fastq files has the same ID.
In the step S122, the read data is two, one barcode is obtained through the second line data of the read data, and the two barcodes are used as the barcode pair.
As shown in fig. 10, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; acquiring read1 and read2 of two fastq files of r1 and r2, wherein the two pieces of read data are total, associating the read2 with the read1, and inserting the read1 into the tail of a corresponding sample queue after association.
As shown in fig. 11, in step S130, in the asynchronous sample thread, a piece of read1 data is obtained from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read1 data, wherein the inverse association is successful at this time, the result of the inverse association is read2, and the read1 data and the inverse-associated read2 data are respectively written into output fastq files of corresponding samples.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.
An exemplary electronic device capable of implementing embodiments of the present invention is shown in fig. 12.
The device 1200 includes a Central Processing Unit (CPU)1201 that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM)1202 or computer program instructions loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the device 1200 can also be stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit 1201 executes the respective methods and processes described above, such as the method S100. For example, in some embodiments, method S100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the CPU 1201, one or more steps of the method S100 described above may be performed. Alternatively, in other embodiments, the CPU 1201 may be configured to perform the method S100 by any other suitable means (e.g., by way of firmware).
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.