CN111767256A - Method for separating sample read data from fastq file - Google Patents

Method for separating sample read data from fastq file Download PDF

Info

Publication number
CN111767256A
CN111767256A CN202010442649.4A CN202010442649A CN111767256A CN 111767256 A CN111767256 A CN 111767256A CN 202010442649 A CN202010442649 A CN 202010442649A CN 111767256 A CN111767256 A CN 111767256A
Authority
CN
China
Prior art keywords
sample
read data
data
queue
barcode
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010442649.4A
Other languages
Chinese (zh)
Other versions
CN111767256B (en
Inventor
黄俊松
文晋
邵艳军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Herui Exquisite Medical Laboratory Co ltd
Original Assignee
Beijing Herui Precision Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Herui Precision Medical Laboratory Co ltd filed Critical Beijing Herui Precision Medical Laboratory Co ltd
Priority to CN202010442649.4A priority Critical patent/CN111767256B/en
Publication of CN111767256A publication Critical patent/CN111767256A/en
Application granted granted Critical
Publication of CN111767256B publication Critical patent/CN111767256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the invention provides a method for separating sample read data from a fastq file, which comprises the steps of loading the fastq file containing a plurality of samples by two threads concurrently, constructing the read data and outputting the read data; analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs; and writing the read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool. In this way, by utilizing a parallel working mode, the multiple threads work in a coordinated manner at the same time, the working efficiency is improved, the time consumption for separating sample read data from the fastq file is greatly shortened, the performance utilization rate of the computer is improved, and the purpose of quickly separating the sample read data from the fastq file is achieved.

Description

Method for separating sample read data from fastq file
Technical Field
Embodiments of the invention relate generally to the field of gene sequencing and, more particularly, to a method of isolating sample read data from a fastq file.
Background
In the field of gene sequencing, the fastq format is the most commonly used file format for storing gene base sequences and corresponding mass scores and related information. The off-line data of the sequencer can be stored as a file in a fastq format after being processed. In order to maximize the use of sequencers and on-board kits, it is now essential to mix multiple samples into a sequencer for sequencing and then output a fastq file that contains gene data from multiple samples. Such fastq files containing multiple samples are typically very large, down to a few GB, and up to tens of hundreds of GB in size. For the next gene sequence analysis, the fastq file in sample units needs to be split from the original fastq file, i.e. the gene data of each sample is independent into a single fastq file (for double-ended sequencing, there are two independent fastq files for each sample). The traditional method for separating sample read data is to utilize scripting languages such as python and the like to read an original fastq file line by line, analyze and construct a read, identify sample dependencies of the read, and additionally write the read into the sample fastq file. Due to the serial working mode and the use of a scripting language with poor performance, the process is extremely long, and the separation of sample read data from the fastq file consumes too long time; for example, when the current fastq file is only a few GB in size, this approach can take approximately 1 hour to complete gene data isolation. When the offline data reaches dozens or hundreds of GB, it takes more than ten hours to complete the most basic data splitting service.
Disclosure of Invention
According to an embodiment of the present invention, a scheme for separating sample read data from a fastq file is provided.
In a first aspect of the invention, a method of separating sample read data from a fastq file is provided. The method comprises the following steps:
loading a fastq file containing a plurality of samples through two threads concurrently, constructing and outputting read data;
analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;
and writing the read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool.
Further, the concurrently loading a fastq file containing a plurality of samples through two threads, constructing and outputting read data, including:
starting a first thread and a second thread, allocating a data block queue and setting the size of a data block;
in the first thread, loading the fastq file block by block according to the size of the data block, and inserting the loaded fastq block data into the tail of the data block queue;
in the second thread, taking out data blocks from the head of the data block queue one by one to obtain fastq block data;
and performing line feed analysis on the fastq block data according to line feed symbols, sequentially constructing one read data from every 4 rows of data to obtain a plurality of read data, and outputting the read data one by one in sequence.
Further, the parsing the barcode pair from the read data includes:
taking the first 8 characters of the second line of the read data as the barcode of the read data;
constructing a barcode pair according to the barcode;
under the condition of single-ended sequencing, copying the one barcode to obtain two identical barcodes serving as the barcode pair;
in the case of paired-end sequencing, the barcode is two, with two barcodes as the barcode pair.
Further, the identifying the sample to which the read data belongs according to the correspondence between the barcode pair and the sample number includes:
grouping the barcode in the read data to obtain a plurality of barcode groups; each barcode group comprises a plurality of different barcodes, and any barcode and the barcodes which are the pairs of the barcodes are in the same group to obtain the unique corresponding relation between the barcode pairs and the barcode groups;
defining the unique corresponding relation between the barcode group and the sample number to obtain the unique corresponding relation between the barcode pair and the sample number;
and identifying the sample to which the read data corresponding to the barcode belongs according to the unique corresponding relation between the barcode pair and the sample number.
Further, the inserting the read data into the sample queue of the sample to which the read data belongs comprises:
allocating a sample queue for each sample; the sample queue is used for storing read data of the same sample in sequence;
and inserting the read data into the tail of the sample queue of the sample to which the read data belongs.
Further, still include:
in the case of double-ended sequencing, the read data are two, namely read1 data and read2 data; associating the read2 data with read1 data;
the associated read1 data is inserted at the end of the sample queue for the sample to which it belongs.
Further, still include:
after the read data are inserted into the sample queues of the corresponding samples each time, judging whether the read data are acquired completely, if so, setting end marks for all the sample queues; otherwise return to execute the sample queue of the sample to which the read data is inserted.
Further, the writing, by an asynchronous sample thread in an asynchronous sample thread pool, read data in the sample queue into an output fastq file of a corresponding sample includes:
in the asynchronous sample thread, judging whether the sample queue is empty and setting an end mark, and if the sample queue is empty and the end mark is set, ending the current asynchronous thread operation; entering a wait state if the sample queue is empty and no end flag is set until the sample queue is not empty or an end flag is set; if the sample queue is not empty, taking out a piece of read data from the head of the sample queue;
performing anti-correlation on the obtained read data, and if an anti-correlation result is obtained, respectively writing the obtained read data and the anti-correlation result into output fastq files of corresponding samples; and if the anti-correlation result is not obtained, writing the obtained read data into an output fastq file of the corresponding sample.
Furthermore, the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other.
In a second aspect of the invention, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.
It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of any embodiment of the invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
According to the method, a fastq file containing a plurality of samples is loaded through a plurality of threads concurrently, read data of the samples are separated, and the read data are output to the fastq file through asynchronous operation of the sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads simultaneously cooperate to work, so that the working efficiency is improved, the time consumption for separating sample read data from the fastq file is greatly reduced, the performance utilization rate of the computer is improved, and the aim of quickly separating the sample read data from the fastq file is fulfilled.
Drawings
The above and other features, advantages and aspects of various embodiments of the present invention will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:
FIG. 1 is a flow diagram of a method of separating sample read data from a fastq file according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a process of concurrently loading a fastq file and outputting read data, according to an embodiment of the present invention;
FIG. 3 is a diagram of a read data structure according to an embodiment of the present invention;
FIG. 4 is a flow diagram of a sample process of identifying the sample to which the read data pertains, according to an embodiment of the invention;
FIG. 5 is a diagram illustrating a correspondence between barcode pairs and sample numbers according to an embodiment of the present invention;
FIG. 6 is a flow diagram of inserting the read data into a sample queue of samples to which it belongs, according to an embodiment of the invention;
FIG. 7 is a flowchart of writing read data in the sample queue to an output fastq file, according to an embodiment of the invention;
FIG. 8 is a flow diagram of inserting the read data into a sample queue of samples to which it belongs in a single-ended sequencing embodiment according to the present disclosure;
FIG. 9 is a flow diagram of writing read data in the sample queue to an output fastq file in a single-ended sequencing embodiment in accordance with the present invention;
FIG. 10 is a flow diagram of inserting the read data into a sample queue of samples to which it belongs according to an embodiment of paired-end sequencing of the present invention;
FIG. 11 is a flow chart of writing read data in the sample queue to an output fastq file in an embodiment of double-ended sequencing according to the present invention;
FIG. 12 is a block diagram of an exemplary electronic device capable of implementing embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
According to the method, a fastq file containing a plurality of samples is loaded through a plurality of threads concurrently, read data of the samples are separated, and the read data are output to the fastq file through asynchronous operation of the sample threads corresponding to the samples; by utilizing a parallel working mode, a plurality of threads simultaneously cooperate to work, so that the working efficiency is improved, the working time is shortened, the performance utilization rate of the computer is improved, and the purpose of quickly separating sample read data from a fastq file is achieved.
FIG. 1 shows a flow diagram of a method of separating sample read data from a fastq file, according to an embodiment of the invention.
The method S100 includes:
and S110, loading the fastq file containing a plurality of samples through two threads concurrently, and constructing and outputting read data.
The fastq file containing multiple samples is generally very large, ranging from a few GB to tens of hundreds of GB in size. In order to perform the next gene sequence analysis, the fastq file in sample units needs to be split from the original fastq file, i.e. the gene data of each sample is independent into a single fastq file.
As an embodiment of the invention, the method designs two threads, namely a first thread and a second thread; the first thread is used for reading fastq file data in a blocking mode and inserting the read data blocks into a data block queue; and the second thread is used for taking out the data block from the data block queue, analyzing the data in the data block and outputting the read data. The process of concurrently loading the fastq file through the two threads, constructing read data and outputting the read data, as shown in fig. 2, includes:
s111, distributing a first thread, a second thread and a data block queue, and setting the maximum data item number limit and the size of a data block of the data block queue; the data block queue comprises a plurality of data blocks which are arranged in sequence; when it is not empty, there is at least a head data block and a tail data block. The access logic defining the data block queue is tail-in, head-out. Setting a fixed size, e.g., 1MB, for the data block; the method is used as a data access size standard to enable the size of the fastq block data loaded each time to be equal.
S112, loading the fastq file block by block according to the size of the data block in the first thread, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the data item number of the data block queue is less than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue, wherein the inserted data block is the tail of the data block queue; continuously judging whether the reading of the fastq file is finished, if the reading is finished, setting a data block queue finishing mark, and finishing the first thread; and if not, returning to wait for distributing the data block and continuously loading the fastq file.
S113, in the second thread, judging whether the data block queue is empty and setting an end mark, and if the data block queue is empty and the end mark is set, ending the second thread; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; and if the data block queue is not empty, taking out the data block from the head of the data block queue to obtain the fastq block data.
And the first thread and the second thread are simultaneously processed in parallel, namely the first thread sequentially loads the data blocks in the fastq file into the data block queue, the second thread sequentially extracts the data blocks from the head of the data block queue one by one, analyzes the data blocks in the memory, and analyzes read data from the data blocks through an analysis process. By utilizing a parallel working mode, multiple threads can simultaneously cooperate, the working efficiency is improved, and the working time is greatly reduced.
FIG. 3 is a diagram illustrating a read data structure according to an embodiment of the present invention.
S114, performing line feed analysis on the analysis data according to the line feed symbols, sequentially constructing one read data from every 4 rows of data to obtain a plurality of read data, and outputting the read data one by one in sequence.
In an embodiment of the present invention, as shown in fig. 3(a), four discontinuous memories are used to store four lines of data of a read, respectively, where the first line of behavior information data, the second line of behavior gene sequence data, the third line of behavior annotation data, and the fourth line of behavior quality data; the read data are sequentially output one by one in the memory format of the read data.
In another embodiment of the present invention, as shown in FIG. 3(b), a whole block of contiguous memory is used to store four rows of complete data of a READ, and such a block of contiguous memory is named READ, with a start position of 0 and an end position of end. Three index values are used to point to the locations of the linebreaks of the first, second, and third lines of read data, such as LF _ pos1, LF _ pos2, and LF _ pos3, respectively. The three line feed symbols divide one piece of data into 4 lines of data, wherein the first line of data is an information line and is represented as READ [0, LF _ pos1) and represents a left-closed right-open interval from 0 to a character LF _ pos 1; the second line of data is a sequential line denoted READ [ LF _ pos1+1, LF _ pos2) representing a left-closed right-open interval between characters LF _ pos1+1 and LF _ pos 2; the third row of data is an annotation row, denoted READ [ LF _ pos2+1, LF _ pos3), representing the left-closed and right-open interval between the characters LF _ pos2+1 and LF _ pos 3; the fourth line data is a quality line, represented by READ [ LF _ pos3+1, end), representing a left-closed right-open interval between the characters LF _ pos3+1 to the READ end position end. And completely storing four lines of read data of the read by using a whole continuous memory, and outputting each obtained read data in sequence.
After the data block is consumed, a request to release the data block is issued.
Through the optimization of the read storage structure, the basic operation times of the read data are greatly reduced. For example, assuming that there are 10 hundred million reads in the original fastq file, the optimized read storage structure reduces the operations of splitting and copying data by 10 hundred million 4-40 hundred million times and splicing data into continuous memory by 10 hundred million times and four rows during the whole gene data separation period. Thereby freeing up a large block of additional unnecessary CPU and memory consumption.
S120, parsing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs.
Further, the step S121 of parsing the barcode pair from the read data includes:
s1211, taking the first 8 characters of the second line of the read data as the barcode of the read data.
And S1212, constructing a barcode pair according to the barcode.
Under the condition of single-ended sequencing, copying the one barcode to obtain two identical barcodes serving as the barcode pair;
in the case of paired-end sequencing, the barcode is two, with two barcodes as the barcode pair.
Further, in S122, the identifying, according to the correspondence between the barcode pair and the sample number, the sample to which the read data belongs includes, as shown in fig. 4:
s1221, as shown in fig. 5, grouping the barcode in the read data to obtain a plurality of barcode groups. Each barcode group contains a plurality of barcodes, the barcodes of all the groups are not repeated, any barcode and the barcodes of which each other is a barcode pair are in the same group, the barcode pair and the belonged barcode group are in a many-to-one relationship, and the unique corresponding relationship between the barcode pair and the barcode group is obtained, namely the unique barcode group can be positioned by any barcode pair.
S1222, defining a unique corresponding relation between the barcode group and the sample number to obtain the unique corresponding relation between the barcode pair and the sample number;
and S1223, identifying the sample to which the read data corresponding to the barcode belongs according to the unique corresponding relation between the barcode pair and the sample number.
Further, S123, inserting the read data into the sample queue of the sample to which the read data belongs, as shown in fig. 6, includes:
allocating a sample queue for each sample; the sample queue is used for storing read data of the same sample in sequence;
judging whether the single-ended sequencing is performed, if so, acquiring a piece of read data, and inserting the piece of read data into the tail of the corresponding sample queue; otherwise, acquiring read1 data and read2 data from the r1 and r2 fastq files respectively, wherein the two pieces of read data are total, associating the read2 data with the read1 data, and inserting the read1 data into the tail of the corresponding sample queue after association.
As an embodiment of the present invention, after inserting the read data into the sample queue of the corresponding sample each time, determining whether the read data is completely acquired, if yes, setting an end mark for all sample queues, and simultaneously ending the process of extracting the read data; otherwise return to execute the sample queue of the sample to which the read data is inserted.
The judgment of whether the read data is completely acquired is realized by identifying a file end mark provided by a bottom-layer file system, for example, when reading a fastq file is ended, an EOF mark returned by the file system is acquired, where the EOF is an english abbreviation of end of file and indicates that the file has been completely read.
S130, writing the read data in the sample queue into the output fastq file of the corresponding sample through the asynchronous sample thread pool, as shown in fig. 7, including:
in the asynchronous sample thread pool, a piece of read data is obtained from the head of a sample queue corresponding to one asynchronous sample thread. The asynchronous sample thread pool is provided with a plurality of asynchronous sample threads, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel.
And performing decorrelation on the acquired read data, namely attempting to correlate the read data associated with the acquired read data through the acquired read data, wherein two results occur in the attempt, one is that the result can be obtained by decorrelation, and the other is that the result cannot be obtained by decorrelation. If the result can be obtained through reverse correlation, which indicates that the result is double-end sequencing, the obtained read data and the reverse correlation result are required to be written into an output fastq file of a corresponding sample; and if the anti-correlation result is not obtained, indicating that the single-ended sequencing is performed, writing the obtained read data into an output fastq file of a corresponding sample.
After the obtained read data is written into the output fastq file of the corresponding sample, judging whether a current sample queue is empty or not, and if so, ending the current asynchronous sample thread operation, which indicates that all the read data in the fastq file is taken out; of course, if the current queue is empty but there is no end mark, it only indicates that there is no data in the current queue, and the fastq file has unwritten read data that is not consumed, and needs to enter a wait state until the current queue is not empty or the queue is set with an end mark; and if the current queue is not empty, returning to re-executing S130, and writing the read data in the sample queue into the output fastq file of the corresponding sample.
Through the technical scheme of the invention, the time consumption for separating the sample read data from the fastq file is greatly reduced. Detailed performance pairs are as follows:
Figure BDA0002504518340000121
as shown in the table above, in the worst case sample read data separation scene "double-ended fastq separation", the time consumption of the separation tool based on the invention is reduced by more than 80% on average compared with the separation tool realized by the traditional python.
The method provides different processing modes of single-ended sequencing and double-ended sequencing in each link, so that the method is perfectly matched with the single-ended sequencing and the double-ended sequencing, and the method can perfectly support the function of quickly and efficiently separating sample read data from a fastq file no matter the single-ended sequencing or the double-ended sequencing is adopted. And ensures the correctness of the output data in the two modes.
In some optional implementations of this embodiment, in the case of single-ended sequencing, as shown in fig. 1 to 5 and 8 to 9, the process of separating sample read data from a fastq file according to the present invention is shown, and in the above embodiment, the step S110 is to perform loading of a fastq file.
In the step S122, the read data is one, a barcode is obtained through the second data of the read data, and the barcode is copied to obtain two identical barcodes as the barcode pair.
As shown in fig. 8, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; and acquiring a piece of read data of a fastq file, and inserting the piece of read data into the tail of the corresponding sample queue.
As shown in fig. 9, in the step S130, in the asynchronous sample thread, a piece of read data is obtained from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel. And performing reverse association on the acquired read data, namely attempting to associate the read data associated with the acquired read data through the acquired read data, and writing the acquired read data into an output fastq file of a corresponding sample if a result cannot be obtained through the reverse association.
In some optional implementations of this embodiment, in the case of double-ended sequencing, as shown in fig. 1 to 5 and 10 to 11, a process of separating sample read data from a fastq file according to the present invention is shown, in the above embodiment, there are two fastq files, which are r1 and r 2; two reads are output, read1 and read2 respectively. Each pair of read1 and read2 of two fastq files has the same ID.
In the step S122, the read data is two, one barcode is obtained through the second line data of the read data, and the two barcodes are used as the barcode pair.
As shown in fig. 10, in step S123, a sample queue is allocated to each sample; the sample queue is used for storing read data of the same sample in sequence; acquiring read1 and read2 of two fastq files of r1 and r2, wherein the two pieces of read data are total, associating the read2 with the read1, and inserting the read1 into the tail of a corresponding sample queue after association.
As shown in fig. 11, in step S130, in the asynchronous sample thread, a piece of read1 data is obtained from the head of the corresponding sample queue. The number of the asynchronous sample threads is multiple, each asynchronous sample thread uniquely corresponds to one sample queue, and different asynchronous sample threads are independent from each other, namely, different asynchronous sample threads can be processed in parallel. And performing inverse association on the acquired read1 data, wherein the inverse association is successful at this time, the result of the inverse association is read2, and the read1 data and the inverse-associated read2 data are respectively written into output fastq files of corresponding samples.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules illustrated are not necessarily required to practice the invention.
An exemplary electronic device capable of implementing embodiments of the present invention is shown in fig. 12.
The device 1200 includes a Central Processing Unit (CPU)1201 that can perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM)1202 or computer program instructions loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for the operation of the device 1200 can also be stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processing unit 1201 executes the respective methods and processes described above, such as the method S100. For example, in some embodiments, method S100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the CPU 1201, one or more steps of the method S100 described above may be performed. Alternatively, in other embodiments, the CPU 1201 may be configured to perform the method S100 by any other suitable means (e.g., by way of firmware).
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.
Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the invention. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (10)

1. A method for separating sample read data from a fastq file, comprising:
loading a fastq file containing a plurality of samples through two threads concurrently, constructing and outputting read data;
analyzing a barcode pair from the read data, identifying a sample to which the read data belongs according to the corresponding relation between the barcode pair and the sample number, and inserting the read data into a sample queue of the sample to which the read data belongs;
and writing the read data in the sample queue into an output fastq file of a corresponding sample through an asynchronous sample thread in an asynchronous sample thread pool.
2. The method of claim 1, wherein the loading the fastq file containing a plurality of samples concurrently by two threads, constructing and outputting read data comprises:
allocating a first thread, a second thread and a data block queue, and setting the maximum data item number limit and the size of a data block of the data block queue;
in the first thread, sending an allocation request, and waiting for allocating a data block with the size of the data block;
in the first thread, reading the fastq file according to the size of the data block, putting the read data into the distributed data block, judging whether the data block queue reaches the maximum data item number limit, if so, entering a waiting state until the number of the data items of the data block queue is less than the maximum data item number limit, and inserting the data block into the tail of the data block queue; otherwise, inserting the data block into the tail of the data block queue;
continuously judging whether the reading of the fastq file is finished, if the reading is finished, setting a data block queue finishing mark, and finishing the first thread; if not, returning to wait for distributing the data block and continuously loading the fastq file;
in the second thread, judging whether the data block queue is empty and setting an end mark, and if the data block queue is empty and setting the end mark, ending the second thread; if the data block queue is empty and the end mark is not set, entering a waiting state until the data block queue is not empty or the end mark is set; if the data block queue is not empty, taking out a data block from the head of the data block queue to obtain fastq block data;
sequentially carrying out line feed analysis on the fastq block data to obtain a plurality of read data, and outputting the read data one by one in sequence;
after the data block is consumed, a request to release the data block is issued.
3. The method of claim 1, wherein parsing barcode pairs from the read data comprises:
taking the first 8 characters of the second line of the read data as the barcode of the read data;
constructing a barcode pair according to the barcode;
under the condition of single-ended sequencing, copying the one barcode to obtain two identical barcodes serving as the barcode pair;
in the case of paired-end sequencing, the barcode is two, with two barcodes as the barcode pair.
4. The method according to claim 1, wherein the identifying the sample to which the read data belongs according to the correspondence between the barcode pair and the sample number comprises:
grouping the barcode in the read data to obtain a plurality of barcode groups; each barcode group comprises a plurality of different barcodes, and any barcode and the barcodes which are the pairs of the barcodes are in the same group to obtain the unique corresponding relation between the barcode pairs and the barcode groups;
defining the unique corresponding relation between the barcode group and the sample number to obtain the unique corresponding relation between the barcode pair and the sample number;
and identifying the sample to which the read data corresponding to the barcode belongs according to the unique corresponding relation between the barcode pair and the sample number.
5. The method of claim 1, wherein said inserting the read data into a sample queue of samples to which the read data belongs comprises:
allocating a sample queue for each sample; the sample queue is used for storing read data of the same sample in sequence;
and inserting the read data into the tail of the sample queue of the sample to which the read data belongs.
6. The method of claim 5, further comprising:
in the case of double-ended sequencing, the read data are two, namely read1 data and read2 data; associating the read2 data with read1 data;
the associated read1 data is inserted at the end of the sample queue for the sample to which it belongs.
7. The method of claim 1, further comprising:
after the read data are inserted into the sample queues of the corresponding samples each time, judging whether the read data are acquired completely, if so, setting end marks for all the sample queues; otherwise return to execute the sample queue of the sample to which the read data is inserted.
8. The method of claim 1, wherein writing read data in the sample queue to an output fastq file of corresponding samples by an asynchronous sample thread in an asynchronous sample thread pool comprises:
in the asynchronous sample thread, judging whether the sample queue is empty and setting an end mark, and if the sample queue is empty and the end mark is set, ending the current asynchronous thread operation; entering a wait state if the sample queue is empty and no end flag is set until the sample queue is not empty or an end flag is set; if the sample queue is not empty, taking out a piece of read data from the head of the sample queue;
performing anti-correlation on the obtained read data, and if an anti-correlation result is obtained, respectively writing the obtained read data and the anti-correlation result into output fastq files of corresponding samples; and if the anti-correlation result is not obtained, writing the obtained read data into an output fastq file of the corresponding sample.
9. The method according to claim 1 or 8, wherein the asynchronous sample thread pool comprises a plurality of asynchronous sample threads, each asynchronous sample thread corresponds to one unique sample queue, and different asynchronous sample threads are independent from each other.
10. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the program, implements the method of any of claims 1-9.
CN202010442649.4A 2020-05-22 2020-05-22 Method for separating sample read data from fastq file Active CN111767256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010442649.4A CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010442649.4A CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Publications (2)

Publication Number Publication Date
CN111767256A true CN111767256A (en) 2020-10-13
CN111767256B CN111767256B (en) 2023-10-20

Family

ID=72719645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010442649.4A Active CN111767256B (en) 2020-05-22 2020-05-22 Method for separating sample read data from fastq file

Country Status (1)

Country Link
CN (1) CN111767256B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100323348A1 (en) * 2009-01-31 2010-12-23 The Regents Of The University Of Colorado, A Body Corporate Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US20130311106A1 (en) * 2012-03-16 2013-11-21 The Research Institute At Nationwide Children's Hospital Comprehensive Analysis Pipeline for Discovery of Human Genetic Variation
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
CN108866051A (en) * 2018-06-19 2018-11-23 上海锐翌生物科技有限公司 Amplicon sequencing library and its construction method
CN109727644A (en) * 2018-11-12 2019-05-07 山东省医学科学院基础医学研究所 Venn figure production method and system based on microbial genome two generations sequencing data
CN110008262A (en) * 2019-02-02 2019-07-12 阿里巴巴集团控股有限公司 A kind of data export method and device
CN110033830A (en) * 2019-04-16 2019-07-19 苏州金唯智生物科技有限公司 A kind of data transmission method for uplink, device, equipment and storage medium
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100323348A1 (en) * 2009-01-31 2010-12-23 The Regents Of The University Of Colorado, A Body Corporate Methods and Compositions for Using Error-Detecting and/or Error-Correcting Barcodes in Nucleic Acid Amplification Process
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US20130311106A1 (en) * 2012-03-16 2013-11-21 The Research Institute At Nationwide Children's Hospital Comprehensive Analysis Pipeline for Discovery of Human Genetic Variation
CN103559020A (en) * 2013-11-07 2014-02-05 中国科学院软件研究所 Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN105760706A (en) * 2014-12-15 2016-07-13 深圳华大基因研究院 Compression method for next generation sequencing data
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored
CN106407743A (en) * 2016-08-31 2017-02-15 上海美吉生物医药科技有限公司 Cluster-based high-throughput data analyzing method
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
CN107609350A (en) * 2017-09-08 2018-01-19 厦门极元科技有限公司 A kind of data processing method of two generations sequencing data analysis platform
CN108866051A (en) * 2018-06-19 2018-11-23 上海锐翌生物科技有限公司 Amplicon sequencing library and its construction method
CN109727644A (en) * 2018-11-12 2019-05-07 山东省医学科学院基础医学研究所 Venn figure production method and system based on microbial genome two generations sequencing data
CN110008262A (en) * 2019-02-02 2019-07-12 阿里巴巴集团控股有限公司 A kind of data export method and device
CN110033830A (en) * 2019-04-16 2019-07-19 苏州金唯智生物科技有限公司 A kind of data transmission method for uplink, device, equipment and storage medium
CN111061434A (en) * 2019-12-17 2020-04-24 人和未来生物科技(长沙)有限公司 Gene compression multi-stream data parallel writing and reading method, system and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
曾?瑶;苑娜;魏文娟;李根;杜政霖;: "高通量计算在大规模人群队列基因组数据解析应用中的挑战", 数据与计算发展前沿, no. 01, pages 121 - 131 *
朱香元;李仁发;李肯立;胡忠望;: "基于异构系统的生物序列比对并行处理研究进展", 计算机科学, no. 2, pages 399 - 404 *
李娟;汤德佑;傅娟;: "一种基于蚁群算法的生物序列并行比对方法", 计算机工程与科学, no. 09, pages 34 - 40 *
陶然;宋晓峰;: "高通量测序数据比对算法研究进展", 计算机与应用化学, no. 01, pages 47 - 54 *

Also Published As

Publication number Publication date
CN111767256B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111506498A (en) Automatic generation method and device of test case, computer equipment and storage medium
CN101860449B (en) Data query method, device and system
CN107506145B (en) Physical storage scheduling method and cloud host creation method
CN104317850B (en) Data processing method and device
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
CN106020984A (en) Creation method and apparatus of processes in electronic device
CN111767256B (en) Method for separating sample read data from fastq file
CN111625350A (en) Memory allocation method, device, equipment and storage medium for network message data
CN112256632B (en) Instruction distribution method and system in reconfigurable processor
CN111782609B (en) Method for rapidly and uniformly slicing fastq file
CN115756619A (en) Hard disk starting method, device, equipment, medium and program product
CN111767255B (en) Optimization method for separating sample read data from fastq file
CN112967454A (en) Resource distribution method, device, cabinet, electronic equipment and storage medium
CN112363847B (en) Automatic identification method and system for license document
CN112069006B (en) Method and device for detecting and analyzing GPU (graphics processing Unit) rate state and computer readable medium
CN115242861B (en) RTE layer communication data mapping configuration file generation method and system, computer readable storage medium and electronic equipment
CN114359904B (en) Image recognition method, image recognition device, electronic equipment and storage medium
CN117216011B (en) File transmission method and device and electronic equipment
CN109753246B (en) Hybrid heterogeneous memory-oriented tagged data and job scheduling method and system
US7512730B2 (en) Method for dynamically allocating interrupt pins
CN115145810A (en) Method, device, equipment, medium and product for obtaining test data
CN107403076B (en) Method and apparatus for treating DNA sequence
CN113918307A (en) Task processing method, device, equipment and medium
CN114661777A (en) Method, device and medium for extracting log records
CN117331702A (en) Request distribution method, request distribution device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant after: Beijing Herui precision medical device technology Co.,Ltd.

Address before: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant before: Beijing Herui precision medical laboratory Co.,Ltd.

CB02 Change of applicant information
TA01 Transfer of patent application right

Effective date of registration: 20230913

Address after: Room 102 and Room 103, 1st Floor, Building 5, No. 4 Life Park Road, Life Science Park, Changping District, Beijing, 102206

Applicant after: Beijing Herui exquisite medical laboratory Co.,Ltd.

Address before: 102206 room 602, 6 / F, building 4, courtyard 4, shengshengyuan Road, Huilongguan town, Changping District, Beijing (Changping Demonstration Park)

Applicant before: Beijing Herui precision medical device technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant