CN117795605A - Method and device for processing FASTQ data, electronic equipment and storage medium - Google Patents

Method and device for processing FASTQ data, electronic equipment and storage medium Download PDF

Info

Publication number
CN117795605A
CN117795605A CN202280054964.7A CN202280054964A CN117795605A CN 117795605 A CN117795605 A CN 117795605A CN 202280054964 A CN202280054964 A CN 202280054964A CN 117795605 A CN117795605 A CN 117795605A
Authority
CN
China
Prior art keywords
sequence
processed
fastq file
fastq
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280054964.7A
Other languages
Chinese (zh)
Inventor
邓天全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Huada Gene Technology Service Co ltd
BGI Technology Solutions Co Ltd
Original Assignee
Wuhan Huada Gene Technology Service Co ltd
BGI Technology Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Huada Gene Technology Service Co ltd, BGI Technology Solutions Co Ltd filed Critical Wuhan Huada Gene Technology Service Co ltd
Publication of CN117795605A publication Critical patent/CN117795605A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for processing FASTQ data, electronic equipment and a storage medium, and relates to the technical field of data processing, wherein the main technical scheme comprises the following steps: filtering the original FASTQ file to obtain a to-be-processed FASTQ file, wherein the original FASTQ file is used for storing sequence units of original sequencing data, acquiring row sequence identifiers of each sequence unit in the to-be-processed FASTQ file, compressing the row sequence identifiers, deleting the to-be-processed FASTQ file, and compared with the related art, only compressing the row sequence identifiers of each sequence unit in the filtered to-be-processed FASTQ file, rather than compressing the whole sequence unit, so that the space occupation amount of the to-be-processed FASTQ file is reduced. When the compressed FASTQ file to be processed is used, the compressed line sequence identification and the original sequencing data are used for reduction, and the process is simple and easy to operate.

Description

Method and device for processing FASTQ data, electronic equipment and storage medium Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a method and apparatus for processing FASTQ data, an electronic device, and a storage medium.
Background
In recent years, with development of sequencing technology, sequencing price is lower and lower, so that in the process of increasing data produced by sequencing, how to effectively reduce storage space of sequencing data FASTQ data has become a problem to be solved urgently.
In order to shorten the occupied space of the FASTQ data, after sequencing is completed by the sequencing equipment, original FASTQ data is obtained, the FASTQ data contains problems of sequencing errors, low sequencing quality values and the like, filtered FASTQ data is obtained after data filtering processing, and finally, compression processing is performed on the filtered FASTQ data based on a preset compression algorithm so as to reduce the occupied space of the FASTQ data.
However, in practical applications, the time that the filtered FASTQ data is analyzed is shorter, but the retention time is longer, and if the FASTQ data is too much, the storage space may be further occupied.
Disclosure of Invention
The application provides a method, a device, an electronic device and a storage medium for processing FASTQ data. The method mainly aims to solve the problem that FASTQ data occupy more storage space.
According to a first aspect of the present application, there is provided a method of processing FASTQ data, comprising:
Filtering an original FASTQ file to obtain a FASTQ file to be processed, wherein the original FASTQ file is used for storing sequence units of original sequencing data;
and acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed.
Optionally, searching a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file to implement restoration of the FASTQ file to be processed.
Optionally, the obtaining the line sequence identifier of each sequence unit in the FASTQ file to be processed, and compressing the line sequence identifier includes:
and acquiring the sequence identification of the first row of each sequence unit in the FASTQ file to be processed, and compressing the sequence identification of the first row.
Optionally, the searching the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file includes:
sequentially reading the compressed sequence identifications and storing the compressed sequence identifications in a preset data set;
and respectively outputting the compressed sequence identifiers and corresponding sequence units contained in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed.
Optionally, the searching the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file includes:
sequentially reading the sequence identifications of the compressed first row, and storing the sequence identifications of the first row as a primary key in a preset data set;
and respectively outputting corresponding sequence units which contain the sequence identifications of the first row in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed, wherein the restored FASTQ file to be processed stores the sequence units which contain the four rows of sequences.
Optionally, after acquiring the line sequence identifier of each sequence unit in the FASTQ file to be processed and compressing the line sequence identifier, the method further includes:
and clearing the filtered sequencing data in the FASTQ file to be processed.
Optionally, after filtering the original FASTQ file to obtain a to-be-processed FASTQ file, the method further includes:
and carrying out hash encryption calculation on the FASTQ file to be processed to obtain an encrypted hash value.
Optionally, after searching for a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file to implement restoration of the FASTQ file to be processed, the method further includes:
Performing encryption calculation on the restored FASTQ file to be processed by adopting the preset encryption algorithm to obtain a second encryption key;
and determining whether the restoration processing of the FASTQ data to be processed has loss or not according to whether the first encryption key is consistent with the second encryption key.
According to a second aspect of the present application, there is provided a FASTQ data processing apparatus, comprising:
the processing unit is used for filtering the original FASTQ file to obtain a FASTQ file to be processed, wherein the original FASTQ file is used for storing a sequence unit of original sequencing data;
the compressing unit is used for acquiring the row sequence identifier of each sequence unit in the FASTQ file to be processed and compressing the row sequence identifier;
and the deleting unit is used for deleting the FASTQ file to be processed.
Optionally, the apparatus further includes:
and the restoring unit is used for searching the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file so as to restore the FASTQ file to be processed.
Optionally, the compression unit includes:
the acquisition module is used for acquiring the sequence identification of the first row of each sequence unit in the FASTQ file to be processed;
And the compression module is used for compressing the sequence identification of the first row.
Optionally, the reduction unit includes:
the first storage module is used for sequentially reading the compressed sequence identifications and storing the compressed sequence identifications in a preset data set;
and the second output module is used for respectively outputting the compressed sequence identifier and the corresponding sequence unit contained in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed.
Optionally, the reduction unit includes:
the second storage module is used for sequentially reading the sequence identifications of the compressed first row and storing the sequence identifications of the first row as a primary key in a preset data set;
and the second output module is used for respectively outputting corresponding sequence units containing the sequence identifiers of the first row in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed, and storing the sequence units containing four rows of sequences in the restored FASTQ file to be processed.
Optionally, the apparatus further includes:
and the deleting unit is used for acquiring the row sequence identifier of each sequence unit in the FASTQ file to be processed by the compressing unit, compressing the row sequence identifier and then clearing the filtered sequencing data in the FASTQ file to be processed.
Optionally, the apparatus further includes:
the first encryption unit is used for carrying out hash encryption calculation on the FASTQ file to be processed after filtering the original FASTQ file to obtain the FASTQ file to be processed, so as to obtain an encrypted hash value.
Optionally, the device further comprises:
the second encryption unit is used for carrying out encryption calculation on the restored FASTQ file to be processed by adopting the preset encryption algorithm to obtain a second encryption key;
and the determining unit is used for determining whether the restoration processing of the FASTQ data to be processed has loss according to the consistency of the first encryption key and the second encryption key.
According to a third aspect of the present application, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.
According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the preceding first aspect.
According to a fifth aspect of the present application there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.
According to the method, the device, the electronic equipment and the storage medium for processing the FASTQ data, the original FASTQ file is filtered to obtain the FASTQ file to be processed, the original FASTQ file is used for storing sequence units of original sequencing data, row sequence identifiers of each sequence unit in the FASTQ file to be processed are obtained, the row sequence identifiers are compressed, and the FASTQ file to be processed is deleted. When the compressed FASTQ file to be processed is used, the compressed line sequence identification and the original sequencing data are used for reduction, and the using process is simple and easy to operate.
It should be understood that the description of this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
fig. 1 is a flow chart of a method for processing FASTQ data according to an embodiment of the present application;
fig. 2 is a schematic diagram of a format of a FASTQ file to be processed according to an embodiment of the present application;
fig. 3 is a flow chart of another method for processing FASTQ data according to an embodiment of the present application;
fig. 4 is a flow chart of another method for processing FASTQ data according to an embodiment of the present application;
fig. 5 is a flow chart of another method for processing FASTQ data according to an embodiment of the present application;
fig. 6 is a flow chart of another method for processing FASTQ data according to an embodiment of the present application;
fig. 7 is a schematic diagram of a row sequence identifier of a first row of a sequence unit according to an embodiment of the present application;
fig. 8 is a schematic diagram of a restored FASTQ file to be processed according to an embodiment of the present application;
fig. 9 is a flow chart of another method for processing FASTQ data according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a FASTQ data processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of another FASTQ data processing apparatus according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of an electronic device 700 according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Methods, apparatuses, electronic devices, and storage media for processing FASTQ data according to embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a flow chart of a method for processing FASTQ data according to an embodiment of the present application.
As shown in fig. 1, the method comprises the steps of:
step 101, filtering an original FASTQ file to obtain a to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data.
In the practical application process, the original FASTQ data of the original machine can be obtained after sequencing, the original FASTQ data is called as raw data in the embodiment of the application, in order to preliminarily reduce the storage space occupation of the original FASTQ data, the original FASTQ data which may be caused by the fact that sequencing errors exist in part of the original FASTQ data raw data in the sequencing process and has a low sequencing quality value is filtered, so that a high-quality FASTQ file to be processed is obtained, and the FASTQ file to be processed is called as clean data in the embodiment of the application and is a subset of raw data.
For better understanding, in the embodiment of the present application, the format of a to-be-processed FASTQ file is illustrated, fig. 2 is a schematic diagram of the format of a to-be-processed FASTQ file provided in the embodiment of the present application, where the to-be-processed FASTQ file is a standard file storing original sequencing data, and each four rows represents one independent sequence unit (or sequence storage unit), as shown in fig. 2, each four rows represents 1 sequence unit, and the preset data format of each sequence unit is as follows:
the first line sequence identity and associated descriptive information, beginning with '@', is a unique identifier for each sequence element;
the second row is a sequence consisting of A, C, G, T and N, wherein A, C, G, T are base information and N is a bit-supplementing code used for replacement when sequencing fails;
the third line starts with '+' followed by a line sequence identification, associated descriptive information, or nothing else, the data shown in the example of the embodiment of the invention being only "+";
and the fourth row is the quality information of the sequence, corresponds to the bases in the second row of the sequence one by one, and each base corresponds to a quality value which is expressed by ASCII codes and is used for measuring the reliability of the sequenced base, and the higher the quality value is, the more reliable.
It should be noted that fig. 2 is only an exemplary example, and the number of sequence units in the FASTQ file to be processed and the content of each row in each sequence unit in the embodiment of the present application are not limited, and fig. 2 is only given for convenience of better understanding the format.
Step 102, acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed.
In the detailed description of the sequence unit in step 101, when compression is performed, the line sequence identifier in the sequence unit in the FASTQ file to be processed is read, and compression is performed, where the compressed file only contains the identifier information of the sequence unit.
As a feasible manner of the embodiment of the present application, at least one row of sequence unit identifiers in the sequence units may be read to perform compression, for example, row sequence identifiers of four rows of sequence are read to perform compression, or row sequence identifiers of the first three rows of sequence are read to perform compression, where the number of row sequence identifiers selected is not limited.
As a possible manner of the embodiment of the present application, one line of sequence unit identifiers in the sequence units may be read for compression, for example, the first line of sequence identifiers may be read for compression, or the second line of sequence identifiers may be read for compression, where it is to be noted that the third line has only '+', and cannot be read and compressed when no description information is added. The embodiment of the application does not limit the specific row sequence selected by the row sequence identifier.
In the embodiment of the application, after the FASTQ file to be processed is compressed, the FASTQ file to be processed is deleted, so that the occupation of storage space is saved.
According to the method for processing the FASTQ data, the original FASTQ file is filtered to obtain the FASTQ file to be processed, the original FASTQ file is used for storing sequence units of original sequencing data, row sequence identifiers of each sequence unit in the FASTQ file to be processed are obtained, the row sequence identifiers are compressed, and the FASTQ file to be processed is deleted. When the compressed FASTQ file to be processed is used, the compressed line sequence identification and the original sequencing data are used for reduction, and the using process is simple and easy to operate.
In a possible implementation manner of this embodiment, as shown in fig. 3, another method for processing FASTQ data is further provided, including:
step 201, filtering the original FASTQ file to obtain a to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data.
Step 202, acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed.
For the descriptions of step 201 and step 202, refer to the detailed descriptions of step 101 and step 102, and the detailed descriptions of the embodiments of the present application are omitted here.
And 203, searching a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file to restore the FASTQ file to be processed.
When the FASTQ file to be processed is needed, the compressed line sequence identification is read, the original sequence unit corresponding to the compressed line sequence identification is output from the original FASTQ file, and the original sequence unit is recorded in the blank FASTQ file to be processed, so that the reduction of the FASTQ file to be processed is realized. When the FASTQ file to be processed is used, the compressed sequence identifier can be completed by searching the corresponding sequence unit from the original FASTQ file, so that the space occupation amount of the FASTQ file to be processed is reduced, and the reduction process of the FASTQ file to be processed is simple and easy to operate.
According to the method for processing the FASTQ data, the original FASTQ file is filtered to obtain the to-be-processed FASTQ file, the original FASTQ file is used for storing sequence units of original sequencing data, row sequence identifiers of each sequence unit in the to-be-processed FASTQ file are obtained, the row sequence identifiers are compressed, the to-be-processed FASTQ file is deleted, the sequence units corresponding to the compressed sequence identifiers are searched from the original FASTQ file, so that reduction of the to-be-processed FASTQ file is achieved.
In a possible implementation manner of this embodiment, as shown in fig. 4, another method for processing FASTQ data is further provided, including:
step 301, filtering the original FASTQ file to obtain a to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data.
The original FASTQ data of the original machine can be obtained after sequencing, the original FASTQ data is called raw data in the embodiment of the application, and in order to preliminarily reduce the storage space occupation of the original FASTQ data, the original FASTQ data which may be caused by sequencing errors of part of the original FASTQ data raw data and has a low sequencing quality value in the sequencing process is filtered to obtain high-quality FASTQ files to be processed, wherein the FASTQ files to be processed are called clean data in the embodiment of the application and are subsets of raw data.
Step 302, obtaining a sequence identifier of a first row of each sequence unit in the FASTQ file to be processed, and compressing the sequence identifier of the first row.
When compression is executed, the sequence identifier of the first Row of each sequence unit in the FASTQ file to be processed is read, and compression is executed, in the embodiment of the application, the compressed file is called Row1_ID.gz, and the compressed file only contains the identifier information of the first Row of the sequence.
In the embodiment of the application, after the FASTQ file to be processed is compressed, the FASTQ file to be processed is deleted, so that the occupation of storage space is saved.
Step 303, searching a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file, so as to implement restoration of the FASTQ file to be processed.
When the FASTQ file to be processed is needed, the compressed line sequence identification is read, the original sequence unit corresponding to the compressed line sequence identification is output from the original FASTQ file, and the original sequence unit is recorded in the blank FASTQ file to be processed, so that the reduction of the FASTQ file to be processed is realized.
The embodiments of the present application provide another method for processing FASTQ data, and fig. 5 is a schematic flow chart of another method for processing FASTQ data provided in the embodiments of the present application, where the method for processing FASTQ data may be performed alone, may be performed in combination with any one embodiment or possible implementation manner in the embodiment of the present application, and may also be performed in combination with any one technical solution in the related art.
Step 401, filtering the original FASTQ file to obtain a to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data.
In the practical application process, the original FASTQ data of the original machine can be obtained after sequencing, the original FASTQ data is called as raw data in the embodiment of the application, in order to preliminarily reduce the storage space occupation of the original FASTQ data, the original FASTQ data which may be caused by the fact that sequencing errors exist in part of the original FASTQ data raw data in the sequencing process and has a low sequencing quality value is filtered, so that a high-quality FASTQ file to be processed is obtained, and the FASTQ file to be processed is called as clean data in the embodiment of the application and is a subset of raw data.
Step 402, acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed.
When compression is executed, reading a row sequence identifier in a sequence unit in the FASTQ file to be processed, and executing compression, wherein the compressed file only contains the identifier information of the sequence unit.
As a feasible manner of the embodiment of the present application, at least one row of sequence unit identifiers in the sequence units may be read to perform compression, for example, row sequence identifiers of four rows of sequence are read to perform compression, or row sequence identifiers of the first three rows of sequence are read to perform compression, where the number of row sequence identifiers selected is not limited.
As a possible manner of the embodiment of the present application, one line of sequence unit identifiers in the sequence units may be read for compression, for example, the first line of sequence identifiers may be read for compression, or the second line of sequence identifiers may be read for compression, where it is to be noted that the third line has only '+', and cannot be read and compressed when no description information is added. The embodiment of the application does not limit the specific row sequence selected by the row sequence identifier.
In the embodiment of the application, after the FASTQ file to be processed is compressed, the FASTQ file to be processed is deleted, so that the occupation of storage space is saved.
Step 403, sequentially reading the compressed sequence identifiers and storing the sequence identifiers in a preset data set.
The sequence identifier of the FASTQ file to be processed is read into a preset data set, and the preset data set is exemplified by a hash table, and the sequence identifier after being read can be stored into the hash table as a main key.
Step 404, outputting the compressed sequence identifier and the corresponding sequence unit included in the preset data set in the original FASTQ file to the FASTQ file to be processed, so that the FASTQ file to be processed is restored.
And outputting a sequence unit containing a preset data set (hash table) in the original FASTQ file to obtain a restored FASTQ file to be processed.
The embodiments of the present application provide another method for processing FASTQ data, and fig. 6 is a schematic flow chart of another method for processing FASTQ data provided in the embodiments of the present application, where the method for processing FASTQ data may be performed alone, may be performed in combination with any one embodiment or possible implementation manner in the embodiment of the present application, and may also be performed in combination with any one technical solution in the related art. Comprising the following steps:
step 501, filtering an original FASTQ file to obtain a to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data.
In the practical application process, the original FASTQ data of the original machine can be obtained after sequencing, the original FASTQ data is called as raw data in the embodiment of the application, in order to preliminarily reduce the storage space occupation of the original FASTQ data, the original FASTQ data which may be caused by the fact that sequencing errors exist in part of the original FASTQ data raw data in the sequencing process and has a low sequencing quality value is filtered, so that a high-quality FASTQ file to be processed is obtained, and the FASTQ file to be processed is called as clean data in the embodiment of the application and is a subset of raw data.
Step 502, obtaining a sequence identifier of a first row of each sequence unit in the FASTQ file to be processed, compressing the sequence identifier of the first row, and deleting the FASTQ file to be processed.
When compression is executed, reading a row sequence identifier in a sequence unit in the FASTQ file to be processed, and executing compression, wherein the compressed file only contains the identifier information of the sequence unit.
As a possible manner of the embodiment of the present application, one line of sequence unit identifiers in the sequence units may be read for compression, for example, the first line of sequence identifiers may be read for compression, or the second line of sequence identifiers may be read for compression, where it is to be noted that the third line has only '+', and cannot be read and compressed when no description information is added. The embodiment of the application does not limit the specific row sequence selected by the row sequence identifier.
In the embodiment of the application, after the FASTQ file to be processed is compressed, the FASTQ file to be processed is deleted, so that the occupation of storage space is saved.
Step 503, sequentially reading the sequence identifiers of the compressed first row, and storing the sequence identifiers of the first row as a primary key in a preset data set.
The sequence identifier of the first line of the FASTQ file to be processed is read into a preset data set, and the preset data set is exemplified by a hash table, and the sequence identifier after being read after being compressed can be stored into the hash table as a main key.
Step 504, outputting corresponding sequence units containing the sequence identifier of the first row in the preset data set in the original FASTQ file to the FASTQ file to be processed, so that the FASTQ file to be processed is restored, and storing the sequence units containing the four rows of sequences in the restored FASTQ file to be processed.
For easy understanding, as shown in fig. 7, fig. 7 is a schematic diagram of a row sequence identifier of a first row of a sequence unit provided in an embodiment of the present application, and in combination with fig. 2, after compression of the first row sequence identifier is performed, 2 first row sequence identifiers shown in fig. 7 are obtained and stored as primary keys in a hash table, if a sequence unit containing the hash table in raw data of an original FASTQ file is output, the first 8 rows of information of the original FASTQ file is obtained, a restored FASTQ file to be processed is obtained, and a restoration result is shown in fig. 8.
The embodiments of the present application provide another method for processing FASTQ data, and fig. 9 is a schematic flow chart of another method for processing FASTQ data provided in the embodiments of the present application, where the method for processing FASTQ data may be performed alone, may be performed in combination with any one embodiment or possible implementation manner in the embodiment of the present application, and may also be performed in combination with any one technical solution in the related art. The method comprises the following steps:
And 601, filtering an original FASTQ file to obtain a FASTQ file to be processed, wherein the original FASTQ file is used for storing sequence units of original sequencing data.
Step 602, calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key.
In order to verify that the processing of FASTQ data in the embodiment of the present application reduces the storage space of the to-be-processed FASTQ data clean data, an example of storing a sequence in the FASTQ format with a size of 5.25Gb in the embodiment of the present application is provided below.
Example 1: sample W was sequenced by a sequencer to obtain the original FASTQ sequence with a file size of 5.54Gb, namely raw data, with a file name of "w.fq.gz". Filtering the W.fq.gz by preset filtering software, and removing the sequence with low quality value to obtain the FASTQ file to be processed with the file size of 5.25Gb, wherein the file name is W.clean.fq.gz. The md5 value of the decompressed file "w.clean.fq" is "747d1e0ee01dd591c0f4 f 08f06bf52ce", and this md5 value can be used to determine that the storage according to the embodiment of the present invention is used, and after the restoration according to the embodiment 2 of the present invention, it is checked whether the md5 value of the clean data of the FASTQ file to be processed remains consistent, so as to verify the reliability of the method described in the present application.
Outputting the first Row sequence identification of the sequence unit in the file 'W.clean.fq' to the file Row1_ID and compressing to obtain a final storage file 'Row1_ID.gz' of the FASTQ file clean data to be processed, and deleting the FASTQ file 'W.clean.fq.gz' to be processed. The size of the Row1_ID.gz is 0.12Gb and is 2.29% of the size of the file W.clean.fq.gz, so that the clean data storage space of the FASTQ file to be processed is greatly reduced. In the embodiment of the application, the reading and the reduction of the to-be-processed FASTQ file clean data are realized, the used program language is perl, and the specific implementation codes are as follows:
step 603, acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed;
step 604, searching for a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file, so as to implement restoration of the FASTQ file to be processed.
And step 605, performing encryption calculation on the restored FASTQ file to be processed by adopting the preset encryption algorithm to obtain a second encryption key.
Step 606, determining whether there is a loss in the restoration process of the to-be-processed FASTQ data according to whether the first encryption key is consistent with the second encryption key.
Example 2, in the form of "Row1_ID.gz" of 0.12Gb obtained in example 1 above, in combination with "W.fq.gz" obtained in example 1, this step was used to verify the correctness of the reduction of the FASTQ file to be processed.
Reading a sequence identification file of a file 'Row1_ID.gz' into a preset data set (hash table), storing each Row of information of the 'Row1_ID.gz' into the hash table as a main key, and outputting a sequence unit containing the main key of the hash table in the 'W.fq.gz', thereby obtaining the restored FASTQ file clean data to be processed, namely 'W.clean.fq.gz'.
In the embodiment of the application, the data reduction of the to-be-processed FASTQ file clean data is realized, the used program language is perl, and the specific implementation code is as follows:
further, we decompress the file "w.clean.fq.gz" to obtain the file "w.clean.fq", and use the command "md5sum w.clean.fq" to obtain the md5 value of "747d1e0ee01dd591c0f4b08f06bf52ce", which is the same as the md5 value in example 1, so that the method described in the example of the present application proves to be effective and feasible.
In summary, filtering the original FASTQ file to obtain the to-be-processed FASTQ file, where the original FASTQ file is used to store sequence units of original sequencing data, obtain a row sequence identifier of each sequence unit in the to-be-processed FASTQ file, compress the row sequence identifier, delete the to-be-processed FASTQ file, search for the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file, so as to implement reduction of the to-be-processed FASTQ file.
Corresponding to the method for processing FASTQ data, the embodiment of the application also provides a processing device for FASTQ data. Since the device embodiments in the embodiments of the present application correspond to the method embodiments described above, details not disclosed in the device embodiments may refer to the method embodiments described above, and details in the embodiments of the present application are not described again.
Fig. 10 is a schematic structural diagram of a FASTQ data processing apparatus according to an embodiment of the present application, as shown in fig. 10, including:
a processing unit 71, configured to perform filtering processing on an original FASTQ file, to obtain a FASTQ file to be processed, where the original FASTQ file is used to store a sequence unit of original sequencing data;
a compression unit 72, configured to obtain a line sequence identifier of each sequence unit in the FASTQ file to be processed, and compress the line sequence identifier;
a deletion unit 73, configured to delete the FASTQ file to be processed;
and a restoring unit 74, configured to find a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file, so as to implement restoration of the FASTQ file to be processed.
According to the FASTQ data processing device, the original FASTQ file is filtered to obtain the FASTQ file to be processed, the original FASTQ file is used for storing sequence units of original sequencing data, row sequence identifiers of each sequence unit in the FASTQ file to be processed are obtained, the row sequence identifiers are compressed, and the FASTQ file to be processed is deleted.
Further, in one possible implementation manner of the present embodiment, as shown in fig. 11, the compression unit 72 includes:
an obtaining module 721, configured to obtain a sequence identifier of a first row of each sequence unit in the FASTQ file to be processed;
a compression module 722, configured to compress the sequence identifier of the first line.
Further, in one possible implementation manner of this embodiment, as shown in fig. 11, the reducing unit 74 includes:
a first storage module 741, configured to sequentially read the compressed sequence identifiers, and store the sequence identifiers in a preset data set;
and a second output module 742, configured to output, to the to-be-processed FASTQ file, the compressed sequence identifier and the corresponding sequence unit included in the preset data set in the original FASTQ file, so that the to-be-processed FASTQ file is restored.
Further, in one possible implementation manner of this embodiment, as shown in fig. 11, the reducing unit 74 includes:
the second storage module 743 is configured to sequentially read the sequence identifiers of the compressed first row, and store the sequence identifiers of the first row as a primary key in a preset data set;
and a second output module 744, configured to output, to the to-be-processed FASTQ file, corresponding sequence units in the original FASTQ file, where the sequence units include sequence identifiers of a first row in the preset data set, so that the to-be-processed FASTQ file is restored, and the restored to-be-processed FASTQ file stores sequence units including four rows of sequences.
Further, in a possible implementation manner of this embodiment, as shown in fig. 11, the apparatus further includes:
the first encryption unit 75 is configured to perform hash encryption calculation on the to-be-processed FASTQ file after performing filtering processing on the original FASTQ file to obtain the to-be-processed FASTQ file, so as to obtain an encrypted hash value.
Further, in a possible implementation manner of this embodiment, as shown in fig. 11, the apparatus further includes:
a second encryption unit 76, configured to perform encryption calculation on the restored FASTQ file to be processed by using the preset encryption algorithm, so as to obtain a second encryption key;
a determining unit 77, configured to determine whether there is a loss in the restoration process of the FASTQ data to be processed according to whether the first encryption key is consistent with the second encryption key.
According to embodiments of the present application, there is also provided an electronic device, a readable storage medium and a computer program product.
Fig. 12 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 12, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 802 or a computer program loaded from a storage unit 808 into a RAM (Random Access Memory ) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An I/O (Input/Output) interface 805 is also connected to bus 804.
Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a CPU (Central Processing Unit ), GPU (Graphic Processing Units, graphics processing unit), various dedicated AI (Artificial Intelligence ) computing chips, various computing units running machine learning model algorithms, DSPs (Digital Signal Processor, digital signal processors), and any suitable processors, controllers, microcontrollers, and the like. The calculation unit 801 performs the respective methods and processes described above, for example, a method of processing FASTQ data. For example, in some embodiments, the method of processing FASTQ data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the methods described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the aforementioned method of processing FASTQ data by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit System, FPGA (Field Programmable Gate Array ), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit), ASSP (Application Specific Standard Product, special-purpose standard product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (Cathode-Ray Tube) or LCD (Liquid Crystal Display ) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network ), WAN (Wide Area Network, wide area network), internet and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application are achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (17)

  1. A method of processing FASTQ data, comprising:
    filtering an original FASTQ file to obtain a FASTQ file to be processed, wherein the original FASTQ file is used for storing sequence units of original sequencing data;
    and acquiring a row sequence identifier of each sequence unit in the FASTQ file to be processed, compressing the row sequence identifier, and deleting the FASTQ file to be processed.
  2. The method according to claim 1, wherein the method further comprises:
    and searching a sequence unit corresponding to the compressed sequence identifier from the original FASTQ file so as to realize the reduction of the FASTQ file to be processed.
  3. The method according to claim 1 or 2, wherein the obtaining a row sequence identity of each sequence unit in the FASTQ file to be processed, and compressing the row sequence identity, comprises:
    and acquiring the sequence identification of the first row of each sequence unit in the FASTQ file to be processed, and compressing the sequence identification of the first row.
  4. The method according to claim 2, wherein the searching for the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file includes:
    Sequentially reading the compressed sequence identifications and storing the compressed sequence identifications in a preset data set;
    and respectively outputting the compressed sequence identifiers and corresponding sequence units contained in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed.
  5. A method according to claim 3, wherein said searching for a sequence unit corresponding to the compressed sequence identity from the original FASTQ file comprises:
    sequentially reading the sequence identifications of the compressed first row, and storing the sequence identifications of the first row as a primary key in a preset data set;
    and respectively outputting corresponding sequence units containing sequence identifications of a first row in a preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed, wherein the restored FASTQ file to be processed stores sequence units containing four rows of sequences.
  6. The method according to any one of claims 1-5, wherein after filtering the original FASTQ file to obtain a FASTQ file to be processed, the method further comprises:
    And calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key.
  7. The method according to claim 6, wherein after searching for the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file to implement the restoration of the to-be-processed FASTQ file, the method further comprises:
    performing encryption calculation on the restored FASTQ file to be processed by adopting the preset encryption algorithm to obtain a second encryption key;
    and determining whether the restoration processing of the FASTQ data to be processed has loss or not according to whether the first encryption key is consistent with the second encryption key.
  8. A FASTQ data processing apparatus, comprising:
    the processing unit is used for filtering the original FASTQ file to obtain a FASTQ file to be processed, wherein the original FASTQ file is used for storing a sequence unit of original sequencing data;
    the compressing unit is used for acquiring the row sequence identifier of each sequence unit in the FASTQ file to be processed and compressing the row sequence identifier;
    and the deleting unit is used for deleting the FASTQ file to be processed.
  9. The apparatus of claim 8, wherein the apparatus further comprises:
    And the restoring unit is used for searching the sequence unit corresponding to the compressed sequence identifier from the original FASTQ file so as to restore the FASTQ file to be processed.
  10. The apparatus according to claim 8 or 9, wherein the compression unit comprises:
    the acquisition module is used for acquiring the sequence identification of the first row of each sequence unit in the FASTQ file to be processed;
    and the compression module is used for compressing the sequence identification of the first row.
  11. The apparatus of claim 9, wherein the reduction unit comprises:
    the first storage module is used for sequentially reading the compressed sequence identifications and storing the compressed sequence identifications in a preset data set;
    and the second output module is used for respectively outputting the compressed sequence identifier and the corresponding sequence unit contained in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed.
  12. The apparatus of claim 10, wherein the reduction unit comprises:
    the second storage module is used for sequentially reading the sequence identifications of the compressed first row and storing the sequence identifications of the first row as a primary key in a preset data set;
    And the second output module is used for respectively outputting corresponding sequence units containing the sequence identifiers of the first row in the preset data set in the original FASTQ file to the FASTQ file to be processed so as to restore the FASTQ file to be processed, and storing the sequence units containing four rows of sequences in the restored FASTQ file to be processed.
  13. The apparatus according to any one of claims 8-12, wherein the apparatus further comprises:
    the first encryption unit is used for carrying out hash encryption calculation on the FASTQ file to be processed after filtering the original FASTQ file to obtain the FASTQ file to be processed, so as to obtain an encrypted hash value.
  14. The apparatus of claim 13, wherein the apparatus further comprises:
    the second encryption unit is used for carrying out encryption calculation on the restored FASTQ file to be processed by adopting the preset encryption algorithm to obtain a second encryption key;
    and the determining unit is used for determining whether the restoration processing of the FASTQ data to be processed has loss according to the consistency of the first encryption key and the second encryption key.
  15. An electronic device, comprising:
    At least one processor; and
    a memory communicatively coupled to the at least one processor; wherein,
    the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
  16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.
  17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.
CN202280054964.7A 2022-07-25 2022-07-25 Method and device for processing FASTQ data, electronic equipment and storage medium Pending CN117795605A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/107705 WO2024020746A1 (en) 2022-07-25 2022-07-25 Method and apparatus for processing fastq data, and electronic device and storage medium

Publications (1)

Publication Number Publication Date
CN117795605A true CN117795605A (en) 2024-03-29

Family

ID=89704901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280054964.7A Pending CN117795605A (en) 2022-07-25 2022-07-25 Method and device for processing FASTQ data, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN117795605A (en)
WO (1) WO2024020746A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015120170A1 (en) * 2014-02-05 2015-08-13 Bigdatabio, Llc Methods and systems for biological sequence compression transfer and encryption
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
CN108287983A (en) * 2017-01-09 2018-07-17 朱瑞星 A kind of method and apparatus for carrying out compression and decompression to genome
CN109360605B (en) * 2018-09-25 2020-10-20 安吉康尔(深圳)科技有限公司 Genome sequencing data archiving method, server and computer readable storage medium
CN111625509A (en) * 2020-05-26 2020-09-04 福州数据技术研究院有限公司 Lossless compression method for deep sequencing gene sequence data file

Also Published As

Publication number Publication date
WO2024020746A1 (en) 2024-02-01

Similar Documents

Publication Publication Date Title
CN112612664B (en) Electronic equipment testing method and device, electronic equipment and storage medium
CN113963110B (en) Texture map generation method and device, electronic equipment and storage medium
CN113343803A (en) Model training method, device, equipment and storage medium
CN113378855A (en) Method for processing multitask, related device and computer program product
CN114880505A (en) Image retrieval method, device and computer program product
CN114861059A (en) Resource recommendation method and device, electronic equipment and storage medium
CN113963197A (en) Image recognition method and device, electronic equipment and readable storage medium
CN114218931A (en) Information extraction method and device, electronic equipment and readable storage medium
CN117795605A (en) Method and device for processing FASTQ data, electronic equipment and storage medium
CN115639966A (en) Data writing method and device, terminal equipment and storage medium
CN115328736A (en) Probe deployment method, device, equipment and storage medium
CN115132186A (en) End-to-end speech recognition model training method, speech decoding method and related device
CN115344627A (en) Data screening method and device, electronic equipment and storage medium
CN115617800A (en) Data reading method and device, electronic equipment and storage medium
CN112860376B (en) Snapshot chain manufacturing method and device, electronic equipment and storage medium
CN114708580A (en) Text recognition method, model training method, device, apparatus, storage medium, and program
CN109783745B (en) Method, device and computer equipment for personalized typesetting of pages
CN113032071A (en) Page element positioning method, page testing method, device, equipment and medium
CN117795855A (en) FASTQ data processing method and device, electronic equipment and storage medium
CN114363627B (en) Image processing method and device and electronic equipment
CN113360712B (en) Video representation generation method and device and electronic equipment
CN116070601B (en) Data splicing method and device, electronic equipment and storage medium
CN115934181B (en) Data loading method, device, electronic equipment and storage medium
CN113961775A (en) Data visualization method and device, electronic equipment and readable storage medium
CN114494817A (en) Image processing method, model training method, related device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination