WO2023226036A1 - Fastq data processing method and apparatus, electronic device, and storage medium - Google Patents

Fastq data processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023226036A1
WO2023226036A1 PCT/CN2022/095757 CN2022095757W WO2023226036A1 WO 2023226036 A1 WO2023226036 A1 WO 2023226036A1 CN 2022095757 W CN2022095757 W CN 2022095757W WO 2023226036 A1 WO2023226036 A1 WO 2023226036A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
row
file
preset
unit
Prior art date
Application number
PCT/CN2022/095757
Other languages
French (fr)
Chinese (zh)
Inventor
邓天全
姜三杰
陈世璇
贺丽娟
杨鑫
黎剑波
Original Assignee
深圳华大基因科技服务有限公司
武汉华大基因技术服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技服务有限公司, 武汉华大基因技术服务有限公司 filed Critical 深圳华大基因科技服务有限公司
Priority to PCT/CN2022/095757 priority Critical patent/WO2023226036A1/en
Priority to CN202280054965.1A priority patent/CN117795855A/en
Publication of WO2023226036A1 publication Critical patent/WO2023226036A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/46Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular, to a FASTQ data processing method and device, electronic equipment and storage media.
  • the FASTQ files that are originally sequenced are often compressed directly through the gz ip command.
  • the compression principle of gzip is: when two pieces of FASTQ data have the same content, as long as By obtaining the position and size of the previous block, you can determine the content of the next block, that is, you can replace the content of the latter block with a pair of information (the distance between the two, the length of the same content). Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
  • the present disclosure provides a FASTQ data processing method, device, electronic equipment and storage medium. Its main purpose is to use the similarity of each line in the sequence unit in the FASTQ file to classify and store the files by lines, and perform preset lossless compression on the classified files to further save the storage space of the FASTQ file.
  • a method for processing FASTQ data including:
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
  • storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
  • the first line and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • triggering the preset lossless compression command to compress the four preset files respectively includes:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first line in a default file contains:
  • the step of reading the first row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • Line N includes:
  • the second row sequence in the first sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the first sequence unit are written into the second default file.
  • the first line includes:
  • Line N includes:
  • the step of reading the third row sequence in the first sequence unit is to write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third preset file.
  • the first line includes:
  • the step of reading the third row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third preset file.
  • Line N includes:
  • the step of reading the fourth row sequence in the first sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth preset file.
  • the first line includes:
  • the step of reading the fourth row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • Line N includes:
  • triggering the preset lossless compression command to compress the four preset files respectively includes:
  • the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.
  • the method further includes:
  • the second default file, the third default file and the fourth default file Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related description information , wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • a target sequence unit is formed
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the method also includes:
  • a FASTQ data processing device including:
  • the splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • a storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
  • a compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.
  • the storage includes:
  • a first reading module configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
  • the first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
  • the storage unit includes:
  • the first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit.
  • the first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • the second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • the third processing module is used to read the second line sequence in the first sequence unit, and write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
  • the fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file.
  • the fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
  • the sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file.
  • the seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
  • the eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • the compression unit is also used for:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first processing module is also used for:
  • the second processing module is also used for:
  • the third processing module is also used for:
  • the fourth processing module is also used to include:
  • the fifth processing module is also used for:
  • the sixth processing module is also used for:
  • the seventh processing module is also used for:
  • the eighth processing module is also used for:
  • the compression unit includes:
  • a judgment module used to judge whether the four preset files contain compressed line sequences
  • a compression module configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
  • the device also includes:
  • the reading unit is configured to sequentially select from the compressed first default file, second default file, Starting from the first line in the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where, read the first default file, the second default file, The order of the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • a writing unit used to write the target sequence unit into the same target FASTQ file
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the device also includes:
  • a control unit configured to control the synchronization of the compressed first preset file, second preset file, and compressed files after the compression unit triggers a preset lossless compression command to compress the four preset files respectively.
  • the third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;
  • the reading and writing unit is used to sequentially write the reading results of the first default file, the second default file, the third default file and the fourth default file into the same target FASTQ file each time it reads one line. .
  • the device also includes:
  • the first encryption unit is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key
  • a decryption unit used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;
  • a second encryption unit configured to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key
  • the determination unit determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
  • an electronic device including:
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the method described in the first aspect.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
  • a computer program product including a computer program that, when executed by a processor, implements the method described in the foregoing first aspect.
  • the FASTQ data processing method, device, electronic equipment and storage medium split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences
  • four row sequences are stored in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store rows of different sequence units with the same row sequence identifier.
  • Sequence trigger the preset lossless compression command to compress the four preset files respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, the four lines are classified according to the line sequence identifier.
  • the line sequences are stored in the corresponding four default files, and the default lossless compression is performed on the four default files respectively, further saving the storage space of the FASTQ file.
  • Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of the format of a FASTQ file provided by an embodiment of the present disclosure
  • Figure 3 An embodiment of the present disclosure provides a schematic diagram of a first preset file storage line sequence
  • Figure 4 An embodiment of the present disclosure provides a schematic diagram of a second preset file storage line sequence
  • Figure 5 This disclosed embodiment provides a schematic diagram of a third preset file storage line sequence
  • Figure 6 is a schematic diagram of a fourth preset file storage line sequence provided by an embodiment of the present disclosure.
  • FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four default files according to an embodiment of the present disclosure
  • Figure 8 is a schematic flow chart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure
  • Figure 9 is a schematic flowchart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a FASTQ data processing device provided by an embodiment of the present disclosure.
  • FIG 11 is a schematic structural diagram of another FASTQ data processing device provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic block diagram of an example electronic device 1200 provided by an embodiment of the present disclosure.
  • Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure.
  • the method consists of the following steps:
  • Step 101 Split the FASTQ file to be processed into at least one sequence unit according to a preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers.
  • FIG. 2 is a schematic diagram of the format of a FASTQ file provided by the embodiment of the present disclosure.
  • the FASTQ file is a standard file for storing original sequencing data.
  • Each Four lines constitute an independent sequence unit (or sequence storage unit), as shown in Figure 2. Every four lines represent a sequence unit, a total of 4 sequence units.
  • the preset data format of each sequence unit is as follows:
  • the first line of sequence identification and related description information is the unique identifier of each sequence unit
  • the second line is that the sequence consists of A, C, G, T and N, where A, C, G, T are base information, and N is the complement code used as a substitute when sequencing fails;
  • the third line starts with ‘+’, followed by the row sequence identifier, relevant description information, or nothing.
  • the data shown in the example of the embodiment of the present invention only has ‘+’ in this line;
  • the fourth line is the quality information of the sequence, which corresponds to the bases in the second line of the sequence.
  • Each base corresponds to a quality value.
  • the quality value is expressed in ASCII code to measure the reliability of the sequenced base. The higher the quality value, the more reliable it is.
  • Figure 2 is only an illustrative example.
  • the embodiment of the present disclosure does not limit the number of sequence units in the FASTQ file to be processed and the content of each line in each sequence unit.
  • Figure 2 is only for convenience in formatting. For better understanding, explanations are given.
  • the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit.
  • the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit.
  • four sequence units are used as an example for explanation, but this method of explanation It is not intended as a limit to a specific quantity.
  • Step 102 Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier.
  • each sequence unit of the FASTQ file to be processed After the four lines of each sequence unit of the FASTQ file to be processed are classified, they are written to four corresponding preset files. That is, the sequence identification and related description information of the first line are written to the first preset file (file 1). The second line of sequence identification and related description information is written to the second preset file (File 2), the third line of sequence identification and related description information is written to the third preset file (File 3), the fourth line of sequence identification and Relevant description information (quality information) is written to the fourth preset file (File 4).
  • Figures 3 to 6 respectively provide a first preset method according to the embodiment of the present disclosure.
  • the schematic diagram of the file storage line sequence, the schematic diagram of the second preset file storage line sequence, the schematic diagram of the third preset file storage line sequence, and the schematic diagram of the fourth preset file storage line sequence can be seen from Figures 3 to 6
  • the data stored in the first preset file is the content related to the first line in different sequence units
  • the data stored in the second preset file is the content related to the second line in different sequence units
  • the data stored in the third preset file is the content related to the second line in different sequence units.
  • the data stored in the file is the content related to the third row in different sequence units
  • the data stored in the fourth preset file is the content related to the fourth row in different sequence units.
  • Step 103 Trigger a preset lossless compression command to compress the four preset files respectively.
  • the similarity between each line sequence written in four preset files is utilized, combined with the compression principle of gzip, to further reduce the space for storing FASTQ data.
  • the preset lossless compression command includes a gzip compression command or pigz compression.
  • Subsequent embodiments take the gzip compression command as an example for explanation. However, this explanation method is not intended to limit the compression method to only be gzip compression command.
  • the compression principle of gzip is: when there are two pieces of the same content in FASTQ data, as long as the position and size of the previous piece are obtained, the content of the latter piece can be determined, that is, it can be used (the distance between the two, the length of the same content) Such a pair of information replaces the latter piece of content. Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
  • the third preset file only contains the '+' at the beginning of the third line of the sequence unit.
  • the gzip command is triggered to perform compression processing.
  • each sequence unit The similarity of individual line sequences is high, which makes the line sequences stored in the four preset files highly similar. Therefore, the gzip command is triggered to perform compression processing, and the content distance of the same module is shortened through classification to achieve the purpose of saving space.
  • the FASTQ data processing method splits the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers.
  • the line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command
  • the four preset files are compressed respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier.
  • the sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
  • step 102 is performed to store the four line sequences in corresponding four default files according to the line sequence identifiers. This can be achieved in the following two ways:
  • Method 1 From the first sequence unit to the last sequence unit of the FASTQ file to be processed, read the row sequences with the same row sequence identifier in different sequence units in sequence, and read the same row sequences in the reading order. Row sequences identified by row sequences are written to the same preset file.
  • the FASTQ file to be processed contains N sequence units (N is greater than 1).
  • N is greater than 1
  • the sequence unit is read, the first line in sequence unit 1 is read, and the first preset is written.
  • the first line of the file read the second line in sequence unit 1 and write the first line in the second preset file, read the third line in sequence unit 1, write the first line in the third preset file line, read the fourth line in sequence unit 1, and write the first line in the fourth preset file.
  • sequence unit 1 completes reading, read sequence N in the same way until all the The sequence unit completes the classified storage of FASTQ files to be processed.
  • Method 2 Read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first preset The first line in the file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed. Sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • Embodiments of the present disclosure provide a method for compressing four preset files respectively.
  • FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four preset files according to an embodiment of the present disclosure. , this method can be executed alone, or can be executed in combination with any embodiment or possible implementation in the embodiment, or in combination with any technical solution in related technologies. As a possible way to perform step 101, this method includes:
  • Step 701 Read the row sequence identifier and related description information of the first row sequence in the first sequence unit;
  • Step 702 Read the row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
  • Step 703 Read the row sequence identifier and related description information of the second row sequence in the first sequence unit;
  • Step 704 Read the row sequence identifier and related description information of the second row sequence in the Nth sequence unit.
  • Step 705 Read the row sequence identifier and related description information of the third row sequence in the first sequence unit;
  • Step 706 Read the row sequence identifier and related description information of the third row sequence in the Nth sequence unit;
  • Step 707 Read the row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
  • Step 708 Read the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
  • the four preset files are respectively Performing compression processing includes: respectively determining whether the four preset files contain compressed line sequences. If so, performing compression processing on the compressed line sequences and performing preset lossless compression on the uncompressed line sequences. . This processing method ensures lossless restoration.
  • the method shown in Figure 7 is to perform a compression process after reading a line sequence in each sequence unit, and write the compressed line sequence into the corresponding preset file.
  • the following embodiments also provide another possible way to perform step 104, that is, through any of the above embodiments, four line sequences are stored in corresponding four default files according to the line sequence identifiers, and then different preset files are stored. Assume that files are compressed separately, specifically:
  • FIG. 8 is a schematic flowchart of a method of performing restoration processing on compressed FASTQ data provided by the embodiment of the present disclosure. This method can be used individually.
  • the execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
  • Step 801 Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related The description information, wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit.
  • N is greater than 1.
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • Step 802 Compose a target sequence unit based on the read row sequence identifier and related description information of each row.
  • Step 803 Write the target sequence unit into the same target FASTQ file.
  • Figure 9 is a schematic flowchart of another method of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure. This method can be used separately.
  • the execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
  • Step 901 Control and synchronize the compressed first default file, second default file, third default file and fourth default file to read all the compressed files from the first line to the last line. Execution sequence identifier and related description information; through program control of the synchronization when reading the first default file, the second default file, the third default file and the fourth default file, the reading efficiency can be improved, and thus Improved the efficiency of restoring FASTQ files.
  • Step 902 Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
  • the embodiment of the present disclosure conducts verification through the following method.
  • the method includes: calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key, decompressing the target FASTQ file using a decompression method corresponding to the preset lossless compression, and obtaining the decompressed target FASTQ file. file, use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key, and based on whether the first encryption key and the second encryption key are consistent, Determine whether there is any loss in the storage and restoration of the FASTQ data to be processed.
  • Example 1 provides an application example of using the above-mentioned FASTQ data processing method to save storage space on a FASTQ compressed file with a size of 21789495522bp and a file name of "V350027954_L01_read_1.fq.gz".
  • the decompressed file is "V350027954_L01_read_1.
  • fq"'s MD5 message digest algorithm (MD5Message-Digest Algorithm) MD5 value is "ca4168a17d0510a5d3f51fa6856d1888". This MD5 value can be used to determine whether the classification compression storage of the present invention is used, and then through the restoration method, check whether the MD5 is consistent to verify the above implementation. Reliability of the method.
  • the programming language is perl.
  • the specific implementation code is as follows:
  • Example 2 The following provides an application example of using the method of the embodiment of the present invention to restore the compressed files Row1.gz, Row2.gz, Row3.gz and Row4.gz obtained by classification and storage in the above Example 1 to the original file V350027954_L01_read_1.fq.gz.
  • the programming language is perl, and the specific implementation code is as follows:
  • the present invention also proposes a FASTQ data processing device. Since the device embodiment of the present invention corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment can be referred to the above-mentioned method embodiment, and will not be described again in the present invention.
  • An embodiment of the present disclosure provides a FASTQ data processing device, as shown in Figure 10, including:
  • the splitting unit 1001 is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • the storage unit 1002 is used to store four line sequences in corresponding four preset files according to the line sequence identifiers, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier. ;
  • the compression unit 1003 is configured to trigger a preset lossless compression command to compress the four preset files respectively.
  • the FASTQ data processing device splits the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers.
  • the line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command
  • the four preset files are compressed respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier.
  • the sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
  • the storage unit 1002 includes:
  • a first reading module configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
  • the first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
  • the storage unit includes:
  • the first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit.
  • the first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • the second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • the third processing module is used to read the second row sequence in the first sequence unit, and write the row sequence identifier and related description information of the second row sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
  • the fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file.
  • the fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
  • the sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file.
  • the seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
  • the eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • the compression unit 1003 is also used to:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first processing module is also used to:
  • the second processing module is also used to:
  • the third processing module is also used to:
  • the fourth processing module is also configured to include:
  • the fifth processing module is also used to:
  • the sixth processing module is also used to:
  • the seventh processing module is also used to:
  • the eighth processing module is also used to:
  • the compression unit 1003 includes:
  • Determination module 10031 used to determine whether the four preset files contain compressed line sequences
  • the compression module 10032 is configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
  • the device further includes:
  • the reading unit 1004 is configured to, after the compression unit 1003 triggers a preset lossless compression command to compress the four preset files respectively, sequentially select from the compressed first preset file and the second preset file.
  • the third default file and the fourth default file read the line sequence identifier and related description information in sequence, wherein, read the first default file, the second default file
  • the order of the files, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • the composition unit 1005 is used to compose a target sequence unit according to the row sequence identifier and related description information of each row read;
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the control unit 1007 is configured to control the synchronization of the compressed first preset file and the second preset file after the compression unit triggers a preset lossless compression command to compress the four preset files respectively.
  • the third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;
  • the reading and writing unit 1008 is configured to sequentially write the reading results of the first default file, the second default file, the third default file, and the fourth default file into the same target FASTQ file each time a line is read. middle.
  • the device further includes:
  • the first encryption unit 1009 is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key
  • the decryption unit 10010 is used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;
  • the second encryption unit 10011 is used to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;
  • the determination unit 10012 determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 12 illustrates a schematic block diagram of an example electronic device 1200 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1200 includes a computing unit 1201, which can be loaded into a RAM (Random Access Memory) according to a computer program stored in a ROM (Read-Only Memory) 1202 or from a storage unit 1208. Access the computer program in the memory) 1203 to perform various appropriate actions and processes. In the RAM 1203, various programs and data required for the operation of the device 1200 can also be stored.
  • Computing unit 1201, ROM 1202 and RAM 1203 are connected to each other via bus 1204.
  • I/O (Input/Output, input/output) interface 1205 is also connected to bus 1204.
  • I/O interface 1205 Multiple components in the device 1200 are connected to the I/O interface 1205, including: input unit 12012, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a magnetic disk, optical disk, etc. ; and communication unit 1209, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1209 allows the device 1200 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include but are not limited to CPU (Central Processing Unit, Central Processing Unit), GPU (Graphic Processing Units, Graphics Processing Unit), various dedicated AI (Artificial Intelligence, artificial intelligence) computing chips, various running The computing unit of the machine learning model algorithm, DSP (Digital Signal Processor, digital signal processor), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 1201 performs various methods and processes described above, such as the processing method of FASTQ data.
  • the FASTQ data processing method may be implemented as a computer software program, which is tangibly embodied in a machine-readable medium, such as storage unit 1208.
  • part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209.
  • the computer program When the computer program is loaded into RAM 1203 and executed by computing unit 1201, one or more steps of the method described above may be performed.
  • the computing unit 1201 may be configured to perform the aforementioned processing method of FASTQ data in any other suitable manner (eg, by means of firmware).
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, laptop disks, hard drives, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, Erasable Programmable Read-Only Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory, portable compact disk read-only memory), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)) for displaying information to the user.
  • a display device eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)
  • LCD Liquid Crystal Display
  • keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), the Internet, and blockchain networks.
  • Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability.
  • the server can also be a distributed system server or a server combined with a blockchain.
  • artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology.
  • Artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.

Abstract

Disclosed in the present disclosure are a FASTQ data processing method and apparatus, an electronic device, and a storage medium. A FASTQ file to be processed is split into at least one sequence unit according to a preset data format, each sequence unit comprising four row sequences, and different row sequences corresponding to different row sequence identifiers; the four row sequences are respectively stored in corresponding four preset files according to the row sequence identifiers, wherein one preset file is used for storing row sequences having a same row sequence identifier and stored in different sequence units; a preset lossless compression command is triggered to respectively compress the four preset files. Compared with a mode in the prior art of directly triggering a gzip command to compress a FASTQ file, according to the present disclosure, classification and storage are carried out by utilizing the similarity of each row in sequence units in a FASTQ file, that is, the four row sequences are respectively stored in the corresponding four preset files according to the row sequence identifiers, and preset lossless compression is separately performed on the four preset files, so that a storage space of the FASTQ file is further saved.

Description

FASTQ数据的处理方法及装置、电子设备和存储介质FASTQ data processing methods and devices, electronic equipment and storage media 技术领域Technical field
本公开涉及数据处理技术领域,尤其涉及一种FASTQ数据的处理方法及装置、电子设备和存储介质。The present disclosure relates to the field of data processing technology, and in particular, to a FASTQ data processing method and device, electronic equipment and storage media.
背景技术Background technique
近年来,随着测序技术的发展,测序价格越来越低,导致测序产出的数据正在激增中,如何有效降低测序数据的存储空间已经成为了一个急需解决的难题。In recent years, with the development of sequencing technology, sequencing prices have become lower and lower, resulting in a surge in sequencing data. How to effectively reduce the storage space of sequencing data has become an urgent problem that needs to be solved.
为了缓解FASTQ文件的存储压力,常将测序原始下机的FASTQ文件,通过gz ip命令直接对FASTQ文件进行压缩处理,其中,gzip的压缩原理为:当FASTQ数据中有两块内容相同时,只要获取前一块的位置和大小,就可以确定后一块的内容,即可以用(两者之间的距离,相同内容的长度)这样一对信息,来替换后一块内容。由于(两者之间的距离,相同内容的长度)这一对信息的大小,小于被替换内容的大小,所以FASTQ文件得到压缩。In order to alleviate the storage pressure of FASTQ files, the FASTQ files that are originally sequenced are often compressed directly through the gz ip command. Among them, the compression principle of gzip is: when two pieces of FASTQ data have the same content, as long as By obtaining the position and size of the previous block, you can determine the content of the next block, that is, you can replace the content of the latter block with a pair of information (the distance between the two, the length of the same content). Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
通过上述方法压缩虽然能在一定程度上缓解存储压力,但是,其压缩空间的的缓存仍没有达到预期。Although compression through the above method can alleviate storage pressure to a certain extent, its compression space cache still does not meet expectations.
发明内容Contents of the invention
本公开提供了一种FASTQ数据的处理方法、装置、电子设备和存储介质。其主要目的在于利用FASTQ文件中序列单元中每行的相似性,进行按行分类存储,对分类存储的文件进行预设无损压缩,进一步节省了FASTQ文件的存储空间。The present disclosure provides a FASTQ data processing method, device, electronic equipment and storage medium. Its main purpose is to use the similarity of each line in the sequence unit in the FASTQ file to classify and store the files by lines, and perform preset lossless compression on the classified files to further save the storage space of the FASTQ file.
根据本公开的第一方面,提供了一种FASTQ数据的处理方法,包括:According to a first aspect of the present disclosure, a method for processing FASTQ data is provided, including:
将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识;Split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。Trigger the preset lossless compression command to compress the four preset files respectively.
可选的,所述按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中包括:Optionally, storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
从所述待处理FASTQ文件的第一个序列单元到最后一个序列单元,依次读取不同序列单元中的相同行序列标识的行序列;From the first sequence unit to the last sequence unit of the FASTQ file to be processed, sequentially read the row sequences with the same row sequence identifier in different sequence units;
按照读取顺序将读取到的所述相同行序列标识的行序列写入同一预设文件中。Write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
可选的,所述按照所述行序列标识将四个行序列分别储存于对应的四个预设文件 中包括:Optionally, storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行,并添加预设分隔符,其中,所述至少一个序列单元包含第一序列单元及第二序列单元,所述第一序列单元为所述待处理FASTQ文件的第一个序列单元,所述第二序列单元为所述待处理FASTQ文件的第N个序列单元,N为大于1的整数;Read the first line sequence in the first sequence unit of the FASTQ file to be processed, and write the line sequence identifier and related description information of the first line sequence in the first sequence unit into the first default file. The first line, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行,并添加预设分隔符;Read the first row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the Nth row in the first default file, and add Default separator;
读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行,并添加预设分隔符;Read the second line sequence in the first sequence unit, write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the first line in the second default file, and add Default separator;
读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行,并添加预设分隔符;Read the second row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the Nth row in the second default file, and add Default separator;
读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行,并添加预设分隔符;Read the third row sequence in the first sequence unit, write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the first row in the third default file, and add Default separator;
读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行,并添加预设分隔符;Read the third row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the Nth row in the third default file, and add Default separator;
读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行,并添加预设分隔符;Read the fourth line sequence in the first sequence unit, write the line sequence identifier and related description information of the fourth line sequence in the first sequence unit into the first line in the fourth default file, and add Default separator;
读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行,并添加预设分隔符;Read the fourth row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the Nth row in the fourth default file, and add Default separator;
直到将所述待处理FASTQ文件的所有序列单元依次写入所述第一预设文件、第二预设文件、第三预设文件及第四预设文件。Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
可选的,所述触发预设无损压缩命令分别将所述四个预设文件进行压缩处理包括:Optionally, triggering the preset lossless compression command to compress the four preset files respectively includes:
响应于触发的第一预设无损压缩命令,对所述第一预设文件执行压缩;In response to the triggered first preset lossless compression command, perform compression on the first preset file;
响应于触发的第二预设无损压缩命令,对所述第二预设文件执行压缩;In response to the triggered second preset lossless compression command, perform compression on the second preset file;
响应于触发的第三预设无损压缩命令,对所述第三预设文件执行压缩;In response to the triggered third preset lossless compression command, perform compression on the third preset file;
响应于触发的第四预设无损压缩命令,对所述第四预设文件执行压缩。In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
可选的,所述预设无损压缩命令包括gzip压缩命令或pigz压缩命令。Optionally, the preset lossless compression command includes a gzip compression command or a pigz compression command.
可选的,所述读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行包括:Optionally, read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first sequence unit. The first line in a default file contains:
将读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;
将压缩后的第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行。Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
可选的,所述读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行包括:Optionally, the step of reading the first row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file. Line N includes:
将读取到的所述第N序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行。Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
可选的,所述读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行包括:Optionally, the second row sequence in the first sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the first sequence unit are written into the second default file. The first line includes:
将读取到的所述第一序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;
将压缩后的第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行。Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
可选的,所述读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行包括:Optionally, the second row sequence in the Nth sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the Nth sequence unit are written into the second default file. Line N includes:
将读取到的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;
将压缩后的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行。Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
可选的,所述读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行包括:Optionally, the step of reading the third row sequence in the first sequence unit is to write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third preset file. The first line includes:
将读取到的所述第一序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;
将压缩后的第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行。Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
可选的,所述读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行包括:Optionally, the step of reading the third row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third preset file. Line N includes:
将读取到的所述第二序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;
将压缩后的第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行。Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.
可选的,所述读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行包括:Optionally, the step of reading the fourth row sequence in the first sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth preset file. The first line includes:
将读取到的第一序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
将压缩后的所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行。Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
可选的,所述读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行包括:Optionally, the step of reading the fourth row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file. Line N includes:
将读取到的第N序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行。Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
可选的,所述触发预设无损压缩命令分别将所述四个预设文件进行压缩处理包括:Optionally, triggering the preset lossless compression command to compress the four preset files respectively includes:
分别判断所述四个预设文件中是否包含已压缩的行序列;Determine whether the four preset files contain compressed line sequences respectively;
若包含,则将已压缩的行序列不再执行压缩处理,对未压缩的行序列执行预设无损压缩。If included, the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.
可选的,在触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,所述方法还包括:Optionally, after triggering a preset lossless compression command to compress the four preset files respectively, the method further includes:
依次从压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第一行开始,依次读取所述行序列标识及相关的描述信息,其中,读取所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的顺序,与所述序列单元的四个行序列顺序一致;Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related description information , wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
根据读取到的每一行的所述行序列标识及相关的描述信息,组成一个目标序列单元;According to the row sequence identifier and related description information of each read row, a target sequence unit is formed;
将所述目标序列单元写入同一目标FASTQ文件中;Write the target sequence unit into the same target FASTQ file;
直到读取完所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第N行的行序列标识及相关的描述信息,以将所述目标FASTQ文件为还原后的待处理FASTQ文件。Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
可选的,在触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,包括:Optionally, after triggering the preset lossless compression command to compress the four preset files respectively, the following steps are included:
控制同步对压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件,从第一行开始到最后一行的读取方式读取所述行序列标识及相关的描述信息;Control synchronization to read the line sequence from the first line to the last line of the compressed first default file, second default file, third default file and fourth default file. Identification and related descriptive information;
每读取一行,依次将所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的读取结果,写入同一目标FASTQ文件中。Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
可选的,所述方法还包括:Optionally, the method also includes:
对所述待处理FASTQ文件按照预设加密算法进行计算,得到第一加密秘钥;Calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;
使用所述预设无损压缩对应的解压方式对所述目标FASTQ文件进行解压,得到解压后的目标FASTQ文件;Use the decompression method corresponding to the preset lossless compression to decompress the target FASTQ file to obtain the decompressed target FASTQ file;
对所述解压后的目标FASTQ文件采用所述预设加密算法进行加密计算,得到第二加密秘钥;Use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key;
根据所述第一加密秘钥与所述第二加密秘钥的是否一致性,确定所述待处理 FASTQ数据的存储及还原处理是否存在损失。According to whether the first encryption key and the second encryption key are consistent, it is determined whether there is a loss in the storage and restoration processing of the FASTQ data to be processed.
根据本公开的第二方面,提供了一种FASTQ数据的处理装置,包括:According to a second aspect of the present disclosure, a FASTQ data processing device is provided, including:
拆分单元,用于将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识;The splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
存储单元,用于按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;A storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
压缩单元,用于触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。A compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.
可选的,所述存储包括:Optionally, the storage includes:
第一读取模块,用于从所述待处理FASTQ文件的第一个序列单元到最后一个序列单元,依次读取不同序列单元中的相同行序列标识的行序列;A first reading module, configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
第一写入模块,用于按照读取顺序将读取到的所述相同行序列标识的行序列写入同一预设文件中。The first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
可选的,所述存储单元包括:Optionally, the storage unit includes:
第一处理模块,用于读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行,并添加预设分隔符,其中,所述至少一个序列单元包含第一序列单元及第二序列单元,所述第一序列单元为所述待处理FASTQ文件的第一个序列单元,所述第二序列单元为所述待处理FASTQ文件的第N个序列单元,N为大于1的整数;The first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit. The first line in the first preset file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the FASTQ file to be processed The first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
第二处理模块,用于读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行,并添加预设分隔符;The second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file. The Nth line of and add the preset delimiter;
第三处理模块,用于读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行,并添加预设分隔符;The third processing module is used to read the second line sequence in the first sequence unit, and write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
第四处理模块,用于读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行,并添加预设分隔符;The fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file. The Nth line of and add the preset delimiter;
第五处理模块,用于读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行,并添加预设分隔符;The fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
第六处理模块,用于读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行,并添加预设分隔符;The sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file. The Nth line of and add the preset delimiter;
第七处理模块,用于读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行,并添加预设分隔符;The seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
第八处理模块,用于读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行,并添加预设分隔符;The eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file. The Nth line of and add the preset delimiter;
直到将所述待处理FASTQ文件的所有序列单元依次写入所述第一预设文件、第二预设文件、第三预设文件及第四预设文件。Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
可选的,所述压缩单元,还用于:Optionally, the compression unit is also used for:
响应于触发的第一预设无损压缩命令,对所述第一预设文件执行压缩;In response to the triggered first preset lossless compression command, perform compression on the first preset file;
响应于触发的第二预设无损压缩命令,对所述第二预设文件执行压缩;In response to the triggered second preset lossless compression command, perform compression on the second preset file;
响应于触发的第三预设无损压缩命令,对所述第三预设文件执行压缩;In response to the triggered third preset lossless compression command, perform compression on the third preset file;
响应于触发的第四预设无损压缩命令,对所述第四预设文件执行压缩。In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
可选的,所述预设无损压缩命令包括gzip压缩命令或pigz压缩命令。Optionally, the preset lossless compression command includes a gzip compression command or a pigz compression command.
可选的,所述第一处理模块,还用于:Optionally, the first processing module is also used for:
将读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;
将压缩后的第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行。Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
可选的,所述第二处理模块,还用于:Optionally, the second processing module is also used for:
将读取到的所述第N序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行。Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
可选的,所述第三处理模块,还用于:Optionally, the third processing module is also used for:
将读取到的所述第一序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;
将压缩后的第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行。Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
可选的,所述第四处理模块,还用于包括:Optionally, the fourth processing module is also used to include:
将读取到的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;
将压缩后的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行。Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
可选的,所述第五处理模块,还用于:Optionally, the fifth processing module is also used for:
将读取到的所述第一序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;
将压缩后的第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行。Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
可选的,所述第六处理模块,还用于:Optionally, the sixth processing module is also used for:
将读取到的所述第二序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;
将压缩后的第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行。Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.
可选的,所述第七处理模块,还用于:Optionally, the seventh processing module is also used for:
将读取到的第一序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
将压缩后的所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行。Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
可选的,所述第八处理模块,还用于:Optionally, the eighth processing module is also used for:
将读取到的第N序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行。Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
可选的,所述压缩单元包括:Optionally, the compression unit includes:
判断模块,用于分别判断所述四个预设文件中是否包含已压缩的行序列;A judgment module, used to judge whether the four preset files contain compressed line sequences;
压缩模块,用于当确定所述四个预设文件中包含已压缩的行序列时,将已压缩的行序列不再执行压缩处理,对未压缩的行序列执行预设无损压缩。A compression module, configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
可选的,所述装置还包括:Optionally, the device also includes:
读取单元,用于在所述压缩单元触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,依次从压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第一行开始,依次读取所述行序列标识及相关的描述信息,其中,读取所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的顺序,与所述序列单元的四个行序列顺序一致;The reading unit is configured to sequentially select from the compressed first default file, second default file, Starting from the first line in the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where, read the first default file, the second default file, The order of the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
组成单元,用于根据读取到的每一行的所述行序列标识及相关的描述信息,组成一个目标序列单元;A component unit used to form a target sequence unit based on the row sequence identifier and related description information of each read row;
写入单元,用于将所述目标序列单元写入同一目标FASTQ文件中;A writing unit, used to write the target sequence unit into the same target FASTQ file;
直到读取完所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第N行的行序列标识及相关的描述信息,以将所述目标FASTQ文件为还原后的待处理FASTQ文件。Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
可选的,所述装置还包括:Optionally, the device also includes:
控制单元,用于在所述压缩单元触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,控制同步对压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件,从第一行开始到最后一行的读取方式读取所述行序列标识及相关的描述信息;A control unit configured to control the synchronization of the compressed first preset file, second preset file, and compressed files after the compression unit triggers a preset lossless compression command to compress the four preset files respectively. The third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;
读写单元,用于每读取一行,依次将所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的读取结果,写入同一目标FASTQ文件中。The reading and writing unit is used to sequentially write the reading results of the first default file, the second default file, the third default file and the fourth default file into the same target FASTQ file each time it reads one line. .
可选的,所述装置还包括:Optionally, the device also includes:
第一加密单元,用于对所述待处理FASTQ文件按照预设加密算法进行计算,得到第一加密秘钥;The first encryption unit is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;
解密单元,用于使用所述预设无损压缩对应的解压方式对所述目标FASTQ文件进行解压,得到解压后的目标FASTQ文件;;A decryption unit, used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;;
第二加密单元,用于对所述解压后的目标FASTQ文件采用所述预设加密算法进行加密计算,得到第二加密秘钥;A second encryption unit, configured to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;
确定单元,根据所述第一加密秘钥与所述第二加密秘钥的是否一致性,确定所述待处理FASTQ数据的存储及还原处理是否存在损失。The determination unit determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
根据本公开的第三方面,提供了一种电子设备,包括:According to a third aspect of the present disclosure, an electronic device is provided, including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行前述第一方面所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the method described in the first aspect.
根据本公开的第四方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行前述第一方面所述的方法。According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
根据本公开的第五方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如前述第一方面所述的方法。According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the method described in the foregoing first aspect.
本公开提供的FASTQ数据的处理方法、装置、电子设备和存储介质,将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识,按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。与相关技术中直接触发预设无损压缩FASTQ文件的方式相比,本公开实施例利用FASTQ文件中序列单元中每行的相似性,进行按行分类存储,即按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,分别对四个预设文件执行预设无损压缩,进一步节省了FASTQ文件的存储空间。The FASTQ data processing method, device, electronic equipment and storage medium provided by this disclosure split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences Corresponding to different row sequence identifiers, four row sequences are stored in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store rows of different sequence units with the same row sequence identifier. Sequence; trigger the preset lossless compression command to compress the four preset files respectively. Compared with the method of directly triggering the preset lossless compression FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, the four lines are classified according to the line sequence identifier. The line sequences are stored in the corresponding four default files, and the default lossless compression is performed on the four default files respectively, further saving the storage space of the FASTQ file.
应当理解,本部分所描述的内容并非旨在标识本申请的实施例的关键或重要特征,也不用于限制本申请的范围。本申请的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily understood from the following description.
附图说明Description of the drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure. in:
图1为本公开实施例所提供的一种FASTQ数据的处理方法的流程示意图;Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure;
图2为本公开实施例提供的一种FASTQ文件的格式的示意图;Figure 2 is a schematic diagram of the format of a FASTQ file provided by an embodiment of the present disclosure;
图3本公开实施例提供一种第一预设文件存储行序列的示意图;Figure 3 An embodiment of the present disclosure provides a schematic diagram of a first preset file storage line sequence;
图4本公开实施例提供一种第二预设文件存储行序列的示意图;Figure 4 An embodiment of the present disclosure provides a schematic diagram of a second preset file storage line sequence;
图5本公开实施例提供一种第三预设文件存储行序列的示意图;Figure 5 This disclosed embodiment provides a schematic diagram of a third preset file storage line sequence;
图6本公开实施例提供一种第四预设文件存储行序列的示意图;Figure 6 is a schematic diagram of a fourth preset file storage line sequence provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种将四个行序列分别储存于对应的四个预设文件的方法流程图;FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four default files according to an embodiment of the present disclosure;
图8为本公开实施例提供的一种对压缩后的FASTQ数据执行还原处理的流程示意图;Figure 8 is a schematic flow chart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure;
图9为本公开实施例提供的一种对压缩后的FASTQ数据执行还原处理的流程示意图;Figure 9 is a schematic flowchart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure;
图10为本公开实施例提供的一种FASTQ数据的处理装置的结构示意图;Figure 10 is a schematic structural diagram of a FASTQ data processing device provided by an embodiment of the present disclosure;
图11为本公开实施例提供的另一种FASTQ数据的处理装置的结构示意图;Figure 11 is a schematic structural diagram of another FASTQ data processing device provided by an embodiment of the present disclosure;
图12为本公开实施例提供的示例电子设备1200的示意性框图。FIG. 12 is a schematic block diagram of an example electronic device 1200 provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
下面参考附图描述本公开实施例的FASTQ数据的处理方法、装置、电子设备和存储介质。The following describes the FASTQ data processing method, device, electronic device, and storage medium according to the embodiments of the present disclosure with reference to the accompanying drawings.
图1为本公开实施例所提供的一种FASTQ数据的处理方法的流程示意图。Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure.
如图1所示,该方法包含以下步骤:As shown in Figure 1, the method consists of the following steps:
步骤101,将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识。Step 101: Split the FASTQ file to be processed into at least one sequence unit according to a preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers.
为了便于更好的理解,本公开实施例对待处理FASTQ文件的格式进行说明,图2为本公开实施例提供的一种FASTQ文件的格式的示意图,FASTQ文件是存储原始测序数据的标准文件,每四行为一个独立的序列单元(或序列存储单元),如图2所示,每四行代表1条序列单元,共4条序列单元,每一个序列单元的预设数据格式如下:In order to facilitate better understanding, the embodiment of the present disclosure explains the format of the FASTQ file to be processed. Figure 2 is a schematic diagram of the format of a FASTQ file provided by the embodiment of the present disclosure. The FASTQ file is a standard file for storing original sequencing data. Each Four lines constitute an independent sequence unit (or sequence storage unit), as shown in Figure 2. Every four lines represent a sequence unit, a total of 4 sequence units. The preset data format of each sequence unit is as follows:
第一行序列标识以及相关的描述信息,以‘@’开头,是每一个序列单元的唯一标识符;The first line of sequence identification and related description information, starting with ‘@’, is the unique identifier of each sequence unit;
第二行是序列由A,C,G,T和N构成,其中A,C,G,T是碱基信息,N为测序失败时用来替补的补位码;The second line is that the sequence consists of A, C, G, T and N, where A, C, G, T are base information, and N is the complement code used as a substitute when sequencing fails;
第三行以‘+’开头,后面是行序列标识、相关的描述信息,或者什么也不加,本发明实施例的举例中所展示的数据该行只有「+」;The third line starts with ‘+’, followed by the row sequence identifier, relevant description information, or nothing. The data shown in the example of the embodiment of the present invention only has ‘+’ in this line;
第四行,是序列的质量信息,和第二行序列中的碱基一一对应,每一个碱基对应一个质量值,质量值用ASCII码表示,用以衡量该测序碱基的可靠程度,质量值越高越可靠。The fourth line is the quality information of the sequence, which corresponds to the bases in the second line of the sequence. Each base corresponds to a quality value. The quality value is expressed in ASCII code to measure the reliability of the sequenced base. The higher the quality value, the more reliable it is.
需要说明的是,图2仅为示例性的举例,本公开实施例对待处理FASTQ文件中序列单元的个数,及每个序列单元中每行的内容不做限定,图2仅为了便于对格式进行更好的理解,而给出的说明。It should be noted that Figure 2 is only an illustrative example. The embodiment of the present disclosure does not limit the number of sequence units in the FASTQ file to be processed and the content of each line in each sequence unit. Figure 2 is only for convenience in formatting. For better understanding, explanations are given.
具体应用过程中,基于序列单元的预设数据格式,将待处理FASTQ文件拆分为至少一个序列单元,在本公开实施例中以序列单元为4个为例进行的说明,但该种说明方式并非作为对具体数量的限定。During the specific application process, the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit. In the embodiment of the present disclosure, four sequence units are used as an example for explanation, but this method of explanation It is not intended as a limit to a specific quantity.
步骤102,按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列。Step 102: Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier.
对待处理FASTQ文件的每个序列单元中四行分类后,分别写到4个对应的预设文件中,即第1行序列标识以及相关描述信息写入到第一预设文件(文件1),第2行序列标识以及相关描述信息写入到第二预设文件(文件2),第3行序列标识以及相关描述信息写入到第三预设文件(文件3),第4行序列标识以及相关描述信息(质量信息)写入到第四预设文件(文件4)。After the four lines of each sequence unit of the FASTQ file to be processed are classified, they are written to four corresponding preset files. That is, the sequence identification and related description information of the first line are written to the first preset file (file 1). The second line of sequence identification and related description information is written to the second preset file (File 2), the third line of sequence identification and related description information is written to the third preset file (File 3), the fourth line of sequence identification and Relevant description information (quality information) is written to the fourth preset file (File 4).
呈由步骤101的举例,在基于图2给出的待处理FASTQ文件的基础上,如图3至图6所示,图3至图6所示分别为本公开实施例提供一种第一预设文件存储行序列的示意图、第二预设文件存储行序列的示意图、第三预设文件存储行序列的示意图、及第四预设文件存储行序列的示意图,由图3至图6可以看出,在第一预设文件中存储的数据是不同序列单元中第一行相关的内容,在第二预设文件中存储的数据是不同序列单元中第二行相关的内容,在第三预设文件中存储的数据是不同序列单元中第三行相关的内容,在第四预设文件中存储的数据是不同序列单元中第四行相关的内容。As an example from step 101, on the basis of the FASTQ file to be processed given in Figure 2, as shown in Figures 3 to 6, Figures 3 to 6 respectively provide a first preset method according to the embodiment of the present disclosure. Assume that the schematic diagram of the file storage line sequence, the schematic diagram of the second preset file storage line sequence, the schematic diagram of the third preset file storage line sequence, and the schematic diagram of the fourth preset file storage line sequence can be seen from Figures 3 to 6 It turns out that the data stored in the first preset file is the content related to the first line in different sequence units, the data stored in the second preset file is the content related to the second line in different sequence units, and the data stored in the third preset file is the content related to the second line in different sequence units. It is assumed that the data stored in the file is the content related to the third row in different sequence units, and the data stored in the fourth preset file is the content related to the fourth row in different sequence units.
步骤103,触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。Step 103: Trigger a preset lossless compression command to compress the four preset files respectively.
本公开实施例中,利用了四个预设文件中写入的每行行序列之间的相似性,结合gzip的压缩原理,能够进一步缩小存储FASTQ数据的空间。In this disclosed embodiment, the similarity between each line sequence written in four preset files is utilized, combined with the compression principle of gzip, to further reduce the space for storing FASTQ data.
作为本申请实施例的可行方式,所述预设无损压缩命令包括gzip压缩命令或pigz压缩,后续实施例以gzip压缩命令为例进行说明,但是该种说明方式并非意在限定压缩方式仅能为gzip压缩命令。As a feasible method for the embodiment of this application, the preset lossless compression command includes a gzip compression command or pigz compression. Subsequent embodiments take the gzip compression command as an example for explanation. However, this explanation method is not intended to limit the compression method to only be gzip compression command.
gzip的压缩原理为:当FASTQ数据中有两块内容相同时,只要获取前一块的位置和大小,就可以确定后一块的内容,即可以用(两者之间的距离,相同内容的长度)这样一对信息,来替换后一块内容。由于(两者之间的距离,相同内容的长度)这一对信息的大小,小于被替换内容的大小,所以FASTQ文件得到压缩。The compression principle of gzip is: when there are two pieces of the same content in FASTQ data, as long as the position and size of the previous piece are obtained, the content of the latter piece can be determined, that is, it can be used (the distance between the two, the length of the same content) Such a pair of information replaces the latter piece of content. Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
请继续参阅5,由图5可以看出第三预设文件仅包含序列单元的第三行的开头‘+’,在进行压缩时,触发gzip命令执行压缩处理,同样的,由于序列单元下每个行序列的相似性较高,使得四个预设文件中存储的行序列相似度高,因此在触发gzip 命令执行压缩处理,通过分类来缩短相同模块内容距离来以达到节省空间的目的。Please continue to refer to 5. It can be seen from Figure 5 that the third preset file only contains the '+' at the beginning of the third line of the sequence unit. When compressing, the gzip command is triggered to perform compression processing. Similarly, since each sequence unit The similarity of individual line sequences is high, which makes the line sequences stored in the four preset files highly similar. Therefore, the gzip command is triggered to perform compression processing, and the content distance of the same module is shortened through classification to achieve the purpose of saving space.
本公开提供的FASTQ数据的处理方法,将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识,按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。与相关技术中直接触发gzip命令压缩FASTQ文件的方式相比,本公开实施例利用FASTQ文件中序列单元中每行的相似性,进行按行分类存储,即按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,分别对四个预设文件执行预设无损压缩,进一步节省了FASTQ文件的存储空间。The FASTQ data processing method provided by this disclosure splits the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers. According to The line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command The four preset files are compressed respectively. Compared with the method of directly triggering the gzip command to compress the FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier. The sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
在实际应用中,在步骤102执行按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,可采用下述两种方式实现:In practical applications, step 102 is performed to store the four line sequences in corresponding four default files according to the line sequence identifiers. This can be achieved in the following two ways:
方式一:从所述待处理FASTQ文件的第一个序列单元到最后一个序列单元,依次读取不同序列单元中的相同行序列标识的行序列,按照读取顺序将读取到的所述相同行序列标识的行序列写入同一预设文件中。Method 1: From the first sequence unit to the last sequence unit of the FASTQ file to be processed, read the row sequences with the same row sequence identifier in different sequence units in sequence, and read the same row sequences in the reading order. Row sequences identified by row sequences are written to the same preset file.
作为一个示例,待处理FASTQ文件包含N个序列单元(N大于1),在执行分类存储时,读取序列单元进行读取,读取序列单元1中的第一行,写入第一预设文件第一行,读取序列单元1中的第二行写入第二预设文件中的第一行,读取序列单元1中的第三行,写入第三预设文件中的第一行,读取序列单元1中的第四行,写入第四预设文件中的第一行,待序列单元1完成读取后,采用同样的方式读取序列N,直到读取完所有的序列单元,完成待处理FASTQ文件的分类存储。As an example, the FASTQ file to be processed contains N sequence units (N is greater than 1). When performing classification storage, the sequence unit is read, the first line in sequence unit 1 is read, and the first preset is written. The first line of the file, read the second line in sequence unit 1 and write the first line in the second preset file, read the third line in sequence unit 1, write the first line in the third preset file line, read the fourth line in sequence unit 1, and write the first line in the fourth preset file. After sequence unit 1 completes reading, read sequence N in the same way until all the The sequence unit completes the classified storage of FASTQ files to be processed.
方式二:读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行,并添加预设分隔符,其中,所述至少一个序列单元包含第一序列单元及第二序列单元,所述第一序列单元为所述待处理FASTQ文件的第一个序列单元,所述第二序列单元为所述待处理FASTQ文件的第N个序列单元,N为大于1的整数;Method 2: Read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first preset The first line in the file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed. Sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行,并添加预设分隔符;Read the first row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the Nth row in the first default file, and add Default separator;
读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行,并添加预设分隔符;Read the second line sequence in the first sequence unit, write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the first line in the second default file, and add Default separator;
读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行,并添加预设分隔符;Read the second row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the Nth row in the second default file, and add Default separator;
读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行,并添加预设分隔符;Read the third row sequence in the first sequence unit, write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the first row in the third default file, and add Default separator;
读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行,并添加预设分隔符;Read the third row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the Nth row in the third default file, and add Default separator;
读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行,并添加预设分隔符;Read the fourth line sequence in the first sequence unit, write the line sequence identifier and related description information of the fourth line sequence in the first sequence unit into the first line in the fourth default file, and add Default separator;
读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行,并添加预设分隔符;Read the fourth row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the Nth row in the fourth default file, and add Default separator;
直到将所述待处理FASTQ文件的所有序列单元依次写入所述第一预设文件、第二预设文件、第三预设文件及第四预设文件。Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
本公开实施例提供了一种分别将四个预设文件压缩处理的方法,图7为本公开实施例提供的一种将四个行序列分别储存于对应的四个预设文件的方法流程图,该方法可以单独被执行,也可以结合本公开中的任一个实施例或是实施例中的可能的实现方式一起被执行,还可以结合相关技术中的任一种技术方案一起被执行。该方法作为执行步骤101的一种可行方式包括:Embodiments of the present disclosure provide a method for compressing four preset files respectively. FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four preset files according to an embodiment of the present disclosure. , this method can be executed alone, or can be executed in combination with any embodiment or possible implementation in the embodiment, or in combination with any technical solution in related technologies. As a possible way to perform step 101, this method includes:
步骤701,读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息;Step 701: Read the row sequence identifier and related description information of the first row sequence in the first sequence unit;
将读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;
将压缩后的第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行。Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
步骤702,读取第N序列单元中的第一行序列的行序列标识及相关的描述信息;Step 702: Read the row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
将读取到的所述第N序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行。Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
步骤703,读取第一序列单元中的第二行序列的行序列标识及相关的描述信息;Step 703: Read the row sequence identifier and related description information of the second row sequence in the first sequence unit;
将读取到的所述第一序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;
将压缩后的第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行。Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
步骤704,读取第N序列单元中的第二行序列的行序列标识及相关的描述信息。Step 704: Read the row sequence identifier and related description information of the second row sequence in the Nth sequence unit.
将读取到的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;
将压缩后的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行。Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
步骤705,读取第一序列单元中的第三行序列的行序列标识及相关的描述信息;Step 705: Read the row sequence identifier and related description information of the third row sequence in the first sequence unit;
将读取到的所述第一序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;
将压缩后的第一序列单元中的第三行序列的行序列标识及相关的描述信息写入 第三预设文件中的第一行。Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
步骤706,读取第N序列单元中的第三行序列的行序列标识及相关的描述信息;Step 706: Read the row sequence identifier and related description information of the third row sequence in the Nth sequence unit;
将读取到的所述第N序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行。Write the row sequence identifier and related description information of the third row sequence in the compressed N-th sequence unit into the N-th row in the third default file.
步骤707,读取第一序列单元中的第四行序列的行序列标识及相关的描述信息;Step 707: Read the row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
将读取到的第一序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
将压缩后的所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行。Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
步骤708,读取第N序列单元中的第四行序列的行序列标识及相关的描述信息;Step 708: Read the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
将读取到的第N序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行。Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
需要说明的是,图7所示的方法,可以作为一个完成的示例执行步骤701至步骤708,也可以单独执行步骤701至步骤708中的任一步骤,也可以选择执行至少2个步骤,具体的,本公开实施例对此不进行限定。It should be noted that the method shown in Figure 7 can be used as a completed example to perform steps 701 to 708, or any one of steps 701 to 708 can be performed individually, or at least two steps can be selected. Specifically, , the embodiment of the present disclosure does not limit this.
作为本公开实施例的可行方式,为了能够将压缩后待处理FASTQ文件无损还原,在进行压缩处理时,仅进行一次压缩处理,因此在触发预设无损压缩命令分别将所述四个预设文件进行压缩处理包括:分别判断所述四个预设文件中是否包含已压缩的行序列,若包含,则将已压缩的行序列不再执行压缩处理,对未压缩的行序列执行预设无损压缩。该种处理方式能够确保无损还原。As a feasible method of the embodiment of the present disclosure, in order to losslessly restore the FASTQ files to be processed after compression, only one compression process is performed during the compression process. Therefore, when the preset lossless compression command is triggered, the four preset files are respectively Performing compression processing includes: respectively determining whether the four preset files contain compressed line sequences. If so, performing compression processing on the compressed line sequences and performing preset lossless compression on the uncompressed line sequences. . This processing method ensures lossless restoration.
图7所示的方法为在读取到每个序列单元中的一个行序列,便执行一次压缩处理,将压缩后的行序列写入对应的预设文件中。下述实施例还提供执行步骤104的另一种可行方式,即通过上述任一实施例按照所述行序列标识将四个行序列分别储存于对应的四个预设文件后,对不同的预设文件进行分别压缩,具体为:The method shown in Figure 7 is to perform a compression process after reading a line sequence in each sequence unit, and write the compressed line sequence into the corresponding preset file. The following embodiments also provide another possible way to perform step 104, that is, through any of the above embodiments, four line sequences are stored in corresponding four default files according to the line sequence identifiers, and then different preset files are stored. Assume that files are compressed separately, specifically:
响应于触发的第一预设无损压缩命令,对所述第一预设文件执行压缩;In response to the triggered first preset lossless compression command, perform compression on the first preset file;
响应于触发的第二预设无损压缩命令,对所述第二预设文件执行压缩;In response to the triggered second preset lossless compression command, perform compression on the second preset file;
响应于触发的第三预设无损压缩命令,对所述第三预设文件执行压缩;In response to the triggered third preset lossless compression command, perform compression on the third preset file;
响应于触发的第四预设无损压缩命令,对所述第四预设文件执行压缩。In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
本公开实施例对具体的压缩的实现方式不进行限定。The embodiments of this disclosure do not limit specific compression implementation methods.
本公开实施例提供了一种分别将四个预设文件执行压缩还原的方法,图8为本公开实施例提供的一种对压缩后的FASTQ数据执行还原处理的流程示意图,该方法可以单独被执行,也可以结合本公开中的任一个实施例或是实施例中的可能的实现方式一 起被执行,还可以结合相关技术中的任一种技术方案一起被执行,包括:The embodiment of the present disclosure provides a method for performing compression and restoration of four preset files respectively. Figure 8 is a schematic flowchart of a method of performing restoration processing on compressed FASTQ data provided by the embodiment of the present disclosure. This method can be used individually. The execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
步骤801,依次从压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第一行开始,依次读取所述行序列标识及相关的描述信息,其中,读取所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的顺序,与所述序列单元的四个行序列顺序一致。Step 801: Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related The description information, wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit.
分别从第一预设文件中的第一行、读取第二预设文件中的第一行,读取第三预设文件中的第一行,读取第四预设文件中的第一行。Read the first line in the first default file, the first line in the second default file, the first line in the third default file, and the first line in the fourth default file. OK.
继续读取第一预设文件第N行、读取第二预设文件中的第N行,读取第三预设文件中的第N行,读取第四预设文件中的第N行,直到其中N大于1。Continue to read the Nth line of the first default file, read the Nth line of the second default file, read the Nth line of the third default file, and read the Nth line of the fourth default file. , until N is greater than 1.
直到读取完所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第N行的行序列标识及相关的描述信息,以将所述目标FASTQ文件为还原后的待处理FASTQ文件。Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
步骤802,根据读取到的每一行的所述行序列标识及相关的描述信息,组成一个目标序列单元。Step 802: Compose a target sequence unit based on the read row sequence identifier and related description information of each row.
步骤803,将所述目标序列单元写入同一目标FASTQ文件中。Step 803: Write the target sequence unit into the same target FASTQ file.
图8所示的方法是依次读取第一预设文件、第二预设文件、第三预设文件及第四预设文件中的每一行,读取效率上相对较慢,为了提高还原待处理FASTQ文件的效率,还可采用下述方式进行,如图9所示,图9为本公开实施例提供的另一种对压缩后的FASTQ数据执行还原处理的流程示意图,该方法可以单独被执行,也可以结合本公开中的任一个实施例或是实施例中的可能的实现方式一起被执行,还可以结合相关技术中的任一种技术方案一起被执行,包括:The method shown in Figure 8 is to read each line in the first default file, the second default file, the third default file and the fourth default file in sequence. The reading efficiency is relatively slow. In order to improve the recovery waiting time, The efficiency of processing FASTQ files can also be achieved in the following manner, as shown in Figure 9. Figure 9 is a schematic flowchart of another method of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure. This method can be used separately. The execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
步骤901,控制同步对压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件,从第一行开始到最后一行的读取方式读取所述行序列标识及相关的描述信息;通过程序控制第一预设文件、第二预设文件、第三预设文件及第四预设文件读取时的同步性,能够提高读取效率,进而提高了还原FASTQ文件的效率。Step 901: Control and synchronize the compressed first default file, second default file, third default file and fourth default file to read all the compressed files from the first line to the last line. Execution sequence identifier and related description information; through program control of the synchronization when reading the first default file, the second default file, the third default file and the fourth default file, the reading efficiency can be improved, and thus Improved the efficiency of restoring FASTQ files.
步骤902,每读取一行,依次将所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的读取结果,写入同一目标FASTQ文件中。Step 902: Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
作为本公开实施例的可行方式,为了判断上述方式是否是可靠的处理方式,本公开实施例通过下述方式进行验证。包括:对所述待处理FASTQ文件按照预设加密算法进行计算,得到第一加密秘钥,使用所述预设无损压缩对应的解压方式对所述目标FASTQ文件进行解压,得到解压后的目标FASTQ文件,对所述解压后的目标FASTQ文件采用所述预设加密算法进行加密计算,得到第二加密秘钥,根据所述第一加密秘钥与所述第二加密秘钥的是否一致性,确定所述待处理FASTQ数据的存储及还原处理是否存在损失。As a feasible method of the embodiment of the present disclosure, in order to determine whether the above method is a reliable processing method, the embodiment of the present disclosure conducts verification through the following method. The method includes: calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key, decompressing the target FASTQ file using a decompression method corresponding to the preset lossless compression, and obtaining the decompressed target FASTQ file. file, use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key, and based on whether the first encryption key and the second encryption key are consistent, Determine whether there is any loss in the storage and restoration of the FASTQ data to be processed.
作为实例1,下面提供本发明方法将一个大小为21789495522bp,文件名为“V350027954_L01_read_1.fq.gz”的FASTQ压缩文件用上述FASTQ数据的处理方法节 省存储空间的应用例,其解压后文件”V350027954_L01_read_1.fq”的MD5信息摘要算法(MD5Message-Digest Algorithm)MD5值为“ca4168a17d0510a5d3f51fa6856d1888”,此MD5值可以用作判定使用本发明分类压缩存储,再通过还原方法后,检验MD5是否保持一致,来验证上述实现方法的可靠性。As Example 1, the following provides an application example of using the above-mentioned FASTQ data processing method to save storage space on a FASTQ compressed file with a size of 21789495522bp and a file name of "V350027954_L01_read_1.fq.gz". The decompressed file is "V350027954_L01_read_1. fq"'s MD5 message digest algorithm (MD5Message-Digest Algorithm) MD5 value is "ca4168a17d0510a5d3f51fa6856d1888". This MD5 value can be used to determine whether the classification compression storage of the present invention is used, and then through the restoration method, check whether the MD5 is consistent to verify the above implementation. Reliability of the method.
在本示例中,实现FASTQ数据的有效压缩,程序语言为perl,具体实现代码如下:In this example, to achieve effective compression of FASTQ data, the programming language is perl. The specific implementation code is as follows:
Figure PCTCN2022095757-appb-000001
Figure PCTCN2022095757-appb-000001
在本发送实施例中,我们将V350027954_L01_read_1.fq.gz的每个序列单元分类并分别压缩写入到文件Row1.gz、Row2.gz、Row3.gz、Row4.gz(四个预设文件)中,其大小分别为1330794994bp、6725867471bp、978835bp和8702582307bp,四个文件总大小为16760223607bp,其大小只为原文件V350027954_L01_read_1.fq.gz的76.92%,FASTQ数据得到了有效压缩。In this sending example, we classify and compress each sequence unit of V350027954_L01_read_1.fq.gz into files Row1.gz, Row2.gz, Row3.gz, and Row4.gz (four default files) , their sizes are 1330794994bp, 6725867471bp, 978835bp and 8702582307bp respectively. The total size of the four files is 16760223607bp. Their size is only 76.92% of the original file V350027954_L01_read_1.fq.gz. The FASTQ data has been effectively compressed.
示例2,下面提供本发明实施例的方法将上述示例1分类存储得到的压缩文件Row1.gz、Row2.gz、Row3.gz和Row4.gz还原为原文件V350027954_L01_read_1.fq.gz的应用例。在本示例2中,程序语言为perl,具体实现代码如下:Example 2: The following provides an application example of using the method of the embodiment of the present invention to restore the compressed files Row1.gz, Row2.gz, Row3.gz and Row4.gz obtained by classification and storage in the above Example 1 to the original file V350027954_L01_read_1.fq.gz. In this example 2, the programming language is perl, and the specific implementation code is as follows:
Figure PCTCN2022095757-appb-000002
Figure PCTCN2022095757-appb-000002
Figure PCTCN2022095757-appb-000003
Figure PCTCN2022095757-appb-000003
在本示例2中,我们同时读取Row1.gz、Row2.gz、Row3.gz、Row4.gz中的第n行(1到最后一行),并压缩输出到文件名为“V350027954_L01_read_1.fq.gz“的文件。In this example 2, we simultaneously read the nth row (1 to the last row) in Row1.gz, Row2.gz, Row3.gz, and Row4.gz, and compress the output to a file named "V350027954_L01_read_1.fq.gz "document.
进一步的,我们对文件“V350027954_L01_read_1.fq.gz”解压后得到文件”V350027954_L01_read_1.fq“,并用命令”md5sum V350027954_L01_read_1.fq”得到其MD5值为“ca4168a17d0510a5d3f51fa6856d1888”,其与示例1中的MD5值是一样,证明本发明实施例提供的压缩方法和还原方法是有效可行的。Further, we decompressed the file "V350027954_L01_read_1.fq.gz" and obtained the file "V350027954_L01_read_1.fq", and used the command "md5sum V350027954_L01_read_1.fq" to obtain its MD5 value as "ca4168a17d0510a5d3f51fa6" 856d1888", which is the same as the MD5 value in Example 1 , proving that the compression method and restoration method provided by the embodiment of the present invention are effective and feasible.
与上述的FASTQ数据的处理方法相对应,本发明还提出一种FASTQ数据的处理装置。由于本发明的装置实施例与上述的方法实施例相对应,对于装置实施例中未披露的细节可参照上述的方法实施例,本发明中不再进行赘述。Corresponding to the above-mentioned FASTQ data processing method, the present invention also proposes a FASTQ data processing device. Since the device embodiment of the present invention corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment can be referred to the above-mentioned method embodiment, and will not be described again in the present invention.
本公开实施例提供一种FASTQ数据的处理装置,如图10所示,包括:An embodiment of the present disclosure provides a FASTQ data processing device, as shown in Figure 10, including:
拆分单元1001,用于将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识;The splitting unit 1001 is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
存储单元1002,用于按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;The storage unit 1002 is used to store four line sequences in corresponding four preset files according to the line sequence identifiers, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier. ;
压缩单元1003,用于触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。The compression unit 1003 is configured to trigger a preset lossless compression command to compress the four preset files respectively.
本公开提供的FASTQ数据的处理装置,将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行 序列标识,按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。与相关技术中直接触发gzip命令压缩FASTQ文件的方式相比,本公开实施例利用FASTQ文件中序列单元中每行的相似性,进行按行分类存储,即按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,分别对四个预设文件执行预设无损压缩,进一步节省了FASTQ文件的存储空间。The FASTQ data processing device provided by the present disclosure splits the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers. According to The line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command The four preset files are compressed respectively. Compared with the method of directly triggering the gzip command to compress the FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier. The sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
进一步地,在本实施例一种可能的实现方式中,如图11所示,所述存储单元1002包括:Further, in a possible implementation of this embodiment, as shown in Figure 11, the storage unit 1002 includes:
第一读取模块,用于从所述待处理FASTQ文件的第一个序列单元到最后一个序列单元,依次读取不同序列单元中的相同行序列标识的行序列;A first reading module, configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
第一写入模块,用于按照读取顺序将读取到的所述相同行序列标识的行序列写入同一预设文件中。The first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
进一步地,在本实施例一种可能的实现方式中,所述存储单元包括:Further, in a possible implementation of this embodiment, the storage unit includes:
第一处理模块,用于读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行,并添加预设分隔符,其中,所述至少一个序列单元包含第一序列单元及第二序列单元,所述第一序列单元为所述待处理FASTQ文件的第一个序列单元,所述第二序列单元为所述待处理FASTQ文件的第N个序列单元,N为大于1的整数;The first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit. The first line in the first preset file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the FASTQ file to be processed The first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
第二处理模块,用于读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行,并添加预设分隔符;The second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file. The Nth line of and add the preset delimiter;
第三处理模块,用于读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行,并添加预设分隔符;The third processing module is used to read the second row sequence in the first sequence unit, and write the row sequence identifier and related description information of the second row sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
第四处理模块,用于读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行,并添加预设分隔符;The fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file. The Nth line of and add the preset delimiter;
第五处理模块,用于读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行,并添加预设分隔符;The fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
第六处理模块,用于读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行,并添加预设分隔符;The sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file. The Nth line of and add the preset delimiter;
第七处理模块,用于读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行,并添加 预设分隔符;The seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
第八处理模块,用于读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行,并添加预设分隔符;The eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file. The Nth line of and add the preset delimiter;
直到将所述待处理FASTQ文件的所有序列单元依次写入所述第一预设文件、第二预设文件、第三预设文件及第四预设文件。Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
进一步地,在本实施例一种可能的实现方式中,所述压缩单元1003,还用于:Further, in a possible implementation of this embodiment, the compression unit 1003 is also used to:
响应于触发的第一预设无损压缩命令,对所述第一预设文件执行压缩;In response to the triggered first preset lossless compression command, perform compression on the first preset file;
响应于触发的第二预设无损压缩命令,对所述第二预设文件执行压缩;In response to the triggered second preset lossless compression command, perform compression on the second preset file;
响应于触发的第三预设无损压缩命令,对所述第三预设文件执行压缩;In response to the triggered third preset lossless compression command, perform compression on the third preset file;
响应于触发的第四预设无损压缩命令,对所述第四预设文件执行压缩。In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
进一步地,在本实施例一种可能的实现方式中,所述预设无损压缩命令包括gzip压缩命令或pigz压缩命令。Further, in a possible implementation of this embodiment, the preset lossless compression command includes a gzip compression command or a pigz compression command.
进一步地,在本实施例一种可能的实现方式中,所述第一处理模块,还用于:Further, in a possible implementation of this embodiment, the first processing module is also used to:
将读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;
将压缩后的第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行。Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
进一步地,在本实施例一种可能的实现方式中,所述第二处理模块,还用于:Furthermore, in a possible implementation of this embodiment, the second processing module is also used to:
将读取到的所述第N序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行。Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
进一步地,在本实施例一种可能的实现方式中,所述第三处理模块,还用于:Further, in a possible implementation of this embodiment, the third processing module is also used to:
将读取到的所述第一序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;
将压缩后的第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行。Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
进一步地,在本实施例一种可能的实现方式中,所述第四处理模块,还用于包括:Further, in a possible implementation of this embodiment, the fourth processing module is also configured to include:
将读取到的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;
将压缩后的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行。Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
进一步地,在本实施例一种可能的实现方式中,所述第五处理模块,还用于:Further, in a possible implementation of this embodiment, the fifth processing module is also used to:
将读取到的所述第一序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;
将压缩后的第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行。Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
进一步地,在本实施例一种可能的实现方式中,所述第六处理模块,还用于:Further, in a possible implementation of this embodiment, the sixth processing module is also used to:
将读取到的所述第二序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;
将压缩后的第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行。Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.
进一步地,在本实施例一种可能的实现方式中,所述第七处理模块,还用于:Further, in a possible implementation of this embodiment, the seventh processing module is also used to:
将读取到的第一序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
将压缩后的所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行。Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
进一步地,在本实施例一种可能的实现方式中,所述第八处理模块,还用于:Further, in a possible implementation of this embodiment, the eighth processing module is also used to:
将读取到的第N序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
将压缩后的第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行。Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
进一步地,在本实施例一种可能的实现方式中,如图11所示,所述压缩单元1003包括:Further, in a possible implementation of this embodiment, as shown in Figure 11, the compression unit 1003 includes:
判断模块10031,用于分别判断所述四个预设文件中是否包含已压缩的行序列;Determination module 10031, used to determine whether the four preset files contain compressed line sequences;
压缩模块10032,用于当确定所述四个预设文件中包含已压缩的行序列时,将已压缩的行序列不再执行压缩处理,对未压缩的行序列执行预设无损压缩。The compression module 10032 is configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
进一步地,在本实施例一种可能的实现方式中,如图11所示,所述装置还包括:Further, in a possible implementation of this embodiment, as shown in Figure 11, the device further includes:
读取单元1004,用于在所述压缩单元1003触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,依次从压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第一行开始,依次读取所述行序列标识及相关的描述信息,其中,读取所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的顺序,与所述序列单元的四个行序列顺序一致;The reading unit 1004 is configured to, after the compression unit 1003 triggers a preset lossless compression command to compress the four preset files respectively, sequentially select from the compressed first preset file and the second preset file. Starting from the first line in the file, the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, wherein, read the first default file, the second default file The order of the files, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
组成单元1005,用于根据读取到的每一行的所述行序列标识及相关的描述信息,组成一个目标序列单元;The composition unit 1005 is used to compose a target sequence unit according to the row sequence identifier and related description information of each row read;
写入单元1006,用于将所述目标序列单元写入同一目标FASTQ文件中; Writing unit 1006, used to write the target sequence unit into the same target FASTQ file;
直到读取完所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第N行的行序列标识及相关的描述信息,以将所述目标FASTQ文件为还原后的待处理FASTQ文件。Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
进一步地,在本实施例一种可能的实现方式中,如图11所示,包括:Further, in a possible implementation of this embodiment, as shown in Figure 11, it includes:
控制单元1007,用于在所述压缩单元触发预设无损压缩命令分别将所述四个预设 文件进行压缩处理之后,控制同步对压缩后的所述第一预设文件、第二预设文件、第三预设文件及第四预设文件,从第一行开始到最后一行的读取方式读取所述行序列标识及相关的描述信息;The control unit 1007 is configured to control the synchronization of the compressed first preset file and the second preset file after the compression unit triggers a preset lossless compression command to compress the four preset files respectively. , the third default file and the fourth default file, read the line sequence identifier and related description information from the first line to the last line;
读写单元1008,用于每读取一行,依次将所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的读取结果,写入同一目标FASTQ文件中。The reading and writing unit 1008 is configured to sequentially write the reading results of the first default file, the second default file, the third default file, and the fourth default file into the same target FASTQ file each time a line is read. middle.
进一步地,在本实施例一种可能的实现方式中,如图11所示,所述装置还包括:Further, in a possible implementation of this embodiment, as shown in Figure 11, the device further includes:
第一加密单元1009,用于对所述待处理FASTQ文件按照预设加密算法进行计算,得到第一加密秘钥;The first encryption unit 1009 is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;
解密单元10010,用于使用所述预设无损压缩对应的解压方式对所述目标FASTQ文件进行解压,得到解压后的目标FASTQ文件;;The decryption unit 10010 is used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;;
第二加密单元10011,用于对所述解压后的目标FASTQ文件采用所述预设加密算法进行加密计算,得到第二加密秘钥;The second encryption unit 10011 is used to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;
确定单元10012,根据所述第一加密秘钥与所述第二加密秘钥的是否一致性,确定所述待处理FASTQ数据的存储及还原处理是否存在损失。The determination unit 10012 determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
需要说明的是,前述对方法实施例的解释说明,也适用于本实施例的装置,原理相同,本实施例中不再限定。It should be noted that the foregoing explanation of the method embodiment also applies to the device of this embodiment. The principles are the same and are no longer limited in this embodiment.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图12示出了可以用来实施本公开的实施例的示例电子设备1200的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。12 illustrates a schematic block diagram of an example electronic device 1200 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图12所示,设备1200包括计算单元1201,其可以根据存储在ROM(Read-Only Memory,只读存储器)1202中的计算机程序或者从存储单元1208加载到RAM(Random Access Memory,随机访问/存取存储器)1203中的计算机程序,来执行各种适当的动作和处理。在RAM 1203中,还可存储设备1200操作所需的各种程序和数据。计算单元1201、ROM 1202以及RAM 1203通过总线1204彼此相连。I/O(Input/Output,输入/输出)接口1205也连接至总线1204。As shown in Figure 12, the device 1200 includes a computing unit 1201, which can be loaded into a RAM (Random Access Memory) according to a computer program stored in a ROM (Read-Only Memory) 1202 or from a storage unit 1208. Access the computer program in the memory) 1203 to perform various appropriate actions and processes. In the RAM 1203, various programs and data required for the operation of the device 1200 can also be stored. Computing unit 1201, ROM 1202 and RAM 1203 are connected to each other via bus 1204. I/O (Input/Output, input/output) interface 1205 is also connected to bus 1204.
设备1200中的多个部件连接至I/O接口1205,包括:输入单元12012,例如键盘、鼠标等;输出单元1207,例如各种类型的显示器、扬声器等;存储单元1208,例如磁盘、光盘等;以及通信单元1209,例如网卡、调制解调器、无线通信收发机等。通信单元1209允许设备1200通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 1200 are connected to the I/O interface 1205, including: input unit 12012, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a magnetic disk, optical disk, etc. ; and communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
计算单元1201可以是各种具有处理和计算能力的通用和/或专用处理组件。计算 单元1201的一些示例包括但不限于CPU(Central Processing Unit,中央处理单元)、GPU(Graphic Processing Units,图形处理单元)、各种专用的AI(Artificial Intelligence,人工智能)计算芯片、各种运行机器学习模型算法的计算单元、DSP(Digital Signal Processor,数字信号处理器)、以及任何适当的处理器、控制器、微控制器等。计算单元1201执行上文所描述的各个方法和处理,例如FASTQ数据的处理方法。例如,在一些实施例中,FASTQ数据的处理方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1208。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1202和/或通信单元1209而被载入和/或安装到设备1200上。当计算机程序加载到RAM 1203并由计算单元1201执行时,可以执行上文描述的方法的一个或多个步骤。备选地,在其他实施例中,计算单元1201可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行前述FASTQ数据的处理方法。 Computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include but are not limited to CPU (Central Processing Unit, Central Processing Unit), GPU (Graphic Processing Units, Graphics Processing Unit), various dedicated AI (Artificial Intelligence, artificial intelligence) computing chips, various running The computing unit of the machine learning model algorithm, DSP (Digital Signal Processor, digital signal processor), and any appropriate processor, controller, microcontroller, etc. The computing unit 1201 performs various methods and processes described above, such as the processing method of FASTQ data. For example, in some embodiments, the FASTQ data processing method may be implemented as a computer software program, which is tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into RAM 1203 and executed by computing unit 1201, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the aforementioned processing method of FASTQ data in any other suitable manner (eg, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、FPGA(Field Programmable Gate Array,现场可编程门阵列)、ASIC(Application-Specific Integrated Circuit,专用集成电路)、ASSP(Application Specific Standard Product,专用标准产品)、SOC(System On Chip,芯片上系统的系统)、CPLD(Complex Programmable Logic Device,复杂可编程逻辑设备)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and technologies described above in this article can be implemented in digital electronic circuit systems, integrated circuit systems, FPGA (Field Programmable Gate Array, field programmable gate array), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit) , ASSP (Application Specific Standard Product, dedicated standard product), SOC (System On Chip, system on chip), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or implemented in their combination. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM(Electrically Programmable Read-Only-Memory,可擦除可编程只读存储器)或快闪存储器、光纤、CD-ROM(Compact Disc Read-Only Memory,便捷式紧凑盘只读存储器)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, laptop disks, hard drives, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, Erasable Programmable Read-Only Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory, portable compact disk read-only memory), optical storage device, magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(Cathode-Ray Tube,阴极射线管)或者LCD(Liquid Crystal Display,液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)) for displaying information to the user. Liquid Crystal Display (LCD monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:LAN(Local Area Network,局域网)、WAN(Wide Area Network,广域网)、互联网和区块链网络。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), the Internet, and blockchain networks.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务("Virtual Private Server",或简称"VPS")中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability. The server can also be a distributed system server or a server combined with a blockchain.
其中,需要说明的是,人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术;人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。Among them, it should be noted that artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology. Artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, there is no limitation here.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims (21)

  1. 一种FASTQ数据的处理方法,其特征在于,包括:A method for processing FASTQ data, which is characterized by including:
    将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识;Split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
    按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
    触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。Trigger the preset lossless compression command to compress the four preset files respectively.
  2. 根据权利要求1所述的处理方法,其特征在于,所述按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中包括:The processing method according to claim 1, wherein storing four line sequences in corresponding four default files according to the line sequence identifiers includes:
    从所述待处理FASTQ文件的第一个序列单元到最后一个序列单元,依次读取不同序列单元中的相同行序列标识的行序列;From the first sequence unit to the last sequence unit of the FASTQ file to be processed, sequentially read the row sequences with the same row sequence identifier in different sequence units;
    按照读取顺序将读取到的所述相同行序列标识的行序列写入同一预设文件中。Write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
  3. 根据权利要求1-2中任一项所述的处理方法,其特征在于,所述按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中包括:The processing method according to any one of claims 1-2, characterized in that, storing four line sequences in corresponding four default files according to the line sequence identifiers includes:
    读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行,并添加预设分隔符,其中,所述至少一个序列单元包含第一序列单元及第二序列单元,所述第一序列单元为所述待处理FASTQ文件的第一个序列单元,所述第二序列单元为所述待处理FASTQ文件的第N个序列单元,N为大于1的整数;Read the first line sequence in the first sequence unit of the FASTQ file to be processed, and write the line sequence identifier and related description information of the first line sequence in the first sequence unit into the first default file. The first line, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
    读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行,并添加预设分隔符;Read the first row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the Nth row in the first default file, and add Default separator;
    读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行,并添加预设分隔符;Read the second line sequence in the first sequence unit, write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the first line in the second default file, and add Default separator;
    读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行,并添加预设分隔符;Read the second row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the Nth row in the second default file, and add Default separator;
    读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行,并添加预设分隔符;Read the third row sequence in the first sequence unit, write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the first row in the third default file, and add Default separator;
    读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行,并添加预设分 隔符;Read the third row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the Nth row in the third default file, and add Default separator;
    读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行,并添加预设分隔符;Read the fourth line sequence in the first sequence unit, write the line sequence identifier and related description information of the fourth line sequence in the first sequence unit into the first line in the fourth default file, and add Default separator;
    读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行,并添加预设分隔符;Read the fourth row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the Nth row in the fourth default file, and add Default separator;
    直到将所述待处理FASTQ文件的所有序列单元依次写入所述第一预设文件、第二预设文件、第三预设文件及第四预设文件。Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
  4. 根据权利要求1-3中任一项所述的处理方法,其特征在于,所述触发预设无损压缩命令分别将所述四个预设文件进行压缩处理包括:The processing method according to any one of claims 1 to 3, characterized in that triggering a preset lossless compression command to compress the four preset files respectively includes:
    响应于触发的第一预设无损压缩命令,对第一预设文件执行压缩;In response to the triggered first preset lossless compression command, perform compression on the first preset file;
    响应于触发的第二预设无损压缩命令,对第二预设文件执行压缩;In response to the triggered second preset lossless compression command, perform compression on the second preset file;
    响应于触发的第三预设无损压缩命令,对第三预设文件执行压缩;In response to the triggered third preset lossless compression command, perform compression on the third preset file;
    响应于触发的第四预设无损压缩命令,对第四预设文件执行压缩。In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
  5. 根据权利要求1-4中任一项所述的处理方法,其特征在于,所述预设无损压缩命令包括gzip压缩命令或pigz压缩命令。The processing method according to any one of claims 1 to 4, characterized in that the preset lossless compression command includes a gzip compression command or a pigz compression command.
  6. 根据权利要求3所述的处理方法,其特征在于,所述读取所述待处理FASTQ文件第一序列单元中的第一行序列,将所述第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行包括:The processing method according to claim 3, characterized in that, reading the first line sequence in the first sequence unit of the FASTQ file to be processed, converting the first line sequence in the first sequence unit The first line of the sequence identifier and related description information written into the first default file includes:
    将读取到的所述第一序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;
    将压缩后的第一序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第一行。Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
  7. 根据权利要求3所述的处理方法,其特征在于,所述读取第N序列单元中的第一行序列,将所述第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行包括:The processing method according to claim 3, characterized in that, in reading the first row sequence in the Nth sequence unit, the row sequence identifier and related description of the first row sequence in the Nth sequence unit are The information written to the Nth line in the first preset file includes:
    将读取到的所述第N序列单元中的第一行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
    将压缩后的第N序列单元中的第一行序列的行序列标识及相关的描述信息写入第一预设文件中的第N行。Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
  8. 根据权利要求3所述的处理方法,其特征在于,所述读取第一序列单元中的第二行序列,将所述第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行包括:The processing method according to claim 3, characterized in that, in reading the second row sequence in the first sequence unit, the row sequence identifier and related description of the second row sequence in the first sequence unit are The first line of information written to the second preset file includes:
    将读取到的所述第一序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;
    将压缩后的第一序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第一行。Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
  9. 根据权利要求3所述的处理方法,其特征在于,所述读取第N序列单元中的第二行序列,将所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行包括:The processing method according to claim 3, characterized in that, in reading the second row sequence in the Nth sequence unit, the row sequence identifier and related description of the second row sequence in the Nth sequence unit are The information written to the Nth line in the second preset file includes:
    将读取到的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;
    将压缩后的所述第N序列单元中的第二行序列的行序列标识及相关的描述信息写入第二预设文件中的第N行。Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
  10. 根据权利要求3所述的处理方法,其特征在于,所述读取第一序列单元中的第三行序列,将所述第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行包括:The processing method according to claim 3, characterized in that, in reading the third row sequence in the first sequence unit, the row sequence identifier and related description of the third row sequence in the first sequence unit are The first line of information written to the third preset file includes:
    将读取到的所述第一序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;
    将压缩后的第一序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第一行。Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
  11. 根据权利要求3所述的处理方法,其特征在于,所述读取第N序列单元中的第三行序列,将所述第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行包括:The processing method according to claim 3, characterized in that, in reading the third row sequence in the Nth sequence unit, the row sequence identifier and related description of the third row sequence in the second sequence unit are The information written to the Nth line in the third preset file includes:
    将读取到的所述第二序列单元中的第三行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;
    将压缩后的第二序列单元中的第三行序列的行序列标识及相关的描述信息写入第三预设文件中的第N行。Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.
  12. 根据权利要求3所述的处理方法,其特征在于,所述读取第一序列单元中的第四行序列,将所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行包括:The processing method according to claim 3, characterized in that, in reading the fourth row sequence in the first sequence unit, the row sequence identifier and related description of the fourth row sequence in the first sequence unit are The first line of information written to the fourth preset file includes:
    将读取到的第一序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
    将压缩后的所述第一序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第一行。Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
  13. 根据权利要求3所述的处理方法,其特征在于,所述读取第N序列单元中的第四行序列,将所述第N序列单元中的第四行序列的行序列标识及相关的描述信息写入第四预设文件中的第N行包括:The processing method according to claim 3, characterized in that, in reading the fourth row sequence in the Nth sequence unit, the row sequence identifier and related description of the fourth row sequence in the Nth sequence unit are The information written to the Nth line in the fourth preset file includes:
    将读取到的第N序列单元中的第四行序列的行序列标识及相关的描述信息,执行预设无损压缩;Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
    将压缩后的第N序列单元中的第四行序列的行序列标识及相关的描述信息 写入第四预设文件中的第N行。Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
  14. 根据权利要求6-13中任一项所述的方法,其特征在于,所述触发预设无损压缩命令分别将所述四个预设文件进行压缩处理包括:The method according to any one of claims 6-13, characterized in that triggering a preset lossless compression command to respectively compress the four preset files includes:
    分别判断所述四个预设文件中是否包含已压缩的行序列;Determine whether the four preset files contain compressed line sequences respectively;
    若包含,则将已压缩的行序列不再执行压缩处理,对未压缩的行序列执行预设无损压缩。If included, the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.
  15. 根据权利要求1-14中任一项所述的处理方法,其特征在于,在触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,所述方法还包括:The processing method according to any one of claims 1 to 14, characterized in that, after triggering a preset lossless compression command to compress the four preset files respectively, the method further includes:
    依次从压缩后的第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第一行开始,依次读取所述行序列标识及相关的描述信息,其中,读取所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的顺序,与所述序列单元的四个行序列顺序一致;Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where , the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
    根据读取到的每一行的所述行序列标识及相关的描述信息,组成一个目标序列单元;According to the row sequence identifier and related description information of each read row, a target sequence unit is formed;
    将所述目标序列单元写入同一目标FASTQ文件中;Write the target sequence unit into the same target FASTQ file;
    直到读取完所述第一预设文件、第二预设文件、第三预设文件及第四预设文件中的第N行的行序列标识及相关的描述信息,以将所述目标FASTQ文件为还原后的待处理FASTQ文件。Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
  16. 根据权利要求1-14中任一项所述的处理方法,其特征在于,在触发预设无损压缩命令分别将所述四个预设文件进行压缩处理之后,包括:The processing method according to any one of claims 1 to 14, characterized in that, after triggering a preset lossless compression command to compress the four preset files respectively, it includes:
    控制同步对压缩后的第一预设文件、第二预设文件、第三预设文件及第四预设文件,从第一行开始到最后一行的读取方式读取所述行序列标识及相关的描述信息;Control synchronization to read the line sequence identifier and the line sequence identifier from the first line to the last line of the compressed first default file, second default file, third default file and fourth default file. Relevant descriptive information;
    每读取一行,依次将所述第一预设文件、第二预设文件、第三预设文件及第四预设文件的读取结果,写入同一目标FASTQ文件中。Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
  17. 根据权利要求15或16所述的处理方法,其特征在于,所述方法还包括:The processing method according to claim 15 or 16, characterized in that the method further includes:
    对所述待处理FASTQ文件按照预设加密算法进行计算,得到第一加密秘钥;Calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;
    使用所述预设无损压缩对应的解压方式对所述目标FASTQ文件进行解压,得到解压后的目标FASTQ文件;Use the decompression method corresponding to the preset lossless compression to decompress the target FASTQ file to obtain the decompressed target FASTQ file;
    对所述解压后的目标FASTQ文件采用所述预设加密算法进行加密计算,得到第二加密秘钥;Use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key;
    根据所述第一加密秘钥与所述第二加密秘钥的是否一致性,确定所述待处理FASTQ数据的存储及还原处理是否存在损失。According to whether the first encryption key and the second encryption key are consistent, it is determined whether there is a loss in the storage and restoration processing of the FASTQ data to be processed.
  18. 一种FASTQ数据的处理装置,其特征在于,包括:A FASTQ data processing device, characterized by including:
    拆分单元,用于将待处理FASTQ文件按照预设数据格式,拆分为至少一个序列单元,每个序列单元包括四个行序列,不同行序列对应不同的行序列标识;The splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
    存储单元,用于按照所述行序列标识将四个行序列分别储存于对应的四个预设文件中,其中,一个预设文件用于存储不同序列单元的、相同行序列标识的行序列;A storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
    压缩单元,用于触发预设无损压缩命令分别将所述四个预设文件进行压缩处理。A compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.
  19. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-17中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform any one of claims 1-17 Methods.
  20. 一种存储有计算机指令的非瞬时计算机可读存储介质,其特征在于,所述计算机指令用于使所述计算机执行根据权利要求1-17中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, characterized in that the computer instructions are used to cause the computer to execute the method according to any one of claims 1-17.
  21. 一种计算机程序产品,其特征在于,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-17中任一项所述的方法。A computer program product, characterized by comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-17.
PCT/CN2022/095757 2022-05-27 2022-05-27 Fastq data processing method and apparatus, electronic device, and storage medium WO2023226036A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/095757 WO2023226036A1 (en) 2022-05-27 2022-05-27 Fastq data processing method and apparatus, electronic device, and storage medium
CN202280054965.1A CN117795855A (en) 2022-05-27 2022-05-27 FASTQ data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/095757 WO2023226036A1 (en) 2022-05-27 2022-05-27 Fastq data processing method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023226036A1 true WO2023226036A1 (en) 2023-11-30

Family

ID=88918251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095757 WO2023226036A1 (en) 2022-05-27 2022-05-27 Fastq data processing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN117795855A (en)
WO (1) WO2023226036A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166518A1 (en) * 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
CN103384884A (en) * 2012-12-11 2013-11-06 华为技术有限公司 File compression method and device, file decompression method and device, and server
CN104657627A (en) * 2013-11-18 2015-05-27 广州中国科学院软件应用技术研究所 Searching and determining method and system started from FASTQ format read segment
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
CN107565975A (en) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 The method of FASTQ formatted file Lossless Compressions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166518A1 (en) * 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
CN103384884A (en) * 2012-12-11 2013-11-06 华为技术有限公司 File compression method and device, file decompression method and device, and server
CN104657627A (en) * 2013-11-18 2015-05-27 广州中国科学院软件应用技术研究所 Searching and determining method and system started from FASTQ format read segment
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
CN107565975A (en) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 The method of FASTQ formatted file Lossless Compressions

Also Published As

Publication number Publication date
CN117795855A (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN107229420B (en) Data storage method, reading method, deleting method and data operating system
WO2022262183A1 (en) Federated computing processing method and apparatus, electronic device, and storage medium
US20200341670A1 (en) Method, device, and computer readable medium for data deduplication
US11514003B2 (en) Data compression based on key-value store
WO2014094479A1 (en) Method and device for deleting duplicate data
CN107850983B (en) Computer system, storage device and data management method
CN107273542B (en) High-concurrency data synchronization method and system
CN104077380A (en) Method and device for deleting duplicated data and system
EP4053770A1 (en) Schedule information acquiring method, apparatus, device, storage medium and program
JP2022159405A (en) Method and device for appending data, electronic device, storage medium, and computer program
US10802719B2 (en) Method and system for data compression and data storage optimization
CN115934414A (en) Data backup method, data recovery method, device, equipment and storage medium
CN113254267B (en) Data backup method and device for distributed database
US11017155B2 (en) Method and system for compressing data
WO2023226036A1 (en) Fastq data processing method and apparatus, electronic device, and storage medium
WO2021082926A1 (en) Data compression method and apparatus
CN113377391B (en) Method, device, equipment and medium for making and burning image file
CN115639966A (en) Data writing method and device, terminal equipment and storage medium
CN112860376B (en) Snapshot chain manufacturing method and device, electronic equipment and storage medium
US20220269659A1 (en) Method, device and storage medium for deduplicating entity nodes in graph database
CN113326038B (en) Method, apparatus, device, storage medium and program product for providing service
WO2024020746A1 (en) Method and apparatus for processing fastq data, and electronic device and storage medium
CN115132186A (en) End-to-end speech recognition model training method, speech decoding method and related device
US20210278977A1 (en) Method and system for performing data deduplication and compression in a data cluster
CN113836157A (en) Method and device for acquiring incremental data of database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943234

Country of ref document: EP

Kind code of ref document: A1