WO2023226036A1 - Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2023226036A1
WO2023226036A1 PCT/CN2022/095757 CN2022095757W WO2023226036A1 WO 2023226036 A1 WO2023226036 A1 WO 2023226036A1 CN 2022095757 W CN2022095757 W CN 2022095757W WO 2023226036 A1 WO2023226036 A1 WO 2023226036A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
row
file
preset
unit
Prior art date
Application number
PCT/CN2022/095757
Other languages
English (en)
Chinese (zh)
Inventor
邓天全
姜三杰
陈世璇
贺丽娟
杨鑫
黎剑波
Original Assignee
深圳华大基因科技服务有限公司
武汉华大基因技术服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技服务有限公司, 武汉华大基因技术服务有限公司 filed Critical 深圳华大基因科技服务有限公司
Priority to CN202280054965.1A priority Critical patent/CN117795855A/zh
Priority to PCT/CN2022/095757 priority patent/WO2023226036A1/fr
Publication of WO2023226036A1 publication Critical patent/WO2023226036A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/46Conversion to or from run-length codes, i.e. by representing the number of consecutive digits, or groups of digits, of the same kind by a code word and a digit indicative of that kind

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular, to a FASTQ data processing method and device, electronic equipment and storage media.
  • the FASTQ files that are originally sequenced are often compressed directly through the gz ip command.
  • the compression principle of gzip is: when two pieces of FASTQ data have the same content, as long as By obtaining the position and size of the previous block, you can determine the content of the next block, that is, you can replace the content of the latter block with a pair of information (the distance between the two, the length of the same content). Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
  • the present disclosure provides a FASTQ data processing method, device, electronic equipment and storage medium. Its main purpose is to use the similarity of each line in the sequence unit in the FASTQ file to classify and store the files by lines, and perform preset lossless compression on the classified files to further save the storage space of the FASTQ file.
  • a method for processing FASTQ data including:
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
  • storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:
  • the first line and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • triggering the preset lossless compression command to compress the four preset files respectively includes:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first line in a default file contains:
  • the step of reading the first row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • Line N includes:
  • the second row sequence in the first sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the first sequence unit are written into the second default file.
  • the first line includes:
  • Line N includes:
  • the step of reading the third row sequence in the first sequence unit is to write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third preset file.
  • the first line includes:
  • the step of reading the third row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third preset file.
  • Line N includes:
  • the step of reading the fourth row sequence in the first sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth preset file.
  • the first line includes:
  • the step of reading the fourth row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • Line N includes:
  • triggering the preset lossless compression command to compress the four preset files respectively includes:
  • the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.
  • the method further includes:
  • the second default file, the third default file and the fourth default file Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related description information , wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • a target sequence unit is formed
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the method also includes:
  • a FASTQ data processing device including:
  • the splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • a storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;
  • a compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.
  • the storage includes:
  • a first reading module configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
  • the first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
  • the storage unit includes:
  • the first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit.
  • the first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • the second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • the third processing module is used to read the second line sequence in the first sequence unit, and write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
  • the fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file.
  • the fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
  • the sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file.
  • the seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
  • the eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • the compression unit is also used for:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first processing module is also used for:
  • the second processing module is also used for:
  • the third processing module is also used for:
  • the fourth processing module is also used to include:
  • the fifth processing module is also used for:
  • the sixth processing module is also used for:
  • the seventh processing module is also used for:
  • the eighth processing module is also used for:
  • the compression unit includes:
  • a judgment module used to judge whether the four preset files contain compressed line sequences
  • a compression module configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
  • the device also includes:
  • the reading unit is configured to sequentially select from the compressed first default file, second default file, Starting from the first line in the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where, read the first default file, the second default file, The order of the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • a writing unit used to write the target sequence unit into the same target FASTQ file
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the device also includes:
  • a control unit configured to control the synchronization of the compressed first preset file, second preset file, and compressed files after the compression unit triggers a preset lossless compression command to compress the four preset files respectively.
  • the third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;
  • the reading and writing unit is used to sequentially write the reading results of the first default file, the second default file, the third default file and the fourth default file into the same target FASTQ file each time it reads one line. .
  • the device also includes:
  • the first encryption unit is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key
  • a decryption unit used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;
  • a second encryption unit configured to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key
  • the determination unit determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
  • an electronic device including:
  • the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the method described in the first aspect.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.
  • a computer program product including a computer program that, when executed by a processor, implements the method described in the foregoing first aspect.
  • the FASTQ data processing method, device, electronic equipment and storage medium split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences
  • four row sequences are stored in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store rows of different sequence units with the same row sequence identifier.
  • Sequence trigger the preset lossless compression command to compress the four preset files respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, the four lines are classified according to the line sequence identifier.
  • the line sequences are stored in the corresponding four default files, and the default lossless compression is performed on the four default files respectively, further saving the storage space of the FASTQ file.
  • Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of the format of a FASTQ file provided by an embodiment of the present disclosure
  • Figure 3 An embodiment of the present disclosure provides a schematic diagram of a first preset file storage line sequence
  • Figure 4 An embodiment of the present disclosure provides a schematic diagram of a second preset file storage line sequence
  • Figure 5 This disclosed embodiment provides a schematic diagram of a third preset file storage line sequence
  • Figure 6 is a schematic diagram of a fourth preset file storage line sequence provided by an embodiment of the present disclosure.
  • FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four default files according to an embodiment of the present disclosure
  • Figure 8 is a schematic flow chart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure
  • Figure 9 is a schematic flowchart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a FASTQ data processing device provided by an embodiment of the present disclosure.
  • FIG 11 is a schematic structural diagram of another FASTQ data processing device provided by an embodiment of the present disclosure.
  • FIG. 12 is a schematic block diagram of an example electronic device 1200 provided by an embodiment of the present disclosure.
  • Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure.
  • the method consists of the following steps:
  • Step 101 Split the FASTQ file to be processed into at least one sequence unit according to a preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers.
  • FIG. 2 is a schematic diagram of the format of a FASTQ file provided by the embodiment of the present disclosure.
  • the FASTQ file is a standard file for storing original sequencing data.
  • Each Four lines constitute an independent sequence unit (or sequence storage unit), as shown in Figure 2. Every four lines represent a sequence unit, a total of 4 sequence units.
  • the preset data format of each sequence unit is as follows:
  • the first line of sequence identification and related description information is the unique identifier of each sequence unit
  • the second line is that the sequence consists of A, C, G, T and N, where A, C, G, T are base information, and N is the complement code used as a substitute when sequencing fails;
  • the third line starts with ‘+’, followed by the row sequence identifier, relevant description information, or nothing.
  • the data shown in the example of the embodiment of the present invention only has ‘+’ in this line;
  • the fourth line is the quality information of the sequence, which corresponds to the bases in the second line of the sequence.
  • Each base corresponds to a quality value.
  • the quality value is expressed in ASCII code to measure the reliability of the sequenced base. The higher the quality value, the more reliable it is.
  • Figure 2 is only an illustrative example.
  • the embodiment of the present disclosure does not limit the number of sequence units in the FASTQ file to be processed and the content of each line in each sequence unit.
  • Figure 2 is only for convenience in formatting. For better understanding, explanations are given.
  • the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit.
  • the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit.
  • four sequence units are used as an example for explanation, but this method of explanation It is not intended as a limit to a specific quantity.
  • Step 102 Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier.
  • each sequence unit of the FASTQ file to be processed After the four lines of each sequence unit of the FASTQ file to be processed are classified, they are written to four corresponding preset files. That is, the sequence identification and related description information of the first line are written to the first preset file (file 1). The second line of sequence identification and related description information is written to the second preset file (File 2), the third line of sequence identification and related description information is written to the third preset file (File 3), the fourth line of sequence identification and Relevant description information (quality information) is written to the fourth preset file (File 4).
  • Figures 3 to 6 respectively provide a first preset method according to the embodiment of the present disclosure.
  • the schematic diagram of the file storage line sequence, the schematic diagram of the second preset file storage line sequence, the schematic diagram of the third preset file storage line sequence, and the schematic diagram of the fourth preset file storage line sequence can be seen from Figures 3 to 6
  • the data stored in the first preset file is the content related to the first line in different sequence units
  • the data stored in the second preset file is the content related to the second line in different sequence units
  • the data stored in the third preset file is the content related to the second line in different sequence units.
  • the data stored in the file is the content related to the third row in different sequence units
  • the data stored in the fourth preset file is the content related to the fourth row in different sequence units.
  • Step 103 Trigger a preset lossless compression command to compress the four preset files respectively.
  • the similarity between each line sequence written in four preset files is utilized, combined with the compression principle of gzip, to further reduce the space for storing FASTQ data.
  • the preset lossless compression command includes a gzip compression command or pigz compression.
  • Subsequent embodiments take the gzip compression command as an example for explanation. However, this explanation method is not intended to limit the compression method to only be gzip compression command.
  • the compression principle of gzip is: when there are two pieces of the same content in FASTQ data, as long as the position and size of the previous piece are obtained, the content of the latter piece can be determined, that is, it can be used (the distance between the two, the length of the same content) Such a pair of information replaces the latter piece of content. Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.
  • the third preset file only contains the '+' at the beginning of the third line of the sequence unit.
  • the gzip command is triggered to perform compression processing.
  • each sequence unit The similarity of individual line sequences is high, which makes the line sequences stored in the four preset files highly similar. Therefore, the gzip command is triggered to perform compression processing, and the content distance of the same module is shortened through classification to achieve the purpose of saving space.
  • the FASTQ data processing method splits the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers.
  • the line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command
  • the four preset files are compressed respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier.
  • the sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
  • step 102 is performed to store the four line sequences in corresponding four default files according to the line sequence identifiers. This can be achieved in the following two ways:
  • Method 1 From the first sequence unit to the last sequence unit of the FASTQ file to be processed, read the row sequences with the same row sequence identifier in different sequence units in sequence, and read the same row sequences in the reading order. Row sequences identified by row sequences are written to the same preset file.
  • the FASTQ file to be processed contains N sequence units (N is greater than 1).
  • N is greater than 1
  • the sequence unit is read, the first line in sequence unit 1 is read, and the first preset is written.
  • the first line of the file read the second line in sequence unit 1 and write the first line in the second preset file, read the third line in sequence unit 1, write the first line in the third preset file line, read the fourth line in sequence unit 1, and write the first line in the fourth preset file.
  • sequence unit 1 completes reading, read sequence N in the same way until all the The sequence unit completes the classified storage of FASTQ files to be processed.
  • Method 2 Read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first preset The first line in the file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed. Sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • Embodiments of the present disclosure provide a method for compressing four preset files respectively.
  • FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four preset files according to an embodiment of the present disclosure. , this method can be executed alone, or can be executed in combination with any embodiment or possible implementation in the embodiment, or in combination with any technical solution in related technologies. As a possible way to perform step 101, this method includes:
  • Step 701 Read the row sequence identifier and related description information of the first row sequence in the first sequence unit;
  • Step 702 Read the row sequence identifier and related description information of the first row sequence in the Nth sequence unit;
  • Step 703 Read the row sequence identifier and related description information of the second row sequence in the first sequence unit;
  • Step 704 Read the row sequence identifier and related description information of the second row sequence in the Nth sequence unit.
  • Step 705 Read the row sequence identifier and related description information of the third row sequence in the first sequence unit;
  • Step 706 Read the row sequence identifier and related description information of the third row sequence in the Nth sequence unit;
  • Step 707 Read the row sequence identifier and related description information of the fourth row sequence in the first sequence unit;
  • Step 708 Read the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;
  • the four preset files are respectively Performing compression processing includes: respectively determining whether the four preset files contain compressed line sequences. If so, performing compression processing on the compressed line sequences and performing preset lossless compression on the uncompressed line sequences. . This processing method ensures lossless restoration.
  • the method shown in Figure 7 is to perform a compression process after reading a line sequence in each sequence unit, and write the compressed line sequence into the corresponding preset file.
  • the following embodiments also provide another possible way to perform step 104, that is, through any of the above embodiments, four line sequences are stored in corresponding four default files according to the line sequence identifiers, and then different preset files are stored. Assume that files are compressed separately, specifically:
  • FIG. 8 is a schematic flowchart of a method of performing restoration processing on compressed FASTQ data provided by the embodiment of the present disclosure. This method can be used individually.
  • the execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
  • Step 801 Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related The description information, wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit.
  • N is greater than 1.
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • Step 802 Compose a target sequence unit based on the read row sequence identifier and related description information of each row.
  • Step 803 Write the target sequence unit into the same target FASTQ file.
  • Figure 9 is a schematic flowchart of another method of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure. This method can be used separately.
  • the execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:
  • Step 901 Control and synchronize the compressed first default file, second default file, third default file and fourth default file to read all the compressed files from the first line to the last line. Execution sequence identifier and related description information; through program control of the synchronization when reading the first default file, the second default file, the third default file and the fourth default file, the reading efficiency can be improved, and thus Improved the efficiency of restoring FASTQ files.
  • Step 902 Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
  • the embodiment of the present disclosure conducts verification through the following method.
  • the method includes: calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key, decompressing the target FASTQ file using a decompression method corresponding to the preset lossless compression, and obtaining the decompressed target FASTQ file. file, use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key, and based on whether the first encryption key and the second encryption key are consistent, Determine whether there is any loss in the storage and restoration of the FASTQ data to be processed.
  • Example 1 provides an application example of using the above-mentioned FASTQ data processing method to save storage space on a FASTQ compressed file with a size of 21789495522bp and a file name of "V350027954_L01_read_1.fq.gz".
  • the decompressed file is "V350027954_L01_read_1.
  • fq"'s MD5 message digest algorithm (MD5Message-Digest Algorithm) MD5 value is "ca4168a17d0510a5d3f51fa6856d1888". This MD5 value can be used to determine whether the classification compression storage of the present invention is used, and then through the restoration method, check whether the MD5 is consistent to verify the above implementation. Reliability of the method.
  • the programming language is perl.
  • the specific implementation code is as follows:
  • Example 2 The following provides an application example of using the method of the embodiment of the present invention to restore the compressed files Row1.gz, Row2.gz, Row3.gz and Row4.gz obtained by classification and storage in the above Example 1 to the original file V350027954_L01_read_1.fq.gz.
  • the programming language is perl, and the specific implementation code is as follows:
  • the present invention also proposes a FASTQ data processing device. Since the device embodiment of the present invention corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment can be referred to the above-mentioned method embodiment, and will not be described again in the present invention.
  • An embodiment of the present disclosure provides a FASTQ data processing device, as shown in Figure 10, including:
  • the splitting unit 1001 is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;
  • the storage unit 1002 is used to store four line sequences in corresponding four preset files according to the line sequence identifiers, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier. ;
  • the compression unit 1003 is configured to trigger a preset lossless compression command to compress the four preset files respectively.
  • the FASTQ data processing device splits the FASTQ file to be processed into at least one sequence unit according to the preset data format.
  • Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers.
  • the line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command
  • the four preset files are compressed respectively.
  • the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier.
  • the sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.
  • the storage unit 1002 includes:
  • a first reading module configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;
  • the first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
  • the storage unit includes:
  • the first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit.
  • the first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;
  • the second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file.
  • the third processing module is used to read the second row sequence in the first sequence unit, and write the row sequence identifier and related description information of the second row sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;
  • the fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file.
  • the fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;
  • the sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file.
  • the seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;
  • the eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file.
  • the compression unit 1003 is also used to:
  • the preset lossless compression command includes a gzip compression command or a pigz compression command.
  • the first processing module is also used to:
  • the second processing module is also used to:
  • the third processing module is also used to:
  • the fourth processing module is also configured to include:
  • the fifth processing module is also used to:
  • the sixth processing module is also used to:
  • the seventh processing module is also used to:
  • the eighth processing module is also used to:
  • the compression unit 1003 includes:
  • Determination module 10031 used to determine whether the four preset files contain compressed line sequences
  • the compression module 10032 is configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.
  • the device further includes:
  • the reading unit 1004 is configured to, after the compression unit 1003 triggers a preset lossless compression command to compress the four preset files respectively, sequentially select from the compressed first preset file and the second preset file.
  • the third default file and the fourth default file read the line sequence identifier and related description information in sequence, wherein, read the first default file, the second default file
  • the order of the files, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;
  • the composition unit 1005 is used to compose a target sequence unit according to the row sequence identifier and related description information of each row read;
  • the target FASTQ The file is a restored FASTQ file to be processed.
  • the control unit 1007 is configured to control the synchronization of the compressed first preset file and the second preset file after the compression unit triggers a preset lossless compression command to compress the four preset files respectively.
  • the third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;
  • the reading and writing unit 1008 is configured to sequentially write the reading results of the first default file, the second default file, the third default file, and the fourth default file into the same target FASTQ file each time a line is read. middle.
  • the device further includes:
  • the first encryption unit 1009 is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key
  • the decryption unit 10010 is used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;
  • the second encryption unit 10011 is used to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;
  • the determination unit 10012 determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 12 illustrates a schematic block diagram of an example electronic device 1200 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1200 includes a computing unit 1201, which can be loaded into a RAM (Random Access Memory) according to a computer program stored in a ROM (Read-Only Memory) 1202 or from a storage unit 1208. Access the computer program in the memory) 1203 to perform various appropriate actions and processes. In the RAM 1203, various programs and data required for the operation of the device 1200 can also be stored.
  • Computing unit 1201, ROM 1202 and RAM 1203 are connected to each other via bus 1204.
  • I/O (Input/Output, input/output) interface 1205 is also connected to bus 1204.
  • I/O interface 1205 Multiple components in the device 1200 are connected to the I/O interface 1205, including: input unit 12012, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a magnetic disk, optical disk, etc. ; and communication unit 1209, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1209 allows the device 1200 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include but are not limited to CPU (Central Processing Unit, Central Processing Unit), GPU (Graphic Processing Units, Graphics Processing Unit), various dedicated AI (Artificial Intelligence, artificial intelligence) computing chips, various running The computing unit of the machine learning model algorithm, DSP (Digital Signal Processor, digital signal processor), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 1201 performs various methods and processes described above, such as the processing method of FASTQ data.
  • the FASTQ data processing method may be implemented as a computer software program, which is tangibly embodied in a machine-readable medium, such as storage unit 1208.
  • part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209.
  • the computer program When the computer program is loaded into RAM 1203 and executed by computing unit 1201, one or more steps of the method described above may be performed.
  • the computing unit 1201 may be configured to perform the aforementioned processing method of FASTQ data in any other suitable manner (eg, by means of firmware).
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, laptop disks, hard drives, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, Erasable Programmable Read-Only Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory, portable compact disk read-only memory), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)) for displaying information to the user.
  • a display device eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)
  • LCD Liquid Crystal Display
  • keyboard and pointing device e.g., a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), the Internet, and blockchain networks.
  • Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability.
  • the server can also be a distributed system server or a server combined with a blockchain.
  • artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology.
  • Artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

La présente divulgation concerne un procédé et un appareil de traitement de données FASTQ, un dispositif électronique et un support de stockage. Un fichier FASTQ à traiter est divisé en au moins une unité de séquence selon un format de données prédéfini, chaque unité de séquence comprenant quatre séquences de rangée, et différentes séquences de rangée correspondant à différents identifiants de séquence de rangée ; les quatre séquences de rangée sont respectivement stockées dans quatre fichiers prédéfinis correspondants selon les identifiants de séquence de rangée, un fichier prédéfini étant utilisé pour stocker des séquences de rangée ayant un même identifiant de séquence de rangée et stockées dans différentes unités de séquence ; une instruction de compression sans perte prédéfinie est déclenchée pour compresser respectivement les quatre fichiers prédéfinis. Par comparaison avec un mode dans l'état de la technique consistant à déclencher directement une commande gzip pour compresser un fichier FASTQ, selon la présente divulgation, la classification et le stockage sont effectués en utilisant la similarité de chaque rangée dans des unités de séquence dans un fichier FASTQ, c'est-à-dire que les quatre séquences de rangées sont respectivement stockées dans les quatre fichiers prédéfinis correspondants selon les identifiants de séquence de rangées, et une compression sans perte prédéfinie est effectuée séparément sur les quatre fichiers prédéfinis, de telle sorte qu'un espace de stockage du fichier FASTQ est encore économisé.
PCT/CN2022/095757 2022-05-27 2022-05-27 Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage WO2023226036A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280054965.1A CN117795855A (zh) 2022-05-27 2022-05-27 Fastq数据的处理方法及装置、电子设备和存储介质
PCT/CN2022/095757 WO2023226036A1 (fr) 2022-05-27 2022-05-27 Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/095757 WO2023226036A1 (fr) 2022-05-27 2022-05-27 Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage

Publications (1)

Publication Number Publication Date
WO2023226036A1 true WO2023226036A1 (fr) 2023-11-30

Family

ID=88918251

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095757 WO2023226036A1 (fr) 2022-05-27 2022-05-27 Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN117795855A (fr)
WO (1) WO2023226036A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166518A1 (en) * 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
CN103384884A (zh) * 2012-12-11 2013-11-06 华为技术有限公司 一种文件压缩方法、文件解压缩方法、装置及服务器
CN104657627A (zh) * 2013-11-18 2015-05-27 广州中国科学院软件应用技术研究所 Fastq格式读段开头的寻找和判断方法以及系统
CN106100641A (zh) * 2016-06-12 2016-11-09 深圳大学 针对fastq数据的多线程快速存储无损压缩方法及其系统
CN107565975A (zh) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 Fastq格式文件无损压缩的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166518A1 (en) * 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
CN103384884A (zh) * 2012-12-11 2013-11-06 华为技术有限公司 一种文件压缩方法、文件解压缩方法、装置及服务器
CN104657627A (zh) * 2013-11-18 2015-05-27 广州中国科学院软件应用技术研究所 Fastq格式读段开头的寻找和判断方法以及系统
CN106100641A (zh) * 2016-06-12 2016-11-09 深圳大学 针对fastq数据的多线程快速存储无损压缩方法及其系统
CN107565975A (zh) * 2017-08-30 2018-01-09 武汉古奥基因科技有限公司 Fastq格式文件无损压缩的方法

Also Published As

Publication number Publication date
CN117795855A (zh) 2024-03-29

Similar Documents

Publication Publication Date Title
CN107229420B (zh) 数据存储方法、读取方法、删除方法和数据操作系统
WO2022262183A1 (fr) Procédé et appareil de traitement de calcul fédéré, dispositif électronique et support de stockage
US11829624B2 (en) Method, device, and computer readable medium for data deduplication
US11514003B2 (en) Data compression based on key-value store
WO2014094479A1 (fr) Procédé et dispositif permettant de supprimer des données dupliquées
CN107273542B (zh) 高并发数据同步方法及系统
US20180067978A1 (en) Log management method, log management device, and recording medium
EP4053770A1 (fr) Procédé d'acquisition d'informations de calendrier, appareil, dispositif, support d'informations et programme
JP2022159405A (ja) データのアペンド方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US10802719B2 (en) Method and system for data compression and data storage optimization
CN113296709A (zh) 用于去重的方法和设备
CN115934414A (zh) 数据备份方法、数据恢复方法、装置、设备及存储介质
WO2023226036A1 (fr) Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage
WO2021082926A1 (fr) Procédé et appareil de compression de données
CN113377391B (zh) 镜像文件的制作和烧录的方法、装置、设备和介质
CN115639966A (zh) 一种数据写入方法、装置、终端设备及存储介质
CN112860376B (zh) 一种快照链的制作方法、装置、电子设备及存储介质
EP4092544A1 (fr) Procédé, appareil et support de stockage pour déduplication de n uds d'entité dans une base de données de graphes
CN113326038B (zh) 用于提供服务的方法、装置、设备、存储介质及程序产品
WO2024020746A1 (fr) Procédé et appareil de traitement de données fastq, dispositif électronique et support de stockage
CN115132186A (zh) 端到端语音识别模型训练方法、语音解码方法及相关装置
US20210278977A1 (en) Method and system for performing data deduplication and compression in a data cluster
WO2022227760A1 (fr) Procédé et appareil de récupération d'images, dispositif électronique et support de stockage lisible par ordinateur
CN113836157A (zh) 获取数据库增量数据的方法和装置
WO2020088211A1 (fr) Procédé de compression de données et appareil associé, et procédé de décompression de données et appareil associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22943234

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280054965.1

Country of ref document: CN