WO2023226036A1

WO2023226036A1 - Fastq data processing method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023226036A1
Application number: PCT/CN2022/095757
Authority: WO
Inventors: 邓天全; 姜三杰; 陈世璇; 贺丽娟; 杨鑫; 黎剑波
Original assignee: 深圳华大基因科技服务有限公司; 武汉华大基因技术服务有限公司
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2023-11-30
Also published as: CN117795855A

Abstract

Disclosed in the present disclosure are a FASTQ data processing method and apparatus, an electronic device, and a storage medium. A FASTQ file to be processed is split into at least one sequence unit according to a preset data format, each sequence unit comprising four row sequences, and different row sequences corresponding to different row sequence identifiers; the four row sequences are respectively stored in corresponding four preset files according to the row sequence identifiers, wherein one preset file is used for storing row sequences having a same row sequence identifier and stored in different sequence units; a preset lossless compression command is triggered to respectively compress the four preset files. Compared with a mode in the prior art of directly triggering a gzip command to compress a FASTQ file, according to the present disclosure, classification and storage are carried out by utilizing the similarity of each row in sequence units in a FASTQ file, that is, the four row sequences are respectively stored in the corresponding four preset files according to the row sequence identifiers, and preset lossless compression is separately performed on the four preset files, so that a storage space of the FASTQ file is further saved.

Description

FASTQ data processing methods and devices, electronic equipment and storage media

Technical field

The present disclosure relates to the field of data processing technology, and in particular, to a FASTQ data processing method and device, electronic equipment and storage media.

Background technique

In recent years, with the development of sequencing technology, sequencing prices have become lower and lower, resulting in a surge in sequencing data. How to effectively reduce the storage space of sequencing data has become an urgent problem that needs to be solved.

In order to alleviate the storage pressure of FASTQ files, the FASTQ files that are originally sequenced are often compressed directly through the gz ip command. Among them, the compression principle of gzip is: when two pieces of FASTQ data have the same content, as long as By obtaining the position and size of the previous block, you can determine the content of the next block, that is, you can replace the content of the latter block with a pair of information (the distance between the two, the length of the same content). Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.

Although compression through the above method can alleviate storage pressure to a certain extent, its compression space cache still does not meet expectations.

Contents of the invention

The present disclosure provides a FASTQ data processing method, device, electronic equipment and storage medium. Its main purpose is to use the similarity of each line in the sequence unit in the FASTQ file to classify and store the files by lines, and perform preset lossless compression on the classified files to further save the storage space of the FASTQ file.

According to a first aspect of the present disclosure, a method for processing FASTQ data is provided, including:

Split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;

Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;

Trigger the preset lossless compression command to compress the four preset files respectively.

Optionally, storing the four line sequences in corresponding four default files according to the line sequence identifiers includes:

From the first sequence unit to the last sequence unit of the FASTQ file to be processed, sequentially read the row sequences with the same row sequence identifier in different sequence units;

Write the read line sequence identified by the same line sequence into the same preset file according to the reading order.

Read the first line sequence in the first sequence unit of the FASTQ file to be processed, and write the line sequence identifier and related description information of the first line sequence in the first sequence unit into the first default file. The first line, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;

Read the first row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the Nth row in the first default file, and add Default separator;

Read the second line sequence in the first sequence unit, write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the first line in the second default file, and add Default separator;

Read the second row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the Nth row in the second default file, and add Default separator;

Read the third row sequence in the first sequence unit, write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the first row in the third default file, and add Default separator;

Read the third row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the Nth row in the third default file, and add Default separator;

Read the fourth line sequence in the first sequence unit, write the line sequence identifier and related description information of the fourth line sequence in the first sequence unit into the first line in the fourth default file, and add Default separator;

Read the fourth row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the Nth row in the fourth default file, and add Default separator;

Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.

Optionally, triggering the preset lossless compression command to compress the four preset files respectively includes:

In response to the triggered first preset lossless compression command, perform compression on the first preset file;

In response to the triggered second preset lossless compression command, perform compression on the second preset file;

In response to the triggered third preset lossless compression command, perform compression on the third preset file;

In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.

Optionally, the preset lossless compression command includes a gzip compression command or a pigz compression command.

Optionally, read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first sequence unit. The first line in a default file contains:

Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.

Optionally, the step of reading the first row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file. Line N includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;

Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.

Optionally, the second row sequence in the first sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the first sequence unit are written into the second default file. The first line includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.

Optionally, the second row sequence in the Nth sequence unit is read, and the row sequence identifier and related description information of the second row sequence in the Nth sequence unit are written into the second default file. Line N includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;

Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.

Optionally, the step of reading the third row sequence in the first sequence unit is to write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third preset file. The first line includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.

Optionally, the step of reading the third row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third preset file. Line N includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;

Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.

Optionally, the step of reading the fourth row sequence in the first sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth preset file. The first line includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.

Optionally, the step of reading the fourth row sequence in the Nth sequence unit is to write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file. Line N includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;

Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.

Determine whether the four preset files contain compressed line sequences respectively;

If included, the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.

Optionally, after triggering a preset lossless compression command to compress the four preset files respectively, the method further includes:

Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related description information , wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;

According to the row sequence identifier and related description information of each read row, a target sequence unit is formed;

Write the target sequence unit into the same target FASTQ file;

Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.

Optionally, after triggering the preset lossless compression command to compress the four preset files respectively, the following steps are included:

Control synchronization to read the line sequence from the first line to the last line of the compressed first default file, second default file, third default file and fourth default file. Identification and related descriptive information;

Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.

Optionally, the method also includes:

Calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;

Use the decompression method corresponding to the preset lossless compression to decompress the target FASTQ file to obtain the decompressed target FASTQ file;

Use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key;

According to whether the first encryption key and the second encryption key are consistent, it is determined whether there is a loss in the storage and restoration processing of the FASTQ data to be processed.

According to a second aspect of the present disclosure, a FASTQ data processing device is provided, including:

The splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;

A storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;

A compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.

Optionally, the storage includes:

A first reading module, configured to sequentially read line sequences with the same line sequence identifier in different sequence units from the first sequence unit to the last sequence unit of the FASTQ file to be processed;

The first writing module is configured to write the read line sequence identified by the same line sequence into the same preset file according to the reading order.

Optionally, the storage unit includes:

The first processing module is used to read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit. The first line in the first preset file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the FASTQ file to be processed The first sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;

The second processing module is used to read the first row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the first default file. The Nth line of and add the preset delimiter;

The third processing module is used to read the second line sequence in the first sequence unit, and write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;

The fourth processing module is used to read the second row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the second default file. The Nth line of and add the preset delimiter;

The fifth processing module is used to read the third row sequence in the first sequence unit, and write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the third default file. the first line and add a preset delimiter;

The sixth processing module is used to read the third row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the third default file. The Nth line of and add the preset delimiter;

The seventh processing module is used to read the fourth row sequence in the first sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the first sequence unit into the fourth default file. the first line and add a preset delimiter;

The eighth processing module is used to read the fourth row sequence in the Nth sequence unit, and write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the fourth default file. The Nth line of and add the preset delimiter;

Optionally, the compression unit is also used for:

Optionally, the first processing module is also used for:

Optionally, the second processing module is also used for:

Optionally, the third processing module is also used for:

Optionally, the fourth processing module is also used to include:

Optionally, the fifth processing module is also used for:

Optionally, the sixth processing module is also used for:

Optionally, the seventh processing module is also used for:

Optionally, the eighth processing module is also used for:

Optionally, the compression unit includes:

A judgment module, used to judge whether the four preset files contain compressed line sequences;

A compression module, configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.

Optionally, the device also includes:

The reading unit is configured to sequentially select from the compressed first default file, second default file, Starting from the first line in the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where, read the first default file, the second default file, The order of the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;

A component unit used to form a target sequence unit based on the row sequence identifier and related description information of each read row;

A writing unit, used to write the target sequence unit into the same target FASTQ file;

Optionally, the device also includes:

A control unit configured to control the synchronization of the compressed first preset file, second preset file, and compressed files after the compression unit triggers a preset lossless compression command to compress the four preset files respectively. The third default file and the fourth default file read the line sequence identifier and related description information from the first line to the last line;

The reading and writing unit is used to sequentially write the reading results of the first default file, the second default file, the third default file and the fourth default file into the same target FASTQ file each time it reads one line. .

Optionally, the device also includes:

The first encryption unit is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;

A decryption unit, used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;;

A second encryption unit, configured to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;

The determination unit determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.

According to a third aspect of the present disclosure, an electronic device is provided, including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the method described in the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method described in the first aspect.

According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the method described in the foregoing first aspect.

The FASTQ data processing method, device, electronic equipment and storage medium provided by this disclosure split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences Corresponding to different row sequence identifiers, four row sequences are stored in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store rows of different sequence units with the same row sequence identifier. Sequence; trigger the preset lossless compression command to compress the four preset files respectively. Compared with the method of directly triggering the preset lossless compression FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, the four lines are classified according to the line sequence identifier. The line sequences are stored in the corresponding four default files, and the default lossless compression is performed on the four default files respectively, further saving the storage space of the FASTQ file.

It should be understood that the content described in this section is not intended to identify key or important features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become readily understood from the following description.

Description of the drawings

The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure. in:

Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure;

Figure 2 is a schematic diagram of the format of a FASTQ file provided by an embodiment of the present disclosure;

Figure 3 An embodiment of the present disclosure provides a schematic diagram of a first preset file storage line sequence;

Figure 4 An embodiment of the present disclosure provides a schematic diagram of a second preset file storage line sequence;

Figure 5 This disclosed embodiment provides a schematic diagram of a third preset file storage line sequence;

Figure 6 is a schematic diagram of a fourth preset file storage line sequence provided by an embodiment of the present disclosure;

FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four default files according to an embodiment of the present disclosure;

Figure 8 is a schematic flow chart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure;

Figure 9 is a schematic flowchart of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure;

Figure 10 is a schematic structural diagram of a FASTQ data processing device provided by an embodiment of the present disclosure;

Figure 11 is a schematic structural diagram of another FASTQ data processing device provided by an embodiment of the present disclosure;

FIG. 12 is a schematic block diagram of an example electronic device 1200 provided by an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding and should be considered to be exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

The following describes the FASTQ data processing method, device, electronic device, and storage medium according to the embodiments of the present disclosure with reference to the accompanying drawings.

Figure 1 is a schematic flowchart of a FASTQ data processing method provided by an embodiment of the present disclosure.

As shown in Figure 1, the method consists of the following steps:

Step 101: Split the FASTQ file to be processed into at least one sequence unit according to a preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers.

In order to facilitate better understanding, the embodiment of the present disclosure explains the format of the FASTQ file to be processed. Figure 2 is a schematic diagram of the format of a FASTQ file provided by the embodiment of the present disclosure. The FASTQ file is a standard file for storing original sequencing data. Each Four lines constitute an independent sequence unit (or sequence storage unit), as shown in Figure 2. Every four lines represent a sequence unit, a total of 4 sequence units. The preset data format of each sequence unit is as follows:

The first line of sequence identification and related description information, starting with ‘@’, is the unique identifier of each sequence unit;

The second line is that the sequence consists of A, C, G, T and N, where A, C, G, T are base information, and N is the complement code used as a substitute when sequencing fails;

The third line starts with ‘+’, followed by the row sequence identifier, relevant description information, or nothing. The data shown in the example of the embodiment of the present invention only has ‘+’ in this line;

The fourth line is the quality information of the sequence, which corresponds to the bases in the second line of the sequence. Each base corresponds to a quality value. The quality value is expressed in ASCII code to measure the reliability of the sequenced base. The higher the quality value, the more reliable it is.

It should be noted that Figure 2 is only an illustrative example. The embodiment of the present disclosure does not limit the number of sequence units in the FASTQ file to be processed and the content of each line in each sequence unit. Figure 2 is only for convenience in formatting. For better understanding, explanations are given.

During the specific application process, the FASTQ file to be processed is split into at least one sequence unit based on the preset data format of the sequence unit. In the embodiment of the present disclosure, four sequence units are used as an example for explanation, but this method of explanation It is not intended as a limit to a specific quantity.

Step 102: Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier.

After the four lines of each sequence unit of the FASTQ file to be processed are classified, they are written to four corresponding preset files. That is, the sequence identification and related description information of the first line are written to the first preset file (file 1). The second line of sequence identification and related description information is written to the second preset file (File 2), the third line of sequence identification and related description information is written to the third preset file (File 3), the fourth line of sequence identification and Relevant description information (quality information) is written to the fourth preset file (File 4).

As an example from step 101, on the basis of the FASTQ file to be processed given in Figure 2, as shown in Figures 3 to 6, Figures 3 to 6 respectively provide a first preset method according to the embodiment of the present disclosure. Assume that the schematic diagram of the file storage line sequence, the schematic diagram of the second preset file storage line sequence, the schematic diagram of the third preset file storage line sequence, and the schematic diagram of the fourth preset file storage line sequence can be seen from Figures 3 to 6 It turns out that the data stored in the first preset file is the content related to the first line in different sequence units, the data stored in the second preset file is the content related to the second line in different sequence units, and the data stored in the third preset file is the content related to the second line in different sequence units. It is assumed that the data stored in the file is the content related to the third row in different sequence units, and the data stored in the fourth preset file is the content related to the fourth row in different sequence units.

Step 103: Trigger a preset lossless compression command to compress the four preset files respectively.

In this disclosed embodiment, the similarity between each line sequence written in four preset files is utilized, combined with the compression principle of gzip, to further reduce the space for storing FASTQ data.

As a feasible method for the embodiment of this application, the preset lossless compression command includes a gzip compression command or pigz compression. Subsequent embodiments take the gzip compression command as an example for explanation. However, this explanation method is not intended to limit the compression method to only be gzip compression command.

The compression principle of gzip is: when there are two pieces of the same content in FASTQ data, as long as the position and size of the previous piece are obtained, the content of the latter piece can be determined, that is, it can be used (the distance between the two, the length of the same content) Such a pair of information replaces the latter piece of content. Since the size of this pair of information (the distance between the two, the length of the same content) is smaller than the size of the replaced content, the FASTQ file is compressed.

Please continue to refer to 5. It can be seen from Figure 5 that the third preset file only contains the '+' at the beginning of the third line of the sequence unit. When compressing, the gzip command is triggered to perform compression processing. Similarly, since each sequence unit The similarity of individual line sequences is high, which makes the line sequences stored in the four preset files highly similar. Therefore, the gzip command is triggered to perform compression processing, and the content distance of the same module is shortened through classification to achieve the purpose of saving space.

The FASTQ data processing method provided by this disclosure splits the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers. According to The line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command The four preset files are compressed respectively. Compared with the method of directly triggering the gzip command to compress the FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier. The sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.

In practical applications, step 102 is performed to store the four line sequences in corresponding four default files according to the line sequence identifiers. This can be achieved in the following two ways:

Method 1: From the first sequence unit to the last sequence unit of the FASTQ file to be processed, read the row sequences with the same row sequence identifier in different sequence units in sequence, and read the same row sequences in the reading order. Row sequences identified by row sequences are written to the same preset file.

As an example, the FASTQ file to be processed contains N sequence units (N is greater than 1). When performing classification storage, the sequence unit is read, the first line in sequence unit 1 is read, and the first preset is written. The first line of the file, read the second line in sequence unit 1 and write the first line in the second preset file, read the third line in sequence unit 1, write the first line in the third preset file line, read the fourth line in sequence unit 1, and write the first line in the fourth preset file. After sequence unit 1 completes reading, read sequence N in the same way until all the The sequence unit completes the classified storage of FASTQ files to be processed.

Method 2: Read the first row sequence in the first sequence unit of the FASTQ file to be processed, and write the row sequence identifier and related description information of the first row sequence in the first sequence unit into the first preset The first line in the file, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed. Sequence unit, the second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;

Embodiments of the present disclosure provide a method for compressing four preset files respectively. FIG. 7 is a flow chart of a method for storing four line sequences in corresponding four preset files according to an embodiment of the present disclosure. , this method can be executed alone, or can be executed in combination with any embodiment or possible implementation in the embodiment, or in combination with any technical solution in related technologies. As a possible way to perform step 101, this method includes:

Step 701: Read the row sequence identifier and related description information of the first row sequence in the first sequence unit;

Step 702: Read the row sequence identifier and related description information of the first row sequence in the Nth sequence unit;

Step 703: Read the row sequence identifier and related description information of the second row sequence in the first sequence unit;

Step 704: Read the row sequence identifier and related description information of the second row sequence in the Nth sequence unit.

Step 705: Read the row sequence identifier and related description information of the third row sequence in the first sequence unit;

Step 706: Read the row sequence identifier and related description information of the third row sequence in the Nth sequence unit;

Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the Nth sequence unit;

Write the row sequence identifier and related description information of the third row sequence in the compressed N-th sequence unit into the N-th row in the third default file.

Step 707: Read the row sequence identifier and related description information of the fourth row sequence in the first sequence unit;

Step 708: Read the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;

It should be noted that the method shown in Figure 7 can be used as a completed example to perform steps 701 to 708, or any one of steps 701 to 708 can be performed individually, or at least two steps can be selected. Specifically, , the embodiment of the present disclosure does not limit this.

As a feasible method of the embodiment of the present disclosure, in order to losslessly restore the FASTQ files to be processed after compression, only one compression process is performed during the compression process. Therefore, when the preset lossless compression command is triggered, the four preset files are respectively Performing compression processing includes: respectively determining whether the four preset files contain compressed line sequences. If so, performing compression processing on the compressed line sequences and performing preset lossless compression on the uncompressed line sequences. . This processing method ensures lossless restoration.

The method shown in Figure 7 is to perform a compression process after reading a line sequence in each sequence unit, and write the compressed line sequence into the corresponding preset file. The following embodiments also provide another possible way to perform step 104, that is, through any of the above embodiments, four line sequences are stored in corresponding four default files according to the line sequence identifiers, and then different preset files are stored. Assume that files are compressed separately, specifically:

The embodiments of this disclosure do not limit specific compression implementation methods.

The embodiment of the present disclosure provides a method for performing compression and restoration of four preset files respectively. Figure 8 is a schematic flowchart of a method of performing restoration processing on compressed FASTQ data provided by the embodiment of the present disclosure. This method can be used individually. The execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:

Step 801: Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, sequentially read the line sequence identifier and related The description information, wherein the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit.

Read the first line in the first default file, the first line in the second default file, the first line in the third default file, and the first line in the fourth default file. OK.

Continue to read the Nth line of the first default file, read the Nth line of the second default file, read the Nth line of the third default file, and read the Nth line of the fourth default file. , until N is greater than 1.

Step 802: Compose a target sequence unit based on the read row sequence identifier and related description information of each row.

Step 803: Write the target sequence unit into the same target FASTQ file.

The method shown in Figure 8 is to read each line in the first default file, the second default file, the third default file and the fourth default file in sequence. The reading efficiency is relatively slow. In order to improve the recovery waiting time, The efficiency of processing FASTQ files can also be achieved in the following manner, as shown in Figure 9. Figure 9 is a schematic flowchart of another method of performing restoration processing on compressed FASTQ data provided by an embodiment of the present disclosure. This method can be used separately. The execution can also be executed in conjunction with any embodiment in the present disclosure or the possible implementation methods in the embodiment, and can also be executed in conjunction with any technical solution in related technologies, including:

Step 901: Control and synchronize the compressed first default file, second default file, third default file and fourth default file to read all the compressed files from the first line to the last line. Execution sequence identifier and related description information; through program control of the synchronization when reading the first default file, the second default file, the third default file and the fourth default file, the reading efficiency can be improved, and thus Improved the efficiency of restoring FASTQ files.

Step 902: Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.

As a feasible method of the embodiment of the present disclosure, in order to determine whether the above method is a reliable processing method, the embodiment of the present disclosure conducts verification through the following method. The method includes: calculating the FASTQ file to be processed according to a preset encryption algorithm to obtain a first encryption key, decompressing the target FASTQ file using a decompression method corresponding to the preset lossless compression, and obtaining the decompressed target FASTQ file. file, use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key, and based on whether the first encryption key and the second encryption key are consistent, Determine whether there is any loss in the storage and restoration of the FASTQ data to be processed.

As Example 1, the following provides an application example of using the above-mentioned FASTQ data processing method to save storage space on a FASTQ compressed file with a size of 21789495522bp and a file name of "V350027954_L01_read_1.fq.gz". The decompressed file is "V350027954_L01_read_1. fq"'s MD5 message digest algorithm (MD5Message-Digest Algorithm) MD5 value is "ca4168a17d0510a5d3f51fa6856d1888". This MD5 value can be used to determine whether the classification compression storage of the present invention is used, and then through the restoration method, check whether the MD5 is consistent to verify the above implementation. Reliability of the method.

In this example, to achieve effective compression of FASTQ data, the programming language is perl. The specific implementation code is as follows:

In this sending example, we classify and compress each sequence unit of V350027954_L01_read_1.fq.gz into files Row1.gz, Row2.gz, Row3.gz, and Row4.gz (four default files) , their sizes are 1330794994bp, 6725867471bp, 978835bp and 8702582307bp respectively. The total size of the four files is 16760223607bp. Their size is only 76.92% of the original file V350027954_L01_read_1.fq.gz. The FASTQ data has been effectively compressed.

Example 2: The following provides an application example of using the method of the embodiment of the present invention to restore the compressed files Row1.gz, Row2.gz, Row3.gz and Row4.gz obtained by classification and storage in the above Example 1 to the original file V350027954_L01_read_1.fq.gz. In this example 2, the programming language is perl, and the specific implementation code is as follows:

In this example 2, we simultaneously read the nth row (1 to the last row) in Row1.gz, Row2.gz, Row3.gz, and Row4.gz, and compress the output to a file named "V350027954_L01_read_1.fq.gz "document.

Further, we decompressed the file "V350027954_L01_read_1.fq.gz" and obtained the file "V350027954_L01_read_1.fq", and used the command "md5sum V350027954_L01_read_1.fq" to obtain its MD5 value as "ca4168a17d0510a5d3f51fa6" 856d1888", which is the same as the MD5 value in Example 1 , proving that the compression method and restoration method provided by the embodiment of the present invention are effective and feasible.

Corresponding to the above-mentioned FASTQ data processing method, the present invention also proposes a FASTQ data processing device. Since the device embodiment of the present invention corresponds to the above-mentioned method embodiment, details not disclosed in the device embodiment can be referred to the above-mentioned method embodiment, and will not be described again in the present invention.

An embodiment of the present disclosure provides a FASTQ data processing device, as shown in Figure 10, including:

The splitting unit 1001 is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;

The storage unit 1002 is used to store four line sequences in corresponding four preset files according to the line sequence identifiers, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier. ;

The compression unit 1003 is configured to trigger a preset lossless compression command to compress the four preset files respectively.

The FASTQ data processing device provided by the present disclosure splits the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences. Different row sequences correspond to different row sequence identifiers. According to The line sequence identifier stores four line sequences in corresponding four preset files respectively, wherein one default file is used to store line sequences of different sequence units with the same line sequence identifier; triggering the default lossless compression command The four preset files are compressed respectively. Compared with the method of directly triggering the gzip command to compress the FASTQ file in the related art, the embodiment of the present disclosure uses the similarity of each line in the sequence unit in the FASTQ file to perform classification and storage by line, that is, four lines are classified according to the line sequence identifier. The sequences are stored in corresponding four preset files, and preset lossless compression is performed on the four preset files respectively, further saving the storage space of FASTQ files.

Further, in a possible implementation of this embodiment, as shown in Figure 11, the storage unit 1002 includes:

Further, in a possible implementation of this embodiment, the storage unit includes:

The third processing module is used to read the second row sequence in the first sequence unit, and write the row sequence identifier and related description information of the second row sequence in the first sequence unit into the second default file. the first line and add a preset delimiter;

Further, in a possible implementation of this embodiment, the compression unit 1003 is also used to:

Further, in a possible implementation of this embodiment, the preset lossless compression command includes a gzip compression command or a pigz compression command.

Further, in a possible implementation of this embodiment, the first processing module is also used to:

Furthermore, in a possible implementation of this embodiment, the second processing module is also used to:

Further, in a possible implementation of this embodiment, the third processing module is also used to:

Further, in a possible implementation of this embodiment, the fourth processing module is also configured to include:

Further, in a possible implementation of this embodiment, the fifth processing module is also used to:

Further, in a possible implementation of this embodiment, the sixth processing module is also used to:

Further, in a possible implementation of this embodiment, the seventh processing module is also used to:

Further, in a possible implementation of this embodiment, the eighth processing module is also used to:

Further, in a possible implementation of this embodiment, as shown in Figure 11, the compression unit 1003 includes:

Determination module 10031, used to determine whether the four preset files contain compressed line sequences;

The compression module 10032 is configured to, when it is determined that the four preset files contain compressed line sequences, no longer perform compression processing on the compressed line sequences, and perform preset lossless compression on the uncompressed line sequences.

Further, in a possible implementation of this embodiment, as shown in Figure 11, the device further includes:

The reading unit 1004 is configured to, after the compression unit 1003 triggers a preset lossless compression command to compress the four preset files respectively, sequentially select from the compressed first preset file and the second preset file. Starting from the first line in the file, the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, wherein, read the first default file, the second default file The order of the files, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;

The composition unit 1005 is used to compose a target sequence unit according to the row sequence identifier and related description information of each row read;

Writing unit 1006, used to write the target sequence unit into the same target FASTQ file;

Further, in a possible implementation of this embodiment, as shown in Figure 11, it includes:

The control unit 1007 is configured to control the synchronization of the compressed first preset file and the second preset file after the compression unit triggers a preset lossless compression command to compress the four preset files respectively. , the third default file and the fourth default file, read the line sequence identifier and related description information from the first line to the last line;

The reading and writing unit 1008 is configured to sequentially write the reading results of the first default file, the second default file, the third default file, and the fourth default file into the same target FASTQ file each time a line is read. middle.

The first encryption unit 1009 is used to calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;

The decryption unit 10010 is used to decompress the target FASTQ file using the decompression method corresponding to the preset lossless compression, and obtain the decompressed target FASTQ file;;

The second encryption unit 10011 is used to perform encryption calculation on the decompressed target FASTQ file using the preset encryption algorithm to obtain a second encryption key;

The determination unit 10012 determines whether there is a loss in the storage and restoration of the FASTQ data to be processed based on whether the first encryption key and the second encryption key are consistent.

It should be noted that the foregoing explanation of the method embodiment also applies to the device of this embodiment. The principles are the same and are no longer limited in this embodiment.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

12 illustrates a schematic block diagram of an example electronic device 1200 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in Figure 12, the device 1200 includes a computing unit 1201, which can be loaded into a RAM (Random Access Memory) according to a computer program stored in a ROM (Read-Only Memory) 1202 or from a storage unit 1208. Access the computer program in the memory) 1203 to perform various appropriate actions and processes. In the RAM 1203, various programs and data required for the operation of the device 1200 can also be stored. Computing unit 1201, ROM 1202 and RAM 1203 are connected to each other via bus 1204. I/O (Input/Output, input/output) interface 1205 is also connected to bus 1204.

Multiple components in the device 1200 are connected to the I/O interface 1205, including: input unit 12012, such as a keyboard, mouse, etc.; output unit 1207, such as various types of displays, speakers, etc.; storage unit 1208, such as a magnetic disk, optical disk, etc. ; and communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.

Computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include but are not limited to CPU (Central Processing Unit, Central Processing Unit), GPU (Graphic Processing Units, Graphics Processing Unit), various dedicated AI (Artificial Intelligence, artificial intelligence) computing chips, various running The computing unit of the machine learning model algorithm, DSP (Digital Signal Processor, digital signal processor), and any appropriate processor, controller, microcontroller, etc. The computing unit 1201 performs various methods and processes described above, such as the processing method of FASTQ data. For example, in some embodiments, the FASTQ data processing method may be implemented as a computer software program, which is tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1200 via ROM 1202 and/or communication unit 1209. When the computer program is loaded into RAM 1203 and executed by computing unit 1201, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the aforementioned processing method of FASTQ data in any other suitable manner (eg, by means of firmware).

Various implementations of the systems and technologies described above in this article can be implemented in digital electronic circuit systems, integrated circuit systems, FPGA (Field Programmable Gate Array, field programmable gate array), ASIC (Application-Specific Integrated Circuit, application-specific integrated circuit) , ASSP (Application Specific Standard Product, dedicated standard product), SOC (System On Chip, system on chip), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or implemented in their combination. These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor The processor, which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device. An output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include electrical connections based on one or more wires, laptop disks, hard drives, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory, Erasable Programmable Read-Only Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory, portable compact disk read-only memory), optical storage device, magnetic storage device, or any suitable combination of the above.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, CRT (Cathode-Ray Tube, cathode ray tube) or LCD (Cathode-Ray Tube)) for displaying information to the user. Liquid Crystal Display (LCD monitor); and a keyboard and pointing device (e.g., a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), the Internet, and blockchain networks.

Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other. The server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short) Among them, there are defects such as difficult management and weak business scalability. The server can also be a distributed system server or a server combined with a blockchain.

Among them, it should be noted that artificial intelligence is the study of using computers to simulate certain human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.). It has both hardware-level technology and software-level technology. Artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc.; artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning, big data processing technology, knowledge graph technology and other major directions.

It should be understood that various forms of the process shown above may be used, with steps reordered, added or deleted. For example, each step described in the present disclosure can be executed in parallel, sequentially, or in a different order. As long as the desired results of the technical solution disclosed in the present disclosure can be achieved, there is no limitation here.

The above-mentioned specific embodiments do not constitute a limitation on the scope of the present disclosure. It will be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions are possible depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this disclosure shall be included in the protection scope of this disclosure.

Claims

A method for processing FASTQ data, which is characterized by including:

Split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;

Store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;

Trigger the preset lossless compression command to compress the four preset files respectively.
The processing method according to claim 1, wherein storing four line sequences in corresponding four default files according to the line sequence identifiers includes:

From the first sequence unit to the last sequence unit of the FASTQ file to be processed, sequentially read the row sequences with the same row sequence identifier in different sequence units;

Write the read line sequence identified by the same line sequence into the same preset file according to the reading order.
The processing method according to any one of claims 1-2, characterized in that, storing four line sequences in corresponding four default files according to the line sequence identifiers includes:

Read the first line sequence in the first sequence unit of the FASTQ file to be processed, and write the line sequence identifier and related description information of the first line sequence in the first sequence unit into the first default file. The first line, and add a preset delimiter, wherein the at least one sequence unit includes a first sequence unit and a second sequence unit, and the first sequence unit is the first sequence unit of the FASTQ file to be processed, The second sequence unit is the Nth sequence unit of the FASTQ file to be processed, and N is an integer greater than 1;

Read the first row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the first row sequence in the Nth sequence unit into the Nth row in the first default file, and add Default separator;

Read the second line sequence in the first sequence unit, write the line sequence identifier and related description information of the second line sequence in the first sequence unit into the first line in the second default file, and add Default separator;

Read the second row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the second row sequence in the Nth sequence unit into the Nth row in the second default file, and add Default separator;

Read the third row sequence in the first sequence unit, write the row sequence identifier and related description information of the third row sequence in the first sequence unit into the first row in the third default file, and add Default separator;

Read the third row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the third row sequence in the second sequence unit into the Nth row in the third default file, and add Default separator;

Read the fourth line sequence in the first sequence unit, write the line sequence identifier and related description information of the fourth line sequence in the first sequence unit into the first line in the fourth default file, and add Default separator;

Read the fourth row sequence in the Nth sequence unit, write the row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit into the Nth row in the fourth default file, and add Default separator;

Until all the sequence units of the FASTQ file to be processed are sequentially written into the first default file, the second default file, the third default file and the fourth default file.
The processing method according to any one of claims 1 to 3, characterized in that triggering a preset lossless compression command to compress the four preset files respectively includes:

In response to the triggered first preset lossless compression command, perform compression on the first preset file;

In response to the triggered second preset lossless compression command, perform compression on the second preset file;

In response to the triggered third preset lossless compression command, perform compression on the third preset file;

In response to the triggered fourth preset lossless compression command, compression is performed on the fourth preset file.
The processing method according to any one of claims 1 to 4, characterized in that the preset lossless compression command includes a gzip compression command or a pigz compression command.
The processing method according to claim 3, characterized in that, reading the first line sequence in the first sequence unit of the FASTQ file to be processed, converting the first line sequence in the first sequence unit The first line of the sequence identifier and related description information written into the first default file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the first row sequence in the compressed first sequence unit into the first row in the first default file.
The processing method according to claim 3, characterized in that, in reading the first row sequence in the Nth sequence unit, the row sequence identifier and related description of the first row sequence in the Nth sequence unit are The information written to the Nth line in the first preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the first row sequence in the Nth sequence unit;

Write the row sequence identifier and related description information of the first row sequence in the compressed N-th sequence unit into the N-th row in the first default file.
The processing method according to claim 3, characterized in that, in reading the second row sequence in the first sequence unit, the row sequence identifier and related description of the second row sequence in the first sequence unit are The first line of information written to the second preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the second row sequence in the compressed first sequence unit into the first row in the second default file.
The processing method according to claim 3, characterized in that, in reading the second row sequence in the Nth sequence unit, the row sequence identifier and related description of the second row sequence in the Nth sequence unit are The information written to the Nth line in the second preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the second row sequence in the Nth sequence unit;

Write the compressed row sequence identifier and related description information of the second row sequence in the N-th sequence unit into the N-th row in the second default file.
The processing method according to claim 3, characterized in that, in reading the third row sequence in the first sequence unit, the row sequence identifier and related description of the third row sequence in the first sequence unit are The first line of information written to the third preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the third row sequence in the compressed first sequence unit into the first row in the third default file.
The processing method according to claim 3, characterized in that, in reading the third row sequence in the Nth sequence unit, the row sequence identifier and related description of the third row sequence in the second sequence unit are The information written to the Nth line in the third preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the third row sequence in the second sequence unit;

Write the row sequence identifier and related description information of the third row sequence in the compressed second sequence unit into the Nth row in the third default file.
The processing method according to claim 3, characterized in that, in reading the fourth row sequence in the first sequence unit, the row sequence identifier and related description of the fourth row sequence in the first sequence unit are The first line of information written to the fourth preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the first sequence unit;

Write the row sequence identifier and related description information of the compressed fourth row sequence in the first sequence unit into the first row in the fourth default file.
The processing method according to claim 3, characterized in that, in reading the fourth row sequence in the Nth sequence unit, the row sequence identifier and related description of the fourth row sequence in the Nth sequence unit are The information written to the Nth line in the fourth preset file includes:

Perform preset lossless compression on the read row sequence identifier and related description information of the fourth row sequence in the Nth sequence unit;

Write the row sequence identifier and related description information of the fourth row sequence in the compressed N-th sequence unit into the N-th row in the fourth default file.
The method according to any one of claims 6-13, characterized in that triggering a preset lossless compression command to respectively compress the four preset files includes:

Determine whether the four preset files contain compressed line sequences respectively;

If included, the compressed line sequence will no longer be compressed, and the uncompressed line sequence will be preset for lossless compression.
The processing method according to any one of claims 1 to 14, characterized in that, after triggering a preset lossless compression command to compress the four preset files respectively, the method further includes:

Starting from the first line in the compressed first default file, the second default file, the third default file and the fourth default file, read the line sequence identifier and related description information in sequence, where , the order of reading the first preset file, the second preset file, the third preset file and the fourth preset file is consistent with the sequence order of the four lines of the sequence unit;

According to the row sequence identifier and related description information of each read row, a target sequence unit is formed;

Write the target sequence unit into the same target FASTQ file;

Until the row sequence identifier and related description information of the Nth line in the first default file, the second default file, the third default file and the fourth default file are read, the target FASTQ The file is a restored FASTQ file to be processed.
The processing method according to any one of claims 1 to 14, characterized in that, after triggering a preset lossless compression command to compress the four preset files respectively, it includes:

Control synchronization to read the line sequence identifier and the line sequence identifier from the first line to the last line of the compressed first default file, second default file, third default file and fourth default file. Relevant descriptive information;

Each time a line is read, the reading results of the first default file, the second default file, the third default file, and the fourth default file are sequentially written into the same target FASTQ file.
The processing method according to claim 15 or 16, characterized in that the method further includes:

Calculate the FASTQ file to be processed according to the preset encryption algorithm to obtain the first encryption key;

Use the decompression method corresponding to the preset lossless compression to decompress the target FASTQ file to obtain the decompressed target FASTQ file;

Use the preset encryption algorithm to perform encryption calculations on the decompressed target FASTQ file to obtain a second encryption key;

According to whether the first encryption key and the second encryption key are consistent, it is determined whether there is a loss in the storage and restoration processing of the FASTQ data to be processed.
A FASTQ data processing device, characterized by including:

The splitting unit is used to split the FASTQ file to be processed into at least one sequence unit according to the preset data format. Each sequence unit includes four row sequences, and different row sequences correspond to different row sequence identifiers;

A storage unit configured to store four row sequences in corresponding four preset files according to the row sequence identifiers, wherein one default file is used to store row sequences of different sequence units with the same row sequence identifier;

A compression unit is used to trigger a preset lossless compression command to compress the four preset files respectively.
An electronic device, characterized by including:

at least one processor; and

a memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform any one of claims 1-17 Methods.
A non-transitory computer-readable storage medium storing computer instructions, characterized in that the computer instructions are used to cause the computer to execute the method according to any one of claims 1-17.
A computer program product, characterized by comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-17.