CN115083530A

CN115083530A - Gene sequencing data compression method and device, terminal equipment and storage medium

Info

Publication number: CN115083530A
Application number: CN202211003550.XA
Authority: CN
Inventors: 陈墩金; 王阳开; 毕星浩; 林凯翔; 张力; 孙齐胜
Original assignee: Guangzhou Mingling Gene Technology Co ltd
Current assignee: Guangzhou Mingling Gene Technology Co ltd
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-09-20
Anticipated expiration: 2042-08-22
Also published as: CN115083530B

Abstract

The application discloses a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium, wherein a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing a gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, prediction processing is carried out on the gene sequencing data text to be compressed to obtain a prediction result, according to the prediction result, arithmetic compression is carried out on the gene sequencing data text to be compressed, the gene sequencing data text is compressed, and storage and transmission expenses are reduced through high-proportion compression.

Description

Gene sequencing data compression method and device, terminal equipment and storage medium

Technical Field

The application belongs to the technical field of gene detection, and particularly relates to a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium.

Background

The research of genomics is accelerated by a new generation sequencing technology, a large amount of gene sequence data can be generated by the new generation sequencing technology at extremely low cost, and the gene sequence data is increasing explosively due to the further reduction of the sequencing cost, and the explosion increase brings huge pressure to storage. Compressing gene sequence data is an extremely effective method to solve this problem.

The standard gene sequencing data text storage format is referred to as FASTQ format. In FASTQ format, each sequencing sequence is represented by 4 rows: the first row is the sequencing sequence identifier; the second row is the base sequence of the sequenced sequence; a third row, spare, for storing additional information, normally empty, where a line change is ignored; the fourth row is the mass number of bases, and the number of bases in the second row corresponds one-to-one to indicate the confidence of that base. The length of the base sequence in the sequence is simply called the read length and is marked as BP.

Since the whole genome reference base sequence of a typical species is known, when the gene sequencing data is compressed and stored, a matching process is first performed. The matching process is to detect whether a certain subsequence in the reference sequence is identical to the subsequence or whether the number of different bases in the reference sequence is less than a preset limit, and if the certain subsequence exists, the respective positions, the common length and a small amount of difference of the two subsequences are recorded, so that the matching is called one-point matching. When the base sequence is stored, an appropriate match can be stored in place of the corresponding subsequence, thereby greatly reducing the number of bases that need to be stored, and a partial base sequence that cannot be described by matching is referred to as a remaining sequence of the base sequence.

Theoretically, adaptive arithmetic coding is one of the highest compression coding methods. Applying adaptive arithmetic coding, a model is designed for the data, the model takes the probability of 0 or 1 for the next bit in the data according to the above of the data, and arithmetic coding is performed according to the probability. The more accurate the model prediction, the better the compression performance. However, due to the high computational overhead of the context prediction model, no mature method for applying adaptive arithmetic coding to gene sequencing data compression exists at present.

Disclosure of Invention

The invention aims to provide a gene sequencing data compression method, a gene sequencing data compression device, a terminal device and a storage medium, so as to solve the defects in the prior art, and the technical problem to be solved by the invention is realized by the following technical scheme.

In a first aspect, the embodiments of the present invention provide a method for compressing gene sequencing data, the method including:

acquiring a gene sequencing data text to be compressed;

segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;

processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;

matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;

and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.

Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coding unit.

Optionally the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;

the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;

the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;

the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.

Optionally, the processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text includes:

and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.

In a second aspect, embodiments of the present invention provide a gene sequencing data compression apparatus, including:

the acquisition module is used for acquiring a gene sequencing data text to be compressed;

the determining module is used for segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;

the processing module is used for processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;

the matching module is used for matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;

and the prediction module is used for performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.

Optionally, the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;

the base sequence matching prediction module predicts the lower part of the base sequence matching record according to the upper part of the base sequence matching record;

Optionally, the processing module is configured to:

In a third aspect, an embodiment of the present invention provides a terminal device, including: at least one processor and memory;

the memory stores a computer program; the at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression method provided in the first aspect.

In a fourth aspect, the embodiments of the present invention provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in the first aspect is implemented.

The embodiment of the invention has the following advantages:

according to the gene sequencing data compression method, the gene sequencing data compression device, the terminal equipment and the storage medium, the gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, performing prediction processing on sequence identifier data, sequence matching record data, sequence residual sequence data and quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

Drawings

In order to more clearly illustrate the embodiments or prior art solutions of the present application, the drawings needed for describing the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a schematic flow chart of a method for compressing gene sequencing data according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of another method for compressing gene sequencing data according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application;

FIG. 4 is a diagram illustrating matching records in an embodiment of the present application;

FIG. 5 is a diagram showing a base sequence for predicting a mass number sequence in an example of the present application;

FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application;

FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application;

FIG. 8 is a block diagram showing the structure of an embodiment of a gene sequencing data compression apparatus according to the present invention;

fig. 9 is a schematic structural diagram of a terminal device of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following embodiments and accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

An embodiment of the invention provides a gene sequencing data compression method, which is used for compressing gene sequences. The execution subject of the embodiment is a gene sequencing data compression device, and is arranged on a terminal device, for example, the terminal device at least comprises a computer terminal and the like.

Referring to fig. 1, a flow chart of steps of an embodiment of a method for compressing gene sequencing data according to the present invention is shown, and the method may specifically include the following steps:

s101, obtaining a gene sequencing data text to be compressed;

specifically, the terminal device acquires a gene sequencing data text to be compressed, wherein the gene sequencing data text is in a FASTQ format, and repeated base sequences are required to be matched with repeated sequences in order to compress the gene sequencing data text.

S102, segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;

the terminal equipment processes a gene sequencing data text to be compressed, namely, different components of FASTQ format data are decomposed into 4 files, and sequence identifier data, base gene sequence number and mass number sequence data are obtained; and generating different types of data into different files, including:

1. the sequencing sequence identifier constitutes a file FXH;

2. matching the base sequence in the reference sequence to record FXA;

3. the remaining sequence of the base sequence FXB;

4. mass number sequence data FXQ.

S103, processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;

specifically, the terminal device compresses the sequence read length by using a dictionary method, and the read length of the gene sequencing data text to be compressed, which forms the output file, is read.

S104, matching the base sequences according to a preset reference sequence to obtain sequence matching recorded data and sequence residual sequence data;

specifically, since the reference base sequence of the whole genome of a typical species is known, when gene sequencing data is compressed and stored, a matching process is performed first, that is, for each subsequence in the base sequence, whether a subsequence identical to the reference sequence exists in the reference sequence or whether the number of different bases is less than a preset limit is detected, and if the subsequence exists, the respective positions, the common length and the small amount of difference of the two subsequences are recorded, which is called one-point matching. When the base sequence is stored, appropriate matches can be stored in place of the corresponding subsequences, thereby greatly reducing the number of bases to be stored, and a partial base sequence which cannot be described by matching is referred to as a residual sequence of the base sequence, so that sequence matching record data and sequence residual sequence data can be obtained separately.

And S105, according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.

Specifically, the pre-established cross prediction models in the embodiment of the present invention are obtained by respectively establishing independent prediction models for 4 files, i.e., a sequencing sequence identifier composition file FXH, a matching record FXA of a base sequence in a reference sequence, a residual sequence FXB of the base sequence, and mass number sequence data FXQ, and encoding each prediction model by using an adaptive arithmetic coding method to obtain the cross prediction model, wherein each prediction model is obtained according to context.

And the terminal equipment performs arithmetic compression on the gene sequencing data text to be compressed according to the prediction probability.

The embodiment of the application divides a gene sequencing data text to be compressed into three components of sequence identifier data, base sequence data and mass number sequence data.

The read length is first compressed separately using conventional methods. Matching the base sequence according to a preset reference sequence, and obtaining two components of sequence matching record data and sequence residual sequence data from the base sequence components; according to a pre-established cross prediction model, sequence identifier data, sequence matching records, residual sequence data and the postamble of four components of mass number sequence data are subjected to arithmetic compression according to the preamble, the read length and other components, and a gene sequencing data text to be compressed is subjected to arithmetic compression according to the prediction probability, so that the storage and transmission expenses are reduced through high-proportion compression.

The terminal equipment can predict residual base sequences according to a pre-established cross prediction model, the base information is used for predicting quality number information, the terminal equipment performs arithmetic compression on a gene sequencing data text to be compressed according to the prediction probability, compared with an undecomposed FASTQ file, the statistical properties inside each subfile are uniform, the rule is simple, and the statistical model of each file can obtain a better prediction result without being too complex.

Dividing a gene sequencing data text to be compressed into three components, namely sequence identifier data, base sequence data and mass number sequence data, firstly independently compressing read length, matching the base gene sequence according to a preset reference sequence, and obtaining two components, namely matched sequence record data and residual sequence data after matching the base sequence components; according to a pre-established cross prediction model, a read length and the respective preambles of each component, sequence identifier data, sequence matching record data and postambles of the four components of the residual sequence data and the mass number sequence data are predicted to obtain a prediction result, according to the prediction result, the arithmetic compression is carried out on a gene sequencing data text to be compressed, and the storage and transmission expenses are reduced through the high-proportion compression.

According to the gene sequencing data compression method provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

The present invention further provides a supplementary explanation of the gene sequencing data compression method provided in the above embodiments.

Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with the arithmetic coding unit.

The embodiment of the invention adopts a cross prediction model and predicts the residual base sequence by using the matching information; predicting mass number information by using the base information; and connecting a plurality of independent prediction models into a composite model by adopting a finite state machine to obtain the cross prediction model.

Optionally the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;

the residual sequence prediction module is determined according to the context of the residual sequence and the base sequence matching record;

the mass number sequence prediction module is determined based on the context of mass number sequence data, base sequence matching records, and the remaining sequence.

Specifically, the predictive model predicts the context by "above". The text that has been currently read above, and the text that is to be read next below.

The predictive model of the FXH file relies solely on the self context.

The predictive model of the FXA file depends only on the body text itself.

The predictive model of the FXB file relies on the matching records FXA in addition to the context itself. Misalignment of alignments that exceeds the design limit will terminate the match, but the extended portion of the match will have a high probability of maintaining a high similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.

The prediction model of the FXQ file relies on the sequencing base sequences FXA and FXB in addition to the context itself. The following is generally inferred: the mass number of the mixed bases must be low, and the mass number corresponding to more consecutive identical bases is slightly low. Therefore, introduction of a nucleotide sequence is helpful for predicting the mass number sequence.

Specifically, the independent prediction model is FXH identifier, FXA matching record, FXB residual base and FXQ mass number four-component independent prediction compression.

The cross prediction model is: FXA was used to predict FXB, while FXA and FXB were used to predict FXQ.

Optionally, processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text, including:

Fig. 2 is a schematic flow chart of another gene sequencing data compression method according to an embodiment of the present application, where the gene sequencing data compression method includes:

step 1, traversing all gene sequencing data text files to be compressed, carrying out format detection on the gene sequencing data text files, and recording some statistical information, namely determining sequence identification, base sequence and mass number sequence, if the format of the text files conforms to the standard FASTQ format.

And 2, compressing the sequence read length by adopting a dictionary method to form the read length of the output file.

And 3, matching the base sequence in the FASTQ format according to a preset reference sequence to obtain a matching record and a residual sequence.

Step 4, decomposing the information of the FASTQ into 4 component files, and establishing independent prediction models for different component files, wherein the independent prediction models comprise an FXH model, an FXA model, an FXB model and an FXQ model:

1. FXH sequencing sequence identifier

2. FXA base sequence matching record

3. FXB base sequence residue sequence

4. FXQ Mass number sequence

And (4) independently establishing a crossed context prediction model for the 4 files. The term "cross" refers to the fact that when a prediction model is built for a file with a later ordinal number, the contents of the file with the earlier ordinal number can be referred to in addition to the contents of the file itself. Since the four files are compressed and decompressed sequentially, such cross-referencing is possible.

And 5, connecting the 4 independent prediction models into a composite model for arithmetic coding to obtain a cross prediction model, and outputting a compressed file according to the read length and a state machine-based multiplexer.

FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application; the cross prediction model specifically comprises:

the predictive model of the FXH file relies solely on the self context.

The predictive model of the FXA file depends only on the body text itself.

The predictive model of the FXB file relies on the matching records FXA in addition to the context itself. Consider the following scenario: misalignment of an alignment beyond design limits terminates matching, but the extended portion of the match retains a high probability of similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.

Fig. 4 is a diagram illustrating matching records according to an embodiment of the present application.

FIG. 5 is a diagram showing the prediction of a mass number sequence by a base sequence in one embodiment of the present application, including the decrease in mass number by the successive repeats, and the inclusion of mixed bases means a lower mass number.

FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application; the method comprises a state machine-based multiplexer, a state machine with 4 states is introduced, and the 4 prediction models are respectively accessed corresponding to each state of the state machine. Thus, the 4 prediction models are connected into a composite model. Coding is performed by adopting 4 independent models respectively, and 4 arithmetic coders and 4 output buffers are needed. And a composite model is adopted, only one arithmetic coder and one output buffer are needed, and the code complexity and the consumption of arithmetic resources are reduced.

S0, switching FXH on, and jumping to S1 in case of a separation symbol;

s1, turning on FXA, and turning back S2 after the fixed character number is passed;

s2, switching on the FXB, reading the length and subtracting the matching length to obtain the residual length, and turning to S3 after the number of characters are passed;

s3, FXQ is switched on, the number of read lengths is known, and the number of characters passes through a post-transition S0.

FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application; the embodiment of the invention relates to a multithreading gene data compression program with a visual interface, which comprises parameter acquisition, file segmentation into block files and a multithreading management module, wherein the multithreading management module comprises a compression/decompression main body module.

Hardware: personal computer

Software: CYGWIN _ NT-10.0/mingw 64-x86_64-gcc-g + + (11.2.0-1)

And (3) testing environment:

hardware: intel Xeon CPU E5-2678 v3 @ 2.50 GHz-

Software: CentOS Linux release 7.6.1810/g + + (Red Hat 4.8.5-44)

The embodiment of the invention provides a visual interface for acquiring or automatically filling 5 parameters: mode selection, i.e., compression or decompression, reference sequence file, number of threads, binary file name, text file name. During compression, the text file is input, and the binary file is output; and the reverse is true when decompressing. When compressing a large file, the content of every 128M text is divided into one block subfile for multi-thread compression.

It should be noted that for simplicity of description, the method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Another embodiment of the present invention provides a gene sequencing data compression apparatus, which is used for implementing the gene sequencing data compression method provided in the above embodiments.

Referring to fig. 8, a block diagram of an embodiment of the present invention is shown, wherein the apparatus may specifically include the following modules: an obtaining module 801, a determining module 802, a processing module 803, a matching module 804 and a predicting module 805, wherein:

the acquisition module 801 is used for acquiring a gene sequencing data text to be compressed;

the determining module 802 is configured to segment a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data, and mass number sequence data;

the processing module 803 is configured to process the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed;

the matching module 804 is used for matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;

the prediction module 805 is configured to perform prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data, and the quality number sequence data according to a pre-established cross prediction model to obtain a prediction result, and perform arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

According to the gene sequencing data compression device provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

The present invention further provides a supplementary explanation of the gene sequencing data compression apparatus provided in the above embodiment.

Optionally, the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;

the mass number sequence prediction module is determined based on the context of the mass number sequence data, the base sequence matching record, and the remaining sequence.

Optionally, the processing module is configured to:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Still another embodiment of the present invention provides a terminal device, configured to execute the gene sequencing data compression method provided in the foregoing embodiment.

Fig. 9 is a schematic structural diagram of a terminal device of the present invention, and as shown in fig. 9, the terminal device includes: at least one processor 901 and memory 902;

the memory stores a computer program; at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression methods provided by the above embodiments.

The terminal device provided by the embodiment obtains a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

In another embodiment of the present application, a computer-readable storage medium is provided, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in any of the above embodiments is implemented.

According to the computer-readable storage medium of the present embodiment, by obtaining a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.

It should be noted that the above detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than those illustrated or otherwise described herein.

Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.

Spatially relative terms, such as "above … …," "above … …," "above … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial relationship to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary term "above … …" can include both an orientation of "above … …" and "below … …". The device may also be oriented in other different ways, such as by rotating it 90 degrees or at other orientations, and the spatially relative descriptors used herein interpreted accordingly.

In the foregoing detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components, unless context dictates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of gene sequencing data compression, the method comprising:

acquiring a gene sequencing data text to be compressed;

2. The method of claim 1, wherein the pre-established cross-prediction model comprises at least a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module, and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected by a state machine, wherein the outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected to a multiplexer, and the output of the multiplexer is connected to an arithmetic coder.

3. The method of compressing gene sequencing data according to claim 2,

the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;

the residual sequence prediction module predicts the following of the residual sequence according to the base sequence matching record and the above of the residual sequence;

4. The method of claim 1, wherein the step of processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed comprises:

5. A gene sequencing data compression apparatus, the apparatus comprising:

6. The apparatus according to claim 5, wherein the pre-established cross prediction mode comprises at least a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module, and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected to a multiplexer, and an output of the multiplexer is connected to an arithmetic coder.

7. The apparatus of claim 6,

8. The apparatus of claim 5, wherein the processing module is configured to:

9. A terminal device, comprising: at least one processor and memory;

the memory stores a computer program; the at least one processor executes the memory-stored computer program to implement the gene sequencing data compression method of any of claims 1-4.

10. A computer-readable storage medium having stored thereon a computer program which, when executed, implements the gene sequencing data compression method of any one of claims 1-4.