CN115083530B - Gene sequencing data compression method and device, terminal equipment and storage medium - Google Patents
Gene sequencing data compression method and device, terminal equipment and storage medium Download PDFInfo
- Publication number
- CN115083530B CN115083530B CN202211003550.XA CN202211003550A CN115083530B CN 115083530 B CN115083530 B CN 115083530B CN 202211003550 A CN202211003550 A CN 202211003550A CN 115083530 B CN115083530 B CN 115083530B
- Authority
- CN
- China
- Prior art keywords
- sequence
- data
- gene sequencing
- prediction
- compressed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Genetics & Genomics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Probability & Statistics with Applications (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The application discloses a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium, wherein a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing a gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, prediction processing is carried out on the gene sequencing data text to be compressed to obtain a prediction result, according to the prediction result, arithmetic compression is carried out on the gene sequencing data text to be compressed, the gene sequencing data text is compressed, and storage and transmission expenses are reduced through high-proportion compression.
Description
Technical Field
The application belongs to the technical field of gene detection, and particularly relates to a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium.
Background
The research of genomics is accelerated by a new generation sequencing technology, a large amount of gene sequence data can be generated by the new generation sequencing technology at extremely low cost, and the gene sequence data is increasing explosively due to the further reduction of the sequencing cost, and the explosion increase brings huge pressure to storage. Compressing gene sequence data is an extremely effective method to solve this problem.
The standard gene sequencing data text storage format is referred to as FASTQ format. In FASTQ format, each sequencing sequence is represented by 4 rows: the first row is the sequencing sequence identifier; the second row is the base sequence of the sequenced sequence; a third row, spare, for storing additional information, normally empty, where a line change is ignored; the fourth row is the mass number of bases, and the number of bases in the second row corresponds one-to-one to indicate the confidence of that base. The length of the base sequence in the sequence is simply called the read length and is marked as BP.
Since the whole genome reference base sequence of a typical species is known, when the gene sequencing data is compressed and stored, a matching process is first performed. The matching process is to detect whether a certain subsequence in the reference sequence is identical to the subsequence or whether the number of different bases in the reference sequence is less than a preset limit, and if the certain subsequence exists, the respective positions, the common length and a small amount of difference of the two subsequences are recorded, so that the matching is called one-point matching. When the base sequence is stored, an appropriate match can be stored in place of the corresponding subsequence, thereby greatly reducing the number of bases that need to be stored, and a partial base sequence that cannot be described by matching is referred to as a remaining sequence of the base sequence.
In theory, adaptive arithmetic coding is one of the highest compression capability coding methods. Applying adaptive arithmetic coding, a model is designed for the data, the model takes the probability of 0 or 1 for the next bit in the data according to the above of the data, and arithmetic coding is performed according to the probability. The more accurate the model prediction, the better the compression performance. However, due to the high computational overhead of the context prediction model, no mature method for applying adaptive arithmetic coding to gene sequencing data compression exists at present.
Disclosure of Invention
The invention aims to provide a gene sequencing data compression method, a gene sequencing data compression device, a terminal device and a storage medium, so as to solve the defects in the prior art, and the technical problem to be solved by the invention is realized by the following technical scheme.
In a first aspect, the embodiments of the present invention provide a method for compressing gene sequencing data, the method including:
acquiring a gene sequencing data text to be compressed;
segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coding unit.
Optionally the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
Optionally, the processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text includes:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
In a second aspect, embodiments of the present invention provide a gene sequencing data compression apparatus, including:
the acquisition module is used for acquiring a gene sequencing data text to be compressed;
the determining module is used for segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the processing module is used for processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
the matching module is used for matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and the prediction module is used for performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coding unit.
Optionally, the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
Optionally, the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
In a third aspect, an embodiment of the present invention provides a terminal device, including: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression method provided in the first aspect.
In a fourth aspect, the embodiments of the present invention provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in the first aspect is implemented.
The embodiment of the invention has the following advantages:
according to the gene sequencing data compression method, the gene sequencing data compression device, the terminal equipment and the storage medium, the gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, performing prediction processing on sequence identifier data, sequence matching record data, sequence residual sequence data and quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Drawings
In order to more clearly illustrate the embodiments or prior art solutions of the present application, the drawings used in the description of the embodiments or prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments described in the present application, and that other drawings can be obtained by those skilled in the art without inventive labor.
FIG. 1 is a schematic flow chart of a method for compressing gene sequencing data according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of another method for compressing gene sequencing data according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application;
FIG. 4 is a diagram illustrating matching records in an embodiment of the present application;
FIG. 5 is a diagram showing a base sequence for predicting a mass number sequence in an example of the present application;
FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application;
FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application;
FIG. 8 is a block diagram showing the structure of an embodiment of a gene sequencing data compression apparatus according to the present invention;
fig. 9 is a schematic structural diagram of a terminal device of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following embodiments and accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the invention provides a gene sequencing data compression method, which is used for compressing gene sequences. The execution subject of the embodiment is a gene sequencing data compression device, and is arranged on a terminal device, for example, the terminal device at least comprises a computer terminal and the like.
Referring to fig. 1, a flow chart of steps of an embodiment of a method for compressing gene sequencing data according to the present invention is shown, and the method may specifically include the following steps:
s101, obtaining a gene sequencing data text to be compressed;
specifically, the terminal device acquires a gene sequencing data text to be compressed, wherein the gene sequencing data text is in a FASTQ format, and repeated base sequences are required to be matched with repeated sequences in order to compress the gene sequencing data text.
S102, segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the terminal equipment processes a gene sequencing data text to be compressed, namely, different components of FASTQ format data are decomposed into 4 files, and sequence identifier data, base gene sequence number and mass number sequence data are obtained; and generating different types of data into different files, including:
1. the sequencing sequence identifier constitutes a file FXH;
2. matching the base sequence in the reference sequence to record FXA;
3. the remaining sequence of the base sequence FXB;
4. mass number sequence data FXQ.
S103, processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
specifically, the terminal device compresses the sequence read length by using a dictionary method, and the read length of the gene sequencing data text to be compressed, which forms the output file, is read.
S104, matching the base sequences according to a preset reference sequence to obtain sequence matching recorded data and sequence residual sequence data;
specifically, since the reference base sequence of the whole genome of a typical species is known, when gene sequencing data is compressed and stored, a matching process is performed first, that is, for each subsequence in the base sequence, whether a subsequence identical to the reference sequence exists in the reference sequence or whether the number of different bases is less than a preset limit is detected, and if the subsequence exists, the respective positions, the common length and the small amount of difference of the two subsequences are recorded, which is called one-point matching. When the base sequence is stored, appropriate matches can be stored in place of the corresponding subsequences, thereby greatly reducing the number of bases to be stored, and a partial base sequence which cannot be described by matching is referred to as a residual sequence of the base sequence, so that sequence matching record data and sequence residual sequence data can be obtained separately.
And S105, according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Specifically, the pre-established cross prediction models in the embodiment of the present invention are obtained by respectively establishing independent prediction models for 4 files, i.e., a sequencing sequence identifier composition file FXH, a matching record FXA of a base sequence in a reference sequence, a residual sequence FXB of the base sequence, and mass number sequence data FXQ, and encoding each prediction model by using an adaptive arithmetic coding method to obtain the cross prediction model, wherein each prediction model is obtained according to context.
And the terminal equipment performs arithmetic compression on the gene sequencing data text to be compressed according to the prediction probability.
The embodiment of the application divides a gene sequencing data text to be compressed into three components of sequence identifier data, base sequence data and mass number sequence data.
The read length is first compressed separately using conventional methods. Matching the base sequence according to a preset reference sequence, and obtaining two components of sequence matching record data and sequence residual sequence data from the base sequence components; according to a pre-established cross prediction model, sequence identifier data, sequence matching records, residual sequence data and the postamble of four components of mass number sequence data are subjected to arithmetic compression according to the preamble, the read length and other components, and a gene sequencing data text to be compressed is subjected to arithmetic compression according to the prediction probability, so that the storage and transmission expenses are reduced through high-proportion compression.
The terminal equipment can predict residual base sequences according to a pre-established cross prediction model, the base information is used for predicting quality number information, the terminal equipment performs arithmetic compression on a gene sequencing data text to be compressed according to the prediction probability, compared with an undecomposed FASTQ file, the statistical properties inside each subfile are uniform, the rule is simple, and the statistical model of each file can obtain a better prediction result without being too complex.
Dividing a gene sequencing data text to be compressed into three components, namely sequence identifier data, base sequence data and mass number sequence data, firstly independently compressing read length, matching the base gene sequence according to a preset reference sequence, and obtaining two components, namely matched sequence record data and residual sequence data after matching the base sequence components; according to a pre-established cross prediction model, a read length and the respective preambles of each component, sequence identifier data, sequence matching record data and postambles of the four components of the residual sequence data and the mass number sequence data are predicted to obtain a prediction result, according to the prediction result, the arithmetic compression is carried out on a gene sequencing data text to be compressed, and the storage and transmission expenses are reduced through the high-proportion compression.
According to the gene sequencing data compression method provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; according to a preset reference sequence, carrying out matching processing on the base gene sequence number to obtain sequence matching recorded data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
The present invention further provides a supplementary explanation of the gene sequencing data compression method provided in the above embodiments.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with the arithmetic coding unit.
The embodiment of the invention adopts a cross prediction model and predicts the residual base sequence by using the matching information; predicting mass number information by using the base information; and connecting a plurality of independent prediction models into a composite model by adopting a finite state machine to obtain the cross prediction model.
Optionally the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module is determined according to the context of the residual sequence and the base sequence matching record;
the mass number sequence prediction module is determined based on the context of mass number sequence data, base sequence matching records, and the remaining sequence.
Specifically, the predictive model predicts the context by "above". The text that has been currently read above, and the text that is to be read next below.
The predictive model of the FXH file relies solely on the self context.
The predictive model of the FXA file relies only on the self context.
The predictive model of the FXB file relies on the matching records FXA in addition to the context itself. Misalignment of an alignment beyond design limits terminates matching, but the extended portion of the match retains a high probability of similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.
The prediction model of the FXQ file relies on the sequencing base sequences FXA and FXB in addition to the context itself. The following is generally inferred: the mass number of the mixed bases must be low, and the mass number corresponding to more consecutive identical bases is slightly low. Therefore, introduction of a nucleotide sequence is helpful for predicting the mass number sequence.
Specifically, the independent prediction model is FXH identifier, FXA matching record, FXB residual base and FXQ mass number four-component independent prediction compression.
The cross prediction model is: FXA is used for predicting FXB, and FXA and FXB are used for predicting FXQ.
Optionally, processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text, including:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
Fig. 2 is a schematic flow chart of another gene sequencing data compression method according to an embodiment of the present application, where the gene sequencing data compression method includes:
step 1, traversing all gene sequencing data text files to be compressed, carrying out format detection on the gene sequencing data text files, and recording some statistical information, namely determining sequence identification, base sequence and mass number sequence, if the format of the text files conforms to the standard FASTQ format.
And 2, compressing the sequence read length by adopting a dictionary method to form the read length of the output file.
And 3, matching the base sequence in the FASTQ format according to a preset reference sequence to obtain a matching record and a residual sequence.
Step 4, decomposing the information of the FASTQ into 4 component files, and establishing independent prediction models for different component files, wherein the independent prediction models comprise an FXH model, an FXA model, an FXB model and an FXQ model:
1. FXH sequencing sequence identifier
2. FXA base sequence matching record
3. FXB base sequence residue sequence
4. FXQ Mass number sequence
And (4) independently establishing a crossed context prediction model for the 4 files. The term "cross" means that when a prediction model is built for a document with a later ordinal number, the contents of the document with the earlier ordinal number can be referred to in addition to the contents of the document itself. Since the four files are compressed and decompressed sequentially, cross-referencing is possible.
And 5, connecting the 4 independent prediction models into a composite model for arithmetic coding to obtain a cross prediction model, and outputting a compressed file according to the read length and a state machine-based multiplexer.
FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application; the cross prediction model specifically comprises:
the predictive model of the FXH file relies only on the self context.
The predictive model of the FXA file depends only on the body text itself.
The predictive model of the FXB file relies on the matching record FXA in addition to the context itself. Consider the following scenario: misalignment of an alignment beyond design limits terminates matching, but the extended portion of the match retains a high probability of similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.
The predictive model of the FXQ file relies on sequencing base sequences FXA and FXB in addition to the context of itself. The following is generally inferred: the mass number of the mixed bases must be low, and the mass number corresponding to more consecutive identical bases is slightly low. Therefore, introduction of a nucleotide sequence is helpful for predicting the mass number sequence.
Fig. 4 is a diagram illustrating matching records according to an embodiment of the present application.
FIG. 5 is a diagram showing the prediction of a mass number sequence by a base sequence in one embodiment of the present application, including the decrease in mass number by the successive repeats, and the inclusion of mixed bases means a lower mass number.
FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application; the method comprises a state machine-based multiplexer, a state machine with 4 states is introduced, and the 4 prediction models are respectively accessed corresponding to each state of the state machine. Thus, the 4 prediction models are connected into a composite model. 4 independent models are used for coding respectively, and 4 arithmetic coders and 4 output buffers are needed. And a composite model is adopted, only one arithmetic coder and one output buffer are needed, and the code complexity and the consumption of arithmetic resources are reduced.
S0, switching on FXH, and jumping to S1 in case of a separation symbol;
s1, switching on FXA, and turning to S2 after the fixed character number passes;
s2, switching on FXB, subtracting the matching length from the read length to obtain the residual length, and then switching to S3 after the number of characters passes;
s3, switching on FXQ, reading the known number of characters, and then switching back to S0.
FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application; the embodiment of the invention relates to a multithreading gene data compression program with a visual interface, which comprises parameter acquisition, file division into block files and a multithreading management module, wherein the multithreading management module comprises a compression/decompression main body module.
Hardware: personal computer
Software: CYGWIN _ NT-10.0/mingw 64-x86_64-gcc-g + + (11.2.0-1)
And (3) testing environment:
hardware: intel Xeon CPU E5-2678 v3 @ 2.50 GHz-
Software: centoS Linux release 7.6.1810/g + + (Red Hat 4.8.5-44)
The embodiment of the invention provides a visual interface for acquiring or automatically filling 5 parameters: mode selection, i.e., compression or decompression, reference sequence file, number of threads, binary file name, text file name. During compression, the text file is input, and the binary file is output; and the reverse is true when decompressing. When compressing a large file, the content of every 128M text is divided into one block subfile for multi-thread compression.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
According to the gene sequencing data compression method provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Another embodiment of the present invention provides a gene sequencing data compression apparatus, which is used for implementing the gene sequencing data compression method provided in the above embodiments.
Referring to fig. 8, a block diagram of an embodiment of the present invention is shown, wherein the apparatus may specifically include the following modules: an obtaining module 801, a determining module 802, a processing module 803, a matching module 804, and a predicting module 805, wherein:
the acquisition module 801 is used for acquiring a gene sequencing data text to be compressed;
the determining module 802 is configured to segment a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data, and mass number sequence data;
the processing module 803 is configured to process the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed;
the matching module 804 is used for matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
the prediction module 805 is configured to perform prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data, and the quality number sequence data according to a pre-established cross prediction model to obtain a prediction result, and perform arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
According to the gene sequencing data compression device provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; according to a preset reference sequence, carrying out matching processing on the base gene sequence number to obtain sequence matching recorded data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
The present invention further provides a supplementary explanation of the gene sequencing data compression apparatus provided in the above embodiment.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coder.
Optionally, the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module is determined according to the context of the residual sequence and the base sequence matching record;
the mass number sequence prediction module is determined based on the context of the mass number sequence data, the base sequence matching record, and the remaining sequence.
Optionally, the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
According to the gene sequencing data compression device provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Still another embodiment of the present invention provides a terminal device, configured to execute the gene sequencing data compression method provided in the foregoing embodiment.
Fig. 9 is a schematic structural diagram of a terminal device according to the present invention, and as shown in fig. 9, the terminal device includes: at least one processor 901 and memory 902;
the memory stores a computer program; at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression methods provided by the above embodiments.
The terminal device provided by the embodiment obtains a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
In another embodiment of the present application, a computer-readable storage medium is provided, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in any of the above embodiments is implemented.
According to the computer-readable storage medium of the present embodiment, by obtaining a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
It should be noted that the above detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than those illustrated or otherwise described herein.
Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For ease of description, spatially relative terms such as "over … …", "over … …", "over … …", "over", etc. may be used herein to describe the spatial positional relationship of one device or feature to another device or feature as shown in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary term "above … …" may include both orientations of "above … …" and "below … …". The device may also be oriented in other different ways, such as by rotating it 90 degrees or at other orientations, and the spatially relative descriptors used herein interpreted accordingly.
In the foregoing detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components, unless context dictates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (8)
1. A method of gene sequencing data compression, the method comprising:
acquiring a gene sequencing data text to be compressed;
segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result, wherein the pre-established cross prediction model is obtained by respectively establishing independent prediction models for 4 files, namely a sequencing sequence identifier constitution file, a matching record of a base sequence in a reference sequence, a residual sequence of the base sequence and the mass number sequence data, and encoding each prediction model by adopting an adaptive arithmetic coding method;
the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, the outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and the output of the multiplexer is connected with an arithmetic coding unit.
2. The method of compressing gene sequencing data according to claim 1,
the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the lower part of the base sequence matching record according to the upper part of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
3. The method for compressing gene sequencing data according to claim 1, wherein the step of processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed comprises:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
4. A gene sequencing data compression apparatus, the apparatus comprising:
the acquisition module is used for acquiring a gene sequencing data text to be compressed;
the determining module is used for segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the processing module is used for processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
the matching module is used for matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
the prediction module is used for performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result, wherein the pre-established cross prediction model is obtained by respectively establishing independent prediction models for 4 files, namely a sequencing sequence identifier forming file, a matching record of a base sequence in a reference sequence, the residual sequence of the base sequence and the mass number sequence data, and encoding each prediction model by adopting a self-adaptive arithmetic coding method;
the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, the outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and the output of the multiplexer is connected with an arithmetic coding unit.
5. The apparatus of claim 4,
the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the following of the residual sequence according to the base sequence matching record and the above of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
6. The apparatus of claim 4, wherein the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
7. A terminal device, comprising: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the memory-stored computer program to implement the gene sequencing data compression method of any of claims 1-3.
8. A computer-readable storage medium having stored thereon a computer program which, when executed, implements the gene sequencing data compression method of any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211003550.XA CN115083530B (en) | 2022-08-22 | 2022-08-22 | Gene sequencing data compression method and device, terminal equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211003550.XA CN115083530B (en) | 2022-08-22 | 2022-08-22 | Gene sequencing data compression method and device, terminal equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115083530A CN115083530A (en) | 2022-09-20 |
CN115083530B true CN115083530B (en) | 2022-11-04 |
Family
ID=83244137
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211003550.XA Active CN115083530B (en) | 2022-08-22 | 2022-08-22 | Gene sequencing data compression method and device, terminal equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115083530B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013128392A1 (en) * | 2012-02-28 | 2013-09-06 | Koninklijke Philips N.V. | Tamper-proof genetic sequence processing |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10090857B2 (en) * | 2010-04-26 | 2018-10-02 | Samsung Electronics Co., Ltd. | Method and apparatus for compressing genetic data |
WO2016081712A1 (en) * | 2014-11-19 | 2016-05-26 | Bigdatabio, Llc | Systems and methods for genomic manipulations and analysis |
US10394763B2 (en) * | 2015-05-19 | 2019-08-27 | Samsung Electronics Co., Ltd. | Method and device for generating pileup file from compressed genomic data |
US10855989B2 (en) * | 2016-06-09 | 2020-12-01 | Qualcomm Incorporated | Substream multiplexing for display stream compression |
CN106100641A (en) * | 2016-06-12 | 2016-11-09 | 深圳大学 | Multithreading quick storage lossless compression method and system thereof for FASTQ data |
WO2018000174A1 (en) * | 2016-06-28 | 2018-01-04 | 深圳大学 | Rapid and parallelstorage-oriented dna sequence matching method and system thereof |
CN106971090A (en) * | 2017-03-10 | 2017-07-21 | 首度生物科技(苏州)有限公司 | A kind of gene sequencing data compression and transmission method |
CN107066837B (en) * | 2017-04-01 | 2020-02-04 | 上海交通大学 | Method and system for compressing reference DNA sequence |
CN110120247A (en) * | 2018-01-14 | 2019-08-13 | 广州明领基因科技有限公司 | A kind of distributed genetic big data storage platform |
CN110428868B (en) * | 2018-04-27 | 2021-11-26 | 人和未来生物科技(长沙)有限公司 | Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data |
CN112086134A (en) * | 2019-06-15 | 2020-12-15 | 广州明领基因科技有限公司 | Gene big data analysis and calculation platform |
WO2022008311A1 (en) * | 2020-07-10 | 2022-01-13 | Koninklijke Philips N.V. | Genomic information compression by configurable machine learning-based arithmetic coding |
-
2022
- 2022-08-22 CN CN202211003550.XA patent/CN115083530B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013128392A1 (en) * | 2012-02-28 | 2013-09-06 | Koninklijke Philips N.V. | Tamper-proof genetic sequence processing |
Also Published As
Publication number | Publication date |
---|---|
CN115083530A (en) | 2022-09-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609350B (en) | Data processing method of second-generation sequencing data analysis platform | |
CN106852185B (en) | Parallelly compressed encoder based on dictionary | |
JP2022532432A (en) | Data compression methods and computing devices | |
US5124791A (en) | Frame-to-frame compression of vector quantized signals and other post-processing | |
CN101569196A (en) | Image encoding and decoding method and apparatus using texture synthesis | |
CN110070914B (en) | Gene sequence identification method, system and computer readable storage medium | |
CN101714187B (en) | Index acceleration method and corresponding system in scale protein identification | |
CN115952520A (en) | Big data platform data standardization processing system and method applied to data files | |
CN114742124A (en) | Abnormal data processing method, system and device | |
CN115083530B (en) | Gene sequencing data compression method and device, terminal equipment and storage medium | |
CN112905324B (en) | Decompression method, system and medium based on circuit state | |
CN113836806A (en) | PHM model construction method, system, storage medium and electronic equipment | |
CN112580825A (en) | Unsupervised data binning method and unsupervised data binning device | |
CN115827221A (en) | BAM file parallel reading method, system and medium | |
JP2000357234A (en) | Device and method for image processing | |
CN111190871A (en) | Log generation method and device, computer equipment and storage medium | |
CN111510154B (en) | Coordinate data compression method | |
CN111370070B (en) | Compression processing method for big data gene sequencing file | |
Fernandez et al. | Genetic algorithms applied to clustering | |
CN113626420A (en) | Data preprocessing method and device and readable storage medium | |
CN112947263A (en) | Management control system based on data acquisition and coding | |
US6205546B1 (en) | Computer system having a multi-pointer branch instruction and method | |
CN110797082A (en) | Method and system for storing and reading gene sequencing data | |
JP2020155834A (en) | Data compression method and data compression device | |
CN106353668B (en) | MAP data compression/recovery method and system in Strip Test process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |