CN115083530A - Gene sequencing data compression method and device, terminal equipment and storage medium - Google Patents

Gene sequencing data compression method and device, terminal equipment and storage medium Download PDF

Info

Publication number
CN115083530A
CN115083530A CN202211003550.XA CN202211003550A CN115083530A CN 115083530 A CN115083530 A CN 115083530A CN 202211003550 A CN202211003550 A CN 202211003550A CN 115083530 A CN115083530 A CN 115083530A
Authority
CN
China
Prior art keywords
sequence
data
gene sequencing
prediction
compressed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211003550.XA
Other languages
Chinese (zh)
Other versions
CN115083530B (en
Inventor
陈墩金
王阳开
毕星浩
林凯翔
张力
孙齐胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Mingling Gene Technology Co ltd
Original Assignee
Guangzhou Mingling Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Mingling Gene Technology Co ltd filed Critical Guangzhou Mingling Gene Technology Co ltd
Priority to CN202211003550.XA priority Critical patent/CN115083530B/en
Publication of CN115083530A publication Critical patent/CN115083530A/en
Application granted granted Critical
Publication of CN115083530B publication Critical patent/CN115083530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The application discloses a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium, wherein a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing a gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, prediction processing is carried out on the gene sequencing data text to be compressed to obtain a prediction result, according to the prediction result, arithmetic compression is carried out on the gene sequencing data text to be compressed, the gene sequencing data text is compressed, and storage and transmission expenses are reduced through high-proportion compression.

Description

Gene sequencing data compression method and device, terminal equipment and storage medium
Technical Field
The application belongs to the technical field of gene detection, and particularly relates to a gene sequencing data compression method, a gene sequencing data compression device, terminal equipment and a storage medium.
Background
The research of genomics is accelerated by a new generation sequencing technology, a large amount of gene sequence data can be generated by the new generation sequencing technology at extremely low cost, and the gene sequence data is increasing explosively due to the further reduction of the sequencing cost, and the explosion increase brings huge pressure to storage. Compressing gene sequence data is an extremely effective method to solve this problem.
The standard gene sequencing data text storage format is referred to as FASTQ format. In FASTQ format, each sequencing sequence is represented by 4 rows: the first row is the sequencing sequence identifier; the second row is the base sequence of the sequenced sequence; a third row, spare, for storing additional information, normally empty, where a line change is ignored; the fourth row is the mass number of bases, and the number of bases in the second row corresponds one-to-one to indicate the confidence of that base. The length of the base sequence in the sequence is simply called the read length and is marked as BP.
Since the whole genome reference base sequence of a typical species is known, when the gene sequencing data is compressed and stored, a matching process is first performed. The matching process is to detect whether a certain subsequence in the reference sequence is identical to the subsequence or whether the number of different bases in the reference sequence is less than a preset limit, and if the certain subsequence exists, the respective positions, the common length and a small amount of difference of the two subsequences are recorded, so that the matching is called one-point matching. When the base sequence is stored, an appropriate match can be stored in place of the corresponding subsequence, thereby greatly reducing the number of bases that need to be stored, and a partial base sequence that cannot be described by matching is referred to as a remaining sequence of the base sequence.
Theoretically, adaptive arithmetic coding is one of the highest compression coding methods. Applying adaptive arithmetic coding, a model is designed for the data, the model takes the probability of 0 or 1 for the next bit in the data according to the above of the data, and arithmetic coding is performed according to the probability. The more accurate the model prediction, the better the compression performance. However, due to the high computational overhead of the context prediction model, no mature method for applying adaptive arithmetic coding to gene sequencing data compression exists at present.
Disclosure of Invention
The invention aims to provide a gene sequencing data compression method, a gene sequencing data compression device, a terminal device and a storage medium, so as to solve the defects in the prior art, and the technical problem to be solved by the invention is realized by the following technical scheme.
In a first aspect, the embodiments of the present invention provide a method for compressing gene sequencing data, the method including:
acquiring a gene sequencing data text to be compressed;
segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coding unit.
Optionally the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
Optionally, the processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text includes:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
In a second aspect, embodiments of the present invention provide a gene sequencing data compression apparatus, including:
the acquisition module is used for acquiring a gene sequencing data text to be compressed;
the determining module is used for segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the processing module is used for processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
the matching module is used for matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and the prediction module is used for performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with an arithmetic coding unit.
Optionally, the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the lower part of the base sequence matching record according to the upper part of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
Optionally, the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
In a third aspect, an embodiment of the present invention provides a terminal device, including: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression method provided in the first aspect.
In a fourth aspect, the embodiments of the present invention provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in the first aspect is implemented.
The embodiment of the invention has the following advantages:
according to the gene sequencing data compression method, the gene sequencing data compression device, the terminal equipment and the storage medium, the gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; according to a pre-established cross prediction model, performing prediction processing on sequence identifier data, sequence matching record data, sequence residual sequence data and quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Drawings
In order to more clearly illustrate the embodiments or prior art solutions of the present application, the drawings needed for describing the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings can be obtained by those skilled in the art without inventive exercise.
FIG. 1 is a schematic flow chart of a method for compressing gene sequencing data according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart of another method for compressing gene sequencing data according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application;
FIG. 4 is a diagram illustrating matching records in an embodiment of the present application;
FIG. 5 is a diagram showing a base sequence for predicting a mass number sequence in an example of the present application;
FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application;
FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application;
FIG. 8 is a block diagram showing the structure of an embodiment of a gene sequencing data compression apparatus according to the present invention;
fig. 9 is a schematic structural diagram of a terminal device of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following embodiments and accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the invention provides a gene sequencing data compression method, which is used for compressing gene sequences. The execution subject of the embodiment is a gene sequencing data compression device, and is arranged on a terminal device, for example, the terminal device at least comprises a computer terminal and the like.
Referring to fig. 1, a flow chart of steps of an embodiment of a method for compressing gene sequencing data according to the present invention is shown, and the method may specifically include the following steps:
s101, obtaining a gene sequencing data text to be compressed;
specifically, the terminal device acquires a gene sequencing data text to be compressed, wherein the gene sequencing data text is in a FASTQ format, and repeated base sequences are required to be matched with repeated sequences in order to compress the gene sequencing data text.
S102, segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the terminal equipment processes a gene sequencing data text to be compressed, namely, different components of FASTQ format data are decomposed into 4 files, and sequence identifier data, base gene sequence number and mass number sequence data are obtained; and generating different types of data into different files, including:
1. the sequencing sequence identifier constitutes a file FXH;
2. matching the base sequence in the reference sequence to record FXA;
3. the remaining sequence of the base sequence FXB;
4. mass number sequence data FXQ.
S103, processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
specifically, the terminal device compresses the sequence read length by using a dictionary method, and the read length of the gene sequencing data text to be compressed, which forms the output file, is read.
S104, matching the base sequences according to a preset reference sequence to obtain sequence matching recorded data and sequence residual sequence data;
specifically, since the reference base sequence of the whole genome of a typical species is known, when gene sequencing data is compressed and stored, a matching process is performed first, that is, for each subsequence in the base sequence, whether a subsequence identical to the reference sequence exists in the reference sequence or whether the number of different bases is less than a preset limit is detected, and if the subsequence exists, the respective positions, the common length and the small amount of difference of the two subsequences are recorded, which is called one-point matching. When the base sequence is stored, appropriate matches can be stored in place of the corresponding subsequences, thereby greatly reducing the number of bases to be stored, and a partial base sequence which cannot be described by matching is referred to as a residual sequence of the base sequence, so that sequence matching record data and sequence residual sequence data can be obtained separately.
And S105, according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
Specifically, the pre-established cross prediction models in the embodiment of the present invention are obtained by respectively establishing independent prediction models for 4 files, i.e., a sequencing sequence identifier composition file FXH, a matching record FXA of a base sequence in a reference sequence, a residual sequence FXB of the base sequence, and mass number sequence data FXQ, and encoding each prediction model by using an adaptive arithmetic coding method to obtain the cross prediction model, wherein each prediction model is obtained according to context.
And the terminal equipment performs arithmetic compression on the gene sequencing data text to be compressed according to the prediction probability.
The embodiment of the application divides a gene sequencing data text to be compressed into three components of sequence identifier data, base sequence data and mass number sequence data.
The read length is first compressed separately using conventional methods. Matching the base sequence according to a preset reference sequence, and obtaining two components of sequence matching record data and sequence residual sequence data from the base sequence components; according to a pre-established cross prediction model, sequence identifier data, sequence matching records, residual sequence data and the postamble of four components of mass number sequence data are subjected to arithmetic compression according to the preamble, the read length and other components, and a gene sequencing data text to be compressed is subjected to arithmetic compression according to the prediction probability, so that the storage and transmission expenses are reduced through high-proportion compression.
The terminal equipment can predict residual base sequences according to a pre-established cross prediction model, the base information is used for predicting quality number information, the terminal equipment performs arithmetic compression on a gene sequencing data text to be compressed according to the prediction probability, compared with an undecomposed FASTQ file, the statistical properties inside each subfile are uniform, the rule is simple, and the statistical model of each file can obtain a better prediction result without being too complex.
Dividing a gene sequencing data text to be compressed into three components, namely sequence identifier data, base sequence data and mass number sequence data, firstly independently compressing read length, matching the base gene sequence according to a preset reference sequence, and obtaining two components, namely matched sequence record data and residual sequence data after matching the base sequence components; according to a pre-established cross prediction model, a read length and the respective preambles of each component, sequence identifier data, sequence matching record data and postambles of the four components of the residual sequence data and the mass number sequence data are predicted to obtain a prediction result, according to the prediction result, the arithmetic compression is carried out on a gene sequencing data text to be compressed, and the storage and transmission expenses are reduced through the high-proportion compression.
According to the gene sequencing data compression method provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
The present invention further provides a supplementary explanation of the gene sequencing data compression method provided in the above embodiments.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with the arithmetic coding unit.
The embodiment of the invention adopts a cross prediction model and predicts the residual base sequence by using the matching information; predicting mass number information by using the base information; and connecting a plurality of independent prediction models into a composite model by adopting a finite state machine to obtain the cross prediction model.
Optionally the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module is determined according to the context of the residual sequence and the base sequence matching record;
the mass number sequence prediction module is determined based on the context of mass number sequence data, base sequence matching records, and the remaining sequence.
Specifically, the predictive model predicts the context by "above". The text that has been currently read above, and the text that is to be read next below.
The predictive model of the FXH file relies solely on the self context.
The predictive model of the FXA file depends only on the body text itself.
The predictive model of the FXB file relies on the matching records FXA in addition to the context itself. Misalignment of alignments that exceeds the design limit will terminate the match, but the extended portion of the match will have a high probability of maintaining a high similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.
The prediction model of the FXQ file relies on the sequencing base sequences FXA and FXB in addition to the context itself. The following is generally inferred: the mass number of the mixed bases must be low, and the mass number corresponding to more consecutive identical bases is slightly low. Therefore, introduction of a nucleotide sequence is helpful for predicting the mass number sequence.
Specifically, the independent prediction model is FXH identifier, FXA matching record, FXB residual base and FXQ mass number four-component independent prediction compression.
The cross prediction model is: FXA was used to predict FXB, while FXA and FXB were used to predict FXQ.
Optionally, processing the to-be-compressed gene sequencing data text according to the mass number sequence data to obtain a read length of the to-be-compressed gene sequencing data text, including:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
Fig. 2 is a schematic flow chart of another gene sequencing data compression method according to an embodiment of the present application, where the gene sequencing data compression method includes:
step 1, traversing all gene sequencing data text files to be compressed, carrying out format detection on the gene sequencing data text files, and recording some statistical information, namely determining sequence identification, base sequence and mass number sequence, if the format of the text files conforms to the standard FASTQ format.
And 2, compressing the sequence read length by adopting a dictionary method to form the read length of the output file.
And 3, matching the base sequence in the FASTQ format according to a preset reference sequence to obtain a matching record and a residual sequence.
Step 4, decomposing the information of the FASTQ into 4 component files, and establishing independent prediction models for different component files, wherein the independent prediction models comprise an FXH model, an FXA model, an FXB model and an FXQ model:
1. FXH sequencing sequence identifier
2. FXA base sequence matching record
3. FXB base sequence residue sequence
4. FXQ Mass number sequence
And (4) independently establishing a crossed context prediction model for the 4 files. The term "cross" refers to the fact that when a prediction model is built for a file with a later ordinal number, the contents of the file with the earlier ordinal number can be referred to in addition to the contents of the file itself. Since the four files are compressed and decompressed sequentially, such cross-referencing is possible.
And 5, connecting the 4 independent prediction models into a composite model for arithmetic coding to obtain a cross prediction model, and outputting a compressed file according to the read length and a state machine-based multiplexer.
FIG. 3 is a schematic diagram of a cross-prediction model according to an embodiment of the present application; the cross prediction model specifically comprises:
the predictive model of the FXH file relies solely on the self context.
The predictive model of the FXA file depends only on the body text itself.
The predictive model of the FXB file relies on the matching records FXA in addition to the context itself. Consider the following scenario: misalignment of an alignment beyond design limits terminates matching, but the extended portion of the match retains a high probability of similarity. Therefore, an extension of the matching record on the reference sequence is introduced to help predict the remaining sequence.
The prediction model of the FXQ file relies on the sequencing base sequences FXA and FXB in addition to the context itself. The following is generally inferred: the mass number of the mixed bases must be low, and the mass number corresponding to more consecutive identical bases is slightly low. Therefore, introduction of a nucleotide sequence is helpful for predicting the mass number sequence.
Fig. 4 is a diagram illustrating matching records according to an embodiment of the present application.
FIG. 5 is a diagram showing the prediction of a mass number sequence by a base sequence in one embodiment of the present application, including the decrease in mass number by the successive repeats, and the inclusion of mixed bases means a lower mass number.
FIG. 6 is a diagram illustrating a state machine connecting independent predictive models into a composite model according to an embodiment of the present application; the method comprises a state machine-based multiplexer, a state machine with 4 states is introduced, and the 4 prediction models are respectively accessed corresponding to each state of the state machine. Thus, the 4 prediction models are connected into a composite model. Coding is performed by adopting 4 independent models respectively, and 4 arithmetic coders and 4 output buffers are needed. And a composite model is adopted, only one arithmetic coder and one output buffer are needed, and the code complexity and the consumption of arithmetic resources are reduced.
S0, switching FXH on, and jumping to S1 in case of a separation symbol;
s1, turning on FXA, and turning back S2 after the fixed character number is passed;
s2, switching on the FXB, reading the length and subtracting the matching length to obtain the residual length, and turning to S3 after the number of characters are passed;
s3, FXQ is switched on, the number of read lengths is known, and the number of characters passes through a post-transition S0.
FIG. 7 is a block diagram of a gene sequencing data compression apparatus according to an embodiment of the present application; the embodiment of the invention relates to a multithreading gene data compression program with a visual interface, which comprises parameter acquisition, file segmentation into block files and a multithreading management module, wherein the multithreading management module comprises a compression/decompression main body module.
Hardware: personal computer
Software: CYGWIN _ NT-10.0/mingw 64-x86_64-gcc-g + + (11.2.0-1)
And (3) testing environment:
hardware: intel Xeon CPU E5-2678 v3 @ 2.50 GHz-
Software: CentOS Linux release 7.6.1810/g + + (Red Hat 4.8.5-44)
The embodiment of the invention provides a visual interface for acquiring or automatically filling 5 parameters: mode selection, i.e., compression or decompression, reference sequence file, number of threads, binary file name, text file name. During compression, the text file is input, and the binary file is output; and the reverse is true when decompressing. When compressing a large file, the content of every 128M text is divided into one block subfile for multi-thread compression.
It should be noted that for simplicity of description, the method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
According to the gene sequencing data compression method provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Another embodiment of the present invention provides a gene sequencing data compression apparatus, which is used for implementing the gene sequencing data compression method provided in the above embodiments.
Referring to fig. 8, a block diagram of an embodiment of the present invention is shown, wherein the apparatus may specifically include the following modules: an obtaining module 801, a determining module 802, a processing module 803, a matching module 804 and a predicting module 805, wherein:
the acquisition module 801 is used for acquiring a gene sequencing data text to be compressed;
the determining module 802 is configured to segment a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data, and mass number sequence data;
the processing module 803 is configured to process the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed;
the matching module 804 is used for matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
the prediction module 805 is configured to perform prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data, and the quality number sequence data according to a pre-established cross prediction model to obtain a prediction result, and perform arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
According to the gene sequencing data compression device provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
The present invention further provides a supplementary explanation of the gene sequencing data compression apparatus provided in the above embodiment.
Optionally, the pre-established cross prediction mode at least comprises a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected through a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module and the mass number sequence prediction module are connected with a multiplexer, and an output of the multiplexer is connected with the arithmetic coding unit.
Optionally, the sequence identifier prediction model predicts a context of the sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module is determined according to the context of the residual sequence and the base sequence matching record;
the mass number sequence prediction module is determined based on the context of the mass number sequence data, the base sequence matching record, and the remaining sequence.
Optionally, the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
According to the gene sequencing data compression device provided by the embodiment of the invention, a gene sequencing data text to be compressed is obtained; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
Still another embodiment of the present invention provides a terminal device, configured to execute the gene sequencing data compression method provided in the foregoing embodiment.
Fig. 9 is a schematic structural diagram of a terminal device of the present invention, and as shown in fig. 9, the terminal device includes: at least one processor 901 and memory 902;
the memory stores a computer program; at least one processor executes the computer program stored in the memory to implement the gene sequencing data compression methods provided by the above embodiments.
The terminal device provided by the embodiment obtains a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
In another embodiment of the present application, a computer-readable storage medium is provided, in which a computer program is stored, and when the computer program is executed, the method for compressing gene sequencing data provided in any of the above embodiments is implemented.
According to the computer-readable storage medium of the present embodiment, by obtaining a gene sequencing data text to be compressed; segmenting a gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data; processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed; matching the base gene sequence number according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data; and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on a gene sequencing data text to be compressed according to the prediction result.
It should be noted that the above detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than those illustrated or otherwise described herein.
Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements explicitly listed, but may include other steps or elements not explicitly listed or inherent to such process, method, article, or apparatus.
Spatially relative terms, such as "above … …," "above … …," "above … …," "above," and the like, may be used herein for ease of description to describe one device or feature's spatial relationship to another device or feature as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if a device in the figures is turned over, devices described as "above" or "on" other devices or configurations would then be oriented "below" or "under" the other devices or configurations. Thus, the exemplary term "above … …" can include both an orientation of "above … …" and "below … …". The device may also be oriented in other different ways, such as by rotating it 90 degrees or at other orientations, and the spatially relative descriptors used herein interpreted accordingly.
In the foregoing detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components, unless context dictates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of gene sequencing data compression, the method comprising:
acquiring a gene sequencing data text to be compressed;
segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and according to a pre-established cross prediction model, performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the quality number sequence data to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
2. The method of claim 1, wherein the pre-established cross-prediction model comprises at least a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module, and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected by a state machine, wherein the outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected to a multiplexer, and the output of the multiplexer is connected to an arithmetic coder.
3. The method of compressing gene sequencing data according to claim 2,
the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the following of the residual sequence according to the base sequence matching record and the above of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
4. The method of claim 1, wherein the step of processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain a read length of the gene sequencing data text to be compressed comprises:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
5. A gene sequencing data compression apparatus, the apparatus comprising:
the acquisition module is used for acquiring a gene sequencing data text to be compressed;
the determining module is used for segmenting the gene sequencing data text to be compressed to obtain sequence identifier data, base gene sequence data and mass number sequence data;
the processing module is used for processing the gene sequencing data text to be compressed according to the mass number sequence data to obtain the read length of the gene sequencing data text to be compressed;
the matching module is used for matching the base sequence according to a preset reference sequence to obtain sequence matching record data and sequence residual sequence data;
and the prediction module is used for performing prediction processing on the sequence identifier data, the sequence matching record data, the sequence residual sequence data and the mass number sequence data according to a pre-established cross prediction model to obtain a prediction result, and performing arithmetic compression on the gene sequencing data text to be compressed according to the prediction result.
6. The apparatus according to claim 5, wherein the pre-established cross prediction mode comprises at least a sequence identifier prediction model, a base sequence matching prediction module, a residual sequence prediction module, and a mass number sequence prediction module, and the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected by a state machine, outputs of the sequence identifier prediction model, the base sequence matching prediction module, the residual sequence prediction module, and the mass number sequence prediction module are connected to a multiplexer, and an output of the multiplexer is connected to an arithmetic coder.
7. The apparatus of claim 6,
the sequence identifier prediction model predicts a context of a sequence identifier from a context of the sequence identifier;
the base sequence matching prediction module predicts the following of the base sequence matching record according to the above of the base sequence matching record;
the residual sequence prediction module predicts the context of the residual sequence based on the base sequence matching record and the context of the residual sequence;
the mass number sequence prediction module predicts a context of the mass number sequence based on the matching record, the remaining sequence, and the context of the mass number sequence.
8. The apparatus of claim 5, wherein the processing module is configured to:
and compressing the gene sequencing data file to be compressed by adopting a dictionary method to obtain the read length of the gene sequencing data file to be compressed.
9. A terminal device, comprising: at least one processor and memory;
the memory stores a computer program; the at least one processor executes the memory-stored computer program to implement the gene sequencing data compression method of any of claims 1-4.
10. A computer-readable storage medium having stored thereon a computer program which, when executed, implements the gene sequencing data compression method of any one of claims 1-4.
CN202211003550.XA 2022-08-22 2022-08-22 Gene sequencing data compression method and device, terminal equipment and storage medium Active CN115083530B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211003550.XA CN115083530B (en) 2022-08-22 2022-08-22 Gene sequencing data compression method and device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211003550.XA CN115083530B (en) 2022-08-22 2022-08-22 Gene sequencing data compression method and device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115083530A true CN115083530A (en) 2022-09-20
CN115083530B CN115083530B (en) 2022-11-04

Family

ID=83244137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211003550.XA Active CN115083530B (en) 2022-08-22 2022-08-22 Gene sequencing data compression method and device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115083530B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
WO2013128392A1 (en) * 2012-02-28 2013-09-06 Koninklijke Philips N.V. Tamper-proof genetic sequence processing
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
US20160342615A1 (en) * 2015-05-19 2016-11-24 Samsung Electronics Co., Ltd. Method and device for generating pileup file from compressed genomic data
CN106971090A (en) * 2017-03-10 2017-07-21 首度生物科技(苏州)有限公司 A kind of gene sequencing data compression and transmission method
CN107066837A (en) * 2017-04-01 2017-08-18 上海交通大学 One kind has with reference to DNA sequence dna compression method and system
US20170357665A1 (en) * 2014-11-19 2017-12-14 Arc Bio, Llc Systems and methods for genomic manipulations and analysis
US20170359583A1 (en) * 2016-06-09 2017-12-14 Qualcomm Incorporated Substream multiplexing for display stream compression
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN110120247A (en) * 2018-01-14 2019-08-13 广州明领基因科技有限公司 A kind of distributed genetic big data storage platform
CN112086134A (en) * 2019-06-15 2020-12-15 广州明领基因科技有限公司 Gene big data analysis and calculation platform
US20200402618A1 (en) * 2018-04-27 2020-12-24 Genetalks Bio-Tech (Changsha) Co., Ltd. Gene sequencing quality line data compression pre-processing and decompression and restoration methods, and system
WO2022008311A1 (en) * 2020-07-10 2022-01-13 Koninklijke Philips N.V. Genomic information compression by configurable machine learning-based arithmetic coding

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
WO2013128392A1 (en) * 2012-02-28 2013-09-06 Koninklijke Philips N.V. Tamper-proof genetic sequence processing
US20170357665A1 (en) * 2014-11-19 2017-12-14 Arc Bio, Llc Systems and methods for genomic manipulations and analysis
US20160342615A1 (en) * 2015-05-19 2016-11-24 Samsung Electronics Co., Ltd. Method and device for generating pileup file from compressed genomic data
US20170359583A1 (en) * 2016-06-09 2017-12-14 Qualcomm Incorporated Substream multiplexing for display stream compression
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2018000174A1 (en) * 2016-06-28 2018-01-04 深圳大学 Rapid and parallelstorage-oriented dna sequence matching method and system thereof
CN106971090A (en) * 2017-03-10 2017-07-21 首度生物科技(苏州)有限公司 A kind of gene sequencing data compression and transmission method
CN107066837A (en) * 2017-04-01 2017-08-18 上海交通大学 One kind has with reference to DNA sequence dna compression method and system
CN110120247A (en) * 2018-01-14 2019-08-13 广州明领基因科技有限公司 A kind of distributed genetic big data storage platform
US20200402618A1 (en) * 2018-04-27 2020-12-24 Genetalks Bio-Tech (Changsha) Co., Ltd. Gene sequencing quality line data compression pre-processing and decompression and restoration methods, and system
CN112086134A (en) * 2019-06-15 2020-12-15 广州明领基因科技有限公司 Gene big data analysis and calculation platform
WO2022008311A1 (en) * 2020-07-10 2022-01-13 Koninklijke Philips N.V. Genomic information compression by configurable machine learning-based arithmetic coding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAMA SRIKANTH MALLAVARAPU ET AL: "Context Based Compression of FASTQ Data", 《IEEE XPLORE》 *
孟倩: "基于高通量测序的短序列生物数据压缩研究", 《计算机应用与软件》 *

Also Published As

Publication number Publication date
CN115083530B (en) 2022-11-04

Similar Documents

Publication Publication Date Title
CN107609350B (en) Data processing method of second-generation sequencing data analysis platform
CN110751224B (en) Training method of video classification model, video classification method, device and equipment
CN101594150B (en) Method of efficient compression for measurement data
CN106852185A (en) Parallelly compressed encoder based on dictionary
US5124791A (en) Frame-to-frame compression of vector quantized signals and other post-processing
CN107870928A (en) File reading and device
CN101569196A (en) Image encoding and decoding method and apparatus using texture synthesis
CN110428868B (en) Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
CN110070914B (en) Gene sequence identification method, system and computer readable storage medium
CN115952520A (en) Big data platform data standardization processing system and method applied to data files
CN115083530B (en) Gene sequencing data compression method and device, terminal equipment and storage medium
CN112905324B (en) Decompression method, system and medium based on circuit state
CN113687773A (en) Data compression model training method and device and storage medium
CN112580825A (en) Unsupervised data binning method and unsupervised data binning device
CN116843107A (en) Building information intelligent management system based on BIM technology
JP2000357234A (en) Device and method for image processing
CN113507625B (en) Self-adaptive video restoration method
CN114328399B (en) Method and system for automatically pairing gene sequencing multi-sample data files
CN112947263A (en) Management control system based on data acquisition and coding
CN111143182B (en) Analysis method, device and storage medium for process behavior
CN110797082A (en) Method and system for storing and reading gene sequencing data
JP2830697B2 (en) Data processing device
CN115827221A (en) BAM file parallel reading method, system and medium
CN111370070B (en) Compression processing method for big data gene sequencing file
CN117762882A (en) Bus data analysis method, computer device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant