CN113268461B

CN113268461B - Method and device for gene sequencing data recombination packaging

Info

Publication number: CN113268461B
Application number: CN202110810347.2A
Authority: CN
Inventors: 郭祥学; 张巍
Original assignee: Guangzhou Jiajian Medical Testing Co ltd
Current assignee: Guangzhou Jiajian Medical Testing Co ltd
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-09-17
Anticipated expiration: 2041-07-19
Also published as: CN113268461A

Abstract

The invention discloses a gene sequencing data recombination and encapsulation method, which comprises the following steps of 1: constructing a reference genome database and a gene dictionary; step 2: obtaining a second gene sequence of a chromosome in the sample; and step 3: comparing the second gene sequence of step 2 with a plurality of first gene sequences; and 4, step 4: comparing the second gene sequence with the standard gene; and 5: sequentially grouping the nucleotides in the gene segments by taking N nucleotides as a group; step 6: expressing the front section, the gene fragment and the rear section by codes in a gene dictionary to form a group of nucleotide data; and 7: counting and compressing the nucleotide data on different chromosomes to obtain compressed genome data; and 8: reducing to obtain a second gene sequence of the sample. According to the invention, a small segment of nucleotides is coded by a dictionary, so that effective compression of data can be realized; meanwhile, the invention also provides a device based on the method.

Description

Method and device for gene sequencing data recombination packaging

Technical Field

The invention relates to the field of electric digital data processing of a new generation of information technology, in particular to a method and a device for gene sequencing data recombination and encapsulation.

Background

CN202010457824.7 discloses a lossless compression method for deeply sequencing a second gene sequence data file, and the technical solution of the patent application uses a built-in standard reference genome and a built-in dictionary file which do not need to be transmitted in the transmission process as a comparison. Therefore, if the converted second gene sequence or the compressed second gene sequence data in the patent is lost in the transmission or storage process, the related sequence cannot be restored as long as other personnel cannot obtain the built-in standard gene and the built-in dictionary file, and the safety is greatly enhanced. And (4) adding a temporary dictionary according to variation on unmatched files, and compressing and transmitting the dictionary along with the files. If the special variation which is not matched for the first time is written into the dictionary, the special variation which appears in the sequencing data for hundreds or even tens of thousands of times does not need to be stored additionally, and the space is greatly saved.

The method adopts the dictionary file to reduce the data of the nucleotide sequence to achieve the purpose of compressing and transmitting the nucleotide data, but whether an effective path for further reducing the data transmission amount exists or not is not further researched or explained, and the urgent need in the field is met.

Disclosure of Invention

The invention aims to provide a gene sequencing data recombination and packaging method, which adopts dictionary coding on a small segment of nucleotide and can realize effective compression of data;

meanwhile, the invention also provides a device based on the method.

In order to achieve the purpose, the invention provides the following technical scheme: a method for gene sequencing data recombination encapsulation comprises the following steps:

step 1: constructing a reference genome database and a gene dictionary, wherein the reference genome database stores first gene sequences of a plurality of chromosomes, and the gene dictionary uses codes to represent different combinations of nucleotide sequences which are less than or equal to N;

step 2: obtaining a second gene sequence of a chromosome in the sample;

and step 3: comparing the second gene sequence in the step (2) with a plurality of first gene sequences, and finding out the first gene sequence with the highest similarity with the second gene sequence as a standard gene;

and 4, step 4: comparing the second gene sequence with the standard gene to separate out a gene segment which is different from the standard gene in the second gene sequence and N nucleotides in front of and behind the gene segment; n nucleotides at the front end of the gene fragment are defined as a front section, and N nucleotides at the rear end of the gene fragment are defined as a rear section;

and 5: sequentially grouping the nucleotides in the gene segments by taking N nucleotides as a group;

step 6: expressing the front section, the gene fragment and the rear section by codes in a gene dictionary to form a group of nucleotide data;

and 7: counting and compressing nucleotide data on different chromosomes to obtain compressed genome data, and sending the genome data and the serial number of a first gene sequence corresponding to a standard gene to a data receiving end;

and 8: when the data receiving end receives the genome data and the serial number of the first gene sequence, decompressing the genome data, extracting the nucleotide data on each chromosome by referring to the gene dictionary, determining the position of the gene segment on the standard gene according to the number of the nucleotide sequences of the front segment and the rear segment and the number of the nucleotides between the front segment and the rear segment, and reducing to obtain the second gene sequence of the sample.

In the above method for packaging gene sequencing data by recombination, N is 3 or 4 or 5 or 6.

In the method for packaging gene sequencing data by recombination, the length of the gene fragment is more than N nucleotides.

In the method for packaging gene sequencing data recombination, the first gene sequence in the reference genome database comprises a first gene sequence of an autosome and a first gene sequence of a sex chromosome.

Meanwhile, the invention also discloses a gene sequencing data recombination packaging device, which comprises the following modules:

a storage module: the system comprises a database for storing and constructing a reference genome database and a gene dictionary, wherein the reference genome database stores first gene sequences of a plurality of chromosomes, and the gene dictionary represents different combinations of nucleotide sequences which are less than or equal to N by codes;

standard genome selection module: comparing the second gene sequence of each chromosome of the sample with a plurality of first gene sequences, and finding out the first gene sequence with the highest similarity with the second gene sequence as a standard gene;

a comparison module: the second gene sequence is compared with the standard gene, and a gene segment which is different from the standard gene in the second gene sequence and N nucleotides in front of and behind the gene segment are separated; n nucleotides at the front end of the gene fragment are defined as a front section, and N nucleotides at the rear end of the gene fragment are defined as a rear section;

a dictionary module: the nucleotide sequence is used for grouping the nucleotides in the gene segments in sequence by taking N as a group; the front section, the gene fragment and the back section are represented by codes in a gene dictionary to form a group of nucleotide data; and the system is used for counting and compressing the nucleotide data on different chromosomes to obtain compressed genome data, and sending the genome data and the code number of the reference gene corresponding to the standard gene to a data receiving end.

In the gene sequencing data recombination and encapsulation device, N is 3, 4, 5 or 6.

In the gene sequencing data recombination and encapsulation device, the length of the gene segment is greater than N nucleotides.

In the above gene sequencing data reassembly and packaging apparatus, the first gene sequence in the reference genome database includes a first gene sequence of an autosome and a first gene sequence of a sex chromosome.

Compared with the prior art, the invention has the beneficial effects that:

the gene dictionary restores the front section, the rear section and the gene fragment in the data, determines the accurate position of the first gene sequence according to the length, the gene sequences of the front section and the rear section and the number of the first gene sequence, and replaces the corresponding position in the first gene sequence to obtain a second gene sequence.

The compressed data volume is small, and the calculation speed is high.

Drawings

FIG. 1 is a flow chart of example 1 of the present invention;

fig. 2 is a topology diagram of embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1, a method for gene sequencing data recombination encapsulation comprises the following steps:

step 1: constructing a reference genome database and a gene dictionary, wherein the reference genome database stores first gene sequences of a plurality of chromosomes, and the gene dictionary uses codes to represent different combinations of nucleotide sequences which are less than or equal to N; each first gene sequence is numbered;

in practice, when N is chosen to be 3, any combination of all nucleotides can be combined into 64 combinations, and 4 different combinations of single nucleotides, 16 combinations of 2 nucleotides, and 84 combinations in total are included.

By choosing N as 4, any combination of all nucleotides can be combined into 256 combinations, with 4 different cases for a single nucleotide, 16 combinations for 2 nucleotides, 64 combinations for 3 nucleotides, and 340 combinations in total.

Taking N as 4 as an example, in the gene dictionary, the 340 combinations are represented by symbols.

The reference genome database does not only contain 23 chromosome pairs for men and women, but also contains data of the first gene sequences of a plurality of chromosomes with 23 chromosome pairs as a group.

Step 2: obtaining a second gene sequence of a chromosome in the sample;

each person has 23 second gene sequences, and the 23 second gene sequences are compared with the first gene sequences in the reference genome database one by one to obtain a plurality of second gene sequences as standard genes.

As a further optimization, the positions possibly appearing in the first gene sequences in the reference genome database can be marked according to the positions where the human genes appear in distinction, a plurality of marking points are generated in each first gene sequence, and when the first gene sequences are aligned with the second gene sequences, only the genes at the same sites of the second gene sequences are aligned with the genes at the marking points, so that the first gene sequences with the least difference are used as standard genes, the determination time of the standard genes can be further obviously shortened, and the speed of the step 3 is increased.

for example, if the gene fragment is 101 nucleotides and N is 4, the genes can be divided into 26 groups.

the nucleotide data consists of several codes in sequence.

The data receiving end receives 23 groups of data, and each group of data comprises genome data and reference gene codes;

in the case of restoring genes of human chromosomes, the gene sequences of the anterior and posterior segments are mainly considered, and how long the length between the anterior and posterior segments is, which can be calculated from the above codes.

Generally, no matter whether N =3 or N =4, the same anterior segment and posterior segment are hardly obtained in the same length, and therefore, this localization method has uniqueness, and position data of genes distinguished in the data set is not necessary.

Which can effectively save the data volume.

Example 2

Referring to FIG. 2, a gene sequencing data recombination packaging device for implementing the method of example 1 comprises the following modules:

the storage module 1: the system comprises a database for storing and constructing a reference genome database and a gene dictionary, wherein the reference genome database stores first gene sequences of a plurality of chromosomes, and the gene dictionary represents different combinations of nucleotide sequences which are less than or equal to N by codes;

standard genome selection module 2: comparing the second gene sequence of each chromosome of the sample with a plurality of first gene sequences, and finding out the first gene sequence with the highest similarity with the second gene sequence as a standard gene;

and a comparison module 3: the second gene sequence is compared with the standard gene, and a gene segment which is different from the standard gene in the second gene sequence and N nucleotides in front of and behind the gene segment are separated; n nucleotides at the front end of the gene fragment are defined as a front section, and N nucleotides at the rear end of the gene fragment are defined as a rear section;

the dictionary module 4: the nucleotide sequence is used for grouping the nucleotides in the gene segments in sequence by taking N as a group; the front section, the gene fragment and the back section are represented by codes in a gene dictionary to form a group of nucleotide data; and the system is used for counting and compressing the nucleotide data on different chromosomes to obtain compressed genome data, and sending the genome data and the code number of the reference gene corresponding to the standard gene to a data receiving end.

The working process is as follows:

manually sequencing to obtain a whole genome sequence of a tested person, wherein the whole genome sequence consists of 23 second gene sequences;

finding the closest first gene sequences for the second gene sequences one by one through a standard genome selection module to serve as standard genes, wherein the standard genes are multiple;

the distinguishing positions of the first gene sequence and the second gene sequence are dictionary-formed through a dictionary module, and the front section, the rear section and the gene segments of the distinguishing positions form continuous codes; and performing dictionary formation on the 23 second gene sequences one by one through a dictionary formation module, and compressing to obtain compressed genome data.

The method comprises the steps that a same storage module is arranged in a server of an operation end at a peripheral data receiving end, a gene dictionary in the storage module restores a front section, a rear section and a gene fragment in data, the accurate position of the first gene sequence is determined according to the length of the first gene sequence, the gene sequences of the front section and the rear section and the number of the first gene sequence, the corresponding position in the first gene sequence is replaced, and a second gene sequence can be obtained.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A gene sequencing data recombination and encapsulation method is characterized by comprising the following steps:

step 2: obtaining a second gene sequence of a chromosome in the sample;

2. The method for recombinantly encapsulating gene sequencing data according to claim 1, wherein N is 3 or 4 or 5 or 6.

3. The method for recombinantly encapsulating gene sequencing data according to claim 1, wherein the gene segment is longer than N nucleotides.

4. The method for repackaging gene sequencing data of claim 1, wherein the first gene sequence comprises a first gene sequence of an autosome and a first gene sequence of a sex chromosome in the reference genomic database.

5. The gene sequencing data recombination packaging device is characterized by comprising the following modules:

6. The genetic sequencing data reassembly device of claim 5, wherein N is 3 or 4 or 5 or 6.

7. The genetic sequencing data reassembly device of claim 5, wherein said gene fragment is longer than N nucleotides.

8. The gene sequencing data recombination packaging apparatus of claim 5, wherein the first gene sequence comprises a first gene sequence of an autosome and a first gene sequence of a sex chromosome in the reference genome database.