CN115455920A - Genome sequencing data quick annotation method and system based on site mapping - Google Patents

Genome sequencing data quick annotation method and system based on site mapping Download PDF

Info

Publication number
CN115455920A
CN115455920A CN202211165115.7A CN202211165115A CN115455920A CN 115455920 A CN115455920 A CN 115455920A CN 202211165115 A CN202211165115 A CN 202211165115A CN 115455920 A CN115455920 A CN 115455920A
Authority
CN
China
Prior art keywords
site
value
annotated
mapping
functional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211165115.7A
Other languages
Chinese (zh)
Inventor
方超
郎秋蕾
陈志锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Lianchuan Biotechnology Co ltd
Original Assignee
Hangzhou Lianchuan Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Lianchuan Biotechnology Co ltd filed Critical Hangzhou Lianchuan Biotechnology Co ltd
Priority to CN202211165115.7A priority Critical patent/CN115455920A/en
Publication of CN115455920A publication Critical patent/CN115455920A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/169Annotation, e.g. comment data or footnotes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computational Linguistics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method and a system for quickly annotating genome sequencing data based on site mapping, and belongs to the technical field of biological information. The invention firstly constructs mapping values of the start sites and the end sites of all the functional components, utilizes the mapping values to construct an index file, similarly obtains the mapping values aiming at the sites to be annotated, further searches the mapping values in the index file, and further judges whether all the sites to be annotated fall between the start sites and the end sites of a certain functional component if the sites to be annotated fall between the start site mapping values and the end site mapping values of the functional component, thereby carrying out annotation. The invention can greatly improve the efficiency of searching for annotations and reduce the time cost and the calculation cost of the annotations.

Description

Genome sequencing data quick annotation method and system based on site mapping
Technical Field
The invention belongs to the technical field of biological information, and particularly relates to a method and a system for quickly annotating genome sequencing data based on site mapping.
Background
Next-generation sequencing (NGS), also known as High-throughput sequencing, is a sequencing-by-synthesis technique developed based on PCR and gene chips. The high-throughput sequencing technology is mainly characterized in that: the sequencing reading length is short, the flux is high, and the accuracy is high. Compared with the first-generation sequencing, the high-throughput sequencing greatly reduces the cost, maintains higher accuracy and greatly reduces the sequencing time, and the high-throughput sequencing is widely applied to the whole group at present. Such as: there are reference transcriptome sequencing, resequencing, DNA methylation sequencing, m6A methylation sequencing, single cell sequencing, etc.
DNA methylation is a main mode of epigenetic modification, can change genetic expression on the premise of not changing a DNA sequence, and plays an important role in regulating and controlling gene expression, chromatin conformation and the like. DNA methylation mainly forms 5-methylcytosine (5-mC) and small amounts of N6-methylpurine (N6-mA) and 7-methylguanine (7-mG), etc. Typically, methylated DNA refers primarily to 5-methylcytosine (5 mC). Methylation in mammalian cells occurs mainly in cytosine of CG dinucleotide, and a large proportion of non-CG (CHH, CHG, H stands for A, C, T) methylation exists in plant cells. 5-methylcytosine (5-mC) is catalyzed by DNA methyltransferase (DNMT) to S-adenosylmethionine (SAM) as a methyl donor, converting cytosine to 5-methylcytosine (mC).
The whole genome methylation sequencing (WGBS) combines a bisulfite conversion (bisulfate conversion) method and a second-generation sequencing technology, and can efficiently detect the methylation state of the whole genome DNA at the level of single base resolution. Bisulfite treatment can deaminate unmethylated cytosines in DNA to uracil, while methylated cytosines remain unchanged; when the desired fragment is PCR amplified, all uracil is converted to thymine. And (3) carrying out high-throughput sequencing on the PCR product, and comparing the PCR product with a reference sequence to judge whether the CpG/CHG/CHH locus is methylated or not. The whole genome methylation sequencing can comprehensively and accurately detect the methylation state of the whole genome DNA, and lays a foundation for deeper epigenetic regulation and control analysis.
The CpG Island in the promoter region of a gene is usually in a demethylated state, promoting gene transcription, and abnormal methylation can lead to transcriptional inactivation. In general, cpG Island methylation leads to gene silencing. DNA methylation plays an important role in genomic imprinting, and hypermethylation in one of the biallelic genes results in expression of a single allele.
The current bioinformatic software has no consistent and rapid annotation method for DNA methylation sequencing data on gene structure regions such as promoter region (promoter), exon region (exonic), intron region (intron) and intergenic region (intergenic) and annotation of CpG Island region.
Disclosure of Invention
In order to solve at least one of the above technical problems, the technical solution adopted by the present invention is as follows:
the invention provides a method for quickly annotating genome sequencing data based on site mapping, which comprises the following steps:
s1, establishing an index file:
obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value for each site using equation (1):
Figure BDA0003860997510000021
wherein, G i Represents the mapping value of the ith position point, INT represents the rounding operation, S i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, L i Represents the number of bits of the ith position, if L i N or less and L i -N=1,
Thereby obtaining mapping values of the start sites and the end sites of all the functional component regions, and constructing an index file according to the following format:
Chr S E s e function
wherein, chr represents the chromosome position information of the functional assembly area, S represents the mapping value of the starting site of the functional assembly area, E represents the mapping value of the ending site of the functional assembly area, S represents the starting site of the functional assembly area, E represents the ending site of the functional assembly area, and function represents the category of the functional assembly area;
s2, obtaining a mapping value of a to-be-annotated locus: the numerical value of the site is Q, and a mapping value G of the site to be annotated is obtained by using a formula (1);
s3, searching the mapping value G obtained in the step S2 in the 2 nd column and the 3 rd column of the index file, and if the mapping value G meets the requirement of S for a certain functional component area j j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j And if so, the position to be annotated can be annotated to be positioned in the jth functional component area.
In the present invention, the functional components and functional elements have equivalent meanings.
In the present invention, the index value and the mapping value have an equivalent meaning.
In some embodiments of the present invention, the method for determining N specifically comprises:
(1) Obtaining the length CL and the gene number GN of each chromosome, and calculating CL/GN;
(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of digits of the result of MN/q is the value N, and q = 1-100. .
The N value is obtained as a selection method that the present invention unexpectedly finds to make the post-processing annotation more efficient, and those skilled in the art can also select the N value in other ways, as long as the core idea of the present invention is not violated, and the N value should be considered to fall within the protection scope of the present invention.
For example, the present invention can obtain a representative number of all chromosome lengths, take the square root of the representative number, and obtain the resulting integer number of digits as the value of N.
In some embodiments of the invention, the representative number is selected from one of a median, a mode, and an average.
In some embodiments of the invention, the source species is a mammal. Preferably, the source species is human.
In some embodiments of the invention, the gene sequencing data is DNA methylation sequencing data.
In some embodiments of the invention, the functional module regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.
Wherein promoter CGIs, endogenous CGIs, 3' transcript CGIs and endogenous CGIs are defined according to the gene position of CGI:
promoter CGIs -1000bp TSS to+300bp TSS
intragenic CGIs +300bp TSS to+300bp TES
3'transcript CGIs -300bp TES to+300bp TES
intergenic CGIs -300bp TES to-1000bp next gene's promoter
the invention provides a genome sequencing data quick annotation system based on site mapping, which comprises the following modules:
the index library module is used for storing index files, wherein the index files are constructed by the following method:
obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value using equation (1) for each site:
Figure BDA0003860997510000041
wherein, the mapping value of the first position point is represented, INT represents rounding operation, S i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L i Represents the number of bits of the i-th site, if L i N or less and L i -N=1,
Thus obtaining the mapping values of the starting positions and the ending positions of all the functional component areas, and constructing an index file according to the following format:
Chr S E s e function
wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region,
the input module is used for receiving the sequencing data, obtaining the site to be annotated and calculating the index value of the site to be annotated by using a formula (1),
the searching module is respectively connected with the input module and the index base module and is used for searching the index values of the to-be-annotated locus obtained by the input module in the 2 nd column and the 3 rd column of the index file, and if the index value of the to-be-annotated locus meets the requirement of S for a certain functional component area j, G meets the requirement of S j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j If yes, the position to be annotated can be annotated in the jth functional component area,
and the result output module is used for outputting the annotation result.
In some embodiments of the present invention, the method for determining N specifically comprises:
(1) Obtaining the length CL and the gene number GN of each chromosome, and calculating CL/GN;
(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of the MN/q result is the value N, and q = 1-100. .
As above, the obtaining of the N value is an unexpected finding of the present invention to be a more efficient selection method for the annotation after processing, and those skilled in the art can also select the N value in other ways, as long as the N value does not depart from the core idea of the present invention, and the N value should be considered to fall within the protection scope of the present invention.
For example, the present invention can obtain a representative number of all chromosome lengths, take the square root of the representative number, and obtain the resulting integer number of digits as the value of N.
In some embodiments of the invention, the representative number is selected from one of a median, a mode, and an average.
In some embodiments of the invention, the source species is a mammal. Preferably, the source species is human.
In some embodiments of the invention, the gene sequencing data is DNA methylation sequencing data.
In some embodiments of the invention, the functional module regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.
Wherein, promoter CGIs, endogenous CGIs, 3' transcript CGIs and endogenous CGIs are defined according to the gene position of the CGI:
promoter CGIs -1000bp TSS to+300bp TSS
intragenic CGIs +300bp TSS to+300bp TES
3'transcript CGIs -300bp TES to+300bp TES
intergenic CGIs -300bp TES to-1000bp next gene's promoter
the invention has the advantages of
Compared with the prior art, the invention has the following beneficial effects:
by utilizing the method and the system, the positions of the functional components are mapped, the method is simple and easy to operate, and the efficiency of searching and annotating can be greatly improved. Taking human chromosome 1 as an example, the search efficiency can be improved by 26804 times. The effect is more obvious for the annotation of a plurality of chromosomes and a plurality of samples.
Drawings
Fig. 1 shows the genetic location to which CGI belongs.
FIG. 2 shows a schematic representation of a site (10540) located in a promoter region ([ 10300,13000 ]).
FIG. 3 shows a schematic diagram of a rapid annotation system for genome sequencing data based on site mapping in an embodiment of the invention.
Detailed Description
Unless otherwise indicated, implicit from the context, or customary in the art, all parts and percentages herein are based on weight and the testing and characterization methods used are in step with the filing date of the present application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are incorporated herein by reference in their entirety and their equivalent family patents are also incorporated by reference, especially as they disclose definitions relating to synthetic techniques, products and process designs, polymers, comonomers, initiators or catalysts, and the like, in the art. To the extent that a definition of a particular term disclosed in the prior art is inconsistent with any definitions provided herein, the definition of the term provided herein controls.
The numerical ranges in this application are approximations, and thus may include values outside of the ranges unless otherwise specified. A numerical range includes all numbers from the lower value to the upper value, in increments of 1 unit, provided that there is a separation of at least 2 units between any lower value and any higher value. For example, if 100 to 1000 is recited, this is meant to explicitly recite all individual values, e.g., 100, 101, 102, etc., as well as all sub-ranges, e.g., 100 to 166, 155 to 170, 198 to 200, etc. For ranges containing a numerical value less than 1 or containing a fraction greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is considered to be 0.0001,0.001,0.01, or 0.1, as appropriate. For ranges containing single digit numbers less than 10 (e.g., 1 to 5), 1 unit is typically considered 0.1. These are merely specific examples of what is intended to be expressed and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.
The terms "comprising," "including," "having," and derivatives thereof do not exclude the presence of any other component, step or procedure, and are not intended to exclude the presence of other elements, steps or procedures not expressly disclosed herein. To the extent that any doubt is eliminated, all compositions herein containing, including, or having the term "comprise" may contain any additional additive, adjuvant, or compound, unless expressly stated otherwise. Rather, the term "consisting essentially of 8230 \8230; \8230composition" excludes any other components, steps or processes from the scope of any of the terms hereinafter recited, insofar as they are necessary for performance. The term "consisting of 823070 \8230composition" does not include any components, steps or processes not specifically described or listed. Unless explicitly stated otherwise, the term "or" refers to the listed individual members or any combination thereof.
In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments.
Examples
The following examples are used herein to demonstrate preferred embodiments of the invention. It will be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the disclosures and references cited herein and the materials to which they refer are incorporated by reference.
Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.
The experimental procedures in the following examples are conventional unless otherwise specified. The instruments used in the following examples are, unless otherwise specified, laboratory-standard instruments; the test materials used in the following examples were purchased from a conventional biochemical reagent store unless otherwise specified.
Example 1 method for fast annotation of functional Components based on DNA methylation sequencing
1. Genomic functional component classification and definition
A promoter is a DNA sequence to which a protein binds to initiate transcription of a single RNA transcript from DNA downstream of the promoter. The RNA transcript may encode a protein (mRNA), or may have its own function (e.g., tRNA or rRNA). The promoter is located near the transcription start site of the gene, upstream of the DNA (toward the 5' region of the sense strand). Promoters are about 100-1000 base pairs in length and their sequence is highly dependent on the gene and transcription product, type or class of RNA polymerase and organism species recruited to the site. The promoter region is the binding region for RN A polymerase, and its structure is directly related to the efficiency of transcription.
The Transcription Start Site (TSS) refers to a base, usually a purine, on the DNA strand corresponding to the first nucleotide of the nascent RNA strand. Sequences preceding the start, i.e. at the 5 'end, are often referred to as upstream (updata) and sequences following it, i.e. at the 3' end, are often referred to as downstream (downstream). In describing the position of the base, it is generally indicated by a number that the TSS has a starting point of +1, a downstream direction of +2, +3 8230; an upstream direction of-1, -2, -3 8230; and an upstream direction of-8230; and a sequence of-8230.
In this example, the inventors have made a uniform definition of promoter regions based on TSS as a uniform annotation standard for subsequent DN a methylation sequencing analysis, as shown in table 1:
TABLE 1 promoter region definitions
Promoter region promoters -2,200 to +500bp
Proximal end Proximal(P) -200 to +500bp
Middle terminal Intermediate(I) -200 to-1,000bp
Distal end Distal(D) -1,000 to-2,200bp
Relative position and CpG content in promoter sequence are important factors for promoter methylation degree, i.e., O/E ratio (observed-to-expected CpG ratio), according to which the inventors classified O/E value of promoter region CpG into three classes of low, medium and high (LCP, ICP and HCP). The calculation formula is as follows:
Figure BDA0003860997510000071
wherein, the number of CpG is the number of CpG, the number of C is the number of C bases in the sequence, the number of G bases in the sequence is the number of G bases in the sequence, and the Total number of Nucleotides in the sequence expresses the Total number of bases in the sequence.
The inventors defined the CpG island (CGI) as described in table 2 according to the gene location to which the CGI belongs:
TABLE 2 CGI definitions
promoter CGIs -1000bp TSS to +300bp TSS
intragenic CGIs +300bp TSS to +300bp TES
3'transcript CGIs 300bp TES to +300bp TES
intergenic CGIs Promoter of the next gene from-300 bp TES to-1000 bp
The gene location definition to which CGI belongs is shown in FIG. 1.
2. Annotating functional component regions to which sequences belong by using a site translation method
The inventors aligned sequencing reads to the genome based on the results of DNA methylation sequencing, and typically the alignment results were output in SAM format. The SAM format comprises site position information, POS: the leftmost position on the alignment, i.e. the position where the reads align to the first base on the genome. Based on the position of the alignment, one skilled in the art needs to quickly note which functional interval (e.g., promoter region, exon region, intron region) of the gene the site is located in, and also needs to confirm whether the site is in the CGI/CGI chord region.
Here, taking the human genome as an example, the length of human chromosome 1 is 249250621 bases, and if several reads align to the 10540 th base position of the chromosome, how to quickly find out whether the position (10540) belongs to the promoter region by site search?
Assuming that the site is located in the promoter region ([ 10300,13000 ]) of one of the genes (as shown in FIG. 2), if the search annotation is traversed by the method of single site translation, the site can be found only by cycling 10540 times (traversing from the 1 st base), and the method is very inefficient to be performed.
3. Annotating functional component regions to which sequences belong by site mapping
In order to improve the searching efficiency of functional annotation, the inventor creates a fast annotated site mapping searching method, and the detailed process is as follows:
(1) First, the inventors performed mapping creation of functional components of a genome using the following formula:
Figure BDA0003860997510000081
wherein, G i Representing the mapped value of the ith locus, S i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L i Represents the number of bits of the i-th site, if L i N or less and L i -N=1。
For the value of N, the selection is carried out according to the following method:
1. obtaining the length (namely bp number) CL and the gene number GN of each chromosome, and calculating CL/GN;
2. the median MN of all chromosomes CL/GN is obtained, and divided by the value q, the integer number of MN/q result is the N value (e.g. 10000, i.e. N = 4), where q = 1-100.
For humans, the calculation results are shown in table 3:
TABLE 3 chromosome length information
Chromosome numbering CL(Mbp) GN CL/GN
1 250 4778 5232.31
2 242 3600 6722.22
3 198 2780 7122.30
4 190 2251 8440.69
5 182 2393 7605.52
6 171 2828 6046.68
7 159 2598 6120.09
8 145 2003 7239.14
9 138 2132 6472.80
10 134 2042 6562.19
11 135 2806 4811.12
12 133 2398 5546.29
13 114 1327 8590.81
14 107 1951 5484.37
15 102 1721 5926.79
16 90 1845 4878.05
17 83 2326 3568.36
18 80 938 8528.78
19 59 2420 2438.02
20 64 1243 5148.83
21 47 726 6473.83
22 51 1129 4517.27
X 156 1973 7906.74
Y 57 496 11491.94
The median MN of CL/GN was 6296.44. If q =1, MN/q =6296.44, n =4; if q =10, MN/q =629.6, n =3; if q =100, MN/q =62.96, n =2.
Thus obtaining the mapping values of the starting positions and the ending positions of all the functional component areas, and constructing an index file according to the following format:
Chr S E s e function
wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region.
At the time of response, the mapping value of the site to be annotated is first obtained: the numerical value of the position point is Q, and the mapping value G of the position point to be annotated is obtained by the formula; searching the mapping value G in the 2 nd column and the 3 rd column of the index file, and if the mapping value G meets the requirement of S for a certain functional component area j j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j And if so, the position to be annotated can be annotated in the jth functional component area, so that annotation is completed quickly.
Example 2 application of the annotation method based on site mapping.
To make a quick annotation of the functional components using the method established in example 1, the inventors chose q =10 and thus N =3.
The above promoter regions [10300,13000]]For example, the promoter region start site (10300), S i =10300,L i =5,G i =10300 mod 10 3 +5×10 2 =510。
Similarly, the index values for the promoter region termination site (13000) are: g j =513。
The inventor creates the following index for the promoter area in the method for storage, and the format is as follows:
Chr S E s e function
Chr1 510 513 10300 13000 promoter
wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region. The categories of functional component areas include, but are not limited to: promoter region (promoter), exon region (exon), intron region (intron), promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat region (repeat region), and miRNA region.
And performing similar index conversion on all known functional components to obtain an index file. It should be noted that these functional modules are stored in the complete location interval of the start and end positions of the functional module on the genome.
By this method, the inventor performs data compression on the original sites, [510,513] the corresponding original sites of the interval are: 10000,13999, namely, the original 4000 sites are stored by using 4 sites, the search efficiency can be improved by 1000 times, and the effect is very obvious for the genome data with large data pet.
When annotating, annotation can be carried out only by the annotated locus and the corresponding index value after conversion:
Figure BDA0003860997510000111
at the time of searching, according to G k Searching the index file for the second column (S) and the second column (E) data, and the inventor searches the index file according to the queryIf S is less than or equal to G k E is not more than E and s is not more than Q k e.ltoreq.Q k This point may be annotated as a corresponding functional component.
For example, for the 10540 site (Q) described above k ) The conversion of the corresponding index value is performed by using the method, and the converted numerical value is as follows:
G k =10540 mod 10 3 +5×10 2 =510
the site satisfies 510 ≦ 513, and 10300 ≦ 10540 ≦ 13000, so that the site Q k Annotated as promoter.
Example 3 application of a method for rapid annotation of functional Components based on DNA methylation sequencing to the entire chromosome
In this example, taking chromosome 1 as an example, the sequence of chromosome 1 has a total of 249250621 bases, and if the traditional site traversal method is used, it needs to traverse 249250621 times at most to find the annotated site. According to the site mapping search method of the embodiment 1, N =4, at most 9249 index values (about 50 functional components) are divided, that is, the site can be found by searching at most 9299 times, and the search efficiency is improved by 26804 times.
The specific efficiency improvement comparisons are shown in table 3:
TABLE 3 comparison of efficiency
Figure BDA0003860997510000112
Therefore, the search performance can be greatly improved by using the site mapping search method of the embodiment 1.
Example 4 genome sequencing data quick annotation System based on site mapping
As shown in fig. 3, this embodiment provides a system to implement the above fast annotation method, where the system includes:
the index library module is used for storing the constructed index file
An input module for receiving the sequencing data, obtaining a site Q to be annotated, and calculating an index value G of the site to be annotated by using the formula,
the searching module is respectively connected with the input module and the index base module and is used for searching the index values of the to-be-annotated locus obtained by the input module in the 2 nd column and the 3 rd column of the index file, and if the index value of the to-be-annotated locus meets the requirement of S for a certain functional component area j, G meets the requirement of S j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j If yes, the position to be annotated can be annotated in the jth functional component area,
and the result output module is used for outputting the annotation result.
All documents mentioned in this application are incorporated by reference in this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes and modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the present invention as defined by the appended claims.

Claims (10)

1. A method for quickly annotating genome sequencing data based on site mapping is characterized by comprising the following steps:
s1, establishing an index file:
obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value using equation (1) for each site:
Figure FDA0003860997500000011
wherein G is i Represents the mapping value of the ith position point, INT represents the rounding operation, S i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, L i Represents the number of bits of the ith position, if L i N or less and L i -N=1,
Thereby obtaining mapping values of the start sites and the end sites of all the functional component regions, and constructing an index file according to the following format:
Chr S E s e function
wherein, chr represents the chromosome position information of the functional assembly area, S represents the mapping value of the starting site of the functional assembly area, E represents the mapping value of the ending site of the functional assembly area, S represents the starting site of the functional assembly area, E represents the ending site of the functional assembly area, and function represents the category of the functional assembly area;
s2, obtaining a mapping value of a to-be-annotated locus: the numerical value of the site is Q, and a mapping value G of the site to be annotated is obtained by using a formula (1);
s3, searching the mapping value G obtained in the step S2 in the 2 nd column and the 3 rd column of the index file, and if the mapping value G meets the requirement of S for a certain functional component area j j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j And if so, the position to be annotated can be annotated to be positioned in the jth functional component area.
2. The method for rapidly annotating genome sequencing data according to claim 1, wherein the determination method of N is specifically as follows:
(1) Obtaining the length CL and the gene number GN of each chromosome, and calculating CL/GN;
(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of the MN/q result is the value N, and q = 1-100.
3. The method for rapid annotation of genome sequencing data according to claim 2, wherein the representative number is selected from one of median, mode and mean.
4. The method for rapid annotation of genome sequencing data according to claim 1, wherein the source species is mammal.
5. The method for rapid annotation of genome sequencing data according to claim 4, wherein said functional component regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.
6. A genome sequencing data quick annotation system based on site mapping is characterized by comprising the following modules:
the index library module is used for storing index files, wherein the construction method of the index files comprises the following steps:
obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value using equation (1) for each site:
Figure FDA0003860997500000021
wherein G is i Representing the mapped value of the ith locus, S i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L i Represents the number of bits of the i-th site, if L i N or less and L i -N=1,
Thereby obtaining mapping values of the start sites and the end sites of all the functional component regions, and constructing an index file according to the following format:
Chr S E s e function
wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region,
the input module is used for receiving sequencing data, obtaining a site to be annotated, calculating an index value of the site to be annotated by using a formula (1),
the searching module is respectively connected with the input module and the index base module and is used for searching the index values of the to-be-annotated locus obtained by the input module in the 2 nd column and the 3 rd column of the index file, and if the index value of the to-be-annotated locus meets the requirement of S for a certain functional component area j, G meets the requirement of S j ≤G≤E j Further judging whether Q satisfies s j ≤Q≤e j If yes, the position to be annotated can be annotated in the jth functional component area,
and the result output module is used for outputting the annotation result.
7. The system for rapid annotation of genome sequencing data according to claim 6, wherein the method for determining N is as follows:
(1) Obtaining the length CL and the gene number GN of each chromosome, and calculating CL/GN;
(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of digits of the result of MN/q is the value N, and q = 1-100. .
8. The rapid annotation system of genome sequencing data of claim 7, wherein the representative number is selected from one of median, mode, and mean.
9. The system for rapid annotation of genome sequencing data according to claim 6, wherein the source species is a mammal.
10. The rapid annotation system of genome sequencing data according to claim 9, wherein the functional component regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.
CN202211165115.7A 2022-09-23 2022-09-23 Genome sequencing data quick annotation method and system based on site mapping Pending CN115455920A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211165115.7A CN115455920A (en) 2022-09-23 2022-09-23 Genome sequencing data quick annotation method and system based on site mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211165115.7A CN115455920A (en) 2022-09-23 2022-09-23 Genome sequencing data quick annotation method and system based on site mapping

Publications (1)

Publication Number Publication Date
CN115455920A true CN115455920A (en) 2022-12-09

Family

ID=84306874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211165115.7A Pending CN115455920A (en) 2022-09-23 2022-09-23 Genome sequencing data quick annotation method and system based on site mapping

Country Status (1)

Country Link
CN (1) CN115455920A (en)

Similar Documents

Publication Publication Date Title
Hedges et al. Comparison of three targeted enrichment strategies on the SOLiD sequencing platform
WO2017214557A1 (en) Nucleic acid sequencing adapters and uses thereof
EP2248914A1 (en) The use of class IIB restriction endonucleases in 2nd generation sequencing applications
JP7171709B2 (en) Methods for Detection of Fusions Using Compacted Molecularly Tagged Nucleic Acid Sequence Data
US20210358572A1 (en) Methods, systems, and computer-readable media for calculating corrected amplicon coverages
CN107829146B (en) Primer group for constructing 16SrRNA gene amplicon sequencing library and construction method
WO2011074960A1 (en) Restriction enzyme based whole genome sequencing
Good Reduced representation methods for subgenomic enrichment and next-generation sequencing
CN105793438B (en) Full-length amplification method of double-strand linear nucleic acid with unknown sequence
CN110959045B (en) Improved methods and kits for generating large-scale parallel sequenced DNA libraries
US20060063181A1 (en) Method for identification and quantification of short or small RNA molecules
CN107429298B (en) Method for designing primer for polymerase chain reaction and primer combination
WO2009091798A1 (en) Quantitative genetic analysis
Goswami et al. RNA-Seq for revealing the function of the transcriptome
Talla et al. A novel design of whole-genome microarray probes for Saccharomyces cerevisiae which minimizes cross-hybridization
AU2010329825B2 (en) RNA analytics method
CN115455920A (en) Genome sequencing data quick annotation method and system based on site mapping
CN114438233B (en) Synchronous typing detection system of X chromosome Multi-DIP for genetic relationship identification
US20050050101A1 (en) Identification and use of informative sequences
CN115631801A (en) Rapid annotation method and system for genome sequencing data based on window-dividing mapping
WO2018236631A1 (en) Methods and compositions for addressing inefficiencies in amplification reactions
CN111370063B (en) MSI (MSI-based micro satellite instability) detection method and system based on Pacbio data
Silke et al. RNA-Seq-based analysis reveals heterogeneity in mature 16S rRNA 3′ termini and extended anti-Shine-Dalgarno motifs in bacterial species
CN104152568A (en) High-flux STR sequence core replication number detection method
CN116377084B (en) High-efficiency autosomal micro-haplotype genetic marker system, and detection primer and kit thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination