CN117106875A - Method for estimating plant genome size and/or repeatability based on low-depth sequencing - Google Patents
Method for estimating plant genome size and/or repeatability based on low-depth sequencing Download PDFInfo
- Publication number
- CN117106875A CN117106875A CN202311367837.5A CN202311367837A CN117106875A CN 117106875 A CN117106875 A CN 117106875A CN 202311367837 A CN202311367837 A CN 202311367837A CN 117106875 A CN117106875 A CN 117106875A
- Authority
- CN
- China
- Prior art keywords
- sequencing
- depth
- genome
- data
- size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 147
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000005070 sampling Methods 0.000 claims abstract description 39
- 238000012070 whole genome sequencing analysis Methods 0.000 claims abstract description 37
- 238000001914 filtration Methods 0.000 claims abstract description 26
- 241000196324 Embryophyta Species 0.000 claims description 63
- 241000219195 Arabidopsis thaliana Species 0.000 claims description 42
- 239000000463 material Substances 0.000 claims description 7
- 208000035199 Tetraploidy Diseases 0.000 claims description 5
- VYPSYNLAJGMNEJ-UHFFFAOYSA-N Silicium dioxide Chemical compound O=[Si]=O VYPSYNLAJGMNEJ-UHFFFAOYSA-N 0.000 claims description 3
- 239000000741 silica gel Substances 0.000 claims description 3
- 229910002027 silica gel Inorganic materials 0.000 claims description 3
- 241000894007 species Species 0.000 description 14
- 241000219194 Arabidopsis Species 0.000 description 9
- 238000011160 research Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 108020004414 DNA Proteins 0.000 description 4
- 208000020584 Polyploidy Diseases 0.000 description 4
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 4
- 241001520750 Arabidopsis arenosa Species 0.000 description 3
- 238000012268 genome sequencing Methods 0.000 description 3
- 210000001519 tissue Anatomy 0.000 description 3
- 241000215338 unidentified plant Species 0.000 description 3
- 241000244995 Paris japonica Species 0.000 description 2
- 241000209051 Saccharum Species 0.000 description 2
- 238000000684 flow cytometry Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 238000009403 interspecific hybridization Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 210000003463 organelle Anatomy 0.000 description 2
- 238000007400 DNA extraction Methods 0.000 description 1
- 241000234642 Festuca Species 0.000 description 1
- 241000104436 Genlisea tuberosa Species 0.000 description 1
- 241000218922 Magnoliophyta Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 241000700110 Myocastor coypus Species 0.000 description 1
- 108020005120 Plant DNA Proteins 0.000 description 1
- 238000012952 Resampling Methods 0.000 description 1
- 241000242583 Scyphozoa Species 0.000 description 1
- 240000008042 Zea mays Species 0.000 description 1
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 1
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000009395 breeding Methods 0.000 description 1
- 230000001488 breeding effect Effects 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 235000005822 corn Nutrition 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 235000013372 meat Nutrition 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000033772 system development Effects 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Computational Biology (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Data Mining & Analysis (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Microbiology (AREA)
- Immunology (AREA)
- Genetics & Genomics (AREA)
- Biochemistry (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a method for estimating the size and/or the repeatability of a plant genome based on low-depth sequencing, and belongs to the technical field of plant molecular biology. The invention selects the RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can obtain the size and/or the repeatability of the plant genome from the low-depth sequencing data which is not higher than 5 multiplied by the number, and reduces the experimental cost; by combining BBDuk, BBMerge in BBMap, mass filtering and combining double-ended sequencing data can be performed in a flow-through manner; and obtaining the initial seed whole genome sequencing depth of the second iteration based on the first iteration result, setting gradient sampling to obtain the sequencing depth calculated by RESPECT under multiple gradients and the estimated genome size, drawing a graph, and calculating the average value of the platform period of the curve below 4 x depth to obtain the final genome size estimated value. The method of the invention has low cost and can accurately estimate the size of the plant genome.
Description
Technical Field
The invention belongs to the technical field of plant molecular biology, and particularly relates to a method for estimating plant genome size and/or repeatability based on low-depth sequencing.
Background
The genome contains the genetic information of the bottommost layer of the species, the genome of the plant is more complex than that of the animal, and the characteristics of large genome, multiple, high heterozygosity, high repeatability and the like are present. Genome size (Genome size), i.e., C-value, is the total amount of DNA contained in a haploid gamete of a species, typically expressed in terms of weight (pg) or number of nucleotide base pairs (base pair), 1 pg of DNA being approximately equal to the number of base pairs of 978 Mb. The land plant genome size varied by 2400 fold, and the genome size of 1.2 tens of thousands of plants was recorded in the genome C value database (https:// cvalues. Science. Kew. Org /), while the genome of the meat screw nutria (Genlisea tuberosa) in flowering plants was only 61 Mb, and the genome of Paris japonica (Paris japonica) was 148Gb. The genome size is an important indicator of biodiversity and species specificity, the chromosome number and the genome size in each plant group cell are relatively fixed, and the genome size has important guiding significance for biological evolution and plant system taxonomy research, and also has important significance for breeding domestication and protection utilization of excellent germplasm resources.
Multiple, repeat sequence and intra-seed variation are three major factors affecting plant genome size. Multiple (homologous and heterologous) can enhance genetic diversity, positively contributing to plant adaptation to environmental changes, and about 70% of existing plants are polyploid. The repeated sequence can be divided into tandem repeated sequence and scattered repeated sequence, and the repeated sequence is mostly present in a gene interval region and an intron region, and accounts for 10% -85% of the plant genome. The intraspecific variation is found in a plurality of species such as corn, fescue and the like, the difference of different resident groups can exceed 30 percent, polyploid and diploid coexist, and the intraspecific variation has correlation with geographic factors, climatic factors and the like. It is generally thought that as organisms evolve, the C value of the organisms increases, but there is no clear correlation between the complexity of the organism's structure and function and the size of the genome, a phenomenon known as "C value paradox".
The commonly used methods for determining the genome size are mainly flow cytometry (genome survey), which has the advantages of simple operation, economy, high efficiency, high accuracy and the like, and genome survey (genome measure) based on k-mer analysis, which has the advantages of simple experiment, high speed, high reproducibility and the like. A K-mer refers to a substring of length K contained in a sequence of length L, which is slid in steps of one base, resulting in a total of (L-K+1) K-mers. k-mer analysis assumes that sequenced reads are randomly distributed across the genome, and that the distribution of k-mers follows a poisson distribution without regard to sequencing errors, sequence reproducibility, and heterozygous sequences. However, in practice, the error rate, the repetitive sequence ratio and the heterozygosity are all calculated, and the genome size is estimated by correction according to the calculation result. The k-mer analysis generally uses software such as Jellyfish, KMC, KAT, kmerGenie to obtain the k-mer frequency distribution, and then uses software such as GCE, genomeScope, findGSE, BBNorm to estimate the genome size, the degree of repetition, and the like. The estimated genome size is affected by the size of the k-mer value set, the upper frequency distribution limit, the maximum k-mer coverage, heterozygosity/homozygosity/ploidy, etc. The current software based on k-mer analysis requires genome sequencing depth of 30-50X, and is easy to suffer from the influence that frequency distribution peak map fitting fails and main peak heterozygous peaks are difficult to distinguish, so that large difference often occurs in estimated genome size.
Disclosure of Invention
The invention aims to provide a method for estimating the size and/or the repeatability of a plant genome based on low-depth sequencing, which can effectively and accurately estimate the size of the plant genome.
The invention provides a method for estimating plant genome size and/or repeatability based on low-depth sequencing, which comprises the following steps:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, running RESPECT software to set 5 sampling gradients (100%, 75%, 50%, 25% and 1%) for the data to be processed, and pre-running to obtain a first iteration result; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration;
setting gradient sampling in a target sequencing depth according to the initial seed whole genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5 x-5 x, and obtaining sampling data of 11 different sampling gradient depths (100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, 5%); performing second-round iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
Preferably, the parameter set for each operation of the rest software is 1000 cycles.
Preferably, obtaining the initial seed whole genome sequencing depth of the second round of iteration according to the result of the first round of iteration comprises:
comparing the ratio of the whole genome sequencing depths estimated by the RESPECT software of the first iteration with the ratio of the sampling gradients, and selecting the whole genome sequencing depth estimated by the RESPECT with the same ratio as the correct estimated value;
the seed whole genome sequencing depth at the beginning of the second round of iteration is 3× 5× based on the correct estimate up or down the sampling percentage.
Preferably, the form of the plant of unknown genome size comprises plant fresh tissue, plant silica gel desiccated material, or plant specimen degraded material.
Preferably, the plant comprises arabidopsis thaliana.
Preferably, the arabidopsis thaliana comprises diploid arabidopsis thaliana and/or tetraploid arabidopsis thaliana.
The invention provides a method for estimating plant genome size and/or repeatability based on low depth sequencing. The invention adopts the software RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can accurately solve the optimization problem, and is particularly effective for the k-mer frequency distribution result under low-depth sequencing (lower than 5×). Therefore, compared with the 30X-50X sequencing depth required by the traditional genome surviviny sequencing, the cost can be greatly reduced. In addition, the invention combines BBDuk, BBMerge software in BBMap software package, and can process quality filtering and combine double-ended sequencing data. Meanwhile, the invention also creates original seed whole genome sequencing depth based on the first iteration result, obtains the sequencing depth calculated by RESPECT under multiple gradients and the estimated genome size through gradient sampling, draws a graph, calculates the average value of the platform period of the curve below 4 times of depth, and obtains the final genome size estimated value. The invention can be applied to newly generated sequencing data of plants of unknown genome size, and can also be applied to published massive plant shallow sequencing data (mainly data generated by research on organelle phylogenetic genomics).
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a technical roadmap of the invention;
FIG. 2 is a graph of three species obtained by operating RESPECT after sampling using raw SRA data of three Arabidopsis species (Arabidopsis, saccharum, arabidopsis, swedish) in example 1 of the present invention;
FIG. 3 is a graph showing the results of the invention in example 1 after running genome scope2 using the same 20X data samples of raw SRA data from three Arabidopsis species (Arabidopsis, saccharum, arabidopsis, swedish);
FIG. 4 is a graph of estimated genome size of Arabidopsis thaliana obtained by running RESPECT after two iterative sampling using raw SRA data of Arabidopsis thaliana in example 2 of the present invention.
Detailed Description
The invention provides a method for estimating plant genome size and/or repeatability based on low-depth sequencing, which comprises the following steps:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, and running RESPECT software to perform pre-running on the data to be processed to obtain a first iteration result; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration;
setting gradient sampling in a target sequencing depth according to the initial seed genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5-5×, and sampling data with different sampling gradient depths are obtained; performing second-round iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
The technical roadmap of the method of the invention is shown in fig. 1. The method has low cost and can effectively and accurately estimate the size of the plant genome. The method is different from genome evaluation software based on k-mers commonly used in the field, can utilize low-depth sequencing data with the depth far lower than 30× (the general sequencing depth requirement in the field) to estimate important information such as the plant genome size, the repeatability and the like of unknown genome characteristics with extremely low cost, is suitable for diploid, homopolyploid and heteropolyploid, and greatly facilitates research work requiring plant genome size information such as plant genome research, molecular system development research and the like.
The method comprises the steps of firstly, carrying out low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing.
In the present invention, the form of the plant of unknown genome size includes plant fresh tissue, plant silica gel desiccated material or plant specimen degraded material.
The invention preferably further comprises extracting DNA from a plant of unknown genome size prior to low depth whole genome second generation sequencing of the plant of unknown genome size. The method for extracting the DNA of the plant with unknown genome size is not particularly limited, and a plant DNA extraction method or a kit conventional in the art can be adopted.
In the present invention, the plants include haploid plants and/or polyploid plants; the polyploid plant includes an autopolyploid plant and/or an allopolyploid plant. In a specific implementation of the invention, the plant comprises a diploid plant or a tetraploid plant.
In one embodiment of the invention, the plant comprises arabidopsis thaliana; the arabidopsis thaliana includes arabidopsis thaliana diploid arabidopsis thaliana (a. Thaliana) and tetraploid arabidopsis thaliana (a. Arenosa); the tetraploid arabidopsis thaliana includes homotetraploid arabidopsis thaliana and heterotetraploid arabidopsis thaliana. In the invention, the autotetraploid arabidopsis thaliana is autotetraploid arabidopsis thaliana; the heterotetraploid arabidopsis thaliana is heterotetraploid arabidopsis thaliana, sweden; the Arabidopsis thaliana is a species formed by interspecific hybridization of Arabidopsis thaliana and Arabidopsis thaliana.
In the present invention, the low depth whole genome second generation sequencing has a sequencing depth of less than 5×.
After low-depth sequencing data is obtained, BBDuk software is used for carrying out quality filtering on the low-depth sequencing data to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data.
When the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, the method further comprises the step of combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data, so that the combined clean sequencing data become a data pattern similar to single-ended sequencing, and the effective sequencing depth of the genome in subsequent analysis can be improved.
The invention combines BBDuk, BBMerge software in BBMap software package, and can process quality filtering and combine double-end sequencing data.
After clean sequencing data or combined clean sequencing data is obtained, the invention takes the clean sequencing data or combined clean sequencing data as data to be processed, and runs RESPECT software to run the data to be processed, so as to obtain a first round of iterative result; the RESPECT software is internally provided with Gurobi.
In the present invention, the cycle number (iteration) of the rest software is preferably 1000.
In the invention, the RESPECT software starts a debug mode to pre-run the data to be processed; the pre-run is the first iteration.
The invention adopts the software RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can accurately solve the optimization problem, and is particularly effective for the k-mer frequency distribution result under low-depth sequencing (lower than 5×). Therefore, compared with the 30X-50X sequencing depth required by the traditional genome surviviny sequencing, the cost can be greatly reduced.
After the result of the first round of iteration is obtained, the invention obtains the whole genome sequencing depth of the initial seed of the second round of iteration according to the result of the first round of iteration.
In the present invention, according to the result of the first iteration, obtaining the initial seed whole genome sequencing depth of the second iteration preferably includes: comparing the ratio of the whole genome sequencing depths estimated by the RESPECT software with the ratio of the sampling gradients, and selecting the whole genome sequencing depth estimated by the RESPECT with the same ratio as the correct estimated value; the seed whole genome sequencing depth at the beginning of the second round of iteration is 3× 5× based on the correct estimate up or down the sampling percentage.
In the practice of the present invention, when the initial seed whole genome sequencing depth estimated by the RESPECT software is not within the target sequencing depth, then downsampling is performed to set a sampling gradient within the target sequencing depth.
In the invention, when the sequencing depth and the genome size of the whole genome of seeds initiated by RESPECT software under different sampling gradient depths are stable, a graph of the sequencing depth and the genome size is obtained; the criteria for stabilization are preferably: the ratio between the different sampling gradient depths and the ratio of the sequencing depth estimated by the RESPECT software are equal to each other and the variation coefficient of the estimated genome size is less than or equal to 10 percent. In the present invention, at least 6 values of the calculated genome size are calculated as the coefficient of variation, and the average value is calculated using the value with the smallest coefficient of variation. In the present invention, the curve is smooth when the genome size is stable.
In the invention, the first iteration preferably samples 100%, 75%, 50%, 25% and 1% of the sequencing data, then operates RESPECT to obtain 5 sequencing depths, and compares the ratio between the 5 sequencing depths to obtain the sampled gradient with the ratio consistent proportion; any gradient is selected from the sampled gradients in proportion to the ratio, the sampling percentage is adjusted upwards or downwards, the initial seed whole genome sequencing depth of the second round of iteration is 3-5×, and 11 gradients of 100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10% and 5% are set for sampling.
In the present invention, the estimated sequencing depth of the rest may be very inaccurate when the actual sequencing depth is much higher than the most applicable range of rest. If the seed depth is too low, the proportion of samples needs to be increased, otherwise the proportion of samples is decreased.
The invention can find stable estimated genome size through a plateau of 4 x depth or less.
After obtaining the whole genome sequencing depth of the initial seed of the second round of iteration, setting gradient sampling in the target sequencing depth according to the whole genome sequencing depth of the initial seed of the second round of iteration, wherein the target sequencing depth is 0.5-5 x, and obtaining sampling data with different sampling gradient depths; and respectively carrying out second iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs.
In the present invention, the target sequencing depth is preferably 0.7× 4×, more preferably 3×, 2×,1×, 0.9×, or 0.8×. In the present invention, 4X to 0.7X is the optimal sequencing depth for RESPECT estimation of Arabidopsis.
The invention creates initial seed whole genome sequencing depth based on the pre-run RESPECT, continuously samples sequencing data, and draws a graph by performing sequencing depth and estimated genome size on the sampled data of a plurality of gradients through an iteration method.
After obtaining the estimated full genome sequencing depth and genome size curve graph, the invention obtains the genome size and/or the repeatability of the platform stage below 4 x depth according to the estimated full genome sequencing depth and genome size curve graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
The method of the invention can be applied to newly generated sequencing data of plants of unknown genome size, and can also be applied to published massive plant shallow sequencing data (mainly data generated by research on organelle phylogenetic genomics).
For further explanation of the present invention, a method for estimating plant genome size and/or reproducibility based on low-depth sequencing provided by the present invention is described in detail below with reference to the accompanying drawings and examples, which should not be construed as limiting the scope of the present invention.
Example 1
Using three species of arabidopsis of known genome size and existing high quality genome assembly, diploid arabidopsis thaliana (a. Thaliana), autotetraploid arabidopsis thaliana (a. Arenosa), heterotetraploid arabidopsis thaliana (a. Suecaca), wherein swedish arabidopsis thaliana is the species formed by interspecific hybridization of arabidopsis thaliana and arabidopsis thaliana. Genome sizes, selected original SRA data accession numbers, and original sequencing depths for the three species are shown in the following table (table 1).
TABLE 1 three species for example 1
Quality filtering of raw SRA data using Bbduk, 19 samples (subsamples) of filtered SRA data at different depths using software seqtk, depending on the genome sizes of the three species and the size of the amount of SRA data after filtering: all, 20×, 10×, 8×, 6×, 5×,4×, 3×, 2×,1×, 0.9×, 0.8×, 0.7×, 0.6×, 0.5×, 0.4×, 0.3×, 0.2×, 0.1×. For each sample of double-ended sequencing data R1 and R2, software bbmere was used to merge into a single fastq file. The genome size at each depth was obtained by running (-N1000-debug) separately for each depth using software RESPECT, plotting the estimated genome size and sequencing depth, finding the plateau for the estimated genome size at less than 4 x sequencing depth, and calculating the mean value as the estimated genome size (fig. 2).
It can be seen that the genome size and coverage of the swedish arabidopsis thaliana estimated by RESPECT was significantly distorted when using all data (121×), and resampling was required to obtain a stable coverage relationship and genome size at gradient sampling. The curves of estimated genome size and sequencing depth after 20 x tended to be smooth. The genome size of the swedish arabidopsis thaliana is greatly reduced at 0.3 x, the genome size of the sandy arabidopsis thaliana is greatly reduced at 0.5 x, and the arabidopsis thaliana is transiently fluctuated at 0.6 x. As can be seen from FIG. 2, 4X to 0.7X is the optimal sequencing depth for RESPECT estimated Arabidopsis. The genome sizes (plateau mean) estimated by using RESPECT were 150.1Mb, 436.5Mb, 354.4Mb for Arabidopsis thaliana, and Arabidopsis thaliana, respectively, and the size relationships therebetween were in line with the actual genome sizes of the species, but were 11%, 18%, 31% larger than the actual genome sizes of the species, respectively, and within acceptable ranges. The procedure used in the present invention estimated genome sizes closer to the actual genome sizes of the species than those obtained by running genome scope2 using 20 x data (159.7 Mb, 193.4Mb, 138.3Mb estimated genome sizes of arabidopsis thaliana, arabidopsis thaliana sweden, respectively) (fig. 3).
Example 2
Assuming that the genome size of arabidopsis thaliana (a. Arenosa) is not known, the data to be processed is obtained by performing quality control filtering and double-ended merging by using an SRR2040811 data set (the data amount is 10.33 Gb), and first, the approximate whole genome sequencing depth of the data to be processed needs to be rapidly determined. The first iteration samples the data to be processed for five gradients of 100%, 75%, 50%, 25%, 1% and then pre-runs the rest, resulting in calculated genome sequencing depth and calculated genome size, respectively (table 2). The ratio of calculated sequencing depths for the five sample gradients is approximately 37:29:18:10:1 (17.13:13.34:8.48:4.68:0.46), the ratio between this ratio and the sample gradients is large, if the calculated sequencing depths are based on the 25% sample ratio, the ratio of calculated sequencing depths for the four sample gradients of 100% to 25% is approximately 4:3:2:1 (17.13:13.34:8.48:4.68), which is exactly consistent with the ratio of the four sample gradients (100%: 75%:50%: 25%), indicating that the calculated sequencing depth of the data for the 1% sample deviates greatly, which is not desirable for the second round of iteration. Meanwhile, the approximate whole genome sequencing depth of the data to be processed can be judged to be 17.13×.
TABLE 2 sequencing depth and genome size and genome repeatability of the RESPECT calculation after the first iteration used in EXAMPLE 2
The second iteration was based on 25% sampled data (calculated sequencing depth 4.68×) of the first iteration as the data to be processed for the second iteration, since 4.68× is close to the upper limit of the optimal run interval (0.5× -4×) recommended by the RESPECT software, starting with which the seed whole genome sequencing depth was set down with ten sampling gradients of 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, 5%, running RESPECT respectively, resulting in the genome size and the repetition (table 3) calculated under the second iteration.
TABLE 3 sequencing depth and genome size and genome repeatability of the RESPECT calculation after the second iteration used in EXAMPLE 2
By plotting the data in table 3 (fig. 4), it was found that the plateau was not obvious, and by manually identifying and comparing the coefficient of variation (cv=standard deviation/average), a curve between 3.57× 1.19×wasselected as the plateau (the average value of the calculated genome sizes of arabidopsis thaliana was 447.9 Mb, the coefficient of variation was 7.9%), and it was found that the result was similar to the result calculated in example 1, and the estimated value was 21% higher than the actual genome size of arabidopsis thaliana.
Example 3
The results of direct comparison of forward and reverse R1 and R2 runs of RESPECT, which were equivalent to 2×,1×, and 0.5× (effective depth halving), and the results of the comparison of the results of three sequencing depths (2×,1×, and 0.5×) with the results of the RESPECT runs after combining the double-ended sequencing data using bbmere, were found to be very close to each other, regardless of genome size or genome reproducibility, without using bbmere for combining the double-ended sequencing data for the three sequencing depths (4×, 2×,1×). However, if the BBMerge is not used for merging the double-ended sequencing data, the required sequencing depth needs to be doubled, and the sequencing cost is greatly increased, so the process of the invention can further reduce the sequencing cost of the method by using the BBMerge for merging the double-ended sequencing data. The invention is used for estimating the genome of unknown plant species with the genome less than 6Gb by 3-5 Gb, most plants with the genome less than 6Gb account for the genome of the unknown plant, even if the genome of the unknown plant is more than 6Gb, the sequencing data amount is improved by a small amount, if a DNB-T7 platform manufactured by Hua Dazhi is used, the cost of each sample can be controlled within 200 yuan, the sample is cheaper than that of the sample by using a flow cytometry, the requirement on the sample is very low, and even a small amount of tissue can meet the requirement on degraded sample materials.
TABLE 4 Single ended and pooled genome size differences calculated using RESPECT
Although the foregoing embodiments have been described in some, but not all, embodiments of the invention, it should be understood that other embodiments may be devised in accordance with the present embodiments without departing from the spirit and scope of the invention.
Claims (6)
1. A method for estimating plant genome size and/or repeatability based on low depth sequencing, comprising the steps of:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, running RESPECT software to set 5 sampling gradients (100%, 75%, 50%, 25% and 1%) for the data to be processed, and pre-running to obtain a first iteration result; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration;
setting gradient sampling in a target sequencing depth according to the initial seed whole genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5 x-5 x, and obtaining sampling data of 11 different sampling gradient depths (100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, 5%); performing second-round iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
2. The method according to claim 1, wherein the parameter set for each operation of the rest software is a number of cycles of 1000.
3. The method of claim 1, wherein deriving a starting seed whole genome sequencing depth for a second round of iterations based on the results of the first round of iterations comprises:
comparing the ratio of the whole genome sequencing depths estimated by the RESPECT software of the first iteration with the ratio of the sampling gradients, and selecting the whole genome sequencing depth estimated by the RESPECT with the same ratio as the correct estimated value;
the seed whole genome sequencing depth at the beginning of the second round of iteration is 3× 5× based on the correct estimate up or down the sampling percentage.
4. The method of claim 1, wherein the form of the plant of unknown genome size comprises plant fresh tissue, plant silica gel desiccated material, or plant specimen degraded material.
5. The method of claim 1, wherein the plant comprises arabidopsis thaliana.
6. The method of claim 5, wherein the arabidopsis thaliana comprises diploid arabidopsis thaliana and/or tetraploid arabidopsis thaliana.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311367837.5A CN117106875B (en) | 2023-10-23 | 2023-10-23 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311367837.5A CN117106875B (en) | 2023-10-23 | 2023-10-23 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117106875A true CN117106875A (en) | 2023-11-24 |
CN117106875B CN117106875B (en) | 2024-02-06 |
Family
ID=88811287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311367837.5A Active CN117106875B (en) | 2023-10-23 | 2023-10-23 | Method for estimating plant genome size and/or repeatability based on low-depth sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117106875B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013097149A1 (en) * | 2011-12-29 | 2013-07-04 | 深圳华大基因科技服务有限公司 | Method and device for estimating repeating sequence content of genome |
CN107679366A (en) * | 2017-08-30 | 2018-02-09 | 武汉古奥基因科技有限公司 | A kind of computational methods of genome mutation data |
CN109295185A (en) * | 2018-09-05 | 2019-02-01 | 暨南大学 | A kind of measuring method suitable for single celled eukaryotic algal gene group size |
CN109411014A (en) * | 2018-10-09 | 2019-03-01 | 中国科学院昆明植物研究所 | A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations |
CN111411107A (en) * | 2020-03-27 | 2020-07-14 | 武汉古奥基因科技有限公司 | Method for polyploid genome surfy |
US20220205034A1 (en) * | 2019-09-12 | 2022-06-30 | Zhejiang University | Method for quickly identifying clean transgenic or gene-edited plants and insertion sites by using whole genome re-sequencing data |
-
2023
- 2023-10-23 CN CN202311367837.5A patent/CN117106875B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013097149A1 (en) * | 2011-12-29 | 2013-07-04 | 深圳华大基因科技服务有限公司 | Method and device for estimating repeating sequence content of genome |
CN107679366A (en) * | 2017-08-30 | 2018-02-09 | 武汉古奥基因科技有限公司 | A kind of computational methods of genome mutation data |
CN109295185A (en) * | 2018-09-05 | 2019-02-01 | 暨南大学 | A kind of measuring method suitable for single celled eukaryotic algal gene group size |
CN109411014A (en) * | 2018-10-09 | 2019-03-01 | 中国科学院昆明植物研究所 | A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations |
US20220205034A1 (en) * | 2019-09-12 | 2022-06-30 | Zhejiang University | Method for quickly identifying clean transgenic or gene-edited plants and insertion sites by using whole genome re-sequencing data |
CN111411107A (en) * | 2020-03-27 | 2020-07-14 | 武汉古奥基因科技有限公司 | Method for polyploid genome surfy |
Non-Patent Citations (1)
Title |
---|
CEN GUO: "Phylogenomics and the flowering plant tree of life", JOURNAL OF INTEGRATIVE PLANT BIOLOGY, vol. 65, no. 2, pages 299 * |
Also Published As
Publication number | Publication date |
---|---|
CN117106875B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102747138B (en) | Rice whole genome SNP chip and application thereof | |
CN108256289A (en) | A kind of method based on target area capture sequencing genomes copy number variation | |
CN117187410A (en) | Local chicken medium-low density 10K whole genome SNP liquid phase chip and application thereof | |
CN112575116A (en) | Soybean whole genome SNP locus combination, gene chip and application | |
CN106244681A (en) | A kind of method and the application that utilize genome SSR and EST SSR finger printing to differentiate mung bean variety | |
CN107153777B (en) | Method for estimating doubling degree of tetraploid species genome | |
CN118038981A (en) | Method and measuring instrument for extracting Cq value based on curvature change of qPCR amplification curve | |
CN117604114A (en) | Local chicken low-density 5K whole genome SNP liquid chip and application thereof | |
CN117106875B (en) | Method for estimating plant genome size and/or repeatability based on low-depth sequencing | |
CN108060237B (en) | Forensic medicine composite detection kit based on 55Y chromosome SNP genetic markers | |
CN112626235A (en) | InDel marker related to goat villus character and application thereof | |
CN112331266A (en) | Method for eliminating PCR fluorescence baseline period fluctuation | |
CN116287172B (en) | Male and female sex identification primer and method for physcomitrella spinosa | |
CN115948521B (en) | Method for detecting aneuploidy deletion chromosome information | |
CN117448458A (en) | Seed preservation method based on local chicken whole genome SNP molecular markers and application thereof | |
Wang et al. | Construction of a high-density adzuki bean genetic map and evaluation of its utility based on a QTL analysis of seed size | |
CN116497124A (en) | Apostichopus japonicus genome-wide 30K liquid-phase breeding chip and application thereof | |
CN112226531B (en) | Endangered species thuja SSR primer and application thereof | |
CN105112534B (en) | Primer pair and method for identifying copy numbers of internal and external genes of chrysanthemum through fluorescent quantitative PCR | |
CN115083518A (en) | SNP double-channel coding method | |
CN113308559A (en) | SNP locus combination for identifying variety of Mongolian snakegourd and identification method thereof | |
CN107784197B (en) | PCR experiment optimization method | |
CN112795697A (en) | Primer pair, kit and detection method for simultaneously detecting multiple infectious bronchitis viruses of chicken | |
CN117925904B (en) | Method, primer, probe and application for identifying paris polyphylla genetic homozygous individuals | |
CN118726616B (en) | SNP molecular marker related to growth traits of large yellow croaker and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |