CN117106875B - Method for estimating plant genome size and/or repeatability based on low-depth sequencing - Google Patents

Method for estimating plant genome size and/or repeatability based on low-depth sequencing Download PDF

Info

Publication number
CN117106875B
CN117106875B CN202311367837.5A CN202311367837A CN117106875B CN 117106875 B CN117106875 B CN 117106875B CN 202311367837 A CN202311367837 A CN 202311367837A CN 117106875 B CN117106875 B CN 117106875B
Authority
CN
China
Prior art keywords
sequencing
depth
genome
data
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311367837.5A
Other languages
Chinese (zh)
Other versions
CN117106875A (en
Inventor
贺正山
杨俊波
曾春霞
李德铢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming Institute of Botany of CAS
Original Assignee
Kunming Institute of Botany of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming Institute of Botany of CAS filed Critical Kunming Institute of Botany of CAS
Priority to CN202311367837.5A priority Critical patent/CN117106875B/en
Publication of CN117106875A publication Critical patent/CN117106875A/en
Application granted granted Critical
Publication of CN117106875B publication Critical patent/CN117106875B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Organic Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Genetics & Genomics (AREA)
  • Biochemistry (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for estimating the size and/or the repeatability of a plant genome based on low-depth sequencing, and belongs to the technical field of plant molecular biology. The invention selects the RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can obtain the size and/or the repeatability of the plant genome from the low-depth sequencing data which is not higher than 5 multiplied by the number, and reduces the experimental cost; by combining BBDuk, BBMerge in BBMap, mass filtering and combining double-ended sequencing data can be performed in a flow-through manner; and obtaining the initial seed whole genome sequencing depth of the second iteration based on the first iteration result, setting gradient sampling to obtain the sequencing depth calculated by RESPECT under multiple gradients and the estimated genome size, drawing a graph, and calculating the average value of the platform period of the curve below 4 x depth to obtain the final genome size estimated value. The method of the invention has low cost and can accurately estimate the size of the plant genome.

Description

Method for estimating plant genome size and/or repeatability based on low-depth sequencing
Technical Field
The invention belongs to the technical field of plant molecular biology, and particularly relates to a method for estimating plant genome size and/or repeatability based on low-depth sequencing.
Background
The genome contains the genetic information of the bottommost layer of the species, the genome of the plant is more complex than that of the animal, and the characteristics of large genome, multiple, high heterozygosity, high repeatability and the like are present. Genome size (Genome size), i.e., C-value, is the total amount of DNA contained in a haploid gamete of a species, typically expressed in terms of weight (pg) or number of nucleotide base pairs (base pair), 1 pg of DNA being approximately equal to the number of base pairs of 978 Mb. The land plant genome size varied by 2400 fold, and the genome size of 1.2 tens of thousands of plants was recorded in the genome C value database (https:// cvalues. Science. Kew. Org /), while the genome of the meat screw nutria (Genlisea tuberosa) in flowering plants was only 61 Mb, and the genome of Paris japonica (Paris japonica) was 148Gb. The genome size is an important indicator of biodiversity and species specificity, the chromosome number and the genome size in each plant group cell are relatively fixed, and the genome size has important guiding significance for biological evolution and plant system taxonomy research, and also has important significance for breeding domestication and protection utilization of excellent germplasm resources.
Multiple, repeat sequence and intra-seed variation are three major factors affecting plant genome size. Multiple (homologous and heterologous) can enhance genetic diversity, positively contributing to plant adaptation to environmental changes, and about 70% of existing plants are polyploid. The repeated sequence can be divided into tandem repeated sequence and scattered repeated sequence, and the repeated sequence is mostly present in a gene interval region and an intron region, and accounts for 10% -85% of the plant genome. The intraspecific variation is found in a plurality of species such as corn, fescue and the like, the difference of different resident groups can exceed 30 percent, polyploid and diploid coexist, and the intraspecific variation has correlation with geographic factors, climatic factors and the like. It is generally thought that as organisms evolve, the C value of the organisms increases, but there is no clear correlation between the complexity of the organism's structure and function and the size of the genome, a phenomenon known as "C value paradox".
The commonly used methods for determining the genome size are mainly flow cytometry (genome survey), which has the advantages of simple operation, economy, high efficiency, high accuracy and the like, and genome survey (genome measure) based on k-mer analysis, which has the advantages of simple experiment, high speed, high reproducibility and the like. A K-mer refers to a substring of length K contained in a sequence of length L, which is slid in steps of one base, resulting in a total of (L-K+1) K-mers. k-mer analysis assumes that sequenced reads are randomly distributed across the genome, and that the distribution of k-mers follows a poisson distribution without regard to sequencing errors, sequence reproducibility, and heterozygous sequences. However, in practice, the error rate, the repetitive sequence ratio and the heterozygosity are all calculated, and the genome size is estimated by correction according to the calculation result. The k-mer analysis generally uses software such as Jellyfish, KMC, KAT, kmerGenie to obtain the k-mer frequency distribution, and then uses software such as GCE, genomeScope, findGSE, BBNorm to estimate the genome size, the degree of repetition, and the like. The estimated genome size is affected by the size of the k-mer value set, the upper frequency distribution limit, the maximum k-mer coverage, heterozygosity/homozygosity/ploidy, etc. The current software based on k-mer analysis requires genome sequencing depth of 30-50X, and is easy to suffer from the influence that frequency distribution peak map fitting fails and main peak heterozygous peaks are difficult to distinguish, so that large difference often occurs in estimated genome size.
Disclosure of Invention
The invention aims to provide a method for estimating the size and/or the repeatability of a plant genome based on low-depth sequencing, which can effectively and accurately estimate the size of the plant genome.
The invention provides a method for estimating plant genome size and/or repeatability based on low-depth sequencing, which comprises the following steps:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, running RESPECT software to set 5 sampling gradients (100%, 75%, 50%, 25% and 1%) for the data to be processed, and pre-running to obtain a first iteration result; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration;
setting gradient sampling in a target sequencing depth according to the initial seed whole genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5 x-5 x, and obtaining sampling data of 11 different sampling gradient depths (100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, 5%); performing second-round iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
Preferably, the parameter set for each operation of the rest software is 1000 cycles.
Preferably, obtaining the initial seed whole genome sequencing depth of the second round of iteration according to the result of the first round of iteration comprises:
comparing the ratio of the whole genome sequencing depths estimated by the RESPECT software of the first iteration with the ratio of the sampling gradients, and selecting the whole genome sequencing depth estimated by the RESPECT with the same ratio as the correct estimated value;
the seed whole genome sequencing depth at the beginning of the second round of iteration is 3× 5× based on the correct estimate up or down the sampling percentage.
Preferably, the form of the plant of unknown genome size comprises plant fresh tissue, plant silica gel desiccated material, or plant specimen degraded material.
Preferably, the plant comprises arabidopsis thaliana.
Preferably, the arabidopsis thaliana comprises diploid arabidopsis thaliana and/or tetraploid arabidopsis thaliana.
The invention provides a method for estimating plant genome size and/or repeatability based on low depth sequencing. The invention adopts the software RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can accurately solve the optimization problem, and is particularly effective for the k-mer frequency distribution result under low-depth sequencing (lower than 5×). Therefore, compared with the 30X-50X sequencing depth required by the traditional genome surviviny sequencing, the cost can be greatly reduced. In addition, the invention combines BBDuk, BBMerge software in BBMap software package, and can process quality filtering and combine double-ended sequencing data. Meanwhile, the invention also creates original seed whole genome sequencing depth based on the first iteration result, obtains the sequencing depth calculated by RESPECT under multiple gradients and the estimated genome size through gradient sampling, draws a graph, calculates the average value of the platform period of the curve below 4 times of depth, and obtains the final genome size estimated value. The invention can be applied to newly generated sequencing data of plants of unknown genome size, and can also be applied to published massive plant shallow sequencing data (mainly data generated by research on organelle phylogenetic genomics).
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a technical roadmap of the invention;
FIG. 2 is a graph of three species obtained by operating RESPECT after sampling using raw SRA data of three Arabidopsis species (Arabidopsis, saccharum, arabidopsis, swedish) in example 1 of the present invention;
FIG. 3 is a graph showing the results of the invention in example 1 after running genome scope2 using the same 20X data samples of raw SRA data from three Arabidopsis species (Arabidopsis, saccharum, arabidopsis, swedish);
FIG. 4 is a graph of estimated genome size of Arabidopsis thaliana obtained by running RESPECT after two iterative sampling using raw SRA data of Arabidopsis thaliana in example 2 of the present invention.
Detailed Description
The invention provides a method for estimating plant genome size and/or repeatability based on low-depth sequencing, which comprises the following steps:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, and running RESPECT software to perform pre-running on the data to be processed to obtain a first iteration result; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration;
setting gradient sampling in a target sequencing depth according to the initial seed genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5-5×, and sampling data with different sampling gradient depths are obtained; performing second-round iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
The technical roadmap of the method of the invention is shown in fig. 1. The method has low cost and can effectively and accurately estimate the size of the plant genome. The method is different from genome evaluation software based on k-mers commonly used in the field, can utilize low-depth sequencing data with the depth far lower than 30× (the general sequencing depth requirement in the field) to estimate important information such as the plant genome size, the repeatability and the like of unknown genome characteristics with extremely low cost, is suitable for diploid, homopolyploid and heteropolyploid, and greatly facilitates research work requiring plant genome size information such as plant genome research, molecular system development research and the like.
The method comprises the steps of firstly, carrying out low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing.
In the present invention, the form of the plant of unknown genome size includes plant fresh tissue, plant silica gel desiccated material or plant specimen degraded material.
The invention preferably further comprises extracting DNA from a plant of unknown genome size prior to low depth whole genome second generation sequencing of the plant of unknown genome size. The method for extracting the DNA of the plant with unknown genome size is not particularly limited, and a plant DNA extraction method or a kit conventional in the art can be adopted.
In the present invention, the plants include haploid plants and/or polyploid plants; the polyploid plant includes an autopolyploid plant and/or an allopolyploid plant. In a specific implementation of the invention, the plant comprises a diploid plant or a tetraploid plant.
In one embodiment of the invention, the plant comprises arabidopsis thaliana; the arabidopsis thaliana includes arabidopsis thaliana diploid arabidopsis thaliana (a. Thaliana) and tetraploid arabidopsis thaliana (a. Arenosa); the tetraploid arabidopsis thaliana includes homotetraploid arabidopsis thaliana and heterotetraploid arabidopsis thaliana. In the invention, the autotetraploid arabidopsis thaliana is autotetraploid arabidopsis thaliana; the heterotetraploid arabidopsis thaliana is heterotetraploid arabidopsis thaliana, sweden; the Arabidopsis thaliana is a species formed by interspecific hybridization of Arabidopsis thaliana and Arabidopsis thaliana.
In the present invention, the low depth whole genome second generation sequencing has a sequencing depth of less than 5×.
After low-depth sequencing data is obtained, BBDuk software is used for carrying out quality filtering on the low-depth sequencing data to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data.
When the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, the method further comprises the step of combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data, so that the combined clean sequencing data become a data pattern similar to single-ended sequencing, and the effective sequencing depth of the genome in subsequent analysis can be improved.
The invention combines BBDuk, BBMerge software in BBMap software package, and can process quality filtering and combine double-end sequencing data.
After clean sequencing data or combined clean sequencing data is obtained, the invention takes the clean sequencing data or combined clean sequencing data as data to be processed, and runs RESPECT software to run the data to be processed, so as to obtain a first round of iterative result; the RESPECT software is internally provided with Gurobi.
In the present invention, the cycle number (iteration) of the rest software is preferably 1000.
In the invention, the RESPECT software starts a debug mode to pre-run the data to be processed; the pre-run is the first iteration.
The invention adopts the software RESPECT with built-in Gurobi to fit the k-mer frequency distribution result, can accurately solve the optimization problem, and is particularly effective for the k-mer frequency distribution result under low-depth sequencing (lower than 5×). Therefore, compared with the 30X-50X sequencing depth required by the traditional genome surviviny sequencing, the cost can be greatly reduced.
After the result of the first round of iteration is obtained, the invention obtains the whole genome sequencing depth of the initial seed of the second round of iteration according to the result of the first round of iteration.
In the present invention, according to the result of the first iteration, obtaining the initial seed whole genome sequencing depth of the second iteration preferably includes: comparing the ratio of the whole genome sequencing depths estimated by the RESPECT software with the ratio of the sampling gradients, and selecting the whole genome sequencing depth estimated by the RESPECT with the same ratio as the correct estimated value; the seed whole genome sequencing depth at the beginning of the second round of iteration is 3× 5× based on the correct estimate up or down the sampling percentage.
In the practice of the present invention, when the initial seed whole genome sequencing depth estimated by the RESPECT software is not within the target sequencing depth, then downsampling is performed to set a sampling gradient within the target sequencing depth.
In the invention, when the sequencing depth and the genome size of the whole genome of seeds initiated by RESPECT software under different sampling gradient depths are stable, a graph of the sequencing depth and the genome size is obtained; the criteria for stabilization are preferably: the ratio between the different sampling gradient depths and the ratio of the sequencing depth estimated by the RESPECT software are equal to each other and the variation coefficient of the estimated genome size is less than or equal to 10 percent. In the present invention, at least 6 values of the calculated genome size are calculated as the coefficient of variation, and the average value is calculated using the value with the smallest coefficient of variation. In the present invention, the curve is smooth when the genome size is stable.
In the invention, the first iteration preferably samples 100%, 75%, 50%, 25% and 1% of the sequencing data, then operates RESPECT to obtain 5 sequencing depths, and compares the ratio between the 5 sequencing depths to obtain the sampled gradient with the ratio consistent proportion; any gradient is selected from the sampled gradients in proportion to the ratio, the sampling percentage is adjusted upwards or downwards, the initial seed whole genome sequencing depth of the second round of iteration is 3-5×, and 11 gradients of 100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10% and 5% are set for sampling.
In the present invention, the estimated sequencing depth of the rest may be very inaccurate when the actual sequencing depth is much higher than the most applicable range of rest. If the seed depth is too low, the proportion of samples needs to be increased, otherwise the proportion of samples is decreased.
The invention can find stable estimated genome size through a plateau of 4 x depth or less.
After obtaining the whole genome sequencing depth of the initial seed of the second round of iteration, setting gradient sampling in the target sequencing depth according to the whole genome sequencing depth of the initial seed of the second round of iteration, wherein the target sequencing depth is 0.5-5 x, and obtaining sampling data with different sampling gradient depths; and respectively carrying out second iteration on the sampling data with different sampling gradient depths by using RESPECT software to obtain estimated full genome sequencing depth and genome size graphs.
In the present invention, the target sequencing depth is preferably 0.7× 4×, more preferably 3×, 2×,1×, 0.9×, or 0.8×. In the present invention, 4X to 0.7X is the optimal sequencing depth for RESPECT estimation of Arabidopsis.
The invention creates initial seed whole genome sequencing depth based on the pre-run RESPECT, continuously samples sequencing data, and draws a graph by performing sequencing depth and estimated genome size on the sampled data of a plurality of gradients through an iteration method.
After obtaining the estimated full genome sequencing depth and genome size curve graph, the invention obtains the genome size and/or the repeatability of the platform stage below 4 x depth according to the estimated full genome sequencing depth and genome size curve graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate.
The method of the invention can be applied to newly generated sequencing data of plants of unknown genome size, and can also be applied to published massive plant shallow sequencing data (mainly data generated by research on organelle phylogenetic genomics).
For further explanation of the present invention, a method for estimating plant genome size and/or reproducibility based on low-depth sequencing provided by the present invention is described in detail below with reference to the accompanying drawings and examples, which should not be construed as limiting the scope of the present invention.
Example 1
Using three species of arabidopsis of known genome size and existing high quality genome assembly, diploid arabidopsis thaliana (a. Thaliana), autotetraploid arabidopsis thaliana (a. Arenosa), heterotetraploid arabidopsis thaliana (a. Suecaca), wherein swedish arabidopsis thaliana is the species formed by interspecific hybridization of arabidopsis thaliana and arabidopsis thaliana. Genome sizes, selected original SRA data accession numbers, and original sequencing depths for the three species are shown in the following table (table 1).
TABLE 1 three species for example 1
Quality filtering of raw SRA data using Bbduk, 19 samples (subsamples) of filtered SRA data at different depths using software seqtk, depending on the genome sizes of the three species and the size of the amount of SRA data after filtering: all, 20×, 10×, 8×, 6×, 5×,4×, 3×, 2×,1×, 0.9×, 0.8×, 0.7×, 0.6×, 0.5×, 0.4×, 0.3×, 0.2×, 0.1×. For each sample of double-ended sequencing data R1 and R2, software bbmere was used to merge into a single fastq file. The genome size at each depth was obtained by running (-N1000-debug) separately for each depth using software RESPECT, plotting the estimated genome size and sequencing depth, finding the plateau for the estimated genome size at less than 4 x sequencing depth, and calculating the mean value as the estimated genome size (fig. 2).
It can be seen that the genome size and coverage of the swedish arabidopsis thaliana estimated by RESPECT was significantly distorted when using all data (121×), and resampling was required to obtain a stable coverage relationship and genome size at gradient sampling. The curves of estimated genome size and sequencing depth after 20 x tended to be smooth. The genome size of the swedish arabidopsis thaliana is greatly reduced at 0.3 x, the genome size of the sandy arabidopsis thaliana is greatly reduced at 0.5 x, and the arabidopsis thaliana is transiently fluctuated at 0.6 x. As can be seen from FIG. 2, 4X to 0.7X is the optimal sequencing depth for RESPECT estimated Arabidopsis. The genome sizes (plateau mean) estimated by using RESPECT were 150.1Mb, 436.5Mb, 354.4Mb for Arabidopsis thaliana, and Arabidopsis thaliana, respectively, and the size relationships therebetween were in line with the actual genome sizes of the species, but were 11%, 18%, 31% larger than the actual genome sizes of the species, respectively, and within acceptable ranges. The procedure used in the present invention estimated genome sizes closer to the actual genome sizes of the species than those obtained by running genome scope2 using 20 x data (159.7 Mb, 193.4Mb, 138.3Mb estimated genome sizes of arabidopsis thaliana, arabidopsis thaliana sweden, respectively) (fig. 3).
Example 2
Assuming that the genome size of arabidopsis thaliana (a. Arenosa) is not known, the data to be processed is obtained by performing quality control filtering and double-ended merging by using an SRR2040811 data set (the data amount is 10.33 Gb), and first, the approximate whole genome sequencing depth of the data to be processed needs to be rapidly determined. The first iteration samples the data to be processed for five gradients of 100%, 75%, 50%, 25%, 1% and then pre-runs the rest, resulting in calculated genome sequencing depth and calculated genome size, respectively (table 2). The ratio of calculated sequencing depths for the five sample gradients is approximately 37:29:18:10:1 (17.13:13.34:8.48:4.68:0.46), the ratio between this ratio and the sample gradients is large, if the calculated sequencing depths are based on the 25% sample ratio, the ratio of calculated sequencing depths for the four sample gradients of 100% to 25% is approximately 4:3:2:1 (17.13:13.34:8.48:4.68), which is exactly consistent with the ratio of the four sample gradients (100%: 75%:50%: 25%), indicating that the calculated sequencing depth of the data for the 1% sample deviates greatly, which is not desirable for the second round of iteration. Meanwhile, the approximate whole genome sequencing depth of the data to be processed can be judged to be 17.13×.
TABLE 2 sequencing depth and genome size and genome repeatability of the RESPECT calculation after the first iteration used in EXAMPLE 2
The second iteration was based on 25% sampled data (calculated sequencing depth 4.68×) of the first iteration as the data to be processed for the second iteration, since 4.68× is close to the upper limit of the optimal run interval (0.5× -4×) recommended by the RESPECT software, starting with which the seed whole genome sequencing depth was set down with ten sampling gradients of 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, 5%, running RESPECT respectively, resulting in the genome size and the repetition (table 3) calculated under the second iteration.
TABLE 3 sequencing depth and genome size and genome repeatability of the RESPECT calculation after the second iteration used in EXAMPLE 2
By plotting the data in table 3 (fig. 4), it was found that the plateau was not obvious, and by manually identifying and comparing the coefficient of variation (cv=standard deviation/average), a curve between 3.57× 1.19×wasselected as the plateau (the average value of the calculated genome sizes of arabidopsis thaliana was 447.9 Mb, the coefficient of variation was 7.9%), and it was found that the result was similar to the result calculated in example 1, and the estimated value was 21% higher than the actual genome size of arabidopsis thaliana.
Example 3
The results of direct comparison of forward and reverse R1 and R2 runs of RESPECT, which were equivalent to 2×,1×, and 0.5× (effective depth halving), and the results of the comparison of the results of three sequencing depths (2×,1×, and 0.5×) with the results of the RESPECT runs after combining the double-ended sequencing data using bbmere, were found to be very close to each other, regardless of genome size or genome reproducibility, without using bbmere for combining the double-ended sequencing data for the three sequencing depths (4×, 2×,1×). However, if the BBMerge is not used for merging the double-ended sequencing data, the required sequencing depth needs to be doubled, and the sequencing cost is greatly increased, so the process of the invention can further reduce the sequencing cost of the method by using the BBMerge for merging the double-ended sequencing data. The invention is used for estimating the genome of unknown plant species with the genome less than 6Gb by 3-5 Gb, most plants with the genome less than 6Gb account for the genome of the unknown plant, even if the genome of the unknown plant is more than 6Gb, the sequencing data amount is improved by a small amount, if a DNB-T7 platform manufactured by Hua Dazhi is used, the cost of each sample can be controlled within 200 yuan, the sample is cheaper than that of the sample by using a flow cytometry, the requirement on the sample is very low, and even a small amount of tissue can meet the requirement on degraded sample materials.
TABLE 4 Single ended and pooled genome size differences calculated using RESPECT
Although the foregoing embodiments have been described in some, but not all, embodiments of the invention, it should be understood that other embodiments may be devised in accordance with the present embodiments without departing from the spirit and scope of the invention.

Claims (4)

1. A method for estimating plant genome size and/or repeatability based on low depth sequencing, comprising the steps of:
performing low-depth whole genome second generation sequencing on plants with unknown genome sizes to obtain low-depth sequencing data; the size of the low-depth sequencing data is 3-5 Gb; the sequencing mode of the low-depth whole-gene second-generation sequencing comprises single-end sequencing or double-end sequencing;
performing quality filtering on the low-depth sequencing data by using BBDuk software to obtain clean sequencing data; the mass filtering includes filtering out adaptor sequences and contaminating sequences in the low depth sequencing data;
when the low-depth whole genome second-generation sequencing is double-ended sequencing, after the quality filtering, combining the sequencing data after the quality filtering by using BBMerge software to obtain the combined clean sequencing data;
taking the clean sequencing data or the combined clean sequencing data as data to be processed, running RESPECT software to set 5 sampling gradients for pre-running the data to be processed to obtain a first iteration result, wherein the 5 sampling gradients are 100%, 75%, 50%, 25% and 1%; the RESPECT software is internally provided with Gurobi;
obtaining the whole genome sequencing depth of the initial seed of the second iteration according to the result of the first iteration, wherein the whole genome sequencing depth estimated by the RESPECT software of the first iteration is compared with the ratio of the sampling gradient, and the whole genome sequencing depth estimated by the RESPECT with the same ratio is selected as a correct estimated value;
up-or down-regulating the sampling percentage based on the correct estimation, so that the initial seed whole genome sequencing depth of the second iteration is 3 x-5 x;
setting gradient sampling in a target sequencing depth according to the initial seed whole genome sequencing depth of the second round of iteration, wherein the target sequencing depth is 0.5-4×, and sampling data of 11 different sampling gradient depths are obtained, and the 11 different sampling gradient depths are 100%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10% and 5%; performing second iteration on the sampling data with different sampling gradient depths by using a RESPECT software respectively, and obtaining a graph of estimated whole genome sequencing depth and genome size when the seed whole genome sequencing depth and genome size initiated by the RESPECT software are stable under different sampling gradient depths;
the criteria for stabilization are: the ratio between different sampling gradient depths and the ratio of the sequencing depth estimated by the RESPECT software are equal to each other and the variation coefficient of the estimated genome size is less than or equal to 10%, wherein the calculated variation coefficient has at least 6 calculated genome size values, and the value with the minimum variation coefficient is adopted to calculate the average value;
acquiring a genome size and/or a repeatability of a plateau phase below 4 x depth according to the estimated whole genome sequencing depth and genome size graph; calculating an average value of genome sizes and/or repetition rates of the plateau, and taking the average value as a final estimated plant genome size and/or repetition rate;
the parameter set for each RESPECT software operation is 1000 cycles.
2. The method of claim 1, wherein the form of the plant of unknown genome size comprises plant fresh tissue, plant silica gel desiccated material, or plant specimen degraded material.
3. The method of claim 1, wherein the plant comprises arabidopsis thaliana.
4. A method according to claim 3, wherein the arabidopsis thaliana comprises diploid arabidopsis thaliana and/or tetraploid arabidopsis thaliana.
CN202311367837.5A 2023-10-23 2023-10-23 Method for estimating plant genome size and/or repeatability based on low-depth sequencing Active CN117106875B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311367837.5A CN117106875B (en) 2023-10-23 2023-10-23 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311367837.5A CN117106875B (en) 2023-10-23 2023-10-23 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Publications (2)

Publication Number Publication Date
CN117106875A CN117106875A (en) 2023-11-24
CN117106875B true CN117106875B (en) 2024-02-06

Family

ID=88811287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311367837.5A Active CN117106875B (en) 2023-10-23 2023-10-23 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Country Status (1)

Country Link
CN (1) CN117106875B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013097149A1 (en) * 2011-12-29 2013-07-04 深圳华大基因科技服务有限公司 Method and device for estimating repeating sequence content of genome
CN107679366A (en) * 2017-08-30 2018-02-09 武汉古奥基因科技有限公司 A kind of computational methods of genome mutation data
CN109295185A (en) * 2018-09-05 2019-02-01 暨南大学 A kind of measuring method suitable for single celled eukaryotic algal gene group size
CN109411014A (en) * 2018-10-09 2019-03-01 中国科学院昆明植物研究所 A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations
CN111411107A (en) * 2020-03-27 2020-07-14 武汉古奥基因科技有限公司 Method for polyploid genome surfy

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110556165B (en) * 2019-09-12 2022-03-18 浙江大学 Method for rapidly identifying transgene or gene editing material and insertion site thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013097149A1 (en) * 2011-12-29 2013-07-04 深圳华大基因科技服务有限公司 Method and device for estimating repeating sequence content of genome
CN107679366A (en) * 2017-08-30 2018-02-09 武汉古奥基因科技有限公司 A kind of computational methods of genome mutation data
CN109295185A (en) * 2018-09-05 2019-02-01 暨南大学 A kind of measuring method suitable for single celled eukaryotic algal gene group size
CN109411014A (en) * 2018-10-09 2019-03-01 中国科学院昆明植物研究所 A kind of cyclic method of plant chloroplast full-length genome assembling based on the sequencing of two generations
CN111411107A (en) * 2020-03-27 2020-07-14 武汉古奥基因科技有限公司 Method for polyploid genome surfy

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Phylogenomics and the flowering plant tree of life;Cen Guo;Journal of Integrative Plant Biology;第65卷(第2期);第299–323页 *

Also Published As

Publication number Publication date
CN117106875A (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN102747138B (en) Rice whole genome SNP chip and application thereof
CN108256289A (en) A kind of method based on target area capture sequencing genomes copy number variation
US20220205053A1 (en) Combination of Soybean Whole Genome SNP Loci, Gene Chip and Application Thereof
CN117187410A (en) Local chicken medium-low density 10K whole genome SNP liquid phase chip and application thereof
CN118038981A (en) Method and measuring instrument for extracting Cq value based on curvature change of qPCR amplification curve
CN107153777B (en) Method for estimating doubling degree of tetraploid species genome
CN117106875B (en) Method for estimating plant genome size and/or repeatability based on low-depth sequencing
CN112626235A (en) InDel marker related to goat villus character and application thereof
CN112331266A (en) Method for eliminating PCR fluorescence baseline period fluctuation
Li et al. TrG2P: A transfer learning-based tool integrating multi-trait data for accurate prediction of crop yield
CN115948521B (en) Method for detecting aneuploidy deletion chromosome information
CN111088327A (en) Method for detecting cattle body size characters under assistance of SIKE1 gene CNV marker and application thereof
Wang et al. Construction of a high-density adzuki bean genetic map and evaluation of its utility based on a QTL analysis of seed size
CN116497124A (en) Apostichopus japonicus genome-wide 30K liquid-phase breeding chip and application thereof
CN117448458A (en) Seed preservation method based on local chicken whole genome SNP molecular markers and application thereof
CN112226531B (en) Endangered species thuja SSR primer and application thereof
CN116287172A (en) Male and female sex identification primer and method for physcomitrella spinosa
CN115083518A (en) SNP double-channel coding method
CN107784197B (en) PCR experiment optimization method
CN113308559A (en) SNP locus combination for identifying variety of Mongolian snakegourd and identification method thereof
CN108588242B (en) SNP locus of crassostrea gigas AHR gene
CN112795697A (en) Primer pair, kit and detection method for simultaneously detecting multiple infectious bronchitis viruses of chicken
CN117925904B (en) Method, primer, probe and application for identifying paris polyphylla genetic homozygous individuals
CN118658521B (en) Intelligent analysis method and system for gene data of fish
CN117198399B (en) Microsatellite locus, system and kit for predicting MSI state

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant