CN111411107A - Method for polyploid genome surfy - Google Patents

Method for polyploid genome surfy Download PDF

Info

Publication number
CN111411107A
CN111411107A CN202010226501.7A CN202010226501A CN111411107A CN 111411107 A CN111411107 A CN 111411107A CN 202010226501 A CN202010226501 A CN 202010226501A CN 111411107 A CN111411107 A CN 111411107A
Authority
CN
China
Prior art keywords
genome
polyploid
sample
quality
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010226501.7A
Other languages
Chinese (zh)
Inventor
袁晓辉
刘海平
肖世俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Gooal Gene Technology Co ltd
Original Assignee
Wuhan Gooal Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Gooal Gene Technology Co ltd filed Critical Wuhan Gooal Gene Technology Co ltd
Priority to CN202010226501.7A priority Critical patent/CN111411107A/en
Publication of CN111411107A publication Critical patent/CN111411107A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1003Extracting or separating nucleic acids from biological samples, e.g. pure separation or isolation methods; Conditions, buffers or apparatuses therefor
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biochemistry (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Plant Pathology (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a polyploid genome surfey method, which belongs to the technical field of molecular biology. The method comprises the following steps: extracting and sequencing genome DNA; controlling the quality of the data generated by genome sequencing; evaluating polyploid genome characteristics; and analyzing ploidy change of the polyploid genome. In the method, the complex relation among polyploid species genome sequences is considered in the design process, and an analysis method aiming at the size, the heterozygosity rate and the homology rate of polyploid species genomes is specially designed, so that the problem of high error rate of polyploid genome evaluation by the conventional method can be effectively solved. The method can theoretically perform surfey analysis on any eukaryotic polyploid genome, so that the method can be widely applied to polyploid genome evaluation, and provides an accurate and effective analysis method for polyploid genome size, heterozygosity rate, homology rate evaluation and other applications.

Description

Method for polyploid genome surfy
Technical Field
The invention belongs to the technical field of molecular biology, and particularly relates to a polyploid genome surfey method.
Background
The genome survivy is that under the condition that the genome characteristics of species are evaluated without a reference genome, the information such as genome size, heterozygosity, repeated sequence proportion and the like is effectively evaluated by counting the frequency of a K-mer by using a second-generation sequencing technology, so that reference is provided for subsequent genome denovo assembly (de novo assembly). Large Genome sequencing today commonly uses the Whole Genome Shotgun method (WGS), in which highly repetitive and highly heterozygous genomes are not depleted. Such genomes would significantly increase the difficulty of genome assembly, leading to incomplete and fragmented results, and the size of the species genome could not be accurately predicted based on the repeat sequence information obtained from sequencing alone. In addition, when the genome has an extremely high ratio of repetitive sequences to heterozygous sequences, or when the species is a polyploid species, it is difficult to obtain a desired assembly result by directly using the following assembly algorithm. It is therefore important to make an accurate assessment of the genomic characteristics of a species in order to formulate an appropriate sequencing and assembly protocol prior to genomic sequencing. There are generally three methods for estimating species genome size: firstly, detecting the total amount of DNA in a cell nucleus by using a flow cytometer; secondly, the number, ploidy and size of chromosomes in the metaphase are identified under a microscope by a karyotype analysis method, so that the relative quantification of the length and size of the chromosomes is realized; and thirdly, estimating the information such as the genome size and the like by second-generation genome sequencing based on K-mer analysis. The K-mer analysis has become the most used technical means for acquiring genome information at present because of low cost and difficulty and capability of obtaining more analysis results.
With the rapid development of high-throughput sequencing technology and the rapid reduction of sequencing cost, the demand for genome sequencing is increasing. To meet the demand for sequencing more and more polyploid species, we need to evaluate more and more complex genomic features of polyploid species. However, the current genome feature evaluation method only has higher accuracy for diploid species, and the genome evaluation result of polyploid species is often greatly different from the real result, so that a reliable polyploid genome surfey method is urgently needed.
Disclosure of Invention
In order to overcome the problem of low accuracy of the traditional genome feature evaluation method for evaluating polyploid species, the invention provides a polyploid genome surfey method, which can accurately evaluate the genome feature of polyploid species and improve the evaluation accuracy, and the specific technical scheme is as follows:
a method of polyploid genome surveyy, comprising the steps of:
step 1, extracting and sequencing genome DNA:
(1) selecting muscle tissue of a biological individual, cutting 5-20mg of tissue into a 2ml centrifuge tube, cutting into pieces with surgical scissors, and then breaking into pieces with a homogenizer;
(2) adding 400ul of AC L solution and 20ul of proteinase K;
(3) shaking for 1-2 min to mix uniformly, then placing at 55 ℃ for 3h, taking out and mixing uniformly every half hour during the period, fully cracking the mixture, and enabling the completely cracked sample to be clear and transparent;
(4) taking out the sample, cooling to room temperature, and then gently shaking uniformly;
(5) sequentially adding 300ul of Ext solution and 300ul of AB solution into a treated sample, forcibly shaking up, and then carrying out centrifugal treatment to separate the solutions into layers, wherein the upper layer is a blue extraction layer, the lower layer is a transparent water phase, a partial precipitation layer is formed between the two layers, and DNA is in the lower water phase;
(6) penetrating the gun head through the upper layer solution to the lower layer solution, and carefully sucking the lower layer solution out of the adsorption column to avoid the upper layer solution and the intermediate layer from being sucked to the greatest extent;
(7) putting the adsorption column into a centrifuge for centrifugal treatment, then taking down the adsorption column, and pouring out waste liquid in the collection pipe;
(8) putting the adsorption column back into the collection tube, adding 500ul of rinsing liquid, and carrying out centrifugal treatment;
(9) repeating the step (8) once;
(10) taking down the adsorption column, discarding the waste liquid in the collection tube, putting the adsorption column back into the collection tube, and performing centrifugal treatment to remove the residual rinsing liquid;
(11) putting the column into a new clean centrifugal tube, adding 50-100 ul of elution buffer solution into the center of the column, standing at room temperature for 2-3 min, then carrying out centrifugal treatment, wherein the liquid in the centrifugal tube is the genome DNA, and storing the sample at-4 ℃ or-20 ℃;
(12) accurately measuring the DNA concentration and OD ratio (A260/A280 and A260/A230) of the sample by using a NanoDrop-1000 concentration tester; the concentration of the sample DNA is required to be more than 100ng/ul, the volume of the sample DNA is at least 60ul, the A260/A280 value is 1.8-2.0, and the A260/A230 value is 2.0-2.4;
(13) performing high-throughput sequencing on the quality-controlled qualified sample by using an Illunima Hiseq 2000 platform;
step 2, genome sequencing output data quality control:
quality control based on second generation high-throughput sequencing data of polyploid genome: evaluating the quality of sequencing data by using fastqc software to ensure that the sequencing data meet the subsequent analysis requirements;
filtering low-quality reads by using Trimmomatic software, namely double-ended reads with joints at any end; double-ended reads with the base length exceeding 10% of the length of the reads are not detected in the single-ended reads; double-ended reads having a low mass base number in single-ended reads exceeding 50% of the reads base number;
10000 pairs of double-end reads are randomly drawn and compared with an NT database (Nucleotide sequence database) by blast software, so that a sequenced sample is ensured not to be obviously polluted;
step 3, evaluating the characteristics of the polyploid genome:
k-mer analysis based on high quality polyploid species genomic sequencing data: calculating the number of K-mers and the frequency distribution of the K-mers by using the high-quality sequencing data obtained by the quality control method in the step 2 and using a parameter K-17 according to a formula
Figure BDA0002427843120000031
Calculating the genome size of the polyploid species, wherein G: an estimated genome size; n isbaseThe total base number in the high-quality sequencing data, L the length of a read, K the parameter K in the K-mer, CKmer: the depth of the K-mer corresponding to the polyploid main peak;
while considering K less than 2 as an error, using equation GreviseGenome size correction was performed for G × (1-ErrorRate), where Grevise: corrected genome size; ErrorRate: an error rate;
by the formula
Figure BDA0002427843120000032
Calculating the heterozygosity of the polyploid species genome, wherein phi: estimated genomic heterozygosity; a is1/2: the K-mer proportion contained in one-half position of a polyploid main peak;
step 4, polyploid genome ploidy change analysis:
genomic ploidy change analysis based on introgression model: using formula of model of gradual infiltration
Figure BDA0002427843120000033
Calculating the genome homology rate and the doubling rate of the polyploid species, wherein M is the total number of K-mers of the genome, N is the total number of K-mers of a repetitive region, the ratio of alpha homologous regions, 1-alpha is the doubling rate, β is the ratio of diploid repetitive sequences, and K is the genome heterozygosity rate;
in the step 1(5), the centrifugation parameter is set to 12000rmp, and the centrifugation is carried out for 5-6 min;
in the step 1(7), the centrifugation parameter is set to 8000rmp for 1-2 min;
in the step 1(8), the centrifugation parameter is set to 8000rmp for 1-2 min;
in the step 1(10), the centrifugation parameter is set to 12000rmp, and the centrifugation is carried out for 1-2 min at room temperature;
in the step 1(11), the centrifugation parameter is set to 12000rmp, and the centrifugation is carried out for 1-2 min at room temperature.
Compared with the prior art, the polyploid genome surfey method has the beneficial effects that:
the method takes the complex relation among the polyploid species genome sequences into consideration when designing, specially designs an analysis method aiming at the polyploid species genome size and the heterozygosity rate, and can effectively solve the problem of low accuracy rate of the polyploid genome evaluation by the existing method;
the invention creatively provides and constructs an introgression model of the polyploid genome. Based on the genome passive analysis of the introgression model in the step 4, the homology rate and the doubling rate of the polyploid genome can be calculated, and the vacancy of the result in the existing analysis method is filled
Third, the method can theoretically perform surfey analysis on any eukaryotic polyploid genome, so that the method can be widely applied to polyploid genome evaluation, and provides an accurate and effective analysis method for polyploid genome size, heterozygosity rate, homology rate evaluation and other applications.
Compared with the prior art, the method improves the evaluation accuracy by 90 percent.
Drawings
FIG. 1 is a schematic diagram of the frequency distribution of tetraploid genome K-mers of example 1 of the present invention: wherein, 1-whole genome heterozygous peak, 2-diploid main peak, 3-tetraploid heterozygous peak, tetraploid main peak;
FIG. 2 is a schematic diagram of the tetraploid gene introgression procedure of example 1 of the present invention: wherein, A-ancestral chromosome, chromosome where A' -introgression occurs, and chromosome where B-introgression ends;
FIG. 3 is a schematic structural diagram of the tetraploid gene introgression model in example 1 of the present invention.
Detailed description of the invention
The invention will be further described with reference to specific embodiments and figures 1-3, but the invention is not limited to these embodiments.
In the embodiment, tetraploid loach genomes are adopted, a suitable loach individual is selected, and the muscle of the loach individual is taken for extracting genome DNA, wherein the extraction method is a cell/tissue genome DNA extraction kit (GK0122) of Shanghai Czeri bioengineering GmbH.
Example 1
A method for polyploid genome surfey, which comprises the following steps:
step 1, extracting and sequencing genome DNA:
(1) cutting 10mg of tissue into 2ml centrifuge tube, cutting into pieces with surgical scissors (sterilized with alcohol), and breaking into pieces with homogenizer;
(2) adding 400ul of AC L solution and 20ul of proteinase K;
(3) shaking and uniformly mixing for 1 minute, then placing at 55 ℃ for 3 hours, taking out and uniformly mixing every half hour during the period, contributing to full cracking, and clarifying and transparent the completely cracked sample;
(4) taking out the sample, and slightly and uniformly shaking the sample when the temperature is reduced to room temperature;
(5) to the treated sample, 300ul of the Ext solution and 300ul of the AB solution were added in this order, shaken vigorously, and then centrifuged at 12000rmp for 5 min. The solution is layered, the upper layer is a blue extraction layer, the lower layer is a transparent water phase, a partial precipitation layer possibly exists between the two layers, and DNA is in the water phase of the lower layer;
(6) penetrating the gun head through the upper layer solution to the lower layer solution, and carefully sucking the lower layer solution out of the adsorption column to avoid the upper layer solution and the intermediate layer from being sucked to the greatest extent;
(7) centrifuging at 8000rmp for 1min, taking down the adsorption column, and pouring off the waste liquid in the collection tube;
(8) placing the adsorption column back into the collection tube, adding 500ul rinsing liquid, and centrifuging at 8000rmp room temperature for 1 min;
(9) repeating the step (8) once;
(10) the adsorption column is taken down, and the waste liquid in the collection tube is discarded. Placing the adsorption column back into the collection tube, 12000rmp, centrifuging at room temperature for 1min to remove residual rinsing liquid;
(11) the column was placed in a fresh clean 1.5ml centrifuge tube, 80ul of elution buffer was added to the center of the column, left at room temperature for 2min, then 12000rmp, and centrifuged at room temperature for 1 min. The liquid in the centrifugal tube is the genome DNA, and the sample can be preserved at-4 ℃ or-20 ℃;
(12) accurately measuring the DNA concentration and OD ratio (A260/A280 and A260/A230) of the sample by using a NanoDrop-1000 concentration tester; the DNA concentration of the sample is required to be more than 100ng/ul, the volume of the sample DNA is at least 60ul, the A260/A280 value is 1.8, and the A260/A230 value is 2.4;
(13) performing high-throughput sequencing on the quality-controlled qualified sample by using an Illunima Hiseq 2000 platform;
step 2, genome sequencing output data quality control:
the specific steps of high-throughput sequencing data quality control are as follows:
(1) acquiring original sequencing data, and evaluating the quality of the original data by using fastqc software to ensure that the data meets the analysis requirement of the second-generation sequencing data;
(2) the raw data was filtered using trimmatic software according to the following rules: filtering out double-ended reads containing a linker sequence at either end; filtering the paired-ended reads when the sequenced single-ended read contains more than 10% of the unsensed base (N) over the length of the read; when the number of low quality (5 or less) bases contained in a single sequencing reads exceeds 50% of the length of the read, the pair of paired-ended reads is filtered, using the parameters: -minlen 50-slinwin 4-slinqsum15-heading3-trailing 3;
(3) randomly drawing 10,000 pairs of reads, and aligning the reads to an NT Database (Nucleotide Sequence Database) Database by using a blastn tool to ensure that the Sequence with the most times of alignment is a closely related species of the loach so as to eliminate the possibility of sample contamination;
(4) successfully obtaining 37Gb high-quality sequencing data through quality control;
step 3, evaluating the characteristics of the polyploid genome:
the genome characteristic evaluation is carried out by using quality-controlled high-quality sequencing data, and the specific method comprises the following steps:
(1) the data obtained in step 2.4 were calculated using the K-mer analysis software GCE, with the following specific parameters: m 1-D8-b 0-H1, calculating the distribution frequency of 250,217,368,293K-mers and K-mers, and the GC content of the genome is 40.98%;
(2) as shown in FIG. 1, according to the distribution characteristics of K-mer of tetraploid genome, in combination with the distribution frequency of K-mer, the depth of K-mer corresponding to the polyploid main peak can be calculated to be 92;
(3) using genome size calculation formula
Figure BDA0002427843120000051
QuadruplingThe loach genome size is 2,719.8 Mb;
(4) using error correction formula GreviseThe corrected tetraploid loach genome size can be calculated to be 2,632.9Mb as G × (1-ErrorRate);
(5) estimating formula according to K-mer distribution frequency and heterozygosity rate
Figure BDA0002427843120000052
Obtaining the heterozygous degree of tetraploid loach genome of 1.15%;
step 4, polyploid genome ploidy change analysis:
simulating tetraploid genome diploidy by gradual regional replacement method, introducing random mutation based on the replacement, simulating heterozygosity of 1%, and constructing a tetraploid genome diploidy introgression model according to introgression model formula as shown in FIG. 2-3
Figure BDA0002427843120000061
The homology rate is 67.1%, and the doubling rate is 32.9%.
Example 2
A method for polyploid genome surfey, which comprises the following steps:
step 1, extracting and sequencing genome DNA:
(1) cutting 7mg of tissue into 2ml centrifuge tube, cutting into pieces with surgical scissors (sterilized with alcohol), and breaking into pieces with homogenizer;
(2) adding 400ul of AC L solution and 20ul of proteinase K;
(3) shaking and uniformly mixing for 1 minute, then placing at 55 ℃ for 3 hours, taking out and uniformly mixing every half hour during the period, contributing to full cracking, and clarifying and transparent the completely cracked sample;
(4) taking out the sample, and slightly and uniformly shaking the sample when the temperature is reduced to room temperature;
(5) to the treated sample, 300ul of the Ext solution and 300ul of the AB solution were added in this order, shaken vigorously, and then centrifuged at 12000rmp for 5 min. The solution is layered, the upper layer is a blue extraction layer, the lower layer is a transparent water phase, a partial precipitation layer possibly exists between the two layers, and DNA is in the water phase of the lower layer;
(6) penetrating the gun head through the upper layer solution to the lower layer solution, and carefully sucking the lower layer solution out of the adsorption column to avoid the upper layer solution and the intermediate layer from being sucked to the greatest extent;
(7) centrifuging at 8000rmp for 1min, taking down the adsorption column, and pouring off the waste liquid in the collection tube;
(8) placing the adsorption column back into the collection tube, adding 500ul rinsing liquid, and centrifuging at 8000rmp room temperature for 1 min;
(9) repeating the step (8) once;
(10) the adsorption column is taken down, and the waste liquid in the collection tube is discarded. Placing the adsorption column back into the collection tube, 12000rmp, centrifuging at room temperature for 1min to remove residual rinsing liquid;
(11) the column was placed in a fresh clean 1.5ml centrifuge tube, 50ul of elution buffer was added to the center of the column, left at room temperature for 2min, then 12000rmp, and centrifuged at room temperature for 1 min. The liquid in the centrifugal tube is the genome DNA, and the sample can be preserved at-4 ℃ or-20 ℃;
(12) accurately measuring the DNA concentration and OD ratio (A260/A280 and A260/A230) of the sample by using a NanoDrop-1000 concentration tester; the DNA concentration of the sample is required to be more than 100ng/ul, the volume of the sample DNA is at least 60ul, the A260/A280 value is 1.9, and the A260/A230 value is 2.3;
(13) performing high-throughput sequencing on the quality-controlled qualified sample by using an Illunima Hiseq 2000 platform;
step 2, genome sequencing output data quality control:
the specific steps of high-throughput sequencing data quality control are as follows:
(1) acquiring original sequencing data, and evaluating the quality of the original data by using fastqc software to ensure that the data meets the analysis requirement of the second-generation sequencing data;
(2) the raw data was filtered using trimmatic software according to the following rules: filtering out double-ended reads containing a linker sequence at either end; filtering the paired-ended reads when the sequenced single-ended read contains more than 10% of the unsensed base (N) over the length of the read; the pair of paired-end reads is filtered when the number of low-quality (5 or less) bases contained in a single sequencing read exceeds 50% of the length of the read. The use parameters are as follows: -minlen 50-slinwin 4-slinqsum15-heading3-trailing 3;
(3) randomly drawing 10,000 pairs of reads, and aligning the reads to an NT Database (Nucleotide Sequence Database) Database by using a blastn tool to ensure that the Sequence with the most times of alignment is a closely related species of the loach so as to eliminate the possibility of sample contamination;
(4) successfully obtaining 37Gb high-quality sequencing data through quality control;
step 3, evaluating the characteristics of the polyploid genome:
the genome characteristic evaluation is carried out by using quality-controlled high-quality sequencing data, and the specific method comprises the following steps:
(1) the data obtained in step 2.4 were calculated using the K-mer analysis software GCE, with the following specific parameters: m 1-D8-b 0-H1, calculating the distribution frequency of 255,192,472,392K-mers and K-mers, and the GC content of the genome is 40.53%;
(2) according to the distribution characteristics of K-mer of the tetraploid genome and the distribution frequency of the K-mer, the depth of the K-mer corresponding to the polyploid main peak can be calculated to be 93;
(3) using genome size calculation formula
Figure BDA0002427843120000071
Obtaining the genome size of tetraploid loach to be 2744.1 Mb;
(4) using error correction formula GreviseThe corrected tetraploid loach genome size can be calculated to be 2653.3Mb as G × (1-ErrorRate);
(5) estimating formula according to K-mer distribution frequency and heterozygosity rate
Figure BDA0002427843120000072
Obtaining the heterozygous degree of the tetraploid loach genome of 1.17%;
step 4, polyploid genome ploidy change analysis:
simulating tetraploid genome diploidy by a gradual regional replacement method, introducing random mutation on the basis of replacement, and simulating the heterozygosity rate of 1% to construct a tetraploid genome diploidy introgression model according to an introgression model formula
Figure BDA0002427843120000073
The homology rate is 67.3%, and the doubling rate is 32.7%.
Example 3
A method for polyploid genome surfey, which comprises the following steps:
step 1, extracting and sequencing genome DNA:
(1) cutting 10mg of tissue into 2ml centrifuge tube, cutting into pieces with surgical scissors (sterilized with alcohol), and breaking into pieces with homogenizer;
(2) adding 400ul of AC L solution and 20ul of proteinase K;
(3) shaking and mixing for 2 minutes, then placing at 55 ℃ for 3 hours, taking out and mixing uniformly every half hour during the period, contributing to full cracking, and clarifying and transparent the completely cracked sample;
(4) taking out the sample, and slightly and uniformly shaking the sample when the temperature is reduced to room temperature;
(5) to the treated sample, 300ul of the Ext solution and 300ul of the AB solution were added in this order, shaken vigorously, and then centrifuged at 12000rmp for 6 min. The solution is layered, the upper layer is a blue extraction layer, the lower layer is a transparent water phase, a partial precipitation layer possibly exists between the two layers, and DNA is in the water phase of the lower layer;
(6) penetrating the gun head through the upper layer solution to the lower layer solution, and carefully sucking the lower layer solution out of the adsorption column to avoid the upper layer solution and the intermediate layer from being sucked to the greatest extent;
(7) centrifuging at 8000rmp for 2min, taking down the adsorption column, and pouring off the waste liquid in the collection tube;
(8) putting the adsorption column back into the collecting tube, adding 500ul rinsing liquid, and centrifuging at 8000rmp room temperature for 2 min;
(9) repeating the step (8) once;
(10) the adsorption column is taken down, and the waste liquid in the collection tube is discarded. Placing the adsorption column back into the collection tube, 12000rmp, centrifuging at room temperature for 2min to remove residual rinsing liquid;
(11) the column was placed in a fresh clean 1.5ml centrifuge tube, 100ul of elution buffer was added to the center of the column, left at room temperature for 3min, then 12000rmp, and centrifuged at room temperature for 2 min. The liquid in the centrifugal tube is the genome DNA, and the sample can be preserved at-4 ℃ or-20 ℃;
(12) accurately measuring the DNA concentration and OD ratio (A260/A280 and A260/A230) of the sample by using a NanoDrop-1000 concentration tester; the DNA concentration of the sample is required to be more than 100ng/ul, the DNA volume of the sample is at least 60ul, the A260/A280 value is 2.0, and the A260/A230 value is 2.0;
(13) performing high-throughput sequencing on the quality-controlled qualified sample by using an Illunima Hiseq 2000 platform;
step 2, genome sequencing output data quality control:
the specific steps of high-throughput sequencing data quality control are as follows:
(1) acquiring original sequencing data, and evaluating the quality of the original data by using fastqc software to ensure that the data meets the analysis requirement of the second-generation sequencing data;
(2) the raw data was filtered using trimmatic software according to the following rules: filtering out double-ended reads containing a linker sequence at either end; filtering the paired-ended reads when the sequenced single-ended read contains more than 10% of the unsensed base (N) over the length of the read; the pair of paired-end reads is filtered when the number of low-quality (5 or less) bases contained in a single sequencing read exceeds 50% of the length of the read. The use parameters are as follows: -minlen 50-slinwin 4-slinqsum15-heading3-trailing 3;
(3) randomly drawing 10,000 pairs of reads, and aligning the reads to an NT Database (Nucleotide Sequence Database) Database by using a blastn tool to ensure that the Sequence with the most times of alignment is a closely related species of the loach so as to eliminate the possibility of sample contamination;
(4) successfully obtaining 37Gb high-quality sequencing data through quality control;
step 3, evaluating the characteristics of the polyploid genome:
the genome characteristic evaluation is carried out by using quality-controlled high-quality sequencing data, and the specific method comprises the following steps:
(1) the data obtained in step 2.4 were calculated using the K-mer analysis software GCE, with the following specific parameters: m 1-D8-b 0-H1, calculating the distribution frequency of 253,936,781,425K-mers and K-mers, and the GC content of the genome is 40.75%;
(2) according to the distribution characteristics of K-mer of the tetraploid genome and the distribution frequency of the K-mer, the depth of the K-mer corresponding to the polyploid main peak can be calculated to be 93;
(3) using genome size calculation formula
Figure BDA0002427843120000091
Obtaining the genome size of tetraploid loach to be 2722.1 Mb;
(4) using error correction formula GreviseThe corrected tetraploid loach genome size can be calculated to be 2639.7Mb as G × (1-ErrorRate);
(5) estimating formula according to K-mer distribution frequency and heterozygosity rate
Figure BDA0002427843120000092
Obtaining the heterozygous degree of the tetraploid loach genome of 1.16 percent;
step 4, polyploid genome ploidy change analysis:
simulating tetraploid genome diploidy by a gradual regional replacement method, introducing random mutation on the basis of replacement, and simulating the heterozygosity rate of 1% to construct a tetraploid genome diploidy introgression model according to an introgression model formula
Figure BDA0002427843120000093
The homology rate is 67.8%, and the doubling rate is 32.2%.
According to the embodiment of the invention, the results are all in line with expectations, and compared with the prior art, the method has the advantage that the evaluation accuracy is improved by 90%.

Claims (6)

1. A method of polyploid genome surveyy, comprising the steps of:
step 1, extracting and sequencing genome DNA:
(1) selecting muscle tissue of a biological individual, cutting 5-20mg of tissue into a 2ml centrifuge tube, cutting into pieces with surgical scissors, and then breaking into pieces with a homogenizer;
(2) adding 400ul of AC L solution and 20ul of proteinase K;
(3) shaking for 1-2 min to mix uniformly, then placing at 55 ℃ for 3h, taking out and mixing uniformly every half hour during the period, fully cracking the mixture, and enabling the completely cracked sample to be clear and transparent;
(4) taking out the sample, cooling to room temperature, and then gently shaking uniformly;
(5) sequentially adding 300ul of Ext solution and 300ul of AB solution into a treated sample, forcibly shaking up, and then carrying out centrifugal treatment to separate the solutions into layers, wherein the upper layer is a blue extraction layer, the lower layer is a transparent water phase, a partial precipitation layer is formed between the two layers, and DNA is in the lower water phase;
(6) penetrating the gun head through the upper layer solution to the lower layer solution, and carefully sucking the lower layer solution out of the adsorption column to avoid the upper layer solution and the intermediate layer from being sucked to the greatest extent;
(7) putting the adsorption column into a centrifuge for centrifugal treatment, then taking down the adsorption column, and pouring out waste liquid in the collection pipe;
(8) putting the adsorption column back into the collection tube, adding 500ul of rinsing liquid, and carrying out centrifugal treatment;
(9) repeating the step (8) once;
(10) taking down the adsorption column, discarding the waste liquid in the collection tube, putting the adsorption column back into the collection tube, and performing centrifugal treatment to remove the residual rinsing liquid;
(11) putting the column into a new clean centrifugal tube, adding 50-100 ul of elution buffer solution into the center of the column, standing at room temperature for 2-3 min, then carrying out centrifugal treatment, wherein the liquid in the centrifugal tube is the genome DNA, and storing the sample at-4 ℃ or-20 ℃;
(12) accurately measuring the DNA concentration and OD ratio of the sample by using a NanoDrop-1000 concentration tester, namely A260/A280 and A260/A230; the concentration of the sample DNA is required to be more than 100ng/ul, the volume of the sample DNA is at least 60ul, the A260/A280 value is 1.8-2.0, and the A260/A230 value is 2.0-2.4;
(13) performing high-throughput sequencing on the quality-controlled qualified sample by using an Illunima Hiseq 2000 platform;
step 2, genome sequencing output data quality control:
quality control based on second generation high-throughput sequencing data of polyploid genome: evaluating the quality of sequencing data by using fastqc software to ensure that the sequencing data meet the subsequent analysis requirements;
filtering low-quality reads by using Trimmomatic software, namely double-ended reads with joints at any end; double-ended reads with the base length exceeding 10% of the length of the reads are not detected in the single-ended reads; double-ended reads having a low mass base number in single-ended reads exceeding 50% of the reads base number;
10000 pairs of double-end reads are randomly drawn and compared with an NT database (Nucleotide sequence database) by blast software, so that a sequenced sample is ensured not to be obviously polluted;
step 3, evaluating the characteristics of the polyploid genome:
k-mer analysis based on high quality polyploid species genomic sequencing data: calculating the number of K-mers and the frequency distribution of the K-mers by using the high-quality sequencing data obtained by the quality control method in the step 2 and using a parameter K-17 according to a formula
Figure FDA0002427843110000021
Calculating the genome size of the polyploid species, wherein G: an estimated genome size; n isbaseThe total base number in the high-quality sequencing data, L the length of a read, K the parameter K in the K-mer, CKmer: the depth of the K-mer corresponding to the polyploid main peak;
while considering K less than 2 as an error, using equation GreviseGenome size correction was performed for G × (1-ErrorRate), where Grevise: corrected genome size; ErrorRate: an error rate;
by the formula
Figure FDA0002427843110000022
Calculating the heterozygosity of the polyploid species genome, wherein phi: estimated genomic heterozygosity; a is1/2: the K-mer proportion contained in one-half position of a polyploid main peak;
step 4, polyploid genome ploidy change analysis:
genomic ploidy change analysis based on introgression model: using formula of model of gradual infiltration
Figure FDA0002427843110000023
And calculating the genome homology and the doubling rate of the polyploid species, wherein M is the total number of the genome K-mer, N is the total number of the repeat region K-mer, the ratio of the alpha homology region, 1-alpha is the doubling rate, β is the ratio of the diploid repeat sequence, and K is the genome heterozygosity rate.
2. The method for generating a polyploid genome surveyy according to claim 1, wherein in the step 1(5), the centrifugation parameters are set to 12000rmp, and the centrifugation is carried out for 5-6 min.
3. The method for generating a polyploid genome surveyy according to claim 1, wherein in the step 1(7), the centrifugation parameter is set to 8000rmp for 1-2 min.
4. The method for generating a polyploid genome surveyy according to claim 1, wherein in the step 1(8), the centrifugation parameter is set to 8000rmp for 1-2 min.
5. The method for generating a polyploid genome surveyy according to claim 1, wherein in the step 1(10), the centrifugation parameters are set to 12000rmp, and the centrifugation is carried out for 1-2 min at room temperature.
6. The method for generating a polyploid genome surveyy according to claim 1, wherein in step 1(11), the centrifugation parameters are set to 12000rmp, and the centrifugation is carried out for 1-2 min at room temperature.
CN202010226501.7A 2020-03-27 2020-03-27 Method for polyploid genome surfy Pending CN111411107A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010226501.7A CN111411107A (en) 2020-03-27 2020-03-27 Method for polyploid genome surfy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010226501.7A CN111411107A (en) 2020-03-27 2020-03-27 Method for polyploid genome surfy

Publications (1)

Publication Number Publication Date
CN111411107A true CN111411107A (en) 2020-07-14

Family

ID=71489267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010226501.7A Pending CN111411107A (en) 2020-03-27 2020-03-27 Method for polyploid genome surfy

Country Status (1)

Country Link
CN (1) CN111411107A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114223453A (en) * 2021-12-20 2022-03-25 山东农业大学 Method for creating apple non-fusion allopolyploid rootstock based on whole genome mutagenesis
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101363059A (en) * 2008-09-24 2009-02-11 南京农业大学 Molecule detection method for triazophos target resistance in rice borer
CN101974629A (en) * 2010-10-26 2011-02-16 西南大学 Method for investigating origin of species of allopolyploid by virtual synthetic species
CN106191246A (en) * 2016-07-12 2016-12-07 集美大学 A kind of method utilizing liquid phase capture to carry out Carnis Pseudosciaenae genomic gene typing
CN107153777A (en) * 2017-05-03 2017-09-12 武汉菲沙基因信息有限公司 A kind of method for the diplodization degree for estimating tetraploid species gene group
CN110106279A (en) * 2019-05-24 2019-08-09 四川省草原科学研究院 Unit point SSR primer sets and its application based on the exploitation of siberian wildrye genome sequence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101363059A (en) * 2008-09-24 2009-02-11 南京农业大学 Molecule detection method for triazophos target resistance in rice borer
CN101974629A (en) * 2010-10-26 2011-02-16 西南大学 Method for investigating origin of species of allopolyploid by virtual synthetic species
CN106191246A (en) * 2016-07-12 2016-12-07 集美大学 A kind of method utilizing liquid phase capture to carry out Carnis Pseudosciaenae genomic gene typing
CN107153777A (en) * 2017-05-03 2017-09-12 武汉菲沙基因信息有限公司 A kind of method for the diplodization degree for estimating tetraploid species gene group
CN110106279A (en) * 2019-05-24 2019-08-09 四川省草原科学研究院 Unit point SSR primer sets and its application based on the exploitation of siberian wildrye genome sequence

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LIU BH等: "Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects", 《QUANTITATIVE BIOLOGY》, pages 1 - 47 *
SHI L等: "Genome survey sequencing of red swamp crayfish Procambarus clarkii", 《MOL BIOL REP》, vol. 45, no. 5, pages 799 - 806 *
张伟等: "捕食性真菌Duddingtonia flagrans基因组DNA的提取及基因组survey分析", 《中国兽医学报》, vol. 37, no. 11, pages 2090 - 2094 *
霍恺森等: "甘薯属耐盐植物马鞍藤基因组大小及特征分析", 《植物遗传资源学报》, vol. 20, no. 3, pages 2001 - 2005 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114223453A (en) * 2021-12-20 2022-03-25 山东农业大学 Method for creating apple non-fusion allopolyploid rootstock based on whole genome mutagenesis
CN114223453B (en) * 2021-12-20 2022-12-27 山东农业大学 Method for creating apple non-fusion allopolyploid rootstock based on whole genome mutagenesis
CN117106875A (en) * 2023-10-23 2023-11-24 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing
CN117106875B (en) * 2023-10-23 2024-02-06 中国科学院昆明植物研究所 Method for estimating plant genome size and/or repeatability based on low-depth sequencing

Similar Documents

Publication Publication Date Title
CN111411107A (en) Method for polyploid genome surfy
CN108315439B (en) SNP molecular marker related to growth of pelteobagrus vachelli and application thereof
WO2016049878A1 (en) Snp profiling-based parentage testing method and application
CN113337639B (en) Method for detecting COVID-19 based on mNGS and application thereof
CN107488725A (en) Library method for building up and its application suitable for the sequencing of unicellular genomic methylation
CN111500723A (en) Marker combination for detecting premature ovarian failure genes and detection kit
CN108192893B (en) Method for developing blumea balsamifera SSR primer based on transcriptome sequencing
CN112126705A (en) MdMYB44 gene promoter SNP (Single nucleotide polymorphism) variation site in apple and application thereof in prediction of apple fruit acidity
CN108504750B (en) Method and system for determining flora SNP site set and application thereof
CN109280696B (en) Method for splitting mixed sample by SNP detection technology
CN106755335B (en) Detection primer, kit and detection method for gene mutation of Leber hereditary optic neuropathy mitochondria DNA
Baker Biomolecular applications
CN111635944A (en) Specific primer, kit and PCR method for detecting liver cancer susceptibility locus rs73613962
CN115719616B (en) Screening method and system for pathogen species specific sequences
CN111500742B (en) Chicken growth trait gene diagnostic kit and application thereof
CN110564843A (en) Primer group and kit for detecting thalassemia mutant type and deletion type genes and application of primer group and kit
CN116814767A (en) Probe set, kit, method and application for detecting alpha thalassemia and beta thalassemia related pathogenic genes
CN110964844B (en) Primer, kit and method for qualitative determination of ginseng, poria cocos and bighead atractylodes rhizome powder
CN114875117A (en) Construction method and kit of gene library for detecting female infertility
Nutini et al. Double incompatibility at human alpha fibrinogen and penta E loci in paternity testing
CN114292908A (en) Reagent, method and kit for detecting rare mutation of thalassemia gene
CN114774517A (en) Method and kit for sequencing human immune repertoire
Maruyama et al. Population data on 15 STR loci using AmpF/STR Identifiler kit in a Malay
CN114717303A (en) Primer group and kit for detecting osteogenesis imperfecta related gene based on multiplex PCR and high-throughput sequencing technology and application
Sun et al. The identification and analysis of meristematic mutations within the apple tree that developed the RubyMac sport mutation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination