CN112553361A - Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data - Google Patents
Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data Download PDFInfo
- Publication number
- CN112553361A CN112553361A CN202011310367.5A CN202011310367A CN112553361A CN 112553361 A CN112553361 A CN 112553361A CN 202011310367 A CN202011310367 A CN 202011310367A CN 112553361 A CN112553361 A CN 112553361A
- Authority
- CN
- China
- Prior art keywords
- snp
- sequence
- sequencing data
- identifying
- rad
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6888—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
- C12Q1/6895—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms for plants, fungi or algae
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Abstract
The invention discloses a method for identifying SNP of broad beans by using simplified genome sequencing data, which comprises the following steps: step one, extracting sample genome DNA; step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end; generating an original sequence, performing sequence quality control analysis, clustering high-quality sequences according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype; step four, KASP marker development and SNP genotyping. The method provided by the invention utilizes genome sequencing data to mine SNP, so that not only can SNP mutation of a gene expression region be identified, but also SNP mutation of non-coding regions such as the inside and the inter-gene of a gene can be identified, and the source of the SNP is more abundant.
Description
Technical Field
The invention relates to the technical field of biological detection, in particular to a method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data.
Background
The broad beans are rich in protein and cellulose and are easy to digest and absorb, dry seeds of the broad beans can be used as grains and feeds or processed into leisure food, and fresh seeds can be used as vegetables for eating. The root system of the broad bean has the function of biological nitrogen fixation, and is an important crop rotation and soil cultivation crop in the structure adjustment of the planting industry.
In recent years, with the rapid development of second-generation sequencing technologies and the significant reduction of sequencing cost, high-throughput sequencing has been widely applied to the label development, gene localization and other works of complex and huge genome crops such as wheat. DNA sequence Polymorphism caused by Single Nucleotide variation on genome level, namely Single Nucleotide Polymorphism (SNP) markers, become ideal molecular markers of a new generation due to the characteristics of wide distribution, high density, good stability, suitability for large-scale screening and the like on the genome, but on broad beans, the number of the SNP markers which are publicly reported at present is limited.
Simplified genome sequencing, such as RAD-Seq (Restriction site-associated DNA sequencing), refers to the use of Restriction enzymes to break down genomic DNA and to perform high-throughput sequencing of specific fragments to obtain sequence data representing the entire genomic information of a species of interest, by reducing the complexity of the genome. Because the sequencing depth is moderate, the cost is low and the reference genome can not be depended on, the method is widely applied to marker development, genetic map construction, target gene positioning and the like on a plurality of non-model species at present.
Broad beans are diploid crops (2n ═ 2x ═ 12), have genomes of about 13Gb, are 25 times larger than alfalfa, which is a leguminous crop, and are one of the species with the largest genome in the leguminous crops. The ultra-large genome of broad bean seriously hinders genome resource researches such as whole genome sequencing and marker development, so that work progress such as acquiring genetic gain by using molecular markers is slow, and therefore, the prior art needs to be improved.
Disclosure of Invention
The invention provides a method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data, which aims to solve the technical problem that the work progress is slow when the molecular markers are used for acquiring genetic gain and the like because the oversized genome of the broad beans seriously hinders genome sequencing, marker development and other genome resource researches.
In order to solve the technical problems, the invention provides the following technical scheme:
a method for identifying broad bean SNP by using simplified genome sequencing data comprises the following steps:
taking young leaves of broad bean seedlings as a sample, and grinding the young leaves by liquid nitrogen to extract sample genome DNA;
step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end;
generating an original sequence, performing sequence quality control analysis, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype;
selecting a sequence which covers the SNP locus and has a total length of 100bp, designing KASP primers, enabling each KASP primer marker to respectively comprise two forward primer sequences and a universal reverse primer sequence for distinguishing SNP allelic variation, and carrying out SNP signal detection by SNP genotyping.
Further, the broad bean young leaf in the first step is the broad bean young leaf which grows for 1 week.
Further, in the first step, after the sample genomic DNA is extracted, the method further comprises:
the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
Further, in the second step, when the genomic DNA is digested by EcoRI, adding ' A ' to the 3' end of the digested fragment for treatment, and connecting an MID joint; single-ended sequencing used the Illumina HiSeq2000 platform.
Further, in the third step, the original sequence is generated by Illumina base catching software CASAVA v1.8.2, and sequence quality control analysis is performed by using trimmatic software under default parameters.
Further, in the third step, clustering the screened sequences by using an ustacks software according to the sequence similarity to generate RAD-tags, clustering the RAD-tags by using a cstags software under default parameters to perform SNP calling, and finally correcting the SNP genotype by using a Bayesian algorithm.
Further, in the fourth step, Kraken is adoptedTMThe software designed KASP primers.
Further, in the fourth step, SNP genotyping adopts an IntelliQube high-throughput genotyping detection platform to detect SNP signals.
Furthermore, in the fourth step, when SNP signal detection is performed in SNP genotyping, the volume of a single-site reaction is 1.6. mu.L, wherein the volume of the sample DNA is 0.8. mu.L, and the volume of the mixture of 2xMaster mix and Primer mix is 0.8. mu.L.
Further, in the fourth step, when SNP genotyping is performed for SNP signal detection, the PCR amplification procedure is 15min at 95 ℃,20 s at 94 ℃ and 60s at 61-55 ℃, which are 10 cycles; 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s.
The technical scheme provided by the invention has the beneficial effects that at least:
the invention provides a method for identifying broad bean SNP by using simplified genome sequencing data, which is different from the method for mining SNP by using transcriptome sequencing data in the prior art. The SNP identified by the method can provide a powerful genetic tool for broad bean germplasm resource identification, gene localization and molecular breeding.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of 4 amplification signals of KASP markers provided by an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The embodiment provides a method for identifying broad bean SNP by using simplified genome sequencing data, which comprises the steps of carrying out simplified genome sequencing on broad bean varieties by using RAD-Seq technology, identifying broad bean SNP in a whole genome range by using a reference genome-independent SNP identification technology and analyzing the characteristics of the broad bean SNP.
The broad bean germplasm used in the embodiment is collected and stored by agriculture and forestry scientific research institute of Lishui city; wherein, 8 germplasms (FB017, FB032, FB036, FB056, FB076, FB080, FB081) are used for simplifying genome sequencing, and the other 46 germplasms are used for verifying SNP accuracy.
The method for identifying the broad bean SNP by using the simplified genome sequencing data comprises the following steps:
step one, DNA extraction:
taking young leaves of broad bean seedlings as samples, grinding the young leaves by using liquid nitrogen, and extracting sample genome DNA by using a DNA extraction kit; wherein, the young leaves of the broad beans adopted in the embodiment are young leaves of the broad beans which grow for 1 week; in this embodiment, after extracting the genomic DNA of the sample, the method further comprises: the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
Step two, library construction and sequencing:
constructing RAD library by referring to the method of Baird et al (Baird NA, Etter PD, Atwood TS, et al, Rapid SNP discovery and genetic mapping using sequential RAD markers [ J ]. PLoS ONE,2008,3, e3376), digesting genomic DNA with EcoRI, treating 3' end of the digested fragment with ' A ', and connecting MID (multiple identifier) linker; enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing at a single end; wherein, the Illumina HiSeq2000 platform is used for single-end sequencing.
Step three, SNP identification:
generating an original sequence through Illumina base cloning software CASAVA v1.8.2, performing sequence quality control analysis under default parameters by using Trimmomatic software, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP cloning, and correcting SNP genotype;
specifically, the screened sequences are clustered according to sequence similarity to generate RAD-tags, the RAD-tags are clustered to perform SNP calling, and the process of correcting the SNP genotype is as follows:
clustering the screened sequences by using ustacks software to generate read tags (RAD-tags), referring to Xu et al (Xu P, Xu S, Wu X, et al. A expression genetic analysis from low-coverage RAD-Seq data: a case study on the non-model cut gene good. plant J,2014,77: 430. sup. 442), identifying RAD gene full genome SNP without reference genome, clustering RAD-tags by using cstags software under default parameters, and finally correcting gene type SNP by using a Bayesian algorithm (Hohenlohe PA, Bassham S, Etter PD, et al. P.P. genetic analysis of additive in feedback gene, P.1000862, Bayesian gene, P.P. No. 13. sub.A method for identifying RAD gene full genome SNP by using a template software under default parameters.
Step four, KASP marker development and genotyping
Selecting a sequence which covers SNP sites and has a total length of about 100bp, and adopting KrakenTMThe software designs KASP primers so that each KASP primer marker comprises three primer sequences, namely two forward primer sequences for distinguishing SNP allelic variation and a universal reverse primer sequence, and SNP genotyping is carried out for SNP signal detection. Wherein, SNP genotyping is carried out in a public laboratory of agricultural scientific college of Zhejiang province, and SNP signal detection is carried out by adopting an IntelliQube high-throughput genotyping detection platform; the single-site reaction volume was 1.6. mu.L, where the sample DNA was 0.8. mu.L and the volume after mixing the 2xMaster mix and the Primer mix was 0.8. mu.L. The PCR amplification procedure is 10 cycles of 15min at 95 deg.C, 20s at 94 deg.C, and 60s at 61-55 deg.C (Touch-Down PCR, 0.6 deg.C per cycle); 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s. The test result is analyzed by adopting IntelliQube platform with software.
Further, in another possible embodiment, the implementation process of the second step is as follows:
1) the specific experimental steps are as follows:
1. constructing a library by using the initial amount of 1 mu g of DNA;
2, breaking DNA to 300-500 bp by Covaris M220 ultrasonic;
3. filling 3' end with A, connecting index joint (TruSeq)TMNano DNA Sample Prep Kit);
4. Enriching the library, and amplifying 8 cycles by PCR;
5.2% Agarose gel recovery of the target band (verified Low Range Ultra Agarose);
TBS380(PicoGreen) quantification, mixing and loading according to the data proportion;
performing bridge PCR amplification on the cBot to generate clusters;
illumina Hiseq sequencing platform, 2 × 150bp sequencing was performed.
2) And (3) biological information analysis flow:
the reads obtained by sequencing are aligned with a reference genome sequence by using BWA software, and then the sequencing reads generated by PCR-replication are removed by using Picard-tools. Then, based on the alignment results, the sequencing depth and coverage relative to the reference genome are calculated. And (3) detecting the SNP and small index information by using the GATK software package. SV was identified using the Breakdancer-1.1.2 software.
3) Raw sequencing data illustrates:
raw image data obtained by Illumina sequencing is converted into sequence data through Base Calling, and the result is stored in a FASTQ file format. The FASTQ file is the most primitive data file, and contains sequence information of sequencing reads as well as sequencing quality information. The FASTQ file format is as follows:
@K00169:186:HM5C2CCXX:6:1101:8136:2962 1:N:0:CTGGCATA
CCACTCATAATCCAGCAAATACTAAATCTGCTGCAGGAAAAGAAATGCGGTTGAGCTTAAATAGCCCAG
+
AFFKKFKKFFKKKKFKAFKKAAKFAFFKKFKKFFKKKKFKAFKKAAKFAFFKKFKKFFKKKKFKAFKKAA
each read contains 4 lines of information, where the first and third lines represent the read name and ID (where the first line starts with "@" and the third line starts with "+"; the ID may be omitted in the third line but "+" cannot be omitted), the second line is the base sequence of the read, and the fourth line is the sequencing quality value for each base of the sequence in the second line. To facilitate the storage and sharing of high throughput sequencing data generated by various laboratories, the NCBI data center has built a large database SRA (Sequence Read Archive, http:// www.ncbi.nlm.nih.gov/Traces/SRA) to store shared raw sequencing data. Raw data volume statistics are shown in table 1:
TABLE 1 raw sequencing data
Sample | Raw reads | Raw bases | Q20(%) | Q30(%) |
FB017-0911-1 | 34620356 | 5089192332 | 94.85 | 88.84 |
FB032-09016-3 | 35246208 | 5181192576 | 95.16 | 89.36 |
FB036-09019-1 | 35445202 | 5210444694 | 95.11 | 89.36 |
FB056-09031 | 35092390 | 5158581330 | 95.1 | 89.31 |
FB076 | 35898166 | 5277030402 | 95.02 | 89.21 |
FB079 | 36945126 | 5467878648 | 95.36 | 89.77 |
FB080 | 29994288 | 4439154624 | 95.05 | 89.26 |
FB081 | 35299488 | 5224324224 | 95.53 | 90.09 |
Sample: the name of the sample;
raw reads: counting original sequence data, taking four rows as a unit, and counting the number of sequencing sequences of each file;
raw bases: multiplying the number of sequencing sequences by the length of the sequencing sequences;
q20, Q30: indicates the percentage of the total base by the base with the Phred value of more than 20 and 30 respectively;
4) quality control of original sequencing data:
illumina sequencing belongs to a second generation sequencing technology, billions of reads can be generated by single operation, and thus the quality condition of each read cannot be displayed one by massive data; the statistical method is used for counting the base distribution and quality fluctuation of each circle of all sequencing reads, and the sequencing quality and the library construction quality of a sample can be visually reflected macroscopically.
Since the original sequencing data of the Illumina Hiseq contains sequencing adaptor sequences, low-quality reads, sequences with high N-rate and sequences with too short length, the quality of subsequent assembly is seriously affected. In order to ensure the accuracy of the subsequent biological information analysis, the original sequencing data is filtered firstly, so as to obtain high-quality sequencing data (clean data) to ensure the smooth proceeding of the subsequent analysis, and the specific steps and sequence are as follows: removing the adaptor sequence in reads, removing reads without inserts due to adaptor self-ligation and the like; trimming the bases with lower quality (quality value less than 20) at the tail end (3' end) of the sequence, if the bases with quality value less than 10 still exist in the residual sequence, removing the whole sequence, otherwise, keeping the sequence; removing reads with the N content ratio exceeding 10%; discarding the sequence with the length less than 75bp after removing the adapter and mass pruning.
Further, in another possible embodiment, the SNP identification process in step three above is as follows:
and (3) comprehensively considering influence factors in the aspects of data characteristics, sequencing quality and experiments, and calculating the probability of each possible genotype on the basis of actually observed data by using a Bayesian model (GATK UnifiedGenottyper). And selecting the genotype with the highest probability as the genotype of the specific site of the sequenced individual, and providing a quality value reflecting the accuracy of the genotype on the basis of the genotype, and obtaining a consistent sequence. Based on the consensus sequence, sites with polymorphisms in the reference sequence are screened and filtered.
The method mainly comprises the following steps:
1. converting the sam file into a Bam file, and sequencing the Bam file;
2. marking PCR duplicates, and removing reads of the PCR duplicates;
3. filtering and indexing comparison reads with mappingQ lower than 10;
4. realignment (realignment) around INDEL;
5. SNPs and INDEL calling using GATK;
6. filtering the Variant result to obtain high-accuracy variation;
the statistical format for SNP identification is shown in Table 2:
TABLE 2 SNP identification statistical Format
type | FB017-0911-1 | FB032-09016-3 | FB036-09019-1 | …… |
all-snp | 7915 | 10227 | 9896 | …… |
hom | 5259 | 6859 | 6676 | …… |
het | 2656 | 3368 | 3220 | …… |
all-indel | 302 | 368 | 385 | …… |
deletion | 144 | 183 | 182 | …… |
insertion | 158 | 185 | 203 | …… |
Hom represents homozygous mutation, example: a- > T; het represents a heterozygous mutation, example: a- > A/T; insert mutation and delete mutation.
Genome-wide SNP mutations can be divided into 6 classes. Taking T: A > C: G as an example, this type of SNP mutation includes T > C and A > G. Since the sequencing data aligns to both the positive and negative strands of the reference genome, when a T > C type mutation occurs on the positive strand of the reference genome, an A > G type mutation is at the same position on the negative strand of the reference genome, and thus T > C and A > G are divided into one class.
SNP annotation: ANNOVAR is an efficient software tool that can functionally annotate genetic variations detected from multiple genomes with up-to-date information. ANNOVAR can be analyzed given the chromosome in which the variation is located, the start site, the stop site, the reference nucleotide and the variant nucleotide. In view of ANNOVAR's powerful annotation function and international acceptance, we used it to annotate SNP detection results. The statistics of SNP annotation results are shown in Table 3, and the statistics of small index annotation results are shown in Table 4:
TABLE 3 SNP annotation results
type | FB017-0911-1 | FB032-09016-3 | FB036-09019-1 | …… |
UTR3 | 111 | 120 | 133 | …… |
UTR5 | 132 | 129 | 148 | …… |
downstream | 190 | 204 | 287 | …… |
exonic | 2207 | 3222 | 2976 | …… |
exonic;splicing | 0 | 0 | 0 | …… |
intergenic | 3735 | 4761 | 4688 | …… |
intronic | 1227 | 1469 | 1299 | …… |
splicing | 9 | 8 | 7 | …… |
upstream | 241 | 254 | 290 | …… |
TABLE 4 results of small indel annotation
type | FB017-0911-1 | FB032-09016-3 | FB036-09019-1 | …… |
|
10 | 13 | 15 | …… |
UTR5 | 17 | 9 | 11 | …… |
downstream | 13 | 13 | 20 | …… |
exonic | 37 | 48 | 47 | …… |
intergenic | 136 | 172 | 181 | …… |
intronic | 60 | 80 | 71 | …… |
splicing | 2 | 3 | 4 | …… |
upstream | 23 | 26 | 32 | …… |
The above table specifically describes and illustrates the reference link addresses:
http://www.openbioinformatics.org/annovar/annovar_gene.html
sample name.
Upstream: the 1Kb region upstream of the gene.
Exonic: the variation is located in an exon region; missense: non-synonymous variants; stop gain: allowing the gene to acquire a variation of a stop codon; stop loss: a mutation that deprives the gene of a stop codon; synonymous: synonymous variants.
Intronic: the variation is located in an intron region.
And (3) spicing: the variation is located at the splice site (2 bp near the exon/intron boundary in the intron).
Downstream: the 1Kb region downstream of the gene.
Upstream of the gene, 1Kb, and Downstream of the other gene, 1 Kb.
Intergenic: the variation is located in the intergenic region.
For the SNP and small indel sites in the CDS region, the effect of the mutation site on protein translation will be annotated. The statistics of the results (SNPs) of the effect of the mutated site of the CDS region on protein translation are shown in table 5:
TABLE 5 results of the influence of the mutated site of the CDS region on protein translation
type | FB017-0911-1 | FB032-09016-3 | FB036-09019-1 | …… |
nonsynonymous SNV | 888 | 1292 | 1201 | …… |
stopgain SNV | 23 | 33 | 24 | …… |
stoploss SNV | 2 | 3 | 3 | …… |
synonymous SNV | 1294 | 1894 | 1748 | …… |
Statistics of the effect of the mutated positions of the CDS region on protein translation (Small Indel) are shown in Table 6, and the degenerate base meanings are shown in Table 7
TABLE 6 results of the influence of the mutated site of the CDS region on protein translation
type | FB017-0911-1 | FB032-09016-3 | FB036-09019-1 | …… |
frameshift |
10 | 19 | 14 | …… |
frameshift |
15 | 18 | 19 | …… |
nonframeshift deletion | 4 | 3 | 6 | …… |
nonframeshift insertion | 8 | 8 | 7 | …… |
stopgain |
0 | 0 | 1 | …… |
TABLE 7 base meanings
Degenerate/mixed bases | A+C+G | V |
Degenerate/mixed bases | A+T+G | D |
Degenerate/mixed bases | T+C+G | B |
Degenerate/mixed bases | A+T+C | H |
Degenerate/mixed bases | A+T | W |
Degenerate/mixed bases | C+G | S |
Degenerate/mixed bases | T+G | K |
Degenerate/mixed bases | A+C | M |
Degenerate/mixed bases | C+T | Y |
Degenerate/mixed bases | A+G | R |
Degenerate/mixed bases | A+G+C+T | N |
Further, to illustrate the feasibility of the method of the present invention for identifying SNP in faba beans using simplified genomic sequencing data, the results of the method were statistically analyzed as follows:
1) sequencing data statistics:
in this example, 8 broad bean germplasms were sequenced using Illumina Hiseq sequencing platform, and 35.47Gb data were obtained altogether, to generate 245443516 reads, each of which has an average length of 144 bp. In 8 germplasms, the minimum sequencing data amount is 3.83Gb, the maximum sequencing data amount is 4.77Gb, and the average sequencing data amount is 4.43 Gb; the minimum number of reads is 26415662, the maximum number is 32822210, and the average number is 30680439.5; q20 and Q30 are respectively more than 97.89 percent and 93.83 percent, and the variation range of GC content is 38.05 percent to 40.09 percent; statistics of 8 germplasm sequencing data are shown in table 8:
table 8, 8 germplasm sequencing data
2) Broad bean whole genome SNP identification:
in this example, 3722 group SNPs were identified by the method for identifying SNP in bottle gourd using the special bayesian algorithm without reference genome, and the statistics of SNP identification information in 8 materials are shown in table 9:
TABLE 9 SNP identification information
On a single germplasm, the number of SNPs identified in FB076 was the least, 3278, and the number of SNPs identified in FB079 was the most, reaching 3578. The number of homozygous SNP mutations varied from 1579 to 2033, the number of heterozygous SNP mutations varied from 1245 to 1804 in 8 germplasm, with the exception of FB080 and FB056, which were greater than the number of heterozygous SNP mutations in most germplasm (table 9).
Of the 6 SNP mutation types, the T: A- > C: G mutation type accounts for the largest proportion (average 38.8%), followed by C: G- > T: A (average 28.0%), and the T: A- > A: T (average 7.50%) with the smallest occurrence proportion, and the statistics of SNP mutation patterns are shown in Table 10:
TABLE 10 SNP mutation patterns
3) SNP validation
In order to verify the effectiveness of the SNPs, 56 SNPs are selected to develop KASP markers after filtering according to the standard that the deletion rate is less than or equal to 20%, the MAF value is greater than or equal to 0.05 and the occurrence frequency of the SNPs is greater than or equal to 40, and finally 31 SNPs are converted into KASP markers with the conversion success rate of 55.3%. The 31 pairs of KASP markers developed in this example are shown in table 11:
tables 11, 31 pairs of KASP tags
The results of genotyping 46 broad bean germplasm resources with the 31 pairs of KASP markers show that 22 pairs of markers detect successfully amplified signals, wherein 14 pairs of markers detect single genotype signals, 4 pairs of markers show 2 genotype signals, and 4 pairs of markers show 3 genotype signals as shown in FIG. 1. In FIG. 1, 4 amplification signals, A, of the present example, labeled with KASP failed amplification; b, single genotype; c, 2 genotypes; d, 3 genotypes.
With the rapid decrease in sequencing costs, identification of SNPs in the genome-wide range using genome re-sequencing has been widely used on a variety of crops. Because broad beans have large genome and no reference genome exists at present, SNP (single nucleotide polymorphism) mining of broad beans lags behind other leguminous crops such as soybeans, kidney beans, cowpeas and the like.
In this example, the RAD-Seq data of 8 germplasm were used to identify 3722 SNP markers. And Ocana et al (S,Seoane P,Bautista R et al.Large-Scale Transcriptome Analysis in Faba Bean(Vicia faba L.)under Ascochyta fabae Infection.PLoS ONE,2015,10(8):e013514)]And Webb et al (Webb A, Cottage A, Wood T, et al. A SNP-based transducing qualitative map for synthesizing-based tracking targeting in faba bean (Vicia faba L.) [ J]Plant Biotechnology Journal,2016,14:177-185) utilizes transcriptome sequencing data to mine SNPs differently, this example utilizes genome sequencing data to mine SNPs, and not only can SNP mutations in gene expression regions be identified, but also SNP mutations in non-coding regions such as gene interiors and intergenes can be identified, and SNP sources are more abundant.
The foregoing is directed to the preferred embodiment of the present invention and it is noted that while the preferred embodiment of the present invention has been described, numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the invention and without departing from the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Claims (10)
1. A method for identifying broad bean SNP by using simplified genome sequencing data is characterized by comprising the following steps:
taking young leaves of broad bean seedlings as a sample, and grinding the young leaves by liquid nitrogen to extract sample genome DNA;
step two, constructing a RAD library, digesting genomic DNA by using EcoRI, enriching the library by using a PCR amplification method, recovering a target band by using agarose gel, and sequencing a single end;
generating an original sequence, performing sequence quality control analysis, removing sequences smaller than 85bp to obtain a screened sequence, clustering the screened sequence according to sequence similarity to generate RAD-tags, clustering the RAD-tags to perform SNP calling, and correcting the SNP genotype;
selecting a sequence which covers the SNP locus and has a total length of 100bp, designing KASP primers, enabling each KASP primer marker to respectively comprise two forward primer sequences and a universal reverse primer sequence for distinguishing SNP allelic variation, and carrying out SNP signal detection by SNP genotyping.
2. The method for identifying SNP of broad beans by using simplified genome sequencing data as set forth in claim 1, wherein the young leaves of broad bean seedlings in the first step are young leaves of broad bean seedlings which grow for 1 week.
3. The method for identifying faba bean SNPs using simplified genomic sequencing data according to claim 1, wherein the step one, after extracting genomic DNA from the sample, further comprises:
the quality of the extracted sample genomic DNA was checked by agarose gel electrophoresis, and the concentration of the extracted sample genomic DNA was checked using a NanoDrop2000 ultramicro spectrophotometer.
4. The method for identifying SNP of broad beans by using simplified genome sequencing data as set forth in claim 1, wherein in the second step, when the genomic DNA is digested by EcoRI, the 3' end of the digested fragment is treated by adding ' A ' to connect with MID linker; single-ended sequencing used the Illumina HiSeq2000 platform.
5. The method for identifying broad bean SNPs using simplified genomic sequencing data as claimed in claim 1, wherein in step three, the original sequence is generated by Illumina base cloning software CASAVA v1.8.2, and sequence quality control analysis is performed using trimmatic software under default parameters.
6. The method for identifying SNP in broad beans by using simplified genome sequencing data as set forth in claim 1, wherein in the third step, the screened sequences are clustered by using ustacks software according to sequence similarity to generate RAD-tags, the RAD-tags are clustered by using cstags software under default parameters to perform SNP calling, and finally the SNP genotype is corrected by using Bayesian algorithm.
7. The method for identifying faba bean SNPs using simplified genomic sequencing data as claimed in claim 1 wherein in step four, Kraken is usedTMThe software designed KASP primers.
8. The method for identifying faba bean SNPs using simplified genomic sequencing data as claimed in claim 1, wherein in step four, SNP genotyping is performed using IntelliQube high throughput genotyping detection platform for SNP signal detection.
9. The method for identifying faba bean SNPs using simplified genomic sequencing data according to claim 1, wherein in the fourth step, when SNP genotyping is performed, the single-spot reaction volume is 1.6 μ L, wherein the sample DNA is 0.8 μ L, and the mixed volume of the 2xMaster mix and the Primer mix is 0.8 μ L.
10. The method for identifying SNP in broad beans according to claim 1, wherein in the fourth step, when SNP genotyping is performed on SNP signals, the PCR amplification procedure is 15min at 95 ℃,20 s at 94 ℃ and 60s at 61-55 ℃, which are 10 cycles; 26 cycles at 94 ℃ for 20s and 55 ℃ for 60 s.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011310367.5A CN112553361A (en) | 2020-11-20 | 2020-11-20 | Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011310367.5A CN112553361A (en) | 2020-11-20 | 2020-11-20 | Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112553361A true CN112553361A (en) | 2021-03-26 |
Family
ID=75044213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011310367.5A Pending CN112553361A (en) | 2020-11-20 | 2020-11-20 | Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112553361A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060041955A1 (en) * | 2004-08-23 | 2006-02-23 | Pioneer Hi-Bred International, Inc. | Marker mapping and resistance gene associations in soybean |
CN101701255A (en) * | 2009-11-13 | 2010-05-05 | 中国检验检疫科学研究院 | Primer for PCR identification of kidney bean and identification method |
US20150322447A1 (en) * | 2012-07-06 | 2015-11-12 | Bayer Cropscience Nv | Soybean rod1 gene sequences and uses thereof |
CN106755328A (en) * | 2016-11-25 | 2017-05-31 | 中国农业科学院作物科学研究所 | A kind of construction method of broad bean SSR finger-prints |
CN110139872A (en) * | 2016-12-21 | 2019-08-16 | 中国农业科学院作物科学研究所 | Plant seed character-related protein, gene, promoter and SNP and haplotype |
-
2020
- 2020-11-20 CN CN202011310367.5A patent/CN112553361A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060041955A1 (en) * | 2004-08-23 | 2006-02-23 | Pioneer Hi-Bred International, Inc. | Marker mapping and resistance gene associations in soybean |
CN101701255A (en) * | 2009-11-13 | 2010-05-05 | 中国检验检疫科学研究院 | Primer for PCR identification of kidney bean and identification method |
US20150322447A1 (en) * | 2012-07-06 | 2015-11-12 | Bayer Cropscience Nv | Soybean rod1 gene sequences and uses thereof |
CN106755328A (en) * | 2016-11-25 | 2017-05-31 | 中国农业科学院作物科学研究所 | A kind of construction method of broad bean SSR finger-prints |
CN110139872A (en) * | 2016-12-21 | 2019-08-16 | 中国农业科学院作物科学研究所 | Plant seed character-related protein, gene, promoter and SNP and haplotype |
Non-Patent Citations (4)
Title |
---|
ANNE WEBB等: "A SNP-based consensus genetic map for synteny-based trait targeting in faba bean (Vicia faba L.)", 《PLANT BIOTECHNOLOGY JOURNAL》 * |
ANNE WEBB等: "A SNP-based consensus genetic map for synteny-based trait targeting in faba bean (Vicia faba L.)", 《PLANT BIOTECHNOLOGY JOURNAL》, vol. 14, no. 1, 10 April 2015 (2015-04-10), pages 177 - 185 * |
刘庭付等: "利用简化基因组测序数据鉴定蚕豆SNP", 《分子植物育种》 * |
刘庭付等: "利用简化基因组测序数据鉴定蚕豆SNP", 《分子植物育种》, 13 November 2020 (2020-11-13), pages 1 - 9 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alonge et al. | Automated assembly scaffolding elevates a new tomato system for high-throughput genome editing | |
Lee et al. | Young inversion with multiple linked QTLs under selection in a hybrid zone | |
Zhebentyayeva et al. | Genetic characterization of worldwide Prunus domestica (plum) germplasm using sequence-based genotyping | |
Barreto et al. | A genome-wide association study identified loci for yield component traits in sugarcane (Saccharum spp.) | |
US9976191B2 (en) | Rice whole genome breeding chip and application thereof | |
Yang et al. | Target SSR-Seq: a novel SSR genotyping technology associate with perfect SSRs in genetic analysis of cucumber varieties | |
Evans et al. | Extensive variation in the density and distribution of DNA polymorphism in sorghum genomes | |
Larsen et al. | Population structure, relatedness and ploidy levels in an apple gene bank revealed through genotyping-by-sequencing | |
CN107345256A (en) | One kind is based on transcript profile sequencing exploitation grass vetch EST SSR primer sets and methods and applications | |
Zheng et al. | QTL mapping combined with bulked segregant analysis identify SNP markers linked to leaf shape traits in Pisum sativum using SLAF sequencing | |
CN113151545B (en) | SSR primer group developed based on multiple transcriptome sequences of zantedeschia hybrida of color group, acquisition method and application | |
Ryu et al. | Genotyping-by-sequencing based single nucleotide polymorphisms enabled Kompetitive Allele Specific PCR marker development in mutant Rubus genotypes | |
Qi et al. | Genomic dissection of widely planted soybean cultivars leads to a new breeding strategy of crops in the post-genomic era | |
CN111916151B (en) | Traceability detection method and application of verticillium wilt of alfalfa | |
CN117144055B (en) | Application of haplotype molecular marker related to regulation and control of papaya fruit length | |
CN107862177B (en) | Construction method of single nucleotide polymorphism molecular marker set for distinguishing carp populations | |
Zhong et al. | Genome-wide identification of sequence variations and SSR marker development in the Munake Grape cultivar | |
Mishra et al. | Analysis of SSR and SNP Markers | |
CN112553361A (en) | Method for identifying SNP (single nucleotide polymorphism) of broad beans by using simplified genome sequencing data | |
Gaur et al. | A high-density SNP-based linkage map using genotyping-by-sequencing and its utilization for improved genome assembly of chickpea (Cicer arietinum L.) | |
CN111534627B (en) | QTL locus related to grape downy mildew resistance, SNP molecular marker and application | |
CN113718342A (en) | Construction method of high-density genetic map of recombinant inbred line population | |
Bello et al. | Genetic diversity analysis of selected sugarcane (Saccharum spp. hybrids) varieties using DarT-Seq technology. | |
KR101911307B1 (en) | Method for selecting and utilizing tag-SNP for discriminating haplotype in gene unit | |
CN116622881B (en) | Tobacco whole genome SNP locus combination, probe, chip and application thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210326 |
|
RJ01 | Rejection of invention patent application after publication |