WO2022160700A1 - Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing - Google Patents

Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing Download PDF

Info

Publication number
WO2022160700A1
WO2022160700A1 PCT/CN2021/115146 CN2021115146W WO2022160700A1 WO 2022160700 A1 WO2022160700 A1 WO 2022160700A1 CN 2021115146 W CN2021115146 W CN 2021115146W WO 2022160700 A1 WO2022160700 A1 WO 2022160700A1
Authority
WO
WIPO (PCT)
Prior art keywords
genotype
progeny
parent
snp
parents
Prior art date
Application number
PCT/CN2021/115146
Other languages
French (fr)
Chinese (zh)
Inventor
韩斌
朱舟
王阿红
Original Assignee
中国科学院分子植物科学卓越创新中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院分子植物科学卓越创新中心 filed Critical 中国科学院分子植物科学卓越创新中心
Priority to AU2021423830A priority Critical patent/AU2021423830A1/en
Publication of WO2022160700A1 publication Critical patent/WO2022160700A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the invention relates to the technical field of biological information processing, in particular to multi-parent crop genotype identification based on high-throughput whole genome sequencing. More specifically, the present invention provides a method and device for identifying genotypes of multi-parent crops based on high-throughput whole genome sequencing data.
  • Genome sequencing opens the door to high-throughput genotyping. Initially this was done using microarray chip technology, which detects single nucleotide polymorphisms (SNPs) by hybridizing genomic DNA to oligonucleotides on a gene chip. Since hundreds to thousands of markers can be detected in a single hybridization, this method of genotyping greatly improves the efficiency [1] . This method has been applied to some model organism systems such as human, Arabidopsis and rice [2-4] . Although the goal of high throughput has been achieved, microarray-based approaches have serious limitations, such as laborious, time-consuming, and high costs in designing, producing, and using microarrays.
  • SNPs single nucleotide polymorphisms
  • next-generation sequencing technology has brought a leap forward in methodological methods for genotyping and genetic mapping.
  • New sequencing technologies not only increase sequencing throughput by several orders of magnitude, but also allow parallel sequencing of many samples [5-6] . Advances in these technologies have paved the way for the development of sequencing-based high-throughput genotyping methods.
  • the new genotyping method combines the advantages of fast and inexpensive, high-density marker coverage, high accuracy and high resolution, while also being applicable to more mapping populations and species for comparative genomic and genetic map construction.
  • the object of the present invention is to provide a method and device for identifying the genotypes of multi-parent plants with rapid analysis and accurate results, so that the population genotypes constructed by multiple parents can be analyzed quickly, accurately and reliably.
  • a method for identifying the genotype of a multi-parent plant comprising:
  • step (c) the analysis of recombination break sites is performed based on the SNP "word string”.
  • step (c) includes analyzing the recombination break site, so as to obtain the analysis result of the recombination break site,
  • step (s1) no matter how large the actual distance between two adjacent SNPs is, all the gaps between the SNPs are removed.
  • step (s1) the SNPs constituting the word string are homozygous SNP sites of the parent.
  • step (s1) the method further includes: firstly screening the SNP sites included in the analysis, thereby excluding any SNP site whose parent is heterozygous.
  • step (s2) scoring is performed according to the scoring rules in Table A.
  • step (s3) for each chromosomal region of the progeny, the genotype corresponding to each chromosomal region of the progeny is determined based on each parental score value or score value curve.
  • step (s3) the genotype of each chromosomal region is determined based on the score value and the standard deviation.
  • the chromosomal region of the genotype determines whether a parent A has a high score value close to full score ( ⁇ 80% full score, preferably ⁇ 80% full score), and the parent is in this paragraph.
  • the score value of the region is quite stable, there is not much numerical fluctuation, and the score value of the remaining parents is low ( ⁇ 50% full score, preferably ⁇ 30% full score) or there is a large numerical fluctuation, then the gene in this chromosome region
  • the genotype is determined as the genotype of the parent A.
  • step (s3) it includes: by sliding the sliding window on the SNP site of the whole genome, the score value of each parent on each chromosome can be obtained, and the score value is vertical Coordinates, with the position of each sliding window on the chromosome as the abscissa, draw the score curve of each parent.
  • step (s3) the sub-step of evaluating the heterozygous region is included:
  • step (s3) the degree of similarity between the offspring and each parent in this section is quantified, and according to the numerical characteristics (value level and standard deviation) of the score curve of each parent, determine Genotype of each segment.
  • step (s3) the genotype assessment is performed in the following manner:
  • the method further includes: if the genotypes on both sides of the unknown region are the same, determining the region as the genotype; and if the genotypes on the two sides are different, determining the unknown region
  • the middle position of the unknown region is regarded as a recombination breakpoint, and the two sides of the unknown region are the genotypes on both sides.
  • the progeny is a multi-parent plant.
  • n is 3-6, more preferably 3, 4 or 5.
  • the sequencing data is selected from the group consisting of genome sequencing data, RNA sequencing data, or a combination thereof.
  • sequencing data are files in fastq format.
  • the sliding window size is 170-500 consecutive SNP sites, preferably 200-400 consecutive SNP sites;
  • sequencing depth of the sequencing data is 0.1x-10x, preferably 0.2x-5x.
  • the sequencing depth of the sequencing data ⁇ 1, preferably 1-5, more preferably 1.5-3.
  • each parental score curve is obtained.
  • the SNP site is used to determine the genotype of the individual
  • step (b) the sequencing data (such as fastq files) are compared and processed by bwa and GATK software to obtain SNP information.
  • the SNP site information includes location information and genotype information.
  • the SNP site used for judging the genotype meets the following requirements:
  • SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions;
  • the SNP information position information and genotype information of the corresponding two parents and offspring are known, and the locus should be deleted if any of the three is unknown.
  • step (c) the evaluation result of each SNP of the progeny is recorded in the rlt file, and the rlt file records the genotype determination situation of each SNP position;
  • the distribution information of the recombination breakpoints on each chromosome of the whole genome of the progeny is recorded in a bin file, and the bin file records the distribution of the recombination breakpoints on the 12 chromosomes of the whole genome.
  • step (c) read genotype and recombination break site judgment are performed by SNPwindow script.
  • step (d) the genotype map is performed on the m individuals of the progeny at the same time
  • step (d) the recombination map is constructed by the SNPwindow script, and the gene map of each progeny individual is drawn by the SNP2png script.
  • step (d) it also includes performing alignment on the recombination map of each individual through the Bin2MCD script to generate a recombination bin map.
  • the resolution of the recombined bin map is one bin per 5-200kb, preferably one bin per 10-100kb.
  • the method further comprises: processing the recombination bin map to obtain the genetic map of the progeny.
  • the method further comprises: performing QTL analysis on the genetic map.
  • the method further includes: performing a visual analysis on the genotypes of the entire population of parents and progeny, generating genotype data, and constructing a linkage map based on the genotype data.
  • the plants include crops, preferably grass crops.
  • the crops include rice, wheat, soybean, and tobacco.
  • a data analysis device for identifying the genotypes of multi-parent plants comprising:
  • a data input module for inputting the data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;
  • a multi-parental plant genotype identification module is configured to perform the method described in the first aspect of the present invention, thereby obtaining the genotype identification result of the progeny;
  • the described multi-parent plant genotype identification module includes:
  • the SNP site information analysis submodule is configured to determine the SNP site information of the parent and progeny based on the sequencing data Df and the sequencing data Dp;
  • Chromosomal recombination breakpoint analysis sub-module which is configured to judge the genotype of the progeny based on the SNP site information, so as to obtain the evaluation result of each SNP of the progeny and the whole genome of the progeny
  • a genotype map construction submodule which is configured to: construct and/or draw a genotype map of the progeny based on the SNP assessment result information of the progeny and the position information of the whole genome recombination breakpoint, so as to obtain the progeny. Genotyping results of multiple parental plants.
  • the plants include crops, preferably grass crops.
  • the output module includes: a display, a printer, a pad, and the like.
  • Figure 1 shows the simulated two-parental material genome-wide recombination breakpoints.
  • Figure 2 shows the simulated four-parental material genome-wide recombination breakpoints.
  • Figure 3 shows genotyping of two-parental mock progeny using a SNP-based sliding window approach.
  • Figure 4 shows the genotyping of two-parental mock progeny using the SEG-Map software method.
  • Figure 5 shows genotyping of four-parental mock progeny using a SNP-based sliding window approach.
  • Figure 6 shows the effect of different sliding window sizes on the accuracy of genotype determination results.
  • Figure 7 shows the effect of different sequencing depths on the accuracy of genotype determination results.
  • Figure 8 shows the analysis framework flow of the SNP-based sliding window genotyping method.
  • Figure 9 shows gene mapping using the SNP2png script.
  • Figure 10 shows a genotype identification ensemble plot of the rice population.
  • Figure 11 shows the genotype table of the recombinant inbred line individual recombination segment map in one example.
  • Figure 12 shows the SNP "string” with window 15.
  • Figure 13 shows the four parental score curves of rice chromosome 3 mock progeny.
  • Figure 14 shows two parental score curves for rice chromosome 11 mock progeny.
  • Figure 15 shows the scores of each parent when it is determined that the parents are three homozygous genotypes in one embodiment.
  • Figure 16 shows the scores of each parent when it is determined that the two parents are heterozygous genotypes in one embodiment.
  • Figure 17 shows the scores of each parent when the genotype is determined as unknown in one embodiment.
  • Figure 18 shows subsequent genotype determination of unknown regions in one embodiment.
  • Figure 19 shows a graph of genotyping of individual individuals in the DH population.
  • Figure 20 shows a graph of the genotyping ensemble for the DH population.
  • Figure 21 shows genotyping of three parental material in one example.
  • Figure 22 shows SEG-Map identification of three parental material in one example.
  • Figure 23 shows the genotyping of four parental mimics in one example.
  • Figure 24 shows the true genotypes of the four-parent mock material in one example.
  • the present inventors After extensive and in-depth research, the present inventors have developed a method for more rapid and accurate genotype identification for the first time, thereby realizing more effective genetic mapping and genome analysis.
  • the method of the present invention is particularly suitable for genotyping and identification of low coverage sequenced multi-parent populations.
  • the genotype information of the real SNPs of multiple parents and progeny in a certain section is directly read, and then the degree of similarity between the progeny and each parent in this section is quantified, according to the numerical value of the score curve of each parent characteristics (value level and standard deviation), forming an efficient, simplified and accurate method for multi-parent plant (or multi-parent crop) genotype identification.
  • the present invention has been completed on this basis.
  • the present inventors developed a high-throughput method to identify genotypes of recombinant populations containing multiple parents based on whole-genome low-coverage sequencing data generated by second-generation sequencing technology.
  • the inventors designed a "sliding window" method to determine the genotype of this segment by comprehensively analyzing the genotypes of multiple single nucleotide polymorphisms (SNPs) in a local region of the genome , and then determine the specific position of the recombination break site to construct a fine recombination map of the multi-parent population.
  • SNPs single nucleotide polymorphisms
  • the inventors constructed simulated whole-genome sequencing data of biparental populations and multi-parental populations, constructed a genetic linkage map using this method, and finally compared the genotype information obtained by identification with the genotypes of the real simulated data.
  • the genotype identification accuracy of the population can reach 89.61%, which is similar to the accuracy of the inventor's SEG-Map software method for identifying the genotype of the parental population (the SEG-Map method has an accuracy of 89.32%).
  • the genotype identification method newly developed by the present inventors has an identification accuracy of 92.10% for multi-parent populations, which cannot be achieved by SEG-Map software or methods.
  • the method of the invention can effectively and quickly analyze the genotype of each individual in the population, plays a key guiding role in genome design and breeding, and can also provide fast and accurate genotype data for QTL mapping of multi-parent populations of different crops.
  • the present inventors tested the method using the real rice RIL genetic population, used high-throughput sequencing-based genotype identification, and finally obtained a fairly good high-precision recombination map.
  • this genotype identification method based on low-coverage genome sequencing can replace the traditional marker-based genotype identification method, and provide large-scale gene exploration research and solve more complex biological Learning questions provide a powerful tool.
  • the method of the invention is more suitable for genotype identification of multi-parent backcross populations that have undergone low coverage sequencing, provides accurate genotype support for QTL mapping, and is also helpful for molecular design breeding applications of multi-parent populations.
  • the terms "containing” or “including (including)” can be open, semi-closed, and closed. In other words, the term also includes “consisting essentially of,” or “consisting of.”
  • the term "biparental" indicates that two parents are involved.
  • multi-parent indicates that 3 parents and more are involved.
  • multi-parent plant refers to plants involving 3 parents and more, eg, progeny plants (eg, crops) involving 3, 4, or 5 parents.
  • the invention provides a method for identifying multi-parent crop genotypes.
  • the method of the invention is a genotype identification method of the sliding window of the SNP site.
  • the data processing is optimized.
  • the optimized process can directly analyze and process the unidirectional or bidirectional end short sequence sequencing results generated by the next-generation sequencing technology, and finally construct the genetic map of the recombinant population.
  • the genome-wide SNPs of both parents need to be identified before proceeding with the data analysis pipeline.
  • the identification of this SNP can be obtained by high-coverage whole-genome deep sequencing, or by existing genomic SNP information in the rice haplotype map, or by low-coverage whole-genome sequencing combined with missing genotypes (SNPs) to fill in to get. Since SNP identification between two parental varieties can be obtained in a fast and cost-effective way, sequencing-based genotype identification of a recombinant population will mainly rely on subsequent analysis, including reading genotypes, recombination breakpoints Point determination and construction of genetic linkage maps.
  • the first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information.
  • the SNP loci used for the final determination of genotype should meet the following requirements:
  • SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions.
  • the SNP information position information and genotype information
  • the locus should be deleted if any one of the three is not known.
  • the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is Not credible, so SNP sites where either parent is heterozygous can be deleted.
  • a python script SNPwindow can be used to judge the genotype of the progeny.
  • the script output will have two files, the rlt file and the bin file.
  • the rlt file records the genotype determination of each SNP position
  • the bin file records the distribution of recombination breakpoints on the 12 chromosomes in the whole genome.
  • a genotype map can be drawn first by using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format.
  • the map is drawn based on the genotype information of the determined SNP loci and the position information of the whole-genome recombination breakpoint. Different colors in the figure represent different genotype types.
  • a perl script can also be used to visualize the genotype profile of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.
  • the bins are reorganized when analyzing the final output data produced by the software, usually at a resolution of one bin per 100kb, or even one bin per 10kb.
  • the genotype results of the mapping population can be imported into programs such as MapMaker [16] or JoinMap [32] for genetic map construction. With the genetic map available, QTL analysis is performed.
  • the genetic map produced by the method of the present invention is much finer in scale than maps produced by most conventional molecular markers.
  • the method of the present invention relates to judging the recombination break site, and its detailed process comprises the following steps:
  • Step 1 Construct the SNP "string”.
  • the SNPs on the 12 chromosomes become 12 consecutive word strings (see Figure 12).
  • the blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2.
  • the homozygous genotype of parent 1 Blue
  • homozygous genotype red
  • heterozygous genotype yellow
  • the genome of the artificially cultivated rice parent material is highly homozygous, and for some multi-generation self-recombinant rice populations, the genome is relatively homozygous, and there are only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.
  • Step 2 Score both parents in one window
  • the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window.
  • the degree of conformity of the offspring with the parent is measured according to the typing of each parent.
  • the scoring rules are as shown in Table A or similar scoring rules.
  • the scoring rules are formulated according to the genetic laws of organisms.
  • genotype scoring rules in the present invention are further described below.
  • the score of the offspring for any parent consists of three parts: 1.
  • the offspring has the same SNP site as the parent; 2.
  • the offspring is different from the parent but conforms to Mendelian inheritance 3.
  • the number of SNP loci in the offspring to be tested that are the same as that of the parent is m;
  • the number of loci of Del's inheritance law and the number of misjudged loci caused by various possible factors is e.
  • s 1 is the scoring value of the same SNP site of the progeny and the parent.
  • s 2 is the scoring value of the locus that is different from the parent but conforms to the Mendelian inheritance law.
  • a continuous SNP frame of size N there are i parents to be determined.
  • the genotypes of the progeny and parent at this locus are gk and g′k, respectively . .
  • the genotypes of the genes of the pure line parents of rice are generally 0/0, 0
  • the genotypes of the offspring are generally 0/0, 0
  • the frequency of 2 alleles is generally less, and is not considered for the time being.
  • the probability that the offspring matches its genotype is:
  • the genotype of a certain region of the offspring is determined to the parent genotype with the highest coincidence probability, that is, the maximum coincidence probability among i parents is obtained:
  • P max max ⁇ P 1 , P 2 , . . . , P t ⁇
  • the following table is used for genotype scoring.
  • the standard deviation std is calculated by sliding window on the continuous parent score value.
  • the score S is the highest, and the standard deviation is the smallest.
  • Step 3 Determine the genotype of the chromosome region according to the score value
  • the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.
  • genotype judgment of each chromosome is based on the characteristics of different parental score curves.
  • a sliding window score was performed on the progeny of the simulated four-parent source, and the score curves of the four parents were drawn according to the score values of the four parents.
  • the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large
  • the numerical fluctuation of the score is measured by the standard deviation in statistics.
  • parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region can be judged to be the homozygous genotype of parent 4.
  • the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values.
  • the rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.
  • the judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents. According to the principle of genetics, even if it is a hybrid progeny derived from multiple parents, its parental origin in a certain chromosomal region is at most two parents. Therefore, it can be judged whether this region is a heterozygous region according to the score curve of the two parents in this region.
  • the genotype identification of the simulated offspring is performed, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated offspring of each chromosome segment, and the rectangular bar corresponding to judge represents the actual genotype information of the simulated offspring.
  • the genotype of the offspring determined by the inventive method. Similarly, when the score of one parent is high and the standard deviation is small, and the score of the other parent has considerable fluctuation and the standard deviation is large, the region is judged to be the homozygous genotype of the former (orange or blue area).
  • One of the core ideas of the method of the present invention is based on directly reading the genotype information of the real SNP of the parent and the progeny in a certain segment, and then quantifying the degree of similarity between the progeny and each parent in this segment.
  • the numerical characteristics of the score curve form a relatively simplified analysis model, and then determine the genotype of each segment.
  • the criteria for the judgment of the present invention mainly include the following situations:
  • the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.
  • the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the two sides of the region are respectively genotypes on both sides.
  • genotype determination is performed through a secondary sliding window.
  • a sliding window was performed on the genotype of the SNP, and a parental score value in each window was counted.
  • the determination of the final genotype depends on the score value and the size of the standard deviation obtained by the secondary sliding window, and the determination of the genotype is carried out by the highest probability that a certain segment of the offspring belongs to a certain parent.
  • genotype determination can be performed faster and more accurately by using the secondary sliding window for genotype determination.
  • a schematic example of a secondary sliding window is as follows:
  • the present invention also provides an identification device or an analysis device for multi-parent crop genotypes for performing the method of the present invention.
  • the device includes:
  • a multi-parent plant genotype identification module is configured to perform the method of the present invention, thereby obtaining the genotype identification result of the progeny;
  • the present invention provides a multi-parent crop genotype identification method based on high-throughput sequencing data for the first time. Before the present invention, there is currently no systematic method for identifying multiple parental genotypes of crops.
  • the high-throughput genotype identification method of the present invention can greatly simplify and accelerate the genetic mapping of quantitative traits in crops [37-39,20] .
  • the theoretical method of the present invention can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding.
  • the present invention can be used for the acquisition of molecular markers closely linked to important agronomic trait genes, the efficient screening of offspring in the breeding process, the fine identification of genotype maps of improved varieties, etc., and provides molecular marker-assisted screening and breeding It has developed a fast and efficient means and platform, making it a new level in efficiency and accuracy.
  • sequencing-based high-throughput genotyping method of the present invention will provide convenience for solving complex biological problems and improving crop breeding.
  • the GenomicsDBImport program in the GATK package uses the GenomicsDBImport program in the GATK package to merge all the mutation intermediate files, and then use the GenotypeGVCFs program in the GATK package to export the merged mutation file, using SelectVariants
  • the program selects the required SNP site information, and then passes the VariantFiltration program (parameters are --cluster-size 3--cluster-window-size 10--filter-expression "QD ⁇ 10.00"--filter-name lowQD--filter- expression"FS>15.000"--filter-name highFS--genotype-filter-expression"DP>50
  • a python script After obtaining the SNP information of parents and progeny through GATK software, a python script is used, the principle is to regionalize the SNPs identified by each individual along the sliding window of all SNP sites for comprehensive analysis, based on a fixed-length sliding window to read Take the genotype, then judge the recombination breakpoint and construct the recombination segment map.
  • a perl script uses the intermediate file determined by the program to generate a PNG format recombination segment map for each individual, which is convenient for users to intuitively browse their overall genotype.
  • the GD module in Perl needs to be used when drawing.
  • Bin2MCD Another script, Bin2MCD was next used to generate a high-density map consisting of recombinant bins [19] for subsequent QTL analysis.
  • output files can be used directly to identify QTLs by several QTL analysis software packages, including Windows QTL Cartographer V2.5 [17] .
  • the rice DH population used in this study was constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its two parents are Kasalath and japonica cv. Nipponbare. The DH population is the line produced by the F2 progeny after many years of self-recombination. The inventors selected dozens of strains for genotype identification and analysis.
  • the three-parent rice plants used in this study were constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its three parents are Wushan Simiao, 93-11 and Shuohui 70. The plants in this population are produced by self-recombination of the hybrid progeny of the three parents, and there are many recombination information in their genomes.
  • the DH population of rice was genotyped using the method of the present invention, and a high-density map composed of recombinant bins was generated by Bin2MCD.
  • genotype analysis and high-density bin map were also performed using the method published in 2010.
  • the high-depth (20-30x) sequencing data of the two parents, Kasalath and Nipponbare, were compared to the Nipponbare reference genome IRGSP 1.0 using the bwa software, and then the GATK software was used to find the high-quality SNP information of the two parents, and then use a
  • the perl script replaces the SNP at the specified locus on the Nipponbare genome, thereby generating a pseudo reference for the two parents.
  • the low-abundance sequencing data of the DH population was then aligned to the pseudo reference of the two parents for genotyping.
  • the predicted genotype information of the progeny from the two parents simulated by the inventors should be consistent with it.
  • the generated simulation data includes three cases: the homozygous region of Wushan silk seedlings, the homozygous region of 93-11 and the heterozygous region of Wushan silk seedlings and 93-11.
  • the figure shows the expected length of each region and the location of the recombination breakpoint.
  • the production of the simulated data is based on the real sequencing data of the two parents. First, the fastq data of the two parents are aligned to the rice Nipponbare genome, and then the required alignment information (chromosome and position information) in the obtained sam file is screened. The fastq information from the two parents was then reformatted to form the simulated hybrid progeny fastq data.
  • the fastq data of the simulated progeny were compared with the fastq data of the two parents Wushan Simiao and 93-11 to the rice reference genome IRGSP 1.0, and then the genome-wide variation information of the two parents and the simulated progeny was searched by GATK software, and filtered. Screening to obtain high-quality SNP sites.
  • the "sliding window” method is used to judge the SNPs of the whole genome, and the two parents are scored and compared in a sliding window. If it is higher, this segment is judged as the homozygous genotype of the parent (indicated in red or blue in the figure). When the scores of the two parents are not significantly different, the segment is judged as the heterozygous region of the two parents (indicated in yellow in the figure).
  • the inventors designed a quantitative method to measure the accuracy of judgment, divided the whole genome into thousands of small regions of 100kb (or small regions of 20-200kb), and then compared the degree of agreement between the results obtained by the method of the present invention and the standard map
  • the accuracy of the method of the present invention can be measured. According to this method, comparing the genotype information obtained from the simulated data with the real genotype of the simulated data, the accuracy of the identification of the two parents can reach 89.61%.
  • the inventors also used the published SEG-Map method to judge the genotype of the simulated progeny data, compared the fastq files of the simulated data to the pseudo reference of the two parents, and used the software to screen out the parent-specific fastq sequence, and then determine the information of the SNP site according to the position of the sequence alignment, and then use the sliding window method to determine the genotype information.
  • the method has more detailed theoretical verification and data simulation in the published articles, and has high accuracy and feasibility.
  • the accuracy obtained by the SEG-Map software results is 89.32%, which is not much different from the method of the present invention, indicating that the method of the present invention has high feasibility and accuracy.
  • the SEG-Map method does have high reliability for the identification of the genotypes of the two parents, and the inventors have used this method for a long time in the genome analysis of rice materials. However, this method cannot perform genotype identification on materials derived from multiple parents, so the method of the present invention is also intended to solve the problem of multiple parent genotype identification.
  • the inventors used the SNP-based sliding window method to identify the genotypes of the simulated progeny derived from the four parents, and scored the four parents in one window.
  • the modified region is determined as the homozygous region of the parent, and the homozygous regions of the four parents are represented by red, blue, green and yellow respectively in the figure.
  • the fastq data of the four parents were simulated for 100 times, and the method of dividing the genome into small regions was also used to quantify the accuracy.
  • the average simulation accuracy of the method of the present invention for the genotype identification of the simulated data of the four parents is 92.10%.
  • Genotypes are read by the scores of the two parental SNPs as the "window" slides along the chromosome. A genotype does not change until a recombination breakpoint is encountered.
  • the inventors found that there are two types of break sites: one is to separate two homozygous genotypes, and the other is to separate a segment of homozygous genotypes from a segment of heterozygous genotypes; the former case in RIL is the predominant form of existence, while the latter is mostly found in the F 2 population.
  • the homozygous genotype When a sliding window hits a "homozygous/homozygous” breakpoint, the homozygous genotype briefly changes to a heterozygous genotype and then back again from a heterozygous genotype into a homozygous genotype.
  • the homozygous genotype When a sliding window hits a "homozygous/heterozygous” breakpoint, the homozygous genotype becomes a heterozygous genotype and then changes to a homozygous genotype again, this The boundary point between the homozygous genotype region and the heterozygous genotype region can be determined.
  • the present invention adopts different window sizes to perform genotype analysis on the final SNP information screened by the four-parent simulation data, and it is found that the sliding window sizes of different sizes do have an impact on the final analysis accuracy.
  • the window size is small (less than 199)
  • the final accuracy rate is less than 90%, but when the sliding window size is increased to 199, the genotype identification accuracy can reach 93.72%, but when the sliding window size continues to increase, the final The accuracy rate does not change much, indicating that the accuracy rate of the judgment result does not always increase with the size of the sliding window.
  • larger sliding window size requires more computing resources and computing time, and the time cost will be more prominent when large-scale groups need to be processed. Therefore, the inventor comprehensively considers the time cost and the accuracy rate, and the sliding window size of 199 (or the sliding window size of 180-220) is a more reasonable choice.
  • genotype identification can be carried out with SEG-Map software at a lower sequencing depth, so the method of the present invention is deeply tested.
  • the genome-wide SNPs of both parents need to be identified before proceeding with the data analysis pipeline.
  • the identification of this SNP can be obtained by high-coverage whole-genome deep sequencing, or by existing genomic SNP information in the rice haplotype map, or by low-coverage whole-genome sequencing combined with missing genotypes (SNPs) to fill in to get. Since SNP identification between two parental varieties can be obtained in a fast and cost-effective way, sequencing-based genotype identification of a recombinant population will mainly rely on subsequent analysis, including reading genotypes, recombination breakpoints Point determination and construction of genetic linkage maps.
  • the functions, steps, and software (scripts) in the data analysis are shown in Figure 8.
  • the first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information.
  • the SNP loci used for the final determination of the genotype should meet the following requirements: 1. The SNP loci should cover the whole genome as much as possible, and there will be no deletions in certain regions. 2.
  • the SNP information position information and genotype information
  • the SNP information of the two parents and simulated offspring are known, and if any one of the three is not known, the locus should be deleted. 3. It is generally believed that the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is unreliable, so SNP sites where either parent was heterozygous were deleted.
  • the script output will have two files, the rlt file and the bin file.
  • the rlt file records the genotype determination of each SNP position
  • the bin file records the distribution of recombination breakpoints on 12 chromosomes in the whole genome.
  • a genotype map is generally drawn first using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format.
  • the map is drawn according to the genotype information of SNP loci determined by the program and the position information of whole-genome recombination breakpoints. Different colors are used to represent different genotype types in the map.
  • a perl script can also be used to visualize the genotype of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.
  • the bins are reorganized when analyzing the final output data produced by the software, usually at a resolution of one bin per 100kb, or even one bin per 10kb.
  • the genotype results of the mapping population can be imported into programs such as MapMaker [16] or JoinMap [32] for genetic map construction. With the genetic map available, QTL analysis is performed.
  • this genetic map is much finer in scale than maps produced by most traditional molecular markers.
  • This package is compatible with multiple platforms (eg: Unix, Linux and Windows).
  • the GD module In addition to the perl environment itself, the GD module also needs to be installed, because there are drawing steps in the process operation.
  • Step 1 Construct the SNP "string”.
  • the SNPs on the 12 chromosomes become 12 consecutive word strings (Fig. 12).
  • the blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2.
  • the homozygous genotype of parent 1 Blue
  • homozygous genotype red
  • heterozygous genotype yellow
  • the genomes of artificially cultivated rice parent materials are highly homozygous, and for some multi-generation self-recombinant rice populations, the genomes are also relatively homozygous, with only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.
  • Step 2 Score both parents in one window
  • the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window.
  • the degree of conformity of the offspring with the parent is measured according to the typing of each parent.
  • the preferred scoring rules are shown in Table A, and the scoring rules of the present invention are formulated according to the genetic rules of organisms.
  • Step 3 Determine the genotype of the chromosome region according to the score value
  • the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.
  • the genotype judgment of each chromosome is based on the characteristics of different parental score curves. As shown in Figure 13, the offspring of the simulated four-parent source were scored by sliding window, and the score curves of the four parents were drawn according to the score values of the four parents.
  • the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large
  • the numerical fluctuation of the score is measured by the standard deviation in statistics.
  • parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region is judged to be the homozygous genotype of parent 4.
  • the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values.
  • the rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.
  • the judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents.
  • the genotype identification of the simulated progeny is carried out, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated progeny of each chromosome segment, and the rectangular bar corresponding to judge represents the judgment of the method of the present invention. offspring genotype.
  • the region is judged to be the homozygous genotype of the former (orange or blue area).
  • Step 5 Description of Judgment Criteria
  • the core idea of the method of the present invention is based on directly reading the genotype information of the real SNPs of the parent and the offspring in a certain segment, and then quantifying the degree of similarity between the offspring and each parent in this segment, according to the score curve of each parent
  • the numerical characteristics (value level and standard deviation) of form a relatively simplified analysis model, and then determine the genotype of each segment.
  • the criteria for judgment mainly include the following situations:
  • the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.
  • the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the region The two sides are the genotypes of the two sides, respectively.
  • the two parents of the DH population of rice used were Kasalath and Nipponbare. It is a population formed by inducing haploids and doubling the F1 generation of biparental crosses. Its plants are homozygous, and the self-bred progeny are pure lines, which can be repeated for many years and multiple points. It is an ideal material for studying the interaction of genotype and environment.
  • the high-depth sequencing data (20x-30x) of the two parents Kasalath and Nipponbare materials were compared to the rice reference genome IRGSP 1.0 through bwa software, and then the required high-quality SNP information was found through GATK software, and the SNP information of the parents was compared. Combined with the SNP information of all progeny into the same vcf file, it is convenient to extract the required variation site information from it.
  • the average sequencing depth of each progeny in the DH population is about 0.02x, which belongs to the sequencing data of lower depth.
  • the SNP information and parental information of each progeny are extracted separately, and then the SNPwindow script is used to judge, and each progeny is judged by rlt file and bin file.
  • the genotype identification results were visualized using the result file obtained in the previous step.
  • the homozygous genotypes of the two parents Kerath in red and Nipponbare in blue
  • the reliability of the heterozygous region is low, which may be due to sequencing errors or program misjudgments caused by the low polymorphism of the two parents in this region.
  • Bin2MCD script uses the Bin2MCD script to take the bin file of the entire population as input, and calculate the map file of the overall genotype distribution.
  • the map file divides the whole genome into many small bins, and each bin is determined according to the genotype results of the individual identification. genotype type.
  • a perl script was used to visualize the genotype information of the entire population, and the proportion of different genotypes at each bin position was also calculated, which is an important parameter for population genetics research.
  • the red and blue ratio map in Figure 20 represents the proportion of the three genotypes in different bins.
  • the visualization of this step facilitates a quick, direct view of the population genotype.
  • the map file output with the phenotype can directly use the analysis software such as winQTL for QTL localization analysis.
  • multi-parent genotype identification was performed on the three-parent progeny materials grown in the laboratory.
  • the amount of sequencing data of the progeny is about 0.2x
  • the three parent materials are Wushan Simiao, 93-11 and Shuohui 70, respectively, and the sequencing depth of the three parents is about 20x-30x.
  • the progeny and the three parental SNP information were integrated into the same vcf, and the final high-quality SNP was further screened from it. Then use the SNPwindow script to judge the genotype of the offspring. In a window, if a parent has the highest score, the region is judged as the homozygous genotype of the parent.
  • the bin file for judging the recombination breakpoint is obtained by using the judgment degree of multiple parents, and a perl script is used to visualize the judgment result.
  • the red area corresponds to parent 1 Wushan Simiao
  • the blue area corresponds to parent 2 of 93-11
  • the green area corresponds to parent 3 of Shuohui 70
  • the yellow is the heterozygous area.
  • the SEG-Map method was used to determine the genotype of this material. Therefore, the laboratory's previous genotype determination of these plants was mainly based on the genotype identification of the two parents, Wushan Simiao and 9311, and three species were determined. Genotype, Wushan silk seedling homozygous genotype, 93-11 homozygous genotype and the heterozygous genotype of the two. According to the judgment results of the three parents, the inventors found that the heterozygous segment obtained from the judgment of the two parents is likely to correspond to the homozygous genotype of the third parent. Therefore, the method of the present invention can make up for the deficiencies of the previous SEG-Map software under the condition of ensuring the accuracy, and solve the problem of multi-parent genotype judgment.
  • the inventors used the real sequencing fastq data of four real rice materials in the laboratory, 93-11, Shuohui 70, Wushan Simiao, and Huang Huazhan, and then segmented out the corresponding regions according to the comparison results.
  • the reads are artificially combined and screened to produce data of a simulated progeny, and the real genotype information and recombination breakpoints of the progeny are clear, so the simulated data can be used to evaluate the feasibility and accuracy of the present invention.
  • the inventors identified dozens of recombination breakpoints in the whole genome, and the determined different chromosomal regions were also roughly consistent with the real genotype results.
  • the figure shows the real genotype information of the mock progeny of the present invention.
  • the inventor checked the intermediate output rlt file of the judgment process, and checked the reasons for the difference in judgment.
  • the possible reasons are as follows: 1. Because the depth of the sequencing data of the progeny is not very high, only a part of the variation information of the whole genome can be captured, and some important parental distinguishing sites may be missed, resulting in the inability to distinguish in some regions. real parent. 2.
  • the identified two or more parents are very similar in certain regions, and there is no polymorphism of the parents. This is also not due to sequencing errors or sequencing depth. For such high-similarity regions, no judgment may be made for the time being.
  • the genotype judgment depends on the genotype information on both sides of the genotype. Therefore, some regions may not be able to make accurate judgments, and the genotypes are judged on both sides. Most likely parental genotype.
  • Multi-parent populations have great application prospects in genetic analysis. By selecting multiple parents, the genetic diversity of the population can be increased, and multiple parents can be fused into a population by means of hybridization and selfing (or inbreeding). Number of reorganizations. Multi-parent populations can not only increase the frequency of recombination and tap the genetic basis behind complex traits, but also have great potential in breeding applications due to the richness of the genetic basis of selected parents. Compared with the biparental group, the multi-parental group has a large number of parents, which increases the population variation richness, including allelic diversity and phenotypic diversity, provides mapping accuracy and precision, and improves the efficiency of QTL detection.
  • the recombination events of will improve the resolution of QTL mapping; because the parental screening of multi-parent populations is more refined, i.e., the criteria are more stringent, and multiple parents increase the diversity of the genetic basis, so its QTL results can be applied to breeding research.
  • multi-parent groups are constructed by mixing multiple parents evenly. Compared with natural groups, because they can know the pedigree relationship and have detailed information on group construction, from the aspect of experimental design, group stratification is avoided, and then control False positive problem of localization results.
  • the present inventors developed a novel method for high-throughput genotyping by whole-genome low-coverage resequencing detection of SNPs.
  • This type of SNP data differs from traditional genetic markers in two main ways. First, in general, not all lines in a recombinant population can obtain information on a certain SNP locus by random sequencing. Second, a single SNP locus is not a reliable marker or locus for genotyping because of potential sequence errors.
  • the inventors In order to process these SNP data with unique properties generated by second-generation sequencing, the inventors further developed a new analysis framework, that is, using a "sliding window method", according to multiple SNPs at local locations The genotype determines the genotype of this segment.
  • SEG-Map Sequencing Enabled Genotyping for Mapping recombination populations
  • GAII Illumina Genome Analyzer II
  • the inventors of the present invention have developed a set of novel program processing and analysis procedures and corresponding methods and devices after research. Using the process of the present invention, in addition to optimizing the steps in the previous SEG-Map program and being compatible with current mainstream bioinformatics analysis software and different types of high-throughput sequencing data, the most important thing is that it can quickly, accurately and reliably analyze multiple Parental constructed population genotypes.
  • the establishment of the method of the invention can help the multi-parent population to be better applied in crop breeding; it can also accurately identify more QTL sites in the multi-parent population; the genome prediction for the multi-parent population can help them be used as germplasm resources It is directly applied to the variety to provide the basis.
  • the inventors used a high-throughput sequencing-based method for genotyping of recombinant inbred lines in rice, showing the advantages of this new genotyping method over the commonly used PCR-based method.
  • the inventors used 287 insertion/deletion markers (including SSR markers) on the F 8 of this population of recombinant inbred lines. Generation individuals were genotyped. These markers were amplified by PCR and identified on agarose gel electrophoresis.
  • each marker covers an average genetic distance of about 5cM, which is equivalent to a physical distance of about 1.4Mb, which is larger than most previously reported rice genetic maps. Designing, screening, and collecting these PCR markers took three researchers more than a year of work. In the study of recombinant inbred lines in rice, the inventors used Illumina GA to obtain an average marker coverage of 40kb per SNP in less than two weeks. In this way, sequencing-based high-throughput genotyping methods are much faster, more efficient, and less expensive than traditional PCR-based genotyping methods.
  • the throughput of resequencing can be easily adjusted, which also allows the inventors to obtain suitable marker density levels and resolution of recombination breakpoints while choosing the shortest time and resource investment.
  • the inventors can increase the coverage of resequencing for the whole or part of the mapping population. It should be noted that, using this method, the recombination break site can be determined very accurately, and if there is a high enough resequencing coverage, it can theoretically be located within 1kb. Such a fine resolution enables the detection of "double crossovers" that have not been previously identified with other types of genetic markers.
  • this method can improve the accuracy of QTL detection and mapping and increase the efficiency and success rate of gene cloning.
  • Precise identification of recombination breakpoints also enables the study of genomic regions with specific genetic properties, such as recombination hotspots.
  • this high-throughput genotype identification method combined with second-generation sequencing technology will greatly simplify and accelerate the genetic mapping of quantitative traits in crops [37-39,20] .
  • the theoretical method proposed by the present inventor can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding.
  • this method can be used for the acquisition of closely linked molecular markers of important agronomic trait genes, the efficient screening of offspring in the breeding process, and the fine identification of genotype maps of improved varieties, etc. Fast and efficient means and platforms make it a new level of efficiency and accuracy.
  • this sequencing-based high-throughput genotyping method will provide convenience for solving complex biological problems and improving crop breeding.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Genotype identification of a multi-parent crop on the basis of high-throughput whole genome sequencing. Specifically, the method comprises: (a) for n parents and the progeny thereof, providing sequencing data Df of a progeny crop to be identified and sequencing data Dp of a parent crop corresponding to the progeny crop, with n being a positive integer of ≥ 3; (b) determining SNP site information of the parent and the progeny on the basis of the sequencing data Df and the sequencing data Dp; (c) determining the genotype of the progeny on the basis of the SNP site information, thereby obtaining an evaluation result of each SNP of the progeny and distribution information of the recombinant breaking point on each chromosome of the whole genome of the progeny; and (d) constructing and/or drawing a genotype map of the progeny, thereby obtaining a genotype identification result of a multi-parent crop. The method can be used for identifying the genotype of the multi-parent crop with high throughput, rapidness and accuracy.

Description

基于高通量全基因组测序的多亲本作物基因型鉴定Genotyping of multi-parent crops based on high-throughput whole-genome sequencing 技术领域technical field
本发明涉及生物信息处理技术领域,具体地涉及基于高通量全基因组测序的多亲本作物基因型鉴定。更具体地,本发明提供了基于高通量全基因组测序数据对多亲本作物的基因型进行鉴定的方法及装置。The invention relates to the technical field of biological information processing, in particular to multi-parent crop genotype identification based on high-throughput whole genome sequencing. More specifically, the present invention provides a method and device for identifying genotypes of multi-parent crops based on high-throughput whole genome sequencing data.
背景技术Background technique
上世纪末,DNA分子标记的使用极大地促进了反向遗传学的发展。随着分子生物学技术的进步,标记的种类以及构建遗传图谱的方法也在逐步发展和完善。聚合酶链式反应(PCR)的出现,引发了一个分子标记爆炸性应用的时代,因为PCR可以大大简化标记设计和结果分析的实验步骤。这些DNA分子标记依旧在被广泛应用,但是也显示出了在基因组覆盖、时耗以及费用方面的越来越多的局限性。当前,基因组学的发展和相关技术方法的逐渐成熟,为基于基因组的高通量策略替代基于标记的作图方法提供了基础。At the end of the last century, the use of DNA molecular markers greatly promoted the development of reverse genetics. With the advancement of molecular biology technology, the types of markers and the methods for constructing genetic maps are gradually developed and improved. The advent of polymerase chain reaction (PCR) triggered an era of explosive applications of molecular markers, because PCR can greatly simplify the experimental steps of marker design and result analysis. These DNA molecular markers are still widely used, but also show increasing limitations in terms of genome coverage, time and cost. At present, the development of genomics and the gradual maturity of related technical methods provide the basis for genome-based high-throughput strategies to replace marker-based mapping methods.
基因组序列打开了高通量基因型鉴定(high-throughput genotyping)的大门。最初这是利用了微阵列芯片技术来完成的,即通过将基因组DNA与基因芯片上的寡聚核苷酸杂交,来检测单核苷酸多态(single nucleotide polymorphisms SNPs)。由于一次杂交就可以检测到数百至数千的标记,这种基因型鉴定的方法充分地提高了效率 [1]。此方法被应用到了一些模式生物系统比如人类、拟南芥与水稻 [2-4]。虽然高通量的目标已经实现了,但是基于微阵列的方法还是具有严重的局限性,例如费力、费时,以及设计、生产和使用微阵列过程中的高昂费用。 Genome sequencing opens the door to high-throughput genotyping. Initially this was done using microarray chip technology, which detects single nucleotide polymorphisms (SNPs) by hybridizing genomic DNA to oligonucleotides on a gene chip. Since hundreds to thousands of markers can be detected in a single hybridization, this method of genotyping greatly improves the efficiency [1] . This method has been applied to some model organism systems such as human, Arabidopsis and rice [2-4] . Although the goal of high throughput has been achieved, microarray-based approaches have serious limitations, such as laborious, time-consuming, and high costs in designing, producing, and using microarrays.
第二代测序技术的出现为基因型鉴定和遗传作图带来了方法学上的跨越式进展。新的测序技术不仅增加了几个数量级的测序通量,还可以允许进行许多个样本的并行测序 [5-6]。这些技术的进步,为以测序为基础的高通量基因型鉴定方法的发展铺平了道路。新的基因型鉴定方法结合了以下优点:快速价廉、高密度标记覆盖、高精确度和高分辨率,同时还适用于更多的作图群体和物种之间的比较基因组和遗传图谱构建。 The advent of next-generation sequencing technology has brought a leap forward in methodological methods for genotyping and genetic mapping. New sequencing technologies not only increase sequencing throughput by several orders of magnitude, but also allow parallel sequencing of many samples [5-6] . Advances in these technologies have paved the way for the development of sequencing-based high-throughput genotyping methods. The new genotyping method combines the advantages of fast and inexpensive, high-density marker coverage, high accuracy and high resolution, while also being applicable to more mapping populations and species for comparative genomic and genetic map construction.
虽然已经有一些方法可进行2个亲本植物的基因型鉴定,然而对于涉及3个或更多亲本的多亲本植物基因型鉴定,目前的方法存在明显不足,例如,精确性较低、分析耗时长等。Although there have been some methods for genotyping 2-parent plants, the current methods have obvious shortcomings for the genotyping of multi-parent plants involving 3 or more parents, such as low accuracy and long analysis time. Wait.
因此,本领域急需提供一种分析快速、结果准确地对涉及3个以上亲本的多亲本植物基因型进行鉴定的方法及装置。Therefore, there is an urgent need in the art to provide a method and device for identifying the genotypes of multi-parent plants involving more than 3 parents with rapid analysis and accurate results.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种分析快速、结果准确的对多亲本植物基因型进行鉴 定的方法及装置,从而可以快速准确可靠地分析多亲本构建的群体基因型。The object of the present invention is to provide a method and device for identifying the genotypes of multi-parent plants with rapid analysis and accurate results, so that the population genotypes constructed by multiple parents can be analyzed quickly, accurately and reliably.
在本发明第一方面,提供了一种对多亲本植物(如作物)的基因型进行鉴定的方法,所述方法包括:In a first aspect of the present invention, there is provided a method for identifying the genotype of a multi-parent plant (such as a crop), the method comprising:
(a)对于n个亲本及其子代,提供待鉴定的子代植物的测序数据Df,以及与所述子代植物相应的亲本植物的测序数据Dp,其中n为≥3的正整数;(a) for n parents and their progeny, provide the sequencing data Df of the progeny plants to be identified, and the sequencing data Dp of the parental plants corresponding to the progeny plants, wherein n is a positive integer ≥ 3;
(b)基于所述测序数据Df和所述测序数据Dp,确定亲代和子代的SNP位点信息;(b) based on the sequencing data Df and the sequencing data Dp, determine the SNP site information of the parent and progeny;
(c)基于所述的SNP位点信息,对子代的基因型进行判断,从而获得所述子代的各个SNP的评定结果以及所述子代的全基因组的各染色体上的重组断裂点的分布信息;(c) Judging the genotype of the progeny based on the SNP site information, thereby obtaining the evaluation results of each SNP of the progeny and the recombination breakpoints on each chromosome of the entire genome of the progeny distribution information;
(d)基于所述子代的SNP评定结果信息和全基因组重组断裂点的位置信息,构建和/或绘制所述子代的基因型图谱,从而获得所述多亲本植物的基因型鉴定结果。(d) constructing and/or drawing a genotype map of the progeny based on the SNP evaluation result information of the progeny and the position information of the whole-genome recombination breakpoint, thereby obtaining the genotype identification result of the multi-parent plant.
在另一优选例中,在步骤(c)中,基于SNP“字串”进行重组断裂位点的分析。In another preferred embodiment, in step (c), the analysis of recombination break sites is performed based on the SNP "word string".
在另一优选例中,步骤(c)中包括对重组断裂位点进行分析,从而获得重组断裂位点的分析结果,In another preferred embodiment, step (c) includes analyzing the recombination break site, so as to obtain the analysis result of the recombination break site,
并且所述的重组断裂位点分析包括:And the recombination break site analysis includes:
(s1)构建SNP“字串”,其中将亲本和子代的各条染色体上所有的SNP的基因型按顺序压缩成一个字串;(s1) constructing a SNP "string", wherein the genotypes of all SNPs on each chromosome of the parent and progeny are sequentially compressed into a string;
(s2)按照预定的窗口大小,确定对应于所述SNP字符串的各个滑动窗口,并对每个窗口中的SNP位点进行打分,从而获得所述窗口内对各个亲本的各自得分值P;(s2) Determine each sliding window corresponding to the SNP string according to a predetermined window size, and score the SNP sites in each window, thereby obtaining the respective score values P for each parent in the window ;
(s3)基于(s2)步骤中获得的得分值P,确定对应于子代的各染色体区域的基因型。(s3) Based on the score value P obtained in the step (s2), the genotype corresponding to each chromosomal region of the progeny is determined.
在另一优选例中,在步骤(s1)中,不管相邻两个SNP之间的实际距离有多大,把SNP之间的间隙全部都去除。In another preferred example, in step (s1), no matter how large the actual distance between two adjacent SNPs is, all the gaps between the SNPs are removed.
在另一优选例中,在步骤(s1)中,构成字串的SNP为亲本纯合的SNP位点。In another preferred embodiment, in step (s1), the SNPs constituting the word string are homozygous SNP sites of the parent.
在另一优选例中,在步骤(s1)中,还包括:对纳入分析的SNP位点先进行了初筛,从而排除了任一亲本为杂合的SNP位点。In another preferred example, in step (s1), the method further includes: firstly screening the SNP sites included in the analysis, thereby excluding any SNP site whose parent is heterozygous.
在另一优选例中,在步骤(s2)中,按照表A的打分规则进行打分。In another preferred example, in step (s2), scoring is performed according to the scoring rules in Table A.
在另一优选例中,在步骤(s3)中,对于子代的各染色体区域,基于各亲本得分值或得分值曲线,确定对应于子代的各染色体区域的基因型。In another preferred example, in step (s3), for each chromosomal region of the progeny, the genotype corresponding to each chromosomal region of the progeny is determined based on each parental score value or score value curve.
在另一优选例中,在步骤(s3)中,基于得分值和标准差,确定各染色体区域的基因型。In another preferred example, in step (s3), the genotype of each chromosomal region is determined based on the score value and the standard deviation.
在另一优选例中,对于待判断基因型的染色体区域,如果某一亲本A有接近 满分的高得分值(≥80%满分,较佳地≥80%满分),并且该亲本在这一段区域的得分值相当稳定,没有太大的数值波动,而其余亲本的得分值低(≤50%满分,较佳地≤30%满分)或存在大的数值波动,则将该染色体区域的基因型确定为所述亲本A的基因型。In another preferred example, for the chromosomal region of the genotype to be determined, if a parent A has a high score value close to full score (≥80% full score, preferably ≥80% full score), and the parent is in this paragraph The score value of the region is quite stable, there is not much numerical fluctuation, and the score value of the remaining parents is low (≤50% full score, preferably ≤30% full score) or there is a large numerical fluctuation, then the gene in this chromosome region The genotype is determined as the genotype of the parent A.
在另一优选例中,在步骤(s3)中,包括:通过滑动窗口在全基因组SNP位点上的滑动,就可以得到每一条染色体上各个亲本的得分值,并将该分值为纵坐标,以每个滑动窗在染色体上的位置为横坐标,绘制每个亲本的得分曲线。In another preferred example, in step (s3), it includes: by sliding the sliding window on the SNP site of the whole genome, the score value of each parent on each chromosome can be obtained, and the score value is vertical Coordinates, with the position of each sliding window on the chromosome as the abscissa, draw the score curve of each parent.
在另一优选例中,在步骤(s3)中,包括对评定杂合区域子步骤:In another preferred embodiment, in the step (s3), the sub-step of evaluating the heterozygous region is included:
(s3a)对于多亲本的杂交子代,仍设定在某一段染色体区域的亲本来源最多是两个亲本,并基于两个亲本在这段区域的得分曲线来判断出这一段区域是否是杂合区域。(s3a) For the hybrid progeny of multiple parents, it is still set that the source of parents in a certain chromosomal region is at most two parents, and based on the score curve of the two parents in this region, it is judged whether this region is heterozygous or not area.
在另一优选例中,在步骤(s3)中,对子代与各个亲本在这一区段的相似程度进行量化值,根据各个亲本的得分曲线的数值特征(数值高低和标准差),判断各个区段的基因型。In another preferred example, in step (s3), the degree of similarity between the offspring and each parent in this section is quantified, and according to the numerical characteristics (value level and standard deviation) of the score curve of each parent, determine Genotype of each segment.
在另一优选例中,在步骤(s3)中,采用以下方式进行基因型评定:In another preferred example, in step (s3), the genotype assessment is performed in the following manner:
(Z1)若某一亲本在该区段得分值较高且其分值较为稳定(得分曲线接近平台期),同时其余亲本在该区段的得分曲线有较大的波动,标准差较大(得分曲线为波峰形上下起伏),就判断这一区段为该亲本的纯合基因型;(Z1) If a parent has a high score in this section and its score is relatively stable (the score curve is close to the plateau), and the score curves of the other parents in this section have large fluctuations and a large standard deviation (the score curve is peak-shaped up and down), then this segment is judged to be the homozygous genotype of the parent;
(Z2)当判断亲本个数为两个时,可以根据两个亲本在某一区段二者都有较大的数值波动,标准差较大,就可以推断该区域为两个亲本的杂合基因型;(Z2) When judging that the number of parents is two, it can be inferred that the region is a heterozygous of the two parents according to the fact that both parents have large numerical fluctuations in a certain section and the standard deviation is large genotype;
在多亲本判断时候,很可能只能发现到可能的杂合区段,即在某一区段找不到一个高分值且稳定的亲本;因为各个亲本的得分在该区段都有较大的波动,只能根据数值特征给出最可能的两个亲本。In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment; because the scores of each parent are higher in this segment fluctuations, only the two most probable parents can be given based on the numerical features.
(Z3)如果在某一区段有两个或多个亲本与子代都很相似,出现两个或多个高分值且标准差较小的亲本曲线,说明在分析的某几个亲本在这一区段非常相似,没有太大差异,可以暂时不做判断(标记为“未知区域”)。(Z3) If there are two or more parents and offspring that are very similar in a certain segment, two or more parent curves with high scores and small standard deviations appear, indicating that some parents in the analysis are in the This section is very similar, there is not much difference, you can temporarily leave judgment (marked as "unknown area").
在另一优选例中,所述方法还包括:如果其未知区域的两侧的基因型相同,就把该区域判定为这一基因型;而如果两侧的基因型不同,把这一未知区域的中间位置视作一个重组断裂点,所述未知区域两边分别为两侧的基因型。In another preferred embodiment, the method further includes: if the genotypes on both sides of the unknown region are the same, determining the region as the genotype; and if the genotypes on the two sides are different, determining the unknown region The middle position of the unknown region is regarded as a recombination breakpoint, and the two sides of the unknown region are the genotypes on both sides.
在另一优选例中,所述子代为多亲本植物。In another preferred embodiment, the progeny is a multi-parent plant.
在另一优选例中,n为3-6,更佳地3、4或5。In another preferred embodiment, n is 3-6, more preferably 3, 4 or 5.
在另一优选例中,所述的测序数据选自下组:基因组测序数据、RNA测序数据、或其组合。In another preferred embodiment, the sequencing data is selected from the group consisting of genome sequencing data, RNA sequencing data, or a combination thereof.
在另一优选例中,所述的测序数据为fastq格式文件。In another preferred embodiment, the sequencing data are files in fastq format.
在另一优选例中,滑动窗口大小170-500个连续SNP位点,较佳地200-400个连续SNP位点;In another preferred embodiment, the sliding window size is 170-500 consecutive SNP sites, preferably 200-400 consecutive SNP sites;
和/或所述的测序数据的测序深度为0.1x-10x,较佳地0.2x-5x。And/or the sequencing depth of the sequencing data is 0.1x-10x, preferably 0.2x-5x.
在另一优选例中,所述的测序数据的测序深度:≥1,较佳地1-5,更佳地1.5-3。In another preferred embodiment, the sequencing depth of the sequencing data: ≥1, preferably 1-5, more preferably 1.5-3.
在另一优选例中,对于各染色体,获得各个亲本得分曲线。In another preferred embodiment, for each chromosome, each parental score curve is obtained.
在另一优选例中,所述的SNP位点被用于判断所述个体的基因型;In another preferred embodiment, the SNP site is used to determine the genotype of the individual;
在另一优选例中,在步骤(b)中,对测序数据(如fastq文件)进行比对,经过bwa和GATK软件处理,从而得到SNP信息。In another preferred embodiment, in step (b), the sequencing data (such as fastq files) are compared and processed by bwa and GATK software to obtain SNP information.
在另一优选例中,所述的SNP位点信息包括位置信息和基因型信息。In another preferred embodiment, the SNP site information includes location information and genotype information.
在另一优选例中,用于判断基因型的SNP位点满足下列要求:In another preferred embodiment, the SNP site used for judging the genotype meets the following requirements:
一、SNP位点尽可能覆盖全基因组,不会在某些区域有缺失;1. SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions;
二、对于任何一个SNP位点,对应的两个亲本和子代的SNP信息(位置信息和基因型信息)都是可知的,三者中任何一个不可知该位点就应当删去。2. For any SNP locus, the SNP information (position information and genotype information) of the corresponding two parents and offspring are known, and the locus should be deleted if any of the three is unknown.
在另一优选例中,在步骤(c)中,所述子代的各个SNP的评定结果记录在rlt文件中,所述rlt文件记录了每个SNP位置的基因型判定情况;In another preferred example, in step (c), the evaluation result of each SNP of the progeny is recorded in the rlt file, and the rlt file records the genotype determination situation of each SNP position;
所述子代的全基因组的各染色体上的重组断裂点的分布信息记录在bin文件中,所述的bin文件记录了全基因组12条染色体上重组断裂点的分布情况。The distribution information of the recombination breakpoints on each chromosome of the whole genome of the progeny is recorded in a bin file, and the bin file records the distribution of the recombination breakpoints on the 12 chromosomes of the whole genome.
在另一优选例中,在步骤(c)中,通过SNPwindow脚本进行读取基因型、重组断裂位点判断。In another preferred embodiment, in step (c), read genotype and recombination break site judgment are performed by SNPwindow script.
在另一优选例中,在步骤(d)中,同时对子代的m个个体进行基因型图谱In another preferred embodiment, in step (d), the genotype map is performed on the m individuals of the progeny at the same time
在另一优选例中,在步骤(d)中,包括通过SNPwindow脚本进行重组图谱的构建,并用SNP2png脚本绘制出各子代个体的基因图谱。In another preferred embodiment, in step (d), the recombination map is constructed by the SNPwindow script, and the gene map of each progeny individual is drawn by the SNP2png script.
在另一优选例中,在步骤(d)中,还包括通过Bin2MCD脚本对各个体的重组图谱进行联配,产生重组bin(recombination bin)图谱。In another preferred embodiment, in step (d), it also includes performing alignment on the recombination map of each individual through the Bin2MCD script to generate a recombination bin map.
在另一优选例中,所述的重组bin图谱的分辨率为每5-200kb一个bin,较佳地10-100kb一个bin。In another preferred embodiment, the resolution of the recombined bin map is one bin per 5-200kb, preferably one bin per 10-100kb.
在另一优选例中,所述方法还包括:对所述的重组bin(recombination bin)图谱进行处理,从而获得所述子代的遗传图谱。In another preferred embodiment, the method further comprises: processing the recombination bin map to obtain the genetic map of the progeny.
在另一优选例中,所述方法还包括:对所述遗传图谱进行QTL分析。In another preferred embodiment, the method further comprises: performing QTL analysis on the genetic map.
在另一优选例中,所述方法还包括:对亲代和子代的整个群体的基因型进行可视化分析,并生成基因型数据,并基于所述基因型数据构建连锁图谱。In another preferred embodiment, the method further includes: performing a visual analysis on the genotypes of the entire population of parents and progeny, generating genotype data, and constructing a linkage map based on the genotype data.
在另一优选例中,所述的植物包括作物,较佳地禾本科作物。In another preferred embodiment, the plants include crops, preferably grass crops.
在另一优选例中,所述的作物包括水稻、小麦、大豆、烟草。In another preferred embodiment, the crops include rice, wheat, soybean, and tobacco.
在本发明的第二方面,提供了一种对多亲本植物的基因型进行鉴定的数据分析装置,该装置包括:In a second aspect of the present invention, a data analysis device for identifying the genotypes of multi-parent plants is provided, the device comprising:
数据输入模块,用于输入待分析的待处理数据,所述的待处理数据包括:待 鉴定的子代植物的测序数据Df,以及与所述子代植物相应的亲本植物的测序数据Dp;A data input module for inputting the data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;
多亲本植物基因型鉴定模块,所述多亲本植物基因型鉴定模块被配置为执行本发明第一方面中所述的方法,从而获得所述子代的基因型鉴定结果;A multi-parental plant genotype identification module, the multi-parent plant genotype identification module is configured to perform the method described in the first aspect of the present invention, thereby obtaining the genotype identification result of the progeny;
和输出模块,用于输出所述的所述子代的基因型鉴定结果。and an output module for outputting the genotype identification result of the progeny.
在另一优选例中,所述的多亲本植物基因型鉴定模块包括:In another preferred embodiment, the described multi-parent plant genotype identification module includes:
SNP位点信息分析子模块,其被配置为基于所述测序数据Df和所述测序数据Dp,确定亲代和子代的SNP位点信息;The SNP site information analysis submodule is configured to determine the SNP site information of the parent and progeny based on the sequencing data Df and the sequencing data Dp;
染色体重组断裂点分析子模块,其被配置为基于所述的SNP位点信息,对子代的基因型进行判断,从而获得所述子代的各个SNP的评定结果以及所述子代的全基因组的各染色体上的重组断裂点的分布信息;Chromosomal recombination breakpoint analysis sub-module, which is configured to judge the genotype of the progeny based on the SNP site information, so as to obtain the evaluation result of each SNP of the progeny and the whole genome of the progeny The distribution information of recombination breakpoints on each chromosome;
基因型图谱构建子模块,其被配置为:基于所述子代的SNP评定结果信息和全基因组重组断裂点的位置信息,构建和/或绘制所述子代的基因型图谱,从而获得所述多亲本植物的基因型鉴定结果。A genotype map construction submodule, which is configured to: construct and/or draw a genotype map of the progeny based on the SNP assessment result information of the progeny and the position information of the whole genome recombination breakpoint, so as to obtain the progeny. Genotyping results of multiple parental plants.
在另一优选例中,所述的植物包括作物,较佳地禾本科作物。In another preferred embodiment, the plants include crops, preferably grass crops.
在另一优选例中,所述的输出模块包括:显示器、打印机、pad等。In another preferred embodiment, the output module includes: a display, a printer, a pad, and the like.
应理解,在本发明范围内中,本发明的上述各技术特征和在下文(如实施例)中具体描述的各技术特征之间都可以互相组合,从而构成新的或优选的技术方案。限于篇幅,在此不再一一累述。It should be understood that within the scope of the present invention, the above-mentioned technical features of the present invention and the technical features specifically described in the following (eg, the embodiments) can be combined with each other to form new or preferred technical solutions. Due to space limitations, it is not repeated here.
附图说明Description of drawings
图1显示了模拟两亲本材料全基因组重组断裂点。Figure 1 shows the simulated two-parental material genome-wide recombination breakpoints.
图2显示了模拟四亲本材料全基因组重组断裂点。Figure 2 shows the simulated four-parental material genome-wide recombination breakpoints.
图3显示了使用基于SNP的滑动窗口方法对两亲本模拟子代进行基因型鉴定。Figure 3 shows genotyping of two-parental mock progeny using a SNP-based sliding window approach.
图4显示了使用SEG-Map软件方法对两亲本模拟子代进行基因型鉴定。Figure 4 shows the genotyping of two-parental mock progeny using the SEG-Map software method.
图5显示了使用基于SNP的滑动窗口方法对四亲本模拟子代进行基因型鉴定。Figure 5 shows genotyping of four-parental mock progeny using a SNP-based sliding window approach.
图6显示了不同滑动窗口大小对基因型判定结果准确率的影响。Figure 6 shows the effect of different sliding window sizes on the accuracy of genotype determination results.
图7显示了不同测序深度对基因型判定结果准确率的影响。Figure 7 shows the effect of different sequencing depths on the accuracy of genotype determination results.
图8显示了基于SNP的滑动窗口基因型鉴定方法的分析框架流程。Figure 8 shows the analysis framework flow of the SNP-based sliding window genotyping method.
图9显示了利用SNP2png脚本绘制基因图谱。Figure 9 shows gene mapping using the SNP2png script.
图10显示了水稻群体的基因型鉴定集合图。Figure 10 shows a genotype identification ensemble plot of the rice population.
图11显示了一个实施例中的重组自交系个体重组区段图谱的基因型表。Figure 11 shows the genotype table of the recombinant inbred line individual recombination segment map in one example.
图12显示了窗口为15的SNP“字串”。Figure 12 shows the SNP "string" with window 15.
图13显示了水稻3号染色体模拟子代的四个亲本得分曲线。Figure 13 shows the four parental score curves of rice chromosome 3 mock progeny.
图14显示了水稻11号染色体模拟子代的两个亲本得分曲线。Figure 14 shows two parental score curves for rice chromosome 11 mock progeny.
图15显示了一个实施例中判定为亲本三纯合基因型时各亲本得分情况。Figure 15 shows the scores of each parent when it is determined that the parents are three homozygous genotypes in one embodiment.
图16显示了一个实施例中判定为两亲本杂合基因型时各亲本得分情况。Figure 16 shows the scores of each parent when it is determined that the two parents are heterozygous genotypes in one embodiment.
图17显示了一个实施例中基因型判定为unknown时各亲本得分情况。Figure 17 shows the scores of each parent when the genotype is determined as unknown in one embodiment.
图18显示了一个实施例中对unknown区域的后续基因型判断。Figure 18 shows subsequent genotype determination of unknown regions in one embodiment.
图19显示了DH群体中单个个体的基因型鉴定图。Figure 19 shows a graph of genotyping of individual individuals in the DH population.
图20显示了DH群体基因型鉴定集合图。Figure 20 shows a graph of the genotyping ensemble for the DH population.
图21显示了一个实施例中三亲本材料的基因型鉴定。Figure 21 shows genotyping of three parental material in one example.
图22显示了一个实施例中三亲本材料的SEG-Map鉴定。Figure 22 shows SEG-Map identification of three parental material in one example.
图23显示了一个实施例中四亲本模拟材料的基因型鉴定。Figure 23 shows the genotyping of four parental mimics in one example.
图24显示了一个实施例中四亲本模拟材料的真实基因型。Figure 24 shows the true genotypes of the four-parent mock material in one example.
具体实施方式Detailed ways
本发明人经过广泛而深入的研究,本发明首次开发了一种更加快速和精确的基因型鉴定的方法,从而实现更有效的遗传作图和基因组分析。本发明方法特别适合对低覆盖率测序的多亲本群体的基因型分析和鉴定。在本发明中,直接读取多亲本与子代在某一区段的真实SNP的基因型信息,然后量化子代与各个亲本在这一区段的相似程度,根据各个亲本的得分曲线的数值特征(数值高低与标准差),形成一种高效、简化而准确的多亲本植物(或多亲本作物)基因型鉴定的方法。在此基础上完成了本发明。After extensive and in-depth research, the present inventors have developed a method for more rapid and accurate genotype identification for the first time, thereby realizing more effective genetic mapping and genome analysis. The method of the present invention is particularly suitable for genotyping and identification of low coverage sequenced multi-parent populations. In the present invention, the genotype information of the real SNPs of multiple parents and progeny in a certain section is directly read, and then the degree of similarity between the progeny and each parent in this section is quantified, according to the numerical value of the score curve of each parent characteristics (value level and standard deviation), forming an efficient, simplified and accurate method for multi-parent plant (or multi-parent crop) genotype identification. The present invention has been completed on this basis.
具体地,本发明人以第二代测序技术产生的全基因组低覆盖率测序数据为基础,开发了一种高通量的方法来鉴定含有多个亲本的重组群体基因型。本发明人设计了一个“滑动窗口”(sliding window)的方法,通过对基因组局部区域内多个单核苷酸多态(single nucleotide polymorphism,SNP)的基因型综合分析确定这个区段的基因型,进而判定重组断裂位点的具体位置以构建多亲本群体的精细重组图。Specifically, the present inventors developed a high-throughput method to identify genotypes of recombinant populations containing multiple parents based on whole-genome low-coverage sequencing data generated by second-generation sequencing technology. The inventors designed a "sliding window" method to determine the genotype of this segment by comprehensively analyzing the genotypes of multiple single nucleotide polymorphisms (SNPs) in a local region of the genome , and then determine the specific position of the recombination break site to construct a fine recombination map of the multi-parent population.
为了验证此方法,本发明人构建了双亲本群体和多亲本群体的模拟全基因组测序数据,利用该方法构建了遗传连锁图,最终鉴定得到的基因型信息与真实模拟数据基因型比较,双亲本群体的基因型鉴定准确度可以达到89.61%,与本发明人的SEG-Map软件方法鉴定双亲群体基因型的准确度相似(SEG-Map方法准确度为89.32%)。本发明人新开发的这种基因型鉴定方法对多亲本群体的鉴定准确度可达到92.10%,这是SEG-Map软件或方法做不到的。In order to verify this method, the inventors constructed simulated whole-genome sequencing data of biparental populations and multi-parental populations, constructed a genetic linkage map using this method, and finally compared the genotype information obtained by identification with the genotypes of the real simulated data. The genotype identification accuracy of the population can reach 89.61%, which is similar to the accuracy of the inventor's SEG-Map software method for identifying the genotype of the parental population (the SEG-Map method has an accuracy of 89.32%). The genotype identification method newly developed by the present inventors has an identification accuracy of 92.10% for multi-parent populations, which cannot be achieved by SEG-Map software or methods.
本发明方法能够有效、快速地分析出群体中每个个体的基因型,在基因组设计育种中起到关键的指导作用,也可为不同作物多亲本群体的QTL定位提供快速准确的基因型数据。同时,本发明人利用真实的水稻RIL遗传群体对该方法进行了测试,使用了基于高通量测序的基因型鉴定,最终也都得到了相当好的高精度重组图谱。The method of the invention can effectively and quickly analyze the genotype of each individual in the population, plays a key guiding role in genome design and breeding, and can also provide fast and accurate genotype data for QTL mapping of multi-parent populations of different crops. At the same time, the present inventors tested the method using the real rice RIL genetic population, used high-throughput sequencing-based genotype identification, and finally obtained a fairly good high-precision recombination map.
因此,随着测序技术的持续发展、提高,这种基于基因组低覆盖率测序的基因型鉴定方法可以替代传统基于标记的基因型鉴定方法,并且为大规模的基因探索研究以及解决更多复杂生物学问题提供了一个有力的工具。本发明方法更加适用于进行了低覆盖率测序的多亲本回交群体的基因型鉴定,为QTL定位提供精确的基因型支持,同时也有助于多亲本群体的分子设计育种应用。Therefore, with the continuous development and improvement of sequencing technology, this genotype identification method based on low-coverage genome sequencing can replace the traditional marker-based genotype identification method, and provide large-scale gene exploration research and solve more complex biological Learning questions provide a powerful tool. The method of the invention is more suitable for genotype identification of multi-parent backcross populations that have undergone low coverage sequencing, provides accurate genotype support for QTL mapping, and is also helpful for molecular design breeding applications of multi-parent populations.
术语the term
除非另有定义,否则本文中所用的全部技术术语和科学术语均具有如本发明所属领域普通技术人员通常理解的相同含义。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
如本文所用,术语“含有”或“包括(包含)”可以是开放式、半封闭式和封闭式的。换言之,所述术语也包括“基本上由…构成”、或“由…构成”。As used herein, the terms "containing" or "including (including)" can be open, semi-closed, and closed. In other words, the term also includes "consisting essentially of," or "consisting of."
如本文所用,术语“双亲本”指出涉及2个亲本。As used herein, the term "biparental" indicates that two parents are involved.
如本文所用,术语“多亲本”指出涉及3个亲本和更多个亲本。As used herein, the term "multi-parent" indicates that 3 parents and more are involved.
如本文所用,术语“多亲本植物”指出涉及3个亲本和更多个亲本的植物,例如涉及3个、4个或5个亲本的子代植物(例如作物)。As used herein, the term "multi-parent plant" refers to plants involving 3 parents and more, eg, progeny plants (eg, crops) involving 3, 4, or 5 parents.
多亲本作物基因型的鉴定方法Methods for the identification of multi-parental crop genotypes
本发明提供了一种多亲本作物基因型的鉴定方法。本发明方法是一种SNP位点滑窗的基因型鉴定方法。The invention provides a method for identifying multi-parent crop genotypes. The method of the invention is a genotype identification method of the sliding window of the SNP site.
在本发明的基于SNP位点滑窗的基因型鉴定方法中,对数据处理进行了优化。优化后的流程能够直接分析处理二代测序技术产生的单向或双向末端短序列测序结果,最终构建出重组群体的遗传图谱。In the genotype identification method based on the sliding window of the SNP site of the present invention, the data processing is optimized. The optimized process can directly analyze and process the unidirectional or bidirectional end short sequence sequencing results generated by the next-generation sequencing technology, and finally construct the genetic map of the recombinant population.
对于一个来自于两个亲本的作图群体,在进行数据分析流程之前,需要对这两个亲本的全基因组范围内的SNP进行鉴定。这个SNP的鉴定工作可以由高覆盖的全基因组深度测序得到,也可以由水稻单倍型图谱中现有的基因组SNP信息得到,还可以由低覆盖的全基因组测序结合缺失基因型(SNP)填补来得到。由于两个亲本品种之间的SNP鉴定工作可以通过快速及省钱的途径得到,那么对于一个重组群体的基于测序的基因型鉴定将主要依靠之后的分析工作,包括读取基因型、重组断裂位点的确定和遗传连锁图谱的构建。For a mapping population from two parents, the genome-wide SNPs of both parents need to be identified before proceeding with the data analysis pipeline. The identification of this SNP can be obtained by high-coverage whole-genome deep sequencing, or by existing genomic SNP information in the rice haplotype map, or by low-coverage whole-genome sequencing combined with missing genotypes (SNPs) to fill in to get. Since SNP identification between two parental varieties can be obtained in a fast and cost-effective way, sequencing-based genotype identification of a recombinant population will mainly rely on subsequent analysis, including reading genotypes, recombination breakpoints Point determination and construction of genetic linkage maps.
数据分析中的功能、步骤、和软件(脚本)可参见图8。Functions, steps, and software (scripts) in data analysis can be seen in Figure 8.
第一个步骤包含了数个可以同时处理的任务。一定数目的重组群体中的个体和亲本材料同时进行二代高通量测序。将得到的fastq文件比对经过bwa和GATK软件处理得到高质量的SNP信息。The first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information.
最终判断基因型所用的SNP位点应当满足下列几个要求:The SNP loci used for the final determination of genotype should meet the following requirements:
一、SNP位点尽可能覆盖全基因组,不会在某些区域有缺失。1. SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions.
二、对于任何一个SNP位点,两个亲本和模拟子代的SNP信息(位置信息和基 因型信息)都是可知的,三者中任何一个不可知该位点就应当删去。2. For any SNP locus, the SNP information (position information and genotype information) of the two parents and simulated progeny are known, and the locus should be deleted if any one of the three is not known.
此外,以水稻为例,一般认为水稻亲本是自交纯合系,基因组中基本不会有杂合位点,因此如果在亲本中找出了杂合的SNP位点,一般认为该位点是不可信的,因此可删去了任一亲本为杂合的SNP位点。In addition, taking rice as an example, it is generally believed that the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is Not credible, so SNP sites where either parent is heterozygous can be deleted.
筛选两个亲本和子代的高质量全基因组SNP之后,可用一个python脚本SNPwindow来对子代的基因型进行判断。脚本输出会有两个文件,rlt文件和bin文件。其中rlt文件记录了每个SNP位置的基因型判定情况,而bin文件记录了全基因组12条染色体上重组断裂点的分布情况。这两个文件是后续作图和连锁分析的重要基础。After screening the high-quality genome-wide SNPs of the two parents and progeny, a python script SNPwindow can be used to judge the genotype of the progeny. The script output will have two files, the rlt file and the bin file. The rlt file records the genotype determination of each SNP position, and the bin file records the distribution of recombination breakpoints on the 12 chromosomes in the whole genome. These two files are an important basis for subsequent mapping and linkage analysis.
参见图9,一般可先利用rlt和bin文件通过一个perl脚本SNP2png绘制一张基因型图谱,图片格式为PNG格式。该图谱是根据判定的SNP位点基因型信息和全基因组重组断裂点位置信息绘制而成,图中用不同的颜色代表不同的基因型类型。Referring to Figure 9, generally, a genotype map can be drawn first by using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format. The map is drawn based on the genotype information of the determined SNP loci and the position information of the whole-genome recombination breakpoint. Different colors in the figure represent different genotype types.
在实际工作中,会有大量的后代群体需要处理,所以可提取每个个体的SNP信息,然后通过SNPwindow脚本分析检验,进行读取基因型、重组断裂位点判断、以及重组图谱构建的工作。先利用SNP2png绘制出每一个个体的基因图谱,对个体的基因型情况有一个整体的把握,然后再由脚本Bin2MCD对所有个体的重组图谱进行联配,产生重组bin(recombination bin)图谱。In actual work, there will be a large number of offspring groups to be processed, so the SNP information of each individual can be extracted, and then analyzed and tested by the SNPwindow script to read the genotype, judge the recombination breakpoint, and construct the recombination map. First, use SNP2png to draw the gene map of each individual, and have an overall grasp of the individual's genotype, and then use the script Bin2MCD to align the recombination maps of all individuals to generate a recombination bin map.
参见图10,还可以用一个perl脚本对整个群体的基因型情况进行可视化分析。分析流程中用到的程序和脚本用斜体字表示,形成了一系列的分析步骤。分析流程最后生成的基因型数据可以被直接用于其它的软件(包括MapMaker和JoinMap),来进行连锁图谱的构建。Referring to Figure 10, a perl script can also be used to visualize the genotype profile of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.
分析软件最后产生的输出数据时重组bin,通常分辨率为每100kb一个bin,甚至是每10kb一个bin。作图群体的基因型结果可以被导入到MapMaker [16]或者JoinMap [32]等程序,进行遗传图谱构建。有了遗传图谱就进行QTL分析。 The bins are reorganized when analyzing the final output data produced by the software, usually at a resolution of one bin per 100kb, or even one bin per 10kb. The genotype results of the mapping population can be imported into programs such as MapMaker [16] or JoinMap [32] for genetic map construction. With the genetic map available, QTL analysis is performed.
参见图11,用本发明方法产生的遗传图谱比大多数传统的分子标记产生的图谱的尺度要精细得多。Referring to Figure 11, the genetic map produced by the method of the present invention is much finer in scale than maps produced by most conventional molecular markers.
本发明方法涉及判断重组断裂位点,其详细流程包括以下步骤:The method of the present invention relates to judging the recombination break site, and its detailed process comprises the following steps:
第一步:构建SNP“字串”。Step 1: Construct the SNP "string".
以窗口大小为15(win=15)时为例。将两个亲本和子代12条染色体上所有的SNP的基因型按顺序压缩成一个字串。不管相邻两个SNP之间的实际距离有多大,把间隙全部都去除。Take the window size of 15 (win=15) as an example. The genotypes of all SNPs on the 12 chromosomes of the two parents and progeny were sequentially compressed into a single string. Regardless of the actual distance between two adjacent SNPs, all gaps are removed.
以12条染色体为例,这样12条染色体上的SNP就成为了12个连续的字串(可参见图12)。图中蓝色表示亲本一的基因型,红色表示亲本二的基因型,对于一个亲本子代,它的每个SNP位点上可能出现的基因型有三种情况,亲本一的纯合 基因型(蓝色)、亲本二的纯合基因型(红色)和杂合基因型(黄色)。Taking 12 chromosomes as an example, the SNPs on the 12 chromosomes become 12 consecutive word strings (see Figure 12). The blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2. For a parent progeny, there are three possible genotypes at each SNP site. The homozygous genotype of parent 1 ( Blue), homozygous genotype (red) and heterozygous genotype (yellow) of parent two.
一般认为人工培育的水稻亲本材料其基因组是高度纯合的,同时对于某些多代自交重组水稻群体,其基因组也是较为纯合的,只在部分染色体位置存在一些杂合区域。因此对纳入分析的SNP位点先进行了人为筛选,排除了任一亲本为杂合的SNP位点,这样的位点没有办法进行准确的判断打分。此外,如果子代的测序深度不是很高,则可过滤了子代为杂合的SNP位点,因为依据低深度而判断出的杂合位点可信度不高,很有可能是由于测序错误而导致的误判。It is generally believed that the genome of the artificially cultivated rice parent material is highly homozygous, and for some multi-generation self-recombinant rice populations, the genome is relatively homozygous, and there are only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.
第二步:在一个窗口内对两个亲本进行打分Step 2: Score both parents in one window
依据孟德尔遗传规律,计算一个滑动窗口内所有SNP位点的得分情况,并计算每一个亲本的总和得分作为该亲本在该滑动窗所在染色体位置的打分。依据每个亲本的打本情况来衡量子代与该亲本的符合程度。打分规律如表A所示或类似的打分规则。优选地,打分规则是按照生物的遗传规律而制定。According to the law of Mendelian inheritance, the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window. The degree of conformity of the offspring with the parent is measured according to the typing of each parent. The scoring rules are as shown in Table A or similar scoring rules. Preferably, the scoring rules are formulated according to the genetic laws of organisms.
结合理论模型,以下进一步本发明中的基因型打分规则。Combined with the theoretical model, the genotype scoring rules in the present invention are further described below.
对于一个由连续的SNP组成的滑动窗,子代对任意一个亲本的得分值由三部分组成:1.子代与亲本相同的SNP位点2.子代与亲本不同但是符合孟德尔遗传规律的位点3.子代与亲本不同且不符合孟德尔遗传规律的位点和各种可能因素导致的误判位点。For a sliding window consisting of consecutive SNPs, the score of the offspring for any parent consists of three parts: 1. The offspring has the same SNP site as the parent; 2. The offspring is different from the parent but conforms to Mendelian inheritance 3. The loci of the progeny that are different from the parent and do not conform to the Mendelian inheritance law and the misjudged locus caused by various possible factors.
对于某一个单独亲本A,待检测的子代与亲本相同的SNP位点数目为m,子代与亲本不同但是符合孟德尔遗传规律的位点数目为n,子代与亲本不同且不符合孟德尔遗传规律的位点和各种可能因素导致的误判位点数目为e。For a single parent A, the number of SNP loci in the offspring to be tested that are the same as that of the parent is m; The number of loci of Del's inheritance law and the number of misjudged loci caused by various possible factors is e.
则该亲本的的得分值S A为: Then the score value S A of the parent is:
S A=s 1*m+s 2*n+s 3*e S A =s 1 *m+s 2 *n+s 3 *e
其中,s 1为子代与亲本相同的SNP位点的打分值。 Among them, s 1 is the scoring value of the same SNP site of the progeny and the parent.
s 2为子代与亲本不同但是符合孟德尔遗传规律的位点的打分值。 s 2 is the scoring value of the locus that is different from the parent but conforms to the Mendelian inheritance law.
s 3子代与亲本不同且不符合孟德尔遗传规律的位点和各种可能因素导致的误判位点的打分值。 Scores of loci that are different from their parents and do not conform to Mendelian inheritance and misjudged loci caused by various possible factors.
在一个大小为N的连续SNP框中,有i个待判定亲本,对一个给定的SNP位点的染色体位置k,该位点上子代和亲本的基因型分别为g k和g′ k。水稻纯系亲本的基因的基因型一般为0/0,0|0,1/1,1|1。而子代的基因型一般为0/0,0|0,1/1,1|1,0/1,0|1。2的等位基因出现的频率一般较少,暂不考虑。 In a continuous SNP frame of size N, there are i parents to be determined. For a given SNP locus at chromosome position k, the genotypes of the progeny and parent at this locus are gk and g′k, respectively . . The genotypes of the genes of the pure line parents of rice are generally 0/0, 0|0, 1/1, 1|1. The genotypes of the offspring are generally 0/0, 0|0, 1/1, 1|1, 0/1, 0|1. The frequency of 2 alleles is generally less, and is not considered for the time being.
对于第i个单独的亲本,子代与其基因型相符合的概率为:For the ith individual parent, the probability that the offspring matches its genotype is:
Figure PCTCN2021115146-appb-000001
Figure PCTCN2021115146-appb-000001
使用贝叶斯方法求出子代该区域属于该亲本的可能性:Use a Bayesian approach to find the probability that the offspring will belong to the parent in that region:
Figure PCTCN2021115146-appb-000002
Figure PCTCN2021115146-appb-000002
将子代某一区域的基因型判定给与其符合概率最高的亲本基因型,即求出i个亲本中的符合概率最大值:The genotype of a certain region of the offspring is determined to the parent genotype with the highest coincidence probability, that is, the maximum coincidence probability among i parents is obtained:
P max=max{P 1,P 2,...,P t} P max =max{P 1 , P 2 , . . . , P t }
优选地,在本发明中,采用下表进行基因型打分。Preferably, in the present invention, the following table is used for genotype scoring.
表A 基因型打分规则表Table A Genotype scoring rules table
Figure PCTCN2021115146-appb-000003
Figure PCTCN2021115146-appb-000003
按照打分规则,s 1=1,s 2=1,s 3=0。 According to the scoring rules, s 1 =1, s 2 =1, and s 3 =0.
若对亲本A打分,得:If parent A is scored, then:
Figure PCTCN2021115146-appb-000004
Figure PCTCN2021115146-appb-000004
依据这种方法分别统计每一个亲本的得分值之后,再对连续的亲本得分值进行滑窗计算标准差std。According to this method, after the score value of each parent is counted separately, the standard deviation std is calculated by sliding window on the continuous parent score value.
优选地,判断染色体某一段区域为A亲本的纯合基因型时候需要满足下列条件:得分S最高,且标准差最小。Preferably, when judging that a certain segment of chromosome is the homozygous genotype of the A parent, the following conditions need to be met: the score S is the highest, and the standard deviation is the smallest.
第三步:依据得分值判断染色体区域基因型Step 3: Determine the genotype of the chromosome region according to the score value
通过滑动窗口在全基因组SNP位点上的滑动,就可以得到每一条染色体上各个亲本的得分值。将该分值为纵坐标,以每个滑动窗在染色体上的位置为横坐标,绘制每个亲本的得分曲线。By sliding the sliding window on the SNP sites of the whole genome, the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.
对于每一段染色体的基因型判断就是依据不同亲本得分曲线的特征。The genotype judgment of each chromosome is based on the characteristics of different parental score curves.
在一个实例中,例如,可参见图13所示,对模拟四亲本来源的子代进行了滑动窗口打分,并依据四个亲本得分值绘制了四个亲本的得分曲线。In one example, for example, as shown in FIG. 13 , a sliding window score was performed on the progeny of the simulated four-parent source, and the score curves of the four parents were drawn according to the score values of the four parents.
观察四个亲本的得分曲线,在一个实例中,可以看到亲本曲线在11号染色体的不同区域存在平台稳定期和波动期两种不同的分布模式。因此利用同一段区域 不同亲本曲线所处的状态进而进行子代在该区域的基因型判断。Observing the score curves of the four parents, in one instance, it can be seen that the parental curves have two different distribution patterns of plateau stable period and fluctuating period in different regions of chromosome 11. Therefore, the state of different parental curves in the same region is used to determine the genotype of the offspring in this region.
如在图中的1bp至10780000bp区域,可以观察到黄色曲线(亲本4)在该区域的得分有接近满分的高得分值,并且该亲本在这一段区域的得分值相当稳定,没有太大的数值波动,利用统计学中的标准差来衡量分值的波动情况。在这一区域亲本4在该区域有很高的得分值和较小的标准差,同时,另外三个亲本在该区域的得分值在0-200的范围内上下波动,有着很高的标准差,因此可判断此段区域为亲本4的纯合基因型。利用类似的方法,可以依据得分值判断出12条染色体不同区域的子代基因型。As in the 1bp to 10780000bp region in the figure, it can be observed that the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large The numerical fluctuation of the score is measured by the standard deviation in statistics. In this area, parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region can be judged to be the homozygous genotype of parent 4. Using a similar method, the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values.
图中true对应的矩形条对应每个染色体段的模拟子代的真实基因型信息,而judge对应的矩形条则表示本发明方法判断出来的子代基因型,二者的信息基本符合。The rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.
杂合区域的判断Judgment of heterozygous regions
以两亲本模拟子代的基因型判断来说明对杂合区域的判断。按照遗传学原理,即使是来源于多亲本的杂交子代,它在某一段染色体区域的亲本来源最多是两个亲本。因此可以根据两个亲本在这段区域的得分曲线来判断出这一段区域是否是杂合区域。The judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents. According to the principle of genetics, even if it is a hybrid progeny derived from multiple parents, its parental origin in a certain chromosomal region is at most two parents. Therefore, it can be judged whether this region is a heterozygous region according to the score curve of the two parents in this region.
在一个实例中,如图14所示,对模拟子代进行了基因型鉴定,true对应的矩形条对应每个染色体段的模拟子代的真实基因型信息,而judge对应的矩形条则表示本发明方法判断出来的子代基因型。同样的,当一个亲本的得分较高并且标准差较小且另一个亲本的得分有相当大的波动,标准差较大时,就判断该段区域为前者的纯合基因型(图中橙色或蓝色区域)。In one example, as shown in Figure 14, the genotype identification of the simulated offspring is performed, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated offspring of each chromosome segment, and the rectangular bar corresponding to judge represents the actual genotype information of the simulated offspring. The genotype of the offspring determined by the inventive method. Similarly, when the score of one parent is high and the standard deviation is small, and the score of the other parent has considerable fluctuation and the standard deviation is large, the region is judged to be the homozygous genotype of the former (orange or blue area).
而当两个亲本的得分值在某一区段(图中6000000bp至15000000bp)两个亲本的相对分值相差不大,且二者都一定的波动,标准差都较大时,就把这一区段判断为两个亲本的杂合基因型(图中黄色区域)。这一判断结果也与真实信息相符合。When the scores of the two parents are in a certain segment (6000000bp to 15000000bp in the figure), the relative scores of the two parents are not much different, and both fluctuate to a certain extent, and the standard deviation is large, the One segment was judged to be the heterozygous genotype of the two parents (yellow area in the figure). This judgment result is also consistent with the real information.
基于量化的相似程度进行判断Judgment based on quantified similarity
本发明方法的核心思想之一是基于直接读取亲本与子代在某一区段的真实SNP的基因型信息,然后量化子代与各个亲本在这一区段的相似程度,根据各个亲本的得分曲线的数值特征(数值高低与标准差),形成一种较为简化的分析模型,进而判断各个区段的基因型。One of the core ideas of the method of the present invention is based on directly reading the genotype information of the real SNP of the parent and the progeny in a certain segment, and then quantifying the degree of similarity between the progeny and each parent in this segment. The numerical characteristics of the score curve (value level and standard deviation) form a relatively simplified analysis model, and then determine the genotype of each segment.
本发明的判断的标准主要包含以下几种情况:The criteria for the judgment of the present invention mainly include the following situations:
一、若某一亲本在该区段得分值较高且其分值较为稳定(得分曲线接近平台期),同时其余亲本在该区段的得分曲线有较大的波动,标准差较大(得分曲线为波峰形上下起伏),就判断这一区段为该亲本的纯合基因型(图15)。1. If a parent has a high score in this section and its score is relatively stable (the score curve is close to the plateau), and the score curves of the other parents in this section have large fluctuations and a large standard deviation ( The score curve is peak-shaped up and down), and this segment is judged to be the homozygous genotype of the parent (Figure 15).
二、当判断亲本个数为两个时,可以根据两个亲本在某一区段二者都有较大 的数值波动,标准差较大,就可以推断该区域为两个亲本的杂合基因型(图16)。2. When judging that the number of parents is two, it can be inferred that the region is a heterozygous gene of the two parents according to the fact that both parents have large numerical fluctuations in a certain segment and the standard deviation is large. type (Figure 16).
在多亲本判断时候,很可能只能发现到可能的杂合区段,即在某一区段找不到一个高分值且稳定的亲本。因为各个亲本的得分在该区段都有较大的波动,只能根据数值特征给出最可能的两个亲本。In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment. Because the scores of each parent have large fluctuations in this section, only the two most probable parents can be given according to the numerical characteristics.
三、因为判断方法很大程度上依赖于衡量子代与亲本的基因型相似程度,所以如果在某一区段有两个或多个亲本与子代都很相似,出现两个或多个高分值且标准差较小的亲本曲线,说明在分析的某几个亲本在这一区段非常相似,没有太大差异,可以暂时不做判断。3. Because the judgment method largely depends on measuring the genotype similarity between the offspring and the parent, if two or more parents and offspring are very similar in a certain segment, two or more high The parental curve with a small score and a small standard deviation indicates that some of the parents analyzed are very similar in this section, and there is not much difference, so we can temporarily not make judgments.
参见图17,某些情况下,先将待判断区域定义为“unknown”,而这一段区域的基因型由其两侧区域的基因型决定。Referring to Figure 17, in some cases, the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.
参见图18,如果其两侧的基因型相同,就把该区域判定为这一基因型,而如果两侧的基因型不同,把这一区域的中间位置视作一个重组断裂点,区域两边分别为两侧的基因型。Referring to Figure 18, if the genotypes on both sides are the same, the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the two sides of the region are respectively genotypes on both sides.
二次滑窗进行基因型判断Secondary sliding window for genotype determination
在本发明中,优选地,通过二次滑窗进行基因型判断。In the present invention, preferably, genotype determination is performed through a secondary sliding window.
第一次对SNP的基因型进行滑窗,统计一个每个窗口内的亲本得分值。For the first time, a sliding window was performed on the genotype of the SNP, and a parental score value in each window was counted.
第二次对所得到亲本得分值进行滑窗,检测得分值的高低和标准差。For the second time, a sliding window was performed on the obtained parental score values to detect the level and standard deviation of the score values.
最终基因型的判定依赖于二次滑窗得到的得分值高低与标准差的大小,并通过子代某一段区域属于某一亲本概率最高来进行基因型的判定。The determination of the final genotype depends on the score value and the size of the standard deviation obtained by the secondary sliding window, and the determination of the genotype is carried out by the highest probability that a certain segment of the offspring belongs to a certain parent.
在本发明中,利用二次滑窗进行基因型判断可以更快和可准确地进行基因型断。In the present invention, genotype determination can be performed faster and more accurately by using the secondary sliding window for genotype determination.
一种二次滑窗的示意例如下所示:A schematic example of a secondary sliding window is as follows:
Figure PCTCN2021115146-appb-000005
Figure PCTCN2021115146-appb-000005
多亲本作物基因型的鉴定装置Apparatus for the identification of multi-parental crop genotypes
本发明还提供了一种用于执行本发明方法的用于多亲本作物基因型的鉴定装置或分析装置。典型地,该装置包括:The present invention also provides an identification device or an analysis device for multi-parent crop genotypes for performing the method of the present invention. Typically, the device includes:
数据输入模块,用于输入待分析的待处理数据,所述的待处理数据包括:待鉴定的子代植物的测序数据Df,以及与所述子代植物相应的亲本植物的测序数据Dp;a data input module for inputting data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;
多亲本植物基因型鉴定模块,所述多亲本植物基因型鉴定模块被配置为执行本发明所述的方法,从而获得所述子代的基因型鉴定结果;A multi-parent plant genotype identification module, the multi-parent plant genotype identification module is configured to perform the method of the present invention, thereby obtaining the genotype identification result of the progeny;
和输出模块,用于输出所述的所述子代的基因型鉴定结果。and an output module for outputting the genotype identification result of the progeny.
本发明的主要优点包括:The main advantages of the present invention include:
(a)本发明首次提供了基于高通量测序数据的多亲本作物基因型鉴定方法。在本发明之前,目前尚无系统性的作物多亲本基因型鉴定方法。(a) The present invention provides a multi-parent crop genotype identification method based on high-throughput sequencing data for the first time. Before the present invention, there is currently no systematic method for identifying multiple parental genotypes of crops.
(b)本发明的高通量基因型鉴定方法,可大大简化和加速作物中数量性状的遗传定位 [37-39,20](b) The high-throughput genotype identification method of the present invention can greatly simplify and accelerate the genetic mapping of quantitative traits in crops [37-39,20] .
(c)本发明的理论方法可以较好地配合多亲本群体进行基因型鉴定,提高QTL定位的准确性和效率,充分利用多亲本群体中存在的丰富遗传变异。同时也有助于作物遗传品质的改良和分子育种设计。(c) The theoretical method of the present invention can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding.
(d)在实际应用中,本发明可以用于重要农艺性状基因紧密连锁的分子标记的获得、育种过程中后代的高效筛选、改良品种基因型图谱的精细鉴定等,为分子标记辅助筛选育种提供了一个快速高效的手段和平台,使之在效率和准确度上提高到一个新的台阶。(d) In practical applications, the present invention can be used for the acquisition of molecular markers closely linked to important agronomic trait genes, the efficient screening of offspring in the breeding process, the fine identification of genotype maps of improved varieties, etc., and provides molecular marker-assisted screening and breeding It has developed a fast and efficient means and platform, making it a new level in efficiency and accuracy.
总之,本发明的这种基于测序的高通量基因型鉴定方法将为解决复杂生物学问题和作物育种改良提供便捷。In conclusion, the sequencing-based high-throughput genotyping method of the present invention will provide convenience for solving complex biological problems and improving crop breeding.
下面结合具体实施例,进一步阐述本发明。应理解,这些实施例仅用于说明本发明而不用于限制本发明的范围。下列实施例中未注明具体条件的实验方法,通常按照常规条件,或按照制造厂商所建议的条件。除非另外说明,否则百分比和份数是重量百分比和重量份数。The present invention will be further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of the present invention. In the following examples, the experimental methods without specific conditions are usually in accordance with conventional conditions, or in accordance with the conditions suggested by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise specified.
实施例1Example 1
1.1基于真实测序数据的全基因组模拟数据制作1.1 Genome-wide simulation data production based on real sequencing data
1.1.1水稻材料及模拟数据1.1.1 Rice materials and simulation data
利用实验室已有的高深度真实水稻材料籼稻93-11(Oryza sativa ssp.indica cv.93-11),硕恢70,五山丝苗和黄华占作为亲本材料,并通过亲本材 料的真实fastq数据经过比对、筛选和组合后得到了模拟的子代fastq数据。在两亲本的模拟数据中,本发明人使用了93-11和五山丝苗作为模拟亲本,在水稻的每条染色体上都模拟了两个亲本各自的纯合区域和两亲本叠加的杂合区域,以测试本发明方法的对于重组断裂点的判断和杂合区域的判断。在多亲本模拟数据中,使用了93-11,硕恢70,五山丝苗,黄华占四种材料作为模拟亲本,进行了100次数据模拟,在水稻的每条染色体上模拟了不同组合的重组断裂点。水稻的每条染色体上平均有4-6个重组断裂点,全基因组总计有50-60个重组断裂点,目的是尽可能的模拟水稻多亲本来源的遗传群体内部的重组情况以及验证本发明方法的准确性。Using the existing high-depth real rice materials in the laboratory Indica 93-11 (Oryza sativa ssp.indica cv.93-11), Shuohui 70, Wushan Simiao and Huang Huazhan as parent materials, and through the real fastq data of the parent materials. Mock progeny fastq data were obtained after alignment, screening and combination. In the simulation data of the two parents, the inventors used 93-11 and Wushan Simiao as the simulated parents, and simulated the respective homozygous regions of the two parents and the overlapping heterozygous regions of the two parents on each chromosome of rice , to test the judgment of the recombination breakpoint and the judgment of the heterozygous region of the method of the present invention. In the multi-parent simulation data, four materials, 93-11, Shuohui 70, Wushan Simiao, and Huang Huazhan, were used as simulated parents, and 100 data simulations were performed to simulate different combinations of recombination breaks on each chromosome of rice. point. There are an average of 4-6 recombination breakpoints on each chromosome of rice, and a total of 50-60 recombination breakpoints in the whole genome. The purpose is to simulate the recombination situation within the genetic population of rice with multiple parents as much as possible and to verify the method of the present invention. accuracy.
1.1.2模拟数据SNP的鉴定1.1.2 Identification of simulated data SNPs
将93-11,硕恢70,五山丝苗,黄华占四种亲本的测序数据与国际水稻基因组计划(International Rice Genome Sequencing Project,IRGSP)测序的粳稻日本晴(japonica cv.Nipponbare)完整序列12条染色体( http://rice.plantbiology.msu.edu/annotation_pseudo_current.shtml)IRGSP 1.0进行比对,比对的软件为bwa 0.7.17-r1188 [13]。然后用GATK软件包 [14]中的HaplotypeCaller程序(参数为-ERC GVCF)来鉴定上述亲本和模拟子代的候选SNP。获得各个亲本和模拟子代的变异中间文件g.vcf文件后,使用GATK软件包中的GenomicsDBImport程序合并所有的变异中间文件,再使用GATK软件包中的GenotypeGVCFs程序导出合并后的变异文件,使用SelectVariants程序选择需要的SNP位点信息,再通过VariantFiltration程序(参数为--cluster-size 3--cluster-window-size 10--filter-expression"QD<10.00"--filter-name lowQD--filter-expression"FS>15.000"--filter-name highFS--genotype-filter-expression"DP>50||DP<5"--genotype-filter-name InvalidDP)对所有的SNP位点进行过滤,得到高质量的SNP位点。而后在此基础上对模拟子代进行基因型鉴定。 The sequencing data of 93-11, Shuohui 70, Wushan Simiao and Huang Huazhan were compared with the complete sequence of 12 chromosomes of japonica cv. Nipponbare sequenced by the International Rice Genome Sequencing Project (IRGSP). http://rice.plantbiology.msu.edu/annotation_pseudo_current.shtml ) IRGSP 1.0 for comparison, and the comparison software is bwa 0.7.17-r1188 [13] . Candidate SNPs for the above parents and mock progeny were then identified using the HaplotypeCaller program (parameter -ERC GVCF) in the GATK software package [14] . After obtaining the mutation intermediate file g.vcf file of each parent and simulated offspring, use the GenomicsDBImport program in the GATK package to merge all the mutation intermediate files, and then use the GenotypeGVCFs program in the GATK package to export the merged mutation file, using SelectVariants The program selects the required SNP site information, and then passes the VariantFiltration program (parameters are --cluster-size 3--cluster-window-size 10--filter-expression "QD<10.00"--filter-name lowQD--filter- expression"FS>15.000"--filter-name highFS--genotype-filter-expression"DP>50||DP<5"--genotype-filter-name InvalidDP) filter all SNP sites to get high quality the SNP site. Then on this basis, the genotype identification of the mock progeny was carried out.
1.2基于测序的基因型鉴定流程的程序开发1.2 Program Development of Sequencing-Based Genotyping Process
在基于测序的基因型鉴定的流程中,需要处理海量的数据,应用多种不同的算法,并且用到了一些现有的软件,比如序列匹配软件和QTL分析软件等。于是本发明人开发了多个perl和python脚本来实现上述的步骤并使之成为一个完整易用、具有广泛通用性的流程。In the process of sequencing-based genotype identification, it is necessary to process massive data, apply a variety of different algorithms, and use some existing software, such as sequence matching software and QTL analysis software. Therefore, the inventor has developed a number of perl and python scripts to realize the above steps and make it a complete and easy-to-use process with wide versatility.
通过GATK软件获得亲本和子代的SNP信息后,通过一个python脚本,原理是沿着所有SNP位点滑动的窗口将每个个体鉴定出来的SNP区域化进行综合分析,基于一个固定长度的滑动窗读取基因型,之后进行重组断裂点的判断和重组区段图的构建。此外,通过一个perl脚本利用程序判定出来的中间文件,为每个个体 产生一张PNG格式的重组区段图,方便使用者直观浏览其整体基因型。作图时需要用到Perl里面的GD模块。After obtaining the SNP information of parents and progeny through GATK software, a python script is used, the principle is to regionalize the SNPs identified by each individual along the sliding window of all SNP sites for comprehensive analysis, based on a fixed-length sliding window to read Take the genotype, then judge the recombination breakpoint and construct the recombination segment map. In addition, a perl script uses the intermediate file determined by the program to generate a PNG format recombination segment map for each individual, which is convenient for users to intuitively browse their overall genotype. The GD module in Perl needs to be used when drawing.
接下来由另一个脚本Bin2MCD产生一个由重组bin [19]组成的高密度图谱,以便于之后的QTL分析。一旦评估好表型并准备好性状数据,输出文件可以直接被一些QTL分析软件包用来鉴定QTL,包括Windows QTL Cartographer V2.5 [17]Another script, Bin2MCD, was next used to generate a high-density map consisting of recombinant bins [19] for subsequent QTL analysis. Once phenotypes have been assessed and trait data prepared, output files can be used directly to identify QTLs by several QTL analysis software packages, including Windows QTL Cartographer V2.5 [17] .
1.3基于水稻DH群体的基因型鉴定1.3 Genotype identification based on rice DH population
1.3.1水稻DH群体与三亲本群体1.3.1 Rice DH population and three-parent population
本研究所用到的水稻DH群体,由中国科学院国家基因研究中心实验室构建。其两个亲本为Kasalath与粳稻日本晴(japonica cv.Nipponbare)。DH群体为F2后代经过多年自交重组后的产生的株系。本发明人选择了几十个株系进行了基因型鉴定进行基因型鉴定分析。本研究所用到的水稻三亲本植株,由中国科学院国家基因研究中心实验室构建。其三个亲本分别为五山丝苗、93-11和硕恢70。该群体中的植株是由三个亲本的杂交后代经过自交重组产生,其基因组内存在许多重组信息。The rice DH population used in this study was constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its two parents are Kasalath and japonica cv. Nipponbare. The DH population is the line produced by the F2 progeny after many years of self-recombination. The inventors selected dozens of strains for genotype identification and analysis. The three-parent rice plants used in this study were constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its three parents are Wushan Simiao, 93-11 and Shuohui 70. The plants in this population are produced by self-recombination of the hybrid progeny of the three parents, and there are many recombination information in their genomes.
1.3.2利用基于测序的方法对水稻DH群体和三亲本群体进行基因型鉴定1.3.2 Genotyping of rice DH populations and three-parent populations using sequencing-based methods
使用本发明方法对水稻的DH群体进行了基因型鉴定,并通过Bin2MCD产生一个由重组bin组成的高密度图谱。同时为了衡量该方法的准确性,还使用2010年已经发表的方法同样进行了基因型分析和高密度的bin图谱。将两个亲本Kasalath和Nipponbare的高深度(20-30x)的测序数据使用bwa软件比对到日本晴参考基因组IRGSP 1.0,再使用GATK软件分别找到两个亲本的高质量的SNP的信息,然后通过一个perl脚本将指定位点的SNP在日本晴基因组上进行替换,进而产生了两个亲本的pseudo reference。然后将DH群体的低丰度测序数据比对到两个亲本的pseudo reference上,进而进行基因型鉴定。The DH population of rice was genotyped using the method of the present invention, and a high-density map composed of recombinant bins was generated by Bin2MCD. At the same time, in order to measure the accuracy of the method, genotype analysis and high-density bin map were also performed using the method published in 2010. The high-depth (20-30x) sequencing data of the two parents, Kasalath and Nipponbare, were compared to the Nipponbare reference genome IRGSP 1.0 using the bwa software, and then the GATK software was used to find the high-quality SNP information of the two parents, and then use a The perl script replaces the SNP at the specified locus on the Nipponbare genome, thereby generating a pseudo reference for the two parents. The low-abundance sequencing data of the DH population was then aligned to the pseudo reference of the two parents for genotyping.
结果result
2.1基于真实亲本的水稻多亲本模拟数据及基因型鉴定2.1 Rice multi-parent simulation data and genotype identification based on real parents
2.1.1水稻基因组信息的模拟数据2.1.1 Simulation data of rice genome information
如图1所示,本发明人模拟的两亲本来源子代的预计基因型信息应当与与之符合。制作的模拟数据包含三种情况:五山丝苗的纯合区域,93-11的纯合区域和五山丝苗、93-11的杂合区域。图中展示了各个区域应有的长度以及重组断裂点的位置。模拟数据的制作是基于两个亲本的真实测序数据,首先分别将两个亲本的fastq数据比对到水稻日本晴基因组,然后根据得到的sam文件中的比对信息(染色体及位置信息)筛选所需的fastq信息,然后将这些来源于两个亲本的fastq数据重新形成了模拟的杂交子代fastq数据。As shown in FIG. 1 , the predicted genotype information of the progeny from the two parents simulated by the inventors should be consistent with it. The generated simulation data includes three cases: the homozygous region of Wushan silk seedlings, the homozygous region of 93-11 and the heterozygous region of Wushan silk seedlings and 93-11. The figure shows the expected length of each region and the location of the recombination breakpoint. The production of the simulated data is based on the real sequencing data of the two parents. First, the fastq data of the two parents are aligned to the rice Nipponbare genome, and then the required alignment information (chromosome and position information) in the obtained sam file is screened. The fastq information from the two parents was then reformatted to form the simulated hybrid progeny fastq data.
如图2所示,采取了类似的方法,制作了来源于四个亲本五山丝苗,黄华占,93-11,硕恢70的模拟子代数据。其预计的基因组基因型信息及重组断裂点应与图中的信息相符合。As shown in Figure 2, a similar method was adopted to produce simulated progeny data derived from four parents Wushan Simiao, Huang Huazhan, 93-11, and Shuohui 70. Its predicted genomic genotype information and recombination breakpoints should be consistent with the information in the figure.
2.1.2模拟数据的基因型鉴定2.1.2 Genotyping of simulated data
将模拟子代的fastq数据与两个亲本五山丝苗、93-11的fastq数据比对到水稻参考基因组IRGSP 1.0,然后通过GATK软件寻找两个亲本和模拟子代的全基因组变异信息,经过过滤筛选得到高质量的SNP位点。The fastq data of the simulated progeny were compared with the fastq data of the two parents Wushan Simiao and 93-11 to the rice reference genome IRGSP 1.0, and then the genome-wide variation information of the two parents and the simulated progeny was searched by GATK software, and filtered. Screening to obtain high-quality SNP sites.
如图3所示,在得到所需的SNP后,使用“滑动窗口”的方法对全基因组的SNP进行判断,在一个滑动窗口中对两个亲本进行打分比较,若某一亲本在该段得分较高,就把这一区段判断为该亲本的纯合基因型(图中用红色或蓝色表示)。而当两个亲本的打分相差不大时,就把该区段判定为两亲本的杂合区域(图中用黄色表示)。本发明人设计一个量化方法衡量判定准确性,将全基因组分割成数千个100kb的小区域(或20-200kb的小区域),然后比较本发明方法得出的结果与标准图相符合的程度就可以衡量本发明方法的准确性。根据这一方法,鉴定得到的模拟数据基因型信息与真实模拟数据基因型比较,双亲本鉴定准确度可以达到89.61%。As shown in Figure 3, after the required SNPs are obtained, the "sliding window" method is used to judge the SNPs of the whole genome, and the two parents are scored and compared in a sliding window. If it is higher, this segment is judged as the homozygous genotype of the parent (indicated in red or blue in the figure). When the scores of the two parents are not significantly different, the segment is judged as the heterozygous region of the two parents (indicated in yellow in the figure). The inventors designed a quantitative method to measure the accuracy of judgment, divided the whole genome into thousands of small regions of 100kb (or small regions of 20-200kb), and then compared the degree of agreement between the results obtained by the method of the present invention and the standard map The accuracy of the method of the present invention can be measured. According to this method, comparing the genotype information obtained from the simulated data with the real genotype of the simulated data, the accuracy of the identification of the two parents can reach 89.61%.
同时本发明人也使用已发表的SEG-Map方法对模拟子代数据进行了基因型判断,将模拟数据的fastq文件比对到两个亲本的pseudo reference上,利用软件筛选出亲本特异性的fastq序列,然后根据序列比对的位置确定SNP位点的信息,然后再利用滑动窗口的方法确定基因型信息。该方法在已发表的文章中已有较为详细的理论验证与数据模拟,具有较高准确性和可行性。利用量化的方法衡量准确性,SEG-Map软件结果得出的准确度为89.32%,与本发明方法相差不大,说明本发明方法确实有较高的可行性与准确性。At the same time, the inventors also used the published SEG-Map method to judge the genotype of the simulated progeny data, compared the fastq files of the simulated data to the pseudo reference of the two parents, and used the software to screen out the parent-specific fastq sequence, and then determine the information of the SNP site according to the position of the sequence alignment, and then use the sliding window method to determine the genotype information. The method has more detailed theoretical verification and data simulation in the published articles, and has high accuracy and feasibility. Using the quantitative method to measure the accuracy, the accuracy obtained by the SEG-Map software results is 89.32%, which is not much different from the method of the present invention, indicating that the method of the present invention has high feasibility and accuracy.
SEG-Map方法对与两亲本的基因型鉴定确实有较高的可信性,本发明人也已经长期使用该方法进行水稻材料的基因组分析。但是该方法不能对多亲本来源的材料进行基因型鉴定,因此本发明方法也是为了解决多亲本基因型鉴定问题。如图5所示,本发明人使用基于SNP的滑动窗口的方法对四亲本来源的模拟子代进行了基因型鉴定,在一个窗口内对四个亲本进行打分,若某一个亲本打分最高就把改区域判定为该亲本的纯合区域,图中分别用红色、蓝色、绿色和黄色表示四种亲本的纯合区域。对于四亲本的fastq数据进行了100次模拟,同样采用将基因组分割成小区域的方法量化准确率,通过与标准图的比较,本发明方法对于四亲本的模拟数据基因型鉴定的平均模拟准确率为92.10%。The SEG-Map method does have high reliability for the identification of the genotypes of the two parents, and the inventors have used this method for a long time in the genome analysis of rice materials. However, this method cannot perform genotype identification on materials derived from multiple parents, so the method of the present invention is also intended to solve the problem of multiple parent genotype identification. As shown in Figure 5, the inventors used the SNP-based sliding window method to identify the genotypes of the simulated progeny derived from the four parents, and scored the four parents in one window. The modified region is determined as the homozygous region of the parent, and the homozygous regions of the four parents are represented by red, blue, green and yellow respectively in the figure. The fastq data of the four parents were simulated for 100 times, and the method of dividing the genome into small regions was also used to quantify the accuracy. By comparing with the standard map, the average simulation accuracy of the method of the present invention for the genotype identification of the simulated data of the four parents is 92.10%.
2.1.3重组断裂位点的判定2.1.3 Determination of recombination break sites
当“窗口”沿着染色体滑动的时候,通过两种亲本SNP的得分来读取基因型。 在碰到一个重组断裂位点之前,一种基因型是不会变化的。本发明人发现有两种类型的断裂位点:一种是将两个纯合的基因型分隔,另一种是将一段纯合基因型与一段杂合基因型分隔;前一种情况在RIL中是占绝大多数的存在形式,而后一种情况大多出现在F 2群体中。当一个滑动中的窗口碰到一个“纯合/纯合”(homozygous/homozygous)断裂点时,纯合的基因型短暂变成了杂合的基因型,之后又从杂合的基因型再变成纯合的基因型。当一个滑动中的窗口碰到一个“纯合/杂合”(homozygous/heterozygous)断裂点时,纯合的基因型变成了杂合的基因型,然后再次改变成纯合的基因型,此时可以判定出纯合基因型区域与杂合基因型区域的边界点。 Genotypes are read by the scores of the two parental SNPs as the "window" slides along the chromosome. A genotype does not change until a recombination breakpoint is encountered. The inventors found that there are two types of break sites: one is to separate two homozygous genotypes, and the other is to separate a segment of homozygous genotypes from a segment of heterozygous genotypes; the former case in RIL is the predominant form of existence, while the latter is mostly found in the F 2 population. When a sliding window hits a "homozygous/homozygous" breakpoint, the homozygous genotype briefly changes to a heterozygous genotype and then back again from a heterozygous genotype into a homozygous genotype. When a sliding window hits a "homozygous/heterozygous" breakpoint, the homozygous genotype becomes a heterozygous genotype and then changes to a homozygous genotype again, this The boundary point between the homozygous genotype region and the heterozygous genotype region can be determined.
2.1.4不同窗口大小对多亲本判断准确率的影响2.1.4 The influence of different window sizes on the accuracy of multi-parent judgment
在使用这个基于测序的方法来进行基因型鉴定研究时,需要设置合适的分析参数,首先考虑到滑动窗口的大小是否会影响基因型检测的正确率,比如说在给定的物理长度里每个窗口包含许多SNP。When using this sequencing-based method for genotype identification studies, it is necessary to set appropriate analysis parameters, first considering whether the size of the sliding window will affect the accuracy of genotype detection, for example, within a given physical length of each The window contains many SNPs.
如图6所示,本发明对四亲本模拟数据筛选得到的最终的SNP信息采用了不同的窗口大小进行基因型分析,发现不同大小的滑动窗大小确实对最终的分析准确度有影响,当滑窗大小较小(小于199)时,最终的准确率小于90%,但是当滑窗大小增大到199时,基因型鉴定准确率可以达到93.72%,但当继续增大滑窗大小时,最终准确率没有太大变化,说明判定结果的准确率并不是一直随着滑动窗的大小增大而增大的。对于程序运行而言,更大滑窗大小需要更多的计算资源与运算时间,当需要处理大规模群体时,时间代价会更加突出。因此,本发明人综合考虑时间成本与准确率,199的滑窗大小(或180-220的滑窗大小)是较为合理的选择。As shown in Figure 6, the present invention adopts different window sizes to perform genotype analysis on the final SNP information screened by the four-parent simulation data, and it is found that the sliding window sizes of different sizes do have an impact on the final analysis accuracy. When the window size is small (less than 199), the final accuracy rate is less than 90%, but when the sliding window size is increased to 199, the genotype identification accuracy can reach 93.72%, but when the sliding window size continues to increase, the final The accuracy rate does not change much, indicating that the accuracy rate of the judgment result does not always increase with the size of the sliding window. For program running, larger sliding window size requires more computing resources and computing time, and the time cost will be more prominent when large-scale groups need to be processed. Therefore, the inventor comprehensively considers the time cost and the accuracy rate, and the sliding window size of 199 (or the sliding window size of 180-220) is a more reasonable choice.
2.1.5不同测序深度对多亲本判断准确率的影响2.1.5 The influence of different sequencing depths on the accuracy of multi-parent judgment
接着考虑到测序深度对于基因型鉴定具有很重要的影响,与SEG-Map软件在较低的测序深度就可以进行较为准确的基因型鉴定,因此对本发明方法进行了深度测试。Then, considering that sequencing depth has a very important influence on genotype identification, more accurate genotype identification can be carried out with SEG-Map software at a lower sequencing depth, so the method of the present invention is deeply tested.
如图7所示,测试了0.2x,1.5x和3x三个不同的深度进行基因型准确率的测试,在每个深度的测试下,进行了100次fastq数据模拟,然后根据变异信息进行基因型鉴定,最后在根据与标准图的符合程序来衡量基因型鉴定的准确率。结果表明,随着深度的提升,基因型鉴定的准确率有略微的提升,但是提升的幅度没有达到预期的幅度,最终的基因型鉴定准确率也没有达到95%以上。因此本发明人进行了进一步的改进。As shown in Figure 7, three different depths of 0.2x, 1.5x and 3x were tested to test the genotype accuracy. Under the test of each depth, 100 fastq data simulations were performed, and then the genes were simulated according to the variation information. Type identification, and finally measure the accuracy of genotype identification according to the conformity procedure with the standard map. The results showed that with the increase of depth, the accuracy of genotype identification was slightly improved, but the improvement did not reach the expected range, and the final genotype identification accuracy did not reach more than 95%. Therefore, the present inventors made further improvements.
2.2基于SNP位点滑窗的基因型鉴定方法2.2 Genotype identification method based on SNP locus sliding window
2.2.1数据分析流程的主要步骤2.2.1 The main steps of the data analysis process
为了能够使这个基于测序的基因型鉴定新方法能被广泛地使用,本发明人整理、优化了数据处理的流程。这个流程能够直接分析处理二代测序技术产生的单向或双向末端短序列测序结果,最终构建出重组群体的遗传图谱。In order to make this new method for genotype identification based on sequencing widely used, the inventors have organized and optimized the data processing flow. This process can directly analyze and process the unidirectional or bidirectional end short sequence sequencing results generated by the next-generation sequencing technology, and finally construct the genetic map of the recombinant population.
对于一个来自于两个亲本的作图群体,在进行数据分析流程之前,需要对这两个亲本的全基因组范围内的SNP进行鉴定。这个SNP的鉴定工作可以由高覆盖的全基因组深度测序得到,也可以由水稻单倍型图谱中现有的基因组SNP信息得到,还可以由低覆盖的全基因组测序结合缺失基因型(SNP)填补来得到。由于两个亲本品种之间的SNP鉴定工作可以通过快速及省钱的途径得到,那么对于一个重组群体的基于测序的基因型鉴定将主要依靠之后的分析工作,包括读取基因型、重组断裂位点的确定和遗传连锁图谱的构建。For a mapping population from two parents, the genome-wide SNPs of both parents need to be identified before proceeding with the data analysis pipeline. The identification of this SNP can be obtained by high-coverage whole-genome deep sequencing, or by existing genomic SNP information in the rice haplotype map, or by low-coverage whole-genome sequencing combined with missing genotypes (SNPs) to fill in to get. Since SNP identification between two parental varieties can be obtained in a fast and cost-effective way, sequencing-based genotype identification of a recombinant population will mainly rely on subsequent analysis, including reading genotypes, recombination breakpoints Point determination and construction of genetic linkage maps.
数据分析中的功能、步骤、和软件(脚本)如图8所示。第一个步骤包含了数个可以同时处理的任务。一定数目的重组群体中的个体和亲本材料同时进行二代高通量测序。将得到的fastq文件比对经过bwa和GATK软件处理得到高质量的SNP信息。最终判断基因型所用的SNP位点应当满足下列几个要求:一、SNP位点尽可能覆盖全基因组,不会在某些区域有缺失。二、对于任何一个SNP位点,两个亲本和模拟子代的SNP信息(位置信息和基因型信息)都是可知的,三者中任何一个不可知该位点就应当删去。三、一般认为水稻亲本是自交纯合系,基因组中基本不会有杂合位点,因此如果在亲本中找出了杂合的SNP位点,一般认为该位点是不可信的,因此删去了任一亲本为杂合的SNP位点。The functions, steps, and software (scripts) in the data analysis are shown in Figure 8. The first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information. The SNP loci used for the final determination of the genotype should meet the following requirements: 1. The SNP loci should cover the whole genome as much as possible, and there will be no deletions in certain regions. 2. For any SNP locus, the SNP information (position information and genotype information) of the two parents and simulated offspring are known, and if any one of the three is not known, the locus should be deleted. 3. It is generally believed that the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is unreliable, so SNP sites where either parent was heterozygous were deleted.
筛选两个亲本和子代的高质量全基因组SNP之后,用一个python脚本SNPwindow来对子代的基因型进行判断。脚本输出会有两个文件,rlt文件和bin文件。其中rlt文件记录了每个SNP位置的基因型判定情况,而bin文件记录了全基因组12条染色体上重组断裂点的分布情况。这两个文件是后续作图和连锁分析的重要基础。After screening the high-quality genome-wide SNPs of the two parents and progeny, a python script SNPwindow was used to judge the genotype of the progeny. The script output will have two files, the rlt file and the bin file. The rlt file records the genotype determination of each SNP position, and the bin file records the distribution of recombination breakpoints on 12 chromosomes in the whole genome. These two files are an important basis for subsequent mapping and linkage analysis.
如图9所示,一般会先利用rlt和bin文件通过一个perl脚本SNP2png绘制一张基因型图谱,图片格式为PNG格式。该图谱是根据程序判定的SNP位点基因型信息和全基因组重组断裂点位置信息绘制而成,图中用不同的颜色代表不同的基因型类型。As shown in Figure 9, a genotype map is generally drawn first using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format. The map is drawn according to the genotype information of SNP loci determined by the program and the position information of whole-genome recombination breakpoints. Different colors are used to represent different genotype types in the map.
在实际工作中,会有大量的后代群体需要处理,所以可提取每个个体的SNP信息,然后通过SNPwindow脚本分析检验,进行读取基因型、重组断裂位点判断、以及重组图谱构建的工作。先利用SNP2png绘制出每一个个体的基因图谱,对个体的基因型情况有一个整体的把握,然后再由脚本Bin2MCD对所有个体的重组图谱进行联配,产生重组bin(recombination bin)图谱。In actual work, there will be a large number of offspring groups to be processed, so the SNP information of each individual can be extracted, and then analyzed and tested by the SNPwindow script to read genotypes, judge the recombination breakpoints, and construct the recombination map. First, use SNP2png to draw the gene map of each individual, and have an overall grasp of the individual's genotype, and then use the script Bin2MCD to align the recombination maps of all individuals to generate a recombination bin map.
如图10所示,还可以用一个perl脚本对整个群体的基因型情况进行可视化分析。分析流程中用到的程序和脚本用斜体字表示,形成了一系列的分析步骤。 分析流程最后生成的基因型数据可以被直接用于其它的软件(包括MapMaker和JoinMap),来进行连锁图谱的构建。As shown in Figure 10, a perl script can also be used to visualize the genotype of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.
分析软件最后产生的输出数据时重组bin,通常分辨率为每100kb一个bin,甚至是每10kb一个bin。作图群体的基因型结果可以被导入到MapMaker [16]或者JoinMap [32]等程序,进行遗传图谱构建。有了遗传图谱就进行QTL分析。 The bins are reorganized when analyzing the final output data produced by the software, usually at a resolution of one bin per 100kb, or even one bin per 10kb. The genotype results of the mapping population can be imported into programs such as MapMaker [16] or JoinMap [32] for genetic map construction. With the genetic map available, QTL analysis is performed.
如图11所示,这个遗传图谱比大多数传统的分子标记产生的图谱的尺度要精细得多。这个软件包与多个平台兼容(如:Unix,Linux和Windows)。除了perl环境本身,还需要安装GD模块,因为流程运行当中有作图的步骤。As shown in Figure 11, this genetic map is much finer in scale than maps produced by most traditional molecular markers. This package is compatible with multiple platforms (eg: Unix, Linux and Windows). In addition to the perl environment itself, the GD module also needs to be installed, because there are drawing steps in the process operation.
2.2.2判断重组断裂位点的详细流程2.2.2 The detailed process of judging the recombination break site
第一步:构建SNP“字串”。Step 1: Construct the SNP "string".
以窗口大小为15(win=15)时为例。将两个亲本和子代12条染色体上所有的SNP的基因型按顺序压缩成一个字串。不管相邻两个SNP之间的实际距离有多大,把间隙全部都去除。Take the window size of 15 (win=15) as an example. The genotypes of all SNPs on the 12 chromosomes of the two parents and progeny were sequentially compressed into a single string. Regardless of the actual distance between two adjacent SNPs, all gaps are removed.
这样12条染色体上的SNP就成为了12个连续的字串(图12)。图中蓝色表示亲本一的基因型,红色表示亲本二的基因型,对于一个亲本子代,它的每个SNP位点上可能出现的基因型有三种情况,亲本一的纯合基因型(蓝色)、亲本二的纯合基因型(红色)和杂合基因型(黄色)。In this way, the SNPs on the 12 chromosomes become 12 consecutive word strings (Fig. 12). The blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2. For a parent progeny, there are three possible genotypes at each SNP site. The homozygous genotype of parent 1 ( Blue), homozygous genotype (red) and heterozygous genotype (yellow) of parent two.
一般认为人工培育的水稻亲本材料其基因组是高度纯合的,同时对于某些多代自交重组水稻群体,其基因组也是较为纯合的,只在部分染色体位置存在一些杂合区域。因此对纳入分析的SNP位点先进行了人为筛选,排除了任一亲本为杂合的SNP位点,这样的位点没有办法进行准确的判断打分。此外,如果子代的测序深度不是很高,则可过滤了子代为杂合的SNP位点,因为依据低深度而判断出的杂合位点可信度不高,很有可能是由于测序错误而导致的误判。It is generally believed that the genomes of artificially cultivated rice parent materials are highly homozygous, and for some multi-generation self-recombinant rice populations, the genomes are also relatively homozygous, with only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.
第二步:在一个窗口内对两个亲本进行打分Step 2: Score both parents in one window
依据孟德尔遗传规律,计算一个滑动窗口内所有SNP位点的得分情况,并计算每一个亲本的总和得分作为该亲本在该滑动窗所在染色体位置的打分。依据每个亲本的打本情况来衡量子代与该亲本的符合程度。优选的打分规律如表A所示,本发明打分规则是按照生物的遗传规律而制定。According to the law of Mendelian inheritance, the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window. The degree of conformity of the offspring with the parent is measured according to the typing of each parent. The preferred scoring rules are shown in Table A, and the scoring rules of the present invention are formulated according to the genetic rules of organisms.
表A 基因型打分规则表Table A Genotype scoring rules table
Figure PCTCN2021115146-appb-000006
Figure PCTCN2021115146-appb-000006
第三步:依据得分值判断染色体区域基因型Step 3: Determine the genotype of the chromosome region according to the score value
通过滑动窗口在全基因组SNP位点上的滑动,就可以得到每一条染色体上各个亲本的得分值。将该分值为纵坐标,以每个滑动窗在染色体上的位置为横坐标,绘制每个亲本的得分曲线。对于每一段染色体的基因型判断就是依据不同亲本得分曲线的特征。如图13所示,对模拟四亲本来源的子代进行了滑动窗口打分,并依据四个亲本得分值绘制了四个亲本的得分曲线。By sliding the sliding window on the SNP sites of the whole genome, the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa. The genotype judgment of each chromosome is based on the characteristics of different parental score curves. As shown in Figure 13, the offspring of the simulated four-parent source were scored by sliding window, and the score curves of the four parents were drawn according to the score values of the four parents.
观察四个亲本的得分曲线,可以看到亲本曲线在11号染色体的不同区域存在平台稳定期和波动期两种不同的分布模式。因此利用同一段区域不同亲本曲线所处的状态进而进行子代在该区域的基因型判断。Observing the score curves of the four parents, it can be seen that the parental curves have two different distribution patterns of plateau stable period and fluctuating period in different regions of chromosome 11. Therefore, the state of different parental curves in the same region is used to determine the genotype of the offspring in this region.
如在图中的1bp至10780000bp区域,可以观察到黄色曲线(亲本4)在该区域的得分有接近满分的高得分值,并且该亲本在这一段区域的得分值相当稳定,没有太大的数值波动,利用统计学中的标准差来衡量分值的波动情况。在这一区域亲本4在该区域有很高的得分值和较小的标准差,同时,另外三个亲本在该区域的得分值在0-200的范围内上下波动,有着很高的标准差,因此判断此段区域为亲本4的纯合基因型。利用类似的方法,就可以依据得分值判断出12条染色体不同区域的子代基因型。图中true对应的矩形条对应每个染色体段的模拟子代的真实基因型信息,而judge对应的矩形条则表示本发明方法判断出来的子代基因型,二者的信息基本符合。As in the 1bp to 10780000bp region in the figure, it can be observed that the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large The numerical fluctuation of the score is measured by the standard deviation in statistics. In this area, parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region is judged to be the homozygous genotype of parent 4. Using a similar method, the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values. The rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.
第四步:杂合区域的判断Step 4: Judgment of Heterozygous Regions
以两亲本模拟子代的基因型判断来说明对杂合区域的判断。按照遗传学原理,即使是来源于多亲本的杂交子代,它在某一段染色体区域的亲本来源最多是两个亲本。因此可以根据两个亲本在这段区域的得分曲线来判断出这一段区域是否是杂合区域。如图14所示,对模拟子代进行了基因型鉴定,true对应的矩形条对应每个染色体段的模拟子代的真实基因型信息,而judge对应的矩形条则表示本 发明方法判断出来的子代基因型。同样的,当一个亲本的得分较高并且标准差较小且另一个亲本的得分有相当大的波动,标准差较大时,就判断该段区域为前者的纯合基因型(图中橙色或蓝色区域)。The judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents. According to the principle of genetics, even if it is a hybrid progeny derived from multiple parents, its parental origin in a certain chromosomal region is at most two parents. Therefore, it can be judged whether this region is a heterozygous region according to the score curve of the two parents in this region. As shown in Figure 14, the genotype identification of the simulated progeny is carried out, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated progeny of each chromosome segment, and the rectangular bar corresponding to judge represents the judgment of the method of the present invention. offspring genotype. Similarly, when the score of one parent is high and the standard deviation is small, and the score of the other parent has considerable fluctuation and the standard deviation is large, the region is judged to be the homozygous genotype of the former (orange or blue area).
而当两个亲本的得分值在某一区段(图中6000000bp至15000000bp)两个亲本的相对分值相差不大,且二者都一定的波动,标准差都较大时,就把这一区段判断为两个亲本的杂合基因型(图中黄色区域)。这一判断结果也与真实信息相符合。When the scores of the two parents are in a certain segment (6000000bp to 15000000bp in the figure), the relative scores of the two parents are not much different, and both fluctuate to a certain extent, and the standard deviation is large, the One segment was judged to be the heterozygous genotype of the two parents (yellow area in the figure). This judgment result is also consistent with the real information.
第五步:判断标准的说明Step 5: Description of Judgment Criteria
本发明方法的核心思想是基于直接读取亲本与子代在某一区段的真实SNP的基因型信息,然后量化子代与各个亲本在这一区段的相似程度,根据各个亲本的得分曲线的数值特征(数值高低与标准差),形成一种较为简化的分析模型,进而判断各个区段的基因型。判断的标准主要包含以下几种情况:The core idea of the method of the present invention is based on directly reading the genotype information of the real SNPs of the parent and the offspring in a certain segment, and then quantifying the degree of similarity between the offspring and each parent in this segment, according to the score curve of each parent The numerical characteristics (value level and standard deviation) of , form a relatively simplified analysis model, and then determine the genotype of each segment. The criteria for judgment mainly include the following situations:
一、若某一亲本在该区段得分值较高且其分值较为稳定(得分曲线接近平台期),同时其余亲本在该区段的得分曲线有较大的波动,标准差较大(得分曲线为波峰形上下起伏),就判断这一区段为该亲本的纯合基因型(图15)。1. If a parent has a high score in this section and its score is relatively stable (the score curve is close to the plateau), and the score curves of the other parents in this section have large fluctuations and a large standard deviation ( The score curve is peak-shaped up and down), and this segment is judged to be the homozygous genotype of the parent (Figure 15).
二、当判断亲本个数为两个时,可以根据两个亲本在某一区段二者都有较大的数值波动,标准差较大,就可以推断该区域为两个亲本的杂合基因型(图16)。2. When judging that the number of parents is two, it can be inferred that the region is a heterozygous gene of the two parents according to the fact that both parents have large numerical fluctuations in a certain segment and the standard deviation is large. type (Figure 16).
在多亲本判断时候,很可能只能发现到可能的杂合区段,即在某一区段找不到一个高分值且稳定的亲本。因为各个亲本的得分在该区段都有较大的波动,只能根据数值特征给出最可能的两个亲本。三、因为判断方法很大程度上依赖于衡量子代与亲本的基因型相似程度,所以如果在某一区段有两个或多个亲本与子代都很相似,出现两个或多个高分值且标准差较小的亲本曲线,说明在分析的某几个亲本在这一区段非常相似,没有太大差异,可以暂时不做判断。In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment. Because the scores of each parent have large fluctuations in this section, only the two most probable parents can be given according to the numerical characteristics. 3. Because the judgment method largely depends on measuring the genotype similarity between the offspring and the parent, if two or more parents and offspring are very similar in a certain segment, two or more high The parental curve with a small score and a small standard deviation indicates that some of the parents analyzed are very similar in this section, and there is not much difference, so we can temporarily not make judgments.
如图17所示,先将待判断区域定义为“unknown”,而这一段区域的基因型由其两侧区域的基因型决定。As shown in Figure 17, the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.
如图18所示,如果其两侧的基因型相同,就把该区域判定为这一基因型,而如果两侧的基因型不同,把这一区域的中间位置视作一个重组断裂点,区域两边分别为两侧的基因型。As shown in Figure 18, if the genotypes on both sides are the same, the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the region The two sides are the genotypes of the two sides, respectively.
2.3基于测序的水稻DH群体基因型鉴定2.3 Genotyping of rice DH population based on sequencing
2.3.1水稻DH群体2.3.1 Rice DH population
所用的水稻的DH群体的两个亲本为Kasalath和日本晴。它是利用双亲杂交的F1代诱导单倍体并加倍形成的群体。它的植株是纯合体,并且自交后代是纯系,可以进行多年多点的重复实验,是研究基因型和环境互作的理想材料。The two parents of the DH population of rice used were Kasalath and Nipponbare. It is a population formed by inducing haploids and doubling the F1 generation of biparental crosses. Its plants are homozygous, and the self-bred progeny are pure lines, which can be repeated for many years and multiple points. It is an ideal material for studying the interaction of genotype and environment.
2.3.2两个亲本之间的SNP鉴定2.3.2 SNP identification between two parents
将两个亲本Kasalath和日本晴材料的高深度测序数据(20x-30x)通过bwa软 件比对到水稻参考基因组IRGSP 1.0上,然后再经过GATK软件找到所需的高质量SNP信息,将亲本的SNP信息和所有子代的SNP信息合并到同一个vcf文件中,方便从里面提取所需要的变异位点信息。The high-depth sequencing data (20x-30x) of the two parents Kasalath and Nipponbare materials were compared to the rice reference genome IRGSP 1.0 through bwa software, and then the required high-quality SNP information was found through GATK software, and the SNP information of the parents was compared. Combined with the SNP information of all progeny into the same vcf file, it is convenient to extract the required variation site information from it.
2.3.3 DH群体的基因型鉴定2.3.3 Genotyping of DH populations
DH群体中每个子代的平均测序深度约为0.02x,属于较低深度的测序数据,将每个子代的SNP信息与亲本信息单独提取,然后使用SNPwindow脚本进行判断,得到每个子代判断得到的rlt文件和bin文件。The average sequencing depth of each progeny in the DH population is about 0.02x, which belongs to the sequencing data of lower depth. The SNP information and parental information of each progeny are extracted separately, and then the SNPwindow script is used to judge, and each progeny is judged by rlt file and bin file.
如图19示,使用SNP2png脚本,利用上一步所得到的结果文件将基因型鉴定结果可视化,在图中可以观察到两种亲本的纯合基因型(红色为Kasalath,蓝色为日本晴),对于一个多年自交重组的自交群体,杂合区域的可靠性较低,有可能是测序错误或两个亲本在这段区域多态性较低而导致的程序误判。As shown in Figure 19, using the SNP2png script, the genotype identification results were visualized using the result file obtained in the previous step. In the figure, the homozygous genotypes of the two parents (Kasalath in red and Nipponbare in blue) can be observed. For In an inbred population that has been self-recombined for many years, the reliability of the heterozygous region is low, which may be due to sequencing errors or program misjudgments caused by the low polymorphism of the two parents in this region.
接着使用Bin2MCD脚本以整个群体的bin文件作为输入,计算得出总体基因型分布的map文件,map文件将全基因组分割成许多小的bin,并且每个bin就根据个体鉴定的基因型结果判断bin的基因型类型。Then use the Bin2MCD script to take the bin file of the entire population as input, and calculate the map file of the overall genotype distribution. The map file divides the whole genome into many small bins, and each bin is determined according to the genotype results of the individual identification. genotype type.
如图20所示,到map文件后,使用一个perl脚本可视化整个群体的基因型信息,同时还计算了每个bin位置的不同基因型的比例,这是群体遗传学研究的一个重要参数。图20部分的红蓝比例图就代表了不同bin的3种基因型所占的比例。这一步骤的可视化是方便对群体基因型有一个快速、直接的了解。配合表型输出的map文件可以直接利用winQTL等分析软件进行QTL的定位分析。As shown in Figure 20, after reaching the map file, a perl script was used to visualize the genotype information of the entire population, and the proportion of different genotypes at each bin position was also calculated, which is an important parameter for population genetics research. The red and blue ratio map in Figure 20 represents the proportion of the three genotypes in different bins. The visualization of this step facilitates a quick, direct view of the population genotype. The map file output with the phenotype can directly use the analysis software such as winQTL for QTL localization analysis.
2.3.4水稻三亲本材料的基因型鉴定2.3.4 Genotype identification of the three-parent material of rice
如图16所示,对实验室培育的三亲本子代材料进行了多亲本的基因型鉴定。子代的测序数据量约0.2x,其三个亲本材料分别为五山丝苗,93-11和硕恢70,三个亲本的测序深度约20x-30x。将子代与三个亲本SNP信息整合到同一个vcf中,再从中进一步筛选最终的高质量SNP。接着使用SNPwindow脚本对子代的基因型进行判断,在一个窗口内,若某一个亲本的得分最高,就把该区域判定为该亲本的纯合基因型。As shown in Figure 16, multi-parent genotype identification was performed on the three-parent progeny materials grown in the laboratory. The amount of sequencing data of the progeny is about 0.2x, and the three parent materials are Wushan Simiao, 93-11 and Shuohui 70, respectively, and the sequencing depth of the three parents is about 20x-30x. The progeny and the three parental SNP information were integrated into the same vcf, and the final high-quality SNP was further screened from it. Then use the SNPwindow script to judge the genotype of the offspring. In a window, if a parent has the highest score, the region is judged as the homozygous genotype of the parent.
如图21所示,利用多亲本的判断程度得到判断重组断裂点的bin文件,又使用了一个perl脚本将判断结果可视化,我么们可以通过图片很直接的看到12条染色体上的基因型信息。红色区域对应亲本一五山丝苗,蓝色区域亲本二对应93-11,绿色区域亲本三对应硕恢70,而黄色为杂合区域。As shown in Figure 21, the bin file for judging the recombination breakpoint is obtained by using the judgment degree of multiple parents, and a perl script is used to visualize the judgment result. We can directly see the genotypes on the 12 chromosomes through the pictures. information. The red area corresponds to parent 1 Wushan Simiao, the blue area corresponds to parent 2 of 93-11, the green area corresponds to parent 3 of Shuohui 70, and the yellow is the heterozygous area.
如图22所示,使用SEG-Map方法对该材料进行基因型判断,因此实验室之前对这些植株的基因型判断主要依据五山丝苗和9311两个亲本进行基因型鉴定,可以判断出3种基因型,五山丝苗纯合基因型、93-11纯合基因型和二者的杂合基因型。根据三亲本判断结果,本发明人发现对于两亲本材料判断得出杂合区段很可能对应第三种亲本的纯合基因型。因此本发明方法能够在保证准确性的情况下 弥补之前SEG-Map软件所存在的不足,解决多亲本基因型判断的问题。As shown in Figure 22, the SEG-Map method was used to determine the genotype of this material. Therefore, the laboratory's previous genotype determination of these plants was mainly based on the genotype identification of the two parents, Wushan Simiao and 9311, and three species were determined. Genotype, Wushan silk seedling homozygous genotype, 93-11 homozygous genotype and the heterozygous genotype of the two. According to the judgment results of the three parents, the inventors found that the heterozygous segment obtained from the judgment of the two parents is likely to correspond to the homozygous genotype of the third parent. Therefore, the method of the present invention can make up for the deficiencies of the previous SEG-Map software under the condition of ensuring the accuracy, and solve the problem of multi-parent genotype judgment.
2.3.5水稻四亲本模拟材料的基因型鉴定2.3.5 Genotype identification of rice four-parent mimic material
如图23所示,本发明人使用实验室四个真实水稻材料93-11,硕恢70,五山丝苗,黄华占四个亲本的真实测序fastq数据,然后根据比对结果分段取出对应区域的reads,经过人工组合和筛选制造了一个模拟子代的数据,而这个子代的真实基因型信息和重组断裂点是清楚的,所以可以使用该模拟数据对本发明的可行性和准确性进行评估。As shown in Figure 23, the inventors used the real sequencing fastq data of four real rice materials in the laboratory, 93-11, Shuohui 70, Wushan Simiao, and Huang Huazhan, and then segmented out the corresponding regions according to the comparison results. The reads are artificially combined and screened to produce data of a simulated progeny, and the real genotype information and recombination breakpoints of the progeny are clear, so the simulated data can be used to evaluate the feasibility and accuracy of the present invention.
根据判定方法,本发明人在全基因组鉴定到几十个重组断裂点,而判断出的不同染色体区域也与真实基因型结果大致符合。According to the determination method, the inventors identified dozens of recombination breakpoints in the whole genome, and the determined different chromosomal regions were also roughly consistent with the real genotype results.
如图24所示,该图展示了本发明模拟子代的真实基因型信息。As shown in Figure 24, the figure shows the real genotype information of the mock progeny of the present invention.
比较二者的细节区域,本发明人发现有一些局部区域存在一些差异,本发明人查看了判断过程的中间输出rlt文件,检查了出现判断差异的原因。可能的原因如下:一、因为子代的测序数据深度不是很高,只能捕捉到全基因组的一部分变异信息,可能会漏到部分重要的亲本区分位点,导致了在某些区域不能区分出真实亲本。二、所鉴定的两个或多个亲本在某些区域很相似,不存在亲本的多态性。这种情况也不是由于测序错误或者测序深度而导致的。对于这种的高相似度区域,可会暂时不作判断,其基因型的判断依赖于其两侧的基因型信息,因此可能会导致部分区域不能做出准确判断,而把基因型判断给两侧最有可能的亲本基因型。Comparing the detailed areas of the two, the inventor found that there are some differences in some local areas. The inventor checked the intermediate output rlt file of the judgment process, and checked the reasons for the difference in judgment. The possible reasons are as follows: 1. Because the depth of the sequencing data of the progeny is not very high, only a part of the variation information of the whole genome can be captured, and some important parental distinguishing sites may be missed, resulting in the inability to distinguish in some regions. real parent. 2. The identified two or more parents are very similar in certain regions, and there is no polymorphism of the parents. This is also not due to sequencing errors or sequencing depth. For such high-similarity regions, no judgment may be made for the time being. The genotype judgment depends on the genotype information on both sides of the genotype. Therefore, some regions may not be able to make accurate judgments, and the genotypes are judged on both sides. Most likely parental genotype.
从分析结果可以清晰的看出,可以把全基因组的不同区域判定出它的最可能的真实亲本,这是之前发表的SEG-Map软件和传统的分析方法较难达到的,这也为在不同作物的育种过程中提供了一个多亲本的分析方法。It can be clearly seen from the analysis results that different regions of the whole genome can be determined as its most likely true parents, which is difficult to achieve by the previously published SEG-Map software and traditional analysis methods. A multi-parental analysis method is provided during crop breeding.
三.讨论3. Discussion
多亲本群体在遗传分析中具有很大的应用前景,通过选择多个亲本,可以增加群体遗传多样性,利用杂交和自交(或近交)两种手段把多个亲本融合到一个群体,增加重组次数。多亲本群体不仅能够增加重组发生次数,挖掘复杂性状背后的遗传基础,而且由于选择亲本遗传基础的丰富性,在育种应用上具有很大潜质。相对于双亲群体,多亲本群体的亲本数目量大,增加了群体变异丰富度,包括等位基因多样性和表现型多样性,提供作图准确度和精确度,且提高QTL检测效率,大量累积的重组事件会提高QTL定位分辨率;因为多亲本群体的亲本筛选更为精细,即标准更为严格,多个亲本增加了遗传基础的多样性,所以其QTL结果可应用于育种研究。而相对于自然群体,多亲本群体是多亲本混合均匀构建而来,相较于自然群体,因其可以知道系谱关系,存在群体构建的详细信息,从实验设计方面,避免群体分层,进而控制定位结果假阳性问题。Multi-parent populations have great application prospects in genetic analysis. By selecting multiple parents, the genetic diversity of the population can be increased, and multiple parents can be fused into a population by means of hybridization and selfing (or inbreeding). Number of reorganizations. Multi-parent populations can not only increase the frequency of recombination and tap the genetic basis behind complex traits, but also have great potential in breeding applications due to the richness of the genetic basis of selected parents. Compared with the biparental group, the multi-parental group has a large number of parents, which increases the population variation richness, including allelic diversity and phenotypic diversity, provides mapping accuracy and precision, and improves the efficiency of QTL detection. The recombination events of , will improve the resolution of QTL mapping; because the parental screening of multi-parent populations is more refined, i.e., the criteria are more stringent, and multiple parents increase the diversity of the genetic basis, so its QTL results can be applied to breeding research. Compared with natural groups, multi-parent groups are constructed by mixing multiple parents evenly. Compared with natural groups, because they can know the pedigree relationship and have detailed information on group construction, from the aspect of experimental design, group stratification is avoided, and then control False positive problem of localization results.
重组群体是孟德尔式遗传学实验的基础,并且一直作为基因、基因组和遗传变异研究的关键因素。但是对于一个作图群体进行基因型鉴定的工作一直是非常费力、费时的事情,包括费钱而且冗长乏味的标记开发以及用数百个标记对数百个个体进行基因型鉴定的过程。而且,使用这样的方法所得到的图谱分辨率还是相对较低 [34-36]。通过应用第二代测序技术,本发明人开发了一种快速、高效、低成本、信息量大、以及可靠的基因型鉴定的方法。有了这种新方法,对于一个典型的包含几百个个体的作图群体的超高分别率基因型鉴定工作,可以由一个基因组测序服务中心在数周内完成,而不像以前利用传统类型标记时需要数月乃至数年来完成。 Recombinant populations are the basis of Mendelian genetics experiments and have been a key element in the study of genes, genomes, and genetic variation. But genotyping a mapping population has been laborious and time-consuming, involving costly and tedious marker development and the process of genotyping hundreds of individuals with hundreds of markers. Moreover, the resolution of maps obtained using such methods is still relatively low [34-36] . By applying second-generation sequencing technology, the inventors developed a method for rapid, efficient, low-cost, informative, and reliable genotype identification. With this new approach, ultra-high-resolution genotyping of a typical mapping population of several hundred individuals can be accomplished within weeks by a single genome sequencing service, rather than using traditional genotypes. Marking can take months or even years to complete.
本发明人开发了一种通过全基因组低覆盖重测序(resequencing)检测SNP来进行高通量基因型鉴定的新方法。这种类型的SNP数据与传统的遗传标记主要有两个方面的不同。第一,通常来说,在一个重组群体中,不是所有的株系都能通过随机测序得到在某一个SNP位点上的信息。第二,某一单独的SNP位点对于基因型鉴定来说并不是一个可靠的标记或者位点,因为会有一些潜在的序列误差存在。The present inventors developed a novel method for high-throughput genotyping by whole-genome low-coverage resequencing detection of SNPs. This type of SNP data differs from traditional genetic markers in two main ways. First, in general, not all lines in a recombinant population can obtain information on a certain SNP locus by random sequencing. Second, a single SNP locus is not a reliable marker or locus for genotyping because of potential sequence errors.
为了处理这些由第二代测序产生的具有独特性质的SNP数据,本发明人进一步开发了一个新的分析架构,即利用一种“滑动窗口(sliding window)方法”,在局部位置根据多个SNP的基因型确定这个区段的基因型。In order to process these SNP data with unique properties generated by second-generation sequencing, the inventors further developed a new analysis framework, that is, using a "sliding window method", according to multiple SNPs at local locations The genotype determines the genotype of this segment.
本发明人还基于此理论开发了一套程序处理分析流程(pipeline),名字叫做SEG-Map(Sequencing Enabled Genotyping for Mapping recombination populations),意思是基于测序的重组群体作图流程。采用SEG-Map,可从分析处理Illumina Genome Analyzer II(GAII)产生的单向或双向末端短序列测序结果开始,通过多步分析处理,最终构建出重组群体的遗传图谱,该方法适用于一个双亲本构建的重组群体。The inventor has also developed a set of program processing analysis process (pipeline) based on this theory, the name is called SEG-Map (Sequencing Enabled Genotyping for Mapping recombination populations), which means the sequencing-based recombinant population mapping process. Using SEG-Map, the genetic map of the recombinant population can be finally constructed by analyzing and processing the unidirectional or bidirectional end short sequence sequencing results generated by Illumina Genome Analyzer II (GAII). The recombinant population of this construction.
本发明人经过研究,推出了一套新颖的程序处理分析流程以及相应的方法和装置。采用本发明的流程,除了能够优化之前SEG-Map程序中的步骤并且兼容现在主流的生物信息学分析软件和不同类型的高通量测序数据外,最重要的是它能够快速准确可靠地分析多亲本构建的群体基因型。The inventors of the present invention have developed a set of novel program processing and analysis procedures and corresponding methods and devices after research. Using the process of the present invention, in addition to optimizing the steps in the previous SEG-Map program and being compatible with current mainstream bioinformatics analysis software and different types of high-throughput sequencing data, the most important thing is that it can quickly, accurately and reliably analyze multiple Parental constructed population genotypes.
本发明方法的建立可以帮助多亲本群体更好地应用在作物育种中;也可以精确鉴定出多亲本群体中的较多QTL位点;为多亲本群体进行基因组预测,可以帮助它们作为种质资源直接应用于品种提供依据。The establishment of the method of the invention can help the multi-parent population to be better applied in crop breeding; it can also accurately identify more QTL sites in the multi-parent population; the genome prediction for the multi-parent population can help them be used as germplasm resources It is directly applied to the variety to provide the basis.
在目前的分析流程中,读取基因型以及判断重组断裂位点的程序被设计成能够适应多种类型的作图群体,并且与之前的鉴定SNP和之后构建重组区段图(recombination bin map)的步骤完全衔接。将这些功能结合之后,分析软件以二代测序技术产生的短序列为输入,经过一系列运算,输出重组区段,这个输出结果能够被现有构建遗传连锁图和QTL(quantitative trait loci数量性状位点) 分析的软件所分析。In the current analysis pipeline, programs for reading genotypes and determining recombination break sites are designed to accommodate multiple types of mapping populations, and are consistent with previous identification of SNPs and subsequent construction of a recombination bin map. The steps are completely connected. After combining these functions, the analysis software takes the short sequence generated by the next-generation sequencing technology as input, and after a series of operations, outputs the recombination segment. This output result can be used to construct genetic linkage maps and QTL (quantitative trait loci). point) analyzed by the software.
本发明人使用基于高通量测序的方法对水稻重组自交系的基因型鉴定,显示出了这个新的基因型鉴定的方法相对于通常所用到的基于PCR的方法的优点。在开发出基于测序的方法鉴定这个水稻F 11代重组自交系群体之前,本发明人用287个插入/缺失(insertion/deletion)标记(包括SSR标记)对这个重组自交系群体的F 8代个体进行了基因型鉴定。这些标记用PCR扩增,并且在琼脂糖凝胶电泳上鉴定。用PCR标记的结果构建的遗传连锁图,每个标记平均覆盖的范围大约为遗传距离5cM,相当于物理距离大约1.4Mb,这个平均距离范围大于之前报道的大多数水稻遗传图谱。设计、筛选、以及收集这些PCR标记花去了3个研究人员超过一年的工作时间。而在水稻重组自交系的研究中,本发明人使用Illumina GA,可以在不到两周的时间里就得到40kb每个SNP的标记平均覆盖度。这样看来,基于测序的高通量基因型鉴定的方法比传统的基于PCR的基因型鉴定方法要快速、高效、花费少得多。 The inventors used a high-throughput sequencing-based method for genotyping of recombinant inbred lines in rice, showing the advantages of this new genotyping method over the commonly used PCR-based method. Before developing a sequencing-based method to identify this population of recombinant inbred lines in the F 11 generation of rice, the inventors used 287 insertion/deletion markers (including SSR markers) on the F 8 of this population of recombinant inbred lines. Generation individuals were genotyped. These markers were amplified by PCR and identified on agarose gel electrophoresis. The genetic linkage map constructed with the results of PCR markers, each marker covers an average genetic distance of about 5cM, which is equivalent to a physical distance of about 1.4Mb, which is larger than most previously reported rice genetic maps. Designing, screening, and collecting these PCR markers took three researchers more than a year of work. In the study of recombinant inbred lines in rice, the inventors used Illumina GA to obtain an average marker coverage of 40kb per SNP in less than two weeks. In this way, sequencing-based high-throughput genotyping methods are much faster, more efficient, and less expensive than traditional PCR-based genotyping methods.
重测序的通量可以很容易被调整,这也使本发明人可以在选择最短时间和最少资源投入时,得到合适的标记密度水平以及重组断裂点的分辨率。当有新的科学问题出现,需要有更高的标记密度或更精确地确定重组断裂点时,本发明人可以对整个或一部分作图群体提高重测序的覆盖度。特别要说明的是,利用这个方法,重组断裂位点可以非常精确地被确定,如果有足够高的重测序覆盖度的话,理论上可以定位到1kb以内。这样一个精细的分辨率,能够检测出之前通常无法用其他类型的遗传标记鉴定出来的“双交换”现象。最终这个方法就能够提高QTL检测定位的准确率并且增加了基因克隆的效率和成功率。精确鉴定出来的重组断裂位点,也使得对于具有特殊的遗传特性的基因组区域(比如说重组热点)的研究能够实现。The throughput of resequencing can be easily adjusted, which also allows the inventors to obtain suitable marker density levels and resolution of recombination breakpoints while choosing the shortest time and resource investment. When a new scientific question arises that requires higher labeling density or more precise determination of recombination breakpoints, the inventors can increase the coverage of resequencing for the whole or part of the mapping population. It should be noted that, using this method, the recombination break site can be determined very accurately, and if there is a high enough resequencing coverage, it can theoretically be located within 1kb. Such a fine resolution enables the detection of "double crossovers" that have not been previously identified with other types of genetic markers. Ultimately, this method can improve the accuracy of QTL detection and mapping and increase the efficiency and success rate of gene cloning. Precise identification of recombination breakpoints also enables the study of genomic regions with specific genetic properties, such as recombination hotspots.
综上可以看出,这个结合第二代测序技术实现的高通量基因型鉴定方法,将会大大简化和加速作物中数量性状的遗传定位 [37-39,20]。本发明人提出的理论方法可以较好地配合多亲本群体进行基因型鉴定,提高QTL定位的准确性和效率,充分利用多亲本群体中存在的丰富遗传变异。同时也有助于作物遗传品质的改良和分子育种设计。在实际应用中,这种方法可以用于重要农艺性状基因紧密连锁的分子标记的获得、育种过程中后代的高效筛选、改良品种基因型图谱的精细鉴定等,为分子标记辅助筛选育种提供了一个快速高效的手段和平台,使之在效率和准确度上提高到一个新的台阶。总之,这种基于测序的高通量基因型鉴定方法将为解决复杂生物学问题和作物育种改良提供便捷。 In conclusion, this high-throughput genotype identification method combined with second-generation sequencing technology will greatly simplify and accelerate the genetic mapping of quantitative traits in crops [37-39,20] . The theoretical method proposed by the present inventor can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding. In practical applications, this method can be used for the acquisition of closely linked molecular markers of important agronomic trait genes, the efficient screening of offspring in the breeding process, and the fine identification of genotype maps of improved varieties, etc. Fast and efficient means and platforms make it a new level of efficiency and accuracy. In conclusion, this sequencing-based high-throughput genotyping method will provide convenience for solving complex biological problems and improving crop breeding.
在本发明提及的所有文献都在本申请中引用作为参考,就如同每一篇文献被单独引用作为参考那样。此外应理解,在阅读了本发明的上述讲授内容之后,本领域技术人员可以对本发明作各种改动或修改,这些等价形式同样落于本申请所 附权利要求书所限定的范围。All documents mentioned herein are incorporated by reference in this application as if each document were individually incorporated by reference. In addition, it should be understood that after reading the above-mentioned teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the application.
参考文献references
[1]Winzeler,E.A.et al(1998).Direct allelic variation scanning of the yeast genome.Science,281:1194-1197.[1] Winzeler, E.A. et al (1998). Direct allelic variation scanning of the yeast genome. Science, 281:1194-1197.
[2]Meaburn,E.,Butcher,L.M.,Schalkwyk,L.C.,& Plomin,R.(2006)Genotyping pooled DNA using 100K SNP microarrays:a step towards genomewide association scans.Nucleic Acids Res.,34:e27.[2] Meaburn, E., Butcher, L.M., Schalkwyk, L.C., & Plomin, R. (2006) Genotyping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans. Nucleic Acids Res., 34:e27.
[3]Singer,T.et al.(2006)A high-resolution map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization.PLoS Genet.,2:e144.[3] Singer, T. et al. (2006) A high-resolution map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization. PLoS Genet., 2:e144.
[4]Jeremy,E.et al.(2008)Development and evaluation of a high-throughput,low-cost genotyping platform based on oligonucleotide microarrays in rice.Plant Methods,4:13.[4] Jeremy, E. et al. (2008) Development and evaluation of a high-throughput, low-cost genotyping platform based on oligonucleotide microarrays in rice. Plant Methods, 4:13.
[5]Craig,D.W.et al.(2008)Identification of genetic variants using bar-coded multiplexed sequencing.Nat.Methods,5:887-893.[5] Craig, D.W. et al. (2008) Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods, 5:887-893.
[6]Cronn,R.et al(2008).Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology.Nucleic Acids Res.,36:e122.[6] Cronn, R. et al (2008). Multiplex sequencing of plant chloroplast genes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res., 36:e122.
[7]Doi K,Iwata N,Yosiiimura A(1997).The construction of chromosome substitution lines of African rice(Oryza glaberrima Steud.)in the background of japonica rice(O.sativa L.).Rice Genet.Newsl.,14:39–41.[7] Doi K, Iwata N, Yosiiimura A (1997). The construction of chromosome substitution lines of African rice (Oryza glaberrima Steud.) in the background of japonica rice (O.sativa L.). Rice Genet. Newsl., 14:39–41.
[8]Wan XY,Wan JM,Su CC,Wang CM,Shen WB,Li JM et al.(2004).QTL detection for eating quality of cooked rice in a population of chromosome segment substitution lines.Theor.Appl.Genet.,110:71–79.[8] Wan XY, Wan JM, Su CC, Wang CM, Shen WB, Li JM et al. (2004). QTL detection for eating quality of cooked rice in a population of chromosome segment substitution lines. Theor.Appl.Genet. , 110:71–79.
[9]Ebitani T,Takeuchi Y,Nonoue Y,Yamamoto T,Takeuchi K,Yano M(2005).Construction and evaluation of chromosome segment substitution lines carrying overlapping chromosome segments of indica rice cultivar'Kasalath'in a genetic background of japonica elite cultivar'Koshihikari'.Breed Sci.,55:65–73.[9] Ebitani T, Takeuchi Y, Nonoue Y, Yamamoto T, Takeuchi K, Yano M (2005). Construction and evaluation of chromosome segment substitution lines carrying overlapping chromosome segments of indica rice cultivar'Kasalath'in a genetic background of japonica elite cultivar 'Koshihikari'. Breed Sci., 55:65–73.
[10]Hao W,Jin J,Sun SY,Zhu MZ,Lin HX(2006).Construction of chromosome segment substitution lines carrying overlapping chromosome segments of the whole wild rice genome and identification of quantitative trait loci for rice quality.J.Plant Physiol.Mol.Biol.,32:354–362.[10] Hao W, Jin J, Sun SY, Zhu MZ, Lin HX (2006). Construction of chromosome segment substitution lines carrying overlapping chromosome segments of the whole wild rice genome and identification of quantitative trait loci for rice quality. J.Plant Physiol. Mol. Biol., 32:354–362.
[11]Huang X,Wei X,Sang T,Zhao Q,Feng Q,Zhao Y,Li C,Zhu C,Lu T,Zhang Z,Li M,Fan D,Guo Y,Wang A,Wang L,Deng L,Li W,Lu Y,Weng Q,Liu K,Huang T,Zhou T,Jing Y,Li W,Lin Z,Buckler ES,Qian Q,Zhang Q,Li J,Han B.(2010)Genome-wide association studies of 14 agronomic traits in rice landraces.Nature Genet.,42:961-967[11]Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L , Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler ES, Qian Q, Zhang Q, Li J, Han B.(2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nature Genet., 42:961-967
[12]Li,H.,Ruan,J.,& Durbin,R.(2008)Mapping short DNA sequencing reads and calling variants using mapping quality scores.Genome Res.,18:1851-1858.[12] Li, H., Ruan, J., & Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18:1851-1858.
[13]Kurtz,S.et al.(2004)Versatile and open software for comparing large genomes.Genome Biol.,5:R12.[13] Kurtz, S. et al. (2004) Versatile and open software for comparing large genes. Genome Biol., 5:R12.
[14]Rice,P.,Longden,I.,& Bleasby,(2000)A.EMBOSS:The European molecular biology open software suite.Trends in Genetics,16:276-277.[14] Rice, P., Longden, I., & Bleasby, (2000) A. EMBOSS: The European molecular biology open software suite. Trends in Genetics, 16: 276-277.
[15]Ning,Z.,Cox,A.J.,& Mullikin,J.C.(2001)SSAHA:a fast search method for large DNA databases.Genome Res.,11:1725-1729.[15] Ning, Z., Cox, A.J., & Mullikin, J.C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res., 11:1725-1729.
[16]Lincoln,S.E.& Lander,S.L.(1993)Mapmaker/exp 3.0 and mapmaker/qtl 1.1.Technical report.Whitehead Institute of Medical Research,Cambridge,MA.[16] Lincoln, S.E. & Lander, S.L. (1993) Mapmaker/exp 3.0 and mapmaker/qtl 1.1. Technical report. Whitehead Institute of Medical Research, Cambridge, MA.
[17]Wang,S.,Basten,C.J.& Zeng,Z.B(2007).Windows QTL Cartographer 2.5.Department of Statistics,North Carolina State University,Raleigh,NC.[17] Wang, S., Basten, C. J. & Zeng, Z. B (2007). Windows QTL Cartographer 2.5. Department of Statistics, North Carolina State University, Raleigh, NC.
[18]Li,R.et al.(2009)SOAP2:an improved ultrafast tool for short read alignment.Bioinformatics,25:1966-1967.[18] Li, R. et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25: 1966-1967.
[19]Van Os,H.et al.(2006)Construction of a 10,000-marker ultradense genetic recombination map of potato:Providing a framework for accelerated gene isolation and a genomewide physical map.Genetics,173:1075-1087.[19] Van Os, H. et al. (2006) Construction of a 10,000-marker ultradense genetic recombination map of potato:Providing a framework for accelerated gene isolation and a genomewide physical map.Genetics, 173:1075-1087.
[20]Xu J,Zhao Q,Du P,Xu C,Wang B,Feng Q,Liu Q,Tang S,Gu M,Han B,Liang G.(2010)Developing high throughput genotyped chromosome segment substitution lines based on population whole-genome re-sequencing in rice(Oryza stative L.).BMC Genomics,11:656.[20] Xu J, Zhao Q, Du P, Xu C, Wang B, Feng Q, Liu Q, Tang S, Gu M, Han B, Liang G. (2010) Developing high throughput genotyped chromosome segment substitution lines based on population whole-genome re-sequencing in rice (Oryza stative L.). BMC Genomics, 11:656.
[21]Paterson AH,Bowers JE,Bruggmann R,Dubchak I,Grimwood J,Gundlach H,Haberer G,Hellsten U,Mitros T,Poliakov A,Schmutz J,Spannagl M,Tang H,Wang X,Wicker T,Bharti AK,Chapman J,Feltus FA,Gowik U,Grigoriev IV,Lyons E,Maher CA,Martis M,Narechania A,Otillar RP,Penning BW,Salamov AA,Wang Y,Zhang L,Carpita NC,Freeling M,Gingle AR,Hash CT,Keller B,Klein P,Kresovich S,McCann MC,Ming R,Peterson DG,Mehboob-ur-Rahman,Ware D,Westhoff P,Mayer KF,Messing J,Rokhsar DS.(2009)The Sorghum bicolor genome and the diversification of grasses.Nature,457(7229):551-6.[21] Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK , Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, Freeling M, Gingle AR, Hash CT, Keller B, Klein P, Kresovich S, McCann MC, Ming R, Peterson DG, Mehboob-ur-Rahman, Ware D, Westhoff P, Mayer KF, Messing J, Rokhsar DS. (2009) The Sorghum bicolor genome and the Diversification of grasses. Nature, 457(7229):551-6.
[22]Stein LD,Mungall C,Shu S,Caudy M,Mangone M,Day A,Nickerson E,Stajich JE,Harris TW,Arva A,Lewis S.(2002)The generic genome browser:a building block for a model organism system database.Genome Res.,12(10):1599-610.[22]Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res., 12(10):1599-610.
[23]Rice Annotation Project.(2007)Curated genome annotation of Oryza sativa ssp.japonica and comparative genome analysis with Arabidopsis thaliana.Genome Res.,17:175-83.[23] Rice Annotation Project.(2007) Curated genome annotation of Oryza sativa ssp.japonica and comparative genome analysis with Arabidopsis thaliana.Genome Res., 17:175-83.
[24]Rice Annotation Project.(2008)The rice annotation project database(RAP-DB):2008 update.Nucleic Acids Res.,36:D1028-D1033.[24]Rice Annotation Project.(2008)The rice annotation project database(RAP-DB):2008 update.Nucleic Acids Res.,36:D1028-D1033.
[25]The Rice Full-Length cDNA Consortium.(2003)Collection,mapping,and annotation of over 28,000 cDNA clones from japonica rice,Science,301:376–379.[25] The Rice Full-Length cDNA Consortium. (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice, Science, 301:376–379.
[26]Liu,X.,Lu,T.,Yu,S.,et al.(2007)A collection of 10,096 indica rice full-length cDNAs reveals highly expressed sequence divergence between Oryza sativa indica and japonica subspecies,Plant Mol.Biol.,65:403–415.[26] Liu, X., Lu, T., Yu, S., et al. (2007) A collection of 10,096 indica rice full-length cDNAs reveales highly expressed sequence divergence between Oryza sativa indica and japonica subspecies, Plant Mol. Biol., 65:403–415.
[27]International Rice Genome Sequencing Project(2005).The map-based sequence of the rice genome.Nature,436:793-800.[27] International Rice Genome Sequencing Project (2005). The map-based sequence of the rice genome. Nature, 436:793-800.
[28]Yu,J.et al.(2005)The Genomes of Oryza sativa:A history of duplications.PLoS Biol.,3:266-281.[28] Yu, J. et al. (2005) The Genomes of Oryza sativa: A history of duplications. PLoS Biol., 3:266-281.
[29]Dohm,J.C.,Lottaz,C.,Borodina,T.,& Himmelbauer,H.(2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.Nucleic Acids Res.,36:e105.[29] Dohm, J.C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36:e105.
[30]Mouse Genome Sequencing Consortium.(2002)Initial sequencing andcomparative analysis of the mouse genome.Nature,420:520–562.[30] Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420:520–562.
[31]Frazer,K.A.,Eskin,E.,Kang,H.M.,Bogue,M.A.,Hinds,D.A.,Beilharz,E.J.,Gupta,R.V.,Montgomery,J.,Morenzoni,M.M.,Nilsen,G.B.,et al.(2007)A sequence-based variation map of 8.27 million SNPs in inbred mouse strains.Nature,448:1050–1053[31] Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., et al. (2007 ) A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature, 448:1050–1053
[32]Stam P.(1993)Construction of integrated genetic linkage maps by means of a new computer package:JoinMap.Plant J.,3:739–44.[32] Stam P. (1993) Construction of integrated genetic linkage maps by means of a new computer package: JoinMap.Plant J., 3:739–44.
[33]Sasaki,A.et al.(2002)A mutant gibberellin-synthesis gene in rice.Nature,416:701-702.[33] Sasaki, A. et al. (2002) A mutant gibberellin-synthesis gene in rice. Nature, 416:701-702.
[34]Eshed,Y.and Zamir,D(1995).An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yield-associated QTL.Genetics,141:1147–1162.[34] Eshed, Y. and Zamir, D (1995). An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yield-associated QTL. Genetics, 141:1147–1162.
[35]Loudet,O.,Chaillou,S.,Camilleri,C.,Bouchez,D.and Daniel-Vedele,F.(2002)Bay-0 x Shahdara recombinant inbred line population:a powerful tool for the genetic dissection of complex traits in Arabidopsis.Theor.Appl.Genet.,104:1173–1184.[35] Loudet, O., Chaillou, S., Camilleri, C., Bouchez, D. and Daniel-Vedele, F. (2002) Bay-0 x Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor.Appl.Genet., 104:1173–1184.
[36]Simon,M.,Loudet,O.,Durand,S.,Berard,A.,Brunel,D.,Sennesal,F.-X.,Durand-Tardif,M.,Pelletier,G.and Camilleri,C.(2008)Quantitative trait loci mapping in five new large recombinant inbred line populations of Arabidopsis thaliana genotyped with consensus single-nucleotide polymorphism markers.Genetics,178:2253–2264.[36] Simon, M., Loudet, O., Durand, S., Berard, A., Brunel, D., Sennesal, F.-X., Durand-Tardif, M., Pelletier, G. and Camilleri, C. (2008) Quantitative trait loci mapping in five new large recombinant inbred line populations of Arabidopsis thaliana genotyped with consensus single-nucleotide polymorphism markers. Genetics, 178:2253–2264.
[37]Huang,X.et al(2009).High-throughput genotyping by whole-genome resequencing.Genome Res.,19:1068-1076.[37] Huang, X. et al (2009). High-throughput genotyping by whole-genome resequencing. Genome Res., 19:1068-1076.
[38]Xie W,Feng Q,Yu H,Huang X,Zhao Q,Xing Y,Yu S,Han B,Zhang Q.(2010)Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing.Proc.Natl.Acad.Sci.U S A.,107(23):10578-83.Epub 2010 May 24.[38] Xie W, Feng Q, Yu H, Huang X, Zhao Q, Xing Y, Yu S, Han B, Zhang Q. (2010) Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing. Proc.Natl.Acad.Sci.US A., 107(23):10578-83.Epub 2010 May 24.
[39]Zhao Q,Huang X,Lin Z,Han B.(2010)SEG-Map:A novel software for genotype calling and genetic map construction from next-generation sequencing.Rice,3:98-102.[39] Zhao Q, Huang X, Lin Z, Han B. (2010) SEG-Map: A novel software for genotype calling and genetic map construction from next-generation sequencing. Rice, 3:98-102.

Claims (10)

  1. 一种对多亲本植物的基因型进行鉴定的方法,其特征在于,所述方法包括:A method for identifying the genotype of a multi-parent plant, wherein the method comprises:
    (a)对于n个亲本及其子代,提供待鉴定的子代植物的测序数据Df,以及与所述子代植物相应的亲本植物的测序数据Dp,其中n为≥3的正整数;(a) for n parents and their progeny, provide the sequencing data Df of the progeny plants to be identified, and the sequencing data Dp of the parental plants corresponding to the progeny plants, wherein n is a positive integer ≥ 3;
    (b)基于所述测序数据Df和所述测序数据Dp,确定亲代和子代的SNP位点信息;(b) based on the sequencing data Df and the sequencing data Dp, determine the SNP site information of the parent and progeny;
    (c)基于所述的SNP位点信息,对子代的基因型进行判断,从而获得所述子代的各个SNP的评定结果以及所述子代的全基因组的各染色体上的重组断裂点的分布信息;(c) Judging the genotype of the progeny based on the SNP site information, thereby obtaining the evaluation results of each SNP of the progeny and the recombination breakpoints on each chromosome of the entire genome of the progeny distribution information;
    (d)基于所述子代的SNP评定结果信息和全基因组重组断裂点的位置信息,构建和/或绘制所述子代的基因型图谱,从而获得所述多亲本植物的基因型鉴定结果。(d) constructing and/or drawing a genotype map of the progeny based on the SNP evaluation result information of the progeny and the position information of the whole-genome recombination breakpoint, thereby obtaining the genotype identification result of the multi-parent plant.
  2. 如权利要求1所述的方法,其特征在于,在步骤(c)中,基于SNP“字串”进行重组断裂位点的分析。The method of claim 1, wherein, in step (c), analysis of recombination break sites is performed based on SNP "word strings".
  3. 如权利要求1所述的方法,其特征在于,步骤(c)中包括对重组断裂位点进行分析,从而获得重组断裂位点的分析结果,The method of claim 1, wherein the step (c) comprises analyzing the recombination break site, thereby obtaining an analysis result of the recombination break site,
    并且所述的重组断裂位点分析包括:And the recombination break site analysis includes:
    (s1)构建SNP“字串”,其中将亲本和子代的各条染色体上所有的SNP的基因型按顺序压缩成一个字串;(s1) constructing a SNP "string", wherein the genotypes of all SNPs on each chromosome of the parent and progeny are sequentially compressed into a string;
    (s2)按照预定的窗口大小,确定对应于所述SNP字符串的各个滑动窗口,并对每个窗口中的SNP位点进行打分,从而获得所述窗口内对各个亲本的各自得分值P;(s2) Determine each sliding window corresponding to the SNP string according to a predetermined window size, and score the SNP sites in each window, thereby obtaining the respective score values P for each parent in the window ;
    (s3)基于(s2)步骤中获得的得分值P,确定对应于子代的各染色体区域的基因型。(s3) Based on the score value P obtained in the step (s2), the genotype corresponding to each chromosomal region of the progeny is determined.
  4. 如权利要求1所述的方法,其特征在于,在步骤(s3)中,对于子代的各染色体区域,基于各亲本得分值或得分值曲线,确定对应于子代的各染色体区域的基因型。The method according to claim 1, wherein, in step (s3), for each chromosomal region of the progeny, based on each parental score value or score value curve, determine the chromosomal region corresponding to each chromosomal region of the progeny. genotype.
  5. 如权利要求1所述的方法,其特征在于,在步骤(s3)中,包括:通过滑动窗口在全基因组SNP位点上的滑动,就可以得到每一条染色体上各个亲本的得分值,并将该分值为纵坐标,以每个滑动窗在染色体上的位置为横坐标,绘制每个亲本的得分曲线。The method according to claim 1, characterized in that, in step (s3), comprising: by sliding the sliding window on the SNP site of the whole genome, the score value of each parent on each chromosome can be obtained, and Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.
  6. 如权利要求1所述的方法,其特征在于,在步骤(s3)中,对子代与各个亲本在这一区段的相似程度进行量化值,根据各个亲本的得分曲线的数值特征(数值高低和标准差),判断各个区段的基因型。The method according to claim 1, characterized in that, in step (s3), quantifying the degree of similarity between the offspring and each parent in this section, according to the numerical characteristics of the score curve of each parent (value level and standard deviation) to determine the genotype of each segment.
  7. 如权利要求3所述的方法,其特征在于,滑动窗口大小170-500个连续SNP位点,较佳地200-400个连续SNP位点;The method of claim 3, wherein the sliding window size is 170-500 consecutive SNP sites, preferably 200-400 consecutive SNP sites;
    和/或所述的测序数据的测序深度为0.1x-10x,较佳地0.2x-5x。And/or the sequencing depth of the sequencing data is 0.1x-10x, preferably 0.2x-5x.
  8. 如权利要求1所述的方法,其特征在于,所述的植物包括作物,较佳地禾本科作物;The method of claim 1, wherein the plant comprises a crop, preferably a grass crop;
    更佳地,所述的作物包括水稻、小麦、大豆、烟草。More preferably, the crops include rice, wheat, soybean, and tobacco.
  9. 一种对多亲本植物的基因型进行鉴定的数据分析装置,该装置包括:A data analysis device for identifying genotypes of multi-parent plants, the device comprising:
    数据输入模块,用于输入待分析的待处理数据,所述的待处理数据包括:待鉴定的子代植物的测序数据Df,以及与所述子代植物相应的亲本植物的测序数据Dp;a data input module for inputting data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;
    多亲本植物基因型鉴定模块,所述多亲本植物基因型鉴定模块被配置为执行权利要求1中所述的方法,从而获得所述子代的基因型鉴定结果;A multi-parental plant genotype identification module configured to perform the method described in claim 1, thereby obtaining the genotype identification result of the progeny;
    和输出模块,用于输出所述的所述子代的基因型鉴定结果。and an output module for outputting the genotype identification result of the progeny.
  10. 如权利要求9所述的装置,其特征在于,所述的多亲本植物基因型鉴定模块包括:The device of claim 9, wherein the multi-parent plant genotype identification module comprises:
    SNP位点信息分析子模块,其被配置为基于所述测序数据Df和所述测序数据Dp,确定亲代和子代的SNP位点信息;The SNP site information analysis submodule is configured to determine the SNP site information of the parent and progeny based on the sequencing data Df and the sequencing data Dp;
    染色体重组断裂点分析子模块,其被配置为基于所述的SNP位点信息,对子代的基因型进行判断,从而获得所述子代的各个SNP的评定结果以及所述子代的全基因组的各染色体上的重组断裂点的分布信息;Chromosomal recombination breakpoint analysis sub-module, which is configured to judge the genotype of the progeny based on the SNP site information, so as to obtain the evaluation result of each SNP of the progeny and the whole genome of the progeny The distribution information of recombination breakpoints on each chromosome;
    基因型图谱构建子模块,其被配置为:基于所述子代的SNP评定结果信息和全基因组重组断裂点的位置信息,构建和/或绘制所述子代的基因型图谱,从而获得所述多亲本植物的基因型鉴定结果。A genotype map construction submodule, which is configured to: construct and/or draw a genotype map of the progeny based on the SNP assessment result information of the progeny and the position information of the whole genome recombination breakpoint, so as to obtain the progeny. Genotyping results of multiple parental plants.
PCT/CN2021/115146 2021-01-30 2021-08-27 Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing WO2022160700A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2021423830A AU2021423830A1 (en) 2021-01-30 2021-08-27 Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110131330.4 2021-01-30
CN202110131330.4A CN114842907A (en) 2021-01-30 2021-01-30 Multi-parent crop genotype identification based on high-throughput whole genome sequencing

Publications (1)

Publication Number Publication Date
WO2022160700A1 true WO2022160700A1 (en) 2022-08-04

Family

ID=82561095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115146 WO2022160700A1 (en) 2021-01-30 2021-08-27 Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing

Country Status (3)

Country Link
CN (1) CN114842907A (en)
AU (1) AU2021423830A1 (en)
WO (1) WO2022160700A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115798580B (en) * 2023-02-10 2023-11-07 北京中仪康卫医疗器械有限公司 Genotype filling and low-depth sequencing-based integrated genome analysis method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104328507A (en) * 2014-10-11 2015-02-04 中国水稻研究所 SNP chip used for identifying rice variety, preparation method and application
CN111508560A (en) * 2020-04-29 2020-08-07 上海师范大学 Method for constructing high-density genotype map of outcrossing species

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104328507A (en) * 2014-10-11 2015-02-04 中国水稻研究所 SNP chip used for identifying rice variety, preparation method and application
CN111508560A (en) * 2020-04-29 2020-08-07 上海师范大学 Method for constructing high-density genotype map of outcrossing species

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
LU WANG; AHONG WANG; XUEHUI HUANG; QIANG ZHAO; GUOJUN DONG; QIAN QIAN; TAO SANG; BIN HAN: "Mapping 49 quantitative trait loci at high resolution through sequencing-based genotyping of rice recombinant inbred lines", THEORETICAL AND APPLIED GENETICS ; INTERNATIONAL JOURNAL OF PLANT BREEDING RESEARCH, SPRINGER, BERLIN, DE, vol. 122, no. 2, 28 September 2010 (2010-09-28), Berlin, DE , pages 327 - 340, XP019873367, ISSN: 1432-2242, DOI: 10.1007/s00122-010-1449-8 *
LUO LONGHAI, YUE GUIDONG, GAO QIANG, WANG JUNYI, XU JIAOHUI, YIN YE: "The Application of High-throughput Sequencing Technology in Plant and Animal Research", SCIENCE CHINA: CHINESE BULLETIN OF LIFE SCIENCE = SCIENTIA SINICA VITAE, vol. 42, no. 2, 1 February 2012 (2012-02-01), pages 107 - 124, XP055954237, ISSN: 1674-7232, DOI: 10.1360/052011-634 *
QING-QING HOU, LI-ZHEN SI, XUE-HUI HUANG, BIN HAN: "Progress on genome-wide association study of important agronomic traits in rice", CHINESE BULLETIN OF LIFE SCIENCES, vol. 28, no. 10, 1 October 2016 (2016-10-01), pages 1 - 8, XP055954244, ISSN: 1004-0374, DOI: 10.13376/j.cbls/2016162 *
X. HUANG, Q. FENG, Q. QIAN, Q. ZHAO, L. WANG, A. WANG, J. GUAN, D. FAN, Q. WENG, T. HUANG, G. DONG, T. SANG, B. HAN: "High-throughput genotyping by whole-genome resequencing", GENOME RESEARCH, COLD SPRING HARBOR LABORATORY PRESS, US, vol. 19, no. 6, 1 June 2009 (2009-06-01), US , pages 1068 - 1076, XP055577519, ISSN: 1088-9051, DOI: 10.1101/gr.089516.108 *
XUE YONGBIAO, HAN BIN, CHONG KANG, WANG TAI, HE ZUHUA, FU XIANGDONG, CHU CHENGCAI, CHENG ZHUKUAN, XU YUNYUAN, LI MING: "Achievements and Prospect of Designer Breeding by Molecular Modules in Rice ", BULLETIN OF CHINESE ACADEMY OF SCIENCES, vol. 33, no. 9, 1 September 2018 (2018-09-01), pages 1 - 10, XP055954240, ISSN: 1000-3045, DOI: 10.16418/j.issn.1000-3045.2018.09.002 *

Also Published As

Publication number Publication date
AU2021423830A1 (en) 2023-12-21
CN114842907A (en) 2022-08-02

Similar Documents

Publication Publication Date Title
Mason et al. A user guide to the Brassica 60K Illumina Infinium™ SNP genotyping array
CN105008599B (en) Oryza sativa L. full-length genome breeding chip and application thereof
Huang et al. High-throughput genotyping by whole-genome resequencing
AU2019101778A4 (en) Method for constructing rice molecular marker map based on Kompetitive Allele Specific PCR and application in breeding Using the same
CN109196123B (en) SNP molecular marker combination for rice genotyping and application thereof
CN108998550B (en) SNP molecular marker for rice genotyping and application thereof
CN106868131A (en) No. 6 chromosomes of upland cotton SNP marker related to fibre strength
Li et al. Construction of high-density genetic map and mapping quantitative trait loci for growth habit-related traits of peanut (Arachis hypogaea L.)
US20140208449A1 (en) Genetics of gender discrimination in date palm
Zhao et al. SEG-Map: a novel software for genotype calling and genetic map construction from next-generation sequencing
Li et al. Three representative inter and intra‐subspecific crosses reveal the genetic architecture of reproductive isolation in rice
CN112289384A (en) Construction method and application of whole citrus genome KASP marker library
CN110846429A (en) Corn whole genome InDel chip and application thereof
WO2022160700A1 (en) Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing
Gardiner et al. A framework for gene mapping in wheat demonstrated using the Yr7 yellow rust resistance gene
KR101539737B1 (en) Methodology for improving efficiency of marker-assisted backcrossing using genome sequence and molecular marker
Li et al. Genome-wide artificial introgressions of Gossypium barbadense into G. hirsutum reveal superior loci for simultaneous improvement of cotton fiber quality and yield traits
CN114574613B (en) Wheat-goose-roegneria kamoji whole genome liquid chip and application
Su et al. Fine‐mapping a fibre strength QTL QFS‐D 11‐1 on cotton chromosome 21 using introgressed lines
CN116004898A (en) Peanut 40K liquid-phase SNP chip PeannitGBTS 40K and application thereof
Fletcher et al. AFLAP: assembly-free linkage analysis pipeline using k-mers from genome sequencing data
CN103725675A (en) Molecular tagging method for paddy rice
Agrawal et al. Molecular marker tools for breeding program in crops
Wang et al. A pangenome analysis pipeline (PSVCP) provides insights into rice functional gene identification
CN115992292B (en) SNP molecular marker combination for brassica napus and application thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21922295

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2021423830

Country of ref document: AU

Ref document number: AU2021423830

Country of ref document: AU

ENP Entry into the national phase

Ref document number: 2021423830

Country of ref document: AU

Date of ref document: 20210827

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 21922295

Country of ref document: EP

Kind code of ref document: A1