WO2022160700A1

WO2022160700A1 - Genotype identification of multi-parent crop on basis of high-throughput whole genome sequencing

Info

Publication number: WO2022160700A1
Application number: PCT/CN2021/115146
Authority: WO
Inventors: 韩斌; 朱舟; 王阿红
Original assignee: 中国科学院分子植物科学卓越创新中心
Priority date: 2021-01-30
Filing date: 2021-08-27
Publication date: 2022-08-04
Also published as: AU2021423830A1; CN114842907A

Abstract

Genotype identification of a multi-parent crop on the basis of high-throughput whole genome sequencing. Specifically, the method comprises: (a) for n parents and the progeny thereof, providing sequencing data Df of a progeny crop to be identified and sequencing data Dp of a parent crop corresponding to the progeny crop, with n being a positive integer of ≥ 3; (b) determining SNP site information of the parent and the progeny on the basis of the sequencing data Df and the sequencing data Dp; (c) determining the genotype of the progeny on the basis of the SNP site information, thereby obtaining an evaluation result of each SNP of the progeny and distribution information of the recombinant breaking point on each chromosome of the whole genome of the progeny; and (d) constructing and/or drawing a genotype map of the progeny, thereby obtaining a genotype identification result of a multi-parent crop. The method can be used for identifying the genotype of the multi-parent crop with high throughput, rapidness and accuracy.

Description

Genotyping of multi-parent crops based on high-throughput whole-genome sequencing

technical field

The invention relates to the technical field of biological information processing, in particular to multi-parent crop genotype identification based on high-throughput whole genome sequencing. More specifically, the present invention provides a method and device for identifying genotypes of multi-parent crops based on high-throughput whole genome sequencing data.

Background technique

At the end of the last century, the use of DNA molecular markers greatly promoted the development of reverse genetics. With the advancement of molecular biology technology, the types of markers and the methods for constructing genetic maps are gradually developed and improved. The advent of polymerase chain reaction (PCR) triggered an era of explosive applications of molecular markers, because PCR can greatly simplify the experimental steps of marker design and result analysis. These DNA molecular markers are still widely used, but also show increasing limitations in terms of genome coverage, time and cost. At present, the development of genomics and the gradual maturity of related technical methods provide the basis for genome-based high-throughput strategies to replace marker-based mapping methods.

Genome sequencing opens the door to high-throughput genotyping. Initially this was done using microarray chip technology, which detects single nucleotide polymorphisms (SNPs) by hybridizing genomic DNA to oligonucleotides on a gene chip. Since hundreds to thousands of markers can be detected in a single hybridization, this method of genotyping greatly improves the efficiency ^[1] . This method has been applied to some model organism systems such as human, Arabidopsis and rice ^[2-4] . Although the goal of high throughput has been achieved, microarray-based approaches have serious limitations, such as laborious, time-consuming, and high costs in designing, producing, and using microarrays.

The advent of next-generation sequencing technology has brought a leap forward in methodological methods for genotyping and genetic mapping. New sequencing technologies not only increase sequencing throughput by several orders of magnitude, but also allow parallel sequencing of many samples ^[5-6] . Advances in these technologies have paved the way for the development of sequencing-based high-throughput genotyping methods. The new genotyping method combines the advantages of fast and inexpensive, high-density marker coverage, high accuracy and high resolution, while also being applicable to more mapping populations and species for comparative genomic and genetic map construction.

Although there have been some methods for genotyping 2-parent plants, the current methods have obvious shortcomings for the genotyping of multi-parent plants involving 3 or more parents, such as low accuracy and long analysis time. Wait.

Therefore, there is an urgent need in the art to provide a method and device for identifying the genotypes of multi-parent plants involving more than 3 parents with rapid analysis and accurate results.

SUMMARY OF THE INVENTION

The object of the present invention is to provide a method and device for identifying the genotypes of multi-parent plants with rapid analysis and accurate results, so that the population genotypes constructed by multiple parents can be analyzed quickly, accurately and reliably.

In a first aspect of the present invention, there is provided a method for identifying the genotype of a multi-parent plant (such as a crop), the method comprising:

(a) for n parents and their progeny, provide the sequencing data Df of the progeny plants to be identified, and the sequencing data Dp of the parental plants corresponding to the progeny plants, wherein n is a positive integer ≥ 3;

(b) based on the sequencing data Df and the sequencing data Dp, determine the SNP site information of the parent and progeny;

(c) Judging the genotype of the progeny based on the SNP site information, thereby obtaining the evaluation results of each SNP of the progeny and the recombination breakpoints on each chromosome of the entire genome of the progeny distribution information;

(d) constructing and/or drawing a genotype map of the progeny based on the SNP evaluation result information of the progeny and the position information of the whole-genome recombination breakpoint, thereby obtaining the genotype identification result of the multi-parent plant.

In another preferred embodiment, in step (c), the analysis of recombination break sites is performed based on the SNP "word string".

In another preferred embodiment, step (c) includes analyzing the recombination break site, so as to obtain the analysis result of the recombination break site,

And the recombination break site analysis includes:

(s1) constructing a SNP "string", wherein the genotypes of all SNPs on each chromosome of the parent and progeny are sequentially compressed into a string;

(s2) Determine each sliding window corresponding to the SNP string according to a predetermined window size, and score the SNP sites in each window, thereby obtaining the respective score values P for each parent in the window ;

(s3) Based on the score value P obtained in the step (s2), the genotype corresponding to each chromosomal region of the progeny is determined.

In another preferred example, in step (s1), no matter how large the actual distance between two adjacent SNPs is, all the gaps between the SNPs are removed.

In another preferred embodiment, in step (s1), the SNPs constituting the word string are homozygous SNP sites of the parent.

In another preferred example, in step (s1), the method further includes: firstly screening the SNP sites included in the analysis, thereby excluding any SNP site whose parent is heterozygous.

In another preferred example, in step (s2), scoring is performed according to the scoring rules in Table A.

In another preferred example, in step (s3), for each chromosomal region of the progeny, the genotype corresponding to each chromosomal region of the progeny is determined based on each parental score value or score value curve.

In another preferred example, in step (s3), the genotype of each chromosomal region is determined based on the score value and the standard deviation.

In another preferred example, for the chromosomal region of the genotype to be determined, if a parent A has a high score value close to full score (≥80% full score, preferably ≥80% full score), and the parent is in this paragraph The score value of the region is quite stable, there is not much numerical fluctuation, and the score value of the remaining parents is low (≤50% full score, preferably ≤30% full score) or there is a large numerical fluctuation, then the gene in this chromosome region The genotype is determined as the genotype of the parent A.

In another preferred example, in step (s3), it includes: by sliding the sliding window on the SNP site of the whole genome, the score value of each parent on each chromosome can be obtained, and the score value is vertical Coordinates, with the position of each sliding window on the chromosome as the abscissa, draw the score curve of each parent.

In another preferred embodiment, in the step (s3), the sub-step of evaluating the heterozygous region is included:

(s3a) For the hybrid progeny of multiple parents, it is still set that the source of parents in a certain chromosomal region is at most two parents, and based on the score curve of the two parents in this region, it is judged whether this region is heterozygous or not area.

In another preferred example, in step (s3), the degree of similarity between the offspring and each parent in this section is quantified, and according to the numerical characteristics (value level and standard deviation) of the score curve of each parent, determine Genotype of each segment.

In another preferred example, in step (s3), the genotype assessment is performed in the following manner:

(Z1) If a parent has a high score in this section and its score is relatively stable (the score curve is close to the plateau), and the score curves of the other parents in this section have large fluctuations and a large standard deviation (the score curve is peak-shaped up and down), then this segment is judged to be the homozygous genotype of the parent;

(Z2) When judging that the number of parents is two, it can be inferred that the region is a heterozygous of the two parents according to the fact that both parents have large numerical fluctuations in a certain section and the standard deviation is large genotype;

In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment; because the scores of each parent are higher in this segment fluctuations, only the two most probable parents can be given based on the numerical features.

(Z3) If there are two or more parents and offspring that are very similar in a certain segment, two or more parent curves with high scores and small standard deviations appear, indicating that some parents in the analysis are in the This section is very similar, there is not much difference, you can temporarily leave judgment (marked as "unknown area").

In another preferred embodiment, the method further includes: if the genotypes on both sides of the unknown region are the same, determining the region as the genotype; and if the genotypes on the two sides are different, determining the unknown region The middle position of the unknown region is regarded as a recombination breakpoint, and the two sides of the unknown region are the genotypes on both sides.

In another preferred embodiment, the progeny is a multi-parent plant.

In another preferred embodiment, n is 3-6, more preferably 3, 4 or 5.

In another preferred embodiment, the sequencing data is selected from the group consisting of genome sequencing data, RNA sequencing data, or a combination thereof.

In another preferred embodiment, the sequencing data are files in fastq format.

In another preferred embodiment, the sliding window size is 170-500 consecutive SNP sites, preferably 200-400 consecutive SNP sites;

And/or the sequencing depth of the sequencing data is 0.1x-10x, preferably 0.2x-5x.

In another preferred embodiment, the sequencing depth of the sequencing data: ≥1, preferably 1-5, more preferably 1.5-3.

In another preferred embodiment, for each chromosome, each parental score curve is obtained.

In another preferred embodiment, the SNP site is used to determine the genotype of the individual;

In another preferred embodiment, in step (b), the sequencing data (such as fastq files) are compared and processed by bwa and GATK software to obtain SNP information.

In another preferred embodiment, the SNP site information includes location information and genotype information.

In another preferred embodiment, the SNP site used for judging the genotype meets the following requirements:

1. SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions;

2. For any SNP locus, the SNP information (position information and genotype information) of the corresponding two parents and offspring are known, and the locus should be deleted if any of the three is unknown.

In another preferred example, in step (c), the evaluation result of each SNP of the progeny is recorded in the rlt file, and the rlt file records the genotype determination situation of each SNP position;

The distribution information of the recombination breakpoints on each chromosome of the whole genome of the progeny is recorded in a bin file, and the bin file records the distribution of the recombination breakpoints on the 12 chromosomes of the whole genome.

In another preferred embodiment, in step (c), read genotype and recombination break site judgment are performed by SNPwindow script.

In another preferred embodiment, in step (d), the genotype map is performed on the m individuals of the progeny at the same time

In another preferred embodiment, in step (d), the recombination map is constructed by the SNPwindow script, and the gene map of each progeny individual is drawn by the SNP2png script.

In another preferred embodiment, in step (d), it also includes performing alignment on the recombination map of each individual through the Bin2MCD script to generate a recombination bin map.

In another preferred embodiment, the resolution of the recombined bin map is one bin per 5-200kb, preferably one bin per 10-100kb.

In another preferred embodiment, the method further comprises: processing the recombination bin map to obtain the genetic map of the progeny.

In another preferred embodiment, the method further comprises: performing QTL analysis on the genetic map.

In another preferred embodiment, the method further includes: performing a visual analysis on the genotypes of the entire population of parents and progeny, generating genotype data, and constructing a linkage map based on the genotype data.

In another preferred embodiment, the plants include crops, preferably grass crops.

In another preferred embodiment, the crops include rice, wheat, soybean, and tobacco.

In a second aspect of the present invention, a data analysis device for identifying the genotypes of multi-parent plants is provided, the device comprising:

A data input module for inputting the data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;

A multi-parental plant genotype identification module, the multi-parent plant genotype identification module is configured to perform the method described in the first aspect of the present invention, thereby obtaining the genotype identification result of the progeny;

and an output module for outputting the genotype identification result of the progeny.

In another preferred embodiment, the described multi-parent plant genotype identification module includes:

The SNP site information analysis submodule is configured to determine the SNP site information of the parent and progeny based on the sequencing data Df and the sequencing data Dp;

Chromosomal recombination breakpoint analysis sub-module, which is configured to judge the genotype of the progeny based on the SNP site information, so as to obtain the evaluation result of each SNP of the progeny and the whole genome of the progeny The distribution information of recombination breakpoints on each chromosome;

A genotype map construction submodule, which is configured to: construct and/or draw a genotype map of the progeny based on the SNP assessment result information of the progeny and the position information of the whole genome recombination breakpoint, so as to obtain the progeny. Genotyping results of multiple parental plants.

In another preferred embodiment, the output module includes: a display, a printer, a pad, and the like.

It should be understood that within the scope of the present invention, the above-mentioned technical features of the present invention and the technical features specifically described in the following (eg, the embodiments) can be combined with each other to form new or preferred technical solutions. Due to space limitations, it is not repeated here.

Description of drawings

Figure 1 shows the simulated two-parental material genome-wide recombination breakpoints.

Figure 2 shows the simulated four-parental material genome-wide recombination breakpoints.

Figure 3 shows genotyping of two-parental mock progeny using a SNP-based sliding window approach.

Figure 4 shows the genotyping of two-parental mock progeny using the SEG-Map software method.

Figure 5 shows genotyping of four-parental mock progeny using a SNP-based sliding window approach.

Figure 6 shows the effect of different sliding window sizes on the accuracy of genotype determination results.

Figure 7 shows the effect of different sequencing depths on the accuracy of genotype determination results.

Figure 8 shows the analysis framework flow of the SNP-based sliding window genotyping method.

Figure 9 shows gene mapping using the SNP2png script.

Figure 10 shows a genotype identification ensemble plot of the rice population.

Figure 11 shows the genotype table of the recombinant inbred line individual recombination segment map in one example.

Figure 12 shows the SNP "string" with window 15.

Figure 13 shows the four parental score curves of rice chromosome 3 mock progeny.

Figure 14 shows two parental score curves for rice chromosome 11 mock progeny.

Figure 15 shows the scores of each parent when it is determined that the parents are three homozygous genotypes in one embodiment.

Figure 16 shows the scores of each parent when it is determined that the two parents are heterozygous genotypes in one embodiment.

Figure 17 shows the scores of each parent when the genotype is determined as unknown in one embodiment.

Figure 18 shows subsequent genotype determination of unknown regions in one embodiment.

Figure 19 shows a graph of genotyping of individual individuals in the DH population.

Figure 20 shows a graph of the genotyping ensemble for the DH population.

Figure 21 shows genotyping of three parental material in one example.

Figure 22 shows SEG-Map identification of three parental material in one example.

Figure 23 shows the genotyping of four parental mimics in one example.

Figure 24 shows the true genotypes of the four-parent mock material in one example.

Detailed ways

After extensive and in-depth research, the present inventors have developed a method for more rapid and accurate genotype identification for the first time, thereby realizing more effective genetic mapping and genome analysis. The method of the present invention is particularly suitable for genotyping and identification of low coverage sequenced multi-parent populations. In the present invention, the genotype information of the real SNPs of multiple parents and progeny in a certain section is directly read, and then the degree of similarity between the progeny and each parent in this section is quantified, according to the numerical value of the score curve of each parent characteristics (value level and standard deviation), forming an efficient, simplified and accurate method for multi-parent plant (or multi-parent crop) genotype identification. The present invention has been completed on this basis.

Specifically, the present inventors developed a high-throughput method to identify genotypes of recombinant populations containing multiple parents based on whole-genome low-coverage sequencing data generated by second-generation sequencing technology. The inventors designed a "sliding window" method to determine the genotype of this segment by comprehensively analyzing the genotypes of multiple single nucleotide polymorphisms (SNPs) in a local region of the genome , and then determine the specific position of the recombination break site to construct a fine recombination map of the multi-parent population.

In order to verify this method, the inventors constructed simulated whole-genome sequencing data of biparental populations and multi-parental populations, constructed a genetic linkage map using this method, and finally compared the genotype information obtained by identification with the genotypes of the real simulated data. The genotype identification accuracy of the population can reach 89.61%, which is similar to the accuracy of the inventor's SEG-Map software method for identifying the genotype of the parental population (the SEG-Map method has an accuracy of 89.32%). The genotype identification method newly developed by the present inventors has an identification accuracy of 92.10% for multi-parent populations, which cannot be achieved by SEG-Map software or methods.

The method of the invention can effectively and quickly analyze the genotype of each individual in the population, plays a key guiding role in genome design and breeding, and can also provide fast and accurate genotype data for QTL mapping of multi-parent populations of different crops. At the same time, the present inventors tested the method using the real rice RIL genetic population, used high-throughput sequencing-based genotype identification, and finally obtained a fairly good high-precision recombination map.

Therefore, with the continuous development and improvement of sequencing technology, this genotype identification method based on low-coverage genome sequencing can replace the traditional marker-based genotype identification method, and provide large-scale gene exploration research and solve more complex biological Learning questions provide a powerful tool. The method of the invention is more suitable for genotype identification of multi-parent backcross populations that have undergone low coverage sequencing, provides accurate genotype support for QTL mapping, and is also helpful for molecular design breeding applications of multi-parent populations.

the term

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

As used herein, the terms "containing" or "including (including)" can be open, semi-closed, and closed. In other words, the term also includes "consisting essentially of," or "consisting of."

As used herein, the term "biparental" indicates that two parents are involved.

As used herein, the term "multi-parent" indicates that 3 parents and more are involved.

As used herein, the term "multi-parent plant" refers to plants involving 3 parents and more, eg, progeny plants (eg, crops) involving 3, 4, or 5 parents.

Methods for the identification of multi-parental crop genotypes

The invention provides a method for identifying multi-parent crop genotypes. The method of the invention is a genotype identification method of the sliding window of the SNP site.

In the genotype identification method based on the sliding window of the SNP site of the present invention, the data processing is optimized. The optimized process can directly analyze and process the unidirectional or bidirectional end short sequence sequencing results generated by the next-generation sequencing technology, and finally construct the genetic map of the recombinant population.

For a mapping population from two parents, the genome-wide SNPs of both parents need to be identified before proceeding with the data analysis pipeline. The identification of this SNP can be obtained by high-coverage whole-genome deep sequencing, or by existing genomic SNP information in the rice haplotype map, or by low-coverage whole-genome sequencing combined with missing genotypes (SNPs) to fill in to get. Since SNP identification between two parental varieties can be obtained in a fast and cost-effective way, sequencing-based genotype identification of a recombinant population will mainly rely on subsequent analysis, including reading genotypes, recombination breakpoints Point determination and construction of genetic linkage maps.

Functions, steps, and software (scripts) in data analysis can be seen in Figure 8.

The first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information.

The SNP loci used for the final determination of genotype should meet the following requirements:

1. SNP sites cover the whole genome as much as possible, and there will be no deletions in certain regions.

2. For any SNP locus, the SNP information (position information and genotype information) of the two parents and simulated progeny are known, and the locus should be deleted if any one of the three is not known.

In addition, taking rice as an example, it is generally believed that the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is Not credible, so SNP sites where either parent is heterozygous can be deleted.

After screening the high-quality genome-wide SNPs of the two parents and progeny, a python script SNPwindow can be used to judge the genotype of the progeny. The script output will have two files, the rlt file and the bin file. The rlt file records the genotype determination of each SNP position, and the bin file records the distribution of recombination breakpoints on the 12 chromosomes in the whole genome. These two files are an important basis for subsequent mapping and linkage analysis.

Referring to Figure 9, generally, a genotype map can be drawn first by using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format. The map is drawn based on the genotype information of the determined SNP loci and the position information of the whole-genome recombination breakpoint. Different colors in the figure represent different genotype types.

In actual work, there will be a large number of offspring groups to be processed, so the SNP information of each individual can be extracted, and then analyzed and tested by the SNPwindow script to read the genotype, judge the recombination breakpoint, and construct the recombination map. First, use SNP2png to draw the gene map of each individual, and have an overall grasp of the individual's genotype, and then use the script Bin2MCD to align the recombination maps of all individuals to generate a recombination bin map.

Referring to Figure 10, a perl script can also be used to visualize the genotype profile of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.

The bins are reorganized when analyzing the final output data produced by the software, usually at a resolution of one bin per 100kb, or even one bin per 10kb. The genotype results of the mapping population can be imported into programs such as MapMaker ^[16] or JoinMap ^[32] for genetic map construction. With the genetic map available, QTL analysis is performed.

Referring to Figure 11, the genetic map produced by the method of the present invention is much finer in scale than maps produced by most conventional molecular markers.

The method of the present invention relates to judging the recombination break site, and its detailed process comprises the following steps:

Step 1: Construct the SNP "string".

Take the window size of 15 (win=15) as an example. The genotypes of all SNPs on the 12 chromosomes of the two parents and progeny were sequentially compressed into a single string. Regardless of the actual distance between two adjacent SNPs, all gaps are removed.

Taking 12 chromosomes as an example, the SNPs on the 12 chromosomes become 12 consecutive word strings (see Figure 12). The blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2. For a parent progeny, there are three possible genotypes at each SNP site. The homozygous genotype of parent 1 ( Blue), homozygous genotype (red) and heterozygous genotype (yellow) of parent two.

It is generally believed that the genome of the artificially cultivated rice parent material is highly homozygous, and for some multi-generation self-recombinant rice populations, the genome is relatively homozygous, and there are only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.

Step 2: Score both parents in one window

According to the law of Mendelian inheritance, the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window. The degree of conformity of the offspring with the parent is measured according to the typing of each parent. The scoring rules are as shown in Table A or similar scoring rules. Preferably, the scoring rules are formulated according to the genetic laws of organisms.

Combined with the theoretical model, the genotype scoring rules in the present invention are further described below.

For a sliding window consisting of consecutive SNPs, the score of the offspring for any parent consists of three parts: 1. The offspring has the same SNP site as the parent; 2. The offspring is different from the parent but conforms to Mendelian inheritance 3. The loci of the progeny that are different from the parent and do not conform to the Mendelian inheritance law and the misjudged locus caused by various possible factors.

For a single parent A, the number of SNP loci in the offspring to be tested that are the same as that of the parent is m; The number of loci of Del's inheritance law and the number of misjudged loci caused by various possible factors is e.

Then the score value S _A of the parent is:

S _A =s ₁ *m+s ₂ *n+s ₃ *e

Among them, s ₁ is the scoring value of the same SNP site of the progeny and the parent.

s ₂ is the scoring value of the locus that is different from the parent but conforms to the Mendelian inheritance law.

Scores _of loci that are different from their parents and do not conform to Mendelian inheritance and misjudged loci caused by various possible factors.

In a continuous SNP frame of size N, there are i parents to be determined. For a given SNP locus at chromosome position k, the genotypes of the progeny and parent at this locus are _gk and g′k, _respectively . . The genotypes of the genes of the pure line parents of rice are generally 0/0, 0|0, 1/1, 1|1. The genotypes of the offspring are generally 0/0, 0|0, 1/1, 1|1, 0/1, 0|1. The frequency of 2 alleles is generally less, and is not considered for the time being.

For the ith individual parent, the probability that the offspring matches its genotype is:

Use a Bayesian approach to find the probability that the offspring will belong to the parent in that region:

The genotype of a certain region of the offspring is determined to the parent genotype with the highest coincidence probability, that is, the maximum coincidence probability among i parents is obtained:

P _max =max{P ₁ , P ₂ , . . . , P _t }

Preferably, in the present invention, the following table is used for genotype scoring.

Table A Genotype scoring rules table

According to the scoring rules, s ₁ =1, s ₂ =1, and s ₃ =0.

If parent A is scored, then:

According to this method, after the score value of each parent is counted separately, the standard deviation std is calculated by sliding window on the continuous parent score value.

Preferably, when judging that a certain segment of chromosome is the homozygous genotype of the A parent, the following conditions need to be met: the score S is the highest, and the standard deviation is the smallest.

Step 3: Determine the genotype of the chromosome region according to the score value

By sliding the sliding window on the SNP sites of the whole genome, the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.

The genotype judgment of each chromosome is based on the characteristics of different parental score curves.

In one example, for example, as shown in FIG. 13 , a sliding window score was performed on the progeny of the simulated four-parent source, and the score curves of the four parents were drawn according to the score values of the four parents.

Observing the score curves of the four parents, in one instance, it can be seen that the parental curves have two different distribution patterns of plateau stable period and fluctuating period in different regions of chromosome 11. Therefore, the state of different parental curves in the same region is used to determine the genotype of the offspring in this region.

As in the 1bp to 10780000bp region in the figure, it can be observed that the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large The numerical fluctuation of the score is measured by the standard deviation in statistics. In this area, parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region can be judged to be the homozygous genotype of parent 4. Using a similar method, the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values.

The rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.

Judgment of heterozygous regions

The judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents. According to the principle of genetics, even if it is a hybrid progeny derived from multiple parents, its parental origin in a certain chromosomal region is at most two parents. Therefore, it can be judged whether this region is a heterozygous region according to the score curve of the two parents in this region.

In one example, as shown in Figure 14, the genotype identification of the simulated offspring is performed, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated offspring of each chromosome segment, and the rectangular bar corresponding to judge represents the actual genotype information of the simulated offspring. The genotype of the offspring determined by the inventive method. Similarly, when the score of one parent is high and the standard deviation is small, and the score of the other parent has considerable fluctuation and the standard deviation is large, the region is judged to be the homozygous genotype of the former (orange or blue area).

When the scores of the two parents are in a certain segment (6000000bp to 15000000bp in the figure), the relative scores of the two parents are not much different, and both fluctuate to a certain extent, and the standard deviation is large, the One segment was judged to be the heterozygous genotype of the two parents (yellow area in the figure). This judgment result is also consistent with the real information.

Judgment based on quantified similarity

One of the core ideas of the method of the present invention is based on directly reading the genotype information of the real SNP of the parent and the progeny in a certain segment, and then quantifying the degree of similarity between the progeny and each parent in this segment. The numerical characteristics of the score curve (value level and standard deviation) form a relatively simplified analysis model, and then determine the genotype of each segment.

The criteria for the judgment of the present invention mainly include the following situations:

1. If a parent has a high score in this section and its score is relatively stable (the score curve is close to the plateau), and the score curves of the other parents in this section have large fluctuations and a large standard deviation ( The score curve is peak-shaped up and down), and this segment is judged to be the homozygous genotype of the parent (Figure 15).

2. When judging that the number of parents is two, it can be inferred that the region is a heterozygous gene of the two parents according to the fact that both parents have large numerical fluctuations in a certain segment and the standard deviation is large. type (Figure 16).

In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment. Because the scores of each parent have large fluctuations in this section, only the two most probable parents can be given according to the numerical characteristics.

3. Because the judgment method largely depends on measuring the genotype similarity between the offspring and the parent, if two or more parents and offspring are very similar in a certain segment, two or more high The parental curve with a small score and a small standard deviation indicates that some of the parents analyzed are very similar in this section, and there is not much difference, so we can temporarily not make judgments.

Referring to Figure 17, in some cases, the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.

Referring to Figure 18, if the genotypes on both sides are the same, the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the two sides of the region are respectively genotypes on both sides.

Secondary sliding window for genotype determination

In the present invention, preferably, genotype determination is performed through a secondary sliding window.

For the first time, a sliding window was performed on the genotype of the SNP, and a parental score value in each window was counted.

For the second time, a sliding window was performed on the obtained parental score values to detect the level and standard deviation of the score values.

The determination of the final genotype depends on the score value and the size of the standard deviation obtained by the secondary sliding window, and the determination of the genotype is carried out by the highest probability that a certain segment of the offspring belongs to a certain parent.

In the present invention, genotype determination can be performed faster and more accurately by using the secondary sliding window for genotype determination.

A schematic example of a secondary sliding window is as follows:

Apparatus for the identification of multi-parental crop genotypes

The present invention also provides an identification device or an analysis device for multi-parent crop genotypes for performing the method of the present invention. Typically, the device includes:

a data input module for inputting data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;

A multi-parent plant genotype identification module, the multi-parent plant genotype identification module is configured to perform the method of the present invention, thereby obtaining the genotype identification result of the progeny;

The main advantages of the present invention include:

(a) The present invention provides a multi-parent crop genotype identification method based on high-throughput sequencing data for the first time. Before the present invention, there is currently no systematic method for identifying multiple parental genotypes of crops.

(b) The high-throughput genotype identification method of the present invention can greatly simplify and accelerate the genetic mapping of quantitative traits in crops ^[37-39,20] .

(c) The theoretical method of the present invention can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding.

(d) In practical applications, the present invention can be used for the acquisition of molecular markers closely linked to important agronomic trait genes, the efficient screening of offspring in the breeding process, the fine identification of genotype maps of improved varieties, etc., and provides molecular marker-assisted screening and breeding It has developed a fast and efficient means and platform, making it a new level in efficiency and accuracy.

In conclusion, the sequencing-based high-throughput genotyping method of the present invention will provide convenience for solving complex biological problems and improving crop breeding.

The present invention will be further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the present invention and not to limit the scope of the present invention. In the following examples, the experimental methods without specific conditions are usually in accordance with conventional conditions, or in accordance with the conditions suggested by the manufacturer. Percentages and parts are weight percentages and parts unless otherwise specified.

Example 1

1.1 Genome-wide simulation data production based on real sequencing data

1.1.1 Rice materials and simulation data

Using the existing high-depth real rice materials in the laboratory Indica 93-11 (Oryza sativa ssp.indica cv.93-11), Shuohui 70, Wushan Simiao and Huang Huazhan as parent materials, and through the real fastq data of the parent materials. Mock progeny fastq data were obtained after alignment, screening and combination. In the simulation data of the two parents, the inventors used 93-11 and Wushan Simiao as the simulated parents, and simulated the respective homozygous regions of the two parents and the overlapping heterozygous regions of the two parents on each chromosome of rice , to test the judgment of the recombination breakpoint and the judgment of the heterozygous region of the method of the present invention. In the multi-parent simulation data, four materials, 93-11, Shuohui 70, Wushan Simiao, and Huang Huazhan, were used as simulated parents, and 100 data simulations were performed to simulate different combinations of recombination breaks on each chromosome of rice. point. There are an average of 4-6 recombination breakpoints on each chromosome of rice, and a total of 50-60 recombination breakpoints in the whole genome. The purpose is to simulate the recombination situation within the genetic population of rice with multiple parents as much as possible and to verify the method of the present invention. accuracy.

1.1.2 Identification of simulated data SNPs

The sequencing data of 93-11, Shuohui 70, Wushan Simiao and Huang Huazhan were compared with the complete sequence of 12 chromosomes of japonica cv. Nipponbare sequenced by the International Rice Genome Sequencing Project (IRGSP). http://rice.plantbiology.msu.edu/annotation_pseudo_current.shtml ) IRGSP 1.0 for comparison, and the comparison software is bwa 0.7.17-r1188 ^[13] . Candidate SNPs for the above parents and mock progeny were then identified using the HaplotypeCaller program (parameter -ERC GVCF) in the GATK software package ^[14] . After obtaining the mutation intermediate file g.vcf file of each parent and simulated offspring, use the GenomicsDBImport program in the GATK package to merge all the mutation intermediate files, and then use the GenotypeGVCFs program in the GATK package to export the merged mutation file, using SelectVariants The program selects the required SNP site information, and then passes the VariantFiltration program (parameters are --cluster-size 3--cluster-window-size 10--filter-expression "QD<10.00"--filter-name lowQD--filter- expression"FS>15.000"--filter-name highFS--genotype-filter-expression"DP>50||DP<5"--genotype-filter-name InvalidDP) filter all SNP sites to get high quality the SNP site. Then on this basis, the genotype identification of the mock progeny was carried out.

1.2 Program Development of Sequencing-Based Genotyping Process

In the process of sequencing-based genotype identification, it is necessary to process massive data, apply a variety of different algorithms, and use some existing software, such as sequence matching software and QTL analysis software. Therefore, the inventor has developed a number of perl and python scripts to realize the above steps and make it a complete and easy-to-use process with wide versatility.

After obtaining the SNP information of parents and progeny through GATK software, a python script is used, the principle is to regionalize the SNPs identified by each individual along the sliding window of all SNP sites for comprehensive analysis, based on a fixed-length sliding window to read Take the genotype, then judge the recombination breakpoint and construct the recombination segment map. In addition, a perl script uses the intermediate file determined by the program to generate a PNG format recombination segment map for each individual, which is convenient for users to intuitively browse their overall genotype. The GD module in Perl needs to be used when drawing.

Another script, Bin2MCD, was next used to generate a high-density map consisting of recombinant bins ^[19] for subsequent QTL analysis. Once phenotypes have been assessed and trait data prepared, output files can be used directly to identify QTLs by several QTL analysis software packages, including Windows QTL Cartographer V2.5 ^[17] .

1.3 Genotype identification based on rice DH population

1.3.1 Rice DH population and three-parent population

The rice DH population used in this study was constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its two parents are Kasalath and japonica cv. Nipponbare. The DH population is the line produced by the F2 progeny after many years of self-recombination. The inventors selected dozens of strains for genotype identification and analysis. The three-parent rice plants used in this study were constructed by the laboratory of the National Genetic Research Center of the Chinese Academy of Sciences. Its three parents are Wushan Simiao, 93-11 and Shuohui 70. The plants in this population are produced by self-recombination of the hybrid progeny of the three parents, and there are many recombination information in their genomes.

1.3.2 Genotyping of rice DH populations and three-parent populations using sequencing-based methods

The DH population of rice was genotyped using the method of the present invention, and a high-density map composed of recombinant bins was generated by Bin2MCD. At the same time, in order to measure the accuracy of the method, genotype analysis and high-density bin map were also performed using the method published in 2010. The high-depth (20-30x) sequencing data of the two parents, Kasalath and Nipponbare, were compared to the Nipponbare reference genome IRGSP 1.0 using the bwa software, and then the GATK software was used to find the high-quality SNP information of the two parents, and then use a The perl script replaces the SNP at the specified locus on the Nipponbare genome, thereby generating a pseudo reference for the two parents. The low-abundance sequencing data of the DH population was then aligned to the pseudo reference of the two parents for genotyping.

result

2.1 Rice multi-parent simulation data and genotype identification based on real parents

2.1.1 Simulation data of rice genome information

As shown in FIG. 1 , the predicted genotype information of the progeny from the two parents simulated by the inventors should be consistent with it. The generated simulation data includes three cases: the homozygous region of Wushan silk seedlings, the homozygous region of 93-11 and the heterozygous region of Wushan silk seedlings and 93-11. The figure shows the expected length of each region and the location of the recombination breakpoint. The production of the simulated data is based on the real sequencing data of the two parents. First, the fastq data of the two parents are aligned to the rice Nipponbare genome, and then the required alignment information (chromosome and position information) in the obtained sam file is screened. The fastq information from the two parents was then reformatted to form the simulated hybrid progeny fastq data.

As shown in Figure 2, a similar method was adopted to produce simulated progeny data derived from four parents Wushan Simiao, Huang Huazhan, 93-11, and Shuohui 70. Its predicted genomic genotype information and recombination breakpoints should be consistent with the information in the figure.

2.1.2 Genotyping of simulated data

The fastq data of the simulated progeny were compared with the fastq data of the two parents Wushan Simiao and 93-11 to the rice reference genome IRGSP 1.0, and then the genome-wide variation information of the two parents and the simulated progeny was searched by GATK software, and filtered. Screening to obtain high-quality SNP sites.

As shown in Figure 3, after the required SNPs are obtained, the "sliding window" method is used to judge the SNPs of the whole genome, and the two parents are scored and compared in a sliding window. If it is higher, this segment is judged as the homozygous genotype of the parent (indicated in red or blue in the figure). When the scores of the two parents are not significantly different, the segment is judged as the heterozygous region of the two parents (indicated in yellow in the figure). The inventors designed a quantitative method to measure the accuracy of judgment, divided the whole genome into thousands of small regions of 100kb (or small regions of 20-200kb), and then compared the degree of agreement between the results obtained by the method of the present invention and the standard map The accuracy of the method of the present invention can be measured. According to this method, comparing the genotype information obtained from the simulated data with the real genotype of the simulated data, the accuracy of the identification of the two parents can reach 89.61%.

At the same time, the inventors also used the published SEG-Map method to judge the genotype of the simulated progeny data, compared the fastq files of the simulated data to the pseudo reference of the two parents, and used the software to screen out the parent-specific fastq sequence, and then determine the information of the SNP site according to the position of the sequence alignment, and then use the sliding window method to determine the genotype information. The method has more detailed theoretical verification and data simulation in the published articles, and has high accuracy and feasibility. Using the quantitative method to measure the accuracy, the accuracy obtained by the SEG-Map software results is 89.32%, which is not much different from the method of the present invention, indicating that the method of the present invention has high feasibility and accuracy.

The SEG-Map method does have high reliability for the identification of the genotypes of the two parents, and the inventors have used this method for a long time in the genome analysis of rice materials. However, this method cannot perform genotype identification on materials derived from multiple parents, so the method of the present invention is also intended to solve the problem of multiple parent genotype identification. As shown in Figure 5, the inventors used the SNP-based sliding window method to identify the genotypes of the simulated progeny derived from the four parents, and scored the four parents in one window. The modified region is determined as the homozygous region of the parent, and the homozygous regions of the four parents are represented by red, blue, green and yellow respectively in the figure. The fastq data of the four parents were simulated for 100 times, and the method of dividing the genome into small regions was also used to quantify the accuracy. By comparing with the standard map, the average simulation accuracy of the method of the present invention for the genotype identification of the simulated data of the four parents is 92.10%.

2.1.3 Determination of recombination break sites

Genotypes are read by the scores of the two parental SNPs as the "window" slides along the chromosome. A genotype does not change until a recombination breakpoint is encountered. The inventors found that there are two types of break sites: one is to separate two homozygous genotypes, and the other is to separate a segment of homozygous genotypes from a segment of heterozygous genotypes; the former case in RIL is the predominant form of existence, while the latter is mostly found in the F ₂ population. When a sliding window hits a "homozygous/homozygous" breakpoint, the homozygous genotype briefly changes to a heterozygous genotype and then back again from a heterozygous genotype into a homozygous genotype. When a sliding window hits a "homozygous/heterozygous" breakpoint, the homozygous genotype becomes a heterozygous genotype and then changes to a homozygous genotype again, this The boundary point between the homozygous genotype region and the heterozygous genotype region can be determined.

2.1.4 The influence of different window sizes on the accuracy of multi-parent judgment

When using this sequencing-based method for genotype identification studies, it is necessary to set appropriate analysis parameters, first considering whether the size of the sliding window will affect the accuracy of genotype detection, for example, within a given physical length of each The window contains many SNPs.

As shown in Figure 6, the present invention adopts different window sizes to perform genotype analysis on the final SNP information screened by the four-parent simulation data, and it is found that the sliding window sizes of different sizes do have an impact on the final analysis accuracy. When the window size is small (less than 199), the final accuracy rate is less than 90%, but when the sliding window size is increased to 199, the genotype identification accuracy can reach 93.72%, but when the sliding window size continues to increase, the final The accuracy rate does not change much, indicating that the accuracy rate of the judgment result does not always increase with the size of the sliding window. For program running, larger sliding window size requires more computing resources and computing time, and the time cost will be more prominent when large-scale groups need to be processed. Therefore, the inventor comprehensively considers the time cost and the accuracy rate, and the sliding window size of 199 (or the sliding window size of 180-220) is a more reasonable choice.

2.1.5 The influence of different sequencing depths on the accuracy of multi-parent judgment

Then, considering that sequencing depth has a very important influence on genotype identification, more accurate genotype identification can be carried out with SEG-Map software at a lower sequencing depth, so the method of the present invention is deeply tested.

As shown in Figure 7, three different depths of 0.2x, 1.5x and 3x were tested to test the genotype accuracy. Under the test of each depth, 100 fastq data simulations were performed, and then the genes were simulated according to the variation information. Type identification, and finally measure the accuracy of genotype identification according to the conformity procedure with the standard map. The results showed that with the increase of depth, the accuracy of genotype identification was slightly improved, but the improvement did not reach the expected range, and the final genotype identification accuracy did not reach more than 95%. Therefore, the present inventors made further improvements.

2.2 Genotype identification method based on SNP locus sliding window

2.2.1 The main steps of the data analysis process

In order to make this new method for genotype identification based on sequencing widely used, the inventors have organized and optimized the data processing flow. This process can directly analyze and process the unidirectional or bidirectional end short sequence sequencing results generated by the next-generation sequencing technology, and finally construct the genetic map of the recombinant population.

The functions, steps, and software (scripts) in the data analysis are shown in Figure 8. The first step consists of several tasks that can be processed simultaneously. Individuals and parental material in a certain number of recombinant populations are subjected to second-generation high-throughput sequencing simultaneously. The obtained fastq files were aligned and processed by bwa and GATK software to obtain high-quality SNP information. The SNP loci used for the final determination of the genotype should meet the following requirements: 1. The SNP loci should cover the whole genome as much as possible, and there will be no deletions in certain regions. 2. For any SNP locus, the SNP information (position information and genotype information) of the two parents and simulated offspring are known, and if any one of the three is not known, the locus should be deleted. 3. It is generally believed that the rice parent is an inbred homozygous line, and there is basically no heterozygous locus in the genome. Therefore, if a heterozygous SNP locus is found in the parent, it is generally considered that the locus is unreliable, so SNP sites where either parent was heterozygous were deleted.

After screening the high-quality genome-wide SNPs of the two parents and progeny, a python script SNPwindow was used to judge the genotype of the progeny. The script output will have two files, the rlt file and the bin file. The rlt file records the genotype determination of each SNP position, and the bin file records the distribution of recombination breakpoints on 12 chromosomes in the whole genome. These two files are an important basis for subsequent mapping and linkage analysis.

As shown in Figure 9, a genotype map is generally drawn first using the rlt and bin files through a perl script SNP2png, and the image format is in PNG format. The map is drawn according to the genotype information of SNP loci determined by the program and the position information of whole-genome recombination breakpoints. Different colors are used to represent different genotype types in the map.

In actual work, there will be a large number of offspring groups to be processed, so the SNP information of each individual can be extracted, and then analyzed and tested by the SNPwindow script to read genotypes, judge the recombination breakpoints, and construct the recombination map. First, use SNP2png to draw the gene map of each individual, and have an overall grasp of the individual's genotype, and then use the script Bin2MCD to align the recombination maps of all individuals to generate a recombination bin map.

As shown in Figure 10, a perl script can also be used to visualize the genotype of the entire population. Programs and scripts used in the analysis process are shown in italics and form a series of analysis steps. The genotype data generated at the end of the analysis process can be directly used in other software (including MapMaker and JoinMap) to construct linkage maps.

As shown in Figure 11, this genetic map is much finer in scale than maps produced by most traditional molecular markers. This package is compatible with multiple platforms (eg: Unix, Linux and Windows). In addition to the perl environment itself, the GD module also needs to be installed, because there are drawing steps in the process operation.

2.2.2 The detailed process of judging the recombination break site

Step 1: Construct the SNP "string".

In this way, the SNPs on the 12 chromosomes become 12 consecutive word strings (Fig. 12). The blue in the figure represents the genotype of parent 1, and the red represents the genotype of parent 2. For a parent progeny, there are three possible genotypes at each SNP site. The homozygous genotype of parent 1 ( Blue), homozygous genotype (red) and heterozygous genotype (yellow) of parent two.

It is generally believed that the genomes of artificially cultivated rice parent materials are highly homozygous, and for some multi-generation self-recombinant rice populations, the genomes are also relatively homozygous, with only some heterozygous regions in some chromosomal locations. Therefore, the SNP loci included in the analysis were first screened artificially, and any SNP loci whose parents were heterozygous were excluded. Such loci cannot be accurately judged and scored. In addition, if the sequencing depth of the progeny is not very high, the SNP loci that are heterozygous in the progeny can be filtered, because the reliability of the heterozygous locus judged based on the low depth is not high, which is likely due to sequencing errors resulting in misjudgment.

Step 2: Score both parents in one window

According to the law of Mendelian inheritance, the scores of all SNP sites in a sliding window are calculated, and the total score of each parent is calculated as the score of the parent in the chromosome position of the sliding window. The degree of conformity of the offspring with the parent is measured according to the typing of each parent. The preferred scoring rules are shown in Table A, and the scoring rules of the present invention are formulated according to the genetic rules of organisms.

Table A Genotype scoring rules table

By sliding the sliding window on the SNP sites of the whole genome, the score value of each parent on each chromosome can be obtained. Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa. The genotype judgment of each chromosome is based on the characteristics of different parental score curves. As shown in Figure 13, the offspring of the simulated four-parent source were scored by sliding window, and the score curves of the four parents were drawn according to the score values of the four parents.

Observing the score curves of the four parents, it can be seen that the parental curves have two different distribution patterns of plateau stable period and fluctuating period in different regions of chromosome 11. Therefore, the state of different parental curves in the same region is used to determine the genotype of the offspring in this region.

As in the 1bp to 10780000bp region in the figure, it can be observed that the yellow curve (parent 4) has a high score value close to full score in this region, and the score value of the parent in this region is quite stable, not too large The numerical fluctuation of the score is measured by the standard deviation in statistics. In this area, parent 4 has a high score value and a small standard deviation in this area, while the other three parents have a high score value in this area that fluctuates in the range of 0-200, with a high standard deviation, so this region is judged to be the homozygous genotype of parent 4. Using a similar method, the offspring genotypes of different regions of the 12 chromosomes can be determined based on the score values. The rectangular bar corresponding to true in the figure corresponds to the real genotype information of the simulated offspring of each chromosome segment, while the rectangular bar corresponding to judge represents the offspring genotype determined by the method of the present invention, and the information of the two basically matches.

Step 4: Judgment of Heterozygous Regions

The judgment of the heterozygous region is illustrated by the genotype judgment of the simulated progeny of the two parents. According to the principle of genetics, even if it is a hybrid progeny derived from multiple parents, its parental origin in a certain chromosomal region is at most two parents. Therefore, it can be judged whether this region is a heterozygous region according to the score curve of the two parents in this region. As shown in Figure 14, the genotype identification of the simulated progeny is carried out, the rectangular bar corresponding to true corresponds to the real genotype information of the simulated progeny of each chromosome segment, and the rectangular bar corresponding to judge represents the judgment of the method of the present invention. offspring genotype. Similarly, when the score of one parent is high and the standard deviation is small, and the score of the other parent has considerable fluctuation and the standard deviation is large, the region is judged to be the homozygous genotype of the former (orange or blue area).

Step 5: Description of Judgment Criteria

The core idea of the method of the present invention is based on directly reading the genotype information of the real SNPs of the parent and the offspring in a certain segment, and then quantifying the degree of similarity between the offspring and each parent in this segment, according to the score curve of each parent The numerical characteristics (value level and standard deviation) of , form a relatively simplified analysis model, and then determine the genotype of each segment. The criteria for judgment mainly include the following situations:

In the multi-parent judgment, it is very likely that only possible heterozygous segments can be found, that is, a high-scoring and stable parent cannot be found in a segment. Because the scores of each parent have large fluctuations in this section, only the two most probable parents can be given according to the numerical characteristics. 3. Because the judgment method largely depends on measuring the genotype similarity between the offspring and the parent, if two or more parents and offspring are very similar in a certain segment, two or more high The parental curve with a small score and a small standard deviation indicates that some of the parents analyzed are very similar in this section, and there is not much difference, so we can temporarily not make judgments.

As shown in Figure 17, the region to be judged is first defined as "unknown", and the genotype of this region is determined by the genotypes of the regions on both sides.

As shown in Figure 18, if the genotypes on both sides are the same, the region is determined as this genotype, and if the genotypes on both sides are different, the middle position of this region is regarded as a recombination breakpoint, and the region The two sides are the genotypes of the two sides, respectively.

2.3 Genotyping of rice DH population based on sequencing

2.3.1 Rice DH population

The two parents of the DH population of rice used were Kasalath and Nipponbare. It is a population formed by inducing haploids and doubling the F1 generation of biparental crosses. Its plants are homozygous, and the self-bred progeny are pure lines, which can be repeated for many years and multiple points. It is an ideal material for studying the interaction of genotype and environment.

2.3.2 SNP identification between two parents

The high-depth sequencing data (20x-30x) of the two parents Kasalath and Nipponbare materials were compared to the rice reference genome IRGSP 1.0 through bwa software, and then the required high-quality SNP information was found through GATK software, and the SNP information of the parents was compared. Combined with the SNP information of all progeny into the same vcf file, it is convenient to extract the required variation site information from it.

2.3.3 Genotyping of DH populations

The average sequencing depth of each progeny in the DH population is about 0.02x, which belongs to the sequencing data of lower depth. The SNP information and parental information of each progeny are extracted separately, and then the SNPwindow script is used to judge, and each progeny is judged by rlt file and bin file.

As shown in Figure 19, using the SNP2png script, the genotype identification results were visualized using the result file obtained in the previous step. In the figure, the homozygous genotypes of the two parents (Kasalath in red and Nipponbare in blue) can be observed. For In an inbred population that has been self-recombined for many years, the reliability of the heterozygous region is low, which may be due to sequencing errors or program misjudgments caused by the low polymorphism of the two parents in this region.

Then use the Bin2MCD script to take the bin file of the entire population as input, and calculate the map file of the overall genotype distribution. The map file divides the whole genome into many small bins, and each bin is determined according to the genotype results of the individual identification. genotype type.

As shown in Figure 20, after reaching the map file, a perl script was used to visualize the genotype information of the entire population, and the proportion of different genotypes at each bin position was also calculated, which is an important parameter for population genetics research. The red and blue ratio map in Figure 20 represents the proportion of the three genotypes in different bins. The visualization of this step facilitates a quick, direct view of the population genotype. The map file output with the phenotype can directly use the analysis software such as winQTL for QTL localization analysis.

2.3.4 Genotype identification of the three-parent material of rice

As shown in Figure 16, multi-parent genotype identification was performed on the three-parent progeny materials grown in the laboratory. The amount of sequencing data of the progeny is about 0.2x, and the three parent materials are Wushan Simiao, 93-11 and Shuohui 70, respectively, and the sequencing depth of the three parents is about 20x-30x. The progeny and the three parental SNP information were integrated into the same vcf, and the final high-quality SNP was further screened from it. Then use the SNPwindow script to judge the genotype of the offspring. In a window, if a parent has the highest score, the region is judged as the homozygous genotype of the parent.

As shown in Figure 21, the bin file for judging the recombination breakpoint is obtained by using the judgment degree of multiple parents, and a perl script is used to visualize the judgment result. We can directly see the genotypes on the 12 chromosomes through the pictures. information. The red area corresponds to parent 1 Wushan Simiao, the blue area corresponds to parent 2 of 93-11, the green area corresponds to parent 3 of Shuohui 70, and the yellow is the heterozygous area.

As shown in Figure 22, the SEG-Map method was used to determine the genotype of this material. Therefore, the laboratory's previous genotype determination of these plants was mainly based on the genotype identification of the two parents, Wushan Simiao and 9311, and three species were determined. Genotype, Wushan silk seedling homozygous genotype, 93-11 homozygous genotype and the heterozygous genotype of the two. According to the judgment results of the three parents, the inventors found that the heterozygous segment obtained from the judgment of the two parents is likely to correspond to the homozygous genotype of the third parent. Therefore, the method of the present invention can make up for the deficiencies of the previous SEG-Map software under the condition of ensuring the accuracy, and solve the problem of multi-parent genotype judgment.

2.3.5 Genotype identification of rice four-parent mimic material

As shown in Figure 23, the inventors used the real sequencing fastq data of four real rice materials in the laboratory, 93-11, Shuohui 70, Wushan Simiao, and Huang Huazhan, and then segmented out the corresponding regions according to the comparison results. The reads are artificially combined and screened to produce data of a simulated progeny, and the real genotype information and recombination breakpoints of the progeny are clear, so the simulated data can be used to evaluate the feasibility and accuracy of the present invention.

According to the determination method, the inventors identified dozens of recombination breakpoints in the whole genome, and the determined different chromosomal regions were also roughly consistent with the real genotype results.

As shown in Figure 24, the figure shows the real genotype information of the mock progeny of the present invention.

Comparing the detailed areas of the two, the inventor found that there are some differences in some local areas. The inventor checked the intermediate output rlt file of the judgment process, and checked the reasons for the difference in judgment. The possible reasons are as follows: 1. Because the depth of the sequencing data of the progeny is not very high, only a part of the variation information of the whole genome can be captured, and some important parental distinguishing sites may be missed, resulting in the inability to distinguish in some regions. real parent. 2. The identified two or more parents are very similar in certain regions, and there is no polymorphism of the parents. This is also not due to sequencing errors or sequencing depth. For such high-similarity regions, no judgment may be made for the time being. The genotype judgment depends on the genotype information on both sides of the genotype. Therefore, some regions may not be able to make accurate judgments, and the genotypes are judged on both sides. Most likely parental genotype.

It can be clearly seen from the analysis results that different regions of the whole genome can be determined as its most likely true parents, which is difficult to achieve by the previously published SEG-Map software and traditional analysis methods. A multi-parental analysis method is provided during crop breeding.

3. Discussion

Multi-parent populations have great application prospects in genetic analysis. By selecting multiple parents, the genetic diversity of the population can be increased, and multiple parents can be fused into a population by means of hybridization and selfing (or inbreeding). Number of reorganizations. Multi-parent populations can not only increase the frequency of recombination and tap the genetic basis behind complex traits, but also have great potential in breeding applications due to the richness of the genetic basis of selected parents. Compared with the biparental group, the multi-parental group has a large number of parents, which increases the population variation richness, including allelic diversity and phenotypic diversity, provides mapping accuracy and precision, and improves the efficiency of QTL detection. The recombination events of , will improve the resolution of QTL mapping; because the parental screening of multi-parent populations is more refined, i.e., the criteria are more stringent, and multiple parents increase the diversity of the genetic basis, so its QTL results can be applied to breeding research. Compared with natural groups, multi-parent groups are constructed by mixing multiple parents evenly. Compared with natural groups, because they can know the pedigree relationship and have detailed information on group construction, from the aspect of experimental design, group stratification is avoided, and then control False positive problem of localization results.

Recombinant populations are the basis of Mendelian genetics experiments and have been a key element in the study of genes, genomes, and genetic variation. But genotyping a mapping population has been laborious and time-consuming, involving costly and tedious marker development and the process of genotyping hundreds of individuals with hundreds of markers. Moreover, the resolution of maps obtained using such methods is still relatively low ^[34-36] . By applying second-generation sequencing technology, the inventors developed a method for rapid, efficient, low-cost, informative, and reliable genotype identification. With this new approach, ultra-high-resolution genotyping of a typical mapping population of several hundred individuals can be accomplished within weeks by a single genome sequencing service, rather than using traditional genotypes. Marking can take months or even years to complete.

The present inventors developed a novel method for high-throughput genotyping by whole-genome low-coverage resequencing detection of SNPs. This type of SNP data differs from traditional genetic markers in two main ways. First, in general, not all lines in a recombinant population can obtain information on a certain SNP locus by random sequencing. Second, a single SNP locus is not a reliable marker or locus for genotyping because of potential sequence errors.

In order to process these SNP data with unique properties generated by second-generation sequencing, the inventors further developed a new analysis framework, that is, using a "sliding window method", according to multiple SNPs at local locations The genotype determines the genotype of this segment.

The inventor has also developed a set of program processing analysis process (pipeline) based on this theory, the name is called SEG-Map (Sequencing Enabled Genotyping for Mapping recombination populations), which means the sequencing-based recombinant population mapping process. Using SEG-Map, the genetic map of the recombinant population can be finally constructed by analyzing and processing the unidirectional or bidirectional end short sequence sequencing results generated by Illumina Genome Analyzer II (GAII). The recombinant population of this construction.

The inventors of the present invention have developed a set of novel program processing and analysis procedures and corresponding methods and devices after research. Using the process of the present invention, in addition to optimizing the steps in the previous SEG-Map program and being compatible with current mainstream bioinformatics analysis software and different types of high-throughput sequencing data, the most important thing is that it can quickly, accurately and reliably analyze multiple Parental constructed population genotypes.

The establishment of the method of the invention can help the multi-parent population to be better applied in crop breeding; it can also accurately identify more QTL sites in the multi-parent population; the genome prediction for the multi-parent population can help them be used as germplasm resources It is directly applied to the variety to provide the basis.

In the current analysis pipeline, programs for reading genotypes and determining recombination break sites are designed to accommodate multiple types of mapping populations, and are consistent with previous identification of SNPs and subsequent construction of a recombination bin map. The steps are completely connected. After combining these functions, the analysis software takes the short sequence generated by the next-generation sequencing technology as input, and after a series of operations, outputs the recombination segment. This output result can be used to construct genetic linkage maps and QTL (quantitative trait loci). point) analyzed by the software.

The inventors used a high-throughput sequencing-based method for genotyping of recombinant inbred lines in rice, showing the advantages of this new genotyping method over the commonly used PCR-based method. Before developing a sequencing-based method to identify this population of recombinant inbred lines in the F ₁₁ generation of rice, the inventors used 287 insertion/deletion markers (including SSR markers) on the F ₈ of this population of recombinant inbred lines. Generation individuals were genotyped. These markers were amplified by PCR and identified on agarose gel electrophoresis. The genetic linkage map constructed with the results of PCR markers, each marker covers an average genetic distance of about 5cM, which is equivalent to a physical distance of about 1.4Mb, which is larger than most previously reported rice genetic maps. Designing, screening, and collecting these PCR markers took three researchers more than a year of work. In the study of recombinant inbred lines in rice, the inventors used Illumina GA to obtain an average marker coverage of 40kb per SNP in less than two weeks. In this way, sequencing-based high-throughput genotyping methods are much faster, more efficient, and less expensive than traditional PCR-based genotyping methods.

The throughput of resequencing can be easily adjusted, which also allows the inventors to obtain suitable marker density levels and resolution of recombination breakpoints while choosing the shortest time and resource investment. When a new scientific question arises that requires higher labeling density or more precise determination of recombination breakpoints, the inventors can increase the coverage of resequencing for the whole or part of the mapping population. It should be noted that, using this method, the recombination break site can be determined very accurately, and if there is a high enough resequencing coverage, it can theoretically be located within 1kb. Such a fine resolution enables the detection of "double crossovers" that have not been previously identified with other types of genetic markers. Ultimately, this method can improve the accuracy of QTL detection and mapping and increase the efficiency and success rate of gene cloning. Precise identification of recombination breakpoints also enables the study of genomic regions with specific genetic properties, such as recombination hotspots.

In conclusion, this high-throughput genotype identification method combined with second-generation sequencing technology will greatly simplify and accelerate the genetic mapping of quantitative traits in crops ^[37-39,20] . The theoretical method proposed by the present inventor can better cooperate with the multi-parent population for genotype identification, improve the accuracy and efficiency of QTL mapping, and make full use of the abundant genetic variation existing in the multi-parent population. It also contributes to the improvement of crop genetic quality and the design of molecular breeding. In practical applications, this method can be used for the acquisition of closely linked molecular markers of important agronomic trait genes, the efficient screening of offspring in the breeding process, and the fine identification of genotype maps of improved varieties, etc. Fast and efficient means and platforms make it a new level of efficiency and accuracy. In conclusion, this sequencing-based high-throughput genotyping method will provide convenience for solving complex biological problems and improving crop breeding.

All documents mentioned herein are incorporated by reference in this application as if each document were individually incorporated by reference. In addition, it should be understood that after reading the above-mentioned teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of the application.

references

[1] Winzeler, E.A. et al (1998). Direct allelic variation scanning of the yeast genome. Science, 281:1194-1197.

[2] Meaburn, E., Butcher, L.M., Schalkwyk, L.C., & Plomin, R. (2006) Genotyping pooled DNA using 100K SNP microarrays: a step towards genomewide association scans. Nucleic Acids Res., 34:e27.

[3] Singer, T. et al. (2006) A high-resolution map of Arabidopsis recombinant inbred lines by whole-genome exon array hybridization. PLoS Genet., 2:e144.

[4] Jeremy, E. et al. (2008) Development and evaluation of a high-throughput, low-cost genotyping platform based on oligonucleotide microarrays in rice. Plant Methods, 4:13.

[5] Craig, D.W. et al. (2008) Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods, 5:887-893.

[6] Cronn, R. et al (2008). Multiplex sequencing of plant chloroplast genes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res., 36:e122.

[7] Doi K, Iwata N, Yosiiimura A (1997). The construction of chromosome substitution lines of African rice (Oryza glaberrima Steud.) in the background of japonica rice (O.sativa L.). Rice Genet. Newsl., 14:39–41.

[8] Wan XY, Wan JM, Su CC, Wang CM, Shen WB, Li JM et al. (2004). QTL detection for eating quality of cooked rice in a population of chromosome segment substitution lines. Theor.Appl.Genet. , 110:71–79.

[9] Ebitani T, Takeuchi Y, Nonoue Y, Yamamoto T, Takeuchi K, Yano M (2005). Construction and evaluation of chromosome segment substitution lines carrying overlapping chromosome segments of indica rice cultivar'Kasalath'in a genetic background of japonica elite cultivar 'Koshihikari'. Breed Sci., 55:65–73.

[10] Hao W, Jin J, Sun SY, Zhu MZ, Lin HX (2006). Construction of chromosome segment substitution lines carrying overlapping chromosome segments of the whole wild rice genome and identification of quantitative trait loci for rice quality. J.Plant Physiol. Mol. Biol., 32:354–362.

[11]Huang X, Wei X, Sang T, Zhao Q, Feng Q, Zhao Y, Li C, Zhu C, Lu T, Zhang Z, Li M, Fan D, Guo Y, Wang A, Wang L, Deng L , Li W, Lu Y, Weng Q, Liu K, Huang T, Zhou T, Jing Y, Li W, Lin Z, Buckler ES, Qian Q, Zhang Q, Li J, Han B.(2010) Genome-wide association studies of 14 agronomic traits in rice landraces. Nature Genet., 42:961-967

[12] Li, H., Ruan, J., & Durbin, R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18:1851-1858.

[13] Kurtz, S. et al. (2004) Versatile and open software for comparing large genes. Genome Biol., 5:R12.

[14] Rice, P., Longden, I., & Bleasby, (2000) A. EMBOSS: The European molecular biology open software suite. Trends in Genetics, 16: 276-277.

[15] Ning, Z., Cox, A.J., & Mullikin, J.C. (2001) SSAHA: a fast search method for large DNA databases. Genome Res., 11:1725-1729.

[16] Lincoln, S.E. & Lander, S.L. (1993) Mapmaker/exp 3.0 and mapmaker/qtl 1.1. Technical report. Whitehead Institute of Medical Research, Cambridge, MA.

[17] Wang, S., Basten, C. J. & Zeng, Z. B (2007). Windows QTL Cartographer 2.5. Department of Statistics, North Carolina State University, Raleigh, NC.

[18] Li, R. et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25: 1966-1967.

[19] Van Os, H. et al. (2006) Construction of a 10,000-marker ultradense genetic recombination map of potato:Providing a framework for accelerated gene isolation and a genomewide physical map.Genetics, 173:1075-1087.

[20] Xu J, Zhao Q, Du P, Xu C, Wang B, Feng Q, Liu Q, Tang S, Gu M, Han B, Liang G. (2010) Developing high throughput genotyped chromosome segment substitution lines based on population whole-genome re-sequencing in rice (Oryza stative L.). BMC Genomics, 11:656.

[21] Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, Haberer G, Hellsten U, Mitros T, Poliakov A, Schmutz J, Spannagl M, Tang H, Wang X, Wicker T, Bharti AK , Chapman J, Feltus FA, Gowik U, Grigoriev IV, Lyons E, Maher CA, Martis M, Narechania A, Otillar RP, Penning BW, Salamov AA, Wang Y, Zhang L, Carpita NC, Freeling M, Gingle AR, Hash CT, Keller B, Klein P, Kresovich S, McCann MC, Ming R, Peterson DG, Mehboob-ur-Rahman, Ware D, Westhoff P, Mayer KF, Messing J, Rokhsar DS. (2009) The Sorghum bicolor genome and the Diversification of grasses. Nature, 457(7229):551-6.

[22]Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res., 12(10):1599-610.

[23] Rice Annotation Project.(2007) Curated genome annotation of Oryza sativa ssp.japonica and comparative genome analysis with Arabidopsis thaliana.Genome Res., 17:175-83.

[24]Rice Annotation Project.(2008)The rice annotation project database(RAP-DB):2008 update.Nucleic Acids Res.,36:D1028-D1033.

[25] The Rice Full-Length cDNA Consortium. (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice, Science, 301:376–379.

[26] Liu, X., Lu, T., Yu, S., et al. (2007) A collection of 10,096 indica rice full-length cDNAs reveales highly expressed sequence divergence between Oryza sativa indica and japonica subspecies, Plant Mol. Biol., 65:403–415.

[27] International Rice Genome Sequencing Project (2005). The map-based sequence of the rice genome. Nature, 436:793-800.

[28] Yu, J. et al. (2005) The Genomes of Oryza sativa: A history of duplications. PLoS Biol., 3:266-281.

[29] Dohm, J.C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36:e105.

[30] Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420:520–562.

[31] Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., et al. (2007 ) A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature, 448:1050–1053

[32] Stam P. (1993) Construction of integrated genetic linkage maps by means of a new computer package: JoinMap.Plant J., 3:739–44.

[33] Sasaki, A. et al. (2002) A mutant gibberellin-synthesis gene in rice. Nature, 416:701-702.

[34] Eshed, Y. and Zamir, D (1995). An introgression line population of Lycopersicon pennellii in the cultivated tomato enables the identification and fine mapping of yield-associated QTL. Genetics, 141:1147–1162.

[35] Loudet, O., Chaillou, S., Camilleri, C., Bouchez, D. and Daniel-Vedele, F. (2002) Bay-0 x Shahdara recombinant inbred line population: a powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor.Appl.Genet., 104:1173–1184.

[36] Simon, M., Loudet, O., Durand, S., Berard, A., Brunel, D., Sennesal, F.-X., Durand-Tardif, M., Pelletier, G. and Camilleri, C. (2008) Quantitative trait loci mapping in five new large recombinant inbred line populations of Arabidopsis thaliana genotyped with consensus single-nucleotide polymorphism markers. Genetics, 178:2253–2264.

[37] Huang, X. et al (2009). High-throughput genotyping by whole-genome resequencing. Genome Res., 19:1068-1076.

[38] Xie W, Feng Q, Yu H, Huang X, Zhao Q, Xing Y, Yu S, Han B, Zhang Q. (2010) Parent-independent genotyping for constructing an ultrahigh-density linkage map based on population sequencing. Proc.Natl.Acad.Sci.US A., 107(23):10578-83.Epub 2010 May 24.

[39] Zhao Q, Huang X, Lin Z, Han B. (2010) SEG-Map: A novel software for genotype calling and genetic map construction from next-generation sequencing. Rice, 3:98-102.

Claims

A method for identifying the genotype of a multi-parent plant, wherein the method comprises:

(a) for n parents and their progeny, provide the sequencing data Df of the progeny plants to be identified, and the sequencing data Dp of the parental plants corresponding to the progeny plants, wherein n is a positive integer ≥ 3;

(b) based on the sequencing data Df and the sequencing data Dp, determine the SNP site information of the parent and progeny;

(c) Judging the genotype of the progeny based on the SNP site information, thereby obtaining the evaluation results of each SNP of the progeny and the recombination breakpoints on each chromosome of the entire genome of the progeny distribution information;

(d) constructing and/or drawing a genotype map of the progeny based on the SNP evaluation result information of the progeny and the position information of the whole-genome recombination breakpoint, thereby obtaining the genotype identification result of the multi-parent plant.
The method of claim 1, wherein, in step (c), analysis of recombination break sites is performed based on SNP "word strings".
The method of claim 1, wherein the step (c) comprises analyzing the recombination break site, thereby obtaining an analysis result of the recombination break site,

And the recombination break site analysis includes:

(s1) constructing a SNP "string", wherein the genotypes of all SNPs on each chromosome of the parent and progeny are sequentially compressed into a string;

(s2) Determine each sliding window corresponding to the SNP string according to a predetermined window size, and score the SNP sites in each window, thereby obtaining the respective score values P for each parent in the window ;

(s3) Based on the score value P obtained in the step (s2), the genotype corresponding to each chromosomal region of the progeny is determined.
The method according to claim 1, wherein, in step (s3), for each chromosomal region of the progeny, based on each parental score value or score value curve, determine the chromosomal region corresponding to each chromosomal region of the progeny. genotype.
The method according to claim 1, characterized in that, in step (s3), comprising: by sliding the sliding window on the SNP site of the whole genome, the score value of each parent on each chromosome can be obtained, and Take the score as the ordinate, and draw the score curve of each parent with the position of each sliding window on the chromosome as the abscissa.
The method according to claim 1, characterized in that, in step (s3), quantifying the degree of similarity between the offspring and each parent in this section, according to the numerical characteristics of the score curve of each parent (value level and standard deviation) to determine the genotype of each segment.
The method of claim 3, wherein the sliding window size is 170-500 consecutive SNP sites, preferably 200-400 consecutive SNP sites;

And/or the sequencing depth of the sequencing data is 0.1x-10x, preferably 0.2x-5x.
The method of claim 1, wherein the plant comprises a crop, preferably a grass crop;

More preferably, the crops include rice, wheat, soybean, and tobacco.
A data analysis device for identifying genotypes of multi-parent plants, the device comprising:

a data input module for inputting data to be processed to be analyzed, the data to be processed includes: the sequencing data Df of the progeny plant to be identified, and the sequencing data Dp of the parent plant corresponding to the progeny plant;

A multi-parental plant genotype identification module configured to perform the method described in claim 1, thereby obtaining the genotype identification result of the progeny;

and an output module for outputting the genotype identification result of the progeny.
The device of claim 9, wherein the multi-parent plant genotype identification module comprises:

The SNP site information analysis submodule is configured to determine the SNP site information of the parent and progeny based on the sequencing data Df and the sequencing data Dp;

Chromosomal recombination breakpoint analysis sub-module, which is configured to judge the genotype of the progeny based on the SNP site information, so as to obtain the evaluation result of each SNP of the progeny and the whole genome of the progeny The distribution information of recombination breakpoints on each chromosome;

A genotype map construction submodule, which is configured to: construct and/or draw a genotype map of the progeny based on the SNP assessment result information of the progeny and the position information of the whole genome recombination breakpoint, so as to obtain the progeny. Genotyping results of multiple parental plants.