CN112885408A

CN112885408A - Method and device for detecting SNP marker locus based on low-depth sequencing

Info

Publication number: CN112885408A
Application number: CN202110199054.5A
Authority: CN
Inventors: 胡晓湘; 王宇哲; 朱迪; 任江丽; 李宁
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-06-01

Abstract

The invention relates to the field of genetics, in particular to a method and a device for detecting SNP marker sites based on low-depth sequencing. The method comprises the following steps: obtaining genome DNA of an individual to be detected; carrying out individual low-depth whole genome sequencing on the genome DNA, and comparing a sequencing result with a reference genome to obtain polymorphic site information; based on a hidden Markov model, carrying out genotyping on the polymorphic site information by utilizing a reference haplotype database; and the reference haplotype database comprises mutation site information of a breeding group to which the individual to be detected belongs. The invention utilizes the low-depth sequencing data of a single sample to carry out high-accuracy and standardized genotyping of SNP sites of ten million orders of magnitude in the whole genome in a very short time.

Description

Method and device for detecting SNP marker locus based on low-depth sequencing

Technical Field

The invention relates to the field of genetics, in particular to a method and a device for detecting SNP marker sites based on low-depth sequencing.

Background

Single Nucleotide Polymorphisms (SNPs) are the most popular genetic markers at present, and have the advantages of large number, wide distribution and good genetic stability in genomes. SNP is widely applied to the research directions of analysis of various trait genetic mechanisms, selection evolution research, genome prediction and the like in human and animal and plant research.

The number of genetic markers is required to be different according to different research contents, wherein the research contents needing to use the whole genome high-density markers mainly comprise whole genome association analysis and animal and plant genome selection analysis. In genome-wide association analysis, true causative mutations of the target phenotype can be identified more accurately using higher density of genome-wide genetic markers; in recent years, the Genome Selection (GS) technology emerging in genetic breeding of animals and plants utilizes high-density SNPs covering the whole genome to construct a genetic relationship coefficient matrix to calculate the estimated genome breeding value of an individual and select the individual. It is worth mentioning that genome selection belongs to application research, the sample scale for seed selection and breeding by utilizing genome selection is greatly increased year by year, and the method is very sensitive to three factors of marking accuracy, timeliness and price in actual production.

The current genome-wide SNP typing methods can be mainly divided into two major categories, namely commercialized SNP chips and genome sequencing. The commercialized SNP chip is the mainstream method for early whole genome typing because of its high standardization, good accuracy and simple operation. However, as the research expands, the deficiency thereof also gradually appears. For example, the number of SNP markers contained in a chip is mostly tens of thousands to hundreds of thousands, and thus, the chip is difficult to meet all types of research requirements; one SNP chip can only detect specific mutation sites, and has poor expansibility; commercial chips use a specific part of mainstream varieties in site design, which can cause a part of labeled sites to fail in a specific population; in addition, with the continuous development of sequencing technology, the cost advantage of chip typing has gradually disappeared. On the other hand, although the whole genome sequencing cost is continuously reduced, the distance from large-scale population breeding application is still not small, and a plurality of alternative methods for targeted sequencing are derived, for example, simplified genome sequencing is taken as an example, and the methods achieve the purpose of reducing the cost by enriching and sequencing fragments with a small proportion in a genome. Compared with the chip technology, the method has great progress in the aspects of marker density and cost optimization, but the targeted sequencing does not really realize the coverage of the whole genome, and the analysis process needs higher biological information basis, so the qualitative breakthrough is not realized in the whole genome typing technology of breeding practice.

In order to achieve higher density genotyping, the main strategy adopted at present is genotype filling, such as filling a low density chip with a high density chip, filling chip data with high depth sequencing data, and the like. However, these methods rely heavily on high-quality reference haplotype data sets (reference panels), which not only means that the population of the data sets is large in size and the result of self-typing has high confidence, but also requires that the data sets have a close genetic relationship with the population to be filled. At present, most of research in livestock and poultry species relies on small-sample high-depth sequencing data to construct a reference haplotype database, a large number of research reports exist, the quality of the panel cannot guarantee high-accuracy filling, which means that tens of thousands or hundreds of thousands of wrong typing results exist when the number of markers is in the million level, and the strategy has high computational complexity and poor timeliness, so that the breeding practice is still not facilitated.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a device for detecting SNP marker sites based on low-depth sequencing. The invention utilizes the low-depth sequencing data of the sample to be tested and carries out gene typing based on the reference haplotype database, thereby shortening the time of SNP locus typing to a greater extent and having extremely high accuracy.

In a first aspect, the present invention provides a method for detecting SNP marker loci based on low-depth sequencing, comprising:

obtaining genome DNA of an individual to be detected;

performing first low-depth whole genome sequencing on the genome DNA, comparing a sequencing result to a reference genome, and then performing genotyping;

the genotyping is based on a hidden Markov model, and polymorphic sites in a sequencing result are genotyped by utilizing a reference haplotype database;

the reference haplotype database comprises haplotype information of a breeding population to which the individual to be detected belongs, which is obtained by sequencing a second low-depth whole genome.

Further, the sequencing depth of the first low-depth whole genome sequencing is between 0.1X and 1X.

Further, the genotyping is:

and predicting the probability of the mutation site belonging to each haplotype source in the reference haplotype database by a hidden Markov model aiming at each mutation site in the sequencing result, and outputting the genotyping result of the mutation site according to the information of the haplotype with the highest probability.

Further, the method for constructing the reference haplotype database comprises the following steps:

obtaining genomic DNAs of a plurality of individuals of the breeding population, and performing the second low-depth sequencing to obtain sequencing data;

comparing the sequencing data to a reference genome, and judging and screening population polymorphic sites to obtain position information of each polymorphic site in the breeding population;

and processing the mutation site information of the breeding population through an EM iterative algorithm to construct a reference haplotype database.

Further, the second low depth whole genome sequencing has a population sequencing depth of between 300X and 600X.

Further, the plurality of individuals is 1500 or more individuals.

Further, the method for constructing the reference haplotype database further comprises:

after completing the detection of the SNP marker sites, haplotype data obtained by the detection result is incorporated into the reference haplotype database.

The method for detecting the SNP marker locus based on the low-depth sequencing is matched with the practical process of breeding and breeding: the breeding precondition needs a large-scale sample reference population which is matched with the process of constructing the reference haplotype database; the individual data to be determined by breeding are accumulated gradually in a small amount and a plurality of times, which is matched with the analysis in a single sample unit in the process of detecting the SNP marker locus.

In a second aspect, the present invention provides an electronic device comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the steps of the program instructions to enable performance of the method as provided by the first aspect.

In a third aspect, the invention provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the steps of the method as provided in the first aspect.

The invention has the following beneficial effects:

according to the invention, a reference haplotype database suitable for a large-scale sample source of a target group is established at low cost by using low-depth sequencing data, and a database construction link and a detection link are independently operated, so that the high-density SNP genotyping of a single low-depth sample, which is rapid, economic, accurate and covers the whole genome, is realized.

In addition, the reference haplotype database provided by the invention also has updating iteration, namely, after the samples obtained by detection reach a certain number, the information of the new samples is updated into the reference haplotype database at one time, so that the high accuracy of the typing of the samples produced subsequently is ensured.

Drawings

FIG. 1 is a flow chart of the method for detecting SNP marker loci based on low-depth sequencing provided by the invention.

Fig. 2 is a schematic physical structure diagram of an electronic device provided in the present invention.

FIG. 3 is a graph showing the results of the relationship between different reference sample sizes and sequencing depths and genotyping accuracy provided in example 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a schematic flow chart of a method for detecting SNP marker loci based on low-depth sequencing, and as shown in FIG. 1, the invention provides a method for detecting SNP marker loci based on low-depth sequencing, which comprises:

s1, obtaining genome DNA of the individual to be detected;

specifically, in practical application, the genomic DNA of the individual to be detected can be obtained by a common method in the art, for example, the whole genome is randomly interrupted by an enzyme digestion method or by an ultrasonic method, and the genomic DNA fragment obtained by any method capable of realizing the random interruption of the whole genome can be applied to subsequent sequencing and other processes.

S2, performing first low-depth whole genome sequencing on the genomic DNA;

specifically, based on the above scheme, low-depth whole genome sequencing can be performed on a second generation sequencing platform by a conventional method in the field, and the sequencing depth is preferably between 0.1X and 1X.

S3, comparing the sequencing result with a reference genome, and then carrying out genotyping;

the genotyping is specifically that based on a hidden Markov model, polymorphic sites in a sequencing result are genotyped by utilizing a reference haplotype database;

further, the reference genome may be selected to be homologous to the individual to be detected, for example, the reference genome of swine may be genotyped, the reference genome of chicken may be genotyped, and the reference genome of chicken may be referred to.

Further, alignment of the sequencing data to a reference genome can yield an alignment for each individual (bam file).

Further, the genotyping is:

The reference haplotype database provided by the invention is constructed by the following method: obtaining genomic DNAs of a plurality of individuals of the breeding population, and performing the second low-depth sequencing to obtain sequencing data; comparing the sequencing data to a reference genome, and judging and screening population polymorphic sites to obtain position information of each polymorphic site in the breeding population; and processing the mutation site information of the breeding population through an EM iterative algorithm to construct a reference haplotype database.

In this step, the sample size of the reference haplotype database construction link should be more than 1500, and the population sequencing depth (sample size for constructing database x sequencing depth of each individual) of a polymorphic site should be more than 300, so as to ensure the accuracy of detection. In practical applications, the sequencing depth can be adjusted according to the number of samples, for example, when the number of samples is 1500, the average sequencing depth is guaranteed to be more than 0.2 x, and when the number of samples is 3000, the average sequencing depth is guaranteed to be more than 0.1 x.

Furthermore, conventional software in the prior art, such as BaseVar software, can be adopted in the step to judge and screen the population polymorphic sites to obtain corresponding polymorphic site information, and certain screening standards can be set, such as EAF ≥ 0.01.

It should be noted that the EM iteration algorithm involved in this step may implement EM iteration by using software existing in the prior art, such as stutch software or fastPHASE.

Further, after the detection of the SNP marker sites is completed, haplotype data obtained by the detection result can be also incorporated into the reference haplotype database. For example, in practical applications, since constructing the reference haplotype database is a rate-limiting step, it is preferable to incorporate the detected haplotype data into the reference haplotype database after accumulating a certain number of samples for each detection, such as once after accumulating 1500 samples, which can ensure the rapidity of the detection process.

Fig. 2 is a schematic physical structure diagram of an electronic device provided in the present invention, and referring to fig. 2, the electronic device includes: a processor (processor)31, a memory (memory)32, and a bus 33; wherein, the processor 31 and the memory 32 complete the communication with each other through the bus 33; the processor 31 is configured to call program instructions in the memory 32 to perform the methods provided by the above-mentioned method embodiments, for example, including: obtaining genome DNA of an individual to be detected; carrying out individual low-depth whole genome sequencing on the genome DNA, and comparing a sequencing result with a reference genome to obtain polymorphic site information; and based on a neural network model, performing genotyping on the polymorphic site information by using a reference haplotype database.

Furthermore, the logic instructions in the memory 32 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the detection method provided by the above embodiments, for example, including: obtaining genome DNA of an individual to be detected; carrying out individual low-depth whole genome sequencing on the genome DNA, and comparing a sequencing result with a reference genome to obtain polymorphic site information; and based on a neural network model, performing genotyping on the polymorphic site information by using a reference haplotype database.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The present invention is further illustrated below based on more specific examples.

Example 1

1. Experimental Material

3000 individual ear tissue samples of Duroc core swine herd were used, and the genome was extracted and diluted to 40 ng/. mu.L.

2. Experimental methods

2.1 Low depth DNA library construction and sequencing

In this example, the DNA library construction by Tn5 enzyme digestion is described as follows:

(1) embedding Tn5 proenzyme with specific Tn5ME-A/Tn5Merev and Tn5ME-B/Tn5Merev joint for 2h at 72 deg.C to obtain Tn5 working enzyme with shear-paste activity, diluting the working enzyme to 16.5ng/μ L, and adding 5 × TAPS-MgCl at 4 μ L₂mu.L of Dimethylformamide (DMF) and nucleic-free water were used to cleave 50ng of the genome at 55 ℃ for 10 min.

(2) To each reaction, 3.5. mu.L of 0.2% SDS was added, and the reaction was incubated at 55 ℃ for another 10 min. A PCR reaction was then performed, including 96 different indices in the primers to distinguish individuals.

The PCR procedure was: 1 × (72 ℃, 9 min); 1 × (98 ℃, 30 sec); 9 × (98 ℃, 30 sec; 63 ℃, 30 sec; 72 ℃, 3 min).

(3) After the PCR product of each individual is quantified by the Qubit Fluorometric quantification (Invitrogen), an equal amount of mixing pool is taken for 96 individuals, AMPure XP beads (Beckmann) is used for purification under the conditions of 0.55 multiplied by retained supernatant and 0.1 multiplied by retained magnetic beads, the size of the library fragment is detected by an Agilent Bioanalyzer 2100 after the concentration of the purified product is detected, and the quality of the library is ensured to be qualified.

Double-ended 2X 100bp whole genome re-sequencing was performed on all samples on a MGIseq2000 platform, with an average sequencing depth of 0.7X per sample.

2.2 polymorphic site identification screening

The filtered raw sequencing data were aligned using an FPGA-based acceleration server, the reference genome was made using a version of porcine Ssc refofa 11.1(ftp:// ftp. ensembl. org/pub/rele ase-99/fasta/sus _ scrofa/dna /), and the alignment software was made using BWA. The alignment time for each sample was about 2-3 min. In the embodiment, BaseVar software is adopted to identify polymorphic sites, the standard of screening sites is that EAF is more than or equal to 0.01, a boxplot is adopted to evaluate the sequencing depth of each site group, and sites with the sequencing depth more than or equal to 1.5IQR are reserved as the group mutation site set. In this example, 11.6M candidate polymorphic sites of the porcine whole genome were obtained.

2.3 reference haplotype database construction

In the implementation example, STITCH software is selected to carry out EM algorithm iterative computation, the number of the founder haplotypes is preset to be 10, the pre-typing result is used as a database haplotype filtering standard, and the specific parameters are imputation info score >0.4 and Hardy Weinberg Equisibrium (HWE) p-value >1 e-6.

2.4 candidate sample mutation typing and accuracy assessment

The same DNA library building, sequencing and comparison methods are adopted for the samples to be typed. And reading the original sequencing data of the typing sample by using the constructed reference haplotype database, and identifying and typing the genotypes of all candidate polymorphic sites by adopting an HMM hidden Markov model. Finally obtaining the SNP typing result of 11.6M of the whole genome of the individual. And then, the accuracy of the genotyping result is judged by adopting a GeneSeek Genomic Profile pore 80K SNP Array chip, the genotyping results of 42 samples are collected for evaluation, and chromosome 13 is selected as an example, so that the result shows that the genotyping consistency of the coincident sites of the two methods reaches 99.67 percent, and the method is proved to have extremely high accuracy.

Example 2

This example is intended to illustrate the accuracy and timeliness of the method for detecting SNP marker sites provided by the present invention.

1. Experimental Material

Blood samples from 3000 individuals in the distant deep-crossbred line of huiyang beard chickens and Lingnan yellow chickens were used to extract genomes and diluted to 40 ng/. mu.L.

2. Experimental methods

The basic methods of low-depth DNA library construction and sequencing, polymorphic site identification and screening, reference haplotype database construction, candidate sample mutation typing and accuracy evaluation are the same as in example 1. The different points include: the average sequencing depth per individual was about 0.8 ×; the reference genome used version of chicken GRCg6a (INSDC Assembly GCA _000002315.5, Mar 2018); since the genomic heterozygosity and complexity of the hybrid population is much higher than that of the inbred population, the number of haplotypes for the founder in this example is preset to 24; obtaining 7.9M candidate polymorphic sites (SNP interval is about 96bp/SNP on average, and genome distribution is uniform) on the chicken autosome by referring to a haplotype database; then, taking the result of chicken Chr11 as an example to evaluate the accuracy, 28 individuals are analyzed in the example, and all the individuals successfully obtain the typing result of 288895 SNP sites on all the Chr 11; the 28 individuals were additionally subjected to ultra-deep whole genome sequencing (average sequencing depth per sample of 80 ×) and genotyped using GATK 4.1 standardized SNP identification protocol.

In the embodiment, the calculation resources used for constructing the reference haplotype database are 40 cores, the time for comparing the whole genome sequencing data of each sample with the genome is about 1-2min, and the total time for constructing the database for 3000 samples is 4 h. In the detection process, the genotyping result only needs 8-10min from the original sequencing data to the generation of a SNP (single nucleotide polymorphism) of hundreds of thousands of levels of a chromosome in each 100 samples, and the generation of all SNPs (of tens of millions of levels) of the whole genome can be completed through parallel calculation of different chromosomes. The typing results of 28 individuals for evaluating the accuracy show that the consistency of the high-depth data results and the genotyping of the method is over 99.71 percent, which proves that the method still has extremely high accuracy in the hybridization population.

In conclusion, the method realizes high-accuracy and standardized genotyping of SNP loci of ten million orders of magnitude in the whole genome in extremely short time by using low-depth sequencing data of a single sample.

Example 3

This example is used to illustrate the influence of the sequencing depth and sample size of each sample on the genotyping accuracy in reference to the haplotype database construction process.

The experimental materials and experimental methods used in this example were the same as those of example 2. In the reference haplotype database construction link, different reference sample sizes (200, 500, 1000, 1500, 2000, 3000, 4000) and sequencing depths (0.05 x, 0.1 x, 0.2 x, 0.3 x, 0.5 x) of each sample are extracted, and finally obtained genotyping results are compared with high-depth data to evaluate the accuracy.

The results are shown in FIG. 3. As can be seen, when the average sequencing depth of each sample reaches more than 0.2X, and the sample size exceeds 1500, the genotyping accuracy is basically stable (kept above 98.78%), and no longer changes obviously with the increase of the sequencing depth and the number of samples; under 0.2 × sequencing conditions, the sample size exceeds 2000, and the accuracy is over 99%, reaching 99.13%.

Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method for detecting SNP marker sites based on low-depth sequencing is characterized by comprising the following steps:

obtaining genome DNA of an individual to be detected;

2. The method of claim 1, wherein the first low depth whole genome sequencing has a sequencing depth between 0.1 x and 1 x.

3. The method of claim 1 or 2, wherein the genotyping is:

4. The method of claim 1, wherein the reference haplotype database is constructed by the method comprising the steps of:

5. The method of claim 4, wherein the second low depth whole genome sequencing has a population sequencing depth between 300 x and 600 x.

6. The method of claim 4, wherein the plurality of individuals is 1500 or more individuals.

7. The method of any of claims 4-6, wherein the reference haplotype database is constructed by a method further comprising:

8. An electronic device, comprising:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein:

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.