WO2019047181A1 - 基于低深度基因组测序进行基因分型的方法、装置及其用途 - Google Patents

基于低深度基因组测序进行基因分型的方法、装置及其用途 Download PDF

Info

Publication number
WO2019047181A1
WO2019047181A1 PCT/CN2017/101128 CN2017101128W WO2019047181A1 WO 2019047181 A1 WO2019047181 A1 WO 2019047181A1 CN 2017101128 W CN2017101128 W CN 2017101128W WO 2019047181 A1 WO2019047181 A1 WO 2019047181A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
organism
pedigree
tested
site
Prior art date
Application number
PCT/CN2017/101128
Other languages
English (en)
French (fr)
Inventor
郭瑞东
贾超
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2017/101128 priority Critical patent/WO2019047181A1/zh
Priority to CN201780093812.7A priority patent/CN110997936B/zh
Publication of WO2019047181A1 publication Critical patent/WO2019047181A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Definitions

  • the present invention relates to the field of biotechnology, and in particular to the field of genotyping and pedigree analysis, and more particularly to methods, devices and uses for genotyping based on low-depth genome sequencing.
  • the existing variety identification (also referred to as "blood analysis” in this paper) is represented by Wisdom Panel's dog genetic test, which detects a given single-base mutation point on pet dog DNA through a customized microarray chip. The type is then compared with the data of the purebred dogs in the database, giving the proportion of the species of the dog to be tested.
  • the above prior art is based on a microarray chip, and the number of samples required for each detection is hundreds, so it is necessary to concentrate on the machine, which is equivalent to analyzing the sample period of the sample to be tested for each lineage.
  • the cost is high, that is, the technology cannot be used to deliver test reports to users quickly and cheaply.
  • the DNA concentration in the sample required for chip sequencing is high, and thus there is a certain probability of sampling failure, that is, the sampling requirement is high, thereby further increasing the difficulty in solving the problem of long chip processing technology and high cost.
  • the present invention aims to solve at least one of the technical problems existing in the prior art. To this end, it is an object of the present invention to provide a genotyping and cultivar identification technique that is low in cost and short in detection cycle.
  • genotyping and pedigree analysis can be performed based on whole genome sequencing, as the cost of genotyping and pedigree analysis based on whole genome sequencing will be lower than the chip-based detection scheme as the cost of whole genome sequencing is reduced. Moreover, based on the sequencing scheme, there is no need to make a sample. Compared with the chip scheme, the required DNA content is low, the sampling success rate is high, and the experimental period is short, and the detection report can be quickly given. With the gradual reduction of sequencing costs, genotyping and pedigree analysis based on whole genome sequencing will make the detection price cheaper.
  • the invention provides a method for genotyping based on low-depth genome sequencing, comprising: (a) performing a low-depth genome sequencing of a whole genome of a sample to be obtained in order to obtain Sequencing results of multiple sequencing data; (b) constructing the known variant sites for at least one known variant site a set of reference sequences comprising a variant of the known variant site and an upstream and downstream sequence of the variant site; (c) the sequencing result obtained in step (a) Aligning the set of reference sequences to determine alignment results for each of the known variant sites, the alignment results including the type of matching variation of the sequencing result and the number of matches of the matching variant type; and (d) based on the The result of the alignment determines the high probability variation type of the known variant site.
  • the inventors have surprisingly found that, using the method of the present invention, the known variant sites of the test sample can be effectively genotyped based on the low-depth genome sequencing data, and the sample can be effectively processed based on the obtained genotyping results.
  • the source organism is subjected to pedigree analysis.
  • the method for genotyping based on low-depth genome sequencing of the present invention has low cost, short detection period, and accurate and reliable detection results.
  • the invention provides a method of pedigree analysis of an organism.
  • the method comprises: (1) performing low-depth genome sequencing of the genome of the biological sample to be tested, and at least one known variation site of the organism to be tested, using the method described above Genotyping is performed; (2) determining the pedigree of the organism based on the results of the genotyping.
  • a known variation site of a biological sample to be tested can be genotyped based on low-depth genome sequencing data, thereby determining the organism's
  • the lineage is simple and easy to operate, the cost is low, the detection period is short, and the detection result is accurate and reliable.
  • the invention provides an apparatus for genotyping based on low depth genome sequencing.
  • the genotyping apparatus comprises: a sequencing unit for performing low-depth genome sequencing of a whole genome of a sample to be obtained in order to obtain a sequencing result composed of a plurality of sequencing data; constructing a reference sequence set a unit, the reference sequence set building unit configured to construct, for at least one known mutation site, a reference sequence set of the known mutation site, the reference sequence set containing a variation type of the known mutation site and An upstream and downstream sequence of the mutation site; an alignment unit, the alignment unit is respectively connected to the sequencing unit and the reference sequence set construction unit, for receiving sequencing results from the sequencing unit, and Aligning the sequencing result with the reference sequence set to determine the alignment result of each known mutation site, the alignment result including the matching mutation type of the sequencing result and the matching number of the matching mutation type; a probability variation type determining unit, the high probability mutation type determining unit being connected to the comparing unit for based on the comparison knot De
  • the invention provides a system for pedigree analysis of an organism.
  • the system comprises: the genotyping device described above, the genotyping device for measuring a biological sample using the method described above for genotyping based on low-depth genome sequencing Genomics for low-depth genome sequencing, and genotyping at least one known variant site of the organism to be tested; pedigree determining device, the pedigree determining device being coupled to the genotyping device for As a result of the genotyping, the pedigree of the organism is determined.
  • a system for performing pedigree analysis of an organism can perform genotyping of a known mutation site of a sample to be tested based on low-depth genome sequencing data, thereby determining the organism's
  • the blood system, and the system is convenient to operate, the detection cost is low, the detection period is short, and the detection result is accurate and reliable.
  • FIG. 1 shows a schematic flow diagram of a method for genotyping based on low-depth genome sequencing of the present invention, in accordance with an embodiment of the present invention
  • FIG. 2 is a schematic view showing the structure of an apparatus for genotyping based on low-depth genome sequencing of the present invention according to an embodiment of the present invention
  • FIG. 3 is a schematic structural view of a system for performing pedigree analysis of an organism according to an embodiment of the present invention
  • Figure 4 shows the results of the pedigree analysis of the dog to be tested in the embodiment 1;
  • Figure 5 shows the result of principal component analysis for verification of the dog to be tested in Example 1;
  • Fig. 6 shows the results of pedigree analysis of the dog to be tested in Example 2.
  • the invention provides a method of genotyping based on low depth genome sequencing.
  • the known variation sites of the sample to be tested can be effectively genotyped based on the low-depth genome sequencing data, and the genotyping results obtained can be effectively measured based on the obtained results.
  • the organism from which the sample is derived is analyzed for pedigree.
  • the method for genotyping based on low-depth genome sequencing of the present invention has low cost, short detection period, and accurate and reliable detection results.
  • the method comprises the following steps:
  • low depth genome sequencing can be high throughput sequencing with a sequencing depth of no more than 5. According to some specific examples of the invention, the sequencing depth may not exceed three.
  • variant type should be understood broadly and may be any mutated base that is different from the wild type, including but not limited to single nucleotide polymorphism, fragment sequence insertion. And delete.
  • known variant sites may include sites known to have single nucleotide polymorphisms, fragment sequence insertions and deletions.
  • the sequencing data is previously segmented into a plurality of short sequences of equal length prior to performing step (c).
  • the appearance of mismatched bases can significantly affect the efficiency of genotyping. Therefore, for low-depth genome sequencing, such as sequencing of genomes with a sequencing depth of no more than 3, it is desirable to avoid mismatches as much as possible.
  • the inventors of the present invention have found through research that the longer the length of the sequenced data, the greater the probability of occurrence of base mismatches. Thus, by dividing the sequencing data into a plurality of short sequences, the probability of mismatching can be effectively reduced, thereby improving the efficiency of genotyping of low-depth genome sequencing.
  • the short sequence does not exceed 50 bp in length. According to further embodiments of the invention, preferably, the short sequence is 35 bp in length. Thereby, the mismatch due to the excessively long sequence can be reduced, so that the reading that can be compared to the corresponding position is erroneously filtered.
  • step (c) by matching, the matching mutation type of the sequencing result and the matching number of the matching mutation type can be obtained.
  • the type of matching variation and its number of matches are related to the true variation type of a particular locus, ie, a known variant locus.
  • a high probability variation type of the known mutation site can be obtained by back-twisting.
  • relatively reliable genotyping results can be obtained efficiently based on low-depth sequencing results.
  • the manner of determining the high probability variation type based on the comparison result that is, the manner of the reverse push mentioned above is not particularly limited.
  • a high probability variation type is determined based on a Bayesian model.
  • the Bayesian model adopts a predetermined mutation type occurrence probability of a predetermined known variation site as a prior probability, and uses the comparison result obtained in the step (c) as a posterior probability.
  • predetermined variation type of a predetermined known variation site described herein, wherein “predetermined” is a meaning that has been determined in advance, can be understood as "predetermined”.
  • the Bayesian model is based on a predetermined probability of a specific known type of a known variant site as a prior probability, and the number of occurrences of the particular variant type obtained by the comparison is used as the corresponding variant type.
  • the posterior probability of the mutation can determine the high probability variation type of the variant site. specific,
  • Adopt formula Determining a high probability variation type of a particular mutation site wherein the Bayesian model uses the known type probability of the known mutation site as the prior probability P(A)/P(B), where the prior probability It can be determined by statistical analysis of a plurality of control samples, that is, samples of known mutation site types, or it can be assumed that the probability of occurrence of various types on the mutation site is the same, for example, for a SNP site, the site appears A, The probability of T, G or C is 0.25.
  • the type is the possibility of the base type corresponding to the read, that is, P(B
  • B) obtained by the Bayesian model is used as the final high probability variation type of the known mutation site.
  • the alignment result can be constructed into a matching number-variation type database.
  • the type of the database is not particularly limited.
  • the matching number-variation type database may be in the form of a hash table, wherein in the hash table, the mutation type is a key, matching The number of times is the key value. Therefore, the matching number-variation type database can be searched more quickly and conveniently, and the result is more accurate and reliable.
  • the invention provides an apparatus for genotyping based on low depth genome sequencing.
  • the apparatus is adapted to carry out the aforementioned method of genotyping based on low depth genome sequencing.
  • the device can be used for genotyping the known mutation sites of the sample to be tested based on the low-depth genome sequencing data, and the operation is convenient, the cost is low, the detection period is short, and the detection result is accurate and reliable.
  • the genotyping apparatus 1000 includes a sequencing unit 100, a reference sequence set construction unit 200, a comparison unit 300, and a high probability variation type determination unit 400.
  • the sequencing unit 100 performs low-depth genome sequencing for the whole genome of the sample to be tested in order to obtain sequencing results consisting of multiple sequencing data.
  • the low depth genome sequencing is high throughput sequencing with a sequencing depth of no more than 5.
  • the sequencing depth is no more than three.
  • the reference sequence set construction unit 200 is configured to construct, for at least one known mutation site, a reference sequence set of the known mutation site, the reference sequence set containing the known mutation bit The type of variation of the point and the upstream and downstream sequences of the variant site.
  • the known variant sites comprise sites known to have single nucleotide polymorphisms, fragment sequence insertions and deletions.
  • the comparison unit 300 is connected to the sequencing unit 100 and the reference sequence set construction unit 200, respectively, for receiving sequencing results from the sequencing unit 100, and combining the sequencing result with the reference sequence set.
  • the alignment is performed to determine alignment results for each of the known variant sites, the alignment results including the matched variation type of the sequencing result and the number of matches of the matched mutation type.
  • the sequencing data was previously divided into a plurality of short sequences of equal length.
  • the short sequence does not exceed 50 bp in length.
  • the short sequence is 35 bp in length.
  • the high probability mutation type determining unit 400 is coupled to the comparison unit 300 for determining a high probability variation type of the known mutation site based on the comparison result.
  • the high probability variation type determining unit 400 including determining a high probability variation type based on a Bayesian model.
  • the Bayesian model is based on a predetermined probability of a specific known type of a known variant site as a prior probability, and the number of occurrences of the particular variant type obtained by the comparison is used as the corresponding variant type.
  • the posterior probability of the mutation can determine the high probability variation type of the variant site. specific,
  • Adopt formula Determining a high probability variation type of a particular mutation site wherein the Bayesian model uses the known type probability of the known mutation site as the prior probability P(A)/P(B), where the prior probability It can be determined by statistical analysis of a plurality of control samples, that is, samples of known mutation site types, or it can be assumed that the probability of occurrence of various types on the mutation site is the same, for example, for a SNP site, the site appears A, The probability of T, G or C is 0.25.
  • the type is the possibility of the base type corresponding to the read, that is, P(B
  • B) obtained by the Bayesian model is used as the final high probability variation type of the known mutation site.
  • the alignment result can be constructed into a matching number-variation type database.
  • the type of the database is not particularly limited.
  • the matching number-variation type database may be in the form of a hash table, wherein in the hash table, the mutation type is a key, matching The number of times is the key value. Therefore, the matching number-variation type database can be searched more quickly and conveniently, and the result is more accurate and reliable.
  • the invention provides a method of pedigree analysis of an organism.
  • a known variation site of a biological sample to be tested can be genotyped based on low-depth genome sequencing data, thereby determining the organism's
  • the lineage is simple and easy to operate, the cost is low, the detection period is short, and the detection result is accurate and reliable.
  • the method comprises: (1) performing low-depth genome sequencing of the genome of the biological sample to be tested, and at least one known variation site of the organism to be tested, using the method described above Genotyping is performed; (2) determining the pedigree of the organism based on the results of the genotyping.
  • the type of the organism to which the method of the present invention is applied is not particularly limited, and the method of the present invention can be used for pedigree analysis by dogs, cats, and even humans.
  • the organism is an animal.
  • the animal comprises a domestic cat (Felis silvestris catus), a domestic dog (Canis lupus familiaris).
  • blood analysis refers to determining the blood, origin, lineage or lineage of a pet such as a domestic cat or a domestic dog, for example, for a particular animal. Determine the animal species of the female parent or the male parent and the more upstream relatives.
  • step (2) the pedigree of the organism is determined based on a predetermined characteristic genotyping of the close relatives of the organism.
  • step (2) further comprises:
  • determining the pedigree of the living body further includes: for the known mutation site, the high probability variation type of the living organism to be tested and a close relative of the plurality of candidate organisms The types of mutations are compared, and the close relatives of each of the candidate organisms are scored to determine the similarity values of the close relatives of each of the candidate organisms.
  • the similarity values are the feature values in the embodiments, which are used interchangeably herein, and are used to represent the similarity values of the pet dogs to be detected and the possible breeds of the candidate dogs. .
  • step (2) further comprises:
  • determining the pedigree of the living body further comprises: dividing the DNA sequence of the living organism to be tested into a plurality of windows of approximately the same length, wherein the window contains at least one of the known a mutation site; and based on the similarity values of the close relatives of the candidate organisms, the obtained plurality of windows of the same length are separately classified to determine the close relative source of each window.
  • the method for classifying the window based on the similarity value is not particularly limited, and the party and the libsvm library in the R language can be utilized by including, but not limited to, a random forest, a support vector machine, and a naive Bayesian model. The classification method is completed.
  • the preferred classification method is a random forest model.
  • Random forest is a classification model that gathers decision trees to get better results. By constructing multiple decision trees, each decision tree classifies the samples according to the weight of each point and the input eigenvalues. The classification of decision trees yields the classification given by the random forest model.
  • these sequences are classified by dividing the gene sequence into fragments of the same length and then, for each fragment, based on the base type of the mutation site, such as SNP classification, as the feature value, It can be classified as a certain species, that is, the DNA sequence of the window is determined to be derived from the variety.
  • the delineation can be made in the following manner:
  • the distance from S1 to S2 is denoted as D1
  • the distance from S2 to S3 is denoted as D2.
  • the most satisfied The SNP points S1, S2, ... Sa are divided into a window, and the window is numbered as No. 1.
  • the most satisfied The SNP point S a+1 , S a+2 ... S b is divided into another window, and the window is numbered 2.
  • the specific value of X is composed of the species to be detected, and may be an autosome in the whole gene sequence of the species to be tested, such as a dog. 1% of the total length.
  • obtaining a close relative source that may be corresponding to each window further comprising: determining a distance of a known variation site corresponding to each of the close relative sources on a genome sequence of the living organism to be tested, and based on the obtained The distance determines the corresponding pedigree weight of each of the close relatives.
  • step (2) may comprise:
  • a pedigree weight of each candidate parent source is determined.
  • the method further comprises: obtaining the cultivar component of the organism to be tested by weighting calculation; and verifying the varietal component result of the obtained organism to be tested by using a cluster analysis method;
  • the cluster analysis method is a principal component analysis. Principal component analysis is a commonly used method of data dimensionality reduction. By finding the dimensions in the multidimensional variable group that are linearly combined and having the largest variance, the original data is projected onto the new coordinate axis, so that the dimensionally reduced data can retain more information of the original data.
  • the principal component analysis method can be performed using the ppca function in the pcrMethods package in the R language.
  • the method for performing pedigree analysis on an organism of the present invention may include the following steps:
  • the detectable mutations are not only single-base mutations, but also short-length and defined insertions and deletions of known fragment sequences.
  • step 4) According to the result of the step 4) comparison, the name of each SNP-index on the comparison is used as a key, and the number of occurrences is used as a value, a hash table is established, and the above hash is updated by traversing the comparison result.
  • the table is derived from the number of times each SNP-index is aligned.
  • the probability of the parent chain and the parent chain being detected is the same when sequencing.
  • the known type probability of the known mutation site is used as a priori. Probability, here assumes that the probability of occurrence of various types on the variant site is the same, the value is called P(A)/P(B), and the comparison result obtained in the step 4) is used as observation, that is, when it occurs When a read is aligned to a sequence corresponding to a certain type, the type is the probability of the base type corresponding to the read, that is, P(B
  • step 6) Comparing the detected genotypes obtained in step 6) with the single base typing results of different varieties of samples in the background database, and based on the expected values of the same number of sites, for each of the varieties to be tested A feature value. It should be noted that the expected value according to the same number of sites, that is, if the classification result is the same, the eigenvalue corresponding to the variety is increased by one. If the result is different, the eigenvalue is unchanged, and then the existing sample is divided by the sample. The quantity gives the average eigenvalue corresponding to each variety.
  • the distance from S1 to S2 is denoted as D1
  • the distance from S2 to S3 is denoted as D2.
  • the most satisfied The SNP points S1, S2, ... Sa are divided into a window, and the window is numbered as No. 1.
  • the most satisfied The SNP point S a+1 , S a+2 ... S b is divided into another window, and the window is numbered 2.
  • the specific value of X consists of the species to be detected, which is 1% of the total length of the autosome in the dog's entire gene sequence.
  • step 8 For each window obtained in step 8), use the feature values derived from step 7) of different breeds, using models including but not limited to random forests, support vector machines, and naive Bayes, through the party in R language. And the libsvm library classification method, the small pieces of DNA of each window are separately classified, the classification result is the possible variety corresponding to the DNA sequence, and the classification is based on the known pure breed dog of the variety in step 7) The characteristic value.
  • the classification results of each window are recorded as b1, b2...bn, where each classification result corresponds to a dog breed, and the final variety component estimation formula is The classification results of each segment are summed to obtain the sum of the classification results of each variety.
  • step 8 further comprising, according to the detection result of the different windows obtained in step 8), weighting the variety components of the organism to be tested according to the length of the DNA sequence represented by the different windows, thereby based on the organism to be tested The proportion of each component of the variety determines the lineage of the organism to be tested.
  • step 8 For windows containing SNP points S a , S a+1 to S b , As the weight of each window, WG is the total number of bases of autonomic chromosomes in the dog.
  • step 8 For the classification window described in step 8), step 8) will result in a classification result, and the classification result of each window will be recorded as b1. , b2...bn, where each classification result corresponds to a dog breed, and the final breed component estimation formula is
  • step 10) Verify the test results obtained in step 9) using principal component analysis or other clustering methods.
  • the most of the varieties obtained in the step 9) are selected, and the principal component analysis method or other clustering method is used for clustering. According to the clustering result, the average value of the distance between the sample to be detected and the different samples is calculated. If the species closest to the sample to be detected is the most dominant species obtained in step 9), the reliability of the result of step 9) is verified.
  • the implementation method of step 7) is: for each of the detected sites, the typing result of the site of the organism to be tested obtained in step 6) and the background database respectively
  • Each sample of each of the varieties is compared one by one to obtain the similarity of the organism to be tested and each variety (i.e., the "characteristic value" described above).
  • the typing result of the site of the organism to be tested is compared with the typing result of the plurality of samples of the variety in the background database at the site, respectively, if the sample to be tested and the sample in the background database are If the classification results at the site are consistent, the similarity between the sample to be tested and the variety (ie, the "characteristic value” above) is increased by one, and the comparison results of multiple samples of the same species in the background database are weighted and averaged. The corresponding similarity of the variety (ie, "characteristic value”) is obtained.
  • the method for performing pedigree analysis on a living body of the present invention can quickly and accurately obtain the genotyping result of the corresponding site from the whole genome second generation low-depth sequencing data. Since the depth of sequencing is an average of 1 to 2 layers, it is impossible to confirm which possible single base variation points will be covered, and accurate classification results cannot be obtained. By using the form of probability to express the classification result of uncertainty, and increasing the tolerance of the missing value when comparing the existing species database of the organism to be tested (need to be explained, the increase of the missing value is described here). Tolerance", how many missing values can be tolerated, without a clear black and white answer. As the proportion of missing values in the data increases, the accuracy of the variety determination will decrease. According to current experience, the requirement is detected.
  • the number of SNP sites is not less than 25% of the total amount, which can effectively determine the pedigree of the organism to be tested.
  • the application prospect is broad.
  • a pedigree certificate of a pure breed dog, a certificate of direct relationship between two dogs, or whether two dogs are the same dog can also give a quantitative ancestral component ratio to a mongrel dog, and a presumed cultivar within three generations.
  • the invention provides a system for pedigree analysis of an organism.
  • a system for performing pedigree analysis of an organism according to the present invention can perform genotyping of a known mutation site of a sample to be tested based on low-depth genome sequencing data, thereby determining the organism's The blood system, and the system is convenient to operate, the detection cost is low, the detection period is short, and the detection result is accurate and reliable.
  • the system 10000 includes a genotyping device 1000 and a pedigree determining device 2000.
  • the genotyping apparatus 1000 is configured to perform low-depth genome sequencing of the genome of the biological sample to be tested, and to test the genome of the biological sample to be tested by using the method described above for genotyping based on low-depth genome sequencing. At least one known variant site of the organism is genotyped.
  • the organism is an animal. According to some specific examples of the invention, the animal comprises a domestic cat (Felis silvestris catus), a domestic dog (Canis lupus familiaris).
  • the pedigree determining device 2000 is coupled to the genotyping device 1000 for determining the pedigree of the organism based on the results of the genotyping. According to some embodiments of the present invention, in the pedigree determining device 2000, the pedigree of the living body is determined based on a predetermined characteristic genotyping of the close relatives of the living body.
  • the pedigree determining apparatus 2000 further includes a similarity value determining unit, the similarity value determining unit being adapted to target the high probability variation type of the organism to be tested for the known mutation site Comparing with a plurality of variant types of the close relatives of the candidate organisms, and scoring the close relatives of the candidate organisms to determine the similarity values of the close relatives of the candidate organisms.
  • the pedigree determining apparatus 2000 further includes a close relative source determining unit that divides the DNA sequence of the organism to be tested into a plurality of windows of approximately the same length, wherein the window contains at least one of the Knowing the mutation site; and classifying the obtained plurality of windows of the same length based on the similarity values of the close relatives of the candidate organisms, so as to determine the close relative source corresponding to each window.
  • the method for classifying windows based on the similarity value is not particularly limited, including but not limited to random forest, support vector machine, naive Bayes, and can use the classification method of party and libsvm library in R language. carry out.
  • the preferred classification method is a random forest model.
  • Random forest is a classification model that gathers decision trees to get better results. By constructing multiple decision trees, each decision tree classifies the samples according to the weight of each point and the input eigenvalues. The classification of decision trees yields the classification given by the random forest model.
  • these sequences are classified by dividing the gene sequence into fragments of the same length and then, for each fragment, based on the base type of the mutation site, such as SNP classification, as the feature value, It can be classified as a certain species, that is, the DNA sequence of the window is determined to be derived from the variety.
  • the pedigree determining apparatus 2000 further includes a pedigree weight determining unit, wherein the pedigree weight determining unit is adapted to: determine a known variant site corresponding to each of the close relative sources on the genomic sequence of the organism to be tested Distance, and based on the distance obtained, the corresponding pedigree weights of each of the close relatives are determined.
  • the pedigree determining apparatus 2000 further includes a pedigree determining unit adapted to perform principal component analysis on the ancestry weights of the respective close relatives to determine the pedigree of the organism to be tested.
  • Principal component analysis is a commonly used method of data dimensionality reduction. By finding the dimensions in the multidimensional variable group that are linearly combined and having the largest variance, the original data is projected onto the new coordinate axis, so that the dimensionally reduced data can retain more information of the original data.
  • the principal component analysis method can be performed using the ppca function in the pcrMethods package in the R language.
  • the method for pedigree analysis of an organism according to the present invention which aims to obtain a variety component by low-depth sequencing data, and to achieve pedigree analysis.
  • the method for genotyping based on low-depth genome sequencing of the present invention uses low-depth whole genome data to estimate the typing results of known mutation sites (eg, single-base mutation sites), while conventional mutation detection Software, such as GATK, does not normally give results when the depth is low.
  • conventional mutation detection Software such as GATK
  • the present invention uses the method of constructing the sequence before and after the site to be detected, and can obtain an accurate single base typing result by one-fifth of the time and one tenth of the memory consumption of the conventional method.
  • a method for genotyping a living body according to the present invention and then performing a pedigree analysis of the organism to be tested.
  • the organism to be tested is a pet dog, and the owner of the dog reports that it is a Siberian Husky.
  • the sample of the organism to be tested is a saliva sample, which is obtained by non-invasive sampling of the dog using the PG-100 saliva sampler.
  • the sequence of 50 bp in front of the mutation site, the base type of the mutation site, and the sequence of 50 bp after the mutation site are sequentially combined to obtain the corresponding sequence of the mutation type corresponding to the site, and this file is called SNP-index,
  • SNP-index There are two possible genotypes for the mutation site.
  • two corresponding sequences are constructed for the same site according to different base types, and each corresponding sequence is numbered and based according to the corresponding SNP site.
  • Type naming is the SNP-index of bases A and G corresponding to the site numbered BICF2G630100019.
  • step 4) According to the result of the step 4) comparison, the name of each SNP-index on the comparison is used as a key, and the number of occurrences is used as a value, a hash table is established, and the above hash is updated by traversing the comparison result.
  • the table is derived from the number of times each SNP-index is aligned.
  • the result of this step is the following table. Since the table contains 160,000 rows, the following table only lists the first three rows: the format of this hash table is as follows, each row represents a step 1) the cut-cut after cutting Read compares the SNP-index event mentioned in step 3), where the first column is the number of the SNP and the second column is the base value corresponding to the read:
  • the value is called P(A)/P(B).
  • the probability that the typing value is the base type corresponding to the strip read is called P(B
  • the formula yields the posterior probability P(A
  • the result of this step is the following table. Since the table contains 160,000 rows, the following table only lists the first three rows: the first column in this table is the ID of the SNP, the second column is the possible classification, and the third column. For the possible probability value of the second column classification, the fourth column is another possible type of the point, and the fifth column is the possible probability value of the point in the fourth column:
  • the single base typing results are compared, and based on the expected values of the same number of sites, one eigenvalue is obtained for each of the tested cultivars.
  • the samples in the background database are used one by one and step 6 is used.
  • step 6 is used.
  • the resulting classification results at this point are compared. If the result of the classification on the site is consistent, the similarity of the sample to be tested and the species (both the eigenvalues mentioned here) is increased by one, one by one. Comparing each sample of each variety in the background database, after comparing all the samples of the variety one by one, dividing by the number of samples of the variety in the database, the corresponding similarity of each variety, that is, the characteristic value is obtained.
  • the result of this step is the following table. Since the table contains 70 rows, the following table only lists the first four rows: the first column is the value, and the second column is the corresponding variety:
  • the DNA of the organism to be tested is divided into a plurality of windows of equal length, each window containing a plurality of single base mutation sites.
  • the distance from S1 to S2 is denoted as D1
  • the distance from S2 to S3 is denoted as D2.
  • the most satisfied The SNP points S1, S2, ... Sa are divided into a window, and the window is numbered as No. 1.
  • the most satisfied The SNP point S a+1 , S a+2 ... S b is divided into another window, and the window is numbered 2.
  • X is 1% of the length of the sum of the autosomes in the dog's genome, ie 21 M bp.
  • step 8 According to the detection results of different windows obtained in step 8), according to the length of the DNA sequence represented by different windows, using the characteristic values obtained by step 7) of different varieties, using the randomization of party and libsvm library in R language
  • the forest model classifies the small pieces of DNA of each window separately, and the result of the classification is the possible variety corresponding to the DNA sequence.
  • the classification is based on the known characteristic values of the purebred dog of the variety in step 7). .
  • the classification results of the respective windows are recorded as b1, b2...b100, and the labels of each classification are derived from the classification results given by the random forest model in this step.
  • b1, b2 correspond to a dog breed
  • wn is the weight of each window, that is, the ratio of the length of the sequence corresponding to the window to the total sequence length
  • wi is calculated as
  • the formula calculates the length of the DNA sequence contained within the window
  • WG is the total number of bases of the dog's whole genome in the genome.
  • the pedigree of the organism to be tested is: 61% Siberian Husky + 39% Greenland sled dogs (see Figure 4).
  • the specific proportion of the ancestral components of the dog to be tested is: 61% Siberian Husky and 39% Greenland Husky (the photograph in Fig. 4 is a photograph of the dog to be tested).
  • step 10) Verify the test results obtained in step 9) using the method of principal component analysis.
  • the principal component analysis method is a common method of data dimensionality reduction, which is implemented in many programming languages, and the result can be directly obtained from the input data.
  • the realization of principal component classification is divided into the following steps: 1) Find the covariance matrix of the matrix according to the input matrix, 2) Find the eigenvalues and eigenvectors of the covariance matrix obtained in the previous step, 3) Select the highest eigenvalue Two eigenvectors, 4) project the input matrix onto the eigenvector.
  • step 9 the most five of the varieties obtained in step 9) are selected, and clustering is performed using the ppca function in the pcrMethods package in the R language.
  • Figure 5 shows the results of principal component analysis for the dog to be tested.
  • the horizontal and vertical axes list the two most important components.
  • the upper left is the Greenland sled dog, the lower right is the Siberian Husky, and the middle is the dog to be tested. .
  • the dog to be tested is located between the Greenland sled dog and the Siberian Husky, in accordance with the ratio obtained in step 9). That is, the test result obtained in step 9) is verified to be accurate.
  • the inventor uses the method of the present invention to estimate the variety component of the dog based on the data of the machine, and obtains the test result within 2 hours, and issues a report to the dog owner.
  • the inventors compared the variety components calculated in the present embodiment with the self-report of the dog owner, and found that the two have higher consistency.
  • Fig. 4 this figure is the three generations of genealogy inferred from the DNA data of the dog to be tested. Figure, including great-grandparents, grandparents, parents.
  • the organism to be tested was subjected to pedigree analysis according to the method of Example 1.
  • the organism to be tested is the pet dog to be tested, and the dog owner reports that it is a poodle.
  • the final bloodline analysis results are shown in Fig. 6 (this figure is a family tree map of three generations inferred from the DNA data of the dog to be tested. With great-grandparents, grandparents, and parents.
  • the dog to be tested is a 100% mini poodle (the photograph in FIG. 6 is a photograph of the dog to be tested).
  • the present invention has been widely used to issue a variety component report to a pet dog of an applicant (Hua Da Gene) internal user.
  • a variety component report to a pet dog of an applicant (Hua Da Gene) internal user.
  • the method can give a test report within 1 week, and the data analysis does not require a mainframe, can be given in a personal computer (4GB memory) in 2 hours per sample report.
  • the method for genotyping based on low-depth genome sequencing of the invention can effectively genotype the known mutation sites of the sample to be tested based on the low-depth genome sequencing data, and the genotyping result can be effectively based on the obtained genotyping results.
  • the pedigree analysis is performed on the organism from which the sample is to be measured. Moreover, the method has low detection cost, short detection period, and accurate and reliable detection results.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

提供了一种基于低深度基因组测序进行基因分型的方法。其中,该方法包括:(a)对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果;(b)针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列;(c)将步骤(a)中所得到的所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数;以及(d)基于所述比对结果,确定所述已知变异位点的高概率变异类型。

Description

基于低深度基因组测序进行基因分型的方法、装置及其用途
优先权信息
技术领域
本发明涉及生物技术领域,具体而言,涉及基因分型和血统分析领域,更具体地,涉及基于低深度基因组测序进行基因分型的方法、装置及其用途。
背景技术
现有的品种鉴定(本文中也称“血统分析”)服务,以Wisdom Panel公司推出的狗的基因检测为代表,通过定制的微阵列芯片,检测宠物狗DNA上给定的单碱基突变点的类型,然后和数据库中的纯种狗的数据进行比较,给出待检测狗的品种成分比例。
上述的现有技术,是基于微阵列芯片的,其每次检测所需的样本数为上百个,因而需要集中上机凑样,从而相当于针对每个血统分析待测样品的实验周期长,成本高,也即利用该技术无法快速廉价的向用户交付检测报告。且芯片测序要求的样本中DNA浓度较高,因而有一定的采样失败几率,即采样要求较高,由此进一步增加了芯片检测技术周期长、成本高的问题的解决难度。
因而,目前的基因分型和血统分析技术仍有待改进。
发明内容
本发明旨在至少解决现有技术中存在的技术问题之一。为此,本发明的一个目的在于提出一种成本低、检测周期短的基因分型和品种鉴定技术。
需要说明的是,本发明是基于发明人的下列工作和发现而完成的:
发明人认为,可以基于全基因组测序进行基因分型和血统分析,因为随着全基因组测序成本的降低,基于全基因组测序进行基因分型和血统分析的成本将低于基于芯片的检测方案。并且,基于测序的方案,不需凑样,相对于芯片方案来说,对所需的DNA含量要求低,采样成功率高,且实验周期短,能快速的给出检测报告。而随着测序成本的逐步降低,基于全基因组测序进行的基因分型和血统分析,检测价格将更加低廉。
进而,发明人通过一系列的实验研究和探索工作,惊奇地发现,基于全基因组低深度测序数据,通过得出已知变异位点的基因分型,并使用概率的形式表示不确定性的分型结果,然后在比较已有的狗的品种数据库时增加对缺失值的容忍,就能够有效地基于低深度基因组测序实现待测生物体样本的基因分型和血统分析。
从而,在本发明的一个方面,本发明提供了一种基于低深度基因组测序进行基因分型的方法,其特征在于,包括:(a)对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果;(b)针对至少一个已知变异位点,构建所述已知变异位点 的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列;(c)将步骤(a)中所得到的所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数;以及(d)基于所述比对结果,确定所述已知变异位点的高概率变异类型。发明人惊奇地发现,利用本发明的方法,基于低深度基因组测序数据即可有效地对待测样本的已知变异位点进行基因分型,进而基于获得的基因分型结果能够有效地对待测样本来源的生物体进行血统分析。并且,本发明的基于低深度基因组测序进行基因分型的方法,成本低、检测周期短,检测结果准确可靠。
在本发明的另一方面,本发明提供了一种对生物体进行血统分析的方法。根据本发明的实施例,该方法包括:(1)利用前面所述的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型;(2)基于所述基因分型的结果,确定所述生物体的血统。根据本发明的实施例,利用本发明的对生物体进行血统分析的方法,基于低深度基因组测序数据即可对待测生物体样本的已知变异位点进行基因分型,进而确定该生物体的血统,并且该方法简单易操作、成本低、检测周期短,检测结果准确可靠。
在本发明的又一方面,本发明提供了一种基于低深度基因组测序进行基因分型的装置。根据本发明的实施例,该基因分型装置包括:测序单元,所述测序单元用于对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果;参考序列集构建单元,所述参考序列集构建单元用于针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列;比对单元,所述比对单元分别与所述测序单元和所述参考序列集构建单元相连,用于从所述测序单元中接收测序结果,并将所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数;以及高概率变异类型确定单元,所述高概率变异类型确定单元与所述比对单元相连,用于基于所述比对结果,确定所述已知变异位点的高概率变异类型。利用该装置,能够基于低深度基因组测序数据对待测样本的已知变异位点进行基因分型,且操作方便、成本低、检测周期短,检测结果准确可靠。
在本发明的再一方面,本发明提供了一种对生物体进行血统分析的系统。根据本发明的实施例,该系统包括:前面所述的基因分型装置,所述基因分型装置用于利用前面所述的基于低深度基因组测序进行基因分型的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型;血统确定装置,所述血统确定装置与所述基因分型装置相连,用于基于所述基因分型的结果,确定所述生物体的血统。根据本发明的实施例,利用本发明的对生物体进行血统分析的系统,基于低深度基因组测序数据即可对待测生物体样本的已知变异位点进行基因分型,进而确定该生物体的血统,并且该系统操作方便,检测成本低、检测周期短,检测结果准确可靠。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得 明显,或通过本发明的实践了解到。
附图说明
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:
图1显示了根据本发明的实施例,本发明的基于低深度基因组测序进行基因分型的方法的流程示意图;
图2显示了根据本发明的实施例,本发明的基于低深度基因组测序进行基因分型的装置的结构示意图;
图3显示了根据本发明的实施例,本发明的对生物体进行血统分析的系统的结构示意图;
图4显示了实施例1中待测宠物狗的血统分析结果;
图5显示了实施例1中待测宠物狗的用于验证的主成分分析结果;
图6显示了实施例2中待测宠物狗的血统分析结果。
发明详细描述
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。
基因分型的方法和装置
在本发明的一个方面,本发明提供了一种基于低深度基因组测序进行基因分型的方法。根据本发明的实施例,利用本发明的方法,基于低深度基因组测序数据即可有效地对待测样本的已知变异位点进行基因分型,进而基于获得的基因分型结果能够有效地对待测样本来源的生物体进行血统分析。并且,本发明的基于低深度基因组测序进行基因分型的方法,成本低、检测周期短,检测结果准确可靠。
根据本发明的实施例,参照图1,该方法包括以下步骤:
(a):对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果。
根据本发明的实施例,低深度基因组测序可以为高通量测序,测序深度不超过5。根据本发明的一些具体示例,测序深度可以不超过3。
(b):针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列。
需要说明的是,在本文中所使用的术语“变异类型”应作广义理解,可以是任何与野生型相比不同的突变碱基,包括但不限于单核苷酸多态性、片段序列插入和删除。由此,根据本发明的实施例,已知变异位点可以包括已知具有单核苷酸多态性、片段序列插入和删除的位点。
(c):将步骤(a)中所得到的所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数。
根据本发明的一些实施例,在进行步骤(c)之前,预先将所述测序数据分割为多个等长的短序列。对于低深度基因组测序而言,错配碱基的出现会显著地影响基因分型的效率。因此,对于低深度基因组测序,例如测序深度不超过3的基因组测序,需要尽可能地避免错配的发生。本发明的发明人通过研究发现当测序数据的长度越长时,碱基错配出现的概率也越大。由此,通过将测序数据分割为多个短序列,可以有效地降低错配出现的几率,从而提高了低深度基因组测序进行基因分型的效率。根据本发明的一些具体示例,所述短序列的长度不超过50bp。根据本发明的另一些实施例,优选地,该短序列的长度为35bp。由此,可以减少由于序列过长导致的错配,从而造成本来可以比对到相应位置的读被错误过滤。
(d):基于所述比对结果,确定所述已知变异位点的高概率变异类型。
在前述步骤(c)中,通过比对,可以获取测序结果的匹配变异类型以及所述匹配变异类型的匹配次数。本领域技术人员能够理解的是,匹配变异类型及其匹配次数与特定位点即已知变异位点的真实变异类型相关。由此,在获取比对结果后,通过反推可以获得已知变异位点的高概率变异类型。进而,利用根据本发明实施例的方法能够有效地基于低深度测序结果,获取相对可靠的基因分型结果。
根据本发明的实施例,基于比对结果确定高概率变异类型的方式,也就是前面所提到的反推的方式并不受特别限制。
根据本发明的实施例,在步骤(d)中,包括基于贝叶斯模型确定高概率变异类型。根据本发明的一些具体示例,所述贝叶斯模型采用预定已知变异位点的预定变异类型出现概率作为先验概率,采用步骤(c)中所得到的比对结果作为后验概率。其中,这里所描述的“预定已知变异位点的预定变异类型”,其中的“预定”是预先已经确定的意思,可以理解为“预先确定的”。具体的,贝叶斯模型是基于预先确定的已知变异位点的特定已知类型的出现概率作为先验概率,采用通过比对所得到的该特定变异类型的出现次数作为对应该特定变异类型的后验概率,可以确定该变异位点的高概率变异类型。具体的,
采用公式
Figure PCTCN2017101128-appb-000001
确定特定变异位点的高概率变异类型,其中,该贝叶斯模型采用所述已知变异位点的已知类型概率作为先验概率P(A)/P(B),这里的先验概率可以通过对多个对照样本即已知变异位点类型的样本进行统计分析确定,也可以假设该变异位点上各种类型出现的概率相同,例如,对于SNP位点,该位点出现A、T、G或C的概率均为0.25。基于所述比对结果作为观察,即当发生了1条read比对到某一分型对应的序列时,该分型值为该条read对应的碱基型的可能性,即P(B|A),采用所述贝叶 斯模型得到的后验概率P(A|B)作为所述已知变异位点的最终高概率变异类型。
另外,为了方便对大量比对结果进行比对,可以将比对结果构建成匹配次数-变异类型数据库。数据库的类型并不受特别限制,根据本发明的一些实施例,匹配次数-变异类型数据库可以是以哈希表的形式存在的,其中,在所述哈希表中,变异类型为键,匹配次数为键值。由此,可以更加快捷方便地对匹配次数-变异类型数据库进行寻找,并且结果更加准确可靠。
相应地,在本发明的另一方面,本发明提供了一种基于低深度基因组测序进行基因分型的装置。该装置适于实施前述的基于低深度基因组测序进行基因分型的方法。利用该装置,能够基于低深度基因组测序数据对待测样本的已知变异位点进行基因分型,且操作方便、成本低、检测周期短,检测结果准确可靠。
根据本发明的实施例,参照图2,该基因分型装置1000包括:测序单元100、参考序列集构建单元200、比对单元300和高概率变异类型确定单元400。
根据本发明的实施例,测序单元100用于对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果。根据本发明的一些实施例,在所述测序单元100中,所述低深度基因组测序为高通量测序,测序深度不超过5。根据本发明的一些实施例,所述测序深度不超过3。
根据本发明的一些实施例,参考序列集构建单元200用于针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列。根据本发明的一些实施例,所述已知变异位点包括已知具有单核苷酸多态性、片段序列插入和删除的位点。
根据本发明的实施例,比对单元300分别与测序单元100和参考序列集构建单元200相连,用于从所述测序单元100中接收测序结果,并将所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数。
根据本发明的一些实施例,进一步包括序列分割单元(图中未示出),所述序列分割单元分别与所述测序单元100和所述比对单元300相连,用于在进行所述比对之前,预先将所述测序数据分割为多个等长的短序列。根据本发明的一些具体示例,所述短序列的长度不超过50bp。根据本发明的另一些实施例,所述短序列的长度为35bp。
根据本发明的一些实施例,高概率变异类型确定单元400与比对单元300相连,用于基于所述比对结果,确定所述已知变异位点的高概率变异类型。根据本发明的一些实施例,在所述高概率变异类型确定单元400中,包括基于贝叶斯模型确定高概率变异类型。具体的,贝叶斯模型是基于预先确定的已知变异位点的特定已知类型的出现概率作为先验概率,采用通过比对所得到的该特定变异类型的出现次数作为对应该特定变异类型的后验概率,可以确定该变异位点的高概率变异类型。具体的,
采用公式
Figure PCTCN2017101128-appb-000002
确定特定变异位点的高概率变异类型,其中,该贝叶斯模型采用所述已知变异位点的已知类型概率作为先验概率P(A)/P(B),这里的先验概率可以通过对多个对照样本即已知变异位点类型的样本进行统计分析确定,也可以假设该变异位点上各种类型出现的概率相同,例如,对于SNP位点,该位点出现A、T、G或C的概率均为0.25。基于所述比对结果作为观察,即当发生了1条read比对到某一分型对应的序列时,该分型值为该条read对应的碱基型的可能性,即P(B|A),采用所述贝叶斯模型得到的后验概率P(A|B)作为所述已知变异位点的最终高概率变异类型。
另外,为了方便对大量比对结果进行比对,可以将比对结果构建成匹配次数-变异类型数据库。数据库的类型并不受特别限制,根据本发明的一些实施例,匹配次数-变异类型数据库可以是以哈希表的形式存在的,其中,在所述哈希表中,变异类型为键,匹配次数为键值。由此,可以更加快捷方便地对匹配次数-变异类型数据库进行寻找,并且结果更加准确可靠。
血统分析的方法和系统
在本发明的再一方面,本发明提供了一种对生物体进行血统分析的方法。根据本发明的实施例,利用本发明的对生物体进行血统分析的方法,基于低深度基因组测序数据即可对待测生物体样本的已知变异位点进行基因分型,进而确定该生物体的血统,并且该方法简单易操作、成本低、检测周期短,检测结果准确可靠。
根据本发明的实施例,该方法包括:(1)利用前面所述的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型;(2)基于所述基因分型的结果,确定所述生物体的血统。
需要说明的是,本发明的方法所适用的生物体种类不受特别限制,狗、猫甚至人类,均可以采用本发明的方法进行血统分析。因而,根据本发明的一些实施例,所述生物体为动物。根据本发明的一些实施例,所述动物包括家猫(Felis silvestris catus)、家犬(Canis lupus familiaris)。
本领域技术人员能够理解的是,在本文中所使用的术语“血统分析”是指确定待测生物体例如家猫或者家犬等宠物的血缘、起源、世系或者谱系,例如对于一个特定的动物,确定其母本或者父本以及更上游亲缘的动物品种。
根据本发明的一些实施例,在步骤(2)中,所述生物体的血统是基于预先确定的所述生物体近亲的特征基因分型而确定的。
根据本发明的实施例,步骤(2)进一步包括:
针对至少一个所述已知变异位点,基于所述待测生物体的所述高概率变异类型以及至少一个所述候选生物体近亲的已知变异类型,对至少一个所述候选生物体近亲进行评分, 以便确定各所述候选生物体近亲的相似度数值。
具体的,根据本发明的实施例,确定所述生物体的血统进一步包括:针对所述已知变异位点,将所述待测生物体的高概率变异类型与多个所述候选生物体近亲的变异类型进行比较,并对各所述候选生物体近亲进行评分,以便确定各所述候选生物体近亲的相似度数值。本领域技术人员能够理解的是,相似度数值越高,表示待测生物体与候选生物体近亲的亲缘关系越近。需要说明的是所述相似度数值即为实施例中的特征值,在本文中是可以互换使用的,均用以代表待检测宠物狗与各个所述候选宠物狗可能品种的亲缘相似度数值。
根据本发明的实施例,步骤(2)进一步包括:
将所述待测生物体的基因组序列的至少一部分划分为多个窗口,所述多个窗口的每一个均含有至少一个所述已知变异位点;以及
基于所述各候选生物体近亲的相似度数值,针对所述多个窗口的至少一部分进行分类,以便确定所述多个窗口的至少一部分所对应的候选近亲来源。
具体的,根据本发明的实施例,确定生物体的血统进一步包括:将所述待测生物体的DNA序列进行划分为多个长度近似相同的窗口,所述窗口内含有至少一个所述已知变异位点;以及基于所述各候选生物体近亲的相似度数值,对所得到的多个长度相同的窗口分别进行分类,以便确定各窗口所对应的近亲来源。需要说明的是,基于相似度数值对窗口进行分类的方法并不受特别限制,可以通过包括但不限于随机森林、支持向量机、朴素贝叶斯的模型,利用R语言中的party和libsvm库的分类方法完成。其中,优选采用的分类方法为随机森林模型。随机森林是一种将决策树集合从而得出更好效果的分类模型,通过构建多棵决策树,由每棵决策树根据各个点的权重,结合输入的特征值对样本进行分类,之后综合多个决策树的分类,得出随机森林模型给出的分类。由此,根据本发明的实施例,通过将基因序列分成长度相同的片段,然后针对每个片段,根据其中的变异位点的碱基类型例如SNP分型作为特征值,对这些窗口进行分类,可以将其归为某类品种,也就是说认定该窗口的DNA序列来源于该品种。
需要说明的是,在本文中所描述的“长度相同的窗口”应当容许一定数量的长度偏差,例如上下浮动1~10%。根据本发明的实施例,可以按照下列方式进行划定:
对于一号染色体上的N个待检测SNP位点,记为S1,S2,S3,...Sn,将S1至S2的距离记为D1,将S2至S3的距离记为D2。给定固定的窗口大小X,将最多个满足
Figure PCTCN2017101128-appb-000003
的SNP点S1,S2,...Sa划分为一个窗口,将该窗口编为1号。之后按照相同的规则,将最多个满足
Figure PCTCN2017101128-appb-000004
的SNP点Sa+1,Sa+2...Sb划分为另一个窗口,将该窗口编为2号。依次类推,在完成了对染色体一号的切割窗口后,对二号染色体使用相同的规则切割窗口,依次完成对所有常染色体的窗口切割。
X的具体取值由待检测的物种构成,可以为待测物种例如狗的全基因序列中常染色体 的总长度的1%。
在获得了各窗口可能对应的近亲来源后,根据本发明的实施例,进一步包括:确定各所述近亲来源对应的已知变异位点在待测生物体基因组序列上的距离,并基于所得到的距离确定各近亲来源的相应血统权重。
根据本发明的实施例,优选地,步骤(2)可以包括:
针对各所述候选亲本来源,确定所述候选亲本来源所对应的所述已知变异位点在所述待测生物体基因组序列上的距离;
基于所述距离,确定各所述候选亲本来源的血统权重。
根据本发明的实施例,在确定各近亲来源的血统权重后,进一步包括:通过加权计算得到待测生物体的品种成分;通过聚类分析方法对得到的待测生物体的品种成分结果进行验证,以便确定所述待测生物体的血统。根据本发明的一些具体示例,所述聚类分析方法为主成分分析。主成分分析是一种常用的数据降维方法。通过找到多维变量组中通过线性组合后,方差最大的几个维度,将原数据投影到新的坐标轴上,从而使得降维后的数据可以保留原数据更多的信息。根据本发明的具体实施例,所述主成分分析方法可以使用R语言中pcrMethods包中的ppca函数完成。
根据本发明的一些具体示例,本发明的对生物体进行血统分析的方法,可以包括以下步骤:
1)对待测样本全基因组进行低深度基因组测序。其中,从二代测序平台得到的读段(read),如果长度大于50bp,将这些读段按前后顺序切成多段等长的短序列,并将这些新切分获得的短序列组成一个新的文件,称为cut-read。
2)从网站(https://www.illumina.com)上找到待检测的基因芯片数据,并下载指定的待检测单碱基变异列表,以及变异前后的参考序列,通过实施例中具体描述的方式,可生成不同的待检测位点上不同分型对应的参考序列,将这个文件称为SNP-index。这里的可检测的变异不止是单碱基突变,也包括长度较短的且确定的知道变异片段序列的插入和删除。
3)从网站(http://soap.genomics.org.cn)下载SOAPaligner2,使用步骤2)得到的SNP-index文件作为输入,用/2bwt-builder命令建立比对所需的数据结构。
4)使用soap命令在基于SNP-index的参考序列上对步骤1)得出的cut-read进行比对,所使用的参数为“-v 0 -M 0 -r 0”。
5)根据步骤4)比对得出的结果,以比对上的每一个SNP-index的名字作为键,以其出现次数作为值,建立哈希表,并通过遍历比对结果更新上述哈希表,从而得出每个SNP-index各自被比对上的次数。
6)假设测序时父链和母链被检测到的概率是相同的,依据贝叶斯公式,根据步骤5)得出的哈希表,将已知变异位点的已知类型概率作为先验概率,这里假设该变异位点上各种类型出现的概率相同,将该值称为P(A)/P(B),采用所述步骤4)得到的比对结果作为观察,即当发生了1条read比对到某一分型对应的序列时,该分型值为该条read对应的碱基型的可能性,即P(B|A),采用所述贝叶斯模型得到的后验概率P(A|B)作为所述已知变异位 点的最终高概率变异类型。根据贝叶斯模型的公式,得出不同深度下各个点可能的单碱基分型结果。
Figure PCTCN2017101128-appb-000005
7)将步骤6)得出的检测出的基因型与背景数据库中的不同品种的样本的单碱基分型结果进行比较,依据相同的位点数的期望值,针对每一个待检测品种,得出一个特征值。需要说明的是依据相同的位点数的期望值,即为如分型结果相同,则该品种对应的特征值加一,如结果不同,则特征值不变,之后已有的除以该品种样本的数量,得到每个品种对应的平均特征值。
8)按照单碱基突变在不同染色体上出现的位置的前后顺序,将待测生物体的DNA分为等长度的多个窗口,每一个窗口包含至少一个单碱基突变的位点。
对于一号染色体上的N个待检测SNP位点,记为S1,S2,S3...Sn,将S1至S2的距离记为D1,将S2至S3的距离记为D2。给定固定的窗口大小X,将最多个满足
Figure PCTCN2017101128-appb-000006
的SNP点S1,S2,...Sa划分为一个窗口,将该窗口编为1号。之后按照相同的规则,将最多个满足
Figure PCTCN2017101128-appb-000007
的SNP点Sa+1,Sa+2...Sb划分为另一个窗口,将该窗口编为2号。依次类推,在完成了对染色体一号的切割窗口后,对二号染色体使用相同的规则切割窗口,依次完成对所有常染色体的窗口切割。
X的具体取值由待检测的物种构成,为狗的全基因序列中常染色体的总长度的1%。
9)针对步骤8)得到的每一个窗口,使用不同品种在步骤7)得出的特征值,使用包括但不限于随机森林、支持向量机和朴素贝叶斯的模型,通过R语言中的party和libsvm库的分类方法,将各窗口的这一小段DNA分别进行分类,分类的结果为该段DNA序列对应的可能品种,分类的依据为已知的该品种纯种狗在步骤7)中得出的特征值。
将各个窗口的分类结果记为b1,b2...bn,这里每一个分类结果对应一种狗的品种,最终的品种成分估算公式为
Figure PCTCN2017101128-appb-000008
即将每一段的分类结果加和,得到各个品种的分类结果的总和。
需要指出的是,本领域的技术人员均可以理解,前述的“对于一号染色体上的N个待检测SNP位点,记为S1,S2,S3...Sn”和“将各个窗口的分类结果记为b1,b2...bn”,其中“Sn”和“bn”的两个编码n的含义不同,“Sn”的n为待检测SNP位点的编码,而“bn”的n为对应窗口的编码,“bn”表示该编码窗口的DNA的分类结果。
根据本发明的实施例,进一步包括依据步骤8)得出的不同窗口的检测结果,按照不同窗口所代表的DNA序列的长度,加权计算出待测生物体的品种成分,从而基于待测生物体 的各品种成分的比例,确定待测生物体的血统。
对于包含SNP点Sa,Sa+1至Sb的窗口,将
Figure PCTCN2017101128-appb-000009
作为每个窗口的权重,WG是狗的全基因组内常染色体的碱基总数,对于依据步骤8)描述的分类窗口,步骤8)会得出一个分类结果,将各个窗口的分类结果记为b1,b2...bn,这里每一个分类结果对应一种狗的品种,最终的品种成分估算公式为
Figure PCTCN2017101128-appb-000010
10)使用主成分分析或其他聚类方法,验证步骤9)得出的检测结果。
具体地,选择步骤9)得到的品种中最多的几个品种,使用主成分分析方法或其他聚类方法进行聚类。根据聚类结果,计算待检测样本与不同样本之间距离的平均值,如果距离待检测样本距离最近的品种为步骤9)得出的最主要品种,则验证了步骤9)结果的可靠性。
其中,根据本发明的一些具体示例,步骤7)的实施方法为:针对所检测到的每一个位点,将步骤6)中得到的待测生物体该位点的分型结果分别与背景数据库中的每个品种的各个样本逐一进行比较,从而分别得到该待测生物体和各品种的相似度(即上述的“特征值”)。具体地,将待测生物体该位点的分型结果与背景数据库中该品种的多个样本在该位点上的分型结果分别进行比较,如果待测生物体和背景数据库中的样本在该位点上的分型结果一致,则该待测样本和该品种的相似度(即上述的“特征值”)加一,而对背景数据库同一品种的多个样本的比较结果,加权平均后得到该品种的对应相似度(即“特征值”)。
需要说明的是,本发明的对生物体进行血统分析的方法,可以快速且准确的从全基因组二代低深度测序数据中,得出相应位点的基因分型结果。由于测序的深度为平均1到2层,故无法确认哪些可能的单碱基变异点会被覆盖,也无法得出准确的分型结果。而通过使用概率的形式表示不确定性的分型结果,并在比较已有的待测生物体的品种数据库时增加对缺失值的容忍(需要说明的是,这里所述的“增加对缺失值的容忍”,可以容忍多少缺失值,没有一个明确的非黑即白的答案,随着数据中缺失值比例的增加,对品种判定的准确性会随之降低,根据目前的经验,要求检测到的SNP位点的个数不小于总量的25%),可以有效地确定待测生物体的血统。而在实际应用方面,应用前景广阔,例如:利用本发明的方法,可以给出纯种宠物狗的血统证书,两只狗直系亲缘关系的证书,或者两只狗是否为同一条狗(给出宠物狗或猫的基因身份证),还可以对杂种狗给出定量的祖源成分比例,以及推测出的三代以内品种树。
在本发明的又一方面,本发明提供了一种对生物体进行血统分析的系统。根据本发明的实施例,利用本发明的对生物体进行血统分析的系统,基于低深度基因组测序数据即可对待测生物体样本的已知变异位点进行基因分型,进而确定该生物体的血统,并且该系统操作方便,检测成本低、检测周期短,检测结果准确可靠。
根据本发明的实施例,参照图3,该系统10000包括:基因分型装置1000和血统确定装置2000。
根据本发明的实施例,基因分型装置1000用于利用前面所述的基于低深度基因组测序进行基因分型的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型。根据本发明的一些实施例,所述生物体为动物。根据本发明的一些具体示例,所述动物包括家猫(Felis silvestris catus)、家犬(Canis lupus familiaris)。
根据本发明的一些实施例,血统确定装置2000与基因分型装置1000相连,用于基于所述基因分型的结果,确定所述生物体的血统。根据本发明的一些实施例,在所述血统确定装置2000中,所述生物体的血统是基于预先确定的所述生物体近亲的特征基因分型而确定的。
根据本发明的一些实施例,血统确定装置2000进一步包括相似度数值确定单元,所述相似度数值确定单元适于针对所述已知变异位点,将所述待测生物体的高概率变异类型与多个所述候选生物体近亲的变异类型进行比较,并对各所述候选生物体近亲进行评分,以便确定各所述候选生物体近亲的相似度数值。
根据本发明的一些实施例,血统确定装置2000进一步包括近亲来源确定单元,将所述待测生物体的DNA序列进行划分为多个长度近似相同的窗口,所述窗口内含有至少一个所述已知变异位点;以及基于所述各候选生物体近亲的相似度数值,对所得到的多个长度相同的窗口分别进行分类,以便确定各窗口所对应的近亲来源。需要说明的是,基于相似度数值对窗口进行分类的方法并不受特别限制,包括但不限于随机森林、支持向量机、朴素贝叶斯,可以利用R语言中的party和libsvm库的分类方法完成。其中,优选采用的分类方法为随机森林模型。随机森林是一种将决策树集合从而得出更好效果的分类模型,通过构建多棵决策树,由每棵决策树根据各个点的权重,结合输入的特征值对样本进行分类,之后综合多个决策树的分类,得出随机森林模型给出的分类。由此,根据本发明的实施例,通过将基因序列分成长度相同的片段,然后针对每个片段,根据其中的变异位点的碱基类型例如SNP分型作为特征值,对这些窗口进行分类,可以将其归为某类品种,也就是说认定该窗口的DNA序列来源于该品种。
根据本发明的一些实施例,血统确定装置2000进一步包括血统权重确定单元,所述血统权重确定单元适于:确定各所述近亲来源对应的已知变异位点在待测生物体基因组序列上的距离,并基于所得到的距离确定各近亲来源的相应血统权重。
根据本发明的一些实施例,血统确定装置2000进一步包括血统确定单元,所述血统确定单元适于对所述各近亲来源的血统权重进行主成分分析,以便确定所述待测生物体的血统。主成分分析是一种常用的数据降维方法。通过找到多维变量组中通过线性组合后,方差最大的几个维度,将原数据投影到新的坐标轴上,从而使得降维后的数据可以保留原数据更多的信息。根据本发明的具体实施例,所述主成分分析方法可以使用R语言中pcrMethods包中的ppca函数完成。
需要说明的是,本发明的基于低深度基因组测序进行基因分型的方法、装置及其应用,具有以下优点的至少之一:
1、本发明的对生物体进行血统分析的方法,旨在通过低深度的测序数据得出品种成分,实现血统分析。
2、本发明的基于低深度基因组测序进行基因分型的方法,使用了低深度全基因组数据进行已知变异位点(例如单碱基突变位点)分型结果的估算,而传统的变异检测软件,如GATK在深度较低时则无法正常的给出结果。并且,本发明使用构建待检测位点前后序列的方式,能够以传统方法五分之一的时间,十分之一的内存消耗,得出准确的单碱基分型结果。
3、本发明的对生物体进行血统分析的方法,进行血统分析时,使用定量的方式给出祖源成分的估算。类似的计算方法,未来也将可以用于检测宠物猫,宠物鸟等宠物,以及牛,鸡等经济作物的品种成分,并且也可以用来检测人的祖源成分。
下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅用于说明本发明,而不应视为限定本发明的范围。实施例中未注明具体技术或条件的,按照本领域内的文献所描述的技术或条件(例如参考J.萨姆布鲁克等著,黄培堂等译的《分子克隆实验指南》,第三版,科学出版社)或者按照产品说明书进行。所用试剂或仪器未注明生产厂商者,均为可以通过市购获得的常规产品。
实施例1:
参照图1,根据本发明的对生物体进行基因分型的方法,进而对待测生物体进行血统分析。
其中,待测生物体为宠物狗,犬主人自述其为一只西伯利亚哈士奇。待测生物体样本为唾液样本,是使用PG-100唾液采样器对宠物狗进行无创采样得到的。
具体步骤如下:
1)利用BGI-seq500测序平台,对待测样本全基因组进行低深度基因组测序。具体地,提取唾液中的DNA,利用酶切法对DNA全基因组进行扩增,再进行文库构建。之后在BGI-seq 500上进行全基因组低深度测序,测序深度为2到3层。其中,从二代测序平台得到的读段(read),将这些读段按前后顺序切成50bp的短序列,并将这些新切分获得的短序列组成一个新的文件,称为cut-read。
2)从网站(ftp://webdata2:webdata2@ussd-ftp.illumina.com/downloads/ProductFiles/CanineHD/CanineHD_B.csv)上找到Illumina Canine HD基因芯片数据,并下载前述链接中文件,该文件为待检测单碱基变异列表,以及变异前后的参考序列。
将突变位点前50bp的序列,突变位点的碱基类型以及突变位点后50bp的序列按顺序组合得出该位点对应的突变类型的对应序列,将这个文件称为SNP-index,由于突变位点可能的基因型有两种,这里需按上述的规则,根据不同的碱基类型,针对同一位点,构建两条对应序列,每条对应序列按对应的SNP位点编号和碱基类型命名,如下所示的是编号为BICF2G630100019的位点对应的碱基A和G的SNP-index。
Figure PCTCN2017101128-appb-000011
Figure PCTCN2017101128-appb-000012
3)从网站(http://soap.genomics.org.cn)下载SOAPaligner2,使用步骤2)得到的SNP-index文件作为输入,用/2bwt-builder命令建立比对所需的13个不同的索引文件,后缀分别为*.amb,*.ann,*.bwt,*.fmv,*.hot,*.lkt,*.pac,*.rev.bwt,*.rev.fmv,*.rev.lkt,*.rev.pac,*.sa,和*.sai.。
4)使用soap命令在基于SNP-index的参考序列上对步骤1)得出的cut-read进行比对,所使用的参数为“-v 0-M 0-r 0”。
5)根据步骤4)比对得出的结果,以比对上的每一个SNP-index的名字作为键,以其出现次数作为值,建立哈希表,并通过遍历比对结果更新上述哈希表,从而得出每个SNP-index各自被比对上的次数。该步骤得到的结果为下表,由于该表包含16万行,下表只列出了前三行:本哈希表的格式如下所示,每一行代表一次步骤1)剪切后的cut-read比对到步骤3)提到的SNP-index的事件,其中第一列为SNP的编号,第二列为该条read对应的碱基值:
SNP编号 碱基值
BICF2S23657714 C
BICF2G630130992 G
BICF2G630708586 G
6)假设测序时父链和母链被检测到的概率是相同的,依据贝叶斯公式,根据步骤5)得出的哈希表,得出不同深度下各个点可能的单碱基分型结果。
假设该变异位点上各种类型出现的概率相同,将该值称为P(A)/P(B)。将该点根据步骤4)得到的测序比对结果当作观察,将该分型值为该条read对应的碱基型的可能性称为P(B|A),根据下述的贝叶斯公式,得出后验概率P(A|B),即作该点上可能的分型值。
Figure PCTCN2017101128-appb-000013
该步骤得到的结果为下表,由于该表包含16万行,下表只列出了前三行:本表中第一列为SNP的ID,第二列为可能的分型,第三列为第二列分型的可能概率值,第四列为该点另一种可能的分型,第五列为该点在在第四列的可能概率值:
SNP的ID 分型1 分型1的概率值 分型2 分型2的概率值
BICF2S23657714 CC 0.67 AC 0.33
BICF2G630130992 GG 0.8 GC 0.2
BICF2G630708586 GG 0.67 AG 0.33
7)将步骤6)得出的检测出的基因型与背景数据库(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE90441)中的不同品种的狗的单碱基分型结果进行比较,依据相同的位点数的期望值,针对每一个待检测品种,得出一个特征值,对于每一个带检测SNP位点,逐个使用背景数据库中的样本与步骤6)得出的该点上的分型结果进行比较,如果与该位点上的分型结果一致,则该待测样本和该品种的相似度(既这里提到的特征值)加一,逐一比较背景数据库中的每个品种的各个样本,在逐个比较了所有该品种的样本后,除以数据库中该品种的样本数,得到各品种的对应相似度,即特征值。
该步骤得到的结果为下表,由于该表包含70行,下表只列出了前四行:第一列为值,第二列为对应的品种:
特征值 品种
94607.4 西伯利亚哈士奇(SiberianHusky)
89423.8028571428 格陵兰雪橇犬(GreenlandSledgeDog)
89404.921 阿拉斯加雪橇犬(AlaskanMalamute)
89399.9492857142 吉娃娃(Chihuahua)
8)按照单碱基突变在不同染色体上出现的位置的前后顺序,将待测生物体的DNA分为等长度的多个窗口,每一个窗口包含若干个单碱基突变的位点。
对于一号染色体上的N个待检测SNP位点,记为S1,S2,S3...Sn,将S1至S2的距离记为D1,将S2至S3的距离记为D2。给定固定的窗口大小X,将最多个满足
Figure PCTCN2017101128-appb-000014
的SNP点S1,S2,...Sa划分为一个窗口,将该窗口编为1号。之后按照相同的规则,将最多个满足
Figure PCTCN2017101128-appb-000015
的SNP点Sa+1,Sa+2...Sb划分为另一个窗口,将该窗口编为2号。依次类推,在完成了对染色体一号的切割窗口后,对二号染色体使用相同的规则切割窗口,依次完成对所有常染色体的窗口切割,共得到100个窗口,编号为1,2,3...100。
X为狗的基因组中常染色体之和的长度的1%,即21M bp。
9)依据步骤8)得出的不同窗口的检测结果,按照不同窗口所代表的DNA序列的长度,使用不同品种在步骤7)得出的特征值,使用R语言中的party和libsvm库的随机森林模型,将各窗口的这一小段DNA分别进行分类,分类的结果为该段DNA序列对应的可能品种,分类的依据为已知的该品种纯种狗在步骤7)中得出的特征值。将各个窗口的分类结果记为b1,b2...b100,每一种分类的标签来自于本步骤中随机森林模型给出的分类结果。这里b1,b2分别对应一种狗的品种,w1,w2...,wn为每个窗口的权重,即该窗口对应的序列的长度占总序列长度的比例,最终的品种成分估算公式为
Figure PCTCN2017101128-appb-000016
其中wi的计算公式为
Figure PCTCN2017101128-appb-000017
对于每一个窗口,该公式计算窗口内包含的DNA序列的长度,WG是狗的全基因组内常染色体的碱基总数。通过对每一个窗口的分类结果按照窗口在染色体上的长度进行加权平均,最终得到各个品种的分类结果的总和。
经过对各窗口分类结果的加权平均计算,待测生物体的血统为:61%西伯利亚哈士奇+39%格陵兰雪橇犬(见图4)。如图4所示,待测宠物狗的祖源成分的具体比例为:61%西伯利亚哈士奇和39%格陵兰雪橇犬(图4中的照片为该待测宠物狗的照片)。
10)使用主成分分析的方法,验证步骤9)得出的检测结果。
其中,主成分分析方法作为一种常见的数据降维的方法,其在多种编程语言中都有实现,可以直接由输入数据得出结果。主成分分型的实现分为如下几步:1)根据输入矩阵,求出该矩阵的协方差矩阵,2)求上一步得到的协方差矩阵的特征值和特征向量,3)选取特征值最高的两个特征向量,4)将输入矩阵投影到特征向量上。
具体地,选择步骤9)得到的品种中最多的5个,使用R语言中pcrMethods包中的ppca函数进行聚类。
图5为待检测狗用于验证的主成分分析结果,横轴和竖轴列出了两个最主要的成分,左上方为格陵兰雪橇犬,右下方为西伯利亚哈士奇,位于中间的为待检测犬。可以看出,待检测犬位于格陵兰雪橇犬和西伯利亚哈士奇之间,符合步骤9)得到的比例。也即,步骤9)得出的检测结果经验证为结果准确。
发明人按照上述步骤,利用本发明的方法基于下机数据对该宠物狗进行品种成分估算,在2小时之内即得出了检测结果,并给宠物狗主人出具了报告。
进一步,为验证本方法的准确性,发明人将本实施例计算出的品种成分与宠物狗主人的自述进行对照,结果发现,两者有较高的一致性。
具体地,测序上机后得到的原始reads为5.5G bp,约2.3x,血统(祖源成分)分析检测结果见图4(该图是根据待测宠物狗的DNA数据推测出的三代品种家谱图,含曾祖父辈、祖父辈、父母辈)。
实施例2:
按照实施例1的方法对待测生物体进行血统分析。
其中,待测生物体为待测宠物狗维妮,犬主人自述其为贵宾犬,最终血统分析检测结果见图6(该图是根据待测宠物狗的DNA数据推测出的三代品种家谱图,含曾祖父辈、祖父辈、父母辈)。如图6所示,该待测宠物狗为100%的迷你贵宾犬(图6中的照片为该待测宠物狗的照片)。
此外,本发明已广泛用于给申请人(华大基因)内部用户的宠物狗出具品种成分报告。目前已给出48份报告,其中既包括纯种狗,也包括杂种狗。而需要强调的是,基于上述实 践,从收到狗的唾液样本,该方法可以在1周内给出检测报告,且其中的数据分析不需要大型机,可以在个人电脑(4GB内存)上以每样本2小时的时间给出报告。
工业实用性
本发明的基于低深度基因组测序进行基因分型的方法,基于低深度基因组测序数据即可有效地对待测样本的已知变异位点进行基因分型,进而基于获得的基因分型结果能够有效地对待测样本来源的生物体进行血统分析。并且,该方法检测成本低、检测周期短,检测结果准确可靠。
尽管本发明的具体实施方式已经得到详细的描述,本领域技术人员将会理解。根据已经公开的所有教导,可以对那些细节进行各种修改和替换,这些改变均在本发明的保护范围之内。本发明的全部范围由所附权利要求及其任何等同物给出。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示意性实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。

Claims (40)

  1. 一种基于低深度基因组测序进行基因分型的方法,其特征在于,包括:
    (a)对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果;
    (b)针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列;
    (c)将步骤(a)中所得到的所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数;以及
    (d)基于所述比对结果,确定所述已知变异位点的高概率变异类型。
  2. 根据权利要求1所述的方法,其特征在于,所述低深度基因组测序为高通量测序,测序深度不超过5。
  3. 根据权利要求2所述的方法,其特征在于,所述测序深度不超过3。
  4. 根据权利要求1所述的方法,其特征在于,在进行步骤(c)之前,预先将所述测序数据分割为多个等长的短序列。
  5. 根据权利要求4所述的方法,其特征在于,所述短序列的长度不超过50bp。
  6. 根据权利要求5所述的方法,其特征在于,所述短序列的长度为35bp。
  7. 根据权利要求1所述的方法,其特征在于,所述已知变异位点包括已知具有单核苷酸多态性、片段序列插入和删除的位点。
  8. 根据权利要求1所述的方法,其特征在于,在步骤(d)中,基于贝叶斯模型确定高概率变异类型。
  9. 根据权利要求8所述的方法,其特征在于,所述贝叶斯模型采用预定已知变异位点的预定变异类型出现概率作为先验概率,采用步骤(c)中所得到的比对结果作为后验概率。
  10. 根据权利要求8所述的方法,其特征在于,进一步包括将所述比对结果构建为哈希表,其中,所述变异类型为键,所述匹配次数为键值。
  11. 一种对生物体进行血统分析的方法,其特征在于,包括:
    (1)利用权利要求1~10任一项所述的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型;
    (2)基于所述基因分型的结果,确定所述生物体的血统。
  12. 根据权利要求11所述的方法,其特征在于,所述生物体为动物。
  13. 根据权利要求12所述的方法,其特征在于,所述动物包括家猫、家犬。
  14. 根据权利要求11所述的方法,其特征在于,在步骤(2)中,所述生物体的血统是基于预先确定的所述生物体近亲的特征基因分型而确定的。
  15. 根据权利要求14所述的方法,其特征在于,步骤(2)进一步包括:
    针对至少一个所述已知变异位点,基于所述待测生物体的所述高概率变异类型以及至 少一个所述候选生物体近亲的已知变异类型,对至少一个所述候选生物体近亲进行评分,以便确定各所述候选生物体近亲的相似度数值。
  16. 根据权利要求15所述的方法,其特征在于,步骤(2)进一步包括:
    将所述待测生物体的基因组序列的至少一部分划分为多个窗口,所述多个窗口的每一个均含有至少一个所述已知变异位点;以及
    基于所述各候选生物体近亲的相似度数值,针对所述多个窗口的至少一部分进行分类,以便确定所述多个窗口的至少一部分所对应的候选近亲来源。
  17. 根据权利要求16所述的方法,其特征在于,所述分类是通过随机森林模型、支持向量机和朴素贝叶斯的至少之一进行的。
  18. 根据权利要求16所述的方法,其特征在于,步骤(2)进一步包括:
    针对各所述候选亲本来源,确定所述候选亲本来源所对应的所述已知变异位点在所述待测生物体基因组序列上的距离;
    基于所述距离,确定各所述候选亲本来源的血统权重。
  19. 根据权利要求18所述的方法,其特征在于,步骤(2)进一步包括:基于所述各所述候选亲本来源的血统权重,确定所述待测生物体的血统。
  20. 根据权利要求19所述的方法,其特征在于,在确定各近亲来源的血统权重后,进一步包括:
    通过加权计算得到所述待测生物体的品种成分,并通过聚类分析方法对得到的待测生物体的品种成分结果进行验证,以便基于所述各所述候选亲本来源的血统权重,确定所述待测生物体的血统。
  21. 一种基于低深度基因组测序进行基因分型的装置,其特征在于,包括:
    测序单元,所述测序单元用于对待测样本全基因组进行低深度基因组测序,以便获得由多个测序数据构成的测序结果;
    参考序列集构建单元,所述参考序列集构建单元用于针对至少一个已知变异位点,构建所述已知变异位点的参考序列集,所述参考序列集含有所述已知变异位点的变异类型以及所述变异位点的上下游序列;
    比对单元,所述比对单元分别与所述测序单元和所述参考序列集构建单元相连,用于从所述测序单元中接收测序结果,并将所述测序结果与所述参考序列集进行比对,以便确定各已知变异位点的比对结果,所述比对结果包括测序结果的匹配变异类型以及所述匹配变异类型的匹配次数;以及
    高概率变异类型确定单元,所述高概率变异类型确定单元与所述比对单元相连,用于基于所述比对结果,确定所述已知变异位点的高概率变异类型。
  22. 根据权利要求21所述的装置,其特征在于,在所述测序单元中,所述低深度基因组测序为高通量测序,测序深度不超过5。
  23. 根据权利要求22所述的装置,其特征在于,所述测序深度不超过3。
  24. 根据权利要求21所述的装置,其特征在于,进一步包括序列分割单元,所述序列 分割单元分别与所述测序单元和所述比对单元相连,用于在进行所述比对之前,预先将所述测序数据分割为多个等长的短序列。
  25. 根据权利要求24所述的装置,其特征在于,所述短序列的长度不超过50bp。
  26. 根据权利要求25所述的装置,其特征在于,所述短序列的长度为35bp。
  27. 根据权利要求21所述的装置,其特征在于,所述已知变异位点包括已知具有单核苷酸多态性、片段序列插入和删除的位点。
  28. 根据权利要求21所述的装置,其特征在于,基于贝叶斯模型确定高概率变异类型。
  29. 根据权利要求29所述的装置,其特征在于,所述贝叶斯模型采用预定所述已知变异位点的预定变异类型出现概率作为先验概率,所述比对结果作为后验概率。
  30. 根据权利要求29所述的装置,其特征在于,进一步包括将所述比对结果构建为哈希表,其中,所述变异类型为键,所述匹配次数为键值。
  31. 一种对生物体进行血统分析的系统,其特征在于,包括:
    权利要求21-30任一项所述的基因分型装置,所述基因分型装置用于利用权利要求1~10任一项所述的方法,对待测生物体样本的基因组进行低深度基因组测序,以及对所述待测生物体的至少一个已知变异位点进行基因分型;
    血统确定装置,所述血统确定装置与所述基因分型装置相连,用于基于所述基因分型的结果,确定所述生物体的血统。
  32. 根据权利要求31所述的系统,其特征在于,所述生物体为动物。
  33. 根据权利要求32所述的系统,其特征在于,所述动物包括家猫、家犬。
  34. 根据权利要求31所述的系统,其特征在于,在所述血统确定装置中,所述生物体的血统是基于预先确定的所述生物体近亲的特征基因分型而确定的。
  35. 根据权利要求34所述的系统,其特征在于,所述血统确定装置进一步包括相似度数值确定单元,所述相似度数值确定单元适于针对至少一个所述已知变异位点,基于所述待测生物体的所述高概率变异类型以及至少一个所述候选生物体近亲的已知变异类型,对至少一个所述候选生物体近亲进行评分,以便确定各所述候选生物体近亲的相似度数值。
  36. 根据权利要求35所述的系统,其特征在于,所述血统确定装置进一步包括近亲来源确定单元,所述近亲来源确定单元适于实施以下步骤:
    将所述待测生物体的基因组序列的至少一部分划分为多个窗口,所述多个窗口的每一个均含有至少一个所述已知变异位点;以及
    基于所述各候选生物体近亲的相似度数值,针对所述多个窗口的至少一部分进行分类,以便确定所述多个窗口的至少一部分所对应的候选近亲来源。
  37. 根据权利要求36所述的系统,其特征在于,所述分类是通过随机森林模型、支持向量机和朴素贝叶斯的至少之一进行的。
  38. 根据权利要求36所述的系统,其特征在于,所述血统确定装置进一步包括血统权重确定单元,所述血统权重确定单元适于:
    针对各所述候选亲本来源,确定所述候选亲本来源所对应的所述已知变异位点在所述 待测生物体基因组序列上的距离;以及
    基于所述距离,确定各所述候选亲本来源的血统权重。
  39. 根据权利要求38所述的系统,其特征在于,所述血统确定装置进一步包括基于所述各所述候选亲本来源的血统权重,确定所述待测生物体的血统。
  40. 根据权利要求39所述的系统,其特征在于,在确定各近亲来源的血统权重后,进一步包括:
    通过加权计算得到所述待测生物体的品种成分,并通过聚类分析方法对得到的待测生物体的品种成分结果进行验证,以便基于所述各所述候选亲本来源的血统权重,确定所述待测生物体的血统。
PCT/CN2017/101128 2017-09-08 2017-09-08 基于低深度基因组测序进行基因分型的方法、装置及其用途 WO2019047181A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/101128 WO2019047181A1 (zh) 2017-09-08 2017-09-08 基于低深度基因组测序进行基因分型的方法、装置及其用途
CN201780093812.7A CN110997936B (zh) 2017-09-08 2017-09-08 基于低深度基因组测序进行基因分型的方法、装置及其用途

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/101128 WO2019047181A1 (zh) 2017-09-08 2017-09-08 基于低深度基因组测序进行基因分型的方法、装置及其用途

Publications (1)

Publication Number Publication Date
WO2019047181A1 true WO2019047181A1 (zh) 2019-03-14

Family

ID=65635230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/101128 WO2019047181A1 (zh) 2017-09-08 2017-09-08 基于低深度基因组测序进行基因分型的方法、装置及其用途

Country Status (2)

Country Link
CN (1) CN110997936B (zh)
WO (1) WO2019047181A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113186255A (zh) * 2021-05-12 2021-07-30 深圳思勤医疗科技有限公司 基于单分子测序检测核苷酸变异方法与装置
CN113327646A (zh) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 测序序列的处理方法及装置、存储介质、电子设备
CN113637747A (zh) * 2021-06-21 2021-11-12 深圳思勤医疗科技有限公司 确定核酸样本中snv和肿瘤突变负荷的方法及应用

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883207B (zh) * 2020-07-31 2022-08-16 武汉蓝沙医学检验实验室有限公司 一种生物学亲缘关系的鉴定方法
CN113470746B (zh) * 2021-06-21 2023-11-21 广州市金域转化医学研究院有限公司 降低高通量测序中人工引入错误突变的方法及应用
CN116168763A (zh) * 2022-09-06 2023-05-26 安诺优达基因科技(北京)有限公司 同源四倍体基因组分型组装的方法和装置、构建染色体的方法和装置及其应用

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106755300A (zh) * 2016-11-17 2017-05-31 中国科学院华南植物园 一种识别猕猴桃杂交亲本对子代基因组贡献比例的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539967B (zh) * 2008-12-12 2010-12-01 深圳华大基因研究院 一种单核苷酸多态性检测方法
US9916416B2 (en) * 2012-10-18 2018-03-13 Virginia Tech Intellectual Properties, Inc. System and method for genotyping using informed error profiles
US20170213127A1 (en) * 2016-01-24 2017-07-27 Matthew Charles Duncan Method and System for Discovering Ancestors using Genomic and Genealogic Data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106755300A (zh) * 2016-11-17 2017-05-31 中国科学院华南植物园 一种识别猕猴桃杂交亲本对子代基因组贡献比例的方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BROUARD JS: "Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accuracy of imputation", BMC GENETICS, vol. 18, no. 1, 5 April 2017 (2017-04-05), pages 1 - 14, XP055581813 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113186255A (zh) * 2021-05-12 2021-07-30 深圳思勤医疗科技有限公司 基于单分子测序检测核苷酸变异方法与装置
CN113637747A (zh) * 2021-06-21 2021-11-12 深圳思勤医疗科技有限公司 确定核酸样本中snv和肿瘤突变负荷的方法及应用
CN113327646A (zh) * 2021-06-30 2021-08-31 南京医基云医疗数据研究院有限公司 测序序列的处理方法及装置、存储介质、电子设备
CN113327646B (zh) * 2021-06-30 2024-04-23 南京医基云医疗数据研究院有限公司 测序序列的处理方法及装置、存储介质、电子设备

Also Published As

Publication number Publication date
CN110997936A (zh) 2020-04-10
CN110997936B (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
AU2021282416B2 (en) Methods and processes for non-invasive assessment of genetic variations
WO2019047181A1 (zh) 基于低深度基因组测序进行基因分型的方法、装置及其用途
KR102381477B1 (ko) 심층 신경망에 기반한 변이체 분류자
CA2964902C (en) Ancestral human genomes
US10930368B2 (en) Methods and processes for non-invasive assessment of genetic variations
KR20200010488A (ko) 심층 학습 기반 변이체 분류자
US20220130488A1 (en) Methods for detecting copy-number variations in next-generation sequencing
Rogers et al. Mitochondrial pseudogenes in the nuclear genomes of Drosophila
Bargelloni et al. Data imputation and machine learning improve association analysis and genomic prediction for resistance to fish photobacteriosis in the gilthead sea bream
Jin et al. Quickly identifying identical and closely related subjects in large databases using genotype data
Gondro et al. Genome wide association studies
Nam et al. Whole genome sequencing reveals the impact of recent artificial selection on red sea bream reared in fish farms
US20210313012A1 (en) Difference-based genomic identity scores
US20220293214A1 (en) Methods of analyzing genetic variants based on genetic material
Besenbacher et al. Local phylogeny mapping of quantitative traits: higher accuracy and better ranking than single-marker association in genomewide scans
Baschal et al. Congruence as a measurement of extended haplotype structure across the genome
JP7122006B2 (ja) 挿入・欠失・逆位・転座・置換検出法
KR20130053775A (ko) Dna 검색 방법
KR102110017B1 (ko) 분산 처리에 기반한 miRNA 분석 시스템
Akpinar et al. The complete genome sequence of elite bread wheat cultivar,“Sonmez”[version 1; peer review: 3 approved]
Kacar Dissecting Tumor Clonality in Liver Cancer: A Phylogeny Analysis Using Computational and Statistical Tools
Dimens et al. Genomic resources for the Yellowfin tuna Thunnus albacares
Daetwyler et al. In silico genotyping using long-range phasing in a complex pedigree
Al-Khudhair Inter-and intra-population genetic variations in humans
Matukumalli Development of bioinformatics applications for prediction and validation of polymorphisms in soybean genome using EST data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17924266

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17924266

Country of ref document: EP

Kind code of ref document: A1