CN110997936B - Method, device and application of genotyping based on low-depth genome sequencing - Google Patents

Method, device and application of genotyping based on low-depth genome sequencing Download PDF

Info

Publication number
CN110997936B
CN110997936B CN201780093812.7A CN201780093812A CN110997936B CN 110997936 B CN110997936 B CN 110997936B CN 201780093812 A CN201780093812 A CN 201780093812A CN 110997936 B CN110997936 B CN 110997936B
Authority
CN
China
Prior art keywords
organism
sequencing
determining
known mutation
genotyping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780093812.7A
Other languages
Chinese (zh)
Other versions
CN110997936A (en
Inventor
郭瑞东
贾超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Publication of CN110997936A publication Critical patent/CN110997936A/en
Application granted granted Critical
Publication of CN110997936B publication Critical patent/CN110997936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids

Landscapes

  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for genotyping based on low depth genome sequencing is provided. Wherein the method comprises the following steps: (a) Performing low-depth genome sequencing on the whole genome of the sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data; (b) Constructing a reference sequence set of known mutation sites for at least one known mutation site, the reference sequence set containing the mutation type of the known mutation site and the sequences upstream and downstream of the mutation site; (c) Comparing the sequencing result obtained in the step (a) with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type; and (d) determining a high probability mutation type of the known mutation site based on the comparison result.

Description

Method, device and application of genotyping based on low-depth genome sequencing
PRIORITY INFORMATION
Without any means for
Technical Field
The present invention relates to the field of biotechnology, in particular to the field of genotyping and blood lineage analysis, and more particularly to methods, devices and uses thereof for genotyping based on low depth genome sequencing.
Background
The existing breed identification (also referred to herein as "pedigree analysis") service, represented by the genetic testing of dogs introduced by Wisdom Panel, detects the type of single base mutation point given on pet dog DNA by a custom-made microarray chip, and then compares the type of single base mutation point with the data of pure breed dogs in a database to give the breed component ratio of dogs to be detected.
The prior art is based on a microarray chip, and the number of samples required for each detection is hundreds, so that the samples need to be collected and assembled, so that the experimental period of analyzing the sample to be detected for each blood system is long, the cost is high, and the detection report cannot be delivered to the user quickly and cheaply by using the technology. The DNA concentration in the sample required by the chip sequencing is higher, so that a certain sampling failure probability exists, namely the sampling requirement is higher, and the problem of long period and high cost of the chip detection technology is further increased.
Thus, current genotyping and pedigree analysis techniques remain to be improved.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, an object of the present invention is to provide a genotyping and variety identification technique which is low in cost and short in detection period.
The present invention has been completed based on the following work and findings of the inventors:
The inventors believe that genotyping and pedigree analysis can be performed based on whole genome sequencing, since the cost of performing genotyping and pedigree analysis based on whole genome sequencing will be lower than chip-based detection schemes as the cost of whole genome sequencing decreases. And based on the sequencing scheme, the method does not need sample preparation, has low requirement on the required DNA content, high sampling success rate and short experimental period compared with a chip scheme, and can rapidly give a detection report. With the gradual decrease of the sequencing cost, the detection price is lower based on genotyping and blood analysis of whole genome sequencing.
Furthermore, the inventors have surprisingly found, through a series of experimental studies and research work, that genotyping and pedigree analysis of a sample of an organism to be tested can be effectively achieved based on low depth genome sequencing by deriving genotyping for known mutation sites based on full genome low depth sequencing data, and expressing the result of the typing with uncertainty in the form of probability, and then increasing tolerance to deletion values when comparing the variety databases of existing dogs.
Thus, in one aspect, the invention provides a method of genotyping based on low depth genome sequencing comprising: (a) Performing low-depth genome sequencing on the whole genome of the sample to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data; (b) Constructing a reference sequence set of known mutation sites for at least one known mutation site, the reference sequence set containing the mutation type of the known mutation site and the sequences upstream and downstream of the mutation site; (c) Comparing the sequencing result obtained in the step (a) with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type; and (d) determining a high probability mutation type of the known mutation site based on the comparison result. The inventors have surprisingly found that by using the method of the invention, known mutation sites of a sample to be tested can be effectively genotyped based on low depth genome sequencing data, and further that the organisms from which the sample to be tested is derived can be effectively subjected to blood analysis based on the obtained genotyping results. In addition, the method for genotyping based on low-depth genome sequencing has the advantages of low cost, short detection period and accurate and reliable detection result.
In another aspect of the invention, a method of performing a pedigree analysis on an organism is provided. According to an embodiment of the invention, the method comprises: (1) Using the method described above, performing low-depth genome sequencing of the genome of a sample of an organism to be tested, and genotyping at least one known mutation site of the organism to be tested; (2) Based on the results of the genotyping, determining the lineage of the organism. According to the embodiment of the invention, the method for analyzing the blood system of the organism can be used for genotyping the known variation site of the sample of the organism to be detected based on the low-depth genome sequencing data, so that the blood system of the organism can be determined.
In yet another aspect of the invention, the invention provides an apparatus for genotyping based on low depth genome sequencing. According to an embodiment of the present invention, the genotyping apparatus includes: a sequencing unit for performing low-depth genome sequencing on a whole genome of a sample to be tested so as to obtain a sequencing result composed of a plurality of sequencing data; a reference sequence set construction unit for constructing a reference sequence set of a known mutation site for at least one known mutation site, the reference sequence set containing a mutation type of the known mutation site and sequences upstream and downstream of the mutation site; the comparison unit is respectively connected with the sequencing unit and the reference sequence set construction unit and is used for receiving a sequencing result from the sequencing unit and comparing the sequencing result with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type; and the high-probability mutation type determining unit is connected with the comparison unit and is used for determining the high-probability mutation type of the known mutation site based on the comparison result. By utilizing the device, the known mutation sites of the sample to be detected can be subjected to genotyping based on low-depth genome sequencing data, and the device is convenient to operate, low in cost, short in detection period and accurate and reliable in detection result.
In yet another aspect of the invention, a system for performing a pedigree analysis on an organism is provided. According to an embodiment of the invention, the system comprises: the genotyping apparatus as described above, which is used for performing genotyping by using the method for performing genotyping based on low-depth genome sequencing as described above, performing low-depth genome sequencing on the genome of a sample of an organism to be tested, and performing genotyping on at least one known mutation site of the organism to be tested; and the blood system determining device is connected with the genotyping device and is used for determining the blood system of the organism based on the genotyping result. According to the embodiment of the invention, the system for analyzing the blood system of the organism can be used for genotyping the known variation site of the sample of the organism to be detected based on the low-depth genome sequencing data, so that the blood system of the organism can be determined, and the system is convenient to operate, low in detection cost, short in detection period and accurate and reliable in detection result.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 shows a schematic flow diagram of a method of genotyping based on low depth genome sequencing of the present invention, according to an embodiment of the present invention;
FIG. 2 shows a schematic structural diagram of an apparatus for genotyping based on low depth genome sequencing of the present invention according to an embodiment of the present invention;
FIG. 3 shows a schematic diagram of a system for performing a pedigree analysis of an organism according to an embodiment of the invention;
FIG. 4 shows the results of the blood analysis of the pet dogs tested in example 1;
FIG. 5 shows the results of principal component analysis for verification of the pet dog to be tested in example 1;
FIG. 6 shows the results of the blood analysis of the pet dogs tested in example 2.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Genotyping method and apparatus
In one aspect of the invention, the invention provides a method for genotyping based on low depth genome sequencing. According to the embodiment of the invention, the method can be used for effectively genotyping the known mutation sites of the sample to be tested based on the low-depth genome sequencing data, and further can be used for effectively performing blood analysis on organisms from which the sample to be tested is derived based on the obtained genotyping result. In addition, the method for genotyping based on low-depth genome sequencing has the advantages of low cost, short detection period and accurate and reliable detection result.
According to an embodiment of the invention, referring to fig. 1, the method comprises the steps of:
(a) : low depth genome sequencing of the whole genome of the sample to be tested to obtain a sequencing result consisting of a plurality of sequencing data.
According to embodiments of the invention, low depth genome sequencing may be high throughput sequencing, with a sequencing depth of no more than 5. According to some specific examples of the invention, the sequencing depth may not exceed 3.
(B) : for at least one known mutation site, constructing a reference sequence set of the known mutation site, the reference sequence set containing the mutation type of the known mutation site and the sequence upstream and downstream of the mutation site.
It should be noted that the term "variant type" as used herein is to be understood in a broad sense and may be any mutant base that differs from the wild type, including but not limited to single nucleotide polymorphisms, fragment sequence insertions and deletions. Thus, according to embodiments of the present invention, known mutation sites may include sites known to have single nucleotide polymorphisms, fragment sequence insertions and deletions.
(C) : comparing the sequencing result obtained in the step (a) with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type.
According to some embodiments of the invention, the sequencing data is pre-partitioned into a plurality of short sequences of equal length prior to performing step (c). For low depth genome sequencing, the presence of mismatched bases can significantly affect the efficiency of genotyping. Thus, for low depth genome sequencing, e.g., genome sequencing with a sequencing depth of no more than 3, it is desirable to avoid the occurrence of mismatches as much as possible. The inventors of the present invention found through studies that the probability of occurrence of base mismatches is also greater as the length of sequencing data is longer. Therefore, by dividing sequencing data into a plurality of short sequences, the probability of occurrence of mismatch can be effectively reduced, and thus the efficiency of genotyping by low-depth genome sequencing is improved. According to some specific examples of the invention, the short sequence is no more than 50bp in length. According to further embodiments of the invention, the short sequence is preferably 35bp in length. As a result, mismatches due to excessively long sequences can be reduced, resulting in reads that could otherwise be aligned to corresponding positions being incorrectly filtered.
(D) : and determining the high probability mutation type of the known mutation site based on the comparison result.
In the step (c), the matching mutation type of the sequencing result and the matching times of the matching mutation type can be obtained through comparison. It will be appreciated by those skilled in the art that the type of variation that is matched and the number of matches is related to the true type of variation at a particular site, i.e., known variation site. Therefore, after the comparison result is obtained, the high-probability mutation type of the known mutation site can be obtained through back-pushing. Furthermore, using the method according to embodiments of the present invention enables to obtain relatively reliable genotyping results based effectively on low depth sequencing results.
The manner of determining the type of high probability variation based on the comparison result, that is, the manner of the reverse thrust mentioned above, is not particularly limited according to the embodiment of the present invention.
In accordance with an embodiment of the present invention, in step (d), determining the high probability variation type based on a bayesian model is included. According to some specific examples of the present invention, the bayesian model uses a predetermined mutation type occurrence probability of a predetermined known mutation site as the prior probability, and uses the comparison result obtained in the step (c) as the posterior probability. Wherein the "predetermined mutation type of predetermined known mutation site" described herein, wherein "predetermined" is a meaning that has been determined in advance, is to be understood as "predetermined". Specifically, the bayesian model is based on the occurrence probability of a specific known type of a predetermined known mutation site as the prior probability, and the high probability mutation type of the mutation site can be determined by using the number of occurrences of the specific mutation type obtained by comparison as the posterior probability corresponding to the specific mutation type. In particular, the method comprises the steps of,
Using the formulaThe high probability mutation type of the specific mutation site is determined, wherein the Bayesian model adopts the known type probability of the known mutation site as the prior probability P (A)/P (B), the prior probability can be determined by carrying out statistical analysis on a plurality of control samples, namely samples with known mutation site types, and the probability of occurrence of various types at the mutation site is also assumed to be the same, for example, for SNP sites, the probability of occurrence of A, T, G or C at the site is 0.25. Based on the comparison result as observation, namely when 1 read is compared to a sequence corresponding to a certain typing, the typing value is the possibility of the base type corresponding to the read, namely P (B|A), and the posterior probability P (A|B) obtained by the Bayesian model is adopted as the final high probability mutation type of the known mutation site.
In addition, in order to facilitate the comparison of a large number of comparison results, the comparison results can be constructed into a matching times-variation type database. The type of database is not particularly limited, and according to some embodiments of the present invention, the matching number-variation type database may be in the form of a hash table in which variation types are keys and matching numbers are key values. Therefore, the matching times-variation type database can be searched more quickly and conveniently, and the result is more accurate and reliable.
Accordingly, in another aspect of the invention, the invention provides an apparatus for genotyping based on low depth genomic sequencing. The device is suitable for carrying out the aforementioned method for genotyping based on low depth genome sequencing. By utilizing the device, the known mutation sites of the sample to be detected can be subjected to genotyping based on low-depth genome sequencing data, and the device is convenient to operate, low in cost, short in detection period and accurate and reliable in detection result.
Referring to fig. 2, according to an embodiment of the present invention, the genotyping apparatus 1000 includes: a sequencing unit 100, a reference sequence set construction unit 200, an alignment unit 300 and a high probability variation type determination unit 400.
According to an embodiment of the present invention, the sequencing unit 100 is used for performing low-depth genome sequencing on the whole genome of a sample to be tested, so as to obtain a sequencing result composed of a plurality of sequencing data. According to some embodiments of the invention, in the sequencing unit 100, the low depth genome sequencing is high throughput sequencing, the sequencing depth not exceeding 5. According to some embodiments of the invention, the sequencing depth is no more than 3.
According to some embodiments of the invention, the reference sequence set construction unit 200 is configured to construct, for at least one known mutation site, a reference sequence set of the known mutation site, the reference sequence set containing a mutation type of the known mutation site and a sequence upstream and downstream of the mutation site. According to some embodiments of the invention, the known mutation sites include sites known to have single nucleotide polymorphisms, fragment sequence insertions and deletions.
According to an embodiment of the present invention, the alignment unit 300 is respectively connected to the sequencing unit 100 and the reference sequence set construction unit 200, and is configured to receive a sequencing result from the sequencing unit 100, and compare the sequencing result with the reference sequence set, so as to determine an alignment result of each known mutation site, where the alignment result includes a matching mutation type of the sequencing result and a matching frequency of the matching mutation type.
According to some embodiments of the invention, the method further comprises a sequence dividing unit (not shown) connected to the sequencing unit 100 and the alignment unit 300, respectively, for dividing the sequencing data into a plurality of short sequences of equal length in advance of the alignment. According to some specific examples of the invention, the short sequence is no more than 50bp in length. According to further embodiments of the invention, the short sequence is 35bp in length.
According to some embodiments of the present invention, the high probability mutation type determining unit 400 is connected to the comparing unit 300, and is configured to determine the high probability mutation type of the known mutation site based on the comparing result. In some embodiments of the present invention, the high probability variation type determining unit 400 includes determining a high probability variation type based on a bayesian model. Specifically, the bayesian model is based on the occurrence probability of a specific known type of a predetermined known mutation site as the prior probability, and the high probability mutation type of the mutation site can be determined by using the number of occurrences of the specific mutation type obtained by comparison as the posterior probability corresponding to the specific mutation type. In particular, the method comprises the steps of,
Using the formulaThe high probability mutation type of the specific mutation site is determined, wherein the Bayesian model adopts the known type probability of the known mutation site as the prior probability P (A)/P (B), the prior probability can be determined by carrying out statistical analysis on a plurality of control samples, namely samples with known mutation site types, and the probability of occurrence of various types at the mutation site is also assumed to be the same, for example, for SNP sites, the probability of occurrence of A, T, G or C at the site is 0.25. Based on the comparison result as observation, namely when 1 read is compared to a sequence corresponding to a certain typing, the typing value is the possibility of the base type corresponding to the read, namely P (B|A), and the posterior probability P (A|B) obtained by the Bayesian model is adopted as the final high probability mutation type of the known mutation site.
In addition, in order to facilitate the comparison of a large number of comparison results, the comparison results can be constructed into a matching times-variation type database. The type of database is not particularly limited, and according to some embodiments of the present invention, the matching number-variation type database may be in the form of a hash table in which variation types are keys and matching numbers are key values. Therefore, the matching times-variation type database can be searched more quickly and conveniently, and the result is more accurate and reliable.
Method and system for blood analysis
In yet another aspect of the invention, a method of performing a pedigree analysis on an organism is provided. According to the embodiment of the invention, the method for analyzing the blood system of the organism can be used for genotyping the known variation site of the sample of the organism to be detected based on the low-depth genome sequencing data, so that the blood system of the organism can be determined.
According to an embodiment of the invention, the method comprises: (1) Using the method described above, performing low-depth genome sequencing of the genome of a sample of an organism to be tested, and genotyping at least one known mutation site of the organism to be tested; (2) Based on the results of the genotyping, determining the lineage of the organism.
The organism type to which the method of the present invention is applicable is not particularly limited, and dogs, cats and even humans can be subjected to pedigree analysis by the method of the present invention. Thus, according to some embodiments of the invention, the organism is an animal. According to some embodiments of the invention, the animal comprises a domestic cat (FELIS SILVESTRIS catus), a domestic dog (Canis lupus familiaris).
As will be appreciated by those skilled in the art, the term "pedigree analysis" as used herein refers to determining the pedigree, origin, lineage or pedigree of a companion animal such as a cat or dog, e.g., for a particular animal, determining the breed of the animal as its female or male parent and further upstream relatives.
According to some embodiments of the invention, in step (2), the ancestry of the organism is determined based on a predetermined genotyping of the characteristic of the organism's close relatives.
According to an embodiment of the present invention, step (2) further includes:
Scoring at least one of said candidate organism relatives for at least one of said known mutation sites based on said high probability mutation type of said test organism and on said known mutation type of at least one of said candidate organism relatives to determine a similarity value for each of said candidate organism relatives.
Specifically, according to an embodiment of the present invention, determining the ancestry of the organism further comprises: comparing the high probability mutation type of the organism to be tested with the mutation types of the plurality of the close relatives of the candidate organisms aiming at the known mutation sites, and scoring the close relatives of the candidate organisms so as to determine the similarity value of the close relatives of the candidate organisms. It will be appreciated by those skilled in the art that a higher similarity value indicates a closer affinity of the test organism to the candidate organism. It should be noted that the similarity value is a characteristic value in the embodiment, and may be used interchangeably herein, and is used to represent the affinity similarity value between the pet dog to be detected and each candidate pet dog possible variety.
According to an embodiment of the present invention, step (2) further includes:
Dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known mutation sites; and
Classifying at least a portion of the plurality of windows based on the similarity value for each candidate organism near to determine a candidate near source corresponding to at least a portion of the plurality of windows.
Specifically, according to an embodiment of the present invention, determining the ancestry of an organism further comprises: dividing the DNA sequence of the organism to be detected into a plurality of windows with approximately the same length, wherein the windows contain at least one known mutation site; and classifying the obtained windows with the same length based on the similarity value of the near parents of each candidate organism so as to determine the near parent source corresponding to each window. It should be noted that, the method for classifying the windows based on the similarity value is not particularly limited, and may be performed by a classification method using part and libsvm libraries in R language through a model including, but not limited to, random forest, support vector machine, na iotave bayes. Among them, the classification method preferably adopted is a random forest model. The random forest is a classification model which integrates decision trees to obtain better effect, a plurality of decision trees are constructed, each decision tree classifies samples according to the weight of each point by combining the input characteristic values, and then classification of the plurality of decision trees is synthesized to obtain classification given by the random forest model. Thus, according to an embodiment of the present invention, by dividing a gene sequence into fragments of the same length and then classifying the windows for each fragment based on the base type of the mutation site therein, such as SNP typing, as a characteristic value, it is possible to classify them into a certain variety, that is, to identify that the DNA sequence of the window originates from the variety.
It should be noted that the "windows of the same length" described herein should accommodate a certain amount of length deviation, e.g. 1-10% floating up and down. According to an embodiment of the present invention, the demarcation may be performed as follows:
For N SNP sites to be detected on chromosome one, denoted S1, S2, S3,..sn, the distance from S1 to S2 is denoted D1, and the distance from S2 to S3 is denoted D2. Given a fixed window size X, will at most satisfy The SNP spots S1, S2,..sa was divided into one window, and the window was coded as No. 1. Then according to the same rule, at most satisfyThe SNP site S a+1,Sa+2...Sb of (1) is divided into another window, and the window is numbered 2. And by analogy, after completing the cutting window of chromosome one, using the same regular cutting window for chromosome two, and completing window cutting of all autosomes in turn.
The specific value of X consists of the species to be detected and may be 1% of the total length of the autosomes in the whole gene sequence of the species to be detected, e.g. dogs.
After obtaining the near-source that each window may correspond to, according to an embodiment of the present invention, further includes: determining the distance between known mutation sites corresponding to the near sources on the genome sequence of the organism to be detected, and determining the corresponding blood lineage weight of each near source based on the obtained distance.
According to an embodiment of the present invention, preferably, the step (2) may include:
Determining the distance of the known mutation site corresponding to the candidate parent source on the genome sequence of the organism to be detected for each candidate parent source;
based on the distances, a ancestry weight for each of the candidate parent sources is determined.
According to an embodiment of the present invention, after determining the blood-lineage weight for each near-source, further comprising: obtaining variety components of the organism to be detected through weighting calculation; and verifying the obtained variety component results of the organisms to be detected by a cluster analysis method so as to determine the blood system of the organisms to be detected. According to some specific examples of the invention, the cluster analysis method is principal component analysis. Principal component analysis is a commonly used method for data dimension reduction. After linear combination is performed in the multidimensional variable group, the original data is projected onto a new coordinate axis by finding out several dimensions with the maximum variance, so that the data after dimension reduction can retain more information of the original data. According to an embodiment of the present invention, the principal component analysis method may be performed using ppca functions in pcrMethods packages in the R language.
According to some specific examples of the invention, the method of performing a pedigree analysis on an organism of the invention may comprise the steps of:
1) And performing low-depth genome sequencing on the whole genome of the sample to be tested. If the length of the read obtained from the second generation sequencing platform is greater than 50bp, the read is cut into a plurality of sections of short sequences with equal length according to the sequence, and the short sequences obtained by the new segmentation form a new file, which is called cut-read.
2) The data of the gene chip to be detected is found from a website (https:// www.illumina.com), a specified single base mutation list to be detected and reference sequences before and after mutation are downloaded, and the reference sequences corresponding to different types on different positions to be detected can be generated by the specific description mode in the embodiment, and the file is called SNP-index. The detectable variants herein are not only single base mutations, but also short and defined insertions and deletions of known variant fragment sequences.
3) Downloading SOAPaligner from a website (http:// soap. Genemics. Org. Cn), using the SNP-index file from step 2) as input, and creating the data structure required for the alignment with a/2 bwt-builder command.
4) The cut-read from step 1) was aligned on the SNP-index based reference sequence using the soap command using the parameters "-v 0-M0-r 0".
5) According to the comparison result in the step 4), a hash table is established by taking the name of each SNP-index on the comparison as a key and the occurrence frequency as a value, and the hash table is updated by traversing the comparison result, so that the frequency of each SNP-index on the comparison is obtained.
6) Assuming that the probability of detecting the parent strand and the parent strand is the same during sequencing, taking the probability of the known type of the known mutation site as the prior probability according to the Bayes formula and the hash table obtained in the step 5), assuming that the probability of various types of occurrence at the mutation site is the same, taking the value as P (A)/P (B), taking the comparison result obtained in the step 4) as an observation, namely, when 1 read is compared to a sequence corresponding to a certain type, the type value is the probability of the base type corresponding to the read, namely, P (B|A), and taking the posterior probability P (A|B) obtained by the Bayes model as the final high probability mutation type of the known mutation site. And obtaining possible single base typing results of each point at different depths according to the formula of the Bayes model.
7) Comparing the detected genotype obtained in the step 6) with single base typing results of samples of different varieties in a background database, and obtaining a characteristic value for each variety to be detected according to the expected value of the same number of sites. It should be noted that, according to the expected value of the same number of positions, that is, if the parting result is the same, the characteristic value corresponding to the variety is added by one, if the result is different, the characteristic value is unchanged, and then the number of the existing variety samples is divided to obtain the average characteristic value corresponding to each variety.
8) The DNA of the organism to be tested is divided into a plurality of windows of equal length according to the sequence of the positions of single base mutations on different chromosomes, and each window contains at least one single base mutation site.
For N SNP sites to be detected on chromosome one, S1, S2, S3..sn, the distance from S1 to S2 is denoted as D1, and the distance from S2 to S3 is denoted as D2. Given a fixed window size X, will at most satisfyThe SNP spots S1, S2,..sa was divided into one window, and the window was coded as No. 1. Then according to the same rule, at most satisfyThe SNP site S a+1,Sa+2...Sb of (1) is divided into another window, and the window is numbered 2. And by analogy, after completing the cutting window of chromosome one, using the same regular cutting window for chromosome two, and completing window cutting of all autosomes in turn.
The specific value of X consists of the species to be detected and is 1% of the total length of autosomes in the whole gene sequence of the dog.
9) And (3) for each window obtained in the step (8), using the characteristic values obtained in the step (7) of different varieties, and using models including but not limited to random forest, support vector machine and naive Bayes, respectively classifying the small DNA segment of each window by using classification methods of a part library and a libsvm library in R language, wherein the classification result is a possible variety corresponding to the DNA sequence, and the classification basis is the characteristic value obtained in the step (7) of the known pure breed dog of the variety.
The classification results of the respective windows are denoted b1, b2.I.e. the classification results of each segment are added to obtain the sum of the classification results of each variety.
It should be noted that, as will be understood by those skilled in the art, the aforementioned "N SNP sites to be detected on chromosome one are denoted as S1, S2, S3..sn" and "classification results for each window are denoted as b1, b2...bn", where the meaning of two codes N for "Sn" and "bn" are different, N for "Sn" is the code of the SNP site to be detected, N for "bn" is the code of the corresponding window, and "bn" represents the classification results for the DNA of the coding window.
According to the embodiment of the invention, the method further comprises the step of weighting and calculating the variety components of the organism to be detected according to the detection results of different windows obtained in the step 8) and the lengths of the DNA sequences represented by the different windows, so that the blood system of the organism to be detected is determined based on the proportion of the variety components of the organism to be detected.
For the window containing SNP points S a,Sa+1 to S b As the weight of each window WG is the total number of bases of autosomes in the whole genome of the dog, for the classification window described according to step 8), step 8) will yield a classification result, the classification result of each window is denoted b1, b2...bn, where each classification result corresponds to a dog breed, and the final breed composition estimation formula is/>
10 Using principal component analysis or other clustering methods to verify the detection results from step 9).
Specifically, the most several varieties among the varieties obtained in step 9) are selected and clustered by using a principal component analysis method or other clustering methods. And according to the clustering result, calculating the average value of the distances between the sample to be detected and different samples, and if the variety closest to the sample to be detected is the most main variety obtained in the step 9), verifying the reliability of the result in the step 9).
According to some specific examples of the present invention, the implementation method of step 7) is as follows: and (3) comparing the typing result of the site of the organism to be detected obtained in the step (6) with each sample of each variety in the background database one by one aiming at each detected site, so as to obtain the similarity (namely the characteristic value) of the organism to be detected and each variety. Specifically, the parting result of the locus of the organism to be measured is compared with the parting results of a plurality of samples of the variety in the background database at the locus, if the parting results of the samples of the organism to be measured and the background database at the locus are consistent, the similarity (namely the characteristic value) of the sample to be measured and the variety is added by one, and the comparison results of a plurality of samples of the same variety in the background database are weighted and averaged to obtain the corresponding similarity (namely the characteristic value) of the variety.
The method for analyzing the blood system of the organism can rapidly and accurately obtain the genotyping result of the corresponding site from the whole genome second-generation low-depth sequencing data. Since the depth of sequencing is on average 1 to 2 layers, it is not possible to confirm which possible single base variation points will be covered, nor to obtain accurate typing results. The typing result of uncertainty is expressed by using a probability form, and tolerance to the missing value is increased when comparing the variety database of the existing organism to be tested (it is to be noted that the "tolerance to the missing value is increased", how many missing values can be tolerated, no clear non-black or white answer is available, the accuracy of variety judgment can be reduced along with the increase of the proportion of the missing values in the data, and the blood lineage of the organism to be tested can be effectively determined according to the current experience, and the number of detected SNP loci is required to be not less than 25% of the total amount). In the aspect of practical application, the application prospect is wide, for example: by using the method of the invention, the pedigree certificate of the pure-bred pet dogs, the certificate of the direct relationship of two dogs or whether the two dogs are the same dog (giving the genetic identity card of the pet dogs or cats) can be given, and quantitative ancestor component proportion and the estimated three-generation inner species tree can be given to the hybrid dogs.
In yet another aspect of the invention, a system for performing a pedigree analysis on an organism is provided. According to the embodiment of the invention, the system for analyzing the blood system of the organism can be used for genotyping the known variation site of the sample of the organism to be detected based on the low-depth genome sequencing data, so that the blood system of the organism can be determined, and the system is convenient to operate, low in detection cost, short in detection period and accurate and reliable in detection result.
According to an embodiment of the invention, referring to fig. 3, the system 10000 comprises: genotyping apparatus 1000 and pedigree determining apparatus 2000.
According to an embodiment of the present invention, the genotyping apparatus 1000 is used for performing low-depth genome sequencing on the genome of a sample of an organism to be tested and genotyping at least one known mutation site of the organism to be tested by using the method for genotyping based on low-depth genome sequencing as described above. According to some embodiments of the invention, the organism is an animal. According to some specific examples of the invention, the animal comprises a domestic cat (FELIS SILVESTRIS catus), a domestic dog (Canis lupus familiaris).
According to some embodiments of the invention, a pedigree determining device 2000 is coupled to the genotyping device 1000 for determining the pedigree of the organism based on the result of the genotyping. According to some embodiments of the invention, in the pedigree determining device 2000, the pedigree of the organism is determined based on a predetermined characteristic genotyping of the organism's close relatives.
According to some embodiments of the invention, the pedigree determining device 2000 further comprises a similarity value determining unit adapted to compare the high probability mutation type of the test organism with the mutation types of a plurality of the candidate organism neighbors for the known mutation sites and score each of the candidate organism neighbors in order to determine a similarity value for each of the candidate organism neighbors.
According to some embodiments of the invention, the pedigree determining apparatus 2000 further comprises a near-source determining unit for dividing the DNA sequence of the organism to be tested into a plurality of windows of approximately the same length, the windows containing at least one of the known mutation sites; and classifying the obtained windows with the same length based on the similarity value of the near parents of each candidate organism so as to determine the near parent source corresponding to each window. It should be noted that, the method for classifying the windows based on the similarity value is not particularly limited, and includes, but is not limited to, random forest, support vector machine, naive bayes, and may be implemented by using the classification method of the part and libsvm library in the R language. Among them, the classification method preferably adopted is a random forest model. The random forest is a classification model which integrates decision trees to obtain better effect, a plurality of decision trees are constructed, each decision tree classifies samples according to the weight of each point by combining the input characteristic values, and then classification of the plurality of decision trees is synthesized to obtain classification given by the random forest model. Thus, according to an embodiment of the present invention, by dividing a gene sequence into fragments of the same length and then classifying the windows for each fragment based on the base type of the mutation site therein, such as SNP typing, as a characteristic value, it is possible to classify them into a certain variety, that is, to identify that the DNA sequence of the window originates from the variety.
According to some embodiments of the invention, the pedigree determination device 2000 further comprises a pedigree weight determination unit adapted to: determining the distance between known mutation sites corresponding to the near sources on the genome sequence of the organism to be detected, and determining the corresponding blood lineage weight of each near source based on the obtained distance.
According to some embodiments of the invention, the blood lineage determination device 2000 further comprises a blood lineage determination unit adapted to perform a principal component analysis of the blood lineage weights of the near sources in order to determine a blood lineage of the test organism. Principal component analysis is a commonly used method for data dimension reduction. After linear combination is performed in the multidimensional variable group, the original data is projected onto a new coordinate axis by finding out several dimensions with the maximum variance, so that the data after dimension reduction can retain more information of the original data. According to an embodiment of the present invention, the principal component analysis method may be performed using ppca functions in pcrMethods packages in the R language.
It should be noted that the method, the device and the application of the invention for genotyping based on low-depth genome sequencing have at least one of the following advantages:
1. The method for analyzing the blood system of the organism aims at obtaining variety components through low-depth sequencing data and realizing the blood system analysis.
2. The method for genotyping based on low-depth genome sequencing uses low-depth whole genome data to estimate the genotyping result of known mutation sites (such as single-base mutation sites), while conventional mutation detection software, such as GATK, cannot normally give the result when the depth is low. In addition, the method can obtain an accurate single-base typing result by using one fifth of the time and one tenth of the memory consumption of the traditional method by using a mode of constructing the sequence before and after the locus to be detected.
3. The method for performing pedigree analysis on organisms of the invention uses a quantitative manner to give an estimate of the ancestral composition. Similar calculations will be used in the future to detect species components of pet cats, pet birds, and other companion animals, and cattle, chickens, and other commercial crops, and also to detect human progenitor components.
The scheme of the present invention will be explained below with reference to examples. It will be appreciated by those skilled in the art that the following examples are illustrative of the present invention and should not be construed as limiting the scope of the invention. The specific techniques or conditions are not noted in the examples and are carried out according to the techniques or conditions described in the literature in the art (for example, refer to J. Sam Brookfield et al, huang Peitang et al, molecular cloning Experimental guidelines, third edition, scientific Press) or according to the product specifications. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
Example 1:
Referring to fig. 1, the genotyping method of organisms according to the invention further performs a pedigree analysis of the organisms to be tested.
The organism to be detected is a pet dog, and the dog owner self-states that the organism to be detected is a Siberian Hashiqi. The biological sample to be detected is a saliva sample, and is obtained by noninvasively sampling a pet dog by using a PG-100 saliva sampler.
The method comprises the following specific steps:
1) And (3) performing low-depth genome sequencing on the whole genome of the sample to be tested by using a BGI-seq500 sequencing platform. Specifically, DNA in saliva is extracted, the whole genome of the DNA is amplified by an enzyme digestion method, and then a library is constructed. Whole genome low depth sequencing was then performed on BGI-seq500, at a depth of 2 to 3 layers. Reads obtained from the second generation sequencing platform are cut into 50bp short sequences in sequence, and the short sequences obtained by the new cut form a new file, which is called cut-read.
2) Illumina Canine HD gene chip data were found from the website (ftp:// webdata2: webdata2@ussd-ftp. Illumina. Com/downloads/ProductFiles/CanineHD/CanineHD _B.csv) and the file in the aforementioned link was downloaded as a list of single base variants to be detected, as well as the reference sequences before and after the variants.
The sequence of 50bp before the mutation site, the base type of the mutation site and the sequence of 50bp after the mutation site are sequentially combined to obtain a corresponding sequence of the mutation type corresponding to the site, and the file is called SNP-index, because two possible genotypes of the mutation site are needed, two corresponding sequences are constructed for the same site according to the different base types according to the rules, each corresponding sequence is named according to the corresponding SNP site number and the base type, and SNP-index of the bases A and G corresponding to the site with the number BICF G630100019 is shown below.
>BICF2G630100019_A
GCGACAAGGGTTTTGGTGAATGTCTGCAAAGAGCAGCGACAGCACATTCTGTTACAATTAAGAACAAAATATTAAGATCATATCTAAAGTGTCCTGGCAAATTGCATGCCACCAATCAATT(SEQ ID NO:1)
>BICF2G630100019_G
GCGACAAGGGTTTTGGTGAATGTCTGCAAAGAGCAGCGACAGCACATTCTGTTACAATTAGGAACAAAATATTAAGATCATATCTAAAGTGTCCTGGCAAATTGCATGCCACCAATCAATT(SEQ ID NO:2)
3) SOAPaligner2 were downloaded from the website (http:// soap. Genemics. Org. Cn), using the SNP-index file from step 2) as input, 13 different index files were required for the alignment were created with/2 btwt-builder commands, with suffixes of amb, ann, btwt, fmv, hot, lkt, pac, rev, btwt, rev, fmv, rev, lkt, rev, pac, sa, and sai, respectively.
4) The cut-read from step 1) was aligned on the SNP-index based reference sequence using the soap command using the parameters "-v 0-M0-r 0".
5) According to the comparison result in the step 4), a hash table is established by taking the name of each SNP-index on the comparison as a key and the occurrence frequency as a value, and the hash table is updated by traversing the comparison result, so that the frequency of each SNP-index on the comparison is obtained. The result of this step is the following table, which lists only the first three rows, since it contains 16 ten thousand rows: the format of the hash table is shown as follows, each row represents the event from the cut-read comparison of step 1) to the SNP-index mentioned in step 3), wherein the first column is the number of the SNP, and the second column is the base value corresponding to the read:
SNP numbering Base number
BICF2S23657714 C
BICF2G630130992 G
BICF2G630708586 G
6) And (3) assuming that the probability of detecting the parent chain and the parent chain is the same during sequencing, and obtaining possible single base typing results of each point at different depths according to a Bayes formula and the hash table obtained in the step 5).
Assuming that the probability of occurrence of each type at the mutation site is the same, this value is referred to as P (A)/P (B). Taking the sequencing comparison result obtained in the step 4) as an observation, the probability that the typing value is the base type corresponding to the read is called P (B|A), and the posterior probability P (A|B) is obtained according to the Bayesian formula, namely the possible typing value at the point.
The result of this step is the following table, which lists only the first three rows, since it contains 16 ten thousand rows: the first column in this table is the ID of the SNP, the second column is the possible typing, the third column is the possible probability value for the second column typing, the fourth column is another possible typing for the point, and the fifth column is the possible probability value for the point in the fourth column:
ID of SNP Parting 1 Probability value of type 1 Parting 2 Probability value of type 2
BICF2S23657714 CC 0.67 AC 0.33
BICF2G630130992 GG 0.8 GC 0.2
BICF2G630708586 GG 0.67 AG 0.33
7) Comparing the detected genotype obtained in the step 6) with the single base typing results of dogs of different varieties in a background database (https:// www.ncbi.nlm.nih.gov/geo/query/acc. Cgiac=GSE 90441), obtaining a characteristic value for each variety to be detected according to the expected value of the same number of positions, comparing each SNP locus with the typing result at the locus obtained in the step 6) by using the sample in the background database one by one, and if the typing result at the locus is consistent with the typing result at the locus, adding one to the similarity (namely the characteristic value mentioned here) of the sample to be detected and the variety, comparing each sample of each variety in the background database one by one, dividing the sample number of each variety in the database after comparing the samples of all varieties one by one, and obtaining the corresponding similarity of each variety, namely the characteristic value.
The result of this step is the following table, which lists only the first four rows, since it contains 70 rows: the first column is the value and the second column is the corresponding variety:
Eigenvalues Variety of species
94607.4 Siberian Husky (SiberianHusky)
89423.8028571428 Greenland sled dog (GreenlandSledgeDog)
89404.921 Alaska sled dog (AlaskanMalamute)
89399.9492857142 Ji doll (Chihuahua)
8) According to the sequence of the single base mutation on different chromosomes, dividing the DNA of the organism to be detected into a plurality of windows with equal length, wherein each window contains a plurality of single base mutation sites.
For N SNP sites to be detected on chromosome one, S1, S2, S3..sn, the distance from S1 to S2 is denoted as D1, and the distance from S2 to S3 is denoted as D2. Given a fixed window size X, will at most satisfyThe SNP spots S1, S2,..sa was divided into one window, and the window was coded as No. 1. Then according to the same rule, at most satisfyThe SNP site S a+1,Sa+2...Sb of (1) is divided into another window, and the window is numbered 2. And by analogy, after completing the cutting window of chromosome one, using the same regular cutting window for chromosome two, sequentially completing the window cutting of all autosomes, and obtaining 100 windows with the numbers of 1,2,3.
X is 1% of the length of the sum of autosomes in the genome of the dog, i.e. 21M bp.
9) According to the detection results of different windows obtained in the step 8), according to the lengths of DNA sequences represented by different windows, using the characteristic values obtained in the step 7) of different varieties, and using random forest models of part y and libsvm libraries in R language to classify the small section of DNA of each window respectively, wherein the classification result is a possible variety corresponding to the section of DNA sequence, and the classification basis is the characteristic value obtained in the step 7) of the known pure breed dog of the variety. The classification results for each window are noted as b1, b2.. B100, the labels for each classification are derived from the classification results given by the random forest model in this step. Where b1 and b2 correspond to a dog breed, w1, w2., wn are weights of each window, i.e. the ratio of the length of the sequence corresponding to the window to the total sequence length, and the final breed component estimation formula isWherein the calculation formula of wi isFor each window, the formula calculates the length of the DNA sequence contained within the window, WG being the total number of bases of autosomes within the whole genome of the dog. And carrying out weighted average on the classification result of each window according to the length of the window on the chromosome, and finally obtaining the sum of the classification results of all varieties.
Through weighted average calculation of the classification results of all windows, the blood system of the organism to be measured is as follows: 61% siberian hastelloy +39% Greenland skis (see figure 4). As shown in fig. 4, the specific proportions of the progenitor components of the pet dogs to be tested are: 61% siberian hastelloy and 39% green sled dogs (the photograph in fig. 4 is that of the pet dog to be tested).
10 Using the principle component analysis method, verifying the detection result obtained in step 9).
The principal component analysis method is a common method for reducing the dimension of data, is realized in various programming languages, and can directly obtain a result from input data. The realization of the main component parting is divided into the following steps: 1) calculating a covariance matrix of the matrix according to the input matrix, 2) calculating eigenvalues and eigenvectors of the covariance matrix obtained in the previous step, 3) selecting two eigenvectors with highest eigenvalues, and 4) projecting the input matrix onto the eigenvectors.
Specifically, a maximum of 5 of the varieties obtained in step 9) are selected for clustering using the ppca function in the pcrMethods packages in the R language.
Fig. 5 shows the principal component analysis results for the dogs to be tested, with the horizontal and vertical axes listing the two most significant components, the upper left hand, the Greenland sled dog, the lower right hand, the siberian hastelloy, and the middle dogs to be tested. It can be seen that the dogs to be tested are located between the Greenland sled dogs and the Husky Siberian, in accordance with the ratio obtained in step 9). That is, the detection result obtained in step 9) is verified to be accurate.
The inventor carries out variety component estimation on the pet dog based on the machine-setting data by utilizing the method, obtains the detection result within 2 hours and reports to the pet dog owner.
Further, in order to verify the accuracy of the method, the inventor compares the variety components calculated in the embodiment with the self-description of the pet dog owner, and the result shows that the variety components and the self-description of the pet dog owner have higher consistency.
Specifically, the original reads obtained after sequencing and loading are 5.5 Gbp, about 2.3x, and the analysis and detection results of blood (ancestor composition) are shown in fig. 4 (the graph is a three-generation variety family spectrogram which is deduced from the DNA data of the pet dog to be tested and contains the great ancestor, ancestor and ancestor).
Example 2:
The organisms to be tested were analyzed for their blood lineages as described in example 1.
The living body to be tested is the dog to be tested, the dog owner can speak about the living body to be tested as a poodle, and the final blood analysis and detection result is shown in fig. 6 (the figure is a third-generation family spectrogram which is estimated according to the DNA data of the dog to be tested and contains the ancestor, ancestor and ancestor). As shown in fig. 6, the pet dog to be tested is 100% mini-poodle (the photograph in fig. 6 is a photograph of the pet dog to be tested).
In addition, the present invention has been widely used for reporting variety composition to pet dogs of applicant (warrior gene) internal users. 48 reports have been presented, including both inbred dogs and heterozygous dogs. It should be emphasized that based on the above practice, from the received saliva samples of the dogs, the method can give test reports within 1 week, and wherein the data analysis does not require a mainframe, and reports can be given on a personal computer (4 GB memory) at a time of 2 hours per sample.
Industrial applicability
According to the method for genotyping based on the low-depth genome sequencing, disclosed by the invention, the known mutation sites of the sample to be tested can be effectively genotyped based on the low-depth genome sequencing data, and further the organisms from which the sample to be tested is derived can be effectively subjected to blood analysis based on the obtained genotyping result. In addition, the method has the advantages of low detection cost, short detection period and accurate and reliable detection result.
Although specific embodiments of the invention have been described in detail, those skilled in the art will appreciate. Numerous modifications and substitutions of details are possible in light of all the teachings disclosed, and such modifications are contemplated as falling within the scope of the present invention. The full scope of the invention is given by the appended claims and any equivalents thereof.
In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
SEQUENCE LISTING
<110> Shenzhen Hua institute of great life science
<120> Methods, devices and uses for genotyping based on low depth genome sequencing
<130> PIOC3170836PCN
<160> 2
<170> PatentIn version 3.3
<210> 1
<211> 121
<212> DNA
<213> Artificial
<220>
<223> BICF SNP-index of base A corresponding to the site of G630100019
<400> 1
gcgacaaggg ttttggtgaa tgtctgcaaa gagcagcgac agcacattct gttacaatta 60
agaacaaaat attaagatca tatctaaagt gtcctggcaa attgcatgcc accaatcaat 120
t 121
<210> 2
<211> 121
<212> DNA
<213> Artificial
<220>
<223> BICF SNP-index of base G corresponding to the site of G630100019
<400> 2
gcgacaaggg ttttggtgaa tgtctgcaaa gagcagcgac agcacattct gttacaatta 60
ggaacaaaat attaagatca tatctaaagt gtcctggcaa attgcatgcc accaatcaat 120
t 121

Claims (20)

1. A method of performing a pedigree analysis on an organism, comprising:
(1) Performing low-depth genome sequencing on a genome of a sample of an organism to be tested, and genotyping at least one known mutation site of the organism to be tested, the known mutation site comprising a site known to have a single nucleotide polymorphism, fragment sequence insertion and deletion;
(2) Determining the lineage of the organism based on the genotyping result,
Wherein,
The genotyping method in step (1) comprises:
(a) Performing low-depth genome sequencing on the whole genome of the organism to be tested so as to obtain a sequencing result consisting of a plurality of sequencing data;
(b) Constructing a reference sequence set of known mutation sites for at least one known mutation site, the reference sequence set containing the mutation type of the known mutation site and the sequences upstream and downstream of the mutation site;
(c) Dividing the sequencing data in the step (a) into a plurality of short sequences with equal length and no more than 50bp, and comparing the obtained sequencing result with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type; and
(D) Determining the high probability mutation type of the known mutation site based on the comparison result and a Bayes model,
In step (2), the ancestry of the organism is determined based on a predetermined genotyping of the characteristic of the organism's immediate vicinity, step (2) further comprising:
(e) Scoring at least one of the candidate organism neighbors based on the high probability mutation type of the test organism and a known mutation type of at least one candidate organism neighbor for at least one of the known mutation sites to determine a similarity value for each of the candidate organism neighbors;
(f) Dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known mutation sites; and
Classifying at least a portion of the plurality of windows based on similarity values for each of the candidate organism neighbors to determine candidate near sources corresponding to at least a portion of the plurality of windows;
(g) Determining, for each of the candidate near sources, a distance of the known mutation site corresponding to the candidate near source on the genomic sequence of the test organism;
Determining a blood lineage weight for each of the candidate near sources based on the distance;
(h) Determining the blood system of the organism to be tested based on the blood system weight of each candidate near source.
2. The method of claim 1, wherein the low depth genome sequencing is high throughput sequencing, and the sequencing depth is no more than 5.
3. The method of claim 2, wherein the sequencing depth is no more than 3.
4. The method of claim 1, wherein the short sequence is 35bp in length.
5. The method according to claim 1, wherein the bayesian model uses a predetermined mutation type occurrence probability of a predetermined known mutation site as a priori probability and uses the comparison result obtained in the step (c) as a posterior probability.
6. The method of claim 1, further comprising constructing the comparison result as a hash table, wherein the variant type is a key and the number of matches is a key value.
7. The method of claim 1, wherein the organism is an animal.
8. The method of claim 7, wherein the animal comprises a domestic cat, a domestic dog.
9. The method of claim 1, wherein the classifying is performed by at least one of a random forest model, a support vector machine, and naive bayes.
10. The method of claim 1, further comprising, after determining the lineage weight for each near source:
Obtaining variety components of the organism to be detected through weighted calculation, and verifying the obtained variety component results of the organism to be detected through a cluster analysis method so as to determine the blood system of the organism to be detected based on the blood system weight of each candidate near source.
11. A system for performing a pedigree analysis on an organism, comprising:
A genotyping apparatus for low depth genome sequencing of a genome of a sample of an organism to be tested and genotyping at least one known mutation site of the organism to be tested, the known mutation site comprising a site known to have a single nucleotide polymorphism, fragment sequence insertion and deletion;
a blood system determining device connected to the genotyping device for determining the blood system of the organism based on the genotyping result
Wherein, genotyping device includes:
A sequencing unit for performing low-depth genome sequencing of a whole genome of an organism to be tested so as to obtain a sequencing result composed of a plurality of sequencing data;
A reference sequence set construction unit for constructing a reference sequence set of a known mutation site for at least one known mutation site, the reference sequence set containing a mutation type of the known mutation site and sequences upstream and downstream of the mutation site;
The comparison unit is respectively connected with the sequencing unit and the reference sequence set construction unit and is used for receiving a sequencing result from the sequencing unit, dividing the sequencing data in the sequencing unit into a plurality of short sequences with equal length and no more than 50bp, and comparing the sequencing result with the reference sequence set so as to determine the comparison result of each known mutation site, wherein the comparison result comprises the matching mutation type of the sequencing result and the matching times of the matching mutation type; and
A high probability mutation type determining unit connected with the comparing unit for determining the high probability mutation type of the known mutation site based on the comparing result and the Bayes model,
In the blood lineage determining apparatus, the blood lineage of the organism is determined based on a predetermined characteristic genotyping of a close proximity of the organism, the blood lineage determining apparatus further includes:
A similarity value determination unit adapted to score at least one of the candidate organism relatives for at least one of the known mutation sites based on the high probability mutation type of the test organism and the known mutation type of at least one candidate organism relatives, so as to determine a similarity value for each of the candidate organism relatives;
a near source determining unit adapted to perform the steps of:
Dividing at least a portion of the genomic sequence of the test organism into a plurality of windows, each of the plurality of windows containing at least one of the known mutation sites; and
Classifying at least a portion of the plurality of windows based on similarity values for each of the candidate organism neighbors to determine candidate near sources corresponding to at least a portion of the plurality of windows;
A blood lineage weight determining unit configured to determine, for each of the candidate near sources, a distance of the known mutation site corresponding to the candidate near source on a genomic sequence of the organism to be tested; and
Determining a blood lineage weight for each of the candidate near sources based on the distance;
the blood system determining means determines the blood system of the organism to be measured based on the blood system weights of the candidate near sources obtained by the blood system weight determining unit.
12. The system of claim 11, wherein in the sequencing unit, the low depth genome sequencing is high throughput sequencing, the sequencing depth not exceeding 5.
13. The system of claim 11, wherein the sequencing depth is no more than 3.
14. The system of claim 11, wherein the short sequence is 35bp in length.
15. The system of claim 11, wherein the bayesian model employs a predetermined mutation type occurrence probability of the predetermined known mutation site as a priori probability, and wherein the comparison result is a posterior probability.
16. The system of claim 15, further comprising constructing the comparison result as a hash table, wherein the variant type is a key and the number of matches is a key value.
17. The system of claim 11, wherein the organism is an animal.
18. The system of claim 17, wherein the animal comprises a domestic cat, a domestic dog.
19. The system of claim 11, wherein in the near source determination unit, the classification is performed by at least one of a random forest model, a support vector machine, and naive bayes.
20. The system of claim 11, further comprising, after determining the lineage weight for each near source:
Obtaining variety components of the organism to be detected through weighted calculation, and verifying the obtained variety component results of the organism to be detected through a cluster analysis method so as to determine the blood system of the organism to be detected based on the blood system weight of each candidate near source.
CN201780093812.7A 2017-09-08 2017-09-08 Method, device and application of genotyping based on low-depth genome sequencing Active CN110997936B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/101128 WO2019047181A1 (en) 2017-09-08 2017-09-08 Method for genotyping on the basis of low-depth genome sequencing, device and use

Publications (2)

Publication Number Publication Date
CN110997936A CN110997936A (en) 2020-04-10
CN110997936B true CN110997936B (en) 2024-05-10

Family

ID=65635230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780093812.7A Active CN110997936B (en) 2017-09-08 2017-09-08 Method, device and application of genotyping based on low-depth genome sequencing

Country Status (2)

Country Link
CN (1) CN110997936B (en)
WO (1) WO2019047181A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883207B (en) * 2020-07-31 2022-08-16 武汉蓝沙医学检验实验室有限公司 Identification method of biological genetic relationship
CN113637747B (en) * 2021-06-21 2023-02-03 深圳思勤医疗科技有限公司 Method for determining SNV and tumor mutation load in nucleic acid sample and application
CN113186255A (en) * 2021-05-12 2021-07-30 深圳思勤医疗科技有限公司 Method and device for detecting nucleotide variation based on single molecule sequencing
CN113470746B (en) * 2021-06-21 2023-11-21 广州市金域转化医学研究院有限公司 Method for reducing artificially introduced error mutation in high-throughput sequencing and application thereof
CN113327646B (en) * 2021-06-30 2024-04-23 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
CN116168763A (en) * 2022-09-06 2023-05-26 安诺优达基因科技(北京)有限公司 Method and device for grouping and assembling autotetraploid genome, method and device for constructing chromosome and application of method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539967A (en) * 2008-12-12 2009-09-23 深圳华大基因研究院 Method for detecting mononucleotide polymorphism
CN106755300A (en) * 2016-11-17 2017-05-31 中国科学院华南植物园 A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9916416B2 (en) * 2012-10-18 2018-03-13 Virginia Tech Intellectual Properties, Inc. System and method for genotyping using informed error profiles
US20170213127A1 (en) * 2016-01-24 2017-07-27 Matthew Charles Duncan Method and System for Discovering Ancestors using Genomic and Genealogic Data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539967A (en) * 2008-12-12 2009-09-23 深圳华大基因研究院 Method for detecting mononucleotide polymorphism
CN106755300A (en) * 2016-11-17 2017-05-31 中国科学院华南植物园 A kind of method for recognizing Kiwi berry hybrid strain to filial generation genome contribution proportion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Ariel W. Chan 等.Evaluating Imputation Algorithms for Low-Depth Genotyping-By-Sequencing (GBS) Data.《PLOS ONE》.2016,第1-17页. *
Jean-Simon Brouard 等.Low-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accLow-depth genotyping-by-sequencing (GBS) in a bovine population: strategies to maximize the selection of high quality genotypes and the accuracy of imputationuracy of imputation.《BMC Genetics》.2017,第18卷第1-14页. *
Ruiqiang Li等.SNP detection for massively parallel whole-genome resequencing.《Genome Research》.2009,第19卷第1124-1132页. *

Also Published As

Publication number Publication date
CN110997936A (en) 2020-04-10
WO2019047181A1 (en) 2019-03-14

Similar Documents

Publication Publication Date Title
CN110997936B (en) Method, device and application of genotyping based on low-depth genome sequencing
KR102562419B1 (en) Variant classifier based on deep neural networks
CA2964902C (en) Ancestral human genomes
KR20200010488A (en) Deep learning-based variant classifier
WO2019200338A1 (en) Variant classifier based on deep neural networks
Calus et al. Efficient genomic prediction based on whole-genome sequence data using split-and-merge Bayesian variable selection
Rogers et al. Mitochondrial pseudogenes in the nuclear genomes of Drosophila
Won et al. Genomic prediction accuracy using haplotypes defined by size and hierarchical clustering based on linkage disequilibrium
Bussotti et al. Improved definition of the mouse transcriptome via targeted RNA sequencing
US20190198134A1 (en) Systems and methods for snp characterization and identifying off target variants
Wang et al. Genetic dissection of growth traits in a unique chicken advanced intercross line
Kumar et al. SNPs with intermediate minor allele frequencies facilitate accurate breed assignment of Indian Tharparkar cattle
Gondro et al. Genome wide association studies
US20200168299A1 (en) Systems and methods for targeted genome editing
Lee et al. Detecting positive selection of Korean native goat populations using next-generation sequencing
CN113122644B (en) SNP locus for detecting red deer blood source content, screening method, corresponding SNP chip and application
Lozano et al. Comparative evolutionary analysis and prediction of deleterious mutation patterns between sorghum and maize
Li et al. Bioinformatics Considerations and Approaches for High‐Density Linkage Mapping in Aquaculture
Fletcher et al. AFLAP: Assembly-Free Linkage Analysis Pipeline using k-mers from whole genome sequencing data
TW201920682A (en) Detection method of polymorphism
Groß et al. Evolutionarily conserved non-protein-coding regions in the chicken genome harbor functionally important variation
Baschal et al. Congruence as a measurement of extended haplotype structure across the genome
Stelzer et al. Genome structure of Brachionus asplanchnoidis, a Eukaryote with intrapopulation variation in genome size
Malde et al. Using sequencing coverage statistics to identify sex chromosomes in minke whales
JP7122006B2 (en) Insertion/deletion/inversion/translocation/substitution detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant