Disclosure of Invention
In view of the above, it is desirable to provide an identity relationship identification method, apparatus, device and storage medium capable of improving validity and reliability of identification results.
An identity relationship identification method comprises the following steps:
step S1: obtaining sample mutation information obtained by comparing and analyzing sequencing results;
step S2: searching a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site;
the plurality of SNP sites of interest is selected from the following 984 SNP sites:
step S3: selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information thereof to obtain identification information;
step S4: and comparing the genotypes of the corresponding target SNP sites in the identification information of different samples, and identifying the identity relationship of the different samples.
An identity relationship authentication apparatus comprising:
the mutation information acquisition module is used for acquiring sample mutation information obtained by comparing and analyzing the sequencing result;
the target SNP information retrieval module is used for retrieving a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site; the plurality of SNP sites of interest is selected from the following 984 SNP sites:
the identification information selection module is used for selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information of the target SNP site to obtain identification information; and
and the identity relationship identification module is used for comparing the genotypes of the corresponding target SNP sites in the identification information of different samples and identifying the identity relationship of different samples.
A computer device having a processor and a memory, the memory storing a computer program, the processor implementing the steps of the identity relationship authentication method according to any one of the above embodiments when executing the computer program.
A computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the identity relationship authentication method according to any one of the above embodiments.
According to the identity relationship identification method, the identity relationship identification device, the identity relationship identification equipment and the identity relationship identification storage medium, when the identity relationship is identified, the target SNP sites are searched one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype of each target SNP site, the target SNP sites meeting the preset requirements and the mutation and sequencing information of the target SNP sites are selected from the target SNP sites to obtain the identification information, finally, the genotypes of the corresponding target SNP sites in the identification information of different samples are compared, and the identity relationship of different samples is identified. Because the mutation rate of the SNP towards other directions is extremely low, even if the SNP is mutated, the influence of a single target SNP on the final result is limited, so that the identification of the identity relationship by the SNP can obviously improve the effectiveness and reliability of the identification result compared with the traditional method using STR detection.
Further, research finds that, because individual identification and paternity test are perfect-match tests and have extremely low tolerance to mismatch, 20 STRs can generally achieve good identification effect when used for individual identification or paternity test, but for other relativity test, because not all sites are matched, large random errors can be caused during identification. For example, 50% of haplotypes between grandparents have different source sites, and at this time, 10 matching results in 20 STRs are matching results of random population on average, so that the number of sites is too few, which results in large random fluctuation of the number of sites which are actually matched finally, and thus the identification effect is very poor. The number of SNPs in the human genome is very large (the number of human polymorphic SNPs reported in a genome project of thousands of people is 8000 ten thousand, and the number of each person is about 350- > 400 ten thousand on average), and better support can be provided for the identification of various identity relationships by using the SNPs. The SNP can be used for paternity test, individual identification and other paternity tests except paternity test, and has small error and high reliability.
Furthermore, the traditional STR detection method is DNA fragment analysis, which is not a conventional DNA sequencing method, and most STRs are located in intergenic regions, many STRs are currently considered non-functional regions, and general sequencing projects do not involve these regions, so that if identity relationship needs to be identified in these sequencing projects, an additional STR detection experiment is often required, which is time-consuming and labor-consuming, and can increase project cost. And a large number of SNPs exist on gene exons and other functional non-coding regions, so that the SNPs which are detected in most scientific research clinical sequencing projects can be further used for identity relationship identification, and various identity relationships can be identified without additional experiments. Therefore, the identification relation identification method saves time and can reduce the detection cost.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The identification of identity relationship includes individual identification, paternity relationship identification and other paternity relationship identification of non-paternity relationship, such as grandfather and grandfather relationship identification, uncle and nephew relationship identification, brother and sister relationship identification, cousins and relatives relationship identification, uncle and nephew relationship identification and the like; the "population mutation frequency" of the SNP site refers to the occurrence frequency of a base in which a SNP site is inconsistent with a reference sequence in a specific population (e.g., Chinese population); the "allele population frequency" of an SNP site refers to the frequency of occurrence of each allele of an SNP site in a specific population (e.g., Chinese population); the mutation quality refers to the default quality control standard given by GATK (or other mutation analysis software); the "read" refers to a sequencing sequence generated by a high-throughput sequencing platform (such as various second-generation sequencing platforms); the "sequencing coverage" refers to the number of reads covered by one sequencing site.
As shown in fig. 1, an embodiment of the present invention provides an identity relationship identification method, which includes the following steps:
step S110: and obtaining sample mutation information obtained by comparing and analyzing the sequencing result.
For each sample, sequencing can be performed using methods, but not limited to, next generation sequencing, and sequencing results obtained. After the sequencing result is obtained, the sequencing result can be aligned to a human reference genome, and a mutation file containing sample mutation information of the sample can be obtained through analysis. The sample mutation information comprises information such as mutation sites, mutation frequency, mutation quality and the like. The mutation is relative to the reference genome, i.e., the sequencing results show variations that differ from the sequence of the corresponding region or site on the reference genome.
Step S120: and searching a plurality of target SNP sites in the mutation sites of the sample mutation information one by one to obtain the mutation and sequencing information of the genotype containing each target SNP site.
SNP sites of interest are preferably located on autosomal exons or functional non-exonsThe coding region and the allele population frequency is between 0.45 and 0.55. For the identification of parent-child relationship (father-son, mother-son), the number of SNP sites required to be searched generally can reach about 99.999 percent of accuracy when 100 SNP sites are searched, and 960 sites can reach about (100-10)-53) % accuracy, therefore, for paternity testing, the number of SNP sites of interest may require not less than 100; for other relativity identification, such as later analysis, no matter how many target SNP sites are speculated and analyzed according to the expected value of the number of unmatched sites, although the judgment cannot be made 100%, the reliability is still high, the reliability of the result is higher as the number of the general target SNP sites is larger, for example, the number of the general target SNP sites is not less than 720, the relativity identification of the level of table relatives can be performed, the number of the target SNP sites is not less than 480, the relativity identification of the level of grandfather/terew can be performed, and the number of the target SNP sites is not less than 240, the relativity identification of the level of siblings can be performed; for individual identification, detection of complete match of genotypes of target SNP sites is required, and the number of general target SNP sites is required to be not less than 50.
In one specific example, a plurality of SNP sites of interest among 984 SNP sites of interest, which are located on frequently-stained exons in the Chinese population and have an allele population frequency between 0.45 and 0.55, and are included in the sequencing project of most gene exons, as shown in Table 1 below, may be selected.
TABLE 1
Note: the reference sequences of the SNP sites are hg 19. For example, a SNP site of interest represented by "10 |101293035| C | A", wherein "|" is used as a term separator symbol, "10" represents a chromosome number, "101293035" represents a coordinate position on the corresponding chromosome, "C" represents a base that is identical to the corresponding site on the reference genome, and "A" represents another base that is not identical to the corresponding site on the reference genome; the same applies to SNP sites of other purposes.
When a plurality of target SNP sites are searched one by one in the mutation sites of the sample mutation information, the current target SNP may or may not be searched. For the target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is determined to be homozygote or heterozygote which is inconsistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency, mutation quality and sequencing coverage of the target SNP site; and for the target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is homozygote which is consistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency and sequencing coverage of the target SNP site. The information such as sequencing coverage can be calculated from the sequencing alignment file (such as bam file) of the sample according to the situation of the target SNP site searched currently.
Taking the reference allele as R and the mutant allele as V as an example, a human is diploid, and for a target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the current target SNP site of the sample is VV (homozygous) or RV (heterozygous), and for a target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the current target SNP site of the sample is RR.
Step S130: and selecting a target SNP site meeting preset requirements from the plurality of target SNP sites and mutation and sequencing information thereof to obtain identification information.
Specifically, meeting the preset requirement means that the sequencing coverage meets more than 30 read coverage, and the mutation quality meets the default quality control standard of GATK.
The default quality control criteria for GATK are QD >2.0 and MQ >40.0 and FS <60.0 and HaplotpySacre <60.0 and MQRankSum > -12.5 and ReadPosRankSum > -8.0.
Reliability analysis is carried out on a plurality of target SNP loci, high-quality loci shared by samples can be screened out, and influence of unreliable loci on result judgment is avoided. That is, sufficient coverage and quality control are required for typing of a target SNP site, otherwise, it is likely to be mistaken for randomness, for example, the father of a target SNP site is AA type, the son is AT type, but if the coverage of the site of the son is low or the quality of the site is not good, for example, only 5 reads exist, it is likely that the 5 reads are just T, or if A is not detected due to poor quality, the typing of the son is finally judged as TT.
Step S140: and comparing the genotypes of the corresponding target SNP sites in the identification information of different samples, and identifying the identity relationship of the different samples.
And summarizing the genotypes of the selected target SNP sites meeting the preset requirement to obtain identification information, and generating an identification information file in the utag format.
The identification information file can be used for individual identification, relationship identification and the like.
Taking the parent-child relationship as an example, the non-parent exclusion rate PE of a single target SNP site is 2 × p
2*(1-p)
2P is the allele population frequency of the target SNP site, and the maximum value of 0.125 can be obtained when p is 0.5. When p is between 0.45 and 0.55, the lowest non-paternal exclusion rate of the single target SNP site is 0.1225125. For 984 target SNP sites, the non-paternal exclusion rate obtained by the method of the invention
Far higher than the conventional 20 STR bitsAnd (3) a dot paternity test method.
In a specific example, as shown in fig. 2, step S140 includes:
step S141: and judging whether the individual identification or the relationship identification is carried out on different samples, if the individual identification is carried out, executing the step S142, and if not, executing the step S143.
Step S142: comparing the genotypes of all the corresponding target SNP sites of different samples, and analyzing whether the different samples belong to the same individual according to the comparison result.
In principle, for individual identification, it is necessary that all the genotypes of the corresponding target SNP sites are completely consistent to judge that the target SNP sites are the same individual, but when a large number of target SNP sites are compared and analyzed, and the genotypes of a very small number of target SNP sites are inconsistent, the target SNP sites can be specifically analyzed according to the situation, such as the sample DNA is degraded, and a certain SNP of the detected individual is mutated in the embryo differentiation process. For example, mutations generated during the embryo differentiation process may cause slight differences in genes at different tissue sites of a human body, and individual identification may be from samples at different sites, which, although very low, still exists without generally affecting the judgment of individual identification.
Step S143: and counting the number of matched target SNP sites and/or the number of unmatched target SNP sites in the corresponding target SNP sites of different samples.
The matched target SNP site refers to a target SNP site with at least one allele being the same in different samples. The unmatched target SNP site refers to a target SNP site in which two alleles of different samples are different. And the sum of the number of the matched target SNP sites and the number of the unmatched target SNP sites is equal to the total number of the target SNP sites in the identification information.
Step S144: and judging whether the paternity identification is carried out on different samples or other paternity identifications except the paternity identification, if so, executing the step S145, otherwise, executing the step S146.
Step S145: and calculating the paternity index of each matched target SNP locus according to the genotype of the target SNP locus and the corresponding allele population frequency, determining a comprehensive paternity index according to the paternity index of each matched target SNP locus, and analyzing whether the different samples belong to the paternity relationship or not according to the comprehensive paternity index.
The paternity index PI of each matched target SNP locus is calculated according to the following formula:
p
ifor matching allele frequencies, PI is taken as the sum of all cases that can be matched. The composite paternity index CPI is the product of all PI values.
Whether the different samples belong to the parent-child relationship can be analyzed according to the comprehensive parent-right index CPI, and if the CPI is more than 1000, the samples can be judged to be the parent-child relationship.
Step S146: and analyzing the relativity of the other non-relativity relations according to the number of the unmatched target SNP sites.
It is understood that in other embodiments, the step S140 may only identify one or two of individual identification, paternity testing, and other paternity testing besides paternity testing, and accordingly, as a specific example, the step S140 includes: judging whether individual identification is carried out on different samples, if so, comparing the genotypes of all corresponding target SNP sites of the different samples, and analyzing whether the different samples belong to the same individual according to a comparison result; as another specific example, step S140 includes: judging whether paternity test is carried out on different samples, if so, calculating the paternity index of each matched target SNP site according to the genotype of the target SNP site and the population frequency of corresponding alleles, determining a comprehensive paternity index according to the paternity index of each matched target SNP site, analyzing whether the different samples belong to the paternity test according to the comprehensive paternity index, wherein the matched target SNP site refers to the target SNP site with at least one same allele of the different samples; as another example, in another specific example, step S140 includes: and judging whether the relativity identification of other non-relativity relationships is carried out on different samples, if so, analyzing the relativity of the other non-relativity relationships according to the number of unmatched target SNP sites, wherein the unmatched target SNP sites refer to target SNP sites with different alleles of different samples.
More specifically, in one example, if the number of unmatched target SNP sites ≈ total number of target SNP sites/16 can be considered as grandgrandfather, or uncle, nephew, etc., the number of unmatched target SNP sites ≈ total number of target SNP sites/32 can be considered as sibling.
Here, the concept is introduced: unrelated sites on the haploid level between the two samples, i.e. sites that are not genetically related. It is only the irrelevant sites that could result in the mismatch of multiple target SNPs in two samples, and for the SNP with population frequency of 0.5, the ratio of three genotypes AA/BB/AB is 0.25, 0.5, respectively, and if and only if two samples are AA and BB, it could result in the mismatch of SNPs with a probability of 2.0.25 x 0.25 to 0.125, i.e., 1/8, i.e., the maximum non-paternal exclusion rate of a single SNP site.
The total number 960 of the target SNPs corresponding to the two samples and the allele population frequency of all target SNP sites are 0.5 are taken as examples below to demonstrate the number of unmatched target SNP sites under different relatives:
① son has a chromosome completely inherited from father, so the number of unrelated loci between father and son is 0;
② considering that the crossover between non-sister chromatids during meiosis leads to gene recombination, the 0.5 expectation of the chromosome that the son inherits from the father is the grandfather, so the number of unrelated sites between grandparents is 960 x 0.5 ═ 480;
③, the son inherits 0.5 of the father chromosome is expected to inherit from milk, and 50% of each of the grandfather and milk are likely to be inherited to tertiary sex, i.e. the tertiary sex has 0.5 x 50% +0.5 x 0.5, so that the number of unrelated loci between tertiary sex is 960 (1-0.5) ═ 480;
④ brother and sister, two alleles are required to be different in origin to be unrelated sites, for example, the father is Aa, the mother is Bb, the father is AB, the brother is AB, or Ab \ aB, if AB and the like have non-cross combinations of common alleles, the target SNP is related matched site, the probability of cross combination is 0.5-0.25, namely, the number of unrelated sites between brothers and sisters is 960-0.25-240;
⑤ above, it is calculated that the probability of having the same chromosome between the tertiary nephew is 0.5, and the probability of this part of the chromosome being inherited from the tertiary nephew to the cousin is 0.5, i.e. the probability of the cousin being 0.25, so the number of unrelated loci between the relatives is 960 x (1-0.25) ═ 720;
⑥, the number of irrelevant sites between nephew-tert is 960 (1-0.125) ═ 840.
Table 2 below shows the expected values of the number of unrelated sites and the number of unmatched SNP sites of interest for each type of relatives.
TABLE 2
In the above, the ideal result in consideration of the case that the allele population frequency of all SNP sites is 0.5, in the actual case, the exclusion rate is reduced due to the deviation of the allele population frequency of SNP from 0.5, so that the number of mismatched target SNPs is reduced.
Through example detection and research, it is found that, when other relationships such as final non-relationships are determined for a plurality of target SNP sites having an allele population frequency of 0.45 to 0.55 and being the target of search in step S120, the relationship determination can be performed with reference to the expected value of the number of unmatched target SNP sites in table 2 above.
According to the identity relationship identification method, when identity relationship identification is carried out, a plurality of target SNPs are searched one by one in mutation sites containing gene sample mutation information to obtain mutation and sequencing information of each target SNP, whether the reliability of each target SNP meets preset requirements is judged according to the mutation and sequencing information of each target SNP, the target SNP meeting the preset requirements and the mutation and sequencing information thereof are selected, identification information is constructed, corresponding target SNPs and mutation and sequencing information thereof in identification information of different samples are compared, and identity relationship of different samples is identified through the identification information. Because the mutation rate of the SNP towards other directions is extremely low, even if the SNP is mutated, the influence of a single target SNP on the final result is limited, so that the identification of the identity relationship by the SNP can obviously improve the effectiveness and reliability of the identification result compared with the traditional method using STR detection.
Further, research finds that, because individual identification and paternity test are perfect-match tests and have extremely low tolerance to mismatch, 20 STRs can generally achieve good identification effect when used for individual identification or paternity test, but for other relativity test, because not all sites are matched, large random errors can be caused during identification. For example, 50% of haplotypes between grandparents have different source sites, and at this time, 10 matching results in 20 STRs are matching results of random population on average, so that the number of sites is too few, which results in large random fluctuation of the number of sites which are actually matched finally, and thus the identification effect is very poor. The number of SNPs in the human genome is very large (the number of human polymorphic SNPs reported in a genome project of thousands of people is 8000 ten thousand, and the number of each person is about 350- > 400 ten thousand on average), and better support can be provided for the identification of various identity relationships by using the SNPs. The SNP can be used for paternity test, individual identification and other paternity tests except paternity test, and has small error and high reliability.
Furthermore, the traditional STR detection method is DNA fragment analysis, which is not a conventional DNA sequencing method, and most STRs are located in intergenic regions, many STRs are currently considered non-functional regions, and general sequencing projects do not involve these regions, so that if identity relationship needs to be identified in these sequencing projects, an additional STR detection experiment is often required, which is time-consuming and labor-consuming, and can increase project cost. And a large number of SNPs exist on gene exons and other functional non-coding regions, so that the SNPs which are detected in most scientific research clinical sequencing projects can be further used for identity relationship identification, and various identity relationships can be identified without additional experiments. For example, after receiving a sequencing project from a genetic disease family clinically and analyzing by whole exon sequencing, the identity relationship identification method can be directly used for identification by using sequenced SNP, and no additional experiment is needed. Therefore, the identification relation identification method saves time and can reduce the detection cost.
As shown in fig. 3, based on the same idea as the method described above, an embodiment of the present invention further provides an identity relationship authentication apparatus 200, which includes:
a mutation information obtaining module 210, configured to obtain sample mutation information obtained by performing comparison analysis on the sequencing result;
a target SNP information retrieval module 220, configured to retrieve a plurality of target SNP sites one by one from the mutation sites of the sample mutation information, so as to obtain mutation and sequencing information of the genotype including each target SNP site;
an identification information selection module 230, configured to select a target SNP site satisfying a preset requirement from multiple target SNP sites and mutation and sequencing information thereof, so as to obtain identification information; and
and the identity relationship identification module 240 is used for comparing the genotypes of the corresponding target SNP sites in the identification information of different samples and identifying the identity relationship of different samples.
In one particular example, the identity relationship evaluation module 240 includes a first determination module 241, an individual identification module 242, a match statistics module 243, a second determination module 244, a paternity evaluation module 245, and other paternity evaluation modules 246.
The first determining module 241 is used for determining whether to perform individual identification or relationship identification on different samples.
The individual identification module 242 is configured to compare genotypes of all corresponding target SNP sites of different samples, and analyze whether the different samples belong to the same individual according to a comparison result.
The matching condition statistics module 243 is used for counting the number of matched target SNP sites and/or the number of unmatched target SNP sites in the corresponding target SNP sites of different samples.
The second determination module 244 is used to determine whether to perform paternity testing or paternity testing on different samples.
The paternity test module 245 is configured to calculate a paternity index of each matched target SNP site according to the genotype of the target SNP site and the population frequency of the corresponding allele, determine a comprehensive paternity index according to the paternity index of each matched target SNP site, and analyze whether the different samples belong to a paternity relationship according to the comprehensive paternity index.
The other paternity identification module 246 is used for analyzing the paternity of the other non-paternity relationships according to the number of unmatched target SNP sites.
Based on the embodiments described above, the present invention further provides a computer device for identity relationship authentication, which has a processor and a memory, where the memory stores a computer program, and the processor executes the computer program to implement the steps of the identity relationship authentication method according to any one of the embodiments described above.
It will be understood by those skilled in the art that all or part of the processes of the above methods may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Accordingly, the present invention also provides a computer storage medium for primer sequence processing for sequencing and library building, wherein a computer program is stored thereon, and when executed, the computer program implements the steps of the identity relationship identification method according to any of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.