CN108647495B - Identity relationship identification method, device, equipment and storage medium - Google Patents

Identity relationship identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN108647495B
CN108647495B CN201810490416.4A CN201810490416A CN108647495B CN 108647495 B CN108647495 B CN 108647495B CN 201810490416 A CN201810490416 A CN 201810490416A CN 108647495 B CN108647495 B CN 108647495B
Authority
CN
China
Prior art keywords
target snp
mutation
identification
sites
paternity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810490416.4A
Other languages
Chinese (zh)
Other versions
CN108647495A (en
Inventor
刘晶星
刘菲菲
庞柳
赵薇薇
于世辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Original Assignee
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kingmed Diagnostics Group Co ltd, Guangzhou Kingmed Diagnostics Central Co Ltd filed Critical Guangzhou Kingmed Diagnostics Group Co ltd
Priority to CN201810490416.4A priority Critical patent/CN108647495B/en
Publication of CN108647495A publication Critical patent/CN108647495A/en
Application granted granted Critical
Publication of CN108647495B publication Critical patent/CN108647495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6888Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for detection or identification of organisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Abstract

The invention relates to an identity relationship identification method, an identity relationship identification device, identity relationship identification equipment and a storage medium, wherein the identity relationship identification method, the identity relationship identification device, the identity relationship identification equipment and the storage medium can improve the validity and the reliability of an identification result. When the identity relationship is identified, a plurality of target SNP sites are searched one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype of each target SNP site, the target SNP site meeting the preset requirement and the mutation and sequencing information thereof are selected from the plurality of target SNP sites to obtain identification information, finally, the genotypes of the corresponding target SNP sites in the identification information of different samples are compared, and the identity relationship of different samples is identified. Because the mutation rate of the SNP towards other directions is extremely low, even if the SNP is mutated, the influence of a single target SNP on the final result is limited, so that the identification of the identity relationship by the SNP can obviously improve the effectiveness and reliability of the identification result compared with the traditional method using STR detection.

Description

Identity relationship identification method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of molecular biology and bioinformatics, in particular to an identity relationship identification method, device, equipment and storage medium.
Background
At present, methods for identification of identity relationships such as individual identification and paternity relationship identification (including paternity relationship identification) are mainly based on Short Tandem Repeat (STR) detection. Research shows that the number of STRs in the human genome is very small compared with that of SNP (Single Nucleotide Polymorphism), the number of STRs with high discrimination for individual discrimination is less, and the remaining available STRs are very limited by excluding the situation that the frequency fluctuation of STRs in different populations is large. However, a single STR is more susceptible to mutation than a SNP, and even a single STR may have a large influence on the final identification result due to the limited number of STRs available.
Disclosure of Invention
In view of the above, it is desirable to provide an identity relationship identification method, apparatus, device and storage medium capable of improving validity and reliability of identification results.
An identity relationship identification method comprises the following steps:
step S1: obtaining sample mutation information obtained by comparing and analyzing sequencing results;
step S2: searching a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site;
step S3: selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information thereof to obtain identification information;
step S4: and comparing the genotypes of the corresponding target SNP sites in the identification information of different samples, and identifying the identity relationship of the different samples.
In one embodiment, each SNP site of interest is located on an autosomal exon or functional non-coding region and the allele population frequency is between 0.45 and 0.55.
In one embodiment, in step S2, for a target SNP that can be retrieved from the mutation sites in the sample mutation information, the genotype of the target SNP site is determined to be homozygous or heterozygous inconsistent with the reference genotype, and the obtained mutation and sequencing information includes the genotype, allele population frequency, mutation quality, and sequencing coverage of the target SNP site;
and for the target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is homozygosis which is consistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency and sequencing coverage of the target SNP site.
In one embodiment, in the step S3, the step S meeting the predetermined requirement means that the sequencing coverage satisfies not less than 30 reads coverage, and the mutation quality satisfies the default quality control standard of GATK.
In one embodiment, the step S4 includes:
step S41: and judging whether individual identification is carried out on different samples, if so, comparing the genotypes of all corresponding target SNP sites of the different samples, and analyzing whether the different samples belong to the same individual according to the comparison result.
In one embodiment, the step S4 includes:
step S42: judging whether paternity tests are carried out on different samples, if so, calculating the paternity index of each matched target SNP site according to the genotype of the target SNP site and the corresponding allele population frequency, determining a comprehensive paternity index according to the paternity index of each matched target SNP site, and analyzing whether the different samples belong to the paternity test according to the comprehensive paternity index;
the matched target SNP site refers to a target SNP site with at least one allele being the same in different samples.
In one embodiment, the step S4 includes:
step S43: judging whether the relativity identification of other non-relativity relationships is carried out on different samples, if so, analyzing the relativity relationships of the other non-relativity relationships according to the number of unmatched target SNP sites;
the unmatched target SNP site refers to a target SNP site in which two alleles of different samples are different.
An identity relationship authentication apparatus comprising:
the mutation information acquisition module is used for acquiring sample mutation information obtained by comparing and analyzing the sequencing result;
the target SNP information retrieval module is used for retrieving a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site;
the identification information selection module is used for selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information of the target SNP site to obtain identification information; and
and the identity relationship identification module is used for comparing the genotypes of the corresponding target SNP sites in the identification information of different samples and identifying the identity relationship of different samples.
A computer device having a processor and a memory, the memory storing a computer program, the processor implementing the steps of the identity relationship authentication method according to any one of the above embodiments when executing the computer program.
A computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the identity relationship authentication method according to any one of the above embodiments.
According to the identity relationship identification method, the identity relationship identification device, the identity relationship identification equipment and the identity relationship identification storage medium, when the identity relationship is identified, the target SNP sites are searched one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype of each target SNP site, the target SNP sites meeting the preset requirements and the mutation and sequencing information of the target SNP sites are selected from the target SNP sites to obtain the identification information, finally, the genotypes of the corresponding target SNP sites in the identification information of different samples are compared, and the identity relationship of different samples is identified. Because the mutation rate of the SNP towards other directions is extremely low, even if the SNP is mutated, the influence of a single target SNP on the final result is limited, so that the identification of the identity relationship by the SNP can obviously improve the effectiveness and reliability of the identification result compared with the traditional method using STR detection.
Further, research finds that, because individual identification and paternity test are perfect-match tests and have extremely low tolerance to mismatch, 20 STRs can generally achieve good identification effect when used for individual identification or paternity test, but for other relativity test, because not all sites are matched, large random errors can be caused during identification. For example, 50% of haploid different homologous sites between grandparents exist, at this time, 10 matching results in average of 20 STRs are matching results of random population, so that the random fluctuation of the number of sites which are actually matched is large, and the identification effect is very poor. The number of SNPs in the human genome is very large (the number of human polymorphic SNPs reported in a genome project of thousands of people is 8000 ten thousand, and the number of each person is about 350- > 400 ten thousand on average), and better support can be provided for the identification of various identity relationships by using the SNPs. The SNP can be used for paternity test, individual identification and other paternity tests except paternity test, and has small error and high reliability.
Furthermore, the traditional STR detection method is DNA fragment analysis, which is not a conventional DNA sequencing method, and most STRs are located in intergenic regions, many STRs are currently considered non-functional regions, and general sequencing projects do not involve these regions, so that if identity relationship needs to be identified in these sequencing projects, an additional STR detection experiment is often required, which is time-consuming and labor-consuming, and can increase project cost. And a large number of SNPs exist on gene exons and other functional non-coding regions, so that the SNPs which are detected in most scientific research clinical sequencing projects can be further used for identity relationship identification, and various identity relationships can be identified without additional experiments. Therefore, the identification relation identification method saves time and can reduce the detection cost.
Drawings
FIG. 1 is a schematic flow chart of a method for identity relationship authentication according to an embodiment;
FIG. 2 is a flowchart illustrating a specific example of the process of FIG. 1 for authenticating identity relationships between different samples;
FIG. 3 is a schematic structural diagram of an identity relationship authentication apparatus according to an embodiment;
fig. 4 is a schematic structural diagram of a specific example of the identity relationship identification module in fig. 3.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
The identification of identity relationship includes individual identification, paternity relationship identification and other paternity relationship identification of non-paternity relationship, such as grandfather and grandfather relationship identification, uncle and nephew relationship identification, brother and sister relationship identification, cousins and relatives relationship identification, uncle and nephew relationship identification and the like; the "population mutation frequency" of the SNP site refers to the occurrence frequency of a base in which a SNP site is inconsistent with a reference sequence in a specific population (e.g., Chinese population); the "allele population frequency" of an SNP site refers to the frequency of occurrence of each allele of an SNP site in a specific population (e.g., Chinese population); the mutation quality refers to the default quality control standard given by GATK (or other mutation analysis software); the "read" refers to a sequencing sequence generated by a high-throughput sequencing platform (such as various second-generation sequencing platforms); the "sequencing coverage" refers to the number of reads covered by one sequencing site.
As shown in fig. 1, an embodiment of the present invention provides an identity relationship identification method, which includes the following steps:
step S110: and obtaining sample mutation information obtained by comparing and analyzing the sequencing result.
For each sample, sequencing can be performed using methods, but not limited to, next generation sequencing, and sequencing results obtained. After the sequencing result is obtained, the sequencing result can be aligned to a human reference genome, and a mutation file containing sample mutation information of the sample can be obtained through analysis. The sample mutation information comprises information such as mutation sites, mutation frequency, mutation quality and the like. The mutation is relative to the reference genome, i.e., the sequencing results show variations that differ from the sequence of the corresponding region or site on the reference genome.
Step S120: and searching a plurality of target SNP sites in the mutation sites of the sample mutation information one by one to obtain the mutation and sequencing information of the genotype containing each target SNP site.
The SNP loci of each order are preferably located on an autosomal exon or a functional non-coding region, and the allele population frequency is between 0.45 and 0.55. For the identification of parent-child relationship (father-son, mother-son), the number of SNP sites required to be searched generally can reach about 99.999 percent of accuracy when 100 SNP sites are searched, and 960 sites can reach about (100-10)-53) % accuracy, therefore, for paternity testing, the number of SNP sites of interest may require not less than 100; for other relativity identification, such as later analysis, no matter how many target SNP sites are speculated and analyzed according to the expected value of the number of unmatched sites, although the judgment cannot be made 100%, the reliability is still high, the reliability of the result is higher as the number of the general target SNP sites is larger, for example, the number of the general target SNP sites is not less than 720, the relativity identification of the level of table relatives can be performed, the number of the target SNP sites is not less than 480, the relativity identification of the level of grandfather/terew can be performed, and the number of the target SNP sites is not less than 240, the relativity identification of the level of siblings can be performed; for individual identification, detection of complete match of genotypes of target SNP sites is required, and the number of general target SNP sites is required to be not less than 50.
In one specific example, a plurality of SNP sites of interest among 984 SNP sites of interest, which are located on frequently-stained exons in the Chinese population and have an allele population frequency between 0.45 and 0.55, and are included in the sequencing project of most gene exons, as shown in Table 1 below, may be selected.
TABLE 1
Figure BDA0001667897810000071
Figure BDA0001667897810000081
Figure BDA0001667897810000091
Figure BDA0001667897810000101
Figure BDA0001667897810000111
Note: the reference sequences of the SNP sites are hg 19. For example, a SNP site of interest represented by "10 |101293035| C | A", wherein "|" is used as a term separator symbol, "10" represents a chromosome number, "101293035" represents a coordinate position on the corresponding chromosome, "C" represents a base that is identical to the corresponding site on the reference genome, and "A" represents another base that is not identical to the corresponding site on the reference genome; the same applies to SNP sites of other purposes.
When a plurality of target SNP sites are searched one by one in the mutation sites of the sample mutation information, the current target SNP may or may not be searched. For the target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is determined to be homozygote or heterozygote which is inconsistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency, mutation quality and sequencing coverage of the target SNP site; and for the target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is homozygote which is consistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency and sequencing coverage of the target SNP site. The information such as sequencing coverage can be calculated from the sequencing alignment file (such as bam file) of the sample according to the situation of the target SNP site searched currently.
Taking the reference allele as R and the mutant allele as V as an example, a human is diploid, and for a target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the current target SNP site of the sample is VV (homozygous) or RV (heterozygous), and for a target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the current target SNP site of the sample is RR.
Step S130: and selecting a target SNP site meeting preset requirements from the plurality of target SNP sites and mutation and sequencing information thereof to obtain identification information.
Specifically, meeting the preset requirement means that the sequencing coverage meets more than 30 read coverage, and the mutation quality meets the default quality control standard of GATK.
The default quality control criteria for GATK are QD >2.0 and MQ >40.0 and FS <60.0 and HaplotpySacre <60.0 and MQRankSum > -12.5 and ReadPosRankSum > -8.0.
Reliability analysis is carried out on a plurality of target SNP loci, high-quality loci shared by samples can be screened out, and influence of unreliable loci on result judgment is avoided. That is, sufficient coverage and quality control are required for typing of a target SNP site, otherwise, it is likely to be mistaken for randomness, for example, the father of a target SNP site is AA type, the son is AT type, but if the coverage of the site of the son is low or the quality of the site is not good, for example, only 5 reads exist, it is likely that the 5 reads are just T, or if A is not detected due to poor quality, the typing of the son is finally judged as TT.
Step S140: and comparing the genotypes of the corresponding target SNP sites in the identification information of different samples, and identifying the identity relationship of the different samples.
And summarizing the genotypes of the selected target SNP sites meeting the preset requirement to obtain identification information, and generating an identification information file in the utag format.
The identification information file can be used for individual identification, relationship identification and the like.
Taking the parent-child relationship as an example, the non-parent exclusion rate PE of a single target SNP site is 2 × p2*(1-p)2P is the allele population frequency of the target SNP site, and the maximum value of 0.125 can be obtained when p is 0.5. When p is between 0.45 and 0.55, the lowest non-paternal exclusion rate of the single target SNP site is 0.1225125. For 984 target SNP sites, the non-paternal exclusion rate obtained by the method of the invention
Figure BDA0001667897810000121
The paternity test method is far higher than the traditional 20 STR loci.
In a specific example, as shown in fig. 2, step S140 includes:
step S141: and judging whether the individual identification or the relationship identification is carried out on different samples, if the individual identification is carried out, executing the step S142, and if not, executing the step S143.
Step S142: comparing the genotypes of all the corresponding target SNP sites of different samples, and analyzing whether the different samples belong to the same individual according to the comparison result.
In principle, for individual identification, it is necessary that all the genotypes of the corresponding target SNP sites are completely consistent to judge that the target SNP sites are the same individual, but when a large number of target SNP sites are compared and analyzed, and the genotypes of a very small number of target SNP sites are inconsistent, the target SNP sites can be specifically analyzed according to the situation, such as the sample DNA is degraded, and a certain SNP of the detected individual is mutated in the embryo differentiation process. For example, mutations generated during the embryo differentiation process may cause slight differences in genes at different tissue sites of a human body, and individual identification may be from samples at different sites, which, although very low, still exists without generally affecting the judgment of individual identification.
Step S143: and counting the number of matched target SNP sites and/or the number of unmatched target SNP sites in the corresponding target SNP sites of different samples.
The matched target SNP site refers to a target SNP site with at least one allele being the same in different samples. The unmatched target SNP site refers to a target SNP site in which two alleles of different samples are different. And the sum of the number of the matched target SNP sites and the number of the unmatched target SNP sites is equal to the total number of the target SNP sites in the identification information.
Step S144: and judging whether the paternity identification is carried out on different samples or other paternity identifications except the paternity identification, if so, executing the step S145, otherwise, executing the step S146.
Step S145: and calculating the paternity index of each matched target SNP locus according to the genotype of the target SNP locus and the corresponding allele population frequency, determining a comprehensive paternity index according to the paternity index of each matched target SNP locus, and analyzing whether the different samples belong to the paternity relationship or not according to the comprehensive paternity index.
The paternity index PI of each matched target SNP locus is calculated according to the following formula:
Figure BDA0001667897810000131
pifor matching allele frequencies, PI is taken as the sum of all cases that can be matched. The composite paternity index CPI is the product of all PI values.
Whether the different samples belong to the parent-child relationship can be analyzed according to the comprehensive parent-right index CPI, and if the CPI is more than 1000, the samples can be judged to be the parent-child relationship.
Step S146: and analyzing the relativity of the other non-relativity relations according to the number of the unmatched target SNP sites.
It is understood that in other embodiments, the step S140 may only identify one or two of individual identification, paternity testing, and other paternity testing besides paternity testing, and accordingly, as a specific example, the step S140 includes: judging whether individual identification is carried out on different samples, if so, comparing the genotypes of all corresponding target SNP sites of the different samples, and analyzing whether the different samples belong to the same individual according to a comparison result; as another specific example, step S140 includes: judging whether paternity test is carried out on different samples, if so, calculating the paternity index of each matched target SNP site according to the genotype of the target SNP site and the population frequency of corresponding alleles, determining a comprehensive paternity index according to the paternity index of each matched target SNP site, analyzing whether the different samples belong to the paternity test according to the comprehensive paternity index, wherein the matched target SNP site refers to the target SNP site with at least one same allele in the different samples; as another example, in another specific example, step S140 includes: and judging whether the relativity identification of other non-relativity relationships is carried out on different samples, if so, analyzing the relativity of the other non-relativity relationships according to the number of unmatched target SNP sites, wherein the unmatched target SNP sites refer to target SNP sites with different alleles of different samples.
More specifically, in one example, if the number of unmatched target SNP sites ≈ total number of target SNP sites/16 can be considered as grandgrandfather, or uncle, nephew, etc., the number of unmatched target SNP sites ≈ total number of target SNP sites/32 can be considered as sibling.
Here, the concept is introduced: unrelated sites on the haploid level between the two samples, i.e. sites that are not genetically related. It is only the irrelevant sites that could result in the mismatch of multiple target SNPs in two samples, and for the SNP with population frequency of 0.5, the ratio of three genotypes AA/BB/AB is 0.25, 0.5, respectively, and if and only if two samples are AA and BB, it could result in the mismatch of SNPs with a probability of 2.0.25 x 0.25 to 0.125, i.e., 1/8, i.e., the maximum non-paternal exclusion rate of a single SNP site.
The total number 960 of the target SNPs corresponding to the two samples and the allele population frequency of all target SNP sites are 0.5 are taken as examples below to demonstrate the number of unmatched target SNP sites under different relatives:
① son has a chromosome completely inherited from father, so the number of unrelated loci between father and son is 0;
② considering that the crossover between non-sister chromatids during meiosis leads to gene recombination, the 0.5 expectation of the chromosome that the son inherits from the father is the grandfather, so the number of unrelated sites between grandparents is 960 x 0.5 ═ 480;
③, the son inherits 0.5 of the father chromosome is expected to inherit from milk, and 50% of each of the grandfather and milk are likely to be inherited to tertiary sex, i.e. the tertiary sex has 0.5 x 50% +0.5 x 0.5, so that the number of unrelated loci between tertiary sex is 960 (1-0.5) ═ 480;
④ brother and sister, two alleles are required to be different in origin to be unrelated sites, for example, the father is Aa, the mother is Bb, the father is AB, the brother is AB, or Ab \ aB, if AB and the like have non-cross combinations of common alleles, the target SNP is related matched site, the probability of cross combination is 0.5-0.25, namely, the number of unrelated sites between brothers and sisters is 960-0.25-240;
⑤ above, it is calculated that the probability of having the same chromosome between the tertiary nephew is 0.5, and the probability of this part of the chromosome being inherited from the tertiary nephew to the cousin is 0.5, i.e. the probability of the cousin being 0.25, so the number of unrelated loci between the relatives is 960 x (1-0.25) ═ 720;
⑥, the number of irrelevant sites between nephew-tert is 960 (1-0.125) ═ 840.
Table 2 below shows the expected values of the number of unrelated sites and the number of unmatched SNP sites of interest for each type of relatives.
TABLE 2
Figure BDA0001667897810000161
In the above, the ideal result in consideration of the case that the allele population frequency of all SNP sites is 0.5, in the actual case, the exclusion rate is reduced due to the deviation of the allele population frequency of SNP from 0.5, so that the number of mismatched target SNPs is reduced.
Through example detection and research, it is found that, when other relationships such as final non-relationships are determined for a plurality of target SNP sites having an allele population frequency of 0.45 to 0.55 and being the target of search in step S120, the relationship determination can be performed with reference to the expected value of the number of unmatched target SNP sites in table 2 above.
According to the identity relationship identification method, when identity relationship identification is carried out, a plurality of target SNPs are searched one by one in mutation sites containing gene sample mutation information to obtain mutation and sequencing information of each target SNP, whether the reliability of each target SNP meets preset requirements is judged according to the mutation and sequencing information of each target SNP, the target SNP meeting the preset requirements and the mutation and sequencing information thereof are selected, identification information is constructed, corresponding target SNPs and mutation and sequencing information thereof in identification information of different samples are compared, and identity relationship of different samples is identified through the identification information. Because the mutation rate of the SNP towards other directions is extremely low, even if the SNP is mutated, the influence of a single target SNP on the final result is limited, so that the identification of the identity relationship by the SNP can obviously improve the effectiveness and reliability of the identification result compared with the traditional method using STR detection.
Further, research finds that, because individual identification and paternity test are perfect-match tests and have extremely low tolerance to mismatch, 20 STRs can generally achieve good identification effect when used for individual identification or paternity test, but for other relativity test, because not all sites are matched, large random errors can be caused during identification. For example, 50% of haploid different homologous sites between grandparents exist, at this time, 10 matching results in average of 20 STRs are matching results of random population, so that the random fluctuation of the number of sites which are actually matched is large, and the identification effect is very poor. The number of SNPs in the human genome is very large (the number of human polymorphic SNPs reported in a genome project of thousands of people is 8000 ten thousand, and the number of each person is about 350- > 400 ten thousand on average), and better support can be provided for the identification of various identity relationships by using the SNPs. The SNP can be used for paternity test, individual identification and other paternity tests except paternity test, and has small error and high reliability.
Furthermore, the traditional STR detection method is DNA fragment analysis, which is not a conventional DNA sequencing method, and most STRs are located in intergenic regions, many STRs are currently considered non-functional regions, and general sequencing projects do not involve these regions, so that if identity relationship needs to be identified in these sequencing projects, an additional STR detection experiment is often required, which is time-consuming and labor-consuming, and can increase project cost. And a large number of SNPs exist on gene exons and other functional non-coding regions, so that the SNPs which are detected in most scientific research clinical sequencing projects can be further used for identity relationship identification, and various identity relationships can be identified without additional experiments. For example, after receiving a sequencing project from a genetic disease family clinically and analyzing by whole exon sequencing, the identity relationship identification method can be directly used for identification by using sequenced SNP, and no additional experiment is needed. Therefore, the identification relation identification method saves time and can reduce the detection cost.
As shown in fig. 3, based on the same idea as the method described above, an embodiment of the present invention further provides an identity relationship authentication apparatus 200, which includes:
a mutation information obtaining module 210, configured to obtain sample mutation information obtained by performing comparison analysis on the sequencing result;
a target SNP information retrieval module 220, configured to retrieve a plurality of target SNP sites one by one from the mutation sites of the sample mutation information, so as to obtain mutation and sequencing information of the genotype including each target SNP site;
an identification information selection module 230, configured to select a target SNP site satisfying a preset requirement from multiple target SNP sites and mutation and sequencing information thereof, so as to obtain identification information; and
and the identity relationship identification module 240 is used for comparing the genotypes of the corresponding target SNP sites in the identification information of different samples and identifying the identity relationship of different samples.
In one particular example of fig. 4, the identity relationship identification module 240 includes a first determination module 241, an individual identification module 242, a match statistics module 243, a second determination module 244, a paternity relationship identification module 245, and other paternity relationship identification modules 246.
The first determining module 241 is used for determining whether to perform individual identification or relationship identification on different samples.
The individual identification module 242 is configured to compare genotypes of all corresponding target SNP sites of different samples, and analyze whether the different samples belong to the same individual according to a comparison result.
The matching condition statistics module 243 is used for counting the number of matched target SNP sites and/or the number of unmatched target SNP sites in the corresponding target SNP sites of different samples.
The second determination module 244 is used to determine whether to perform paternity testing or paternity testing on different samples.
The paternity test module 245 is configured to calculate a paternity index of each matched target SNP site according to the genotype of the target SNP site and the population frequency of the corresponding allele, determine a comprehensive paternity index according to the paternity index of each matched target SNP site, and analyze whether the different samples belong to a paternity relationship according to the comprehensive paternity index.
The other paternity identification module 246 is used for analyzing the paternity of the other non-paternity relationships according to the number of unmatched target SNP sites.
Based on the embodiments described above, the present invention further provides a computer device for identity relationship authentication, which has a processor and a memory, where the memory stores a computer program, and the processor executes the computer program to implement the steps of the identity relationship authentication method according to any one of the embodiments described above.
It will be understood by those skilled in the art that all or part of the processes of the above methods may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Accordingly, the present invention also provides a computer storage medium for primer sequence processing for sequencing and library building, wherein a computer program is stored thereon, and when executed, the computer program implements the steps of the identity relationship identification method according to any of the above embodiments.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An identity relationship identification method is characterized by comprising the following steps:
step S1: obtaining sample mutation information obtained by comparing and analyzing sequencing results;
step S2: searching a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site; for the target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is determined to be homozygote or heterozygote which is inconsistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency, mutation quality and sequencing coverage of the target SNP site; for the target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is homozygote which is consistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency and sequencing coverage of the target SNP site;
step S3: selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information thereof to obtain identification information;
step S4: comparing the genotypes of the corresponding target SNP sites in the identification information of different samples, and identifying the identity relationship of the different samples;
step S41: judging whether individual identification is carried out on different samples, if so, comparing the genotypes of all corresponding target SNP sites of the different samples, and analyzing whether the different samples belong to the same individual according to a comparison result;
step S42: judging whether paternity tests are carried out on different samples, if so, calculating the paternity index of each matched target SNP site according to the genotype of the target SNP site and the corresponding allele population frequency, determining a comprehensive paternity index according to the paternity index of each matched target SNP site, and analyzing whether the different samples belong to the paternity test according to the comprehensive paternity index; the matched target SNP site refers to a target SNP site with at least one same allele in different samples;
step S43: judging whether the relativity identification of other non-relativity relationships is carried out on different samples, if so, analyzing the relativity relationships of the other non-relativity relationships according to the number of unmatched target SNP sites; the unmatched target SNP locus refers to a target SNP locus which is different in two alleles of different samples;
the SNP is a single nucleotide polymorphism.
2. The method of claim 1, wherein the sample mutation information comprises mutation sites, mutation frequency, and mutation quality.
3. The method of claim 1, wherein each SNP site of interest is located on an autosomal exon or functional noncoding region and the population frequency of alleles is between 0.45 and 0.55.
4. The identity relationship identification method of claim 1, wherein for paternity relationship identification, the number of target SNP sites is required to be not less than 100;
for the paternity relationship identification of the table relative, the number of target SNP sites is required to be not less than 720;
for the primary relativity identification of grandfather, grandson, nephew and the like, the number of target SNP sites is required to be not less than 480;
for the sibling relationship identification of the level of brother sisters, the number of target SNP sites is required to be not less than 240;
for individual identification, the number of target SNP sites is required to be not less than 50.
5. The identity relationship identification method of claim 1, wherein in the step S3, the satisfaction of the preset requirement means that the sequencing coverage satisfies not less than 30 read coverage, and the mutation quality satisfies the default quality control standard of GATK, which is Genome Analysis Toolkit software.
6. The identity relationship identification method of claim 5, wherein the GATK has default quality control criteria QD >2.0 and MQ >40.0 and FS <60.0 and HaplotpySecure <60.0 and MQRankSum > -12.5 and ReadPosRankSum > -8.0.
7. The method for identifying identity relationship of claim 1, wherein the SNP sites of interest include the following sites:
Figure FDA0002372351280000031
Figure FDA0002372351280000041
Figure FDA0002372351280000051
Figure FDA0002372351280000061
wherein the reference sequence is hg 19.
8. An apparatus for authenticating identity relationship, comprising:
the mutation information acquisition module is used for acquiring sample mutation information obtained by comparing and analyzing the sequencing result;
the target SNP information retrieval module is used for retrieving a plurality of target SNP sites one by one in the mutation sites of the sample mutation information to obtain the mutation and sequencing information of the genotype containing each target SNP site; for the target SNP which can be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is determined to be homozygote or heterozygote which is inconsistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency, mutation quality and sequencing coverage of the target SNP site; for the target SNP which cannot be searched in the mutation site of the sample mutation information, the genotype of the target SNP site is homozygote which is consistent with the reference genotype, and the obtained mutation and sequencing information comprises the genotype, allele population frequency and sequencing coverage of the target SNP site;
the identification information selection module is used for selecting a target SNP site meeting preset requirements from a plurality of target SNP sites and mutation and sequencing information of the target SNP site to obtain identification information; and
the identity relationship identification module is used for comparing the genotypes of the corresponding target SNP sites in the identification information of different samples and identifying the identity relationship of the different samples;
the identity relationship identification module comprises a first judgment module, an individual identification module, a matching condition statistics module, a second judgment module, a paternity relationship identification module and other paternity relationship identification modules;
the first judging module is used for judging whether individual identification or relationship identification is carried out on different samples;
the individual identification module is used for comparing the genotypes of all corresponding target SNP sites of different samples and analyzing whether the different samples belong to the same individual according to the comparison result;
the matching condition counting module is used for counting the number of matched target SNP sites and/or the number of unmatched target SNP sites in the corresponding target SNP sites of different samples;
the second judging module is used for judging whether the paternity identification is carried out on different samples or other paternity identifications except the paternity;
the paternity relationship identification module is used for calculating the paternity index of each matched target SNP locus according to the genotype of the target SNP locus and the population frequency of the corresponding allele, determining a comprehensive paternity index according to the paternity index of each matched target SNP locus, and analyzing whether different samples belong to paternity relationships or not according to the comprehensive paternity index;
the other relativity relationship identification module is used for analyzing the relativity relationship of other non-relativity relationships according to the number of unmatched target SNP sites;
the SNP is a single nucleotide polymorphism.
9. A computer device having a processor and a memory, the memory storing a computer program, the processor implementing the steps of the identity relationship authentication method according to any one of claims 1 to 7 when executing the computer program.
10. A computer storage medium having a computer program stored thereon, wherein the computer program when executed implements the steps of the identity relationship authentication method of any one of claims 1 to 7.
CN201810490416.4A 2018-05-21 2018-05-21 Identity relationship identification method, device, equipment and storage medium Active CN108647495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810490416.4A CN108647495B (en) 2018-05-21 2018-05-21 Identity relationship identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810490416.4A CN108647495B (en) 2018-05-21 2018-05-21 Identity relationship identification method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108647495A CN108647495A (en) 2018-10-12
CN108647495B true CN108647495B (en) 2020-04-10

Family

ID=63757290

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810490416.4A Active CN108647495B (en) 2018-05-21 2018-05-21 Identity relationship identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108647495B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3938536A4 (en) * 2019-03-12 2023-03-08 Crown Bioscience (Suzhou) Inc. Methods and compositions for identification of tumor models
WO2023219214A1 (en) * 2022-05-12 2023-11-16 Republic Of Korea(National Forensic Service Director Ministry Of Interior And Safety) Snps panel for kinship identification in korean and use thereof
CN115346594B (en) * 2022-08-24 2023-09-05 温州医科大学 Ancestor relationship identification method, system, equipment and medium without raw mother participation
CN117423382A (en) * 2023-10-21 2024-01-19 云准医药科技(广州)有限公司 Single-cell barcode identity recognition method based on SNP polymorphism

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539967A (en) * 2008-12-12 2009-09-23 深圳华大基因研究院 Method for detecting mononucleotide polymorphism
WO2016049993A1 (en) * 2014-09-30 2016-04-07 深圳华大基因科技有限公司 Method and system for testing identity relations among multiple biological samples
CN107217095A (en) * 2017-06-15 2017-09-29 广东腾飞基因科技股份有限公司 The mankind's paternity identification multiple PCR primer group and detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101539967A (en) * 2008-12-12 2009-09-23 深圳华大基因研究院 Method for detecting mononucleotide polymorphism
WO2016049993A1 (en) * 2014-09-30 2016-04-07 深圳华大基因科技有限公司 Method and system for testing identity relations among multiple biological samples
CN106715712A (en) * 2014-09-30 2017-05-24 深圳华大基因科技有限公司 Method and system for testing identity relations among multiple biological samples
CN107217095A (en) * 2017-06-15 2017-09-29 广东腾飞基因科技股份有限公司 The mankind's paternity identification multiple PCR primer group and detection method

Also Published As

Publication number Publication date
CN108647495A (en) 2018-10-12

Similar Documents

Publication Publication Date Title
CN108647495B (en) Identity relationship identification method, device, equipment and storage medium
Lloyd-Jones et al. The genetic architecture of gene expression in peripheral blood
Huang et al. High-throughput genotyping by whole-genome resequencing
Yang et al. Evaluation of breast cancer susceptibility using improved genetic algorithms to generate genotype SNP barcodes
US20220101944A1 (en) Methods for detecting copy-number variations in next-generation sequencing
US8090543B2 (en) Computer algorithm for automatic allele determination from fluorometer genotyping device
US11302417B2 (en) Systems and methods for SNP characterization and identifying off target variants
Vy et al. A composite-likelihood method for detecting incomplete selective sweep from population genomic data
CN108694304B (en) Identity relationship identification method, device, equipment and storage medium
Leibon et al. A SNP streak model for the identification of genetic regions identical-by-descent
Ahsan et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data
Wiehe et al. Identification of selective sweeps using a dynamically adjusted number of linked microsatellites
AU2020296188B2 (en) System and method for determining genetic relationships between a sperm provider, oocyte provider, and the respective conceptus
CN111798924A (en) Human leukocyte antigen typing method and device
CN113981070B (en) Method, device, equipment and storage medium for detecting embryo chromosome microdeletion
JPWO2019132010A1 (en) Methods, devices and programs for estimating base species in a base sequence
JP7446343B2 (en) Systems, computer programs and methods for determining genome ploidy
CN114921536A (en) Method, device, storage medium and equipment for detecting uniparental diploid and loss of heterozygosity
US20230282307A1 (en) Method for detecting uniparental disomy based upon ngs-trio, and use thereof
CN115966259B (en) Sample homology detection and verification method and system based on logistic regression modeling
Blanton Linkage Analysis
Hedges Bioinformatics of Human Genetic Disease Studies
EP1840212A1 (en) Method of systematic analysis of relevant gene in relevant genome region (including relevant gene/relevant haplotype)
Flickinger Detecting and Correcting Contamination in Genetic Data.
WO2022098980A1 (en) Methods and related aspects for analyzing chromosome number status

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181012

Assignee: Zhengzhou Jinyu Clinical Laboratory Center Co.,Ltd.

Assignor: GUANGZHOU KINGMED DIAGNOSTICS GROUP Co.,Ltd.

Contract record no.: X2021980010019

Denomination of invention: Identification method, device, equipment and storage medium

Granted publication date: 20200410

License type: Common License

Record date: 20210928

EC01 Cancellation of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: Zhengzhou Jinyu Clinical Laboratory Center Co.,Ltd.

Assignor: GUANGZHOU KINGMED DIAGNOSTICS GROUP Co.,Ltd.

Contract record no.: X2021980010019

Date of cancellation: 20220922

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20181012

Assignee: Zhengzhou Jinyu Clinical Laboratory Center Co.,Ltd.

Assignor: GUANGZHOU KINGMED DIAGNOSTICS GROUP Co.,Ltd.

Contract record no.: X2022980016522

Denomination of invention: Identification method, device, equipment and storage medium

Granted publication date: 20200410

License type: Common License

Record date: 20220927