CN117965748A

CN117965748A - Identification method for screening synegg twins based on SNV and INDEL

Info

Publication number: CN117965748A
Application number: CN202410029423.XA
Authority: CN
Inventors: 刘希玲; 孙伟芬; 王紫薇; 韩丁丁; 黄傲; 冯旗; 温姝博; 李辉; 姜磊
Original assignee: Academy Of Forensic Science
Current assignee: Academy Of Forensic Science
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-05-03

Abstract

The invention belongs to the technical field of biological genetics, and particularly relates to an identification method for screening syngeneic twins based on SNV and INDEL; the method comprises the following steps: s1: extracting DNA in peripheral blood of the syngeneic twins; s2: preparing a whole genome sequencing pre-library of sample DNA; s3: whole genome sequencing; s4: processing whole genome sequencing data, and screening SNV and INDEL loci of difference between twins in the same egg; s5: designing a specific primer of the site in S4, and constructing a sequencing library through PCR amplification and purification; s6: multiple PCR targeted resequencing; s7: analyzing the targeted resequencing, and verifying the authenticity of the site detected by S4; s8: sanger sequencing was performed for sites of interest that failed validation in S7 or that the PCR product was greater than 350 bp. The identification method provided by the invention can be widely applied to judicial fields such as criminal case detection and the like.

Description

Identification method for screening synegg twins based on SNV and INDEL

Technical Field

The invention belongs to the technical field of biological genetics, and particularly relates to an identification method for screening syngeneic twins based on SNV and INDEL.

Background

The syngeneic twins are fertilized eggs which are the products of fertilized egg division generated by fertilization of one egg with one sperm, and theoretically have the same genetic information, so that the traditional forensic DNA typing technology cannot be used for identifying individuals under complex conditions such as the syngeneic twins.

In the prior art, identification of synova twins mainly comprises three schemes:

The first scheme is as follows: collecting detection materials such as peripheral blood, oral swab, intestinal biopsy sample and the like of twins in the same egg, obtaining a whole genome DNA methylation level map by utilizing a traditional whole genome bisulfite methylation sequencing technology, and further obtaining a differential methylation region by utilizing a bioinformatic data analysis technology. Furthermore, spanners found differentially methylated CpG sites in saliva DNA of syngeneic twins in 2018 using PCR-high resolution lysis techniques. The technology detects DNA sequence variation by measuring the dissolution characteristic of DNA double chains in the heating process, and is more suitable for forensic laboratories.

The second scheme is as follows: copy Number Variation (CNV) refers to the increase or decrease in copy number of certain large fragments on the genome, and can regulate the plasticity of organisms by changing gene dosage, transcriptional structure, etc., and is one of the main genetic bases for the evolution of individual phenotype diversity and population adaptability. CNV is a structural variation of a gene, has strong polymorphism and relative instability, and has a mutation rate 100-10000 times higher than that of single base substitution. Several research teams have hitherto used algorithms such as comparative genomic hybridization chip or SNP chip technology in combination with Birdsuite, pennCNV, quantiSNP to detect somatic CNV typing differences in peripheral blood and oral epithelial cell DNA of syngeneic twins.

The third scheme is: compared with nuclear DNA, mitochondrial DNA (mtDNA) has the characteristics of high copy number, small genome volume, high mutation rate caused by lack of DNA repair mechanism and the like, so that tiny point mutation differences are successfully detected in mitochondrial genomes of syngeneic twins through an Illumina Hiseq 2000 sequencing platform or a long-reading long single-molecule real-time sequencing technology (SMRT).

However, any of the above existing technical solutions has certain drawbacks in practical applications for identifying twins in common eggs. In particular, if conventional whole genome bisulfite methylation sequencing techniques are employed, bisulfite treatment or enzymatic digestion of DNA samples is often required, which not only consumes a lot of manpower, but also involves the risk of DNA damage. Furthermore, there is growing evidence that such methylation signatures associated with in ovo twins may evolve (i.e., epigenetic drift) during their life due to environmental or aging factors. For CNV, if chip technology is used to detect the CNV typing differences of twins in the same egg, typically, the number of CNV differences is very small and only a very small number of fragments can pass qPCR verification. Furthermore, if mtDNA is used to distinguish syngeneic twins, since mtDNA is maternal and DNA recombination does not occur, mtDNA of the progeny is essentially all from egg cells, and this genetic exclusivity makes it difficult for mtDNA to distinguish syngeneic twins from the same maternal species.

Therefore, the identification method for identifying the synostoma twin, which can realize the detection and the identification of the DNA sample with lower content and has higher efficiency and more economy, has important significance.

Disclosure of Invention

The invention aims to provide an identification method for screening syngeneic twins based on SNV and INDEL so as to solve the problems in the background technology.

In order to solve the technical problems, the invention provides the following technical scheme:

an identification method for identifying twins in common eggs based on SNV and INDEL comprises the following steps: the method comprises the following steps:

S1: extracting DNA in peripheral blood of the twins of the same egg to obtain sample DNA;

S2: preparing a small fragment library A by using sample DNA, and detecting the library concentration and fragment size of the small fragment library A;

s3: performing whole genome sequencing on the small fragment library A obtained in the step S2 to obtain sequencing data;

S4: processing whole genome sequencing data, screening SNV and INDEL sites of difference between twins in the same egg, and taking the SNV and INDEL sites as target sites;

s5: designing a specific primer of the target site in the step S4, and constructing a sequencing library through PCR amplification and purification;

s6: multiple PCR targeted resequencing;

s7: analyzing the targeted resequencing, and verifying the authenticity of the target site in the step S4;

S8: for the target sites that failed validation or the PCR products were greater than 350bp in step S7, sanger sequencing was performed.

Preferably, S2 specifically comprises the following steps:

Firstly, fragmenting sample DNA, carrying out terminal repair and adding A tail, and screening to obtain a flat-terminal DNA fragment;

(II) connecting the joints to obtain a DNA fragment added with the joints;

And thirdly, carrying out PCR amplification and purification to obtain a small fragment library, and detecting the library concentration and fragment size of the small fragment library.

Preferably, S4 specifically comprises the following steps:

Firstly, performing quality control on the original sequencing data, removing low-quality reads, and comparing the reads with human genome sequences;

(II) marking repeated reads and performing alkali matrix weight recalibration;

(III) detection of somatic SNV and INDEL, removal of variant sites located in known genomic structures, such as: repeated structure, copy number variation region, long homomultimer rich in polymorphism and DNA sequence with extremely high GC content;

(IV) further screening SNV and INDEL sites containing at least 5 reads according to the filtering result of the last step, and taking the SNV and INDEL sites as target sites;

and fifthly, performing functional annotation by using ANNOVAR software.

Preferably, S5 specifically comprises the following steps:

Designing a specific primer of a target site by using Premier5 software, and verifying the specificity by using agarose gel electrophoresis;

(II) mixing all forward primers and reverse primers at equal concentrations to form a single forward primer mixture and a single reverse primer mixture;

Amplifying the fragment containing the target site, purifying the amplified PCR product by using magnetic beads, repairing the tail end of the PCR product, and screening to obtain a flat-end DNA fragment;

(IV) connecting the joints to obtain a DNA fragment with the joints;

And fifthly, carrying out PCR amplification and purification on the spliced DNA fragments to obtain a small fragment library B, detecting the library concentration and fragment size of the small fragment library B, and diluting to obtain a sequencing library.

Preferably, the concentration of the single forward primer mixture and the single reverse primer mixture is 10. Mu.M.

Preferably, in step (five) of S5, after detection of the library concentration and fragment size of the small fragment library B, the library is diluted to 8 μm to give the final sequencing library.

Preferably, S7 specifically includes the following steps:

converting the original sequencing data from BCL format to FASTQ format;

(II) removing sequence joints and 3' -end low-quality bases;

Thirdly, comparing the processed sequencing data to human genome, sequencing genome coordinates and establishing an index;

fourth, genotyping all sites of interest.

Preferably, S8 specifically comprises the following steps:

firstly, taking a site with failed verification or PCR product larger than 350bp in S7, designing a specific primer by using Premier5 software, and verifying the specificity by using agarose gel electrophoresis;

amplifying target sites by PCR reaction and purifying;

(IV) taking the purified product for Sanger sequencing.

Preferably, the identification method for identifying the twins in the same egg can be applied to judicial fields such as criminal case detection and the like.

Compared with the prior art, the invention has the following beneficial effects: the short fragment sequence variation identification method and the established sequencing data analysis flow provided by the invention have the advantages of high identification efficiency, more cost effectiveness and reliable results; in addition, the identification method can realize the detection of SNV with lower content, and gives consideration to the practical problem of low DNA content of trace detection materials, degradation samples and the like; therefore, the identification method of the synzygotic twins provided by the invention can be widely applied to judicial fields such as criminal case forensic detection.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and together with the embodiments of the invention and do not constitute a limitation to the invention, and in which:

FIG. 1 is an experimental flow diagram of an embodiment;

FIG. 2 is the number of SNVs detected in an in ovo sample by whole genome sequencing in the examples;

FIG. 3 is a karyotype distribution of SNV detected in an in ovo sample by whole genome sequencing in the examples;

FIG. 4 is an overlay of SNV detected in an in ovo sample by whole genome sequencing in the examples;

FIG. 5 is an IGV diagram of multiplex PCR targeted resequencing in the examples, taken chr1:104081131, A > G as an example;

FIG. 6 is a verification of Sanger sequencing in the examples, taken as example for chr4:185690326, C > T;

FIG. 7 shows the results of DNA sensitivity analysis in examples, exemplified by chr20:38761028, G > A.

Detailed Description

What follows is a preferred implementation of the embodiments of the invention, it being apparent that the described embodiments are only some, but not all, of the embodiments of the invention. All other embodiments, which are apparent to those of ordinary skill in the art without undue burden, are within the scope of the invention, as would be within the skill of one of ordinary skill in the art without departing from the principles of the embodiments of the present invention.

The test methods used in the examples are conventional methods unless otherwise specified; materials, reagents and the like used, unless otherwise indicated, are all commercially available.

Example 1: the identification of the syngeneic twins is carried out by taking the free DNA in the peripheral blood of the pair of female syngeneic twins as a sample, and the specific process is as follows:

S1: extracting DNA in peripheral blood of the twins of the same egg to obtain sample DNA; the specific operation is as follows: peripheral Blood from syngeneic twins at age 27 and 33 years old were collected and stored in EDTA anticoagulation tubes, and sample DNA was extracted from Blood samples using a commercially available QIAAMP DNA Blood mini Kit (Qiagen, hilden, germany), and DNA concentration was determined using QubitTM DSDNA HS ASSAY KIT (ThermoFisher Scientific, carlsbad, USA);

S2: sample DNA was fragmented into fragments of average size 350bp using Covaris ultrasound equipment (Covaris, woburn, MA, USA); a small fragment library A was prepared using Illumina TruSeq Nano DNA Kit (Illumina, san Diego, calif., USA) according to the manufacturer's instructions, using an initial DNA dose of 100ng;

specifically, the broken DNA fragment is subjected to magnetic bead purification, and then end repair and A tail addition are carried out to obtain a blunt-end DNA fragment; connecting the joints, and carrying out PCR amplification and purification to obtain a small fragment library A; concentration of small fragment library a was detected using the Quant-iT PicoGreen dsDNA detection kit (Invitrogen, carlsbad, CA, USA), qPCR quantification was performed using the Universal KAPA library quantification kit, purity analysis was performed using Agilent 2100;

s3: whole genome sequencing: carrying out double-end sequencing on the small fragment library A in S2 by Illumina NovaSeq 6000,6000 to obtain sequencing data in the FASTQ format, wherein the average sequencing depth is not less than 30×, and the sequence length is 150bp;

S4: processing whole genome sequencing data, screening SNV and INDEL sites of difference between twins in the same egg, and taking the SNV and INDEL sites as target sites; the specific operation is as follows:

quality control was performed on the sequencing data obtained in S3 using FastQC to remove low quality bases, and four sample cases were as follows:

Table one:

	total reads number (M)	％≥Q30
			Twin A_27 years old	31.304	89.19
Twin B_27 years old	31.109	89.41
			Twin A_33 years old	31.097	89.08
Twin B_33 years old	31.099	89.27

The filtered sequences were aligned to human reference genome hg38 using BWA-MEM (0.7.17) to obtain BAM files; marking duplicate reads occurring in the BAM file using a command MarkDuplicate of Picard (2.19.0); the base quality scores were re-corrected using the command BaseRecalibrator of GATK (3.8.1) in combination with known SNV and INDEL sites from the thousand genome project (ftp:// ftp. Broadinstrument/hg 38/Mills_and 1000G_gold_standard. Indexes. Hg38.Vcf. Gz) and dbSNP database (ftp:// ftp. Broadinstrument. Org/bundle/hg38/dbSNP _146. Hcf. Gz);

Detection of somatic SNV and INDEL was performed according to the GATK standard procedure; specifically, a Panel of Normal (PoN) file is first created based on internal whole genome sequencing data using Mutect software; mutect2 was run in the turmor-only mode and somatic SNV and INDEL were detected in combination with PoN and germline polymorphism information (gnomAD); filtering high quality variant sites using FilterMutectCalls and further excluding variant sites located in known genomic structures, including repeat structures, copy number variations, long homomultimers rich in polymorphisms, and sequences with very high GC content; screening SNV and INDEL loci with at least 5 reads on each allele, and taking the SNV and INDEL loci as target loci for subsequent analysis; functional annotation using ANNOVAR;

S5: verifying the target site screened in the step S4 through targeted resequencing; specifically, the specific primers were designed using Premier5 software and their specificity was verified using agarose gel electrophoresis; mixing all forward and reverse primers at equal concentrations to form a single forward primer mixture and a single reverse primer mixture, the final concentration being 10 μm; amplifying target sites by adopting a two-step PCR reaction, connecting joints, and preparing a reaction system according to a second table;

And (II) table:

Fully and uniformly mixing the reaction systems, centrifuging briefly, putting the mixture into a PCR instrument, and operating according to the program of the third table;

Table three:

Combining four multiplex amplified PCR samples into a mixed sample of equimolar concentration to obtain a small fragment library B, detecting fragment size (< 350 bp) using 2% agarose gel, and quantifying by a Qubit fluorometer (ThermoFisher Scientific, carlsbad, USA); diluting the small fragment library B to 8 mu M according to the qPCR result of the last step to obtain a final sequencing library;

s6: multiplex PCR targeted resequencing: performing double-end deep sequencing on the sequencing library in S5 by using Illumina NovaSeq 6000,6000 to obtain sequencing data in a BCL format;

S7: targeted resequencing data analysis: converting the sequencing data of S6 (BCL format) into FASTQ format using BCL2FASTQ software; the adapter sequence and 3' low quality bases were removed using fastp software and four sample filters were as shown in Table IV;

table four:

The processed FASTQ file is compared with the human reference genome hg38 through a BWA-MEM algorithm to obtain a BAM file; finishing genome coordinate sequencing and establishing BAM file indexes by using samtools software to obtain a processed BAM file; genotyping all target sites through GATK HaplotypeCaller (- -genotyping-mode GENOTYPE _ GIVENALLELES) to obtain information such as genotype, sequencing depth and mutation frequency of each site;

S8: performing Sanger sequencing on the target site failed to verify in the S7 or the target site with the PCR product larger than 350 bp; specifically, a primer specific to a target site is designed by using Premier5 software, and the specificity of the primer is verified by using agarose gel electrophoresis; mixing all forward and reverse primers at equal concentrations to form a single forward primer mixture and a single reverse primer mixture, the final concentration being 10 μm; amplifying target sites by adopting a two-step PCR reaction, and preparing a reaction system according to a fifth table;

Table five:

Reaction system	Dosage (mu L)
		DNA(1ng)	1
2 XPCR amplification mixture	10
		Site-specific forward primer (10. Mu.M)	1
Site-specific reverse primer (10. Mu.M)	1
		PCR sterile water	7
Total volume of	20

Fully mixing the reaction systems, centrifuging briefly, and placing the mixture into GENEAMP PCR SYSTEM 9700:9700 thermal cycler (ThermoFisher Scientific, carlsbad, USA) to operate according to the procedure shown in Table six;

Table six:

The amplification products were purified using a QIAquick PCR purification kit (Qiagen, hilden, germany); the purified product was sequenced on an ABI 3730xl capillary sequencer according to standard protocols, equipped with a 50cm capillary and POP7 polymer;

in sum, 9 different sites of the syngeneic twin are added, and the specific cases are shown in Table seven:

table seven:

note that: * It was shown that SNV could be verified simultaneously by multiplex PCR targeted resequencing and Sanger sequencing.

Double blind testing to verify the efficacy of individual identification: from a DNA sample from a 27 year old syngeneic twin, 6 samples were randomly extracted, PCR reactions were performed based on the 9 SNV sites identified above, DNA preparation and PCR analysis were performed independently by two investigators, two persons had no knowledge of the sample information, and blind test results are shown in Table eight:

table eight:

Testing SNV

Sample_A

Sample_B

Sample_C

Sample_D

Sample_E

Sample_F

chr1:104081131_A>G

A

A/G

A

A/G

A

chr3:89767141_T>C

T

T/C

T

T/C

T

chr4:185690326_C>T

C

C/T

C

C/T

C

chr5:98825091_A>T

A

A/T

A

A/T

A

chr9:89734622_G>C

G

G/C

G

G/C

G

chr15:38160979_T>A

T

T/A

T

T/A

-

chr17:68016720_T>G

T/G

T

T/G

T

T/G

chr18:60663612_C>T

C/T

C

C/T

C

C/T

chr20:38763028_G>A

G/A

G

G/A

G

G/A

Common ovum twin

A

B

A

B

A

DNA sensitivity analysis: genomic DNA from twins from 33 years old in ovo was diluted to produce 8 concentration gradients of 1,0.5,0.25,0.125,0.075,0.05,0.025,0.0125 ng/. Mu.L, respectively; 1 mu L of DNA of each dilution is taken to construct a library in sequence; specifically, a target site is amplified by adopting a two-step PCR reaction, and a reaction system is prepared according to a table nine;

Table nine:

Fully mixing the reaction systems, centrifuging briefly, and placing the mixture into GENEAMP PCR SYSTEM 9700 thermal cycler (ThermoFisher Scientific, carlsbad, USA) to operate according to the procedure of Table ten;

table ten:

The amplification products were purified using a QIAquick PCR purification kit (Qiagen, hilden, germany); the purified product was sequenced on an ABI 3730xl capillary sequencer according to instructions;

According to the capillary electrophoresis result, when the DNA content is as low as 0.25ng, all 9 SNV sites can be detected; when further down to 0.075ng, 7 SNV sites remain detectable.

In conclusion, according to the results, 9 SNVs can effectively identify the syngeneic twins in the examples, and the identification can still be successfully implemented under the condition of reduced DNA content.

Finally, it should be noted that: the foregoing description of the preferred embodiments of the present application is merely illustrative, and the scope of the present application is not limited thereto, since any changes or substitutions that would be easily contemplated by those skilled in the art within the scope of the present application shall fall within the scope of the present application; the embodiments of the present application and features in the embodiments may be combined with each other without conflict. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. An identification method for identifying twins in common eggs based on SNV and INDEL is characterized in that: the method comprises the following steps:

s6: multiple PCR targeted resequencing;

s7: analyzing the target resequencing sequence, and verifying the authenticity of the target site in the step S4;

S8: and (3) carrying out Sanger sequencing on the target site failed to verify in the step S7 or the target site with the PCR product larger than 350 bp.

2. The method for identifying the twins based on SNV and INDEL according to claim 1, wherein the method comprises the following steps: s2 specifically comprises the following steps:

(II) connecting the joints to obtain a DNA fragment added with the joints;

3. The method for identifying the twins based on SNV and INDEL according to claim 1, wherein the method comprises the following steps: s4 specifically comprises the following steps:

(II) marking repeated reads and performing alkali matrix weight recalibration;

(III) detecting SNV and INDEL of the somatic cells, and removing mutation sites located in known genome structures;

and fifthly, performing functional annotation by using ANNOVAR software.

4. The method for identifying the twins based on SNV and INDEL according to claim 1, wherein the method comprises the following steps: s5 specifically comprises the following steps:

(IV) connecting the joints to obtain a DNA fragment with the joints;

5. The method for identifying the twins based on SNV and INDEL according to claim 4, wherein the method comprises the following steps: the concentration of the single forward primer mixture and the single reverse primer mixture was 10. Mu.M.

6. The method for identifying the twins based on SNV and INDEL according to claim 4, wherein the method comprises the following steps: in step (five) of S5, after detecting the library concentration and fragment size of the small fragment library B, the library was diluted to 8. Mu.M, to obtain the final sequencing library.

7. The method for identifying the twins based on SNV and INDEL according to claim 1, wherein the method comprises the following steps: s7 specifically comprises the following steps:

converting the original sequencing data from BCL format to FASTQ format;

(II) removing sequence joints and 3' -end low-quality bases;

fourth, genotyping all sites of interest.

8. The method for identifying the twins based on SNV and INDEL according to claim 1, wherein the method comprises the following steps: s8 specifically comprises the following steps:

amplifying target sites by PCR reaction and purifying;

(IV) taking the purified product for Sanger sequencing.

9. The use of any one of the SNV and INDEL based methods for identifying syngeneic twins according to claims 1-8, wherein: can be used in the judicial field.