CN110129422B

CN110129422B - Method for analyzing mutation structure of repeated mutation disease of polynucleotide based on long-fragment PCR and single-molecule sequencing

Info

Publication number: CN110129422B
Application number: CN201910458674.9A
Authority: CN
Inventors: 罗巍; 岑志栋; 姜正文; 杨德壕; 付爱思; 胡奔
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-05-29
Filing date: 2019-05-29
Publication date: 2021-06-29
Anticipated expiration: 2039-05-29
Also published as: CN110129422A

Abstract

The invention provides a method for analyzing a mutation structure of a polynucleotide repeat mutation disease based on long-fragment PCR and single-molecule sequencing. The method comprises the following steps: (a) providing a sample to be detected, wherein the sample to be detected is a nucleic acid sample containing genome DNA; (b) carrying out long-fragment PCR on the sample to be detected so as to obtain a first amplification product; (c) adding a barcode sequence to an end of the amplification product to form a first amplification product with a barcode sequence; (d) single molecule sequencing the amplification products of the barcoding sequence to obtain a data set corresponding to the daughter reads of the target region. Based on this data set, the mutant structure of a polynucleotide repeat mutation disease can be accurately resolved. The invention has the characteristics of high efficiency, high precision, low cost and the like.

Description

Method for analyzing mutation structure of repeated mutation disease of polynucleotide based on long-fragment PCR and single-molecule sequencing

Technical Field

The invention relates to the technical field of biology, in particular to a method for analyzing a mutation structure of a polynucleotide repeat mutation disease based on long-fragment PCR and single-molecule sequencing.

Background

Polynucleotide expansion disease (RED) is a large group of genetic diseases caused by abnormal amplification of 3-12 nucleotide repeats. In addition to the more common trinucleotide repeat amplification mutations such as (CAG) n and (CGG) n, more and more polynucleotide repeat amplification mutations are reported to be discovered at present. In the polynucleotide repeat mutation diseases, such as C9ORF72 gene (GGGGGGCC) n hexanucleotide repeat mutation, CSTB gene (CCCCGCCCCGCG) n and the like, the length of the abnormally repeated amplified sequence can reach several kb or even dozens of kb, the detailed structure of the mutation is clarified, and the exploration of different pathogenic mechanisms of different structures and the relation with clinical phenotypes are always difficult.

Taking Familial Cortical Myoclonic Tremor Epilepsy (FCMTE) as an example, the disease is a group of autosomal dominant hereditary epilepsy syndromes with significant clinical and genetic heterogeneity. FCMTE has as its main clinical manifestations adult onset, tremor of the motor cortex and myoclonus of the extremities, with or without epileptic onset, and can be associated with symptoms of cognitive impairment, dementia, night blindness, migraine, ataxia, etc. The electrophysiological examination may have abnormal electromyographic and electroencephalographic manifestations, such as Giant somatosensory evoked potentials (G-SEP) of cortical origin, and Long-latency cortical reflex (LLCR or C-reflex). The antiepileptic medicine is effective.

The clinical phenotype of FCMTE is complex, and the diagnosis can be confirmed only by combining complex electrophysiological examinations such as G-SEP, C-reflex and the like in the past diagnostic standard, so that a large number of cases of missed diagnosis and misdiagnosis exist, and the targeted and accurate diagnosis and treatment cannot be obtained.

The single-molecule sequencing (single-molecule sequencing or long-read sequencing) technology provides a new detection means for the repeated mutation of the polynucleotides, is also applied to the research of the repeated insertion mutation of the SAMD12 gene pentanucleotide of FCMTE by other groups, but has various limitations, so that the missed detection rate is high, reliable sequence content information cannot be provided, only the detection at a qualitative level can be realized, the existence of the repeated amplification sequence can be seen, but the reliable sequence content information cannot be provided. In addition, single molecule sequencing at the whole genome level is very expensive and cannot be popularized in clinical detection.

Therefore, there is an urgent need in the art to develop a novel method for efficiently and accurately analyzing the mutant structure of a disease caused by repetitive mutation of a polynucleotide.

Disclosure of Invention

The invention aims to provide a novel method for efficiently and accurately analyzing the mutant structure of a polynucleotide repeat mutation disease.

In a first aspect of the present invention, there is provided a method of resolving a mutant structure of a polynucleotide repeat mutation disease or a method of resolving a structure of a polynucleotide repeat region, the method comprising the steps of:

(a) providing a sample to be detected, wherein the sample to be detected is a nucleic acid sample containing genome DNA;

(b) carrying out long-fragment PCR on the sample to be detected so as to obtain a first amplification product;

(c) adding a barcode sequence to an end of the amplification product to form a first amplification product with a barcode sequence;

(d) single molecule sequencing the amplification products of the barcoding sequence to obtain a dataset corresponding to the sub-reads of the target region (i.e., the sub-reads corresponding to the polynucleotide repeat region).

In another preferred example, the method further comprises:

(e) the data set is analyzed to obtain the mutant structure of the target region (i.e., the polynucleotide repeat region).

In another preferred embodiment, between steps (c) and (d), further comprising:

(d0) mixing the first amplification product of the barcode sequence with m-1 amplification products with barcode sequences to obtain a mixed library of amplification products;

wherein the m-1 barcoded amplification products are the 2 nd, 3 rd, … th and m th amplification products with different barcode sequences prepared in steps (a), (b) and (c), respectively, and the mixed library of amplification products contains the m barcoded amplification products;

wherein m is a positive integer not less than 2.

In another preferred embodiment, m.gtoreq.5, preferably.gtoreq.10, more preferably.gtoreq.20, most preferably.gtoreq.30.

In another preferred embodiment, m is from 5 to 5000, preferably from 10 to 2000, more preferably from 20 to 500.

In another preferred embodiment, m is 5 to 60, preferably 20 to 50, more preferably 35 to 45.

In another preferred embodiment, in step (d), single-molecule sequencing is performed on the mixed library of amplification products, thereby obtaining the data set of the sub-reads corresponding to the target region (i.e., the sub-reads corresponding to the polynucleotide repeat region).

In another preferred example, in step (e), the data set is split based on different barcode sequences, and then reads with the same barcode sequence are subjected to classification analysis, so as to obtain mutation structures corresponding to the target regions (i.e. polynucleotide repeat regions) of the m samples to be tested respectively.

In another preferred embodiment, the length of the polynucleotide repeat region is 200 + 10000bp, preferably 1500 + 5000 bp.

In another preferred embodiment, the polynucleotide repeats are repeats of 3-12nt nucleotide units.

In another preferred embodiment, the polynucleotide repeat region comprises one or more polynucleotide repeats.

In another preferred embodiment, the polynucleotide repeat mutation disease is selected from the group consisting of: familial corticotropin tremor epilepsy (e.g., familial corticotropin

tremor epilepsy types

1, 6, 7); c9ORF 72-associated amyotrophic lateral sclerosis/frontotemporal dementia; spinocerebellar ataxia (e.g., spinocerebellar ataxia types 8, 10, 31, 36, 37); myotonic dystrophy (e.g., myotonic dystrophy type 1, 2).

In another preferred example, between steps (b) and (c), further comprising:

(c0) separating the first amplification product to obtain a separated and purified first amplification product.

In another preferred embodiment, when the method is used for m amplification products, the m amplification products are separated, thereby obtaining m separated amplification products.

In another preferred embodiment, Bluepippin is used to separate and recover the amplified product by fragment separation.

In another preferred embodiment, the method is non-diagnostic and non-therapeutic.

In a second aspect of the invention, there is provided a kit for diagnosing familial corticotropin tremor epilepsy (FCMTE), the kit comprising a first standard which is a nucleic acid sequence having (TTTGA) n1 five nucleotide repeat insertion mutation, wherein n1 is 50-800.

In another preferred embodiment, n1 is 100-500.

In another preferred embodiment, the kit further comprises a second standard, wherein the second standard is a nucleic acid sequence having a (TTTCA) n2 five-nucleotide repeat insertion mutation, wherein n2 is 100-700.

In another preferred embodiment, n2 is 200-500.

In another preferred embodiment, the kit further comprises a primer pair for long-fragment PCR.

In another preferred embodiment, the sequences of the primer pair for long-fragment PCR are shown in SEQ ID Nos. 1 and 2.

In a third aspect of the invention, there is provided the use of a kit according to the second aspect of the invention in the preparation of a test kit for the diagnosis of familial corticotropin clonic tremor epilepsy (FCMTE).

In a fourth aspect of the invention, there is provided a use of a detection reagent for detecting (TTTGA) n1 pentanucleotide repeat in SAMD12 gene, wherein the detection reagent is used for preparing a detection kit for diagnosing familial corticotropin tingling tremor epilepsy (FCMTE).

In a fifth aspect of the invention, there is provided a method of diagnosing FCMTE, comprising the steps of: detecting the presence or absence of a TTTGA-type pentanucleotide repeat in the SAMD12 gene of the subject;

wherein, if a TTTGA-type pentanucleotide repeat is present, it is indicative that the subject has, or is more likely to have, FCMTE (i.e., susceptible) than the normal population.

In a sixth aspect of the present invention, there is provided a system (or apparatus) for resolving a mutant structure of a polynucleotide repeat mutation disease, the system comprising:

(i) an LR-PCR amplification module configured to: carrying out long-fragment PCR on a sample to be detected so as to obtain a first amplification product, wherein the sample to be detected is a nucleic acid sample containing genome DNA;

(ii) an amplification product post-processing module configured to: adding a barcode sequence to an end of the amplification product to form a first amplification product with a barcode sequence; and

(iii) a single molecule sequencing module configured to: single molecule sequencing the amplification products of the barcoding sequence to obtain a dataset corresponding to the sub-reads of the target region (i.e., the sub-reads corresponding to the polynucleotide repeat region).

In another preferred example, the system further includes:

(iv) a data analysis module configured to: the data set is analyzed to obtain the mutant structure of the target region (i.e., the polynucleotide repeat region).

It is to be understood that within the scope of the present invention, the above-described features of the present invention and those specifically described below (e.g., in the examples) may be combined with each other to form new or preferred embodiments. Not to be reiterated herein, but to the extent of space.

Drawings

FIG. 1 shows a diagram of the five nucleotide repeat insertion mutation pattern within intron 4 of SAMD12 gene. Normal sequences are generally (TTTTTTA)₇TTA(TTTTA)₁₃(ii) a The sequences of the two mutations are (TTTTA) exp (TTTGA) exp and (TTTTA) exp (TTTCA) (exp: repeat expansion, where exp represents the presence or absence of the repeat expansion sequence and does not represent the number of times), respectively.

FIG. 2 shows the result of the repeat insertion of five nucleotides into intron 4 of SAMD12 gene and RP-PCR. As can be seen from the two samples tested, RP-PCR suggested the presence of (TTTTA) n and (TTTGA) n repeat amplification but the absence of (TTTCA) n repeat amplification.

FIG. 3 shows the result of LR-PCR gel running of five-nucleotide repeat insertion mutation of SAMD12 gene. The III:4, II:6 and IV:2 samples have abnormal amplification bands at about 2000 bp; the P-I-III2 sample has an abnormal amplification band at about 3000 bp.

FIG. 4 shows the sub-reads of a representative target region for single molecule sequencing of two cases of FCMTE samples. II:6 details of the sequence of the aberrantly amplified bands of the samples are: (TTTTA)₅TTA(TTTTA)₁₁₄(TTTGA)₁₁₁(ii) a The detailed sequence of the abnormal amplification band of the P-I-III2 sample is: (TTTTA)₃TTA(TTTTA)₃₂(TTTCA)₄₈₁。

FIG. 5 shows the sub-read length and content distribution of the target region for 4 cases of FCMTE samples: A-D is the length distribution of the abnormal amplification bands of each sample; E-H is the distribution of the lengths of (TTTTA) n and (TTTGA) n or (TTTCA) n in each sample abnormal amplification band. The dotted line represents the median (see table 2 for specific values).

FIG. 6 shows (TTTGA) n pentanucleotide repeat insertion mutation pathogenic pedigree, mutation sequence structure and LR-PCR gel map.

FIG. 7 shows (TTTGA) n pentanucleotide repeat insertion mutation nosogenes sample LR-PCR product Sanger sequencing and normal control: normal control Sanger sequencing suggested a repetitive sequence structure of (TTTTTTA)₇TTA(TTTTA)₁₃(ii) a Sanger sequencing of the long fragment PCR product suggested a (TTTTA) exp at the 5 'end and a (TTTGA) exp at the 3' end.

Figure 8 shows the sub-reads of a representative region of interest for single molecule sequencing of two additional (TTTGA) n quintet repeat insertion mutation-causing FCMTE samples: the detailed sequences of the abnormal amplification bands of the III:4 and IV:2 samples are respectively as follows: (TTTTA)₅TTA(TTTTA)₁₁₉(TTTGA)₁₁₁And (TTTTA)₅TTA(TTTTA)₁₀₈(TTTGA)₁₁₃。

Detailed Description

The present inventors have conducted extensive and intensive studies and, for the first time, have developed a method for efficiently and accurately analyzing the mutation structure of a disease caused by repetitive mutation of a polynucleotide. Specifically, the method is based on LR-PCR and single-molecule sequencing, and utilizes LR-PCR products to perform target region single-molecule sequencing, so that more effective reads (effective data increase) are obtained under the condition of reduced sequencing total amount (cost reduction), and the method has the characteristics of high efficiency, high precision and low cost. Based on the method of the present invention, the present inventors also identified for the first time a new mutation structure of FCMTE, a polynucleotide repeat mutation disease, i.e., the presence of a (TTTGA) n-type five-nucleotide repeat insertion mutation on SAMD12 gene. The present invention has been completed based on this finding.

Term(s) for

As used herein, the term "on-target read" of a target region refers to a read associated with a disease mutation structure.

As used herein, the term "TTTGA-type pentanucleotide repeat" refers to the presence of a (TTTGA) n1 pentanucleotide repeat within a target region, wherein n1 is a positive integer as defined above. In the present invention, the TTTGA-type pentanucleotide repeat in SAMD12 gene was first confirmed to be associated with familial corticotropin clonic tremor epilepsy.

As used herein, the term "TTTCA-type pentanucleotide repeat" refers to the (TTTCA) n2 pentanucleotide repeat present within a target region, wherein n2 is a positive integer as defined above. It was demonstrated that the TTTCA type pentanucleotide repeat in the SAMD12 gene was first demonstrated to be associated with familial corticotropin clonic tremor epilepsy.

SAMD12 gene

SAMD12(Sterile Alpha Motif Domain restriction 12, NCBI ID:401474), is a protein coding gene in which the "ENST 00000409003.5" transcript contains 5 coding exons, has been previously reported as the causative gene for FCMTE, and the causative mutation is a TTTCA-type pentanucleotide repeat insert located within intron 4.

Some specific functions of the protein encoded by the SADM12 gene are not yet known.

LR-PCR

The Polymerase Chain Reaction (PCR) is a molecular biological technique for the in vitro enzymatic synthesis of specific DNA fragments, the DNA fragment to be amplified starting from oligonucleotide primers complementary to its sequence, the basic principle of the PCR technique being similar to the natural replication process of DNA, the specificity of which depends on oligonucleotide primers complementary to both ends of the target sequence. PCR consists of three basic reaction steps of denaturation-annealing-extension. In the PCR amplification process, under the action of DNA polymerase (such as Taq DNA polymerase), dNTP is used as a reaction raw material, a target sequence is used as a template, and a new half-retained replication chain complementary with a template DNA chain is synthesized according to the base complementary pairing and half-retained replication principle. The specific DNA fragments are exponentially increased in number through multiple cycles of three reactions of denaturation, annealing and extension. A large number of specific gene fragments can be obtained in a short time by PCR.

In the present invention, Long-range PCR (LR-PCR) refers to a PCR reaction in which the amplification product is 4kb or more (preferably 5kb or more). In the invention, the technical method for amplifying the target product with more than 5kb which can not be amplified by the conventional PCR (generally can amplify a 3-4kb fragment) is achieved by adjusting the PCR reaction conditions and the type (such as Taq polymerase) of the related DNA polymerase.

In the present invention, the amplification product of Long-range PCR (LR-PCR) is usually 4.5-15kb, preferably 5-10kb, more preferably 5-8 kb.

In the present invention, the amplification of long-fragment DNA sequences can be further improved by adjusting PCR conditions, such as the use of a specific polymerase, optimization of the amount of template DNA, Mg2+ concentration, and the like.

One preferred LR-PCR polymerase includes TAKARA's specific DNA polymerase (Takara LA Taq DNA polymerase and PrimeSTAR GXLDDNA polymerase), which can amplify long fragments up to several tens of Kb, including sequences with AT repeats, high GC content.

For the operation of LR-PCR, see also the following documents: waggott W.Long Range PCR.In: Lo Y.M.D. (eds) Cl inorganic Applications of PCR.methods in Molecular Medicine^TM,vol 16.1998.Humana Press；Saiki RK,Gelfand DH,Stoffel S,et al.Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase.Science.1998.239(4839):487–91。

Single molecule sequencing

In the present invention, for the long-fragment amplification product obtained by LR-PCR amplification, the corresponding reading data can be obtained by single-molecule sequencing. In the present invention, representative single molecule sequencing includes (but is not limited to): tSMS sequencing, nanopore sequencing, and the like.

Typically, based on the technical principle of using fluorescence to label deoxynucleotide and recording the change of fluorescence intensity in real time by a microscope, the new generation sequencing which can detect very long sequence (up to 20kb) by one reaction is completed, and the technical bottleneck of reading the length (100-.

For nucleic acid sequencing, the term "template" refers to a nucleic acid molecule that undergoes a sequencing reaction. For example, in a sequencing-by-synthesis reaction, a template is a molecule used by a polymerase to direct nascent strand synthesis; for example, it is complementary to the nascent strand produced. In nanopore-based sequencing methods, the template is a nucleic acid that passes through the nanopore, either intact or after nucleolytic degradation. The template may comprise, for example, DNA, RNA, or the like, or a combination thereof. In addition, the template may be single-stranded, double-stranded, or may comprise both single-stranded and double-stranded regions.

In the present invention, it is preferred to use single molecule sequencing systems to detect nucleic acid templates by analyzing reaction data (e.g., sequence and/or kinetic data) obtained from such systems. In particular, a modification in a template nucleic acid strand can cause a unique and identifiable change in an analytical reaction that allows the modification to be identified. In other embodiments, the modification in the template alters pathways in which the current through the nanopore is perturbed as the template passes. In a preferred embodiment, such modifications are detected using single molecule nucleic acid sequencing techniques, wherein the resulting sequence reads correspond to a single molecule of the nucleic acid template. In preferred embodiments, single molecule nucleic acid sequencing technology is capable of detecting individual nucleotides in real time, for example during nucleotide incorporation or passage through a nanopore. Such sequencing techniques are known in the art and include, for example, nanopore sequencing techniques. For more information on nanopore sequencing, see, e.g., U.S. patent nos. 5,795,782; kasiaanowicz, et al (1996) Proc Natl Acad Sci USA 93 (24): 13770-3; ashkenas, et al (2005) Angew Chem Int Ed Engl44 (9): 1401-4; howorkka, et al (2001) Nat Biotechnology 19 (7): 636-9; astier, et al (2006) J Am Chem Soc128 (5): 1705-10; U.S.S.N.13/083,320, filed on 8/4/2011; and Zhao, et al (2007) Nano Letters 7 (6): 1680-.

Furthermore, for single molecule sequencing techniques, see also the following documents: ameur A, Kloosterman WP, Hestand MS.Single-molecule sequencing: todards clinical applications. trends Biotechnol.2018.37(1): 72-85; mitsuhashi, et al, tandem-genpatterns, robust detection of tandem repeat extensions from long DNA reads, genome biology.2019.20:58: 1-17.

Diseases caused by repeated mutation of polynucleotide

The polynucleotide repeat mutation disease (RED) applicable to the present invention is not particularly limited, and any disease caused by abnormal amplification of 3-12 nucleotide repeat sequences, particularly genetic diseases, may be used, and representative examples include (but are not limited to): spinocerebellar ataxia, myotonic dystrophy, C9ORF 72-associated amyotrophic lateral sclerosis/frontotemporal dementia, fragile X syndrome, and the like.

Detection method

The invention provides a method for efficiently and accurately analyzing the mutation structure of the repeated mutation disease of the polynucleotide. The method skillfully integrates the advantages of long-fragment PCR and single-molecule sequencing analysis, thereby not only efficiently and accurately analyzing the FCMTE, but also analyzing the mutation structures of other different polynucleotide repeated mutation diseases.

The invention is particularly suitable for the case where the repeat region of the polynucleotide exceeds 500 bp. In the prior art, when the repeated region of the polynucleotide exceeds 500bp, even if the technology such as second generation sequencing is adopted, the accurate result can not be obtained due to the interference of various factors such as the repeated region of the polynucleotide.

Reagent kit

The invention also provides a kit for detecting FCMTE. The kit of the invention contains a first standard which is a nucleic acid sequence having a (TTTGA) n1 pentanucleotide repeat insertion mutation, wherein n1 is 50-800.

In another preferred embodiment, n1 is 100-500.

In another preferred embodiment, n2 is 200-500.

In another preferred embodiment, the kit further comprises a barcode nucleic acid for adding a barcode sequence to the amplification product.

In another preferred embodiment, the kit also contains m barcode nucleic acids, wherein m is a positive integer greater than or equal to 2.

The main advantages of the invention include:

(a) in the invention, the target region sequence is captured and then single-molecule sequencing is carried out, so that more (for example, 50-300 pieces) of sub-reads (sub-reads of the target region) of the target region with higher accuracy (> 90%) are obtained, and compared with single-molecule sequencing at the whole genome level which only has single-digit sub-reads of the target region, the analysis on the specific sequence content of the repeated sequence mutation is more accurate.

(b) In the present invention, false negatives can be significantly reduced. Even for the detection of false negatives, such as appearance (TTTTA), that may be missed by repeated insertion mutation of SAMD12 gene pentanucleotide of FCMTE using repetitive primer PCR (RP-PCR) and long-range PCR (LR-PCR)₁₀₀(TTTCA)₂₁₀(TTTTA)₁₀₀This rare mutation structure, or (TTTTA) n (TTTGA) n, a novel pentanucleotide repeat insertion, is still accurately detectable by the methods of the present invention.

(c) Compared with the high price of whole genome single molecule sequencing (about 3-5 ten thousand yuan per case), the cost of the whole process is only 1/12 or less (about 2500 yuan), so the method has higher clinical application value.

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Experimental procedures without specific conditions noted in the following examples, generally followed by conventional conditions, such as Sambrook et al, molecular cloning: the conditions described in the Laboratory Manual (New York: Cold Spring Harbor Laboratory Press,1989), or according to the manufacturer's recommendations. Unless otherwise indicated, percentages and parts are percentages and parts by weight.

General procedure

1. Long fragment PCR (Long-range PCR, LR-PCR)

1.1, collecting a peripheral blood sample of a person to be detected, and extracting genome DNA of the sample by using a phenol chloroform method;

1.2 LR-PCR using sample DNA: LR-PCR system configuration: 50ul/LR-PCR system: 100-500ng sample DNA, 0.2. mu.M primers SAMD12LF and SAMD12LR (see Table 1), 200mM dNTP, 1 × PrimeSTAR GXL buffer, 1.25U PrimeSTAR GXL DNA polymerase (TAKARA). LR-PCR reaction condition parameters: denaturation at 98 ℃ for 1 min; 30 cycles: alternating at 98 ℃ for 10 seconds and 68 ℃ for 15 minutes; 10 minutes at 72 ℃; storing at 4 ℃ to complete the PCR amplification procedure. The long fragment products of the 1% agarose gel-run unambiguous mutation were amplified (FIG. 3).

TABLE 1 LR-PCR primer sequences

Primer and method for producing the same	Sequence of	SEQ ID No:
			SAMD12LF	5'-TGTGCAGCCATTGGTCCAGTCTT-3'	1
SAMD12LR2	3'-GCTGGCAAAGTTCAGAGGTCACTT-5'	2

Single molecule sequencing by PacBio sequencing platform

2.1 sample purification and fragment sorting before single molecule sequencing: the method comprises the following steps of (1) carrying out fragment sorting and recovery on a target LR-PCR product by utilizing a BluePippin full-automatic nucleic acid electrophoresis and fragment recovery system and combining the length of a target large fragment during the glue running of an LR-PCR sample;

2.2 labeling the recovery products of the target fragments of different samples with barcode sequences (barcodes): different barcode sequence tags are added to the target fragments of different samples by SMRTbell Barcoded Adapter Complete Prep-96 (PN: 100-. All labeled target fragments were adjusted to the same concentration and pooled based on the qubit (invitrogen) measurement method and purified using agencurtempure XP beads according to the PacBio library preparation recommendation protocol;

2.3 Single molecule sequencing library creation: the PacBio library was created according to the protocol of the SMRTbell Template Prep Kit 1.0(100-259-100) of PacBio and the "Procedure & checkpoint-10 kb Template Preparation and Sequencing" instructions. DNA damage repair, end repair, etc. were all performed in 200ng of labeled DNA. In all steps, AMPure PB magnetic beads (Pacific Biosciences) were used for purification. Both qualitative and quantitative analyses used the Agilent 2100Fragment Analyzer and the Quit fluorometer with Quant-iT dsDNA BR Assay Kits (Invitrogen);

2.4 Single molecule sequencing: SMRTbell templates were annealed to the v2 sequencing primer and bound to DNA Polymerase P6 using DNA/Polymerase Binding Kit P6(part #: 100-356-300) under the direction of Binding primer version 2.3.1.1 according to PacBio's protocol. The polymerase-template complex was purified using the Pacific Biosciences Magbead Binding Kit (part #: 100-133-600). And the sample reaction was set under the direction of Binding simulator. The samples were added to a single SMRT cell v3(Part #: 100-;

2.5 Single molecule sequencing data post-processing: sequencing data were processed using Pacific Biosciences SMRT Portal and SMRT Analysis System software (v2.3.0) bioinformatics software.

3. Screening and carrying out statistical analysis processing on a target sequence of single molecule sequencing data by using biological information software, screening CCS reading sequence with the length of the target sequence, and further screening data with the accuracy prediction more than or equal to 90% for next analysis; the sub-reads of the entire target sequence of the SAMD12 gene were used as the sub-reads of the target region (fig. 4). Counting the number of the sub-reads of each sample target region, calculating the total length (TTTTA + TTTCA or TTTTA + TTTGA), (TTTTA) length, (TTTCA) or (TTTGA) length of the repeat sequence region in the sub-reads of each target region of each sample by using R language, and selecting the median of each length as a representative result of the mutation specific structure of the sample repeat sequence.

Example 1

Identification of a novel five-nucleotide repeat insertion mutation (TTTGA) n by combination of LR-PCR and Single-molecule sequencing

For a certain FCMTE family, the pathogenic cause of FCMTE is firstly studied by RP-PCR and LR-PCR.

The inventors found that a long amplified fragment exists in the target region of the pedigree, but RP-PCR suggests that only (TTTTA) n is repeatedly amplified and no (TTTCA) n is repeatedly amplified (FIG. 2).

Through Sanger sequencing of LR-PCR long fragment products, the inventors found that a new unreported (TTTGA) n-pentanucleotide repeat insertion exists at the 3' end (FIG. 7), but the inventors still cannot confirm whether the (TTTCA) n-pentanucleotide repeat insertion still exists inside the long fragment to cause diseases.

Further, the inventors selected 3 RP-PCR-suggested (TTTTA) n-repeat amplified samples (III:4, II:6 and IV:2) in the family, obtained long fragment products by LR-PCR amplification, and performed single molecule sequencing. It was confirmed that in the long fragment cosegregating the pedigree with the disease, only (TTTGA) n pentanucleotide repeat insertion mutation, and no (TTTCA) n pentanucleotide repeat was present.

By the invention, the inventor firstly defines an FCMTE family in which no (TTTCA) n-quintet repeat mutation insertion is detected by RP-PCR (figure 6), and the family is a new and unreported FCMTE disease caused by (TTTGA) n-quintet repeat insertion mutation (figure 1, figure 4, figure 5 and figure 8).

Example 2

Combined with LR-PCR and single-molecule sequencing, specific sequences of N five-nucleotide repeated insertions of SAMD12 gene (TTTCA) of FCMTE are analyzed

For one example of the study of the present inventors, FCMTE samples (P-I-III:2) with (TTTCA) n-quintet repeat insertion mutations were confirmed by RP-PCR and LR-PCR product Sanger sequencing, and the mutant structures of the polynucleotide repeat mutation diseases were analyzed again by a method combining LR-PCR and single-molecule sequencing.

The results show that LR-PCR amplified its corresponding long fragment product (FIG. 3). Further, by single molecule sequencing, the specific sequence of the long fragment product was confirmed to be (TTTTA)₃₅(TTTCA)₄₈₁(FIG. 4).

Thus, for this FCMTE sample (P-I-III:2), the corresponding polynucleotide repeat mutation disease mutation structure was accurately defined for the first time: namely, specific sequence of n five-nucleotide repeated insertion mutation on SAMD12 gene (TTTCA) of FCMTE.

The disease mutation structures of 4 FCMTE samples analyzed in examples 1 and 2 are summarized in table 2.

TABLE 2.4 statistical Table of the length and content of the sub-reads of the target region for the FCMTE sample

Note:

1. the table summarizes the length of the number of repetitions and the median of the number of repetitions.

N represents the number of sub-reads that are available from the target long read.

3.N.D. indicates no detection. n is (length-3 bp)/5 bp.

Discussion of the related Art

In 2018, the first FCMTE pathogenic gene (SAMD12 gene) and the pathogenic mutation thereof, namely the (TTTCA) n five-nucleotide repeat insertion mutation (shown in figure 1) are identified and found, so that the molecular genetic diagnosis of FCMTE is possible for the first time.

However, the detection method of (TTTCA) n-pentanucleotide repeat insertion mutation reported at present is mainly RP-PC, LR-PCR or Southern Blot, and the combination of RP-PCR and LR-PCR or Southern Blot can qualitatively judge whether (TTTCA) n-pentanucleotide repeat insertion mutation exists in the number 4 intron region of SAMD12 gene, but the following problems still exist, which may cause false negative of detection: typically (TTTCA) n pentanucleotide repeat insertion mutations are located downstream (i.e. 5' to) a stretch of (ttta) n pentanucleotide repeats, such as: (TTTTA)₂₀₀(TTTCA)₂₁₀(FIG. 1), therefore, primers for specific diagnosis (TTTCA) n five-nucleotide repeat insertion mutation are designed for RP-PCR for detection, and a saw-like (saw-like) detection result can be generated in the capillary electrophoresis detection result (FIG. 2). However, it has also been reported that (TTTCA) n-pentanucleotide repeat insertion mutations are located within (TTTTA) n-pentanucleotide repeat sequences, e.g. (TTTTA)₁₀₀(TTTCA)₂₁₀(TTTTA)₁₀₀In this case, although the long segment allel amplified repeatedly can be detected by LR-PCR (primers shown in table 1) or Southern blot (fig. 3), since there are cases where only (ttta) n long segment allel amplified repeatedly is present in normal persons, it is impossible to distinguish whether the person to be detected carries a repeat insertion mutation of pathogenic (TTTCA) n pentanucleotide, which may lead to missed diagnosis.

In the mutation detection of FCMTE, a novel five-nucleotide repeat insertion mutation- (TTTGA) n is unexpectedly discovered, and is determined to be co-separated with diseases in a family by RP-PCR and LR-PCR (see an example), and the structure of the mutation is predicted to be (TTTTTTA) n (TTTGA) n by two-end Sanger sequencing (figure 1).

Since Sanger sequencing cannot cover the sequence of the entire LR-PCR long fragment, the inventors were still unable to ascertain whether (TTTCA) n pentanucleotide repeat insertion mutations still remain inside the LR-PCR long fragment. Similar problems also occur in spinocerebellar ataxia types (SCA), such as SCA10, SCA31, SCA37, etc., where repeated insertions of five nucleotides outside the reference sequence are present. Most of the detection methods at present cannot accurately detect the detailed sequence of the mutation. Therefore, the current detection means still has obvious defects in mutation detection and mutation content judgment.

Although the single-molecule sequencing technology has certain application value, the practical application of the single-molecule sequencing technology is greatly limited. Firstly, the detection of the five-nucleotide repeat insertion mutation of the SAMD12 gene by using a single-molecule sequencing technology is based on single-molecule sequencing at the whole genome level, the average effective coverage depth is only about 8X, and the reading sequence capable of crossing the five-nucleotide repeat insertion mutation of the SAMD12 gene is only 1-2 or even no reading sequence, so that the detection omission can be caused.

Secondly, even if 1-2 reads were obtained with repeated insertional mutations across the five nucleotides of the SAMD12 gene, there was still great difficulty in the accuracy of the analysis of the specific sequence content within the reads. Since the misreading of a single base is the technical defect of single-molecule sequencing, a limited number of reads cannot be corrected by an algorithm, so that the detection of the five-nucleotide repeat insertion mutation of the SAMD12 gene by using single-molecule sequencing at the genome-wide level is still a qualitative level detection, the existence of repeat amplification sequences can be seen, and reliable sequence content information such as the specific repeat number, arrangement mode and the like of (TTTTTTA) n, (TTTCA) n and (TTTGA) n cannot be provided.

Thirdly, the price of the whole genome level single molecule sequencing is still very expensive, and the whole genome level single molecule sequencing is still in scientific research level at present and cannot be popularized in clinical detection.

Based on the problem that the specific sequence content of a mutated long fragment needs to be analyzed in detail in SAMD12 gene mutation detection of FCMTE, the limitations and defects of the prior related technologies (RP-PCR, LR-PCR or Southern blot, whole genome single molecule sequencing) at the technical and application levels are fully considered, the method successfully analyzes the five-nucleotide repeat insertion mutation with different intron regions in SAMD12 gene No. 4 for the first time in detail by combining LR-PCR and target region single molecule sequencing, and confirms that the method can analyze the detailed sequence content of the long fragment polynucleotide repeat mutation and identifies (TTTGA) n as a new FCMTE five-nucleotide repeat insertion pathogenic mutation for the first time.

In the invention, in the technical aspect, more effective reading sequences are obtained, so that the accuracy of sequence analysis is greatly improved; in the aspect of sequencing cost, compared with the whole genome, the total sequencing amount is obviously reduced, the cost is greatly reduced, and the total cost is controlled at the thousand yuan level.

It will be appreciated that although the examples given in the examples are examples of the analysis of the detailed sequence content of the five nucleotide repeat insertion mutation of the SAMD12 gene of FCMTE, it is clear that the method of the present invention can be used to analyze other mutant structures of FCMTE and also to analyze the mutant structures of other polynucleotide repeat mutation diseases.

Meanwhile, the invention can provide reference for more detailed sequence analysis of similar repeated mutation of the polynucleotide, and provides a more clinically accurate molecular genetics detection and diagnosis method.

All documents referred to herein are incorporated by reference into this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes and modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the present invention as defined by the appended claims.

Sequence listing

<110> Zhejiang university

<120> method for analyzing mutant structure of repeated mutation disease of polynucleotide based on long fragment PCR and single molecule sequencing

<130> P2019-0707

<160> 2

<170> SIPOSequenceListing 1.0

<210> 1

<211> 23

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 1

tgtgcagcca ttggtccagt ctt 23

<210> 2

<211> 24

<212> DNA

<213> Artificial Sequence (Artificial Sequence)

<400> 2

gctggcaaag ttcagaggtc actt 24

Claims

1. The application of the detection reagent is characterized in that the detection reagent is used for preparing a detection kit for diagnosing the familial corticotropin clonic tremor epilepsy; wherein the detection reagent is used for detection inSAMD12A detection reagent for (TTTGA) n1 pentanucleotide repeats in a gene, wherein n1 is 50-800.