CN115725720A - Primer combination, kit and system for detecting SLC25A13IVS16 region variation - Google Patents

Primer combination, kit and system for detecting SLC25A13IVS16 region variation Download PDF

Info

Publication number
CN115725720A
CN115725720A CN202211269044.5A CN202211269044A CN115725720A CN 115725720 A CN115725720 A CN 115725720A CN 202211269044 A CN202211269044 A CN 202211269044A CN 115725720 A CN115725720 A CN 115725720A
Authority
CN
China
Prior art keywords
follows
primer
sequence
sequencing
downstream primer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211269044.5A
Other languages
Chinese (zh)
Inventor
文曙
李珉
朱娜
杨锋
栗海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Semek Gene Technology Co ltd
Original Assignee
Suzhou Semek Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Semek Gene Technology Co ltd filed Critical Suzhou Semek Gene Technology Co ltd
Priority to CN202211269044.5A priority Critical patent/CN115725720A/en
Publication of CN115725720A publication Critical patent/CN115725720A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
    • Y02A50/30Against vector-borne diseases, e.g. mosquito-borne, fly-borne, tick-borne or waterborne diseases whose impact is exacerbated by climate change

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application discloses a primer composition, a kit and a system for detecting SLC25A13IVS16 region variation, wherein the primer composition aims at the SLC25A13IVS16 region and comprises an upstream primer and a downstream primer, and the upstream primer has the following sequences: AAACTGGGGTGAGGGATCGAATACACGAGC; the downstream primer comprises a mutant downstream primer and a wild type downstream primer, and the sequence of the mutant downstream primer is as follows: GCCCGAACCCTTTCCACCC TGCCAACACCCTC, and the wild-type downstream primer sequence is as follows: CTGGCCAAACCATTACAGCGGAGTGATAG. The system adopts a mode of combining second-generation sequencing and letter generation, and can quickly and accurately judge the variation result.

Description

Primer combination, kit and system for detecting SLC25A13IVS16 region variation
Technical Field
The application relates to the field of gene detection, in particular to a primer combination, a kit and a system for detecting SLC25A13IVS16 regional variation.
Background
Hitellin deficiency disease (CD) is an autosomal recessive genetic disease, the disease-causing gene SLC25A13 is located on chromosome 7q21.3, and the encoded protein is called Hitelin.
IVS16ins3kb is a common pathogenic mutation of SLC25A13 gene, and because the insertion sequence fragment is long, the mutation is difficult to be accurately detected by means of probe capture and the like at present. At present, the insert fragment can be completely amplified by using a third-generation long fragment sequencing technology, but the problems of high cost, high application price and the like exist in the third-generation sequencing technology.
Disclosure of Invention
The embodiment of the application provides a primer composition, which is used for amplifying an SLC25A13IVS16 region under a second-generation sequencing technology, so as to solve the technical problem of high cost caused by the use of a third-generation long fragment sequencing technology in the prior art.
The primer composition aims at the SLC25A13IVS16 region and comprises an upstream primer and a downstream primer, wherein the sequence of the upstream primer is as follows: AAACTGGGGTGAGGGATCGAATACACGAGC; the downstream primers comprise a mutant downstream primer and a wild downstream primer, and the sequence of the mutant downstream primer is as follows: GCCCGAACCCTTTCCATGCCAAACACCCTC, and the wild-type downstream primer sequence is as follows: CTGGCCAAACCATTACAGCGGAGTGATAG.
Further, the primer composition at least further comprises at least one of the following control primer combinations:
for the amplification region cg0175.Acad9.Nm — 014049.5.Exon7_control _acad9, the upstream primer sequences were as follows: ATAGGGGTTTGGTTTTCTCCAAAGTC, and the downstream primer sequence is as follows: CGCGCACACAGGAGCTACTT;
for the amplification region cg0539.Cyp11b1.Nm — 000497.3.Exon8 \_control, the upstream primer sequences were as follows: CTCTCAGCTCGCCGCTTAC, the downstream primer sequence is as follows: GACATGGGTCCCACCATCCAGCAAC;
aiming at an amplification region CG0095.INSRR. NM-014215.3. Exon2 control _NTRK1_exon2, the sequence of an upstream primer is as follows: TCCTGATGCCTAGCTTAAGGAGTC, and the sequence of a downstream primer is as follows: GCATTGGGGGAAATGATCCAAATG;
for the amplification region cg0335.Sil1.Nm _001037633.2.Exon5_control _a001, the upstream primer sequences were as follows: TCTGTGCTCTCTGGGAGAGAAGTAAA, the downstream primer sequence is as follows: gagactgacatgcaggattgacg;
for the amplification region cg0336.Sil1.Nm _001037633.2.Exon5_control _a002, the upstream primer sequences were as follows: CAGCAATCTTCTTCCAAACTGGAGC, and the downstream primer sequence is as follows: CCATGGTAGACCACAGATCTTGGGC;
for the amplification region cg0593.Stim1.Nm _001277961.1.Exon10_control _a001, the sequence of the upstream primer is as follows: AAGTCCATGCCTGCAGTTCTCTT, the downstream primer sequence is as follows: ATCCACGTCGTCAGTCATGATGAAG;
for the amplification region cg0594.Stim1.Nm — 001277961.1.Exon10_control _ a002, the upstream primer sequence is as follows: AAGTCCATGCCTGCAGTTCTCTT, the downstream primer sequence is as follows: AAAGGCTCCTTCCTTCATCCCCGC;
for the amplification region cg0740.Ednrb. Nm — 001201397.1.Exon3_control, the upstream primer sequence is as follows: GGAAACACTTCTGAGTGGCATTTATTTA, the downstream primer sequence is as follows: TGAGTAAATGAGCCATCTTTTTAAGGGTCA;
for the amplification region cg0174.Iqcb1.Nm — 001023570.Exon3_control, the upstream primer sequences were as follows: GTAATACTGATATGGTACAGAAGCTTCATACCAA, the sequence of the downstream primer is as follows: GTTAGGGGAGAAAAATCAAACCTTCA.
Another embodiment of the present application provides an application of the primer composition in preparing a kit for detecting SLC25a13IVS16 regional variation.
Another embodiment of the present application provides a kit comprising the above primer composition.
Further, the kit also comprises DNA polymerase and reaction buffer.
A third embodiment of the present application provides a system for detecting SLC25a13IVS16 regional variation, comprising the following modules:
an acquisition module for acquiring a sample of a subject;
the amplification module is used for carrying out PCR amplification on the sample;
the library construction module is used for constructing a multiplex PCR targeted sequencing library;
the sequencing module is used for sequencing and analyzing;
wherein the amplification module is performed using the primer composition of claim 1 or 2; the sequencing module is carried out by adopting a second-generation sequencing technology.
Further, the sequencing module comprises a machine learning model trained in advance, and sample data obtained after PCR amplification and construction of the multiple PCR targeted sequencing library is input into the machine learning model, so that an analysis result corresponding to the sample data is obtained.
Further, the pre-trained machine learning model is obtained by fitting the historical samples into a model of 7:3 into training set and testing set.
Furthermore, the machine learning model is classified by adopting a decision tree algorithm, and the decision tree algorithm is constructed based on four parameters of the total sequencing depth, the number of wild type sequencing sequences, the number of mutant type sequences and the frequency of variant alleles of each sample.
Further, the decision tree algorithm is respectively standardized as follows:
the decision criterion for positive determination is that the following conditions are satisfied:
the sequencing total depth is more than 50X, the number of the mutant sequences is more than or equal to 10, and the mutation ratio is more than or equal to 10%;
the decision criterion for determining negative is that the following conditions are satisfied:
the total sequencing depth is more than or equal to 50X, and the number of mutant sequences is less than 10;
the decision criterion determined to be undecidable is that the following conditions are satisfied:
the total sequencing depth is less than or equal to 50X.
The embodiment of the application provides a novel primer composition, which can be used for stably and accurately detecting the SLC25A13 intron insertion mutation under the second-generation sequencing technology, and can be used for detecting the special mutation by utilizing the high-throughput sequencing advantage, the birth-credit algorithm and the decision tree to carry out threshold judgment, so that the rapidity and the accuracy of the detection are ensured.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
fig. 1 is a flowchart of a method for detecting SLC25a13IVS16 regional variation according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the application provides a primer composition, which is used for amplifying an SLC25A13IVS16 region under a second generation sequencing technology aiming at the SLC25A13IVS16 region, so as to solve the technical problem of high cost in the prior art which uses a third generation long fragment sequencing technology.
Specifically, the primer composition of the present embodiment includes an upstream primer and a downstream primer, and the sequence of the upstream primer is as follows: AAACTGGGGTGAGGGATCGAATACACGAGC; the downstream primer comprises a mutant downstream primer and a wild type downstream primer, and the sequence of the mutant downstream primer is as follows: GCCCGAACCCTTTCCATGCCAACAACACCCTC, and the wild type downstream primer sequence is as follows: CTGGCCAAACCATTACAGCGGAGTGATAG.
The targeted sequencing technology can enrich the interested genome region for sequencing, and the sequencing data output of a single sample is less and the analysis speed is higher, so that the advantages of the NGS technology can be exerted more economically and efficiently, and the targeted sequencing technology can be widely applied to a plurality of fields such as clinical detection, health screening and the like. In addition, the target region can be subjected to deep sequencing by targeted sequencing, so that the detection sensitivity and accuracy of genetic variation in the target region are improved.
The methods of targeted sequencing are mainly divided into two categories: hybrid capture sequencing and amplicon sequencing. Sequencing the amplicon, namely designing a PCR primer for amplification and enrichment and sequencing aiming at an interested target region. It is generally suitable for detecting several tens to several thousands of sites, or a region of several tens kb or less. The hybridization capture sequencing is mainly applied to liquid phase hybridization capture sequencing at present, namely, a synthetic nucleic acid probe is designed based on a base complementary pairing principle, and a DNA library is subjected to hybridization enrichment of a target region based on a liquid phase environment and is sequenced. However, the liquid phase hybridization capture operation is difficult, the operation time is long, and the probe capture efficiency is easily influenced, so that the amplicon sequencing is more suitable for operation of non-professional technicians compared with the amplicon sequencing. As a method for quickly constructing a target sequencing library, multiplex PCR plays an increasingly important role in the current clinical gene detection and research fields due to high efficiency, systematicness and economic simplicity.
In order to ensure that amplification is normal or only the detection point mutation of the target is amplified, the present embodiment also provides a control primer combination for amplifying other regions, which includes at least one of the following control primer combinations:
for the amplification region cg0175.Acad9.Nm — 014049.5.Exon7_control _acad9, the upstream primer sequences were as follows: ATAGGGGTTTGGTTTTCTCCAAAGTC, and the downstream primer sequence is as follows: CGCGCACACAGGAGCTACTT;
for the amplification region cg0539.Cyp11b1.Nm — 000497.3.Exon8_control, the upstream primer sequences are as follows: CTCTCAGCTCGCCGCTTAC, the downstream primer sequence is as follows: gacatgtcccatccagcac;
for the amplified region cg0095.Insrr. Nm — 014215.3.Exon2_control _ntrk1_exon2, the sequence of the upstream primers is as follows: TCCTGATGCCTAGCTTAAGGAGTC, and the downstream primer sequence is as follows: GCATTGGGGGAAATGATCCAAATG;
for the amplification region cg0335.Sil1.Nm _001037633.2.Exon5_control _a001, the upstream primer sequences were as follows: TCTGTGCTCTCTGGGAGAGAAGTAAA, the downstream primer sequence is as follows: gagactgacatgcaggtacg;
for the amplification region cg0336.Sil1.Nm _001037633.2.Exon5_control _a002, the upstream primer sequences were as follows: CAGCAATCTTCTTCCAAACTGGAGC, and the downstream primer sequence is as follows: CCATGGTAGACCACAGATCTTGGGC;
for the amplification region cg0593.Stim1.Nm _001277961.1.Exon10_control _a001, the sequence of the upstream primer is as follows: AAGTCCATGCCTGCAGTTCTCTT, the downstream primer sequence is as follows: ATCCACGTCGTCAGTCATGATGAAG;
aiming at an amplification region CG0594. STIMI 1. NM-001277961.1. Exon10_control _A002, the sequence of an upstream primer is as follows: AAGTCCATGCCTGCAGTTCTCTT, the sequence of a downstream primer is as follows: AAAGGCTCCTTCCTTCATCCCCGC;
for the amplification region cg0740.Ednrb. Nm — 001201397.1.Exon3_control, the upstream primer sequence is as follows: GGAAACACTTCTGAGTGGCATTTATTTA, the downstream primer sequence is as follows: TGAGTAAATGAGCCATCTTTAAGGGTCA;
for the amplification region cg0174.Iqcb1.Nm — 001023570.Exon3_control, the upstream primer sequences were as follows: GTAATACTGATATGGTACAGAAGCTTCATACCAA, the downstream primer sequence is as follows: GTTAGGGGAGAAAAATCAAACCTTA.
The primer composition is used alone or together with one or more of the primer compositions of the control group as the content of the kit to prepare a kit for detecting the SLC25A13IVS16 region variation, and the kit can realize the variation detection of the SLC25A13IVS16 region by utilizing a second-generation sequencing technology to replace a third-generation sequencing technology.
In these embodiments, the kit typically also includes a DNA polymerase and a reaction buffer.
Because the variant insert is long, only part of the fragment captured by the probe can be affected, and the detection effect of the variant can be affected due to the limitation of sequencing depth and the like, and the multiplex amplicon sequencing can be used for amplifying the variant fragment in a targeted manner, certain embodiments of the application are supplemented with a bioinformatic means to effectively avoid the randomness of the amplified fragment. Accordingly, in some embodiments, a system for detecting a regional variation in SLC25a13IVS16 is provided, comprising:
the acquisition module is used for acquiring a sample of a subject.
And the amplification module is used for carrying out PCR amplification on the sample by adopting the primer composition in the embodiment, and comprises the steps of constructing an amplification region file and calculating the depth of each amplicon according to the amplification region.
And the library construction module is used for constructing a multiplex PCR targeted sequencing library.
And the sequencing module is used for sequencing and analyzing by adopting a second-generation sequencing technology.
The system formed by the modules is used for executing the method for detecting the regional variation of the SLC25A13IVS16, and comprises the following steps:
s101, collecting and obtaining a sample of a subject;
step S102, carrying out PCR amplification on the sample by adopting the primer composition in the embodiment, wherein the PCR amplification comprises the steps of constructing an amplification region file, and calculating the depth of each amplicon according to the amplification region;
step S103, constructing a multiple PCR targeted sequencing library;
and step S104, sequencing and analyzing by adopting a second-generation sequencing technology.
In some embodiments, the sequencing module includes a machine learning model trained in advance, and the analysis result corresponding to the sample data obtained after PCR amplification and construction of the multiple PCR targeted sequencing library is input to the machine learning model.
In some embodiments, the pre-trained machine learning model is generated by fitting the historical samples to 7:3 into a training set and a test set.
In certain embodiments, the machine learning model is classified using a decision tree algorithm that is constructed based on four parameters of total depth of sequencing, number of wild-type sequences, number of mutant sequences, and variant allele frequency for each sample.
In some embodiments, the decision tree algorithms are respectively normalized as follows:
the decision criterion for positive determination is that the following conditions are satisfied:
the total sequencing depth is more than 50X, the number of mutant sequences is more than or equal to 10, and the mutation ratio is more than or equal to 10%;
the decision criterion for determining negative is that the following conditions are satisfied:
the total sequencing depth is more than or equal to 50X, and the number of mutant sequences is less than 10;
the decision criterion determined to be undecidable is that the following conditions are satisfied:
the total sequencing depth is less than or equal to 50X.
The above-mentioned letter generation method, analyze a large amount of historical sample data and have got the unified decision-making standard, it is the key that can carry on the accurate fast judgement mutation according to the result amplified, therefore, introduce the establishment process of this set of standard specifically, including the following steps:
firstly, performing basic quality control on historical sample sequencing data FASTQ, wherein the data quality control comprises data quality Q20>90% and data quality Q30>85%. Here, Q20 and Q30 have the following meanings: each base in sequencing data has a corresponding quality value, and if the quality value is Q20, the probability of error identification is 1%, namely the error rate is 1%, or the accuracy is 99%; the quality value is Q30, the probability of misidentification is 0.1%, i.e. the error rate is 0.1%, or the accuracy is 99.9%.
And then, using sentienon software (NGS gene data analysis acceleration software) for the data after quality control, obtaining the BAM files which are not sequenced after comparison according to a bwa mem acceleration algorithm in the software, and sequencing according to genome coordinates to obtain the final sorted BAM files, namely comparison data.
The method specifically comprises the following steps:
and counting the initial coordinates and the termination coordinates of the amplicons according to the primer design file.
The primer design file comprises the initial coordinate and the termination coordinate of the forward primer, and the initial coordinate and the termination coordinate of the reverse primer. And constructing a primer amplification region file by using the initial coordinates of the forward primer and the termination coordinates of the reverse primer.
And obtaining the amplicon position information, the wild type sequence number and the mutant type sequence number of the test sample.
Bam file of historical test sample obtained above, the amplified pair-end sequence alignment information of each pair of primers will be used for subsequent statistics.
The sequence alignment to the initial position and the termination position of the reference genome is the 5 'terminal coordinates of the forward primer and the 5' terminal coordinates of the reverse primer for primer amplification.
By designing the initial termination coordinate information of forward and reverse primers of wild type primers, the number of amplicon sequencing sequences amplified by each pair of primers can be calculated according to the 5 'terminal coordinate of the forward primer and the 5' terminal coordinate of the reverse primer, the comparison coordinate of the sequencing sequences (the position information of an amplicon corresponding to the sequencing sequence read in a reference genome) is compared with the primer coordinate (the position information of a primer designed for a known target amplicon in a primer design file), the number of wild type sequences is counted according to the condition that the left position of the left read of a test sample in a region pair of SLC25A13IVS16 is consistent with the 5 'terminal coordinate of the forward primer of the wild type primers, and the right end termination position of the right read is consistent with the 5' terminal coordinate of the reverse primer of the wild type primers, so that the number of the sequencing sequences corresponding to the wild type amplicons can be obtained.
The mutant amplicon sequence is the primer reads with part of the sequence as SLC25A13 gene and the other part as insertion sequence. Therefore, after extracting sequences at two positions where the sequences are likely to be aligned, sequence matching is performed according to the mutant primer designed in the first part, 2-base mismatching is allowed, the left-end initial sequence of the pair sequence is matched with the forward primer, and the right-end termination sequence of the pair sequence is matched with the reverse primer. And counting the extracted sequences to obtain the sequence number of the mutant amplicons.
Selecting historical samples, and calculating the number of wild-type amplicons and mutant amplicons of the samples by using the calculation method.
And summarizing the characteristics of mutation types and wild type mutation ratios and the like in the historical samples, and determining a classification threshold.
The number of mutant and wild types in the historical positive samples and the historical negative samples were counted. The mutation ratio is calculated by the formula: number of mutant sequences/(number of mutant sequences + number of wild-type sequences). And (3) calculating the mutation ratio according to the historical negative and positive samples by the formula, and selecting a proper threshold value to judge the sample result according to the total sequencing depth (the number of the mutant sequences plus the number of the wild sequences), the number of the mutant sequences and the mutation ratio of the historical samples in the region. Sample characterization information is shown in table 1 below:
TABLE 1
Sample examples Depth of field Number of wild type sequences in the sample Number of mutant sequences in sample Frequency of variant alleles
S1 1000 500 500 0.5
S2 2000 2000 0 0
And according to historical sample results, a decision tree, namely a classical classification algorithm, is used for obtaining an optimal threshold value of the final characteristics in judging positive results and negative results. And (3) dividing the historical results into a training set and a testing set according to a ratio of 7 to 3, and constructing a corresponding decision tree in the 4 parameters based on a decision tree algorithm. The final results were as follows: when the sample parameter mutant sequence number > =10, and the mutation ratio > =0.01, the sample can be judged as a positive result.
In the threshold value selection process, when the total depth of the sample is lower than 50X, due to the specificity problem of PCR amplification, whether the region is effectively amplified cannot be judged, so that the result cannot be judged, whether the region is subjected to other variation to cause no amplification or the sequencing of a library needs to be reconstructed can be judged by an experimental means. Therefore, the sample result cannot be determined when the total sequencing depth of the sample in the region is less than 50X (including 50X).
Therefore, the method for determining the final sample is as follows:
acquiring the total sequencing depth, the number of mutant sequences, wild type mutation data and mutation proportion of a sample to be detected, and judging the sample result as being unable to be judged when the total sequencing depth is less than or equal to 50X; when the total sequencing depth is more than or equal to 50X, if the number of the mutant sequences is less than 10, judging the result of the sample to be negative; and when the total sequencing depth is more than 50X, the number of the mutant sequences is more than or equal to 10, if the mutation ratio is more than or equal to 0.01, the result of the sample is judged to be positive, otherwise, the result is judged to be negative.
In this embodiment, there is also provided an electronic device comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the method in the above embodiments.
These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks, and corresponding steps may be implemented by different modules.
The programs described above may be run on a processor or may also be stored in memory (or referred to as computer-readable media), which includes both non-transitory and non-transitory, removable and non-removable media, that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The use of the above system for detecting regional variation in SLC25a13IVS16 is described below with an exemplary embodiment.
After the sample is subjected to PCR library building and sequencing, the PCR primer pool comprises a primer pair for amplifying a wild type sequence and a primer pair for amplifying a mutant type sequence. FASTQ data obtained by high-throughput sequencing were then subjected to the following steps:
1) Quality control, removing sequencing joints and low-quality bases or sequences and evaluating the quality of original data; the data quality control comprises that the data quality Q20 is more than 90 percent, and the data quality Q30 is more than 85 percent.
2) And (3) comparing, namely mapping the sequence information of the FASTQ file after the first step of processing to a human reference genome by using BWA software, and sequencing the obtained BAM file storing the comparison information to obtain a final BAM file.
3) Depth calculation, namely counting the total depth of the sample in the SLC25A13IVS16 region, the number of wild type sequences and the number of mutant type sequences according to the designed primer information and the BAM file obtained in the second step, and calculating the corresponding mutation proportion;
4) Collecting historical sample information: after the quality control, the comparison and the depth calculation are carried out on the historical samples, the total depth, the number of wild type sequences, the number of mutant sequences and the numerical value of mutation proportion of the negative samples and the positive samples in the region are summarized. The mutation ratio was calculated as follows: number of mutant sequences/(number of mutant sequences + number of wild-type sequences).
5) And (3) according to historical sample results, obtaining the optimal threshold value of the final characteristic in positive result judgment and negative result judgment by using a classical classification algorithm of a decision tree (a decision tree and a model are in a tree structure, namely a classical binary classification algorithm). According to historical results, according to the following steps of 7:3, dividing the ratio into a training set and a testing set, and constructing a corresponding decision tree in the 4 parameters based on a decision tree algorithm. The final results were as follows: when the sample parameter mutant sequence number > =10, and the mutation ratio > =0.01, the sample can be judged as a positive result.
Meanwhile, according to the specificity problem of PCR amplification, when the total sequencing depth is less than or equal to 50, whether the region is effectively amplified or not can not be judged, so that the result can not be judged, whether the region has other variation or not to cause no amplification or the sequencing of a library needs to be rebuilt can be judged through an experimental means.
Test example
7 samples were selected, and 3 replicates of each sample were tested for results.
Extracting DNA of a test sample, adding a PCR primer for amplification after breaking and repairing, performing on-machine sequencing after amplification is completed and a library is built, and obtaining data as FASTQ data after sequencing.
The FASTQ data is subjected to quality control, low-quality bases and linker sequences are removed, and data quality statistics is carried out on the data, wherein the data quality statistics comprises that the data volume is more than 1.5G, the average sequencing depth is 3000X, the data quality Q20 is more than 90%, and the data quality Q30 is more than 85%.
FASTQ subjected to quality control and low-quality base and linker sequence removal is mapped to a reference genome by using a BWA module in sentienon software to speak a sequence in FASTQ, and the alignment information of each sequence in FASTQ is acquired to obtain a BAM file for storing the alignment information.
And calculating the number of wild type sequences and the number of mutant sequences of the test sample according to the BAM file of the sample and the sequence information of wild type primers and the sequence information of mutant type primers for amplifying SLC25A13IVS16 regions. Wild type sequence number calculation mode: counting according to the condition that the left end position of the left sequence in the paired sequences of the SLC25A13IVS16 region of the test sample is consistent with the 5 'end coordinate of the forward primer of the wild type primer, and the right end termination position of the right sequence is consistent with the 5' end coordinate of the reverse primer of the wild type primer. And judging the result according to the total depth, the number of wild type sequences, the number of mutant type sequences and the numerical value of the mutation ratio in the region by the threshold value obtained in the fifth step. The number of mutant sequences was calculated as: after extracting the sequences at two positions possibly aligned by the sequences, carrying out sequence matching according to the mutant type primer designed by the first part, allowing the mismatching of 2 bases, matching the left end initial sequence of the paired sequences with the forward primer, and matching the right end termination sequence of the paired sequences with the reverse primer. And counting the extracted sequences to obtain the sequence number of the mutant amplicons.
The results of 3 replicates of 7 samples were as follows:
Figure BDA0003894401070000091
Figure BDA0003894401070000101
and judging the result of the current test sample according to the result and a threshold value obtained by summarizing the historical samples, wherein the judgment result is as follows:
Figure BDA0003894401070000102
Figure BDA0003894401070000111
the sample and repeat outcome determinations are 100% consistent with the true outcome.
Therefore, the present embodiment is based on high throughput sequencing of target region amplification, and besides simple operation and controllable cost, the present embodiment can more accurately and efficiently detect the mutation type of the insertion of the specific long fragment compared with hybridization capture sequencing.
According to the high-throughput data obtained by PCR primer design, the sequence number of each amplicon is calculated according to the primer amplification coordinates of the wild-type amplicon and the sequence characteristics of the mutant amplicon, the historical samples are used as training sets, the characteristics of each sample are evaluated, the final detection threshold is determined, the IVS16ins3kb variation result of the SLC25A13 of the sample to be detected is detected, the effect is considerable, and the accuracy is high.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. The primer composition aims at the SLC25A13IVS16 region and comprises an upstream primer and a downstream primer, and is characterized in that the upstream primer has the following sequence: AAACTGGGGTGAGGGATCGAATACACGAGC; the downstream primer comprises a mutant downstream primer and a wild type downstream primer, and the sequence of the mutant downstream primer is as follows: GCCCGAACCCTTTCCATGCCAAACACCCTC, and the wild-type downstream primer sequence is as follows: CTGGCCAAACCATTACAGCGGAGTGATAG.
2. The primer combination of claim 1, wherein: the primer composition further comprises at least one control primer combination of:
for the amplification region cg0175.Acad9.Nm — 014049.5.Exon7_control _acad9, the upstream primer sequences were as follows: ATAGGGGTTTGGTTTTCTCCAAAGTC, and the downstream primer sequence is as follows: CGCGCACACAGGAGCTACTT;
for the amplification region cg0539.Cyp11b1.Nm — 000497.3.Exon8_control, the upstream primer sequences are as follows: CTCTCAGCTCGCCGCTTAC, the downstream primer sequence is as follows: gacatgtcccatccagcac;
for the amplified region CG0095.INSRR. NM-014215.3. EXON2. U CONTROL. NTRK1. EXON2, the sequence of the upstream primer is as follows: TCCTGATGCCTAGCTTAAGGAGTC, and the downstream primer sequence is as follows: GCATTGGGGGAAATGATCCAAATG;
for the amplification region cg0335.Sil1.Nm _001037633.2.Exon5_control _a001, the upstream primer sequence was as follows: TCTGTGCTCTCTGGGAGAGAAGTAAA, the downstream primer sequence is as follows: gagactgacatgcaggtacg;
for the amplification region cg0336.Sil1.Nm _001037633.2.Exon5_control _a002, the upstream primer sequences were as follows: CAGCAATCTTCTTCCAAACTGGAGC, and the downstream primer sequence is as follows: CCATGGTAGACCACAGATCTTGGGC;
for the amplification region cg0593.Stim1.Nm _001277961.1.Exon10_control _a001, the upstream primer sequence is as follows: AAGTCCATGCCTGCAGTTCTCTT, the downstream primer sequence is as follows: ATCCACGTCGTCAGTCATGATGAAG;
for the amplification region CG0594. STIMM 1. NM-001277961.1. EXON10_CONTROL _. A002, the upstream primer sequence is as follows: AAGTCCATGCCTGCAGTTCTCTT, the sequence of a downstream primer is as follows: AAAGGCTCCTTCCTTCATCCCCGC;
for the amplification region cg0740.Ednrb. Nm — 001201397.1.Exon3_control, the upstream primer sequence is as follows: GGAAACACTTCTGAGTGGCATTTATTTA, the downstream primer sequence is as follows: TGAGTAAATGAGCCATCTTTAAGGGTCA;
for the amplification region cg0174.Iqcb1.Nm — 001023570.Exon3_control, the upstream primer sequence was as follows: GTAATACTGATATGGTACAGAAGCTTCATACCAA, the downstream primer sequence is as follows: GTTAGGGGAGAAAAATCAAACCTTA.
3. Use of the primer composition of claim 1 or 2 in the preparation of a kit for detecting SLC25A13IVS16 regional variation.
4. The kit is characterized in that: the kit comprises the primer composition of claim 1 or 2.
5. The kit of claim 4, wherein: the kit also comprises DNA polymerase and reaction buffer solution.
6. A system for detecting SLC25A13IVS16 regional variation, comprising the following modules:
an acquisition module for acquiring a sample of a subject;
the amplification module is used for carrying out PCR amplification on the sample;
the library construction module is used for constructing a multiple PCR targeted sequencing library;
the sequencing module is used for sequencing and analyzing;
wherein the amplification module is performed using the primer composition of claim 1 or 2; the sequencing module is carried out by adopting a second-generation sequencing technology.
7. The system of claim 6, wherein: the sequencing module comprises a machine learning model which is trained in advance, and sample data obtained after PCR amplification and multiple PCR targeted sequencing library construction is input into the machine learning model, so that an analysis result corresponding to the sample data is obtained.
8. The system of claim 7, wherein: the pre-trained machine learning model is obtained by mixing historical samples according to the following ratio of 7:3 into a training set and a test set.
9. The system of claim 8, wherein: the machine learning model is classified by adopting a decision tree algorithm, and the decision tree algorithm is constructed on the basis of four parameters of the total sequencing depth, the number of wild type sequencing sequences, the number of mutant type sequences and the frequency of variant alleles of each sample.
10. The system of claim 8, wherein: the decision tree algorithm is respectively standardized as follows:
the decision criterion for positive determination is that the following conditions are satisfied:
the sequencing total depth is more than 50X, the number of the mutant sequences is more than or equal to 10, and the mutation ratio is more than or equal to 10%;
the decision criterion for determining negative is that the following conditions are satisfied:
the total sequencing depth is more than or equal to 50X, and the number of mutant sequences is less than 10;
the decision criterion determined to be undecidable is that the following conditions are satisfied:
the total sequencing depth is less than or equal to 50X.
CN202211269044.5A 2022-10-17 2022-10-17 Primer combination, kit and system for detecting SLC25A13IVS16 region variation Pending CN115725720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211269044.5A CN115725720A (en) 2022-10-17 2022-10-17 Primer combination, kit and system for detecting SLC25A13IVS16 region variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211269044.5A CN115725720A (en) 2022-10-17 2022-10-17 Primer combination, kit and system for detecting SLC25A13IVS16 region variation

Publications (1)

Publication Number Publication Date
CN115725720A true CN115725720A (en) 2023-03-03

Family

ID=85293663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211269044.5A Pending CN115725720A (en) 2022-10-17 2022-10-17 Primer combination, kit and system for detecting SLC25A13IVS16 region variation

Country Status (1)

Country Link
CN (1) CN115725720A (en)

Similar Documents

Publication Publication Date Title
US10216895B2 (en) Rare variant calls in ultra-deep sequencing
US10127351B2 (en) Accurate and fast mapping of reads to genome
CN105740650B (en) A method of quick and precisely identifying high-throughput genomic data pollution sources
CN114333987B (en) Data analysis method for predicting drug resistance phenotype based on metagenomic sequencing
WO2023115662A1 (en) Method for detecting variant nucleic acids
CN113249453B (en) Method for detecting copy number change
CN116072218A (en) Sequencing method
CN105986013A (en) Method and device for determining microbial species
CN105950707A (en) Method and system for determining nucleic acid sequence
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN110444253B (en) Method and system suitable for mixed pool gene positioning
CN116386718A (en) Method, apparatus and medium for detecting copy number variation
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
Whitehouse et al. Timesweeper: accurately identifying selective sweeps using population genomic time series
CN113930492A (en) Biological information processing method for paternity test of contaminated sample
CN110232951B (en) Method, computer readable medium and application for judging saturation of sequencing data
CN115725720A (en) Primer combination, kit and system for detecting SLC25A13IVS16 region variation
WO2020068881A1 (en) Compositions, systems, apparatuses, and methods for validation of microbiome sequence processing and differential abundance analyses via multiple bespoke spike-in mixtures
CN116312779A (en) Method and apparatus for detecting sample contamination and identifying sample mismatch
CN115637288A (en) Method for detecting copy number change of SMN1 and SMN2 genes and application thereof
CN114944188A (en) Sample homology judgment model and establishment method and application thereof
Zachariasen et al. Identification of representative species-specific genes for abundance measurements
CN117672354A (en) Method and apparatus for comparing quality of complete genome assembly of closely related species of mammals
Chlis et al. Extracting reliable gene expression signatures through stable bootstrap validation
CN114703301A (en) Primer group and kit for identifying three kinds of Bordetella, and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination