CN112435710A - Method for detecting single-sample SMN gene copy number in WES data - Google Patents

Method for detecting single-sample SMN gene copy number in WES data Download PDF

Info

Publication number
CN112435710A
CN112435710A CN202011107940.2A CN202011107940A CN112435710A CN 112435710 A CN112435710 A CN 112435710A CN 202011107940 A CN202011107940 A CN 202011107940A CN 112435710 A CN112435710 A CN 112435710A
Authority
CN
China
Prior art keywords
gene
coverage
value
copy number
smn1
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011107940.2A
Other languages
Chinese (zh)
Other versions
CN112435710B (en
Inventor
余伟师
梁萌萌
鲍远亮
栗海波
贺洪鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Saifu Decoding Beijing Gene Technology Co ltd
Original Assignee
Saifu Decoding Beijing Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Saifu Decoding Beijing Gene Technology Co ltd filed Critical Saifu Decoding Beijing Gene Technology Co ltd
Priority to CN202011107940.2A priority Critical patent/CN112435710B/en
Publication of CN112435710A publication Critical patent/CN112435710A/en
Application granted granted Critical
Publication of CN112435710B publication Critical patent/CN112435710B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for detecting the copy number of a single sample SMN gene in WES data, which is characterized in that a SMN1 gene and SMN2 gene copy number data set is constructed by using a negative sample with known SMN gene actual copy number and a positive sample with known SMN gene actual copy number in advance to detect the gene copy number of the single sample, and a control interval with high correlation with the SMN gene copy number is searched in a full exon Bed interval; batch effects among different samples are corrected by using the reads coverage of the region, the accuracy of the detection method is effectively improved, and meanwhile SMN 12 +0 silent carriers with g.27134T G point mutation can be detected. The method realizes the purposes of accurately detecting the copy number of the SMN gene in a single sample and detecting the SMN 12 +0 silent carrier with g.27134T > G point mutation.

Description

Method for detecting single-sample SMN gene copy number in WES data
Technical Field
The invention relates to the field of biological and precise medical Whole genome variation detection, in particular to a method for detecting SMN gene copy number of a single sample in WES (Whole exon Sequence, abbreviated as WES) data.
Background
Spinal muscular atrophy (SMA for short), is an inherited neurological disease. It can cause motor neuron degeneration, muscle atrophy, muscle weakness and ultimately death. SMA is caused by a deletion or abnormality (mutation) of a gene called "motor neuron survival No. 1" (SMN1) in humans. SMA is closely related to two highly homologous (meaning that the sequences of these two genes are very similar) genes, namely SMN1 and SMN2 ("motor neuron survival No. 2" genes), which are distinguished primarily by two genetic loci on exon7 and exon 8. Generally, most normal individuals have 2 copies of the SMN1 gene and 2 copies of the SMN2 gene, the SMN2 gene is skipped by exon7, and only a small amount of the full-length SMN mRNA is present, so if one person loses function of both copies of the SMN1 gene, the disease will be caused, and only one SMN1 gene is a carrier. In the case of the SMN1 gene is out of function, the copy number of the SMN2 gene affects the disease onset time and the disease severity of a patient.
The SMA gene detection methods are of the following classes: (1) PCR (Polymerase Chain Reaction) or first-generation sequencing, firstly amplifying a target region, then distinguishing by a restriction enzyme or first-generation sequencing method, if the target region is a patient, deleting a C peak of SMN1 at a c.840 site, and only displaying a T homozygous peak of SMN 2; normal persons or carriers should be heterozygous for the C/T peak. (2) MLPA, multiplex ligation-dependent probe amplification (MLPA), was first reported by Schouten et al in 2002, and was a new technology developed in early years for qualitative and semi-quantitative analysis of DNA sequences to be detected. The technology is efficient and specific, changes of copy numbers of 45 nucleotide sequences can be detected in one reaction, different probe sequences are designed aiming at c.840C > T sites, fragments with different lengths are amplified for SMN1 genes and SMN2 genes, and the height of a peak can reflect copy number variation. (3) Second-generation sequencing: feng et al, Beller medical college, published a study on the detection of SMA using second generation sequencing (pmid: 28125085) in Genetics in Medicine, which included 6648 samples. The main principle is that samples of the same batch are collected to carry out target region capture sequencing on SMN genes, the total coverage of exon 1-exon 8 of SMN1 and SMN2 is counted, single-ended reads are extracted, the proportion of SMN1 reads and SMN2 reads is analyzed according to c.840C > T, and then SMN1 and SMN2 which carry a plurality of copies of each person are calculated according to the proportion of SMN1 and SMN2 reads and the total coverage. Compared with MLPA, the sensitivity is more than 98%, and the specificity is more than 98%. In addition, the research also confirms that several pathogenic point mutation sites are diagnosed, and g.27134T > G is consistent with RFLP (Restriction Fragment Length Polymorphism) results, and the site is closely related to SMN 12 +0 type special carriers. But sensitivity and specificity for SMN2 copy number are not explicitly described.
For the above 3 detection methods, each method has its own disadvantages, such as (1) PCR-RFLP or primary sequencing: the method has the defects that the hidden danger of incomplete enzyme digestion exists, carriers cannot be distinguished from normal people, the copy number of the SMN2 cannot be detected, patients with the SMN1 homozygous deletion can be clinically diagnosed, and other situations can only be used as preliminary screening. (2) MLPA technology this kit can't detect point mutation and special SMN 12 +0 carriers, and detect the flux low. (3) The detection method of Feng et al is based on the NGS platform, and can solve the problem of special SMN 12 +0 carrier variation, but the method needs to detect in the same batch of samples to eliminate batch effect, and if the number of the samples in the same batch is not enough, the detection result is affected. According to the method, the statistical coverage of the comparison result of the single-ended reads is extracted, and partial effective information can be lost. The method counts the coverage of all exons of exon 1-exon 8, and although the comparison condition of SMN genes is comprehensively considered, due to uncertainty of library preparation and sequencing links and difference of amplification efficiency among a plurality of exons, amplified reads have difference and have great influence on detection of copy numbers of the SMN genes, particularly true copy numbers of exon7 and exon 8.
In addition, partial open source software can detect SMN gene copy number in WGS, but the software needs to use WGS data or batch samples, and the requirement of single sample detection cannot be effectively solved. In order to quickly and accurately detect the copy number of the SMN gene, particularly to meet the clinical requirement on detection of a single sample, the method is based on an NGS platform and WES sequencing data, a data set is constructed by utilizing a large number of test samples, probability values corresponding to different copy numbers of the SMN gene are explored in advance, batch effects among samples are eliminated fully, and the flexibility and the reliability of detection are improved.
Disclosure of Invention
The application provides a method for detecting the copy number of a single-sample SMN gene in WES data, and is used for solving the problems that the copy number of the single-sample SMN gene cannot be accurately detected and the state of a special SMN 12 +0 carrier cannot be detected at the same time in the prior art.
The application provides a method for detecting the copy number of SMN gene in a single sample in WES data,
s1, collecting negative samples of the known SMN gene actual copy number and positive samples of the known SMN gene actual copy number of different batches of WES data, and searching a control interval with high correlation with the SMN gene copy number in a full exon Bed interval;
s2, correcting batch effects between the negative samples and the positive samples of different batches by using the resds coverage of the control interval, defining the negative samples with known SMN gene actual copy numbers and the positive samples with known SMN gene actual copy numbers of the WES data of different batches as all samples, and calculating the P1 value distribution range of the SMN1 gene corresponding copy numbers and the P2 value distribution range of the SMN2 gene corresponding copy numbers of all samples: the P1 value distribution ranges of the SMN1 genes are grouped according to the actual copy number of the SMN1 gene in the sample, for example, the P1 value distribution range where the actual copy number of the SMN1 gene is 0 copies is defined as P1_ zero, the P1 value distribution range where the actual copy number of the SMN1 gene is 1 copy is defined as P1_ one, the P1 value distribution range where the actual copy number of the SMN1 gene is 2 copies is defined as P1_ two, and so on. The distribution range of the P2 value of the SMN2 gene is grouped according to the actual copy number of the SMN2 gene in the sample, for example, the distribution range of the P2 value of the sample SMN2 gene with the actual copy number of 0 copy is defined as P2_ zero, the distribution range of the P2 value of the sample SMN2 gene with the actual copy number of 1 copy is defined as P2_ one, the distribution range of the P2 value of the sample SMN2 gene with the actual copy number of 2 copies is defined as P2_ two, and so on. Hereinafter, P1_ zero, P1_ one, P1_ two, P2_ zero, P2_ one, P2_ two, and the like are collectively referred to as a P1 value and a P2 value.
Counting the corrected coverage P _ silent value distribution range of the g.27134T > G locus of the intron 7 of the sample which is verified to be the silent carrier in all the samples, and judging whether the single sample is the silent carrier or not according to the coverage P _ silent value distribution range and the evidence that the copy number of the SMN1 gene of the single sample is 2;
s3, calculating P1 values of the No. 7 exon and the No. 8 exon of the SMN1 gene of a single test sample and P2 values of the No. 7 exon and the No. 8 exon of the SMN2 gene, and judging the copy numbers of the SMN1 gene and the SMN2 gene corresponding to the P1 value and the P2 value in the step according to the distribution range of the P1 value and the P2 value calculated by S2;
counting the coverage p _ silent value of g.27134T > G locus on intron 7 of a single test sample; and judging the state of the single test sample according to the p _ silent value and the copy number of the SMN1 gene of the single test sample: (ii) a
Judging that the single test sample is a silent carrier when the P _ silent value is within the P _ silent value distribution range calculated in S2 and the copy number of the SMN1 gene of the single test sample is 2;
when the P _ silent value is within the P _ silent value distribution range calculated in S2 but the copy number of the SMN1 gene of the single test sample is not 2, judging the single test sample to be a suspected silence carrier;
and judging the single test sample as a non-silent carrier in other cases.
Further, the present invention provides that the step of searching for the control interval in S1 includes:
s101, verifying the actual copy numbers of the SMN1 gene and the SMN2 gene of all samples by using an MLPA platform, and processing by using a biogenic analysis process to obtain a Bam file;
s102, screening the Bed intervals of the two copy genes in advance, and counting the coverage of all samples in the Bed intervals of the whole exome;
s103, correcting the coverage of all samples to 100X to obtain the corrected coverage of the samples;
and S104, calculating correlation and variance according to the corrected coverage of all samples, and searching a Bed interval with good correlation and low variance value as a comparison interval.
Furthermore, the invention provides that the comparison interval is the first 5 Bed intervals with good correlation and low variance value.
Further, the present invention provides that the step of S2 includes:
s201, counting and correcting the total coverage of the SMN1 gene and SMN2 gene exon7 and exon8 of all samples to obtain the corrected total coverage of SMN1 gene and SMN2 gene exon7 and exon 8;
s202, counting the total coverage of all the samples in the 5 comparison intervals and correcting to obtain a corrected coverage mean value of the comparison intervals;
s203, counting the coverage of the 3 point mutations of all the samples and correcting to obtain the corrected coverage of the 3 point mutations; the coverage of the 3 point mutations includes the coverage of c.840C > T sites on exon7, the coverage of c.about 239G > A sites on exon8 and the coverage of g.27134T > G sites on intron 7; calculating the ratio values of the corrected coverage of the SMN1 gene in the exon7 and the exon 8; calculating the ratio values of the corrected coverage of the SMN2 gene in the exon7 and the exon 8;
s204, calculating copy numbers p _ e7_ S1 and p _ e8_ S1 of the No. 7 exon and the No. 8 exon of the SMN1 gene according to the corrected total coverage of the SMN1 gene and the SMN2 gene No. 7 exon and No. 8 exon, the corrected coverage mean value of a control interval and the ratio value; calculating copy numbers of the exon7 and the exon8 of the SMN2 gene, namely p _ e7_ s2 value and p _ e8_ s2 value; calculating a p1 value from the p _ e7_ s1 value and the p _ e8_ s1 value; the p2 value was calculated from the p _ e7_ s2 value and the p _ e8_ s2 value.
Furthermore, the invention provides that the corrections are all corrected by adopting corresponding median coverage in batches.
Furthermore, the invention provides a calculation formula of the ratio value of the SMN1 gene on the exon7 and the p _ e7_ s1 value, wherein the calculation formula comprises the following steps:
ratio_e7_s1=rc_e7_s1/(rc_e7_s1+rc_e7_s2);
cn_e7_s1=rc_e7_s1_total/rc_control;
cn_e7_s2=rc_e7_s2_total/rc_control;
p_e7_s1=ratio_e7_s1*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN1 gene on exon8 and the p _ e8_ s1 value were calculated as:
ratio_e8_s1=rc_e8_s1/(rc_e8_s1+rc_e8_s2);
cn_e8_s1=rc_e8_s1_total/rc_control;
cn_e8_s2=rc_e8_s2_total/rc_control;
p_e8_s1=ratio_e8_s1*(cn_e8_s1+cn_e8_s2)*2;
the p1 value of the SMN1 gene is calculated as:
p1=(p_e7_s1+p_e8_s1)/2
the ratio of the SMN2 gene on exon7 and the p _ e7_ s2 values were calculated as:
ratio_e7_s2=rc_e7_s2/(rc_e7_s1+rc_e7_s2);
p_e7_s2=ratio_e7_s2*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN2 gene on exon8 and the p _ e8_ s2 value were calculated as:
ratio_e8_s2=rc_e8_s2/(rc_e8_s1+rc_e8_s2);
p_e8_s2=ratio_e8_s2*(cn_e8_s1+cn_e8_s2)*2;
the p2 value of the SMN2 gene is calculated as:
p2=(p_e7_s1+p_e8_s1)/2
the formula for the p _ silent value is:
p _ silent [ g.27134T > corrected coverage of G site ]
The formula for the P _ silent value is:
P_silent=[min{p_silent_sample1,p_silent_sample1,...,p_silent_sampleN},5000]
the names of the variables in the formula have the following meanings:
rc _ e7_ s1 corrected coverage of c.840C > T site on exon7 of SMN1 corrected by median within batch,
rc _ e8_ s1 corrected coverage of c. about.239G > a site on exon8 of SMN1 corrected by median within batch,
rc _ e7_ s2 corrected coverage of c.840C > T site on exon7 of SMN2 corrected by median within batch,
rc _ e8_ s2 corrected coverage of c. about.239G > a site on exon8 of SMN2 corrected by median within batch,
rc control-coverage on control region calibrated with median within batch,
rc _ e7_ s1_ total corrected total coverage of SMN1 on exon7,
rc _ e8_ s1_ total corrected total coverage of SMN1 on exon8,
rc _ e7_ s2_ total corrected total coverage of SMN2 on exon7,
rc _ e8_ s2_ total corrected total coverage of SMN2 on exon8,
cn _ e7_ s1 copy number coefficient of SMN1 on exon7,
cn _ e8_ s1 copy number coefficient of SMN1 on exon8,
cn _ e7_ s2 copy number coefficient of SMN2 on exon7,
cn _ e8_ s2 copy number coefficient of SMN2 on exon8,
ratio _ e7_ s1 ratio value of SMN1 on exon7,
ratio _ e8_ s1 ratio value of SMN1 on exon8,
ratio _ e7_ s2 ratio value of SMN2 on exon7,
ratio _ e8_ s2 ratio value of SMN2 on exon8,
p _ e7_ s1 copy number probability value of SMN1 on exon7,
p _ e8_ s1 copy number probability value of SMN1 on exon8,
p _ e7_ s2 copy number probability value of SMN2 on exon7,
p _ e8_ s2 copy number probability value of SMN2 on exon8,
p1 probability value of corresponding copy number of SMN1 gene in single sample,
p2 probability value of corresponding copy number of SMN2 gene in single sample,
p1 distribution of P1 values of all samples according to corresponding copy number statistics,
p2 distribution of P2 values of all samples according to corresponding copy number statistics,
p _ silent: g.27134T > corrected coverage of G sites for a single sample,
p _ silent: the silence carrier threshold distribution range is the minimum to 5000 of all p _ silents after the outliers are excluded from the sample, (5000 is the maximum limit, which is theoretically the maximum corrective coverage at that site).
Further, the present invention proposes that the step of calculating the copy number p1 of exon7 and exon8 of SMN1 gene, and the copy number p2 of exon7 and exon8 of SMN2 gene of the single test sample in S3 comprises:
s301, respectively counting the total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of a single test sample, the coverage of 5 control intervals and the coverage of 3 point mutations, and respectively correcting to obtain the corrected total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of the single test sample, the corrected coverage mean value of 5 control intervals and the corrected coverage of 3 point mutations; the coverage of the 3 point mutations comprises the coverage of c.840C > T sites on exon7, the coverage of c.a.239G > A sites on exon8 and the coverage of g.27134T > G sites on intron 7;
s302, calculating ratio values of the corrected coverage of the SMN1 gene and the SMN2 gene on the No. 7 exon and the No. 8 exon according to the corrected total coverage of the SMN1 gene and the No. 7 exon and the No. 8 exon on the SMN2 gene of the single test sample, the corrected coverage mean of 5 control intervals and the corrected coverage of 3 point mutations;
s303, calculating copy numbers p1 and p2 of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of a single test sample.
Further, the present invention proposes that the total coverage of SMN1 gene and SMN2 gene exon7 and exon8 in a single test sample is corrected using the median of corrected total coverage of exon7 and exon8 in SMN1 gene and SMN2 gene as described in S201;
correcting the coverage of 5 control intervals of a single test sample by adopting the median of the mean value of the corrected coverage of the control intervals in the S202;
the correction of the coverage of 3 point mutations for a single test sample was performed using the median of the corrected coverage of 3 point mutations described in S203.
Advantageous effects
The invention provides a method for detecting the copy number of a single sample SMN gene in WES data, so that the problem that the copy number of the single sample SMN gene cannot be accurately detected in the prior art is effectively solved, and the following technical effects are achieved:
according to the method, a negative sample with a known SMN gene actual copy number and a positive sample with a known SMN gene actual copy number are used for constructing SMN1 gene and SMN2 gene copy number value data sets in advance to detect the gene copy number of a single sample, and a control interval with high correlation with the SMN gene copy number is searched in a full exon Bed (Bed) interval; batch effects among different samples are corrected by using the reads coverage of the region, the accuracy of the detection method is effectively improved, and meanwhile SMN 12 +0 silent carriers with g.27134T G point mutation can be detected.
It should be understood that all combinations of the foregoing concepts and additional concepts described in greater detail below can be considered as part of the inventive subject matter of this disclosure unless such concepts are mutually inconsistent.
The foregoing and other aspects, embodiments and features of the present teachings can be more fully understood from the following description taken in conjunction with the accompanying drawings. Additional aspects of the present invention, such as features and/or advantages of exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of specific embodiments in accordance with the teachings of the present invention.
Drawings
FIG. 1 is a flowchart of searching a control interval with high correlation with SMN gene copy number in a full exon Bed interval according to the present invention;
FIG. 2 is a flowchart of calculating the distribution range of the copy number P of the SMN1 gene and the SMN2 gene of all samples according to the present invention;
FIG. 3 is a flow chart of the present invention for determining the copy number p values of SMN1 gene and SMN2 gene and the status of silent carriers for a single test sample.
Detailed Description
In this disclosure, aspects of the present invention are described with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the present disclosure are not necessarily defined to include all aspects of the invention. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the disclosed concepts and embodiments are not limited to any one implementation. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.
In order to solve the problems that the SMN gene copy number of a single sample cannot be accurately detected and the state of a special SMN 12 +0 carrier cannot be simultaneously detected in the prior art, the SMN1 gene and SMN2 gene copy number data set is constructed by using a negative sample with known SMN gene actual copy number and a positive sample with known SMN gene actual copy number in advance to detect the gene copy number of the single sample, and a control interval with high correlation with the SMN gene copy number is searched in a full exon Bed interval; batch effects among different samples are corrected by using the reads coverage of the region, the accuracy of the detection method is effectively improved, and meanwhile SMN 12 +0 silent carriers with g.27134T G point mutation can be detected. The method realizes the purposes of accurately detecting the copy number of the SMN gene in a single sample and detecting the SMN 12 +0 silent carrier with g.27134T > G point mutation.
In particular, the application provides a method for detecting the copy number of the SMN gene of a single sample in WES data,
s1, as shown in figure 1, collecting 500 negative samples of the known SMN gene actual copy number and positive samples of the known SMN gene actual copy number of WES data of different batches, and searching a control interval with high correlation with the SMN gene copy number in a full exon Bed interval;
s2, correcting batch effects between the negative samples and the positive samples of different batches by using the resds coverage of the control interval, defining the negative samples with the known SMN gene actual copy number and the positive samples with the known SMN gene actual copy number of the WES data of different batches as all samples, and calculating the P1 value distribution range and the P2 value distribution range of the SMN1 gene at the corresponding copy numbers of the SMN2 gene of all samples.
Counting the corrected coverage P _ silent value distribution range of the g.27134T > G locus of the intron 7 of the sample which is verified to be the silent carrier in all the samples, and judging whether the single sample is the silent carrier or not according to the coverage P _ silent value distribution range and the evidence that the copy number of the SMN1 gene of the single sample is 2;
s3, calculating P1 values of the No. 7 exon and the No. 8 exon of the SMN1 gene of a single test sample and P2 values of the No. 7 exon and the No. 8 exon of the SMN2 gene, and judging the copy numbers of the SMN1 gene and the SMN2 gene corresponding to the P1 value and the P2 value in the step according to the distribution range of the P1 value and the P2 value calculated by S2;
counting the coverage p _ silent value of g.27134T > G locus on intron 7 of a single test sample; and judging the state of the single test sample according to the p _ silent value and the copy number of the SMN1 gene of the single test sample: (ii) a
Judging that the single test sample is a silent carrier when the P _ silent value is within the P _ silent value distribution range calculated in S2 and the copy number of the SMN1 gene of the single test sample is 2;
when the P _ silent value is within the P _ silent value distribution range calculated in S2 but the copy number of the SMN1 gene of the single test sample is not 2, judging the single test sample to be a suspected silence carrier;
and judging the single test sample as a non-silent carrier in other cases.
In specific implementation, the step of searching the comparison interval in S1 includes: as shown in figure 1 of the drawings, in which,
s101, verifying the actual copy numbers of the SMN1 gene and the SMN2 gene of all samples by using an MLPA platform, and processing by using a biogenic analysis process to obtain a Bam file;
s102, screening the Bed intervals of the two copy genes in advance, and counting the coverage of all samples in the Bed intervals of the whole exome;
s103, correcting the coverage of all samples to 100X to obtain the corrected coverage of the samples;
and S104, calculating correlation and variance according to the corrected coverage of all the samples, searching the first 5 Bed intervals with good correlation and low variance value as comparison intervals, and achieving the effect of reducing correction deviation when subsequently calculating the average value or the median.
In specific implementation, the step of S2 provided by the present invention includes:
s201, counting the total coverage of the SMN1 gene and SMN2 gene exon7 and exon8 of all samples and correcting by adopting corresponding median coverage in batches to obtain the corrected total coverage of SMN1 gene and SMN2 gene exon7 and exon 8;
s202, counting the total coverage of all samples in 5 comparison intervals, and correcting by adopting corresponding median coverage in a batch to obtain a corrected coverage mean value of the comparison intervals;
s203, counting the coverage of the 3 point mutations of all the samples and correcting by adopting the corresponding median coverage in the batch to obtain the corrected coverage of the 3 point mutations; the coverage of the 3 point mutations includes the coverage of c.840C > T sites on exon7, the coverage of c.about 239G > A sites on exon8 and the coverage of g.27134T > G sites on intron 7; calculating the ratio values of the corrected coverage of the SMN1 gene in the exon7 and the exon8 according to a formula; calculating the ratio values of the corrected coverage of the SMN2 gene in the exon7 and the exon 8;
s204, calculating copy numbers p _ e7_ S1 and p _ e8_ S1 of the No. 7 exon and the No. 8 exon of the SMN1 gene according to the corrected total coverage of the SMN1 gene and the SMN2 gene No. 7 exon and No. 8 exon, the corrected coverage mean value of a control interval and the ratio value; calculating copy numbers of the exon7 and the exon8 of the SMN2 gene, namely p _ e7_ s2 value and p _ e8_ s2 value; calculating a p1 value from the p _ e7_ s1 value and the p _ e8_ s1 value; the p2 value was calculated from the p _ e7_ s2 value and the p _ e8_ s2 value.
In specific implementation, the ratio value of the SMN1 gene on exon7 and the p _ e7_ s1 value are calculated by the following formula:
ratio_e7_s1=rc_e7_s1/(rc_e7_s1+rc_e7_s2);
cn_e7_s1=rc_e7_s1_total/rc_control;
cn_e7_s2=rc_e7_s2_total/rc_control;
p_e7_s1=ratio_e7_s1*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN1 gene on exon8 and the p _ e8_ s1 value were calculated as:
ratio_e8_s1=rc_e8_s1/(rc_e8_s1+rc_e8_s2);
cn_e8_s1=rc_e8_s1_total/rc_control;
cn_e8_s2=rc_e8_s2_total/rc_control;
p_e8_s1=ratio_e8_s1*(cn_e8_s1+cn_e8_s2)*2;
the p1 value of the SMN1 gene is calculated as:
p1=(p_e7_s1+p_e8_s1)/2
the ratio of the SMN2 gene on exon7 and the p _ e7_ s2 values were calculated as:
ratio_e7_s2=rc_e7_s2/(rc_e7_s1+rc_e7_s2);
p_e7_s2=ratio_e7_s2*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN2 gene on exon8 and the p _ e8_ s2 value were calculated as:
ratio_e8_s2=rc_e8_s2/(rc_e8_s1+rc_e8_s2);
p_e8_s2=ratio_e8_s2*(cn_e8_s1+cn_e8_s2)*2;
the p2 value of the SMN2 gene is calculated as:
p2=(p_e7_s1+p_e8_s1)/2
the distribution range of P1 values at the corresponding copy numbers of SMN1 gene and the distribution range of P2 values at the corresponding copy numbers of SMN2 gene of all 500 samples described in this example are shown in table 1 below:
TABLE 1P 2 values at corresponding copy numbers of the SMN1 gene, the P1 and the SMN2 gene for all samples
Number of copies Range of P1 values Range of P2 values
0 [0,0.003] [0,0.002]
1 [0.95,1.12] [0.83,1.15]
2 [1.86,2.21] [1.97,2.16]
3 [2.92,3.22] [2.89,3.21]
4 [3.81,4.17] [4.23,4.23]
The range of P _ silen values for 6 silence carriers out of 500 all samples is 0.39, 0.65.
In specific implementation, the invention provides that the step of calculating the copy numbers p1 of exon7 and exon8 of SMN1 gene, and p2 of exon7 and exon8 of SMN2 gene of each test sample in S3 comprises the following steps:
s301, counting the total coverage of the SMN1 gene and the exon7 and the exon8 on the SMN2 gene of a single test sample, and correcting by using the median of the corrected total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene in S201 to obtain the corrected total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of the single test sample;
counting the coverage of 5 control intervals and the coverage of 3 point mutation, and correcting by adopting the median of the corrected coverage mean of the control intervals in S202 to obtain the corrected coverage mean of 5 control intervals;
counting the corrected coverage of 3 point mutations; the coverage of the 3 point mutations comprises the coverage of c.840C > T sites on the exon7, the coverage of c.a.239G > A sites on the exon8 and the coverage of g.27134T > G sites on the intron 7, and the corrected coverage of the 3 point mutations is obtained by correcting the median of the corrected coverage of the 3 point mutations in the S203;
s302, calculating the ratio values of the corrected coverage of the SMN1 gene and the SMN2 gene in the No. 7 exon and the No. 8 exon;
and S303, calculating the copy number p1 value of the exon7 and the exon8 of the SMN1 gene of a single test sample and the copy number p2 value of the exon7 and the exon8 on the SMN2 gene.
The p1 value of the copy number of exon7 and exon8 of the SMN1 gene and the p2 value of the copy number of exon7 and exon8 of the SMN2 gene of 500 test samples with known actual copy numbers of the SMN gene are calculated according to the steps shown in the table 2, and the copy numbers of the SMN1 gene and the SMN2 gene and the state of SMN 12 +0 silent carriers are obtained.
TABLE 215 copy numbers of SMN1 and SMN2 genes and SMN 12 +0 silenced Carrier status for individual test samples
Sample(s) SMN1 copy number SMN2 copy number p1 value p2 value p _ silent value Silent carrier status
Sample 1 4 1 4.17 1.15 0 Whether or not
Sample 2 4 0 3.81 0.006 0 Whether or not
Sample 3 3 2 3.22 2.16 0 Whether or not
Sample 4 3 1 2.92 0.83 0 Whether or not
Sample 5 3 0 3.16 0.007 0.002 Whether or not
Sample 6 2 2 1.95 2.07 0 Whether or not
Sample 7 2 3 2.11 3.13 0 Whether or not
Sample 8 2 1 2.21 0.89 0.47 Is that
Sample 9 2 3 1.86 3.11 0.51 Is that
Sample 10 1 3 0.98 3.21 0 Whether or not
Sample 11 1 2 0.95 1.97 0 Whether or not
Sample 12 1 1 1.12 1.09 0 Whether or not
Sample 13 0 4 0 4.23 0 Whether or not
Sample 14 0 3 0.003 2.89 0.013 Whether or not
Sample 15 0 2 0 2.01 0.001 Whether or not
As can be seen from the above table, the actual copy numbers of the SMN genes of the 15 test samples calculated by the method are the same as the actual copy numbers of the known SMN genes, and the state of the SMN 12 +0 is consistent with the detection result of RFLP. The method provided by the invention can accurately detect the copy number of the SMN gene of a single sample and simultaneously judge the state of the silent carrier of the sample SMN 12 + 0.
It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention. Unless defined otherwise, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.
The use of "first," "second," and similar terms in the description and claims of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. Similarly, the singular forms "a," "an," or "the" do not denote a limitation of quantity, but rather denote the presence of at least one, unless the context clearly dictates otherwise.

Claims (8)

1. A method for detecting the copy number of a SMN gene in a single sample in WES data, comprising:
s1, collecting negative samples of the known SMN gene actual copy number and positive samples of the known SMN gene actual copy number of different batches of WES data, and searching a control interval with high correlation with the SMN gene copy number in a full exon Bed interval;
s2, correcting batch effects between the negative samples and the positive samples by using the resds coverage of the control interval, defining the negative samples with known SMN gene actual copy numbers and the positive samples with known SMN gene actual copy numbers of the WES data of different batches as all samples, and calculating the P1 value distribution range when the corresponding copy numbers of the SMN1 genes of all samples and the P2 value distribution range when the corresponding copy numbers of the SMN2 genes of all samples are obtained;
counting the corrected coverage P _ silent value distribution range of g.27134T > G locus of intron No. 7 of the sample which is verified to be a silent carrier in all the samples;
s3, calculating P1 values of the No. 7 exon and the No. 8 exon of the SMN1 gene and P2 values of the No. 7 exon and the No. 8 exon of the SMN2 gene of a single test sample, and judging the copy numbers of the SMN1 gene and the SMN2 gene corresponding to the P1 value and the P2 value in the step according to the distribution range of the P1 value and the P2 value calculated in S2;
counting the coverage p _ silent value of g.27134T > G locus on intron 7 of a single test sample; and judging the state of the silent carrier of the single test sample according to the p _ silent value and the copy number of the SMN1 gene of the single test sample:
judging that the single test sample is a silent carrier when the P _ silent value is within the P _ silent value distribution range calculated in S2 and the copy number of the SMN1 gene of the single test sample is 2;
when the P _ silent value is within the P _ silent value distribution range calculated in S2 but the copy number of the SMN1 gene of the single test sample is not 2, judging the single test sample to be a suspected silence carrier;
and judging the single test sample as a non-silent carrier in other cases.
2. The method of claim 1, wherein the method comprises determining the copy number of SMN gene in a single sample from WES data: the step of finding the control interval in S1 includes:
s101, verifying the actual copy numbers of the SMN1 gene and the SMN2 gene of all samples by using an MLPA platform, and processing by using a biogenic analysis process to obtain a Bam file;
s102, screening the Bed intervals of the two copy genes in advance, and counting the coverage of all samples in the Bed intervals of the whole exome;
s103, correcting the coverage of all samples to 100X to obtain the corrected coverage of the samples;
and S104, calculating correlation and variance according to the corrected coverage of all samples, and searching a Bed interval with good correlation and low variance value as a comparison interval.
3. The method of claim 2, wherein the method comprises determining the copy number of SMN gene in a single sample from WES data: the control interval is the first 5 Bed intervals with good correlation and low variance value.
4. The method of claim 3, wherein the method comprises determining the copy number of the SMN gene in a single sample from the WES data by: the step of S2 includes:
s201, counting and correcting the total coverage of the SMN1 gene and SMN2 gene exon7 and exon8 of all samples to obtain the corrected total coverage of SMN1 gene and SMN2 gene exon7 and exon 8;
s202, counting the total coverage of all the samples in the 5 comparison intervals and correcting to obtain a corrected coverage mean value of the comparison intervals;
s203, counting the coverage of the 3 point mutations of all the samples and correcting to obtain the corrected coverage of the 3 point mutations; the coverage of the 3 point mutations includes the coverage of c.840C > T sites on exon7, the coverage of c.about 239G > A sites on exon8 and the coverage of g.27134T > G sites on intron 7; calculating the ratio values of the corrected coverage of the SMN1 gene in the exon7 and the exon 8; calculating the ratio values of the corrected coverage of the SMN2 gene in the exon7 and the exon 8;
s204, calculating the copy number p _ e7_ S1 value of the No. 7 exon and the copy number p _ e8_ S1 value of the No. 8 exon of the SMN1 gene according to the corrected total coverage of the No. 7 exon and the No. 8 exon of the SMN1 gene and the SMN2 gene, the corrected coverage mean value of a control interval and the ratio value; calculating the copy number p _ e7_ s2 value of the No. 7 exon and the copy number p _ e8_ s2 value of the No. 8 exon of the SMN2 gene; calculating a p1 value from the p _ e7_ s1 value and the p _ e8_ s1 value; calculating a p2 value from the p _ e7_ s2 value and the p _ e8_ s2 value; the distribution range of the P1 values of all the samples according to the corresponding copy number statistics is P1, and the distribution range of the P2 values of all the samples according to the corresponding copy number statistics is P2.
5. The method of claim 4, wherein the method comprises determining the copy number of the SMN gene in a single sample from the WES data by: and the correction is carried out by adopting corresponding median coverage in batches.
6. The method of claim 5, wherein the method comprises determining the copy number of the SMN gene in a single sample from the WES data by: calculating the P1 value and the P2 value of each sample in all the samples, and then counting the P1 value and the P2 value of all the samples according to the corresponding copy numbers; the calculation method is as follows:
the ratio of the SMN1 gene on exon7 and the p _ e7_ s1 values were calculated as:
ratio_e7_s1=rc_e7_s1/(rc_e7_s1+rc_e7_s2);
cn_e7_s1=rc_e7_s1_total/rc_control;
cn_e7_s2=rc_e7_s2_total/rc_control;
p_e7_s1=ratio_e7_s1*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN1 gene on exon8 and the p _ e8_ s1 value were calculated as:
ratio_e8_s1=rc_e8_s1/(rc_e8_s1+rc_e8_s2);
cn_e8_s1=rc_e8_s1_total/rc_control;
cn_e8_s2=rc_e8_s2_total/rc_control;
p_e8_s1=ratio_e8_s1*(cn_e8_s1+cn_e8_s2)*2;
the p1 value of the SMN1 gene is calculated as:
p1=(p_e7_s1+p_e8_s1)/2
the ratio of the SMN2 gene on exon7 and the p _ e7_ s2 values were calculated as:
ratio_e7_s2=rc_e7_s2/(rc_e7_s1+rc_e7_s2);
p_e7_s2=ratio_e7_s2*(cn_e7_s1+cn_e7_s2)*2;
the ratio of the SMN2 gene on exon8 and the p _ e8_ s2 value were calculated as:
ratio_e8_s2=rc_e8_s2/(rc_e8_s1+rc_e8_s2);
p_e8_s2=ratio_e8_s2*(cn_e8_s1+cn_e8_s2)*2;
the p2 value of the SMN2 gene is calculated as:
p2=(p_e7_s1+p_e8_s1)/2
the formula for the p _ silent value is:
p _ silent [ g.27134T > corrected coverage of G site ]
The formula for the P _ silent value is:
P_silent=[min{p_silent_sample1,p_silent_sample1,...,p_silent_sampleN},5000]
the names of the variables in the formula have the following meanings:
rc _ e7_ s1 corrected coverage of c.840C > T site on exon7 of SMN1 corrected by median within batch,
rc _ e8_ s1 corrected coverage of c. about.239G > a site on exon8 of SMN1 corrected by median within batch,
rc _ e7_ s2 corrected coverage of c.840C > T site on exon7 of SMN2 corrected by median within batch,
rc _ e8_ s2 corrected coverage of c. about.239G > a site on exon8 of SMN2 corrected by median within batch,
rc control-coverage on control region calibrated with median within batch,
rc _ e7_ s1_ total corrected total coverage of SMN1 on exon7,
rc _ e8_ s1_ total corrected total coverage of SMN1 on exon8,
rc _ e7_ s2_ total corrected total coverage of SMN2 on exon7,
rc _ e8_ s2_ total corrected total coverage of SMN2 on exon8,
cn _ e7_ s1 copy number coefficient of SMN1 on exon7,
cn _ e8_ s1 copy number coefficient of SMN1 on exon8,
cn _ e7_ s2 copy number coefficient of SMN2 on exon7,
cn _ e8_ s2 copy number coefficient of SMN2 on exon8,
ratio _ e7_ s1 ratio value of SMN1 on exon7,
ratio _ e8_ s1 ratio value of SMN1 on exon8,
ratio _ e7_ s2 ratio value of SMN2 on exon7,
ratio _ e8_ s2 ratio value of SMN2 on exon8,
p _ e7_ s1 copy number probability value of SMN1 on exon7,
p _ e8_ s1 copy number probability value of SMN1 on exon8,
p _ e7_ s2 copy number probability value of SMN2 on exon7,
p _ e8_ s2 copy number probability value of SMN2 on exon8,
p1 probability value of corresponding copy number of SMN1 gene in single sample,
p2 probability value of corresponding copy number of SMN2 gene in single sample,
p1 distribution of P1 values of all samples according to corresponding copy number statistics,
p2 distribution of P2 values of all samples according to corresponding copy number statistics,
p _ silent: g.27134T > corrected coverage of G sites for a single sample,
p _ silent: the silence carrier threshold distribution range is the minimum value of all p _ silents after outliers are excluded from the sample to 5000.
7. The method of claim 6, wherein the method comprises determining the copy number of SMN gene in a single sample from WES data: the step of calculating the copy number p1 of exon7 and exon8 of SMN1 gene, and the copy number p2 of exon7 and exon8 of SMN2 gene of the single test sample in S3 comprises:
s301, respectively counting the total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of a single test sample, the coverage of 5 control intervals and the coverage of 3 point mutations, and respectively correcting to obtain the corrected total coverage of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of the single test sample, the corrected coverage mean value of 5 control intervals and the corrected coverage of 3 point mutations; the coverage of the 3 point mutations comprises the coverage of c.840C > T sites on exon7, the coverage of c.a.239G > A sites on exon8 and the coverage of g.27134T > G sites on intron 7;
s302, calculating ratio values of the corrected coverage of the SMN1 gene and the SMN2 gene on the No. 7 exon and the No. 8 exon according to the corrected total coverage of the SMN1 gene and the No. 7 exon and the No. 8 exon on the SMN2 gene of the single test sample, the corrected coverage mean of 5 control intervals and the corrected coverage of 3 point mutations;
s303, calculating copy numbers p1 and p2 of the exon7 and the exon8 on the SMN1 gene and the SMN2 gene of a single test sample.
8. The method of claim 7, wherein the method comprises determining the copy number of SMN gene in a single sample from WES data:
correcting the total coverage of the SMN1 gene and SMN2 gene exon7 and exon8 of a single test sample using the median of the corrected total coverage of exon7 and exon8 on the SMN1 gene and SMN2 gene described in S201;
correcting the coverage of 5 control intervals of a single test sample by adopting the median of the mean value of the corrected coverage of the control intervals in the S202;
the correction of the coverage of 3 point mutations for a single test sample was performed using the median of the corrected coverage of 3 point mutations described in S203.
CN202011107940.2A 2020-10-16 2020-10-16 Method for detecting single sample SMN gene copy number in WES data Active CN112435710B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011107940.2A CN112435710B (en) 2020-10-16 2020-10-16 Method for detecting single sample SMN gene copy number in WES data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011107940.2A CN112435710B (en) 2020-10-16 2020-10-16 Method for detecting single sample SMN gene copy number in WES data

Publications (2)

Publication Number Publication Date
CN112435710A true CN112435710A (en) 2021-03-02
CN112435710B CN112435710B (en) 2024-05-03

Family

ID=74694965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011107940.2A Active CN112435710B (en) 2020-10-16 2020-10-16 Method for detecting single sample SMN gene copy number in WES data

Country Status (1)

Country Link
CN (1) CN112435710B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117153249B (en) * 2023-10-26 2024-02-02 北京华宇亿康生物工程技术有限公司 Methods, devices and media for detecting SMN gene copy number variation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104762398A (en) * 2015-04-17 2015-07-08 代苒 Method for detecting spinal muscular atrophy virulence gene
CN108048548A (en) * 2017-11-07 2018-05-18 北京华瑞康源生物科技发展有限公司 People's spinal muscular atrophy Disease-causing gene copy number detects PCR kit for fluorescence quantitative
WO2018117986A1 (en) * 2016-12-23 2018-06-28 Leader Medical Genetics And Genomics, Co., Ltd. A method for detecting a copy number of smn1 gene
US20190066842A1 (en) * 2016-03-09 2019-02-28 Baylor College Of Medicine A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing
CN110699436A (en) * 2018-07-10 2020-01-17 天津华大医学检验所有限公司 Method and system for determining whether number seven exon deletion exists in SMN1 gene of sample to be detected

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104762398A (en) * 2015-04-17 2015-07-08 代苒 Method for detecting spinal muscular atrophy virulence gene
US20190066842A1 (en) * 2016-03-09 2019-02-28 Baylor College Of Medicine A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing
WO2018117986A1 (en) * 2016-12-23 2018-06-28 Leader Medical Genetics And Genomics, Co., Ltd. A method for detecting a copy number of smn1 gene
CN108048548A (en) * 2017-11-07 2018-05-18 北京华瑞康源生物科技发展有限公司 People's spinal muscular atrophy Disease-causing gene copy number detects PCR kit for fluorescence quantitative
CN110699436A (en) * 2018-07-10 2020-01-17 天津华大医学检验所有限公司 Method and system for determining whether number seven exon deletion exists in SMN1 gene of sample to be detected

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117153249B (en) * 2023-10-26 2024-02-02 北京华宇亿康生物工程技术有限公司 Methods, devices and media for detecting SMN gene copy number variation

Also Published As

Publication number Publication date
CN112435710B (en) 2024-05-03

Similar Documents

Publication Publication Date Title
US20230203573A1 (en) Methods for detection of donor-derived cell-free dna
JP6487504B2 (en) Detection of cancer-related genetic or molecular abnormalities
KR102028375B1 (en) Systems and methods to detect rare mutations and copy number variation
US10538813B2 (en) Biomarker panel for diagnosis and prediction of graft rejection
US20140066317A1 (en) Systems and methods to detect rare mutations and copy number variation
JP2014502155A5 (en)
CN111411150B (en) Intestinal flora for diagnosing sarcopenia and application thereof
CN114085903B (en) Primer pair probe combination product for detecting mitochondria 3243A & gtG mutation, kit and detection method thereof
WO2017112738A1 (en) Methods for measuring microsatellite instability
CN112435710B (en) Method for detecting single sample SMN gene copy number in WES data
EP3409788B1 (en) Method and system for nucleic acid sequencing
US20230230655A1 (en) Methods and systems for assessing fibrotic disease with deep learning
WO2024038396A1 (en) Method of detecting cancer dna in a sample
CN116377053A (en) Diagnostic biomarker for coronary artery dilatation and application thereof
WO2023239866A1 (en) Methods for identifying cns cancer in a subject
Zhang Cis-acting genetic variants that alter ERCC5 regulation as a prototype to characterize cis-regulation of key protective genes in normal bronchial epithelial cells

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant