CN117153249B - Methods, devices and media for detecting SMN gene copy number variation - Google Patents

Methods, devices and media for detecting SMN gene copy number variation Download PDF

Info

Publication number
CN117153249B
CN117153249B CN202311401218.3A CN202311401218A CN117153249B CN 117153249 B CN117153249 B CN 117153249B CN 202311401218 A CN202311401218 A CN 202311401218A CN 117153249 B CN117153249 B CN 117153249B
Authority
CN
China
Prior art keywords
amplicon
depth
gene
copy number
smn
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311401218.3A
Other languages
Chinese (zh)
Other versions
CN117153249A (en
Inventor
钟影
张倩倩
李宁
刘会涛
李翔
辛忠涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Pinfeng Medical Technology Co ltd
Co Health Beijing Laboratories Co ltd
Original Assignee
Shanghai Pinfeng Medical Technology Co ltd
Co Health Beijing Laboratories Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Pinfeng Medical Technology Co ltd, Co Health Beijing Laboratories Co ltd filed Critical Shanghai Pinfeng Medical Technology Co ltd
Priority to CN202311401218.3A priority Critical patent/CN117153249B/en
Publication of CN117153249A publication Critical patent/CN117153249A/en
Application granted granted Critical
Publication of CN117153249B publication Critical patent/CN117153249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Abstract

The invention relates to a method, equipment and medium for detecting SMN gene copy number variation. The method comprises the following steps: comparing the pre-processed sequencing data on the DNA sample to be tested with the sequencing data of the reference genome so as to obtain comparison result data; based on the comparison result data, homogenizing the amplicon coverage depth in the DNA sample to be tested so as to obtain the amplicon depth for the SMN gene; determining a reference region for correcting amplicon depth; constructing a control set based on the reference region and the copy number normal sample for determining a depth ratio threshold corresponding to the SMN gene copy number; and obtaining SMN gene copy number variation data about the DNA sample to be tested based on the exon copy number ratios and depth ratio thresholds of the SMN1 gene and the SMN2 gene. The invention can obtain accurate detection results of SMN gene copy number variation with high efficiency and low cost.

Description

Methods, devices and media for detecting SMN gene copy number variation
Technical Field
The present invention relates generally to biological information processing, and in particular, to methods, computing devices, and computer storage media for detecting SMN gene copy number variation.
Background
Copy number variation (copy number variation, CNV) refers to the deletion or amplification of DNA fragments of not less than 1kbp compared to the reference genome length.
Conventional methods for detecting SMN gene copy number variation are, for example, methods based on targeted capture sequencing technology and whole genome sequencing technology, and few documents and patents propose methods for analyzing SMN gene copy number for multiplex polymerase chain reaction (Polymerase Chain Reaction, hereinafter referred to as PCR) amplicon data. It will be appreciated that the targeted capture technique is relatively more expensive and complex to operate than the heavy amplicon technique. The multiplex amplicon technology has the characteristics of high efficiency, economy, simplicity and the like, can detect multiple genotypes simultaneously in the same reaction tube, can remarkably save detection time and detection cost, and can meet clinical requirements. However, multiple PCR often causes excessive primers, which are prone to react with each other, affecting the amplification efficiency of each other, and external conditions such as different batches, different PCR amplicons, etc. affect the amplification efficiency, affecting the final detection result, and making the detection result about SMN gene copy number variation inaccurate.
Therefore, the traditional method for detecting SMN gene copy number variation based on the targeted capture sequencing technology and the whole genome sequencing technology has high technical cost and complex operation. The method for detecting the SMN gene copy number variation based on the multiplex amplicon technology has the characteristics of high efficiency, economy and simplicity, but is difficult to obtain an accurate detection result.
To sum up. The traditional method for detecting the copy number variation of the SMN gene has the following defects: it is difficult to obtain accurate detection results regarding SMN gene copy number variation efficiently at low cost.
Disclosure of Invention
The invention provides a method, a computing device and a computer storage medium for detecting SMN gene copy number variation, which can obtain accurate detection results about the SMN gene copy number variation efficiently and at low cost.
According to a first aspect of the present invention, there is provided a method for detecting SMN gene copy number variation. The method comprises the following steps: comparing the sequencing data on the test DNA sample with the sequencing data of the reference genome, which have been subjected to pretreatment, so as to obtain comparison result data, wherein the sequencing data are obtained by using an amplicon sequencing technology based on a predetermined SMN gene amplification primer, and the predetermined SMN gene amplification primer is used for amplifying the region of the 7 th exon or the 8 th exon of the SMN1 gene, the SMN2 gene simultaneously; based on the comparison result data, homogenizing the amplicon coverage depth in the DNA sample to be tested so as to obtain the amplicon depth for the SMN gene; determining a reference region for correcting amplicon depth; constructing a control set based on the reference region and the copy number normal sample for determining a depth ratio threshold corresponding to the SMN gene copy number; and obtaining SMN gene copy number variation data about the DNA sample to be tested based on the exon copy number ratios and depth ratio thresholds of the SMN1 gene and the SMN2 gene.
According to a second aspect of the present invention there is also provided a computing device, the device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute one or more programs to cause the apparatus to perform the method of the first aspect of the invention.
According to a third aspect of the present invention, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions that, when executed, cause a machine to perform the method of the first aspect of the invention.
In some embodiments, obtaining SMN gene copy number variation data for a DNA sample to be tested comprises: determining the SMN gene copy number of the DNA sample to be tested based on the determined depth ratio threshold corresponding to the SMN gene copy number; correcting the determined copy number of the SMN gene with respect to the sample of the DNA to be tested based on the ratio of the copy numbers of exons of the SMN1 gene and the SMN2 gene so as to obtain variation data of the copy number of the SMN gene with respect to the sample of the DNA to be tested.
In some embodiments, the reference region is an amplicon region that coincides with a change in amplicon depth for the SMN1 gene and the SMN2 gene.
In some embodiments, constructing the control set comprises: screening amplicon regions of the SMN gene amplicon having a coefficient of variation that satisfies a predetermined condition to determine a reference region for correcting amplicon depth; and based on the determined internal reference regions, homogenizing amplicon depths of a predetermined number of normal samples, respectively, for constructing a control set.
In some embodiments, determining the reference region for correcting amplicon depth comprises: obtaining reference sequencing data for different batches of samples for which the known SMN1 gene and SMN2 gene are both 2-fold, so as to construct a coverage depth matrix based on the reference sequencing data; calculating a correlation coefficient between amplicons of the SMN gene with respect to amplicon depth based on the coverage depth matrix; determining candidate amplicon regions based on the correlation coefficient and the coefficient of variation for the amplicon depth; determining whether the coefficient of variation of the candidate amplicon region satisfies a predetermined condition; and in response to determining that the coefficient of variation of the amplicon region satisfies a predetermined condition, determining the amplicon region as a reference region for correcting the amplicon depth.
In some embodiments, determining the candidate amplicon region comprises: screening a plurality of amplicon regions having a correlation coefficient with the amplicon depth of the SMN1 gene and SMN2 gene amplicons greater than or equal to a first correlation coefficient threshold and a variation coefficient of the amplicon depth that is close to the variation coefficient of the SMN gene amplicons as candidate amplicon regions.
In some embodiments, determining the depth ratio threshold corresponding to the SMN gene copy number comprises: calculating a corrected depth median for each amplicon region of the control set to calculate a second ratio between the amplicon depth of the SMN gene for each sample in the control set and the corrected depth median; and separately counting the distribution range of the second ratio of the different SMN gene copy numbers so as to determine a depth ratio threshold corresponding to the SMN gene copy number.
In some embodiments, correcting the calculated SMN gene copy number for the test DNA sample comprises: selecting a sample with the same ratio of the SMN1 gene to the SMN2 gene on the same exon; grouping the selected samples according to the same ratio; and calculating a corresponding copy number ratio threshold for each group of identical ratios, so as to correct the determined SMN gene copy number for the DNA sample to be tested based on the corresponding copy number ratio threshold.
In some embodiments, the predetermined SMN gene amplification primers are a pair of primers for simultaneously amplifying a region of exon 7 or exon 8 of the SMN1 gene, SMN2 gene, and the amplified fragment comprises a single nucleotide polymorphism site that distinguishes between the SMN1 gene and the SMN2 gene.
In some embodiments, constructing the control set comprises: determining whether the reference region comprises a plurality of amplicons; in response to determining that the reference region includes a plurality of amplicons, calculating an average amplicon depth for the reference region; correcting the amplicon depth of the SMN gene of the normal sample using the average amplicon depth of the reference region to generate an amplicon depth of the SMN gene corrected via the reference region; and constructing a control set based on the amplicon depth of the SMN gene corrected via the reference region.
The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the invention, nor is it intended to be used to limit the scope of the invention.
Drawings
FIG. 1 shows a schematic diagram of a system for implementing a method for detecting SMN gene copy number variation according to an embodiment of the present invention.
FIG. 2 shows a flow chart of a method for detecting SMN gene copy number variation based on an embodiment of the present invention.
FIG. 3 shows a flow chart of a method for determining an internal reference region for correcting amplicon depth according to an embodiment of the invention.
Fig. 4 shows a flow chart of a method for constructing a control set according to an embodiment of the invention.
Fig. 5 shows a flowchart of a method for detecting SMN gene copy number variation according to an embodiment of the present invention.
Fig. 6 schematically shows a block diagram of an electronic device suitable for implementing embodiments of the invention.
Like or corresponding reference characters indicate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are illustrated in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
As described above, the conventional method for detecting SMN gene copy number variation based on the targeted capture sequencing technology and the whole genome sequencing technology is technically costly and complicated to operate. The method for detecting the SMN gene copy number variation based on the multiplex amplicon technology has the characteristics of high efficiency, economy and simplicity, but is difficult to obtain an accurate detection result. Therefore, the conventional method for detecting the copy number variation of the SMN gene has the following defects: it is difficult to obtain accurate detection results regarding SMN gene copy number variation efficiently at low cost.
To at least partially address one or more of the above problems, as well as other potential problems, exemplary embodiments of the present invention provide a method of detecting copy number variation based on amplicon sequencing data. The method compares the sequencing data of a sample obtained by an amplicon sequencing technology based on a preset SMN gene amplification primer (which simultaneously amplifies the region of the 7 th exon or the 8 th exon of the SMN1 gene and the SMN2 gene) with the sequencing data of a reference genome so as to obtain comparison result data, the invention can obtain the sequencing data with the consistency characteristic of the SMN1 gene and the SMN2 gene on the 7 th exon or the 8 th exon, and the invention detects the copy number variation of the SMN gene based on a multiplex amplicon technology, thereby having the characteristics of high efficiency and low cost. In addition, by homogenizing the amplicon coverage depth in the DNA sample to be tested based on the comparison result data so as to obtain the amplicon depth for the SMN gene, the invention can reduce the influence of the sequencing data amount on the amplicon region depth. Furthermore, by determining the reference region for correcting amplicon depth; constructing a control set based on the reference region and the copy number normal sample for determining a depth ratio threshold corresponding to the SMN gene copy number; and obtaining SMN gene copy number variation data about the DNA sample to be detected based on the exon copy number ratio and the depth ratio threshold of the SMN1 gene and the SMN2 gene. Thus, the present invention can obtain an accurate detection result concerning the copy number of the SMN gene.
FIG. 1 shows a schematic diagram of a system for implementing a method 100 for detecting SMN gene copy number variation, according to an embodiment of the present invention. As shown in fig. 1, the system 100 includes: computing device 110, sequencing device 130. In some embodiments, the computing device 110, the sequencing device 130, and the data interaction occurs directly or via a network (not shown).
With respect to the sequencing device 130, it is for example used to provide sequencing data with respect to a DNA sample to be tested. The sequencing device 130 obtains sequencing data for the DNA sample to be tested based on amplicon sequencing techniques via amplification primers based on a predetermined SMN gene. In some embodiments, the predetermined SMN gene amplification primers are a pair of primers for simultaneously amplifying a region of exon 7 or exon 8 of the SMN1 gene, SMN2 gene, and the amplified fragment comprises a single nucleotide polymorphism site that distinguishes between the SMN1 gene and the SMN2 gene.
With respect to computing device 110, it detects, for example, SMN gene copy number variations. Specifically, the computing device 110 is configured to compare the pre-processed sequencing data on the DNA sample to be tested with the sequencing data of the reference genome, so as to obtain comparison result data; based on the comparison result data, homogenizing the amplicon coverage depth in the DNA sample to be tested so as to obtain the amplicon depth for the SMN gene; and determining a reference region for correcting amplicon depth. The computing device 110 is further configured to construct a control set for determining a depth ratio threshold corresponding to the SMN gene copy number based on the reference region and the copy number normal sample; and obtaining SMN gene copy number variation data about the DNA sample to be tested based on the exon copy number ratios and depth ratio thresholds of the SMN1 gene and the SMN2 gene.
In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example: an alignment result data obtaining unit 112, an amplicon depth obtaining unit 114, an internal reference region determining unit 116, a depth ratio threshold determining unit 118, and an smn gene copy number variation data obtaining unit 120. The above-mentioned comparison result data obtaining unit 112, amplicon depth obtaining unit 114, internal reference region determining unit 116, depth ratio threshold determining unit 118, smn gene copy number variation data obtaining unit 120 may be configured on one or more computing devices 110.
Regarding the comparison result data obtaining unit 112 for comparing the sequencing data regarding the DNA sample to be tested, which is obtained via the amplicon sequencing technology based on the predetermined SMN gene amplification primer for simultaneously amplifying the SMN1 gene, the exon 7 or the exon 8 region of the SMN2 gene, with the sequencing data of the reference genome, which is subjected to the pretreatment, so as to obtain the comparison result data.
Regarding the amplicon depth obtaining unit 114, it is used for based on the comparison result data, to make a homogenization to the amplicon coverage depth in the DNA sample to be tested, so as to obtain the amplicon depth regarding SMN gene.
Regarding the reference region determining unit 116, it is used to determine the reference region for correcting the amplicon depth.
Regarding the depth ratio threshold determining unit 118, it is used to construct a control set for determining a depth ratio threshold corresponding to the SMN gene copy number based on the reference region and the copy number normal sample.
Regarding the SMN gene copy number variation data obtaining unit 120, it is used for obtaining the SMN gene copy number variation regarding the DNA sample to be tested based on the exon copy number ratio and depth ratio threshold of the SMN1 gene and the SMN2 gene.
FIG. 2 shows a flowchart of a method 200 for detecting SMN gene copy number variation based on an embodiment of the present invention. It should be appreciated that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 200 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 202, computing device 110 compares the pre-processed sequencing data for the test DNA sample with the sequencing data of the reference genome to obtain comparison data, the sequencing data obtained via an amplicon sequencing technique based on predetermined SMN gene amplification primers for simultaneously amplifying regions of the SMN1 gene, the 7 th exon, or the 8 th exon of the SMN2 gene.
Regarding the predetermined SMN gene amplification primers, which are a pair of primers for simultaneously amplifying the regions of the SMN1 gene, exon7 (exon 7) or exon8 (exon 8) of the SMN2 gene, and the amplified fragments contain SNP (single nucleotide polymorphism) sites (exons 7:c.840c > t, exon 8:c..239g > a) that distinguish between the SMN1 gene and the SMN2 gene.
As to a method of obtaining sequencing data on a DNA sample to be tested via pretreatment, it for example comprises: the amplicon library-building sequencing kit (comprising a preset SMN gene amplification primer) is used for building a library for a DNA sample to be tested, high-throughput sequencing data are obtained, and high-quality sequencing data subjected to pretreatment are obtained through the basic control requirement of the kit. For example, the bed file (full Browser Extensible Data, which is a type of annotation information presented by the contents of a given row, where the location of the amplicon on the genome and the annotation information are presented) of the amplicon is obtained based on the physical location of the amplicon panel primer. Then, the sequencing data is compared with the sequencing data of the reference genome using a bioinformatics method, so as to obtain comparison result data (bam file).
It should be appreciated that in some embodiments, the computing device 110 also obtains sample sequencing data for known SMN gene copy numbers for analysis of subsequent step decision thresholds.
At step 204, the computing device 110 homogenizes the amplicon coverage depth within the DNA sample to be tested based on the comparison data in order to obtain an amplicon depth for the SMN gene.
For example, the computing device 110 counts the depth of coverage D of each bed region and the average depth of coverage d_av of the amplicon region of the test DNA sample or the median d_med of the amplicon region depth of the test DNA sample with respect to the bam file (the file generated by high throughput sequencing data alignment to the reference genome) of the test DNA sample. The depth of amplicon coverage within the sample is then homogenized in order to reduce the effect of the amount of sequencing data on the depth of the amplicon region. For example, the computing device 110 uses the amplicon average depth d_av of the test DNA sample to correct the amplicon coverage depth D in order to obtain an amplicon depth d_nor for the SMN gene. The amplicon depth d_nor for the SMN gene is the amplicon depth via homogenization.
At step 206, the computing device 110 determines an internal reference region for correcting amplicon depth.
Regarding a method for determining an internal reference region for correcting amplicon depth, it includes, for example: computing device 110 obtains reference sequencing data for different batches of samples for which the known SMN1 gene and SMN2 gene are both 2-fold, so as to construct a coverage depth matrix based on the reference sequencing data; calculating a correlation coefficient between amplicons of the SMN gene with respect to amplicon depth based on the coverage depth matrix; determining candidate amplicon regions based on the correlation coefficient and the coefficient of variation for the amplicon depth; determining whether the coefficient of variation of the candidate amplicon region satisfies a predetermined condition; and in response to determining that the coefficient of variation of the amplicon region satisfies a predetermined condition, determining the amplicon region as a reference region for correcting the amplicon depth. The method 300 of determining the reference region for correcting amplicon depth is described in detail below in conjunction with FIG. 3, and is not described in detail herein.
Regarding the method of determining candidate amplicon regions, it includes, for example: the computing device 110 screens a plurality of amplicon regions having a coefficient of variation that is close to the coefficient of variation of the SMN gene amplicon for a correlation coefficient with the amplicon depths of the SMN1 gene and SMN2 gene amplicons that is greater than or equal to the first correlation coefficient threshold.
At step 208, computing device 110 constructs a control set for determining a depth ratio threshold corresponding to the SMN gene copy number based on the reference region and the copy number normal sample.
As to a method for constructing a control set, it includes, for example: computing device 110 screens amplicon regions of SMN gene amplicons whose coefficients of variation meet a predetermined condition to determine a reference region for correcting amplicon depth; and based on the determined internal reference regions, homogenizing amplicon depths of a predetermined number of normal samples, respectively, for constructing a control set.
Regarding a method for determining copy number threshold intervals corresponding to different SMN gene copy numbers, it includes, for example: computing device 110 calculates a corrected depth median for each amplicon region of the control set to calculate a second ratio between the amplicon depth of the SMN gene for each sample in the control set and the corrected depth median; and separately counting the distribution range of the second ratio of the different SMN gene copy numbers so as to determine a depth ratio threshold corresponding to the SMN gene copy number.
For example, the computing device 110 calculates a corrected depth median for each amplicon region of the control set to calculate a second ratio between the amplicon depth and corrected depth median for the SMN gene for each sample in the control set.
The calculation method of the second ratio is described below in conjunction with equation (1):
R=SMN_D_ref/median(SMN_D_ref) (1)
in the above formula (1), R represents a second ratio. Smn_d_ref represents the amplicon depth of the SMN gene of the sample. media (smn_d_ref) represents the median amplicon depth for each amplicon region of the control set. The median amplicon depth is scaled to be deparasitized with the amplicon depth of the SMN gene of the sample to generate a second ratio.
The distribution ranges of the second ratios (e.g., R values) of the different SMN gene copy numbers (e.g., 0, 1, 2, 3,. Gtoreq.4) are counted separately to determine the depth ratio threshold corresponding to the SMN gene copy number. The following describes a calculation method of the distribution range of the second ratio in conjunction with the formula (2):
q=r_mean±1.96×r_standard deviation (2)
In the above formula (2), q represents the distribution range of the second ratio. R_mean represents the mean of the second ratio. The r_standard deviation represents the standard deviation of the second ratio. It should be appreciated that for different SMN gene copy numbers, there is an intersection portion of the depth ratio threshold corresponding to the SMN gene copy number. The intersection portion is determined as, for example, a gray area. The invention can periodically optimize the depth ratio threshold corresponding to the SMN gene copy number along with the increase of the sample set.
At step 210, computing device 110 obtains SMN gene copy number variation data for the DNA sample to be tested based on the exon copy number ratios and depth ratio thresholds of the SMN1 gene and SMN2 gene.
For example, computing device 110 determines a SMN gene copy number for the DNA sample to be tested based on the determined depth ratio threshold corresponding to the SMN gene copy number; and correcting the determined copy number of the SMN gene with respect to the sample of the DNA to be tested based on the ratio of the copy numbers of exons of the SMN1 gene and the SMN2 gene, so as to obtain variation data of the copy number of the SMN gene with respect to the sample of the DNA to be tested.
It will be appreciated that exon7 and exon8 of the SMN1 gene and SMN2 gene have different copy numbers per person, and thus the ratio of the read numbers of the bed regions of exon7 of the SMN1 gene and SMN2 gene is different. The invention applies a multiplex amplicon technology, uses the same preset SMN gene amplification primer to simultaneously amplify the exon7 and exon8 regions of the SMN gene, and ensures that the amplification efficiency of the same exons of the SMN1 gene and the SMN2 gene is the same as much as possible. For example, for the same exon7, the copy number of SMN1 gene is 2, the copy number of SMN2 gene is 2, and the ratio of corrected read numbers of SMN1 gene and SMN2 gene is relatively stable, e.g., floats over a range. Thus, the ratio of corrected read numbers using the SMN1 gene and SMN2 gene at the same exon can also be used to correct the copy numbers of the SMN1 gene and SMN2 gene. Thus, the present invention is based on the principle that the copy number of exons can be further corrected.
A method for correcting the determined SMN gene copy number for a DNA sample to be tested, for example, comprises: computing device 110 picks samples that have the same ratio between the copy numbers of the SMN1 gene and the SMN2 gene on the same exon (e.g., exon 7); grouping the selected samples according to the same ratio; and calculating a corresponding copy number ratio threshold for each group of identical ratios, so as to correct the determined SMN gene copy number for the DNA sample to be tested based on the corresponding copy number ratio threshold.
Methods for determining the number of copies of the SMN gene in relation to a DNA sample to be tested, for example, include:
for example, the computing device 110 determines the copy number of each amplicon, denoted cn_smn1_exon7, cn_smn1_exon8, cn_smn2_exon7, cn_smn2_exon8, respectively. Corresponding copy number ratio thresholds for the SMN1 gene and the SMN2 gene at exon7 and exon8 are determined based on the ratio values of r_exo7 and r_exon8, such that the corresponding copy number ratio thresholds are used to correct the determined copy numbers cn_smn1_exo7, cn_smn1_exon8, cn_smn2_exon7, cn_smn2_exon8 for each amplicon.
For example, if the ratio (of read numbers) of the SMN1 gene to the SMN2 gene at the same exons exo 7 and exo 8 is 1:1, then it is determined whether the same exon copy numbers are the same. For example, it is determined whether the copy numbers cn_smn1_exo7 and cn_smn2_exo7 of the SMN1 gene at the same exon7 are the same, and the copy numbers cn_smn1_exo8 and cn_smn2_exo8 of the SMN1 gene at the same exon8; if the same is determined, the determined amplicon copy number is the final copy number. If it is determined that the two copies of the SMN2 gene are different, the result of multiplying the copy number of the SMN2 gene by the corresponding copy number ratio threshold is determined as the final copy number of the SMN1 gene. For example, the final copy number cn_smn1_exo7_final copy number=cn_smn2_exo7 of the SMN1 gene at exon7 is made to correspond to the copy number ratio threshold.
In the scheme, the invention can lead the amplification efficiency of the same exons of the SMN1 gene and the SMN2 gene to be the same, and further obtain sequencing data with the consistency characteristic of the SMN1 gene and the SMN2 gene on the No. 7 exons or the No. 8 exons. In addition, the invention can reduce the influence of sequencing data quantity on the depth of an amplification subarea; and determining and correcting the SMN gene copy number variation data by utilizing the determined internal reference data, the exon copy number ratio and the depth ratio threshold, so that the detected SMN gene copy number variation data is more accurate. Therefore, the invention can obtain accurate detection results of SMN gene copy number variation with high efficiency and low cost.
FIG. 3 shows a flowchart of a method 300 for determining an internal reference region for correcting amplicon depth, according to an embodiment of the invention. It should be appreciated that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 300 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 302, computing device 110 obtains reference sequencing data for different batches of samples for which the SMN1 gene and SMN2 gene are both known to be 2-fold, in order to construct a coverage depth matrix based on the reference sequencing data.
For example, sequencing data for different batches of normal samples for which the SMN1 gene and SMN2 gene are both known to be 2-fold are obtained by gold standard methods for use in constructing a depth of coverage matrix after correction for each amplicon. The rows covering the depth matrix indicate amplicons, e.g., the total amplicon number is X (X is a natural number). The columns covering the depth matrix indicate samples, e.g. the total number of samples for normal samples of different batches is n (n is a natural number).
At step 304, the computing device 110 calculates a correlation coefficient between amplicons of the SMN gene with respect to amplicon depth based on the coverage depth matrix.
For example, the computing device 110 calculates pearson correlation coefficients (Pearson Correlation Coefficient) for the amplicon depth d_nor (e.g., d_nor) of the SMN gene based on the homogenization between the SMN gene pairs indicated by row vectors in the coverage depth matrix using an R language program to obtain the correlation coefficients for the homogenized amplicon depth d_nor. It should be appreciated that the pearson correlation coefficient is used to reflect the degree of linear correlation between two amplicons. The pearson correlation coefficient has a value between-1 and 1. When the value is 1, the two amplicons are in a complete positive correlation; when the value is-1, the two amplicons are in a complete negative correlation; at a value of 0, this indicates that the linearity between the two amplicons is irrelevant.
At step 306, the computing device 110 determines candidate amplicon regions based on the correlation coefficient and the coefficient of variation for the amplicon depth.
With respect to the coefficient of variation (Coefficient of variation, or "CV value"), which is also referred to as "standard deviation", it is used to measure the degree of variation of each observed value. Regarding the coefficient of variation of the amplicon regions, it is used to measure the degree of variation of each amplicon region. The method for calculating the coefficient of variation of the amplicon region is, for example: the CV value of the amplicon region was calculated using the ratio of standard deviation to average (relative value). The smaller the CV value of the amplicon region, the more accurate the SMN gene copy number variation that is ultimately determined based on the amplicon region.
For example, the computing device 110 screens a plurality of amplicon regions having a coefficient of variation that is close to the coefficient of variation of the SMN gene amplicon (CV value of d_nor) as candidate amplicon regions with a coefficient of correlation greater than or equal to a first coefficient of correlation threshold (the first coefficient of correlation threshold is, for example, 0.85) for a depth of homogenization coverage of the SMN gene (SMN 1 gene and SMN2 gene) amplicons. For example, the candidate amplicon region determined by the computing device 110 is MRi.
At step 308, the computing device 110 determines whether the coefficient of variation of the candidate amplicon region satisfies a predetermined condition.
For example, the computing device 110 calculates a first ratio of the amplicon depth of the SMN gene to the amplicon average depth of the candidate amplicon region MRi to determine a coefficient of variation of the candidate amplicon region based on the first ratio. For example, the coefficient of variation of the first ratio is determined as the coefficient of variation of the candidate amplicon region.
The manner of calculating the first ratio is described below in conjunction with equation (3):
D_ref=SMN_D_nor/MRi_D_nor (3)
in the above formula (3), smn_d_nor represents the amplicon depth of the SMN gene. MRi_D_nor represents a candidate amplicon region. D_ref represents the first ratio.
At step 310, if the computing device 110 responds to determining that the coefficient of variation of the amplicon region satisfies a predetermined condition, the amplicon region is determined to be a reference region for correcting amplicon depth.
Regarding satisfaction of the predetermined condition, it is, for example: when the coefficient of variation of the candidate amplicon region is less than or equal to a predetermined coefficient of variation threshold (the predetermined coefficient of variation threshold is, for example, but not limited to, 10%).
For example, if the computing device 110 determines that the CV value of D_ref is within 10%, the candidate amplicon region at that time is determined to be the final reference region. The reference region being, for example, MR SMN1 、MR SMN2 (the reference region may comprise a plurality of candidate amplicon regions, each 1 amplicon may be designated MR SMN1i 、MR SMN2i I=1, 2, 3..m, m is a natural number).
The reference region is an amplicon region that is related to the change in amplicon depth of the SMN1 gene and the SMN2 gene and corresponds to the change in amplicon depth of the SMN1 gene and the SMN2 gene. It should be appreciated that for each panel, a sequence of reference regions may be determined.
By adopting the means, the invention can minimize the difference in amplicon depth calculation results caused by sample differences.
Fig. 4 shows a flow chart of a method 400 for constructing a control set according to an embodiment of the invention. It should be appreciated that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 400 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 402, computing device 110 determines whether the reference region includes a plurality of amplicons.
At step 404, if the computing device 110 determines that the reference region includes a plurality of amplicons, an average amplicon depth for the reference region is calculated.
At step 406, computing device 110 corrects the amplicon depth of the SMN gene for the normal sample using the average amplicon depth of the reference region to generate an amplicon depth of the SMN gene corrected via the reference region.
The manner of calculation of amplicon depth for the generation of SMN gene corrected via the reference region is described below in conjunction with equation (4):
SMN_D_ref=SMN_D_nor/MR_D_nor (4)
in the above formula (4), smn_d_nor represents the amplicon depth of the SMN gene. MR_D_nor represents the average amplicon depth of the reference region. Smn_d_ref represents the amplicon depth of the SMN gene corrected via the reference region. The 1 amplicon includes exon 7 or exon 8 of the SMN1 gene and SMN2 gene.
At step 408, the computing device 110 constructs a control set based on the amplicon depth of the SMN gene corrected via the reference region.
By the means, the invention can obtain the control set corrected by internal reference.
Fig. 5 shows a flowchart of a method 500 for detecting SMN gene copy number variation according to an embodiment of the present invention. It should be appreciated that the method 500 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 500 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 502, the computing device 110 compares the sequencing data of a predetermined number of DNA samples to be tested with the sequencing data of the reference genome to obtain comparison result data.
For example, 400 different batches of amplicon sequencing data were collected with known SMN gene copy numbers, including 150 samples with SMN gene copy numbers of 2. The reference region was determined using 150 copy number normal samples and a control set was constructed. A depth ratio threshold corresponding to the copy number of the SMN gene and a copy number ratio threshold corresponding to the same exon between SMN genes were determined using 400 samples.
At step 504, the computing device 110 homogenizes the amplicon depth for each test DNA sample based on the comparison data to obtain an amplicon depth for the SMN gene.
For example, the computing device 110 obtains a bam file of quality-controlled qualified samples, uses cnvkit to count the depth of coverage D of the bed region, uses bamdst to count the average depth of amplicon d_av of the targeted amplicon region, so as to correct for the average depth D of the bed region of each DNA sample to be tested to obtain the amplicon depth d_nor for the SMN gene via homogenization. The algorithm for correcting the depth of coverage D of the bed region for each DNA sample to be tested is described below in conjunction with equation (5):
D_nor=D/D_av (5)
In the above formula (5), d_nor represents the depth of amplicon with respect to the SMN gene via homogenization. D represents the amplicon coverage depth of the bed region of each test DNA sample. D_av represents the average sequencing depth d_av of the targeted amplicon region. It will be appreciated that by homogenizing the amplicon depth for each sample, the present invention can eliminate sequencing data volume as well as batch-to-batch variation.
At step 506, the computing device 110 determines an internal reference region for correcting amplicon depth.
For example, computing device 110 constructs a coverage depth matrix using 150 cases of sample data from different batches with SMN gene copy number 2. The rows covering the depth matrix indicate amplicons and the columns indicate samples. The computing device 110 calculates pearson correlation coefficients between amplicon coverage depths of amplicon regions of the SMN gene and other amplicon depths using an R language program. Then, amplicon regions having a pearson correlation coefficient greater than or equal to 0.85 and having a coefficient of variation (CV value) close to the coefficient of variation (CV value) of the SMN gene amplicon via the uniformity with respect to the amplicon depth (d_nor) of the SMN gene are screened to obtain candidate amplicon regions. For example, the computing device 110 screens 5 candidate amplicon regions associated with exon7 and 3 candidate amplicon regions associated with exon 8. The computing device 110 corrects the amplicon depth of the exon7 region of the SMN1 gene using the 5 candidate amplicon regions related to exon7, respectively, and calculates the variation coefficient of the exon7 amplicon depth of 150 samples after correction so as to select the region with the lowest variation coefficient as the internal reference region. For example, the reference region mr_exo7 is determined to be chr11_520127_52023776. The reference region associated with exo 8 was determined based on a similar method. The reference region mr_exo8 related to exon8 is, for example, chr11_5244311_5244562.
At step 508, the computing device 110 constructs a corrected control set based on the reference region based on the determined reference region and the predetermined number of normal samples.
For example, the computing device 110 uses the determined reference regions to further homogenize the amplicon depths of the 400 samples, respectively, according to equation (3) in order to eliminate the influence of the inter-sample lot. The median of amplicon depths (e.g., smn1_exon7_d_ref, smn2_exon7_d_ref, smn1_exon8_d_ref, smn2_exon8_d_ref) of 150 normal samples via the homogenized SMN gene was calculated in order to construct a control set corrected based on the reference region.
At step 510, computing device 110 determines depth ratio thresholds corresponding to different SMN gene copy numbers.
For example, computing device 110 screens samples with SMN gene copy numbers of 0, 1, 2, 3, and ≡4 copies, respectively; in each copy number sample, the distribution range of the second ratio (R value) is calculated according to the aforementioned formula (4) so as to obtain ratio thresholds corresponding to the different SMN gene exon copy numbers. The following table one schematically shows the threshold to depth ratio values corresponding to different SMN gene copy numbers. For example, the "Exon7 threshold interval" indicates that: for Exon Exon7, the corresponding to depth ratio threshold values of different SMN gene copy numbers; "Exon8 threshold interval" indicates that: for Exon8, different SMN gene copy numbers correspond to depth ratio thresholds.
At step 512, computing device 110 picks samples where the ratio of SMN1 gene to SMN2 gene is the same across the same exons, so that for each same ratio, a corresponding copy number ratio threshold is calculated.
For example, computing device 110 picks samples with ratios (e.g., SMN1-exon7: SMN2-exon 7) between SMN1 genes and SMN2 genes on the same exons (e.g., exon 7) of 1:1, 1:2, 1:3, 1:4, 2:1, 2:3, 3:2, 3:1, 4:1, respectively; for each ratio, a second ratio between the amplicon depth of the SMN gene and the median of the corrected depth is calculated, for example according to the aforementioned formula (3), in order to determine a corresponding copy number ratio threshold for the same exon of the SMN gene based on the aforementioned formula (4). The following table two schematically shows the corresponding copy number ratio thresholds determined for the same exons of the SMN gene.
At step 514, computing device 110 determines the SMN gene copy number for the DNA sample to be tested.
For example, the computing device 110 uses the average amplicon depth of the reference region to correct for the amplicon depth of the SMN gene corrected via the reference region (smn_d_nor) of the sample to be tested in order to generate the amplicon depth of the SMN gene corrected via the reference region (smn_d_ref) for the DNA sample to be tested.
Then, the calculation device 110 calculates a ratio R of the amplicon depth (smn_d_ref) of the SMN gene corrected via the internal reference region to the amplicon depth median of each amplicon region of the control set and a ratio of the SMN1 gene to the SMN2 gene on the same exon with respect to the DNA sample to be measured based on the formula (1). The ratios are, for example, r_exon7 (r_exon7=smn1_exon7_d_ref/smn2_exon7_d_ref), r_exon8 (r_exon8=smn1_exon8_d_ref/smn2_exon8_d_ref).
Thereafter, the computing device 110 determines the SMN gene copy number for the DNA sample under test, e.g. denoted cn_smn1_exon7, cn_smn1_exon8, cn_smn2_exon7, cn_smn2_exon8, respectively, from the calculated ratio R and the corresponding copy number ratio threshold determined at step 514.
At step 516, computing device 110 determines a corresponding copy number ratio threshold based on the ratio of SMN1 gene to SMN2 gene on the same exon in order to correct for the determined SMN gene copy number for the DNA sample to be tested.
The following table three shows exemplary SMN gene copy number variation data for 20 DNA samples tested. It can be seen from Table three that the SMN gene copy number variation of the remaining 19 DNA samples tested was completely consistent with the gold standard results except that 1 sample, i.e., sample 09, was in the gray zone.
Fig. 6 schematically shows a block diagram of an electronic device 600 suitable for use in implementing embodiments of the invention. The electronic device 600 may be for implementing the methods 200 to 500 shown in fig. 2 to 5. As shown in fig. 6, the electronic device 600 includes a central processing unit (i.e., CPU 601) that can perform various suitable actions and processes according to computer program instructions stored in a read-only memory (i.e., ROM 602) or computer program instructions loaded from a storage unit 608 into a random access memory (i.e., RAM 603). In the RAM 603, various programs and data required for the operation of the electronic device 600 can also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output interface (i.e., I/O interface 605) is also connected to bus 604.
A number of components in the electronic device 600 are connected to the I/O interface 605, including: the input unit 606, the output unit 607, the storage unit 608, and the cpu 601 perform the respective methods and processes described above, for example, perform the methods 200 to 500. For example, in some embodiments, the methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by CPU 601, one or more of the operations of methods 200 through 500 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured to perform one or more actions of methods 200-500 in any other suitable manner (e.g., by means of firmware).
It should be further appreciated that the present invention can be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. A method for detecting SMN gene copy number variation, comprising:
comparing the pre-processed sequencing data for the test DNA sample with the sequencing data of the reference genome to obtain comparison result data, the pre-processed sequencing data for the test DNA sample being obtained via an amplicon sequencing technique based on predetermined SMN gene amplification primers for simultaneously amplifying regions of the SMN1 gene, the 7 th exon or the 8 th exon of the SMN2 gene;
Based on the comparison result data, homogenizing the amplicon coverage depth in the DNA sample to be tested so as to obtain the amplicon depth for the SMN gene;
determining an internal reference region for correcting the amplicon depth, the internal reference region being an amplicon region consistent with the variation in amplicon depth of the SMN1 gene and the SMN2 gene;
constructing a control set based on the reference region and the copy number normal sample for determining a depth ratio threshold corresponding to the SMN gene copy number; and
based on the exon copy number ratio and depth ratio threshold of the SMN1 gene and the SMN2 gene, obtaining SMN gene copy number variation data about the DNA sample to be detected,
wherein determining the reference region for correcting amplicon depth comprises:
obtaining reference sequencing data for different batches of samples for which the known SMN1 gene and SMN2 gene are both 2-fold, so as to construct a coverage depth matrix based on the reference sequencing data;
calculating a correlation coefficient between amplicons of the SMN gene with respect to amplicon depth based on the coverage depth matrix;
determining candidate amplicon regions based on the correlation coefficient and the coefficient of variation for the amplicon depth;
determining whether the coefficient of variation of the candidate amplicon region satisfies a predetermined condition; and
In response to determining that the coefficient of variation of the amplicon region satisfies a predetermined condition, the amplicon region is determined to be a reference region for correcting amplicon depth.
2. The method of claim 1, wherein obtaining SMN gene copy number variation data for the test DNA sample comprises:
determining the SMN gene copy number of the DNA sample to be tested based on the determined depth ratio threshold corresponding to the SMN gene copy number;
correcting the determined copy number of the SMN gene with respect to the sample of the DNA to be tested based on the ratio of the copy numbers of exons of the SMN1 gene and the SMN2 gene so as to obtain variation data of the copy number of the SMN gene with respect to the sample of the DNA to be tested.
3. The method of claim 1, wherein constructing a control set comprises:
screening amplicon regions of the SMN gene amplicon having a coefficient of variation that satisfies a predetermined condition to determine a reference region for correcting amplicon depth; and
based on the determined internal reference regions, amplicon depths of a predetermined number of normal samples are each homogenized for use in constructing a control set.
4. The method of claim 1, wherein determining a depth ratio threshold corresponding to SMN gene copy number comprises:
Calculating a corrected depth median for each amplicon region of the control set to calculate a second ratio between the amplicon depth of the SMN gene for each sample in the control set and the corrected depth median; and
the distribution ranges of the second ratios of the different SMN gene copy numbers are respectively counted so as to determine a depth ratio threshold corresponding to the SMN gene copy number.
5. The method of claim 2, wherein correcting for the calculated SMN gene copy number for the test DNA sample comprises:
selecting a sample with the same ratio of the SMN1 gene to the SMN2 gene on the same exon;
grouping the selected samples according to the same ratio; and
for each set of identical ratios, a corresponding copy number ratio threshold is calculated so that, based on the corresponding copy number ratio threshold, a correction is made for the determined copy number of the SMN gene for the DNA sample to be tested.
6. The method according to claim 1, wherein the predetermined SMN gene amplification primers are a pair of primers for simultaneously amplifying a region of exon 7 or exon 8 of the SMN1 gene, SMN2 gene, and the amplified fragment contains a single nucleotide polymorphism site distinguishing the SMN1 gene and the SMN2 gene.
7. The method of claim 1, wherein constructing a control set comprises:
determining whether the reference region comprises a plurality of amplicons;
in response to determining that the reference region includes a plurality of amplicons, calculating an average amplicon depth for the reference region;
correcting the amplicon depth of the SMN gene of the normal sample using the average amplicon depth of the reference region to generate an amplicon depth of the SMN gene corrected via the reference region; and
a control set was constructed based on amplicon depth of SMN gene corrected via the reference region.
8. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the apparatus to perform the steps of the method according to any one of claims 1 to 7.
9. A computer readable storage medium, characterized in that a computer program is stored on the computer readable storage medium, which computer program, when executed by a machine, implements the method according to any of claims 1 to 7.
CN202311401218.3A 2023-10-26 2023-10-26 Methods, devices and media for detecting SMN gene copy number variation Active CN117153249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311401218.3A CN117153249B (en) 2023-10-26 2023-10-26 Methods, devices and media for detecting SMN gene copy number variation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311401218.3A CN117153249B (en) 2023-10-26 2023-10-26 Methods, devices and media for detecting SMN gene copy number variation

Publications (2)

Publication Number Publication Date
CN117153249A CN117153249A (en) 2023-12-01
CN117153249B true CN117153249B (en) 2024-02-02

Family

ID=88910249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311401218.3A Active CN117153249B (en) 2023-10-26 2023-10-26 Methods, devices and media for detecting SMN gene copy number variation

Country Status (1)

Country Link
CN (1) CN117153249B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650312A (en) * 2016-12-29 2017-05-10 安诺优达基因科技(北京)有限公司 Device for detecting DNA copy number variation of circulating tumor
CN112435710A (en) * 2020-10-16 2021-03-02 赛福解码(北京)基因科技有限公司 Method for detecting single-sample SMN gene copy number in WES data
KR102273257B1 (en) * 2020-11-16 2021-07-06 주식회사 엔젠바이오 Copy number variations detecting method based on read-depth and analysis apparatus
CN113192555A (en) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 Method for detecting copy number of second-generation sequencing data SMN gene by calculating sequencing depth of differential allele
CN114457144A (en) * 2022-03-22 2022-05-10 上海润达榕嘉生物科技有限公司 Method for detecting copy number of target gene
KR20220060198A (en) * 2020-11-04 2022-05-11 국립암센터 Method for Predicting Survival Prognosis of Pancreatic Cancer Patients Using Gene Copy Number Variation Profile
CN115637288A (en) * 2022-12-23 2023-01-24 苏州赛福医学检验有限公司 Method for detecting copy number change of SMN1 and SMN2 genes and application thereof
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN116287192A (en) * 2023-01-18 2023-06-23 浙江大学 Kit for integrating SMN1 and SMN2 copy number, minor variation and family linkage analysis and application thereof
CN116386718A (en) * 2023-05-30 2023-07-04 北京华宇亿康生物工程技术有限公司 Method, apparatus and medium for detecting copy number variation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650312A (en) * 2016-12-29 2017-05-10 安诺优达基因科技(北京)有限公司 Device for detecting DNA copy number variation of circulating tumor
CN112435710A (en) * 2020-10-16 2021-03-02 赛福解码(北京)基因科技有限公司 Method for detecting single-sample SMN gene copy number in WES data
KR20220060198A (en) * 2020-11-04 2022-05-11 국립암센터 Method for Predicting Survival Prognosis of Pancreatic Cancer Patients Using Gene Copy Number Variation Profile
KR102273257B1 (en) * 2020-11-16 2021-07-06 주식회사 엔젠바이오 Copy number variations detecting method based on read-depth and analysis apparatus
CN113192555A (en) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 Method for detecting copy number of second-generation sequencing data SMN gene by calculating sequencing depth of differential allele
WO2023030233A1 (en) * 2021-08-30 2023-03-09 广州燃石医学检验所有限公司 Copy number variation detection method and application thereof
CN114457144A (en) * 2022-03-22 2022-05-10 上海润达榕嘉生物科技有限公司 Method for detecting copy number of target gene
WO2023179053A1 (en) * 2022-03-22 2023-09-28 上海润达榕嘉生物科技有限公司 Method for detecting number of copies of target gene
CN115637288A (en) * 2022-12-23 2023-01-24 苏州赛福医学检验有限公司 Method for detecting copy number change of SMN1 and SMN2 genes and application thereof
CN116287192A (en) * 2023-01-18 2023-06-23 浙江大学 Kit for integrating SMN1 and SMN2 copy number, minor variation and family linkage analysis and application thereof
CN116386718A (en) * 2023-05-30 2023-07-04 北京华宇亿康生物工程技术有限公司 Method, apparatus and medium for detecting copy number variation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
TaqMan实时荧光定量PCR检测毕赤酵母基因组中外源基因拷贝数;李凯;高宏雷;高立;祁小乐;高玉龙;徐延伟;王笑梅;;畜牧兽医学报(05);全文 *
王佶 ; 安宇 ; 周水珍 ; 王艺 ; 刘仁超 ; .脊髓性肌萎缩症SMN1和SMN2基因拷贝数变异分析.中国循证儿科杂志.2013,(03),全文. *
脊髓性肌萎缩症SMN1和SMN2基因拷贝数变异分析;王佶;安宇;周水珍;王艺;刘仁超;;中国循证儿科杂志(03);全文 *

Also Published As

Publication number Publication date
CN117153249A (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Singer et al. Single-cell mutation identification via phylogenetic inference
Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome
Alachiotis et al. RAiSD detects positive selection based on multiple signatures of a selective sweep and SNP vectors
Hardwick et al. Reference standards for next-generation sequencing
Daber et al. Understanding the limitations of next generation sequencing informatics, an approach to clinical pipeline validation using artificial data sets
Lohmueller et al. Natural selection affects multiple aspects of genetic variation at putatively neutral sites across the human genome
Browning et al. Haplotype phasing: existing methods and new developments
Sveinbjornsson et al. Weighting sequence variants based on their annotation increases power of whole-genome association studies
Olson et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics
TWI748263B (en) Gene mutation identification method, device and storage medium
Cibulskis et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples
Faure et al. DiMSum: an error model and pipeline for analyzing deep mutational scanning data and diagnosing common experimental pathologies
Feder et al. LDx: estimation of linkage disequilibrium from high-throughput pooled resequencing data
Faye et al. Re-ranking sequencing variants in the post-GWAS era for accurate causal variant identification
Renzette et al. On the analysis of intrahost and interhost viral populations: human cytomegalovirus as a case study of pitfalls and expectations
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
CN111462816B (en) Method, electronic device and computer storage medium for detecting microdeletion and microduplication of germ line genes
Munch et al. Selective sweeps across twenty millions years of primate evolution
Piazza et al. CEQer: a graphical tool for copy number and allelic imbalance detection from whole-exome sequencing data
US20190259468A1 (en) System and Method for Correlated Error Event Mitigation for Variant Calling
Wood et al. Recommendations for accurate resolution of gene and isoform allele-specific expression in RNA-Seq data
Talevich et al. CNVkit-RNA: copy number inference from RNA-sequencing data
Carvajal-Rodriguez HacDivSel: two new methods (haplotype-based and outlier-based) for the detection of divergent selection in pairs of populations
Rentas et al. Utility of droplet digital PCR and NGS-based CNV clinical assays in hearing loss diagnostics: current status and future prospects
Rafajlović et al. Demography-adjusted tests of neutrality based on genome-wide SNP data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant