CN115497557A - Method and device for detecting gene copy number variation aiming at targeted sequencing - Google Patents

Method and device for detecting gene copy number variation aiming at targeted sequencing Download PDF

Info

Publication number
CN115497557A
CN115497557A CN202211046725.5A CN202211046725A CN115497557A CN 115497557 A CN115497557 A CN 115497557A CN 202211046725 A CN202211046725 A CN 202211046725A CN 115497557 A CN115497557 A CN 115497557A
Authority
CN
China
Prior art keywords
gene
sample
copy number
number variation
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211046725.5A
Other languages
Chinese (zh)
Inventor
王涛
贾磊
肖姗姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Repugene Technology Co ltd
Original Assignee
Hangzhou Repugene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Repugene Technology Co ltd filed Critical Hangzhou Repugene Technology Co ltd
Priority to CN202211046725.5A priority Critical patent/CN115497557A/en
Publication of CN115497557A publication Critical patent/CN115497557A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Library & Information Science (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The embodiment of the invention provides a method and a device for detecting gene copy number variation aiming at targeted sequencing, wherein the method comprises the following steps: obtaining a sample to be detected and a negative sample; controlling the quality of the sample sequencing data, and comparing the sample sequencing data after quality control with a reference genome to obtain the position information of the sequence fragment; carrying out interval division on the target sequencing region, and combining position information of the sequence fragment to obtain a depth index; obtaining a gene set obtained by primarily screening the copy number variation gene according to the depth index and by combining a preset scheme; removing a gene set obtained by primary screening of copy number variation genes, and calculating and selecting an optimal negative sample; and combining the sample to be detected, and calculating to obtain the copy number variation gene in the sample to be detected. By adopting the method, the potential copy number variation in the sample to be detected can be removed, the most relevant normal sample is selected from the background sample pool as a contrast to detect the copy number variation of the sample to be detected, the detection cost is reduced, and the detection effect is improved.

Description

Method and device for detecting gene copy number variation aiming at targeted sequencing
Technical Field
The invention relates to the technical field of gene detection, in particular to a method and a device for detecting gene copy number variation by targeting sequencing.
Background
For the detection of copy number variation, the traditional detection methods include Fluorescence In Situ Hybridization (FISH), multiple ligation dependent probe amplification (MLPA), digital PCR (ddPCR), chromosome Microarray (CMA), and the like. FISH is based on the hybridization of sequence-specific fluorescently labeled probes to microscopically detect a given fluorescent signal that is indicative of the presence or absence of a particular target DNA sequence. ddPCR allows absolute quantitation of target copy number without standard detection by diluting template DNA into thousands of nanoscale droplets. In addition to traditional detection methods, NGS sequencing-based detection methods are also widely used. NGS sequencing for copy number variation detection includes Target-NGS (Target-NGS), whole Exon Sequencing (WES), and Whole Genome Sequencing (WGS), among others. Copy number variation detection based on NGS sequencing relies on differences in sequencing depth to identify genes or genome regions with copy number variations, but genome regions that can be covered by different sequencing methods are different, so that some differences exist in the adopted identification algorithm. The targeted sequencing covers a few genome intervals, copy number variation is generally recognized by adopting the difference of relative sequencing depth of each interval, and the whole exon sequencing and the whole genome sequencing can cover a large-range genome, so that the copy number variation signal can be recognized by combining a neural network, wavelet transformation and the like besides being directly recognized based on the depth difference. Copy number detection based on targeted sequencing can carry out targeted design on specific genes, has strong purpose and lower cost than WES and WGS, and is widely applied to related fields at present. The copy number detection method based on the target sequencing can be specifically divided into copy number variation detection based on a matched sample, copy number variation detection of a multi-background sample pool, copy number variation detection of a no-control sample and the like. The copy number variation detection based on the paired samples is used for detecting the copy number variation of tumor tissues of the same individual by collecting normal tissues or blood cells of the same individual as a control, the detection method of the multiple background sample pools is characterized in that multiple normal samples are selected and mixed to construct one background sample pool, the detection of the control sample is independent of the control sample, and the copy number variation is directly identified based on the depth difference of the sample.
Among copy number variation detection methods based on targeted sequencing, detection based on a paired sample is theoretically the optimal method. But it requires the collection of normal samples of the same individual, which on the one hand are not available in some cases, and on the other hand sequencing the paired samples doubles the cost of the whole protocol. The detection method for the multiple background samples has a plurality of problems in constructing the sample pool, and mainly comprises the following steps: when the background sample is selected based on the sample to be detected, the copy number variation condition of the sample to be detected is not considered, the number of the background samples is difficult to define, and the sample pool is kept unchanged for a long time and may not represent the characteristics of a new sample. The detection cost of the unpaired sample is low, but the depth can be corrected only by relying on the genome characteristics (such as GC content, repeated sequence distribution and the like) of the population, and the characteristics of the sample can not be corrected.
Disclosure of Invention
In view of the problems in the prior art, embodiments of the present invention provide a method and an apparatus for detecting gene copy number variation by targeted sequencing.
The embodiment of the invention provides a method for detecting gene copy number variation aiming at targeted sequencing, which comprises the following steps:
obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample;
performing quality control on sample sequencing data in the sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data;
acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene region;
according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene;
removing gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, carrying out the most value homogenization of the depth index on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance;
and calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
In one embodiment, the method further comprises:
sequencing the gene intervals of the sample to be detected according to the depth index, and selecting corresponding target genes one by one according to the sequencing for gene primary screening, wherein the gene primary screening comprises the following steps: calculating the standard deviation of the depth index of the corresponding interval of the residual genes except the target gene, and comparing the interval depth of the target gene with the standard deviation;
and determining a gene set obtained by the primary screening of the copy number variation gene according to the comparison result of the primary screening of the gene.
In one embodiment, the method further comprises:
calculating the GC proportion of each gene interval according to the sequence information of the reference genome;
carrying out window division on the GC proportion interval, and calculating a screening gene interval of which the depth index of the sample to be detected accounts for the first 5% in each window;
and comparing each interval with the screened gene interval one by one, and when the target gene meets that more than or equal to 60 percent of the gene intervals belong to the screened gene interval, determining that the target gene belongs to the gene set obtained by primary screening of the copy number variation gene.
In one embodiment, the method further comprises:
removing the adaptor sequence, the low-quality sequences at two ends and the sequence containing a plurality of continuous N bases or the length of which is lower than a preset threshold value from the sequencing data of the sample.
In one embodiment, the reference genome comprises:
GRCh37、GRCh38。
the embodiment of the invention provides a device for detecting gene copy number variation aiming at targeted sequencing, which comprises:
the acquisition module is used for acquiring a sample to be detected, acquiring a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample;
the quality control module is used for performing quality control on the sample sequencing data in the sample library, and comparing the sample sequencing data after the quality control with a reference genome to obtain the position information of each sequence fragment in the sample sequencing data;
the interval division module is used for acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene interval, and performing reading depth statistics by combining position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene interval;
the primary screening module is used for obtaining a gene set obtained by primary screening of the copy number variation genes by combining a preset copy number variation gene primary screening scheme according to the depth index of each gene interval;
the selection module is used for removing the gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, performing the most value homogenization of the depth index on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance;
and the calculating module is used for calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene level copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene level copy number ratio.
In one embodiment, the apparatus further comprises:
the sequencing module is used for sequencing the gene intervals of the samples to be detected according to the depth index, and selecting corresponding target genes one by one according to the sequencing for gene primary screening, wherein the gene primary screening comprises the following steps: calculating the standard deviation of the depth index of the corresponding interval of the residual genes except the target gene, and comparing the interval depth of the target gene with the standard deviation;
and the determining module is used for determining the gene set obtained by the primary screening of the copy number variation gene according to the comparison result of the primary screening of the gene.
In one embodiment, the apparatus further comprises:
the second calculation module is used for calculating the GC proportion of each gene interval according to the sequence information of the reference genome;
the dividing module is used for carrying out window division on the GC proportion interval and calculating a screening gene interval of which the depth index of the sample to be detected accounts for the first 5% in each window;
and the comparison module is used for comparing each interval with the screened gene interval one by one, and when the target gene meets that more than or equal to 60 percent of the gene intervals belong to the screened gene intervals, the target gene belongs to the gene set obtained by primarily screening the copy number variation gene.
An embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method for detecting copy number variation of a gene for targeted sequencing when executing the program.
Embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program that, when executed by a processor, performs the steps of the above method for detecting copy number variation of a gene for targeted sequencing.
The embodiment of the invention provides a method and a device for detecting gene copy number variation by targeting sequencing, wherein a sample to be detected is obtained, a related negative sample is obtained according to the sample to be detected, and a sample library is established according to the sample to be detected and the negative sample; performing quality control on sample sequencing data in a sample library, and comparing the sample sequencing data after the quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data; acquiring a preset target sequencing region of a reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in sample sequencing data to obtain a depth index of each gene region; according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene; removing gene intervals corresponding to the gene sets in the sample to be detected and the negative sample according to the gene sets obtained by primarily screening the copy number variation genes, carrying out the most value homogenization of the depth indexes on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance; calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene level copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene level copy number ratio. Therefore, potential copy number variation in the sample to be detected can be removed, a nearest normal sample is selected from a background sample pool (negative sample) to be used as a contrast to detect the copy number variation of the sample to be detected, the detection cost is reduced, and the detection effect is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for detecting copy number variation of a gene for targeted sequencing according to an embodiment of the present invention;
FIG. 2 is a diagram showing the structure of an apparatus for detecting copy number variation of a gene by targeted sequencing according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for detecting gene copy number variation for target sequencing according to an embodiment of the present invention, and as shown in fig. 1, the method for detecting gene copy number variation for target sequencing according to an embodiment of the present invention includes:
step S101, obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample.
Specifically, a sample to be detected is obtained, and a related negative sample is obtained according to the sample to be detected, for example, when the sample to be detected is tumor tissue cells, the source of the related negative sample may be a tissue, a blood cell sample, a negative enterprise reference, and the like, which are negative in copy number variation of the same individual or different individuals. In addition, in order to ensure the diversity of the negative sample library, multiple (more than 10) negative samples with different time, batches and material types need to be selected, and then a corresponding sample library is established according to the samples to be detected and the negative samples.
And S102, performing quality control on the sample sequencing data in the sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain the position information of each sequence fragment in the sample sequencing data.
Specifically, the quality control is performed on the sample sequencing data in the sample library, including the sequencing data of the sample to be detected and the sequencing data of the negative sample, wherein the quality control may be: removing the adaptor sequence, the low-quality sequences at two ends and the sequence containing a plurality of continuous N bases or the length of which is lower than a threshold value in the sequencing data, then using sequence alignment software to align the data after quality control to a reference genome to obtain the alignment result of each position, and obtaining the position information of each sequence fragment in the sample sequencing data on the reference genome, wherein the human reference genome can be GRCh37 or GRCh38, and the sequence alignment software can be bwa or bowtie2.
Step S103, acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene region.
Specifically, a preset target sequencing region in a reference genome is obtained, continuous interval division is carried out on the target sequencing region, the size of the interval division can be 100-1000bp, each gene interval is obtained, reading depth statistics is carried out by combining position information of each sequence fragment in sample sequencing data, and then the average depth or median depth of each interval can be taken as a depth index of the interval.
And step S104, obtaining a gene set obtained by primarily screening the copy number variation genes by combining a preset copy number variation gene primary screening scheme according to the depth index of each gene interval.
Specifically, according to the depth index of each gene interval, a preset copy number variation gene primary screening scheme is combined to obtain a gene set obtained by primary screening of the copy number variation gene, so that a part of genes with copy number variation can be identified in advance, and particularly the variation with obvious copy number can be identified. And removing the identified abnormal gene interval in the subsequent steps, so that the finally selected optimal negative control sample has higher similarity with the sample to be detected.
In addition, the pre-set copy number variation gene primary screening scheme can be as follows:
and sequencing the genome intervals of the sample to be detected from high to low according to the reading depth. Selecting corresponding target genes one by one according to the interval sequencing sequence, calculating the average value (mu) and the standard deviation (sigma) of the depths of the residual genome intervals except the target genes, judging the relative relation between the depths of the corresponding intervals of the genes and the residual genome intervals, and determining that the copy number variation of the target genes occurs if the target genes meet the following conditions: the interval of more than 50% of the target genes meets the interval depth of more than or equal to mu +2 sigma or the interval of more than 20% meets the interval depth of more than or equal to mu +3 sigma. And (3) eliminating all the corresponding intervals of the genes which are determined to have the copy number variation from the rest intervals, finishing the initial screening judgment of all the genes one by one according to the scheme, and obtaining all the genes which have the copy number variation and the genome interval sets thereof. The judgment formula is as follows:
Figure BDA0003822647990000081
where RegionDepth represents the depth of the gene interval and TotalRegion represents the number of intervals that the gene contains.
In addition, the preset copy number variation gene primary screening scheme can also be as follows:
and calculating and obtaining GC ratio (GC _ ratio) of each interval according to the sequence information of a reference genome, wherein the reference gene can be GRCh37 or GRCh38 and needs to be consistent with a reference genome used for genome alignment. Considering that the sequencing depth of the region with over/under GC content is greatly influenced by the GC content, the region with GC _ ratio <0.3 or GC _ ratio >0.8 is removed during the primary screening of the copy number variation gene. And (3) carrying out window division on the residual GC _ ratio range (target range) [0.3-0.8] according to the window length of 0.05, and calculating to obtain a gene interval of the sample to be detected, wherein the sequencing depth of the sample in each GC window accounts for the first 5%, and the gene interval is marked as GC _ top5 (screening gene interval). And (4) integrating the results of all GC _ ratio windows to judge: when the interval of the gene satisfying 60% or more belongs to GC _ top5, the copy number variation of the gene is determined. According to the scheme, the initial screening and judgment of all genes are completed one by one, and all gene sets with copy number variation are obtained.
And S105, removing the gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, performing the most value homogenization of the depth index on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance.
Specifically, based on a gene set obtained by primarily screening copy number variation genes, corresponding gene intervals in a sample to be detected and a negative sample are removed, and then the sample to be detected and the negative sample are subjected to maximum value homogenization (AveDepth) of depth indexes in interval depth so as to eliminate the influence caused by sequencing depth difference between the samples, wherein the homogenization method can be replaced by other types of homogenization methods. Calculating Euclidean distances between negative samples in a negative sample library and a sample to be detected one by one based on AveDepth, determining a sample with the minimum Euclidean distance (Dist) from the negative sample library to be a near edge control sample (NearbyControl) of the sample to be detected, namely the optimal negative sample, wherein the Euclidean distance (Dist) calculation formula is as follows:
Figure BDA0003822647990000091
wherein n represents the total number of the remaining intervals of the sample, t represents the sample to be detected, and n represents the negative sample to be compared.
And S106, calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
Specifically, the determined near control sample (the best negative sample) and the GC proportion (GC _ ratio) of each interval are used for correcting the sequencing depth of the sample to be detected and converting the sample to be detected based on the Log2 function to obtain the relative depth Ratio (RD) of each interval of the sample to be detected, the difference between the relative depth ratio of each interval in the sample to be detected and the median (MedianLog 2) of the relative depth ratios of all intervals is calculated and recorded as the adjusted relative copy number ratio (adjust ratio), and the calculation formula is as follows, wherein the RD is All Relative depth ratio for all intervals:
AdjustRatio=RD-median(RD All )
calculating the gene horizontal copy number ratio of the designated gene of the sample to be detected based on AdjustRatio, and marking as ResRatio, wherein the calculation formula is as follows, wherein n represents the interval number contained in the gene:
Figure BDA0003822647990000092
based on the gene horizontal copy number ratio ResRatio, completing copy number calculation by using the following formula, wherein ResRatio is the gene horizontal copy number ratio:
Figure BDA0003822647990000093
thereby obtaining the copy number variation gene CN in the sample to be detected.
The embodiment of the invention provides a method for detecting gene copy number variation by targeted sequencing, which comprises the steps of obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample; performing quality control on sample sequencing data in a sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data; acquiring a preset target sequencing region of a reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in sample sequencing data to obtain a depth index of each gene region; according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene; removing gene intervals corresponding to the gene sets in the sample to be detected and the negative sample according to the gene sets obtained by primarily screening the copy number variation genes, carrying out the most value homogenization of the depth indexes on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance; calculating the relative depth ratio of each gene interval in the sample to be detected based on the best negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio. Therefore, potential copy number variation in the sample to be detected can be removed, and a nearest normal sample is selected from the background sample pool (negative sample) as a control to detect the copy number variation of the sample to be detected, so that the detection cost is reduced, and the detection effect is improved.
In another embodiment of the present invention, the method for performing Panel sequencing and copy number variation detection on 13 reference samples comprises: based on negative sample pool detection, random selection of a negative sample as a control for detection, no-control detection and the detection method of the invention, all detection methods are consistent except for different control sample selections, results of 4 detection methods of 13 reference samples are compared and analyzed, the 13 reference samples are sequenced by a Panel containing 11 genes, and the copy number variation information of the 13 samples is known to be as follows:
Figure BDA0003822647990000101
the copy number variation detection was performed on 13 samples by using the method of the present invention, a detection method based on a negative sample pool (containing 10 negative samples), a random selection of negative samples as a control detection, and a no control sample detection. The results show that all four methods can effectively detect copy number variation genes existing in 13 samples, wherein false positive detection is not generated based on the method of the invention; 4 false positive detections exist in the detection method based on the negative sample pool, and the false positive rate is 3.1%; 4 false positive detections exist by using a method based on randomly selected negative samples as a control, and the false positive rate is 3.1%; the false positive rate is 0.8% when 1 false positive is detected by the method without the control detection.
The above results indicate that, in this embodiment, the detection method based on negative sample cell detection can achieve a better detection effect.
Fig. 2 is a schematic diagram of an apparatus for detecting copy number variation of a gene by targeted sequencing according to an embodiment of the present invention, including: the system comprises an acquisition module S201, a quality control module S202, an interval division module S203, a preliminary screening module S204, a selection module S205 and a calculation module S206, wherein:
the acquisition module S201 is configured to acquire a sample to be detected, acquire an associated negative sample according to the sample to be detected, and establish a sample library according to the sample to be detected and the negative sample.
And the quality control module S202 is used for performing quality control on the sample sequencing data in the sample library, and comparing the sample sequencing data after the quality control with the reference genome to obtain the position information of each sequence fragment in the sample sequencing data.
And the interval dividing module S203 is used for acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene interval, and performing reading depth statistics by combining the position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene interval.
And the primary screening module S204 is used for obtaining a gene set obtained by primary screening of the copy number variation genes by combining a preset copy number variation gene primary screening scheme according to the depth index of each gene interval.
And the selecting module S205 is used for removing the gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, performing the most value homogenization of the depth indexes on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance.
And the calculating module S206 is used for calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, obtaining the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, obtaining the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and obtaining the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
In one embodiment, the apparatus may further comprise:
the sequencing module is used for sequencing the gene intervals of the samples to be detected according to the depth index, and selecting corresponding target genes one by one according to the sequencing for gene primary screening, wherein the gene primary screening comprises the following steps: and calculating the standard deviation of the depth indexes of the corresponding intervals of the residual genes except the target gene, and comparing the interval depth of the target gene with the standard deviation.
And the determining module is used for determining the gene set obtained by the primary screening of the copy number variation gene according to the comparison result of the primary screening of the gene.
In one embodiment, the apparatus may further comprise:
and the second calculation module is used for calculating the GC proportion of each gene interval according to the sequence information of the reference genome.
And the dividing module is used for carrying out window division on the GC proportion interval and calculating a screening gene interval of which the depth index of the sample to be detected accounts for the first 5% in each window.
And the comparison module is used for comparing each interval with the screened gene interval one by one, and when the target gene meets that more than or equal to 60 percent of the gene intervals belong to the screened gene intervals, the target gene belongs to the gene set obtained by primarily screening the copy number variation gene.
For specific limitations of the apparatus for detecting gene copy number variation for target sequencing, reference may be made to the above limitations of the method for detecting gene copy number variation for target sequencing, which are not described herein again. The modules in the device for detecting gene copy number variation aiming at targeted sequencing can be wholly or partially realized by software, hardware and the combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor) 301, a memory (memory) 302, a communication Interface (Communications Interface) 303 and a communication bus 304, wherein the processor 301, the memory 302 and the communication Interface 303 are configured to communicate with each other via the communication bus 304. The processor 301 may call logic instructions in the memory 302 to perform the following method: obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample; performing quality control on sample sequencing data in a sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data; acquiring a preset target sequencing region of a reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in sample sequencing data to obtain a depth index of each gene region; according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene; removing gene intervals corresponding to the gene sets in the sample to be detected and the negative sample according to the gene sets obtained by primarily screening the copy number variation genes, carrying out the most value homogenization of the depth indexes on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance; calculating the relative depth ratio of each gene interval in the sample to be detected based on the best negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
Furthermore, the logic instructions in the memory 302 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample; performing quality control on sample sequencing data in a sample library, and comparing the sample sequencing data after the quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data; acquiring a preset target sequencing region of a reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in sample sequencing data to obtain a depth index of each gene region; according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene; removing gene intervals corresponding to the gene sets in the sample to be detected and the negative sample according to the gene sets obtained by primarily screening the copy number variation genes, carrying out the most value homogenization of the depth indexes on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance; calculating the relative depth ratio of each gene interval in the sample to be detected based on the best negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for detecting copy number variation of a gene for targeted sequencing, comprising:
obtaining a sample to be detected, obtaining a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample;
performing quality control on the sample sequencing data in the sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain position information of each sequence fragment in the sample sequencing data;
acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene region, and performing reading depth statistics by combining position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene region;
according to the depth index of each gene interval, combining a preset copy number variation gene primary screening scheme to obtain a gene set obtained by primary screening of the copy number variation gene;
removing gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, performing the most-value homogenization of the depth index on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance;
and calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene horizontal copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene horizontal copy number ratio.
2. The method for detecting gene copy number variation through targeted sequencing according to claim 1, wherein the gene set obtained through copy number variation gene preliminary screening is obtained by combining a preset copy number variation gene preliminary screening scheme according to the depth index of each gene interval, and further comprising:
sequencing the gene intervals of the sample to be detected according to the depth index, and selecting corresponding target genes one by one according to the sequencing for gene primary screening, wherein the gene primary screening comprises the following steps: calculating the standard deviation of the depth indexes of the corresponding intervals of the rest genes except the target gene, and comparing the interval depth of the target gene with the standard deviation;
and determining a gene set obtained by the primary screening of the copy number variation gene according to the comparison result of the primary screening of the gene.
3. The method for detecting gene copy number variation through targeted sequencing according to claim 1, wherein the obtaining of the gene set obtained through copy number variation gene preliminary screening according to the depth index of each gene interval and by combining a preset copy number variation gene preliminary screening scheme comprises:
calculating the GC proportion of each gene interval according to the sequence information of the reference genome;
carrying out window division on the GC proportion interval, and calculating a screening gene interval of which the depth index of the sample to be detected accounts for the first 5% in each window;
and comparing each interval with the screened gene interval one by one, and when the target gene meets that more than or equal to 60 percent of the gene intervals belong to the screened gene interval, determining that the target gene belongs to the gene set obtained by primary screening of the copy number variation gene.
4. The method for detecting copy number variation of genes for targeted sequencing according to claim 1, wherein the quality control of the sample sequencing data in the sample library comprises:
removing the adaptor sequence, the low-quality sequences at two ends and the sequence containing a plurality of continuous N bases or the length of which is lower than a preset threshold value from the sequencing data of the sample.
5. The method of claim 1, wherein the reference genome comprises:
GRCh37、GRCh38。
6. an apparatus for detecting gene copy number variation for targeted sequencing, the apparatus comprising:
the acquisition module is used for acquiring a sample to be detected, acquiring a related negative sample according to the sample to be detected, and establishing a sample library according to the sample to be detected and the negative sample;
the quality control module is used for performing quality control on the sample sequencing data in the sample library, and comparing the sample sequencing data after quality control with a reference genome to obtain the position information of each sequence fragment in the sample sequencing data;
the interval division module is used for acquiring a preset target sequencing region of the reference genome, continuously dividing the target sequencing region to obtain each gene interval, and performing reading depth statistics by combining position information of each sequence fragment in the sample sequencing data to obtain a depth index of each gene interval;
the primary screening module is used for obtaining a gene set obtained by primary screening of the copy number variation gene by combining a preset copy number variation gene primary screening scheme according to the depth index of each gene interval;
the selection module is used for removing the gene intervals corresponding to the gene set in the sample to be detected and the negative sample according to the gene set obtained by primarily screening the copy number variation gene, performing the most value homogenization of the depth index on the removed sample to be detected and the negative sample, calculating the distance between the sample to be detected and the negative sample, and selecting the best negative sample according to the distance;
and the calculating module is used for calculating the relative depth ratio of each gene interval in the sample to be detected based on the optimal negative sample, calculating to obtain the adjusted relative copy number ratio of each gene interval according to the relative depth ratio, calculating to obtain the gene level copy number ratio according to the adjusted relative copy number ratio, and calculating to obtain the copy number variation gene in the sample to be detected based on the gene level copy number ratio.
7. The apparatus for detecting copy number variation of a gene for targeted sequencing of claim 6, further comprising:
the sequencing module is used for sequencing the gene intervals of the samples to be detected according to the depth index, and selecting corresponding target genes one by one according to the sequencing for gene primary screening, wherein the gene primary screening comprises the following steps: calculating the standard deviation of the depth indexes of the corresponding intervals of the rest genes except the target gene, and comparing the interval depth of the target gene with the standard deviation;
and the determining module is used for determining the gene set obtained by the primary screening of the copy number variation gene according to the comparison result of the primary screening of the gene.
8. The apparatus for detecting copy number variation of a gene for targeted sequencing of claim 6, further comprising:
the second calculation module is used for calculating the GC proportion of each gene interval according to the sequence information of the reference genome;
the dividing module is used for carrying out window division on the GC proportion interval and calculating a screening gene interval of which the depth index of the sample to be detected accounts for the first 5% in each window;
and the comparison module is used for comparing each interval with the screened gene interval one by one, and when the target gene meets that more than or equal to 60 percent of the gene intervals belong to the screened gene intervals, the target gene belongs to the gene set obtained by primarily screening the copy number variation gene.
9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps of the method for detecting copy number variation of a gene for targeted sequencing according to any one of claims 1 to 5.
10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the method for detecting copy number variation of a gene for targeted sequencing of any one of claims 1 to 5.
CN202211046725.5A 2022-08-30 2022-08-30 Method and device for detecting gene copy number variation aiming at targeted sequencing Pending CN115497557A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211046725.5A CN115497557A (en) 2022-08-30 2022-08-30 Method and device for detecting gene copy number variation aiming at targeted sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211046725.5A CN115497557A (en) 2022-08-30 2022-08-30 Method and device for detecting gene copy number variation aiming at targeted sequencing

Publications (1)

Publication Number Publication Date
CN115497557A true CN115497557A (en) 2022-12-20

Family

ID=84467379

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211046725.5A Pending CN115497557A (en) 2022-08-30 2022-08-30 Method and device for detecting gene copy number variation aiming at targeted sequencing

Country Status (1)

Country Link
CN (1) CN115497557A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117409856A (en) * 2023-10-25 2024-01-16 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data
CN117409856B (en) * 2023-10-25 2024-03-29 北京博奥医学检验所有限公司 Mutation detection method, system and storable medium based on single sample to be detected targeted gene region second generation sequencing data

Similar Documents

Publication Publication Date Title
Eres et al. Reorganization of 3D genome structure may contribute to gene regulatory evolution in primates
CN103201744B (en) For estimating the method that full-length genome copies number variation
DE202013012824U1 (en) Systems for the detection of rare mutations and a copy number variation
CN110189796A (en) A kind of sheep full-length genome resurveys sequence analysis method
CN115064209B (en) Malignant cell identification method and system
Katzman et al. GC-biased evolution near human accelerated regions
CN113674803A (en) Detection method of copy number variation and application thereof
CN115497557A (en) Method and device for detecting gene copy number variation aiming at targeted sequencing
US20230259588A1 (en) Inter-cluster intensity variation correction and base calling
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
US20230175053A1 (en) Method for analysing loss-of-heterozygosity (loh) following deterministic restriction-site whole genome amplification (drs-wga).
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN117316271A (en) Method and detection system for screening copy number variation of blood tumor specimen based on second-generation sequencing technology
JP2004527728A (en) Base calling device and protocol
CN112687341A (en) Method for identifying chromosome structure variation by taking breakpoint as center
EP2977466B1 (en) Detecting chromosomal aneuploidy
CN113284558B (en) Method for distinguishing gene expression difference and long copy number variation in RNA sequencing data
CN111724860B (en) Method and device for identifying chromatin open area based on sequencing data
CN109841265B (en) Method and system for determining tissue source of plasma free nucleic acid molecules by using fragmentation mode and application
CN116153417B (en) Methylation characteristic screening method and device
CN114703263B (en) Group chromosome copy number variation detection method and device
CN113270138B (en) Analysis method for enriching fetal free DNA (deoxyribonucleic acid) for copy number variation based on bioinformatics
US20240161870A1 (en) Alignment of target and reference sequences of polymer units
CN116168761B (en) Method and device for determining characteristic region of nucleic acid sequence, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination