A device for FFPE sample copy number variation detects
Technical Field
The invention belongs to the field of molecular biology detection, and particularly relates to a device and a method for detecting copy number variation of an FFPE sample.
Background
Tissue specimens prepared by Formalin-fixed Paraffin-embedded (FFPE) methods are called Formalin-fixed Paraffin-embedded tissue samples, FFPE samples for short. The FFPE sample can be preserved for a long time, and particularly, a large number of tumor tissue sections are preserved in the form of the FFPE sample. The FFPE sample is commonly used for clinical pathological examination, tumor gene detection and medical science research, and provides valuable resources for aspects of disease mechanism elucidation, therapeutic target discovery, prognosis indication and the like.
Copy Number Variation (CNV) of genes is a clinically important structural Variation, and is related to prognosis of various tumors and sensitivity of targeted drugs. The reliable CNV detection result can provide important basis for clinical medication, disease condition evaluation and the like. The CNV detection technology used in clinical practice is mostly based on PCR or immunohistochemical experimental means (e.g. FISH, IHC, etc.). The method can only cover one gene in a single detection, and the detection result has lower sensitivity.
CNV detection based on a Next-Generation Sequencing (NGS) platform can provide CNV detection results of a plurality of genes at one time on the premise of ensuring detection performance. Most of the traditional NGS platform CNV detection technologies are researched and developed based on a whole genome sequencing technology platform, and with the continuous progress of the NGS technology, the high-depth sequencing technology based on target region capture gradually shows advantages in the application scene of clinical detection.
However, because there is a fundamental difference between whole genome sequencing data and target region capture sequencing data, the current traditional CNV detection method of the NGS platform is not suitable for target region capture sequencing data, and is difficult to ensure the accuracy of CNV detection, and the detection sensitivity needs to be improved. This problem is particularly pronounced in FFPE samples. The DNA fragmentation of the FFPE sample is serious, the influence is generated on the processes of target gene DNA capture, NGS sequencing and the like, and the key technical indexes such as the effective depth of a target area are finally influenced. Thus, the availability of low-depth sequencing data generated by low-quality FFPE samples becomes a major technical challenge.
Disclosure of Invention
In view of the above-described drawbacks of the prior art, an object of the present invention is to provide a detection apparatus and a detection method with higher detection sensitivity for CNV in an FFPE sample.
The inventors of the present invention have made intensive studies to solve the above-mentioned technical problems, and as a result, found that: in the CNV detection method of the FFPE sample, whether to perform reasonable noise reduction processing on the data or not and whether to use an appropriate background library directly affect the detection result, and particularly, such an effect is significant in the capture sequencing. The sensitivity of the FFPE sample CNV detection can be improved by more reasonable and comprehensive noise reduction treatment and application of a dynamic background library, thereby completing the invention.
Namely, the present invention comprises:
an apparatus for detecting copy number variation (which may occur in a genetic region or in a non-genetic region) in an FFPE sample, comprising:
the device comprises a sequencing data acquisition module, a sequencing data acquisition module and a sequencing data acquisition module, wherein the sequencing data acquisition module is used for acquiring capture sequencing data from an FFPE sample to be detected and sequencing data from a healthy population sample, and the healthy population sample is a plurality of healthy people (healthy normal people) samples;
a sequence comparison module, connected to the sequencing data acquisition module, for comparing the sequencing data acquired by the sequencing data acquisition module with a reference genome sequence to obtain a comparison result (including information such as a chromosome where each short sequence that can be compared with the reference genome is located, coordinates, matching condition of the short sequence and the reference genome), and calculating a depth value of each site (referring to each site on the genome, but depth values of some sites in captured sequencing may be 0) according to the comparison result;
the early-stage data processing module is connected with the sequence comparison module and is used for dividing a target region (100 k-100M, a whole genome or a key focus region) into windows with certain length (50-1000 bp) and overlapping (10-70%), removing depth extreme values (maximum value and minimum value) of sites in the windows, calculating a depth mean value or a median value, and calculating the GC content of a reference genome sequence in the windows;
the normalization module is connected with the early data processing module and used for normalizing the depth mean value or the depth median value in each window obtained by the early data processing module and calculating to obtain the Z value in each window of the FFPE sample to be detected and the healthy population sample;
a background library screening module connected with the normalization module and used for screening n healthy person samples (each healthy person sample corresponds to one healthy person) according to the Z values of the FFPE sample to be detected and the healthy population sample to obtain a background library sample set of the n healthy person samples, and then constructing a matrix X with m rows and n columns by using the Z values of the n healthy person samples in m windowsm×n;
The data fluctuation elimination module is connected with the background library screening module and is used for eliminating inherent data fluctuation caused by capture sequencing;
the GC correction module is connected with the data fluctuation elimination module and is used for carrying out GC correction according to GC content in each window;
and an output module, connected to the GC correction module, for outputting CNV detection results (including, for example, a graph showing CNV detection results, determination results of negative/positive CNV variation, etc.).
The sequencing data acquisition module of the device for detecting the copy number variation of the FFPE sample acquires sequencing data obtained by sequencing DNA in the FFPE sample to be detected by adopting a second-generation sequencing method. The mainstream platform of the second-generation Sequencing generally adopts Sequencing By Synthesis (SBS) technology to perform nucleic acid Sequencing. Before sequencing, a nucleic acid (DNA or RNA) sample needs to be subjected to sequencing library construction, and the basic flow is as follows: firstly, repairing the tail end of a fragment of fragmented DNA, then adding an ' A ' base at the 3' end of the repaired fragment, then connecting the DNA fragment with a DNA adaptor (Adapter) containing a sequencing primer binding site, and finally amplifying by PCR to complete the construction of a sequencing library. There is no particular limitation on the specific secondary sequencing method, and any secondary sequencing method known to those skilled in the art may be employed.
Preferably, the sequencing data is sequencing data obtained using a capture sequencing method;
the target gene for the capture sequencing may vary for different target diseases. The target disease may be, for example, a solid cancer (e.g., gastric cancer, breast cancer, colorectal cancer, lung cancer, etc.).
For example, in the case where the target disease is breast cancer, the target gene may be, for example, an EGFR gene, ERBB2 gene, FGFR1 gene, KIT gene, PIK3CA gene, or/and PTEN gene; in case the target disease is colorectal cancer, the target gene may be, for example, EGFR gene, ERBB2 gene, FGFR2 gene, KRAS gene, MET gene, PTEN gene; in the case where the target disease is gastric cancer, the target gene may be, for example, an EGFR gene, an ERBB2 gene, an FGFR1 gene, an FGFR2 gene, a KRAS gene, a MET gene, a PIK3CA gene, or/and a PTEN gene; in the case where the target disease is lung cancer, the target gene may be, for example, ALK gene, BRAF gene, EGFR gene, ERBB2 gene, FGFR1 gene, KRAS gene, MET gene, PIK3CA, or/and PTEN.
Preferably, the early-stage data processing module divides the window by a sliding window method.
Preferably, the normalization module calculates the Z value in each window of the sample to be detected according to the following formula (1), where Zi in the formula (1) represents the Z value of the ith window,
Zi=trimScale(Zi,Zi)……(1)。
preferably, formula (2) is defined:
Wherein chr represents a chromosome, St represents a biological sample to be examined, and SNRepresenting a sample of healthy people;
the background library screening module screens out n healthy person samples with the minimum d value according to the Z values of the FFPE sample to be detected and the healthy person sample to obtain a screened background library sample set S1,S2,S3,…,Sn(N and N are both natural numbers and N < N).
Preferably, the data fluctuation elimination module is used for solving the background library matrix Xm×nSingular value decomposition is carried out to obtain an m-row r-column factor matrix Um×rR is the number of factors, then k factors with the largest contribution rate (namely k factors with the top rank, k is generally 4-10) are taken for LOESS regression, and residual error Z is obtainedp。
Preferably, the GC correction module pairs Z according to GC content within each windowpPerforming GC correction based on LOESS regression to obtain residual error Zpg。
Preferably, the FFPE sample copy number variation detection apparatus further comprises:
and the data quality detection module is connected with the sequencing module and the sequence comparison module and is used for performing quality detection on the sequencing data obtained by the sequencing module. Quality control includes, but is not limited to, removing short sequences with low quality, removing short sequences with high N content, removing short sequences related to Adapter, and finally counting quality control indexes related to each item.
In addition, the present invention further comprises:
a method for detecting copy number variation (which may occur in a genetic region or in a non-genetic region) in an FFPE sample, comprising:
a sequencing data acquisition step, wherein the sequencing data acquisition step is used for acquiring the captured sequencing data from the FFPE sample to be detected and the sequencing data from a healthy population sample, wherein the healthy population sample is a plurality of healthy people samples;
a sequence comparison step, comparing the sequencing data obtained in the sequencing data acquisition step with a reference genome sequence to obtain a comparison result (including, for example, information such as a chromosome where each short sequence that can be compared with the reference genome is located, coordinates, and matching conditions of the short sequences and the reference genome), and calculating a depth value of each site (referring to each site on the genome, but depth values of some sites in the captured sequencing may be 0) according to the comparison result;
the method comprises the steps of early data processing, namely dividing a target region (100 k-100M, a whole genome or an important attention region) into windows with certain length (50-1000 bp) and overlapping (10-70%), removing depth extreme values (maximum values and minimum values) of sites in the windows, calculating a depth mean value or a depth median value, and calculating the GC content of a reference genome sequence in the windows;
normalizing, namely normalizing the depth mean value or the depth median value in each window obtained in the previous data processing step, and calculating to obtain the Z value in each window of the FFPE sample to be detected and the healthy population sample;
a background library screening step, namely screening n healthy person samples (healthy person samples, wherein each background library sample corresponds to a healthy person) according to Z values of the FFPE sample to be detected and the healthy crowd sample to obtain a background library sample set, and then constructing a matrix X of m rows and n columns by using the Z values of the n healthy person samples in m windowsm×n;
A data fluctuation elimination step, which is to eliminate inherent data fluctuation caused by capture sequencing;
a GC correction step, wherein GC correction is carried out according to the GC content in each window; and
and an output step of outputting the CNV detection result (including, for example, a graph showing the CNV detection result, a determination result of negative/positive CNV variation, and the like).
The sequencing data acquisition step of the method for detecting the copy number variation of the FFPE sample acquires sequencing data obtained by sequencing DNA in the FFPE sample to be detected by adopting a second-generation sequencing method. The mainstream platform of the second-generation Sequencing generally adopts Sequencing By Synthesis (SBS) technology to perform nucleic acid Sequencing. Before sequencing, a nucleic acid (DNA or RNA) sample needs to be subjected to sequencing library construction, and the basic flow is as follows: firstly, repairing the tail end of a fragment of fragmented DNA, then adding an ' A ' base at the 3' end of the repaired fragment, then connecting the DNA fragment with a DNA adaptor (Adapter) containing a sequencing primer binding site, and finally amplifying by PCR to complete the construction of a sequencing library. There is no particular limitation on the specific secondary sequencing method, and any secondary sequencing method known to those skilled in the art may be employed.
Preferably, the sequencing data is sequencing data obtained using a capture sequencing method;
the target gene for the capture sequencing may vary for different target diseases. The target disease may be, for example, a solid cancer (e.g., gastric cancer, breast cancer, colorectal cancer, lung cancer, etc.).
For example, in the case where the target disease is breast cancer, the target gene may be, for example, an EGFR gene, ERBB2 gene, FGFR1 gene, KIT gene, PIK3CA gene, or/and PTEN gene; in case the target disease is colorectal cancer, the target gene may be, for example, EGFR gene, ERBB2 gene, FGFR2 gene, KRAS gene, MET gene, PTEN gene; in the case where the target disease is gastric cancer, the target gene may be, for example, an EGFR gene, an ERBB2 gene, an FGFR1 gene, an FGFR2 gene, a KRAS gene, a MET gene, a PIK3CA gene, or/and a PTEN gene; in the case where the target disease is lung cancer, the target gene may be, for example, ALK gene, BRAF gene, EGFR gene, ERBB2 gene, FGFR1 gene, KRAS gene, MET gene, PIK3CA, or/and PTEN.
Preferably, the preliminary data processing step uses a sliding window method to divide the window.
Preferably, the normalization step calculates the Z value in each window of the sample to be detected according to the following formula (1), wherein Zi in the formula (1) represents the Z value of the ith window,
Zi=trimScale(Zi,Zi)……(1)。
preferably, formula (2) is defined:
Wherein chr represents a chromosome, STRepresenting the FFPE sample to be examined, SNRepresenting a sample of healthy people;
the background library screening step screens n healthy person samples with the minimum d value according to the Z values of the FFPE sample to be detected and the healthy person sample to obtain a screened background library sample set S1,S2,S3,…,Sn(N, N are natural numbers and N is less than N).
Preferably, the data fluctuation elimination step is performed on the background library matrix Xm×nSingular value decomposition is carried out to obtain an m-row r-column factor matrix Um×rR is the number of factors, then k factors with the largest contribution rate (namely k factors with the top rank, k is generally 4-10) are taken for LOESS regression, and residual error Z is obtainedp。
Preferably, the GC correction step is performed on Z according to GC content in each windowpPerforming GC correction based on LOESS regression to obtain residual error Zpg。
Preferably, the copy number variation detection method further comprises:
and a data quality detection step, wherein the sequencing data obtained in the sequencing step are subjected to quality detection. Quality control includes, but is not limited to, removing short sequences with low quality, removing short sequences with high N content, removing short sequences related to Adapter, and finally counting quality control indexes related to each item.
In the above, reference is made to the above-mentioned preferred embodiments of the respective steps.
According to the present invention, a detection apparatus and a detection method with higher detection sensitivity for the FFPE sample CNV are provided.
Drawings
FIG. 1 is a schematic diagram of an apparatus for detecting copy number variation of an FFPE sample according to the present invention.
FIG. 2 is a graph showing the results of CNV detection of multiple genes of breast cancer in example 1.
Detailed description of the invention
Technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art, and in case of conflict, the definitions in this specification shall control.
Definition of
Reference genome: a complete set of haploid sequences carried by a cell or organism, including the complete set of genes and spacer sequences.
And (3) comparison: generally refers to sequence alignment, which refers to the process of aligning two or more sequences according to a certain rule in order to determine their similarity or homology.
Depth value: for a certain site on the genome, according to the comparison result, the number of short sequences covering the site is the depth value of the site.
Window (sliding window): generally refers to a fixed length region on the genome.
Background library: a sample library is composed of a plurality of samples (generally ≧ 20) of healthy persons.
And (3) capturing and sequencing: the process of grabbing DNA fragments for a specific region (region of interest) on the genome through a pre-designed probe and finally performing NGS sequencing on the grabbed DNA fragments.
NGS (high throughput sequencing): high-throughput sequencing, also known as "Next-generation" sequencing technology, is marked by the ability to sequence hundreds of thousands to millions of DNA molecules in parallel at one time, and by the short read length.
trimScale (w, v): defining w as a certain value to be normalized and v as a certain data set
a. Removing a certain percentage of the data above and below v to obtain
b. Computing
Mean value μ and standard deviation σ of
c. Is calculated to obtain
As a final result
SVD (singular value decomposition): SVD is an important matrix decomposition in linear algebra, and is a generalization of unitary diagonalization of a normal matrix in matrix analysis. The method has important application in the fields of signal processing, statistics and the like. The effect is to map the data set into a low dimensional space. The eigenvalues of the data set (characterized by singular values in SVD) are arranged according to importance, the dimension reduction process is a process of discarding unimportant eigenvectors, and the space formed by the remaining eigenvectors is the space after dimension reduction.
Examples
The present invention will be described in more detail with reference to examples. It should be understood that the embodiments described herein are intended to illustrate, but not limit the invention.
Example 1
The CNV condition of the FFPE sample of the tissue of a female breast cancer patient is detected by adopting the device for detecting the copy number variation of the FFPE sample.
1.1 extraction of DNA from FFPE samples
The FFPE sample DNA was obtained by performing extraction procedures using the GeneRead DNA FFPE Kit (QIAGEN Co.) according to the manual.
1.2 sample disruption
And (3) using a Biorupter interrupt instrument to perform interruption, setting interrupt conditions for 30 cycles, and setting 30s ON/30s OFF to interrupt the FFPE sample DNA into fragments of about 200bp to obtain the fragmented DNA fragments.
1.3 End Repair (End Repair)
(1) The required reagents were removed from the kit stored at-20 ℃ in advance and the individual sample amounts are shown in Table 1.
TABLE 1
(2) End repair reaction: after the addition of the DNA sample, the 1.5mL centrifuge tube was placed in a Thermomixer and incubated at 20 ℃ for 30 minutes. After completion of the reaction, DNA in the purification reaction system was collected using 1.8X nucleic acid purification beads and dissolved in 32. mu.LEB.
1.4 adding A (A-Tailing) at the end
(1) The required reagents were removed from the kit stored at-20 ℃ in advance, and the single sample formulation amounts are shown in table 2:
TABLE 2
(2) And (3) adding A at the tail end for reaction: after adding 32. mu.L of the DNA recovered by the previous purification step, a 1.5mL centrifuge tube was placed in a Thermomixer and incubated at 37 ℃ for 30 minutes. DNA in the purification reaction system was recovered using 1.8X nucleic acid purification magnetic beads and dissolved in 18. mu.L EB.
1.5 connection of the Joint (Adapter Ligation)
(1) The required reagents were removed from the kit stored at-20 ℃ in advance, and the single sample formulation amounts are shown in table 3:
TABLE 3
(2) And (3) connecting the joint: after adding 18. mu.L of the recovered DNA purified in the previous step, the sample tube was incubated in a Thermomixer at 20 ℃ for 15 minutes. DNA in the purification reaction system was recovered using 1.8 Xnucleic acid purification magnetic beads and dissolved in 30. mu.L of EB.
1.6 PCR reaction
(1) Taking out the required reagent from the kit stored at the temperature of-20 ℃, and preparing a PCR reaction system in a 2mL PCR tube:
TABLE 4
(2) The PCR program was set, and the program for the PCR reaction was set as follows:
and (5) taking out the sample in time after the reaction is finished, storing the sample in a refrigerator at 4 ℃, and withdrawing or closing the instrument according to requirements.
(3) The DNA in the purification reaction system was recovered using 0.9X nucleic acid purification magnetic beads, and the purified library was dissolved in 20. mu.L of ddH 2O. The library was subjected to the Qubit assay and submitted to the Agilent 2100.
1.7 Breast cancer target region Capture chip library hybridization
(1) In this experiment, buffers for providing an ionic environment for the hybridization capture reaction, and washing solutions and rinsing solutions for eluting physical adsorption or nonspecific hybridization were commercially available.
(2) Preparing a hybridization library: the DNA library to be hybridized is thawed on ice and 1. mu.g of total mass is taken (this DNA library is referred to as sample library in the subsequent working up step).
(3) Preparation of an Ann primer Pool: the tag primer In1 (100. mu.M) and the common primer (1000. mu.M) corresponding to the sample library Index were mixed together at 1000pmol each (this mixture was called Ann primer pool In the subsequent operation).
(4) Preparation of hybridization samples: to a 1.5mL EP tube was added 5. mu.L of COT DNA (Human COT-1DNA, Life technologies, 1mg/mL), 1. mu.g of the sample library, and an primer pool. The prepared hybridization sample EP tube was sealed with a sealing film, and the EP tube containing the sample library pool/COT DNA/Ann primer pool was placed in a vacuum apparatus until completely dried.
(5) Solution of hybridization sample: to a dry powder of sample library pool/COT DNA/Ann primer pool was added:
7.5 μ L of 2 × hybridization buffer
3 μ L of hybridization fraction A
(6) After mixing well, the mixture was denatured for 10 minutes on a pre-prepared 95 ℃ heating module.
(7) The mixture was transferred to a 0.2mL flat-capped PCR tube containing 4.5. mu.L of the capture chip. Vortex well for 3 seconds and place the hybridization sample mixture on a 47 ℃ heating block for 16 hours. The temperature of the heat cover of the heating module needs to be set to 57 ℃, and the product after hybridization needs to be subjected to subsequent elution and recovery operation.
(8) 10 Xthe cleaning solution (I, II and III), 10 Xthe rinsing solution and 2.5 Xthe magnetic bead cleaning solution were prepared into 1 Xthe working solution.
TABLE 5
(9) The following reagents were preheated in a 47 ℃ heating module:
400 μ L of 1 × rinsing liquid
100 μ L of 1 XWash solution I
1.8 preparation of affinity adsorption magnetic beads
(1) Streptavidin magnetic beads (Dynabeads M-280Streptavidin, hereinafter referred to as magnetic beads) were equilibrated at room temperature for 30 minutes, and then the beads were vortexed thoroughly for 15 seconds.
(2) 100 mu L of magnetic beads are subpackaged in a 1.5mL centrifuge tube, the centrifuge tube containing 100 mu L of magnetic beads is placed on a magnetic frame, after about 5 minutes, the supernatant is carefully discarded, 1 Xmagnetic bead cleaning solution with the volume twice that of the initial volume of the magnetic beads is added, and the mixture is vortexed and mixed for 10 seconds. And (4) putting the centrifugal tube containing the magnetic beads back to the magnetic frame to adsorb the magnetic beads. After the solution was clear, the supernatant was discarded by aspiration. The procedure was repeated twice for a total of two washes.
(3) After washing, the magnetic bead washing solution was aspirated, and the magnetic beads were resuspended in a 1 × magnetic bead washing solution of the initial volume of the magnetic beads by vortexing and transferred to a 0.2mL PCR tube. Placing the PCR tube on a magnetic frame to adsorb magnetic beads for clarification, and then absorbing and removing the supernatant.
1.9 binding and rinsing of DNA and affinity adsorption magnetic beads
(1) And transferring the hybridized sample library into a 0.2mL PCR tube containing affinity adsorption magnetic beads, and performing vortex oscillation and uniform mixing.
(2) The 0.2mL PCR tube was placed in a 47 ℃ heating block for 45 minutes and vortexed once every 15 minutes to bind the DNA to the beads.
(3) After 45 min incubation, 100. mu.L of 1 XWash solution I pre-warmed at 47 ℃ was added to 15. mu.L of the captured DNA sample. Vortex for 10 seconds. All the components in the 0.2mL PCR tube were transferred to a 1.5mL centrifuge tube. A1.5 mL centrifuge tube was placed on a magnetic rack to adsorb magnetic beads, and the supernatant was discarded.
(4) A1.5 mL centrifuge tube was removed from the magnetic rack and 200. mu.L of a preheated 47 ℃ 1 Xrinse was added. Sucking and mixing for 10 times (rapid operation is needed, and the temperature of the reagent and the sample is prevented from being lower than 47 ℃). After mixing, the sample was placed on a heating module at 47 ℃ for 5 minutes. This procedure was repeated and washed twice with 1 × rinse at 47 ℃. A1.5 mL centrifuge tube was placed on a magnetic rack, magnetic beads were adsorbed, and the supernatant was discarded.
(5) 200. mu.L of room temperature 1 XWash I was added to the 1.5mL centrifuge tube and vortexed for 2 minutes. Placing the centrifuge tube on a magnetic frame, adsorbing magnetic beads, and discarding the supernatant. 200. mu.L of room temperature 1 XWash II was added to the 1.5mL centrifuge tube and vortexed for 1 minute. Placing the centrifuge tube on a magnetic frame, adsorbing magnetic beads, and discarding the supernatant. 200. mu.L of room temperature 1 XWash III was added to the above 1.5mL centrifuge tube and vortexed for 30 seconds. Placing the centrifuge tube on a magnetic frame, adsorbing magnetic beads, and discarding the supernatant.
(6) The 1.5mL centrifuge tube was removed from the magnetic rack, and 45. mu.L of PCR water was added to dissolve the eluted magnetic beads to capture the sample.
1.10 PCR amplification of captured DNA
(1) The post-capture PCR mix was prepared according to the following table, and vortexed and mixed well after preparation. Both the enriching primer F and the enriching primer R were purchased from Yingchi Weiji Co.
(2) The amplification program of magnetic bead adsorption DNA PCR was set as follows:
(3) recovery and purification of hybridization capture DNA PCR product: the DNA in the purification reaction system was recovered using nucleic acid purification magnetic beads in an amount of 0.9X, and the purified library was dissolved in 30. mu.L of ddH2And (4) in O.
1.11 library quantitation
The library was subjected to 2100 Bio Analyzer (Agilent)/LabChip GX (Caliper) and QPCR assays and the library concentration was recorded.
1.12 on-machine sequencing of libraries
The constructed library was sequenced with NextSeq 550 AR.
1.13 data processing and analysis
The FFPE sample copy number variation detection device provided by the invention is used for processing and analyzing the result of the on-machine sequencing of the 1.12 library.
The FFPE sample copy number variation detection apparatus of example 1 includes the following modules.
A sequencing data acquisition module:
the method is used for acquiring the sequencing data obtained by capturing and sequencing the FFPE sample of the breast cancer to be detected by using the breast cancer target region capturing chip.
The data quality inspection module:
and performing data quality inspection on the sequencing data, filtering out short sequences with low average quality value, filtering out short sequences with high N content, and filtering out short sequences related to Adapter to obtain filtered sequencing data C.
A sequence alignment module:
using the filtered sequencing data C, a short sequence alignment was performed with the human reference genome HG19 to obtain alignment result a. And calculating the depth value of each site on the genome according to the comparison result A to obtain a result D.
The early data processing module:
dividing a cancer target region into windows with certain lengths and overlapping, removing a depth extreme value in the window, calculating a depth median value, and calculating the GC content of a reference genome sequence in the window to obtain a result X.
A normalization module:
combining the results X and D according to the formula Zi=trimScale(Zi,Zi) And calculating to obtain the Z value in each window of the genomic DNA to be detected.
Background library screening module:
chr means chromosome, St means sample to be detected, and Sn means background pool sample.
According to the Z values of the genomic DNA to be detected and the background library, screening out the background library sample with the minimum d value to obtain a screened background library sample set S1,S2,S3,…,Sn。
Constructing matrix X using the Z values of the n samples over m windowsm×nUsed as a background library for standby.
The data fluctuation elimination module:
to background library matrix Xm×nSingular value decomposition is carried out to obtain a factor matrix U with m rows and n columnsm×nAnd n is the number of factors. Taking several factors with the largest contribution rate to carry out LOESS regression to obtain residual error Zp。
A GC correction module:
according to GC content in m windows, for ZpPerforming GC correction based on LOESS regression to obtain residual error Zpg。
An output module:
and the output module is used for displaying a graph of the CNV detection result.
The detection result is shown in FIG. 2, where each small dot is a Z of a windowpgThe value is obtained. Wherein, copy numbers of both PIK3CA and ERBB2 genes are detected to be increased.
1.14 validation of results
And performing reverse transcription after extracting RNA from the fresh tissues of the original tumor of the same patient, and verifying whether the expression quantity of PIK3CA and ERBB2 genes is increased by using a QPCR method, wherein the verification result is consistent with the 1.13 detection result. The detection device provided by the invention can successfully detect copy number variation of the FFPE sample.
Industrial applicability
The FFPE sample CNV detection device and the detection method can obviously improve the detection sensitivity of CNV.