CN113299342A - Copy number variation detection method and device based on chip data - Google Patents

Copy number variation detection method and device based on chip data Download PDF

Info

Publication number
CN113299342A
CN113299342A CN202110673034.7A CN202110673034A CN113299342A CN 113299342 A CN113299342 A CN 113299342A CN 202110673034 A CN202110673034 A CN 202110673034A CN 113299342 A CN113299342 A CN 113299342A
Authority
CN
China
Prior art keywords
window
detected
fluorescence signal
signal intensity
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110673034.7A
Other languages
Chinese (zh)
Other versions
CN113299342B (en
Inventor
卢娜如
张军
孔令印
梁波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Basecare Medical Appliances Co ltd
Original Assignee
Suzhou Basecare Medical Appliances Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Basecare Medical Appliances Co ltd filed Critical Suzhou Basecare Medical Appliances Co ltd
Priority to CN202110673034.7A priority Critical patent/CN113299342B/en
Publication of CN113299342A publication Critical patent/CN113299342A/en
Application granted granted Critical
Publication of CN113299342B publication Critical patent/CN113299342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Abstract

The application relates to a copy number variation detection method and a detection device based on chip data. The method compares the sequence to be detected of the fluorescence signal intensity data of each window of the genome of the sample to be detected with the reference sequence of the fluorescence signal intensity data of the corresponding window of the genome of the reference sample in a pre-established reference library, to obtain a comparison data sequence of the fluorescence signal intensity of each window, detect the significance difference among the windows, determine a target window and a corresponding region to be determined in the sample to be detected based on the significance difference among the windows, further according to the comparison data sequence of the fluorescence signal intensity of the region to be determined and each window and the preset variation threshold, therefore, the abnormal region in the sample to be detected can be effectively identified, the boundary of the abnormal region can be visually determined based on the target window corresponding to the abnormal region, manual intervention is not needed, and high-resolution copy number variation detection is realized.

Description

Copy number variation detection method and device based on chip data
Technical Field
The present application relates to the field of genetic data analysis technologies, and in particular, to a copy number variation detection method, a detection apparatus, a computer device, and a storage medium based on chip data.
Background
Copy Number Variation (CNV) is caused by a rearrangement of the genome, generally meaning an increase or decrease in the Copy Number of genomic fragments greater than 1KB in length, and is mainly manifested by deletions and duplications at the sub-microscopic level. In recent years, various techniques have been developed for the detection of CNV in the human genome.
The current detection method of CNV mainly comprises CNV detection based on high-throughput sequencing and CNV detection based on chips. In the CNV detection based on the chip, a high-density SNP (Single Nucleotide Polymorphism) chip is a commonly used CNV detection method. The SNP chip can be divided into the following parts according to the technical principle: specific site hybridization (ASH), specific site primer extension (ASPE), single base extension (SBCE), specific site cleavage (ASC) and specific site ligation (ASL) 5. At present, common algorithms for CNV detection by using an SNP chip include PennCNV, cnvPartion and the like, wherein the PennCNV is obtained by extracting the fluorescence signal intensity of an allele from the SNP chip, integrating the information of the SNP position and the SNP allele frequency, and performing CNV identification through a Hidden Markov Model (HMM) algorithm; cnvpartion mainly performs CNV identification through two indexes of LogR Ratio (LR) and B Allow Frequency (BAF) and gives confidence.
However, the various CNV detection methods of the SNP chip mainly use semi-automatic analysis of Windows software, and data analysis can be performed only by converting off-chip data into a fixed format with the aid of third-party data conversion software, which is not only troublesome in operation, but also has the problems of low sensitivity of chimeric detection and excessive CNV identification false positive, and the CNV start and end positions cannot be accurately given, and the CNV boundary can be determined only by manually rechecking and checking. Therefore, no scheme for effectively detecting copy number variation exists in the traditional technology.
Disclosure of Invention
In view of the above, it is necessary to provide a chip data-based copy number variation detection method, a detection apparatus, a computer device, and a storage medium, which can effectively detect copy number variation, in order to solve the problems of copy number variation detection in the conventional technologies.
A method for copy number variation detection based on chip data, the method comprising:
according to gene sequencing data of a sample to be detected, acquiring a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
according to the comparison data sequence of the fluorescence signal intensity of each window, detecting the significance difference among the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance difference among the windows, wherein the target window is a window corresponding to the initial position and a window corresponding to the end position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
In one embodiment, the preset mutation threshold comprises a preset copy number missing threshold and a copy number duplication threshold; determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold, wherein the determining comprises the following steps: acquiring a comparison data average value of the window fluorescence signal intensity in the to-be-determined area according to the window corresponding to the initial position and the window corresponding to the end position of the to-be-determined area in the to-be-detected sample and the comparison data sequence of the fluorescence signal intensity of each window; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a preset copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining that the region to be determined is an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region.
In one embodiment, after determining the abnormal region in the sample to be detected, the method further includes: acquiring the copy number of an abnormal area in the sample to be detected; and calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold.
In one embodiment, the calculating the mosaic ratio of the abnormal region includes: if the abnormal area is determined to be an abnormal area with repeated copy number, calculating a first difference between the copy number of the abnormal area and 2, and determining the quotient of the first difference and a copy number repetition threshold value as the embedding proportion of the abnormal area; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference and the copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the acquiring, according to gene sequencing data of a sample to be detected, a sequence to be detected of fluorescence signal intensity data of each window on a genome of the sample to be detected includes: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected; carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of the locus in each window; performing local linear regression correction on the intensity of the fluorescence signal to be detected of the window in the sample to be detected to obtain corrected intensity data of the fluorescence signal to be detected of each window; and generating a to-be-detected sequence of fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
In one embodiment, the comparing the to-be-detected sequence of the fluorescence signal intensity data with the reference sequence of the fluorescence signal intensity data of the window corresponding to the reference sample on the genome in the pre-established reference library based on each window on the genome to obtain the comparison data sequence of the fluorescence signal intensity of each window includes: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window; and acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
In one embodiment, the comparing the data sequences according to the fluorescence signal intensities of the windows, detecting the significance difference between the windows, and determining the target window and the corresponding region to be determined in the sample to be detected based on the significance difference between the windows includes: identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of regions with significant differences based on the significant differences among the windows, wherein the significant differences do not exist among the windows in each region; identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
An apparatus for detecting copy number variation based on chip data, the apparatus comprising:
the fluorescence signal intensity data acquisition module is used for acquiring a to-be-detected sequence of fluorescence signal intensity data of each window of a to-be-detected sample on a genome according to gene sequencing data of the to-be-detected sample;
the comparison processing module is used for comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
the region identification module is used for detecting significance differences among the windows according to comparison data sequences of fluorescence signal intensities of the windows, determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to the starting position and a window corresponding to the ending position of the region to be determined in the sample to be detected;
and the abnormal region determining module is used for determining the abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as set forth above.
According to the copy number variation detection method, the device, the computer equipment and the storage medium, the sequence to be detected of the fluorescence signal intensity data of each window of the genome of the sample to be detected is obtained according to the gene sequencing data of the sample to be detected, the sequence to be detected of the fluorescence signal intensity data is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome in the pre-established reference library based on each window on the genome to obtain the comparison data sequence of the fluorescence signal intensity of each window, the significant difference among the windows is detected, the target window and the corresponding region to be determined in the sample to be detected are determined based on the significant difference among the windows, and then the abnormal region in the sample to be detected can be effectively identified according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and the preset variation threshold, the boundary of the abnormal region can be visually determined based on the target window corresponding to the abnormal region without manual intervention, so that the detection of the high-resolution copy number variation is realized.
Drawings
FIG. 1 is a schematic flow chart of a copy number variation detection method according to an embodiment;
FIG. 2 is a schematic flow chart showing the steps of obtaining fluorescence signal intensity data for the sequence to be detected in one embodiment;
FIG. 3 is a schematic flow chart illustrating the comparison step of fluorescence signal intensity data according to an embodiment;
FIG. 4 is a schematic flow chart of a copy number variation detection method according to another embodiment;
FIG. 5 is a block diagram of an embodiment of an apparatus for detecting copy number variation;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In an embodiment, as shown in fig. 1, a copy number variation detection method based on chip data is provided, and this embodiment is illustrated by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be implemented by an independent server or a server cluster formed by a plurality of servers. In this embodiment, the method includes the steps of:
102, acquiring a to-be-detected sequence of the to-be-detected sample in the fluorescence signal intensity data of each window on the genome according to the gene sequencing data of the to-be-detected sample.
Wherein, the sample to be detected is a sample to be subjected to copy number variation detection. The gene sequencing data refers to the original data of the corresponding sample which is sent from the chip. The window is obtained by dividing each chromosome chip SNP (Single Nucleotide Polymorphism) site of the whole genome according to a preset rule. The sequence to be detected is obtained based on the fluorescence signal intensity data of each window on the genome of the corresponding sample. In this embodiment, according to gene sequencing data of a sample to be detected, data processing is performed on the gene sequencing data to obtain fluorescence signal intensity data of the sample to be detected at each window on a genome, and a sequence to be detected of the fluorescence signal intensity data of the sample to be detected is generated based on the fluorescence signal intensity data of each window and an arrangement order of each window on the genome.
And 104, comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of the reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window.
Wherein, the reference sample is a normal sample, i.e. a sample without copy number variation. The reference library is a sample database established based on the reference sample. In this example, stored in the reference library is a reference sequence of fluorescence signal intensity data for each window of the reference sample on the genome. Specifically, the fluorescence signal intensity data of the reference sample in each window on the genome is obtained by performing data processing on the gene sequencing data of the reference sample in advance according to the gene sequencing data of the reference sample, a reference sequence of the fluorescence signal intensity data of the reference sample is generated based on the fluorescence signal intensity data of each window and the arrangement sequence of each window on the genome, and the reference library is established based on the reference sequence of the fluorescence signal intensity data of the reference sample, so that the reference sequence of the fluorescence signal intensity data of the reference sample can be directly called from the reference degree when the copy number variation of the sample to be detected is required to be detected. It can be understood that, when the data processing and the window division are performed on the sample to be detected, the data processing and the window division can be performed in the same manner of obtaining the reference sequence of the fluorescence signal intensity data of the reference sample, so that each window of the sample to be detected and each window of the reference sample correspond to each other on a genome-by-genome basis, and the data have the same dimension. The comparison refers to the process of comparing the fluorescence signal intensity data of each window of the sample to be detected with the fluorescence signal intensity data of each window of the reference sample under the specified conditions. Specifically, in this embodiment, the fluorescence signal intensity data of the sample to be detected corresponding to each window is compared with the fluorescence signal intensity data of the corresponding window in the reference sample based on each window on the genome, so as to obtain comparison data of the fluorescence signal intensity of each window, and further obtain a comparison data sequence of the corresponding fluorescence signal intensity according to the comparison data of the fluorescence signal intensity of each window on the genome and the arrangement sequence of each window on the genome.
And 106, detecting the significance difference among the windows according to the comparison data sequence of the fluorescence signal intensity of each window, and determining a target window and a corresponding region to be determined in the sample to be detected based on the significance difference among the windows.
The region to be determined is a region with statistical difference identified based on the significant difference between the windows, each window in each region to be determined does not have significant difference, and the target window is a window corresponding to the starting position and the ending position of the region to be determined in the sample to be detected. Specifically, in this embodiment, the significance difference between the windows on the genome is detected based on the comparison data sequence of the fluorescence signal intensity of each window, and then the windows are merged based on the significance difference between the windows on the genome, that is, the front and rear windows without the significance difference are merged, so as to obtain one or more merged regions to be determined.
And step 108, determining abnormal regions in the sample to be detected according to the regions to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
Since copy number variation is generally expressed as copy number deletion or copy number duplication, the variation threshold preset in this embodiment may include a copy number deletion threshold and a copy number duplication threshold. In this embodiment, a comparison data average value of the intensities of the fluorescence signals of the windows in the to-be-determined region is obtained based on the comparison data sequence of the intensities of the fluorescence signals of the windows, and the comparison data average value of the intensities of the fluorescence signals of the windows in the to-be-determined region is matched with a preset variation threshold, if the comparison data average value is matched with the preset variation threshold, the corresponding to-be-determined region in the to-be-detected sample is determined to be an abnormal region, otherwise, the corresponding to-be-determined region in the to-be-detected sample is determined not to be the abnormal region.
The copy number variation detection method comprises the steps of obtaining a to-be-detected sequence of fluorescence signal intensity data of a to-be-detected sample in each window on a genome according to gene sequencing data of the to-be-detected sample, comparing the to-be-detected sequence of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a reference sample in a corresponding window on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window, detecting significance differences among the windows, determining a target window and a corresponding to-be-determined region in the to-be-detected sample based on the significance differences among the windows, further effectively identifying an abnormal region in the to-be-detected sample according to the to-be-determined region, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold value, and visually determining the boundary of the abnormal region based on the target window corresponding to the abnormal region, and the detection of high-resolution copy number variation is realized without manual intervention.
In an embodiment, as shown in fig. 2, obtaining the to-be-detected sequence of the fluorescence signal intensity data of each window on the genome of the to-be-detected sample according to the gene sequencing data of the to-be-detected sample may specifically include:
step 202, performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected.
The gene sequencing data refers to the original data of the corresponding sample which is sent from the chip. In this embodiment, in order to distinguish the fluorescence signal intensity of the sample to be detected from the fluorescence signal intensity of the reference sample, the fluorescence signal intensity of each site of the sample to be detected is referred to as the fluorescence signal intensity to be detected, and the fluorescence signal intensity of each site of the reference sample is referred to as the reference fluorescence signal intensity.
Further, because different samples have different sample quantities, the R values of the fluorescence signal intensities to be detected at different sites of each sample to be detected need to be normalized, and the R values of each sample to be detected are mapped to the same dimension, thereby eliminating errors caused by different dimensions. And because there may be differences between samples, such as chip, reagent or human operation, which may also cause measurement errors, there may be fluctuations in the normalized fluorescence signal intensity values for the same site. And sites with larger fluctuations will influence the subsequent CNV detection. Therefore, in this embodiment, based on the statistics of the dispersion degree of the reaction data such as Standard Deviation (SD) and uniformity (Evenness) between the same sites, the sites on the chip can be filtered according to the statistics, and the sites with the fluorescence signal intensity always being 0 in the sample can be removed, so as to eliminate the difference and improve the accuracy of the subsequent detection.
And 204, performing window segmentation on the genome of the sample to be detected according to the window segmentation condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of each locus in each window.
Because the sequence to be detected of the fluorescence signal intensity data of the sample to be detected is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome based on each window on the genome, the window division conditions need to be kept consistent, and based on this, the same conditions as the window division conditions of the reference sample should be adopted when the genome of the sample to be detected is subjected to window division. Specifically, the window division condition of the reference sample needs to satisfy the number of sites of one window < ═ 10 or the window size < ═ 18 KB. In order to improve the comparison accuracy and avoid errors caused by different chip data, the gene sequencing data of the sample to be detected and the gene sequencing data of the reference sample acquired in the above steps should be the detection data of the same chip. In this embodiment, the genome of the sample to be detected is subjected to window division according to the window division condition of the reference sample, so as to obtain N divided windows, and then the mean value of the intensities of the fluorescence signals to be detected of all the sites in each window is counted, and is taken as the intensity of the fluorescence signal to be detected of the corresponding window, which is denoted as B, and the calculation formula is shown as follows:
Figure BDA0003119510900000091
wherein, BiIs the value of R (i.e., the intensity of the fluorescence signal to be detected in the ith window) in the ith window, n is the number of sites contained in the window, and R isjThe value of R at the jth site in the window (i.e., the intensity of the fluorescence signal to be detected at the jth site) is determined.
And step 206, performing local linear regression correction on the to-be-detected fluorescence signal intensity of the window in the to-be-detected sample to obtain corrected to-be-detected fluorescence signal intensity data of each window.
In the genome sequence, the characteristics such as GC content (which refers to the ratio of G base and C base in a combination of a genome sequence ATGC) influence the sequence amplification efficiency of a sample and the binding efficiency of a target sequence and a probe in the chip sequencing process, so that the light intensity ratio of the genome presents a nonlinear distribution. Therefore, in order to eliminate the influence of such genomic characteristics, the present embodiment corrects the fluorescence signal intensity value of each window by GC. Wherein the GC correction principle is based on the fact that the fluorescence signal intensity of the same GC content region is uniformly influenced by GC content, while the fluorescence signal intensity of different GC content regions should be different theoretically, and based on the fact that the fluorescence signal intensity of the same GC content window is multiplied by a fixed weight, the fluorescence signal intensity can be corrected to a linear level.
Specifically, the GC content of the chromosome sequence in each window in the sample to be detected is calculated and is marked as C1,C2,C3,…,CNIt is understood that, given the fluorescence signal intensity of each window, the GC content of each window can be obtained from the corresponding conversion. If the median of the fluorescence signal intensities of the N windows (i.e., all windows in the sample to be detected) is M. The windows with equal GC content are classified into one category by thousandth, and the total number of M is recorded as G1,G2,G3,…,GM. Suppose that G is present for a certain GC contentiWhich comprises n windows, each window having a fluorescence signal intensity value of B1,B2,B3,…,BnThen G isiMiddle n windows B1,B2,B3,…,BnThe median of the fluorescence signal intensity values of (a) is MiBased on this, a weight was assigned to each window, thereby eliminating the effect of GC content on the fluorescence signal intensity. After GC correction, the corrected fluorescence signal intensity data to be detected of each window of all chromosomes can be obtained. It can be specifically corrected by the following formula:
Figure BDA0003119510900000101
wherein M is the median of the fluorescence signal intensity values of all windows in the sample to be detected, and M isiIs a GC content of GiB is the median of the fluorescence signal intensities of all the windows, and G is the GC contentiBefore a certain window correction, BGCThe fluorescence signal intensity data obtained after correcting the window. In particular, the value of M may be different based on the parity of the number N of windows in the sample to be detected, for example, when N is an odd number,
Figure BDA0003119510900000102
when N is an even number, the number of bits in the bit line is,
Figure BDA0003119510900000103
and 208, generating a to-be-detected sequence of the fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
Specifically, local linear regression correction is performed on the fluorescence signal intensity to be detected of each window in the sample to be detected based on the steps, so that fluorescence signal intensity data after correction of each window is obtained, and the sequence to be detected of the corresponding fluorescence signal intensity data is obtained based on the fluorescence signal intensity data after correction of each window on the genome of the sample to be detected and the arrangement sequence of each window on the genome.
In one embodiment, as shown in fig. 3, comparing the sequence to be detected of the fluorescence signal intensity data with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome in the pre-established reference library based on each window on the genome to obtain the comparison data sequence of the fluorescence signal intensity of each window, specifically includes:
step 302, for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence.
In this embodiment, the reference sequence of the fluorescence signal intensity data of the corresponding window on the genome of the reference sample is obtained by processing the reference sequence based on the gene sequencing data of the reference sample by the same method as shown in fig. 2. And the gene sequencing data of the sample to be detected and the reference sample which are involved in the comparison are from the same chip, and the two samples are subjected to window division by adopting the same window division condition, so that the data on each window can be conveniently compared. Specifically, in this embodiment, for each window on the genome, the fluorescence signal intensity data to be detected of the window is extracted from the sequence to be detected, the reference fluorescence signal intensity data of the corresponding window is extracted from the reference sequence, and the window data is compared through the subsequent steps.
And 304, acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base 2 as comparison data of the fluorescence signal intensity of the window.
And for each window on the genome, comparing and analyzing the corrected fluorescence signal intensity data to be detected of the same window with the reference fluorescence signal intensity data, and calculating comparison data of the fluorescence signal intensity of each window. Specifically, the following formula can be used to determine the comparison data CNV of fluorescence signal intensity of each windowi(also known as log)2RR value):
Figure BDA0003119510900000111
in particular, i is the number of the corresponding window, TRiThe corrected fluorescence signal intensity data R of the ith window in the sample to be detectediCorrected reference fluorescence signal intensity data for the ith window in the reference sample, CNViComparison data of the intensity of the fluorescence signal of the window of the ith window
And step 306, acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
Specifically, the fluorescence signal intensity data to be detected of each window in the sample to be detected is compared with the reference fluorescence signal intensity data of the corresponding window in the reference sample based on the steps, so that comparison data of the fluorescence signal intensity of each window on the genome is obtained, and further, the arrangement sequence of each window on the genome is obtained, so that a comparison data sequence of the corresponding fluorescence signal intensity of each window is obtained.
In one embodiment, detecting a significant difference between the windows according to the aligned data sequence of the fluorescence signal intensity of each window, and determining a target window and a corresponding region to be determined in the sample to be detected based on the significant difference between the windows includes: identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of regions with significance difference based on the significance difference between the windows, wherein the significance difference does not exist between the windows in each region; identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
For example, after obtaining the aligned data sequence of the fluorescence signal intensity of each window through the above steps, the log of each window on each chromosome can be obtained by using algorithms such as Circular Binary Segmentation (CBS) or Hidden Markov Model (HMM)2The RR values are statistically analyzed to identify regions with statistical differences, and it is understood that the following principles can be referenced for the division of the regions: there is no significant difference between the windows in each region, and there is a significant difference between two adjacent windows located in different regions.
Further, due to the large number of windows on the genome, the statistically different regions identified by the algorithms such as CBS or HMM are large and do not represent the presence of CNV. Based on this, in order to obtain the final complete CNV region, the present embodiment further processes the identified statistically different region based on the small segment CNV merging algorithm. Specifically, the small segment CNV merging algorithm checks whether the front and rear regions have significant differences based on a standard Z-test principle, performs region merging if there is no difference, determines a breakpoint position (i.e., a position of a target window) if there is a significant difference, determines a subsequent breakpoint using the same procedure, and analyzes a region to be determined where a chromosome may be abnormal after determining two breakpoints.
In one embodiment, the preset mutation threshold comprises a preset copy number deletion threshold and a copy number duplication threshold; determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold, wherein the method specifically comprises the following steps:
and acquiring a comparison data average value of the fluorescence signal intensity of the window in the region to be determined according to the window corresponding to the initial position and the window corresponding to the end position of the region to be determined in the sample to be detected and the comparison data sequence of the fluorescence signal intensity of each window. Specifically, according to a window corresponding to the start position and a window corresponding to the end position of the region to be determined in the sample to be detected, which specific windows correspond to the region to be determined can be obtained, and then according to the comparison data sequence of the fluorescence signal intensity of each window, the comparison data of the fluorescence signal intensity of the specific window corresponding to the region to be determined is extracted, so as to calculate the average value of the comparison data of the fluorescence signal intensity of each window in the region to be determined. Further matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold, determining the area to be determined as an abnormal area with repeated copy number; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region. Therefore, whether chromosome abnormality exists in the sample to be detected can be judged. Specifically, in this embodiment, when the comparison data average value of the intensity of the window fluorescent signal in the region to be determined is greater than the preset copy number repetition threshold, it may be determined that the comparison data average value of the region to be determined matches the preset copy number repetition threshold, and it is determined that the region to be determined is an abnormal region where the copy number is repeated; when the comparison data average value of the window fluorescence signal intensity in the region to be determined is smaller than the preset copy number missing threshold, determining that the comparison data average value of the region to be determined is matched with the preset copy number missing threshold, and determining that the region to be determined is an abnormal region with missing copy number.
Further, when the comparison data average value of the region to be determined is larger than zero and smaller than a preset copy number repetition threshold, the repeated embedding proportion can be calculated according to the subsequent steps; or when the comparison data average value of the region to be determined is larger than the preset copy number missing threshold value and smaller than zero, the missing mosaic ratio can be calculated according to the subsequent steps.
In one embodiment, as shown in fig. 4, after determining the abnormal region in the sample to be detected, the method further comprises:
step 402, obtaining the copy number of the abnormal area in the sample to be detected.
In this embodiment, after determining that an abnormal region exists in the sample to be detected, the fitting proportion of the abnormal region may be further calculated, so as to improve the sensitivity of CNV detection and avoid the problem of CNV recognition false positive. Specifically, after determining that an abnormal region exists in the sample to be detected, the copy number of the abnormal region may be obtained, which may be specifically obtained by the following formula:
Figure BDA0003119510900000131
wherein the content of the first and second substances,
Figure BDA0003119510900000132
log for each window in the exception region2The average value of RR, CN, is the number of copies of the abnormal region.
And step 404, calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In this embodiment, if it is determined that the abnormal region is an abnormal region whose copy number is repeated, a first difference between the copy number of the abnormal region and 2 is calculated, and a quotient between the first difference and a copy number repetition threshold is determined as a fitting ratio of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference value and the copy number missing threshold value as the embedding proportion of the abnormal region.
For example, if the copy number of the abnormal region is CN, the preset mutation threshold includes a copy number duplication threshold a and a copy number missing threshold b, and for the abnormal region with duplicate copy number, the embedding ratio of the duplicate regions is:
Figure BDA0003119510900000141
and if the abnormal region is an abnormal region with a missing copy number, the mosaic ratio of the missing region is as follows:
Figure BDA0003119510900000142
according to the embodiment, the embedding proportion of the abnormal region is calculated, so that the detection rate can be effectively improved, the problem of false positive is avoided, and the accuracy of CNV identification is greatly improved.
It should be understood that although the various steps in the flow charts of fig. 1-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 5, there is provided a chip data-based copy number variation detection apparatus, including: a fluorescence signal intensity data acquisition module 501, a comparison processing module 502, a region identification module 503 and an abnormal region determination module 504, wherein:
a fluorescence signal intensity data acquisition module 501, configured to acquire, according to gene sequencing data of a sample to be detected, a sequence to be detected of fluorescence signal intensity data of each window on a genome of the sample to be detected;
a comparison processing module 502, configured to compare, based on each window on the genome, the to-be-detected sequence of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window on the genome of a reference sample in a pre-established reference library, so as to obtain a comparison data sequence of fluorescence signal intensity of each window;
the region identification module 503 is configured to detect a significance difference between the windows according to the comparison data sequence of the fluorescence signal intensity of each window, and determine a target window in the sample to be detected and a corresponding region to be determined based on the significance difference between the windows, where the target window is a window corresponding to a start position and a window corresponding to an end position of the region to be determined in the sample to be detected;
an abnormal region determining module 504, configured to determine an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensities of the windows, and a preset variation threshold.
In one embodiment, the preset mutation threshold comprises a preset copy number deletion threshold and a copy number duplication threshold; the abnormal region determining module is specifically configured to: acquiring a comparison data average value of the window fluorescence signal intensity in the to-be-determined area according to the window corresponding to the initial position and the window corresponding to the end position of the to-be-determined area in the to-be-detected sample and the comparison data sequence of the fluorescence signal intensity of each window; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a preset copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining that the region to be determined is an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region.
In one embodiment, the apparatus further includes a mosaic ratio calculation module, configured to obtain the copy number of the abnormal region in the sample to be detected after determining the abnormal region in the sample to be detected; and calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold.
In one embodiment, the chimeric ratio calculation module is specifically configured to: if the abnormal area is determined to be an abnormal area with repeated copy number, calculating a first difference between the copy number of the abnormal area and 2, and determining the quotient of the first difference and a copy number repetition threshold value as the embedding proportion of the abnormal area; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference and the copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the fluorescence signal intensity data acquisition module is specifically configured to: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected; carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of the locus in each window; performing local linear regression correction on the intensity of the fluorescence signal to be detected of the window in the sample to be detected to obtain corrected intensity data of the fluorescence signal to be detected of each window; and generating a to-be-detected sequence of fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
In one embodiment, the comparison processing module is specifically configured to: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window; and acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
In one embodiment, the region identification module is specifically configured to: identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of regions with significant differences based on the significant differences among the windows, wherein the significant differences do not exist among the windows in each region; identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
For the specific limitation of the copy number variation detection apparatus, reference may be made to the above limitation of the copy number variation detection method, which is not described herein again. The modules in the copy number variation detection apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of detecting copy number variation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:
according to gene sequencing data of a sample to be detected, acquiring a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
according to the comparison data sequence of the fluorescence signal intensity of each window, detecting the significance difference among the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance difference among the windows, wherein the target window is a window corresponding to the initial position and a window corresponding to the end position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
In one embodiment, the preset mutation threshold comprises a preset copy number deletion threshold and a copy number duplication threshold; the processor, when executing the computer program, further performs the steps of: acquiring a comparison data average value of the window fluorescence signal intensity in the to-be-determined area according to the window corresponding to the initial position and the window corresponding to the end position of the to-be-determined area in the to-be-detected sample and the comparison data sequence of the fluorescence signal intensity of each window; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a preset copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining that the region to be determined is an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring the copy number of an abnormal area in the sample to be detected; and calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold.
In one embodiment, the processor, when executing the computer program, further performs the steps of: if the abnormal area is determined to be an abnormal area with repeated copy number, calculating a first difference between the copy number of the abnormal area and 2, and determining the quotient of the first difference and a copy number repetition threshold value as the embedding proportion of the abnormal area; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference and the copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the processor, when executing the computer program, further performs the steps of: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected; carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of the locus in each window; performing local linear regression correction on the intensity of the fluorescence signal to be detected of the window in the sample to be detected to obtain corrected intensity data of the fluorescence signal to be detected of each window; and generating a to-be-detected sequence of fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
In one embodiment, the processor, when executing the computer program, further performs the steps of: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window; and acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
In one embodiment, the processor, when executing the computer program, further performs the steps of: identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of regions with significant differences based on the significant differences among the windows, wherein the significant differences do not exist among the windows in each region; identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
according to gene sequencing data of a sample to be detected, acquiring a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
according to the comparison data sequence of the fluorescence signal intensity of each window, detecting the significance difference among the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance difference among the windows, wherein the target window is a window corresponding to the initial position and a window corresponding to the end position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
In one embodiment, the preset mutation threshold comprises a preset copy number deletion threshold and a copy number duplication threshold; the computer program when executed by the processor further realizes the steps of: acquiring a comparison data average value of the window fluorescence signal intensity in the to-be-determined area according to the window corresponding to the initial position and the window corresponding to the end position of the to-be-determined area in the to-be-detected sample and the comparison data sequence of the fluorescence signal intensity of each window; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a preset copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining that the region to be determined is an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring the copy number of an abnormal area in the sample to be detected; and calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold.
In one embodiment, the computer program when executed by the processor further performs the steps of: if the abnormal area is determined to be an abnormal area with repeated copy number, calculating a first difference between the copy number of the abnormal area and 2, and determining the quotient of the first difference and a copy number repetition threshold value as the embedding proportion of the abnormal area; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference and the copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the computer program when executed by the processor further performs the steps of: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected; carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of the locus in each window; performing local linear regression correction on the intensity of the fluorescence signal to be detected of the window in the sample to be detected to obtain corrected intensity data of the fluorescence signal to be detected of each window; and generating a to-be-detected sequence of fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
In one embodiment, the computer program when executed by the processor further performs the steps of: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window; and acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
In one embodiment, the computer program when executed by the processor further performs the steps of: identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of regions with significant differences based on the significant differences among the windows, wherein the significant differences do not exist among the windows in each region; identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for detecting copy number variation based on chip data, the method comprising:
according to gene sequencing data of a sample to be detected, acquiring a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
according to the comparison data sequence of the fluorescence signal intensity of each window, detecting the significance difference among the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance difference among the windows, wherein the target window is a window corresponding to the initial position and a window corresponding to the end position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
2. The method of claim 1, wherein the preset mutation threshold comprises a preset copy number deletion threshold and a copy number duplication threshold; determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold, wherein the determining comprises the following steps:
acquiring a comparison data average value of the window fluorescence signal intensity in the to-be-determined area according to the window corresponding to the initial position and the window corresponding to the end position of the to-be-determined area in the to-be-detected sample and the comparison data sequence of the fluorescence signal intensity of each window;
matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a preset copy number repeating threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining that the region to be determined is an abnormal region with missing copy number; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the region to be determined is not matched with the preset copy number repetition threshold and the preset copy number missing threshold, determining that the region to be determined is not an abnormal region.
3. The method of claim 2, wherein after determining the abnormal region in the sample to be tested, the method further comprises:
acquiring the copy number of an abnormal area in the sample to be detected;
and calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold.
4. The method of claim 3, wherein said calculating a chimerism ratio of said abnormal region comprises:
if the abnormal area is determined to be an abnormal area with repeated copy number, calculating a first difference between the copy number of the abnormal area and 2, and determining the quotient of the first difference and a copy number repetition threshold value as the embedding proportion of the abnormal area;
if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference between the step 2 and the copy number of the abnormal region, and determining the quotient of the second difference and the copy number missing threshold value as the embedding proportion of the abnormal region.
5. The method according to claim 1, wherein the obtaining of the to-be-detected sequence of the fluorescence signal intensity data of each window on the genome of the to-be-detected sample according to the gene sequencing data of the to-be-detected sample comprises:
performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each site corresponding to the gene sequencing data of the sample to be detected;
carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and acquiring the intensity of the fluorescence signal to be detected of each window according to the intensity of the fluorescence signal to be detected of the locus in each window;
performing local linear regression correction on the intensity of the fluorescence signal to be detected of the window in the sample to be detected to obtain corrected intensity data of the fluorescence signal to be detected of each window;
and generating a to-be-detected sequence of fluorescence signal intensity data based on the to-be-detected fluorescence signal intensity data of each window in the to-be-detected sample.
6. The method according to claim 1, wherein the comparing the sequence to be detected of the fluorescence signal intensity data with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome in the pre-established reference library based on each window on the genome to obtain the comparison data sequence of the fluorescence signal intensity of each window comprises:
for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence;
acquiring the ratio of the fluorescence signal intensity data to be detected of the window to the reference fluorescence signal intensity data, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window;
and acquiring an alignment data sequence of the fluorescence signal intensity of each window based on the alignment data of the fluorescence signal intensity of each window on the genome.
7. The method of claim 1, wherein the step of detecting the significance difference between the windows according to the aligned data sequences of the fluorescence signal intensities of the windows and determining the target window and the corresponding region to be determined in the sample to be detected based on the significance difference between the windows comprises:
identifying the significant difference among the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window;
acquiring a plurality of regions with significant differences based on the significant differences among the windows, wherein the significant differences do not exist among the windows in each region;
identifying significance differences among the multiple regions, and if no significance difference exists between adjacent regions, merging the adjacent regions without significance differences; if the adjacent regions have significance differences, obtaining the region to be determined, and the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined based on the adjacent regions with significance differences.
8. An apparatus for detecting copy number variation based on chip data, the apparatus comprising:
the fluorescence signal intensity data acquisition module is used for acquiring a to-be-detected sequence of fluorescence signal intensity data of each window of a to-be-detected sample on a genome according to gene sequencing data of the to-be-detected sample;
the comparison processing module is used for comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample on the genome in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window;
the region identification module is used for detecting significance differences among the windows according to comparison data sequences of fluorescence signal intensities of the windows, determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to the starting position and a window corresponding to the ending position of the region to be determined in the sample to be detected;
and the abnormal region determining module is used for determining the abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and a preset variation threshold.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110673034.7A 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data Active CN113299342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110673034.7A CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110673034.7A CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Publications (2)

Publication Number Publication Date
CN113299342A true CN113299342A (en) 2021-08-24
CN113299342B CN113299342B (en) 2024-03-15

Family

ID=77328615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110673034.7A Active CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Country Status (1)

Country Link
CN (1) CN113299342B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376613A (en) * 2022-09-13 2022-11-22 郑州思昆生物工程有限公司 Base type detection method, device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060134674A1 (en) * 2002-11-11 2006-06-22 Affymetrix, Inc. Methods for identifying DNA copy number changes
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060134674A1 (en) * 2002-11-11 2006-06-22 Affymetrix, Inc. Methods for identifying DNA copy number changes
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙玉琳;刘飞;赵晓航;: "拷贝数变异的全基因组关联分析", 生物化学与生物物理进展, no. 08 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376613A (en) * 2022-09-13 2022-11-22 郑州思昆生物工程有限公司 Base type detection method, device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113299342B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN108920899B (en) Single exon copy number variation prediction method based on target region sequencing
RU2654575C2 (en) Method for detecting chromosomal structural abnormalities and device therefor
RU2768718C2 (en) Detection of somatic variation of number of copies
CN111462816B (en) Method, electronic device and computer storage medium for detecting microdeletion and microduplication of germ line genes
US20230287487A1 (en) Systems and methods for genetic identification and analysis
JP6623400B2 (en) Kit, device and method for measuring chromosomal aneuploidy
CN115064209B (en) Malignant cell identification method and system
KR20200107774A (en) How to align targeting nucleic acid sequencing data
KR102273257B1 (en) Copy number variations detecting method based on read-depth and analysis apparatus
CN110826494A (en) Method and device for evaluating quality of labeled data, computer equipment and storage medium
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
Talevich et al. CNVkit-RNA: copy number inference from RNA-sequencing data
CN113299342B (en) Copy number variation detection method and detection device based on chip data
CN116386718A (en) Method, apparatus and medium for detecting copy number variation
CN113823353B (en) Gene copy number amplification detection method, device and readable medium
Duan et al. Common copy number variation detection from multiple sequenced samples
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
AU2022218581B2 (en) Sequencing data-based itd mutation ratio detecting apparatus and method
CN112863602B (en) Chromosome abnormality detection method, chromosome abnormality detection device, chromosome abnormality detection computer device, and chromosome abnormality detection storage medium
CN110570908B (en) Sequencing sequence polymorphic identification method and device, storage medium and electronic equipment
JP5213009B2 (en) Gene expression variation analysis method and system, and program
Vepakomma et al. Diverse data selection via combinatorial quasi-concavity of distance covariance: A polynomial time global minimax algorithm
CN114613434A (en) Method and system for detecting gene copy number variation based on population sample depth information
CN112562787B (en) Gene large fragment rearrangement detection method based on NGS platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant