CN110246543B

CN110246543B - Method and computer system for detecting copy number variation by using single sample based on second-generation sequencing technology

Info

Publication number: CN110246543B
Application number: CN201910541057.5A
Authority: CN
Inventors: 郎继东; 王博; 杨家亮; 田埂
Original assignee: Geneis Beijing Co ltd
Current assignee: Geneis Beijing Co ltd
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2021-02-26
Anticipated expiration: 2039-06-21
Also published as: CN110246543A

Abstract

The invention discloses a method and a computer system for detecting copy number variation by using a single sample based on a second-generation sequencing technology. The method can perform the copy number variation CNV detection on a single sample from sequencing original data without depending on factors necessary or required by the traditional methods such as comparison, sequencing depth, GC content correction and sample pairing. Therefore, the method not only simplifies the experiment and analysis steps and reduces the cost, but also has the analysis result highly consistent with that of the traditional method, and effectively corrects the false positive and false negative detected by the traditional method by adding the clinical verification result (such as FISH verification).

Description

Method and computer system for detecting copy number variation by using single sample based on second-generation sequencing technology

Technical Field

The invention relates to gene detection, in particular to a method and a computer system for detecting copy number variation by using a single sample based on a second-generation sequencing technology.

Background

Copy Number Variation (CNV) is a structural variation, caused by a rearrangement of the genome, and can be classified into microscopic (microscopic) and sub-microscopic (microscopic) levels according to size. Microscopic structural variation mainly refers to chromosome aberrations including euploid or aneuploid, insertion, deletion, inversion, translocation and the like which can be seen under a microscope; the structural variation at the sub-microscopic level mainly refers to the variation of DNA fragment length above 1Kb, including insertion, deletion, duplication, inversion, translocation and the like. Copy number variation is one of the important pathogenic factors of human diseases, and the current research finds that CNV is related to the pathogenic mechanism or susceptibility of a plurality of complex genetic diseases, including tumors, acquired immunodeficiency syndrome, systemic lupus erythematosus, autoimmune inflammatory diseases and the like. Clinically, the copy number variation detection is necessary, and the variation of a large fragment of DNA sequence in a genome can be discovered early, so that a reference basis is provided for the diagnosis and treatment of diseases.

At present, a plurality of means and methods for detecting copy number variation exist, such as methods based on polymerase chain reaction, including multiplex-linked probe amplification technology, multiplex amplifiable probe hybridization technology and the like; methods based on hybridization techniques, including in situ immunofluorescence and Gismsa banding methods, and the like; the chip technology based method includes single nucleotide polymorphism chip, etc. These methods are not only complicated to operate, low in resolution, difficult to provide specific information of the variant region, but also low in analysis throughput, expensive in price and not very high in cost performance. With the rapid development of the second-generation sequencing technology, the sequencing cost is greatly reduced, the analysis flux is exponentially improved, and the resolution can be reduced to the Kb level, so that the copy number variation research of the sub-microscopic level can be deeper. At present, algorithms for detecting CNV are basically developed based on Whole Genome Sequencing (WGS) level, such as CNVkit, CNVnator, Control-FREEC and the like, and in consideration of detection accuracy, a pairing sample is generally required to detect CNV; for single sample detection, CNV identification is typically corrected based on sequencing depth and GC content. With the increasing demand of target sequencing, more targeted algorithms, such as patternrnv, Ioncopy, etc., are generated in addition to the aforementioned algorithms. However, all of these methods are subject to comparison, depend on the sequencing depth and GC content excessively, and are limited by the parameter settings of the comparison parameters and the analysis algorithm, so the overall process is complicated and complicated, and the experiment and analysis costs are high.

Disclosure of Invention

In view of this, the present invention provides a method and computer system for single-sample copy number variation detection based on the next-generation sequencing technology. The method can perform the copy number variation CNV detection on a single sample from sequencing original data without depending on factors necessary or required by the traditional methods such as comparison, sequencing depth, GC content correction and sample pairing.

Specifically, the present invention includes the following.

In a first aspect of the present invention, a method for detecting copy number variation using a single sample based on a second generation sequencing technology is provided, which comprises the following steps:

(1) establishing a first gene sample database and a second gene sample database, wherein the first gene sample database comprises A cases of copy number variation genes, the second gene sample database comprises B cases of genes without copy number variation in corresponding genes, and A and B are natural numbers respectively more than 50; preferably, the genes in the first gene sample database comprise copy number variations, and the genes in the second gene sample database preferably do not have corresponding copy number variations at the positions (i.e. regions) of the copy number variations of the genes in the first gene sample database. It should be noted that the mutations that are different from the copy number variation region in the first gene sample may be present in the genes in the second gene sample database, including copy number variation mutations. In order to ensure the reliability and accuracy of the method of the present invention, it is generally necessary that each of a and B is a natural number of 50 or more, preferably 100 or more, more preferably 200 or more, and further preferably 300 or more. The upper limit of A and B is not particularly limited.

(2) Will have a length L_jThe j copy number variation areas of bp are divided by a sliding window with the size of m bp, the step length is n bp, and therefore i is L in each copy number variation area_jN seed sequences, wherein if L_jWhere/n is integer division, i is integer, if L is_jIf/n is not an integer division, i is rounded down plus 1, resulting in a matrix of j × i seed sequences. In general, j is a natural number of 1 or more, preferably 10 or more, and more preferably 30 or more. m is a natural number of 50 or more, more preferably 80 or more, and still more preferably m is L or less. n is a natural number of 1 or more and L or less, preferably 5 or more and L or less.

(3) And respectively carrying out non-fault-tolerant complete sequence matching on the j x i seed sequences in the first gene sample database and the second gene sample database, and obtaining j x i matrixes of the number of completely matched seed sequences in each database.

(4) The matrix of the number of perfect match seed sequences in each database is normalized by dividing each perfect match seed sequence number by the average of all perfect match seed sequences of the copy number variation region.

(5) And (3) performing 0 value complementing treatment on the matrix of the completely matched seed sequence number after the standardization treatment, namely comparing the matrix with the largest number of the seed sequences obtained in the copy number variation region, and setting the matrix value of the number of the rest regions which is less than the number as 0.

(6) And performing mathematical modeling on the A + B standardized completely-matched seed sequence number matrixes subjected to the 0-complementing value processing, establishing a data statistical model according to the negative and positive results, and finally obtaining a negative and positive mathematical model for judging copy number variation.

(7) And (3) repeating the steps (2) to (5) on the sample to be judged, predicting and judging the copy number variation by using the mathematical model obtained in the step (6), and judging the sample to be judged to be positive if the predicted value is more than 0.5, preferably more than 0.6, and more preferably more than 0.8, and otherwise, judging the sample to be judged to be negative.

Preferably, in the method for detecting copy number variation by using a single sample based on the second generation sequencing technology, the gene sample data in the first gene sample database and the second gene sample database are derived from data obtained by whole genome sequencing and/or target region capture/amplification sequencing.

Preferably, in the method for detecting copy number variation by using a single sample based on the second generation sequencing technology, the copy number variation comprises gene copy number amplification and/or deletion.

Preferably, the method for detecting copy number variation including the euploid or aneuploid, insertion, deletion, inversion and translocation of chromosome and the insertion, deletion, duplication, inversion or translocation of DNA fragment using single sample based on the second generation sequencing technology of the present invention.

Preferably, in the method for detecting copy number variation using a single sample based on the second-generation sequencing technology of the present invention, the length of the DNA fragment is 1Kb or more, preferably 1.5Kb or more. On the other hand, it is preferably 10Kb or less, more preferably 8Kb or less.

Preferably, in the method for detecting copy number variation by using a single sample based on the second generation sequencing technology, the gene is ERBB 2.

Preferably, in the method for detecting copy number variation using a single sample based on the second-generation sequencing technology of the present invention, the data statistical model is established by a logistic regression or deep learning algorithm in step (6).

In a second aspect of the invention, there is provided a computer system comprising a processor and configured to perform the method of the first aspect of the invention.

The invention not only simplifies the experiment and analysis steps and reduces the cost, but also has higher consistency rate of the analysis result and the traditional method, and effectively corrects the false positive and false negative detected by the traditional method by increasing the clinical verification result (such as FISH verification).

Drawings

FIG. 1 is an exemplary flow chart of the present invention.

Detailed Description

Reference will now be made in detail to various exemplary embodiments of the invention, the detailed description should not be construed as limiting the invention but as a more detailed description of certain aspects, features and embodiments of the invention.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Further, for numerical ranges in this disclosure, it is understood that the upper and lower limits of the range, and each intervening value therebetween, is specifically disclosed. Every smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in a stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although only preferred methods and materials are described herein, any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All documents mentioned in this specification are incorporated by reference herein for the purpose of disclosing and describing the methods and/or materials associated with the documents. In case of conflict with any incorporated document, the present specification will control. Unless otherwise indicated, "%" or "amount" are percentages by weight.

Examples

The analysis method of the present invention was tested by selecting full exome data for 5 known ERBB2 amplification positive and 5 known ERBB2 amplification negative. Specifically, 1 ERBB2 positive Sample1 is exemplified (as shown in the flowchart 1), and the steps 1 to 11 are repeated in the other examples as follows:

1. collecting 272 ERBB2 gene amplification positive whole exome sequencing data, and collecting 1029 ERBB2 gene amplification negative whole exome sequencing data, and dividing the data into a training set and a testing set; the training set comprises 223 positive samples and 817 negative samples, and the testing set comprises 49 positive samples and 212 negative samples;

2. the gene ERBB2 comprises 27 full exons, 27 amplified or deleted regions Lj (0< j <28) bp are divided by a sliding window with the size of 50bp, the step size is 40bp, and each amplified or deleted region can obtain i ═ Lj/40 seed sequences, wherein if Lj/40 is divided into whole parts, i is rounded, if Lj/40 is not divided into whole parts, i is rounded downwards and added with 1, so that 27 × i seed sequence matrixes can be obtained in total, wherein Lj is respectively 311bp, 152bp, 214bp, 135bp, 69bp, 116bp, 142bp, 120bp, 127bp, 74bp, 91bp, 200bp, 133bp, 91bp, 161bp, 48bp, 139bp, 123bp, 99bp, 186bp, 156bp, 76bp, 147bp, 98bp, 189bp, 253bp and 974 bp. The corresponding i are 8, 4, 6, 4, 2, 3, 4, 2, 3, 6, 4, 3, 5, 2, 4, 3, 5, 4, 2, 4, 3, 5, 7, 25, respectively.

3. And respectively carrying out non-fault-tolerant complete sequence matching on the 27 x i seed sequences in the data of 1040 samples in the training set to obtain a matrix of the 27 x i completely matched seed sequence numbers of each sample.

4. The matrix of perfect match seed sequence numbers for each sample is normalized by dividing each perfect match seed sequence number by the average of all perfect match seed sequence numbers for the amplified or deleted region.

5. And (3) performing 0 value complementing treatment on the normalized matrix of the number of completely matched seed sequences, namely, taking the maximum number of seed sequences obtained by amplifying or deleting the regions as comparison, and setting the matrix value of the number of the rest regions which is less than the number as 0.

6. The standardized complete matching seed sequence number 27 x 25 matrix after the 1040 0-complementing value processing is subjected to mathematical modeling, firstly, 10 times of cross validation is carried out on 1040 samples, and a Convolutional Neural Network (CNN) algorithm in deep learning is utilized to combine negative and positive results to select hyper-parameters, adjust and optimize the model, and finally, an optimal mathematical model with the training set AUC of 93.04% and the testing set AUC of 94.54% is obtained and is used as a model method for judging new samples of the same type of data. The model parameters are shown in table 1.

TABLE 1 model parameters

7. And (3) repeating the step 2-5 with Sample1, and predicting and judging the copy number variation by using the optimal mathematical model obtained in the step 6, wherein the predicted value is 0.9916596 and is more than 0.5, and the copy number variation is considered to be positive.

The predicted values and judgment results of Sample2-Sample10 are shown in Table 2 below.

TABLE 2 summary of predicted results for each sample

Sample ID	Prediction value	Observed value
			Sample1	0.9916596	Positive for
Sample2	0.9989957	Positive for
			Sample3	0.9999901	Positive for
Sample4	0.9990958	Positive for
			Sample5	0.99751943	Positive for
Sample6	0.012844639	Negative of
			Sample7	0.006111831	Negative of
Sample8	0.003521628	Negative of
			Sample9	0.008016149	Negative of
Sample10	0.002645513	Negative of

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments of the present disclosure without departing from the scope or spirit of the disclosure. Other embodiments will be apparent to those skilled in the art from consideration of the specification. The specification and examples are exemplary only.

Claims

1. A method for detecting copy number variation by using a single sample based on a second-generation sequencing technology is characterized by comprising the following steps:

(1) establishing a first gene sample database and a second gene sample database, wherein the first gene sample database comprises A cases of copy number variation genes, the second gene sample database comprises B cases of genes which do not have copy number variation at corresponding positions, and A and B are natural numbers of more than 50 respectively;

(2) will have a length L_jThe j copy number variation areas of bp are divided by a sliding window with the size of m bp, the step length is n bp, and therefore i is L in each copy number variation area_jN seed sequences, wherein if L_jWhere/n is integer division, i is integer, if L is_jIf/n is not integer division, i is rounded down and added with 1, and a matrix consisting of j x i seed sequences is obtained in total;

(3) respectively carrying out non-fault-tolerant complete sequence matching on the j x i seed sequences in the first gene sample database and the second gene sample database, and obtaining a matrix of j x i completely matched seed sequence numbers in each database;

(4) normalizing the matrix of perfect match seed sequence numbers in each database, i.e., dividing each perfect match seed sequence number by the average of all perfect match seed sequence numbers of the copy number variation region;

(5) performing 0 value complementing treatment on the matrix of the completely matched seed sequence number after the standardization treatment, namely comparing the maximum number of the seed sequences obtained in the copy number variation area, and setting the matrix value of the number of the rest areas which is less than the number as 0;

(6) performing mathematical modeling on the A + B standardized fully matched seed sequence number matrixes subjected to the 0 complementing value processing, establishing a data statistical model according to negative and positive results, and finally obtaining a negative and positive mathematical model for judging copy number variation;

(7) and (5) repeating the steps (2) to (5) on the sample to be judged, predicting and judging the copy number variation by using the mathematical model obtained in the step (6), and judging the sample to be judged to be positive if the predicted value is more than 0.5, otherwise, judging the sample to be negative.

2. The method of claim 1, wherein j is a natural number greater than 1.

3. The method for detecting copy number variation using single sample based on secondary sequencing technology according to claim 2, wherein the gene sample data in the first gene sample database and the second gene sample database is derived from whole genome sequencing and/or data obtained by target region capture sequencing or target region amplification sequencing.

4. The method of claim 3, wherein the copy number variation comprises gene copy number amplification and/or deletion.

5. The method of claim 4, wherein the copy number variation comprises chromosome euploid or aneuploid aberrations, chromosome insertions, deletions, inversions or translocations, and DNA fragment insertions, deletions, duplications, inversions or translocations.

6. The method of claim 5, wherein the DNA fragment has a length of 1Kb or more.

7. The method for detecting copy number variation by using a single sample based on the secondary sequencing technology of claim 1, wherein the gene is ERBB 2.

8. The method for detecting copy number variation using single sample based on next-generation sequencing technology according to claim 1, wherein the data statistical model is established in step (6) by logistic regression or deep learning algorithm.

9. A computer system comprising a processor and configured to perform the method of any one of claims 1-8.