Disclosure of Invention
In view of the above, there is a need for a method, an apparatus, a device and a storage medium for analyzing copy number variation, so as to improve the accuracy and resolution of copy number variation analysis based on next generation sequencing.
A method of analyzing copy number variation, comprising the steps of:
step S1: obtaining DNA sequencing data of a genome target region;
step S2: calling a first application program through a first application program interface, and obtaining extracted sequencing data obtained by extracting reads covering the target area from the DNA sequencing data by the first application program according to the number of the reads to be extracted, wherein the number of the reads to be extracted is determined according to the number of bases of the target area, the sequencing read length and the preset average depth;
step S3: calling a second application program through a second application program interface to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program;
step S4: calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program;
step S5: calling a fourth application program through a fourth application program interface, and obtaining the number of reads which are counted from the distinguishing result by the fourth application program and have non-PCR repetition and comparison scores not less than a preset value, wherein the number of the reads falling into each target area is obtained;
step S6: and calling a fifth application program through a fifth application program interface, and obtaining the ratio of reads and/or the copy number of the CNV region determined by the fifth application program according to the read number of each target region.
In one embodiment, in step S2, the number of reads to be extracted is (number of bases of target region x preset average depth)/(sequencing read length x correlation coefficient), wherein the correlation coefficient is less than 1;
the predetermined average depth is determined according to the mutation analysis type of the sample to be detected, wherein the predetermined average depth of the somatic mutation is not less than 950 ×, and the predetermined average depth of the germline mutation is not less than 80 ×.
In one embodiment, after the step S2 and before the step S3, the method further comprises:
step S03: calling a sixth application program through a sixth application program interface, obtaining an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program, and executing step S3 on the extracted sequencing data of which the evaluation result meets a preset requirement; otherwise, calling a first application program through a first application program interface, extracting reads covering the target region from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted to obtain new extracted sequencing data, calling a sixth application program through a sixth application program interface to obtain a new evaluation result obtained by carrying out sequencing quality evaluation on the new extracted sequencing data through the sixth application program, and executing a step S3 on the new extracted sequencing data of which the new evaluation result meets the preset requirement; otherwise, returning to step S1, DNA sequencing data for the new genomic target region is obtained.
In one embodiment, the preset requirements are: the average read mass is more than 85% of the total read mass of Q30, the average GC content is between 40% and 55%, and the bases A, T, C and G respectively account for 25% +/-2%.
In one embodiment, the step S6 includes:
step S61: obtaining the expected value p of the ith target area of the test sample obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through the beta-binomial distribution modeli;
Step S62: obtaining a formula exp (Y) by said fifth applicationi)=Yi*Pi/(1-Pi) Determined expected read number exp (Y) for each target region of the test specimeni) Wherein Y isiThe read number of the ith target area of the test sample;
step S63: obtaining a formula by said fifth applicationThe determined proportion of reads of the CNV region on the corresponding chromosome, and/or
Obtaining a formula CNV by said fifth applicationcopy=CNV ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formulacopy=CNV ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formulacopy=CNVratioDetermining the copy number of the CNV region of the male X or Y chromosome;
wherein, CNVratioIs the ratio of read of the CNV region, CNVcopyIs the copy number, X, of the CNV regioni-jIs the number of reads from the ith target region to the jth target region in the region of the test sample where the CNV is located,
in one embodiment, in the step S62, the method further includes: obtaining a formula Ratio by the fifth applicationi=Yi/exp(Yi) Determined Ratio of reads for each target region of test sampleiWherein Y isiThe number of reads for the ith target region of the test sample.
In one embodiment, the method for analyzing copy number variation further includes step S7: and calling a seventh application program and an eighth application program through a seventh application program interface and an eighth application program interface respectively to obtain the results of the seventh application program and the eighth application program for annotating and graphically displaying the ratio of reads of all target areas and CNV areas.
An apparatus for analyzing copy number variation, comprising:
the sequencing data acquisition module is used for acquiring DNA sequencing data of the genome target region;
the extraction calling module is used for calling a first application program through a first application program interface, extracting reads covering the target region from the DNA sequencing data according to the number of the reads to be extracted by the first application program, and obtaining extracted sequencing data, wherein the number of the reads to be extracted is determined according to the number of bases of the target region, the sequencing read length and the preset average depth;
the comparison calling module is used for calling a second application program through a second application program interface to obtain a comparison result obtained by comparing the genome of the extracted sequencing data by the second application program;
the distinguishing and calling module is used for calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program;
the statistical calling module is used for calling a fourth application program through a fourth application program interface to obtain the number of reads which are counted from the distinguishing result by the fourth application program and are not repeated by the PCR and have the comparison score not less than a preset value, and the number of the reads falling into each target area is obtained; and
and the CNV analysis calling module is used for calling a fifth application program through a fifth application program interface to obtain the read proportion and/or copy number of the CNV region determined by the fifth application program according to the read number of each target region.
A computer device having a processor and a memory, the memory having stored thereon a computer program, the processor implementing the steps of the method for analyzing copy number variation according to any of the above embodiments when executing the computer program.
A computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the method for analyzing copy number variation according to any of the above embodiments.
The traditional next generation sequencing technology is used for analyzing CNV by means of related biological information software, and researches show that most of the biological information software is not accurate enough in predicting the missing or repeated copy number of CNV fragments and determining the positions of variant regions, and has larger deviation from gold standard (MLPA verification) data of CNV. The analysis method, the analysis device, the equipment and the storage medium for copy number variation provided by the invention can be used for sequentially extracting, comparing, marking and distinguishing the DNA sequencing data of the second-generation sequencing by calling the corresponding application program, and finally obtaining the read proportion and/or the copy number of the CNV region, the final result has high accuracy and good resolution, and particularly, in the extraction process, the number of the reads to be extracted is determined according to the base number, the sequencing read length and the preset average depth of the target region, so that different sequencing results can be analyzed in a targeted manner, and the reliability of the analysis result is greatly improved.
The analysis method of the copy number variation is an analysis method for non-disease diagnosis purposes, the CNV region of the genome target region is analyzed by the analysis method of the copy number variation, the obtained result can be used for various CNV analyses so as to be further used for researching various effective or ineffective CNVs, particularly CNV related to health diseases, and the analysis result can not be directly used as a diagnosis result for diagnosing whether a certain disease is suffered from, but can be used as an intermediate result together with other results for auxiliary diagnosis of the disease and pathological research and analysis of the disease, so that the analysis method has important clinical research and use values.
Detailed Description
To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
"read" as used herein refers to a sequencing sequence generated by a high throughput sequencing platform (e.g., a second generation sequencing platform); the sequencing depth refers to the ratio of the total base number obtained by sequencing to the size of the genome to be detected, and usually the unit x represents the multiple; the "sequencing read length" refers to the length of one sequencing in the sequencing process; the 'PCR repeated read' refers to read obtained by carrying out mirror image replication on the same molecule for multiple times by PCR, and the standard for judging whether the molecule is a mirror image molecule is as follows: the start and end positions of reads are the same, the base sequences between the start point and the end point are the same, and different reads are obtained as long as one of the start point, the end point or the sequences between the start point and the end point is different; the comparison score refers to a scoring mechanism of sequence comparison, the quality of a reaction sequence compared with a genome is high, and the accuracy of sequence comparison with the genome is high when the comparison score is high; the "reference sample" is a normal sample without CNV region on the genome theoretically, however, it is difficult to obtain such an ideal sample in reality, and in order to reduce data bias caused by different capture efficiency of experiments in different batches, the selection is performed in the samples in the same batch, and the conditions are selected as follows: and calculating a correlation coefficient between the read numbers of the test sample and the alternative sample in the target area, wherein the sample with the correlation coefficient meeting the preset requirement (such as being more than or equal to 0.97) is selected as a reference sample, and CNV calculation is carried out on the test sample by using the reference sample correlation data.
As shown in fig. 1, the method for analyzing copy number variation according to an embodiment of the present invention includes the following steps:
step S11: DNA sequencing data for the genomic target region is obtained.
In this embodiment, a second-generation sequencing mode is adopted to sequence the captured DNA (whole exon region or exon region of a specific gene set) of the genomic target region, so as to obtain DNA sequencing data. Data obtained off-line is typically in bcl format.
In one embodiment, step S11 further includes: calling applications such as bcl2fastq (Illumina company) through a bcl2fastq program interface to obtain a fastq format data file which is analyzed by subsequent software and converted from the bcl format data file by the applications such as bcl2 fastq.
Step S12: and calling the first application program through the first application program interface to obtain the extracted sequencing data, wherein the first application program extracts the reads covering the target area from the DNA sequencing data according to the number of the reads to be extracted.
The number of reads to be extracted in this example is determined based on the number of bases in the target region, the sequencing read length, and the predetermined average depth.
By extracting corresponding reads, the sequencing samples of different batches or sample piece differences caused by different data among samples can be reduced. Specifically, the depth of the extracted data can be determined according to whether the detected sample is to be subjected to somatic mutation or germ line mutation, and the preset average depth of the somatic mutation is generally not less than 950 ×, such as 1000 × ± 20 ×, and the preset average depth of the germ line mutation is not less than 80 ×, such as 100 × ± 20 ×, 200 × ± 20 × etc., which are commonly used.
More specifically, in one embodiment, T is definedsizeNumber of bases (unit: bp), D, which is a target regionMIs a preset average depth (can be selected according to requirements, as long as all sample standards are consistent), and readssizeAs sequencing read length (sequence length of read measured by a sequencing instrument), the number of reads to be extractednumCan be calculated according to the following formula:
readnum=Tsize*DM/(readsizecorrelation coefficient)
The correlation coefficient is less than 1, for example, the correlation coefficient may be in a range of 0.8 to 0.98, or may be 0.8, 0.85, 0.9, 0.95, or the like. By selecting a proper correlation coefficient, the number of the extracted reads is more than that preset theoretically, so that a certain number of invalid reads can be removed when genome comparison and PCR repeated read distinguishing are carried out subsequently, and the accuracy and reliability of an analysis result are ensured.
In one embodiment, the first application may be a software program, such as seqtk, by which corresponding reads are extracted from the fastq-formatted DNA sequencing data according to the number of reads to be extracted.
Step S13: and calling a second application program through a second application program interface to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program.
The comparison function is to perform genome positioning on the sequencing reads, and obtain comparison results for recording comparison information of all the reads, such as sequence names, comparison scores, genome positions, sequence comparison details and the like.
In an embodiment, the second application may be alignment software such as bwa and bowie, and the second application performs genome alignment on the extracted fastq-format sequencing data meeting the quality assessment standard, so as to obtain an alignment result of the bam-format file after the alignment.
Step S14: and calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program.
In one embodiment, the third application may be picardtools, samtools, or other marking software, and the third application compares the results to mark the PCR repeat read to distinguish it from the non-PCR repeat read, and removes the repeat read generated by the PCR process during subsequent calculations.
Step S15: and calling a fourth application program through a fourth application program interface, and obtaining the read which is obtained by the fourth application program from the distinguishing result and has statistical non-PCR repetition and the comparison score not less than a preset value, so as to obtain the read number falling into each target area.
In one embodiment, the fourth application program may be the Rsamtools toolkit in R language, etc., and the sequencing depth of autosomes of all samples (including the test sample and the reference sample), X chromosomes of female sample, X and Y chromosomes of male sample in the target region, i.e., the number of reads falling in the target region, and the GC content of the target region are counted by the fourth application program. According to the processing of the above step S13 and step S14, the request for read is: 1) non-PCR repeat read; 2) the comparison score is not less than a preset value, such as not less than 20.
The target region for sequencing is the region captured by all probes contained in the panle design, e.g., can be stored in the format of a bed file, e.g., in a more specific embodiment, in the format of: first column: chromosome number, second column: the starting position of the target region, the third column: end position of target region, fourth column: name of the target area. For the male and female sex chromosome statistics, because the female sex chromosome composition is different from the male sex chromosome composition, the female sex chromosome composition is 2X chromosomes, the male sex chromosome composition is one X chromosome, one Y chromosome, so when the percentage of reads or copy number of CNV region of sex chromosome is calculated subsequently, it needs to be calculated separately according to sex.
Step S16: and calling a fifth application program through a fifth application program interface, and obtaining the proportion and/or copy number of the reads of the CNV region determined by the fifth application program according to the number of the reads of each target region.
In step 16, CNV analysis is performed on the autosome of all samples (including the test sample and the reference sample), the X chromosome of female and the X, Y chromosome of male.
In one embodiment, the fifth application may be an ExomeDepth toolkit in R language, or the like.
In one embodiment, as shown in fig. 2, step S16 includes:
step S161: obtaining an expected value p of the ith target area of the test sample obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through a beta-binomial distribution modeli。
Step S162: the expected number of reads for each target region of the test sample is determined.
Specifically, in one embodiment, the formula exp (Y) is obtained by the fifth application as availablei)=Yi*Pi/(1-Pi) Determined expected read number exp (Y) for each target region of the test specimeni) Wherein Y isiThe number of reads for the ith target region of the test sample.
Step S163: determining the proportion of reads of the CNV region on the corresponding chromosome, and/or the copy number of the CNV region.
Specifically, in one embodiment, the fifth application formulates as may be obtained
Determining the proportion of reads of the CNV region on the corresponding chromosome.
Occupation ratio of read in CNV region CNVratioAfter the determination, a fifth application program according to the formula CNV may be further obtainedcopy=CNV ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formulacopy=CNV ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formulacopy=CNVratioDetermining the copy number of the CNV region of the male X or Y chromosome.
Wherein, CNVratioIs the ratio of read of the CNV region, CNVcopyIs the copy number of the CNV region, Xi-jIs the number of reads from the ith target region to the jth target region in the region of the test sample where the CNV is located,
in calculating the read occupation ratio of CNV region (CNV)ratio) In this case, the CNV region may be generally one or more adjacent exon regions, so the proportion of read of the CNV region needs to be recalculated based on the CNV region. CNV calculated as described aboveratioIt does not reflect the specific duplication or exact value of the CNV, and thus, the CNV can be further calculatedcopyTo be straightThe copy number values of the CNV regions on the genome are then reflected. The formula for calculating the copy number of the CNV region on autosomal chromosomes 1-22 and female X chromosome is: CNVcopy=CNV ratio2, the calculated formula for copy number of CNV region of male X or Y chromosome: CNVcopy=CNVratio。
CNV if it is located on the autosomal or female X chromosome copy2, the CNV region is a normal region, and there is no duplication or deletion, since human is diploid, there are two copies per chromosome in normal autosome, CNVcopy<2 indicates that the CNV region is deleted, CNVcopy>2 indicates that there is a duplication in this CNV region. Copy number of CNV region and CNV on Male X and Y chromosomeratioThe values are the same because both the X and Y chromosomes are 1 copy in males, if the CNV is located on chromosome X, Y in malescopy1 indicates that the CNV region is a normal region, if CNVcopy<1 denotes that the CNV region is a deletion region, CNVcopy>1 indicates that the CNV region is an overlapping region.
Further, in an embodiment, the step S162 further includes: obtaining by the fifth application according to the formula Ratioi=Yi/exp(Yi) Determined Ratio of reads for each target region of test sampleiWherein Y isiThe number of reads for the ith target region of the test sample.
As shown in fig. 3, in the method for analyzing copy number variation according to another embodiment of the present invention, after step S22 and before step S23, the method further includes step S023:
calling a sixth application program through a sixth application program interface to obtain an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program, and executing step S3 on the extracted sequencing data of which the evaluation result meets the preset requirement; otherwise, calling the first application program through the first application program interface, obtaining reads covering the target area extracted from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted, obtaining new extracted sequencing data, calling the sixth application program through the sixth application program interface, obtaining a new evaluation result obtained by carrying out sequencing quality evaluation on the new extracted sequencing data through the sixth application program, and executing the step S3 on the new extracted sequencing data of which the new evaluation result meets the preset requirement; otherwise, returning to step S1, DNA sequencing data for the new genomic target region is obtained.
Steps S21, S22, S23, S24, S25 and S26 are the same as steps S11, S12, S13, S14, S15 and S16, respectively. Step S26 may further include steps S161, S162, and S163 described above.
In step S023, the sixth application program is used to mainly evaluate the contents such as the base quality fraction, Q30, GC content, etc., and the base composition in the read, and to screen the read with poor quality. The sixth application may be quality control software such as fastqc, fastx, ClinQC, etc. Specifically, in one embodiment, the preset requirements are: the average read mass is more than 85% of the total read mass of Q30, the average GC content is between 40% and 55%, and the bases A, T, C and G respectively account for 25% +/-2%.
The random seed number is a parameter in the seqtk software, and researches show that adjusting the random seed number causes the final extracted reads to be different, and new extracted sequencing data can be obtained.
If neither of the two sequencing quality assessments meets the predetermined requirements, the sample is required to be resequenced, and step S21 is performed to retrieve DNA sequencing data of a new genomic target region.
By carrying out sequencing quality evaluation on the extracted sequencing data, namely quality control, unqualified sequencing data can be abandoned, and the accuracy and reliability of subsequent data analysis can be guaranteed.
Further, as shown in fig. 3, in an embodiment, the method for analyzing copy number variation further includes step S27: and calling a seventh application program and an eighth application program through a seventh application program interface and an eighth application program interface respectively to obtain the results of the seventh application program and the eighth application program which are used for annotating and graphically displaying the ratio of reads of all target areas and CNV areas.
In step S27, the CNVs calculated in step S26 are subjected to numerical values of genes, exons, OMIM database, and the like, and the Ratio of reads for all target regionsiCNV area and ratio of read to CNV arearatioAnd visually displaying the value.
Specifically, the seventh application program may be a perl language, the eighth application program may be a ggplot package of an R language, and for example, the perl language may be used to sort data formats, and the ggplot toolkit of the R language is used to implement drawing.
By annotating and graphically displaying the corresponding results, the copy number variation condition can be more intuitively and clearly reflected.
Based on the same idea as the above method, as shown in fig. 4, the present invention also provides an analysis apparatus 30 for copy number variation, comprising:
a sequencing data acquisition module 31, configured to acquire DNA sequencing data of a genome target region;
the extraction calling module 32 is configured to call a first application program through a first application program interface, obtain extracted sequencing data obtained by extracting, by the first application program, reads covering a target region from DNA sequencing data according to the number of the reads to be extracted, where the number of the reads to be extracted is determined according to the number of bases in the target region, the sequencing read length, and a preset average depth;
the comparison calling module 33 is configured to call a second application program through a second application program interface, so as to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program;
the distinguishing and calling module 34 is configured to call a third application program through a third application program interface, so as to obtain a distinguishing result obtained by distinguishing the PCR repeat read and the non-PCR repeat read in the comparison result by the third application program;
the statistical calling module 35 is configured to call a fourth application program through a fourth application program interface, obtain reads with a comparison score not less than a preset value and counted from the discrimination result by the fourth application program, and obtain the number of reads falling into each target region; and
and the CNV analysis calling module 36 is configured to call a fifth application program through a fifth application program interface, and obtain the percentage of reads and/or the copy number of the CNV regions, which are determined by the fifth application program according to the read number of each target region.
The sequencing data obtaining module 31 may further include a format conversion module, for example, calling an application program such as bcl2fastq through a bcl2fastq program interface to obtain a data file in the fastq format, which is analyzed by subsequent software and converted from a data file in the bcl format by the application program such as bcl2 fastq.
The extraction calling module 32 may further comprise a read number calculating module for calculating the read number according to the formulanum=Tsize*DM/(readsizeCorrelation coefficient) to calculate the number of reads to be extracted, where TsizeNumber of bases (unit: bp), D, which is a target regionMIs a preset average depth (can be selected according to requirements, as long as all sample standards are consistent), and readssizeThe sequence read length (the sequence length of read measured by the sequencer). The correlation coefficient is less than 1, such as 0.8-0.98.
Specifically, as shown in fig. 5, in one embodiment, the CNV analysis calling module 36 includes an expected value calculation calling module 361, an expected read number calculation calling module 362, and a read proportion calculation calling module 363.
The expected value calculation calling module 361 is used for obtaining an expected value p of the ith target area of the test sample, which is obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through a beta-binomial distribution modeli。
The expected read number calculation calling module 362 is used for obtaining the formula exp (Y) by the fifth applicationi)=Yi*Pi/(1-Pi) Determined expected read number exp (Y) for each target region of the test specimeni) Wherein Y isiThe number of reads for the ith target region of the test sample.
The read proportion calculation calling module 363 is used for obtaining the result of the fifth applicationProcedure according to formula
Determining the proportion of reads of the CNV region on the corresponding chromosome.
Further, the CNV analysis calling module 36 further includes a copy number calculation calling module 364. The copy number calculation call module 364 is used to obtain the formula CNV by the fifth applicationcopy=CNV ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formulacopy=CNV ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formulacopy=CNVratioDetermining the copy number of the CNV region of the male X or Y chromosome.
Further, in one embodiment, the read percentage calculation call module 363 is further configured to obtain the Ratio of the fifth application program according to the formula Ratioi=Yi/exp(Yi) Determined Ratio of reads for each target region of test sampleiWherein Y isiThe number of reads for the ith target region of the test sample.
In another embodiment, as shown in fig. 6, the apparatus 40 for analyzing copy number variation further includes a quality control calling module 043. The quality control calling module 043 is configured to call a sixth application program through a sixth application program interface, and obtain an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program. The comparison calling module 43 is used for calling a second application program through a second application program interface for the extracted sequencing data meeting the preset requirement, so as to obtain a comparison result obtained by comparing the genome of the extracted sequencing data by the second application program; otherwise, the extraction calling module 42 calls the first application program through the first application program interface to obtain the read covering the target region extracted from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted, so as to obtain new extracted sequencing data, the quality control calling module 043 calls the sixth application program through the sixth application program interface to obtain an evaluation result obtained by performing sequencing quality evaluation on the new extracted sequencing data again through the sixth application program, the comparison calling module 43 is used for calling the second application program through the second application program interface on the new extracted sequencing data meeting the preset requirement, so as to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data through the second application program interface; otherwise, DNA sequencing data for the new genomic target region is acquired by the sequencing data acquisition module 41 on the second pass.
In the embodiment shown in fig. 6, the functions of the sequencing data obtaining module 41, the extraction calling module 42, the comparison calling module 43, the differentiation calling module 44, the statistic calling module 45 and the CNV analysis calling module 46 are respectively the same as the sequencing data obtaining module 31, the extraction calling module 32, the comparison calling module 33, the differentiation calling module 34, the statistic calling module 35 and the CNV analysis calling module 36 in fig. 5. The CNV analysis calling module 46 may further include an expected value calculation calling module 361, an expected read number calculation calling module 362, and a read proportion calculation calling module 363, or include an expected value calculation calling module 361, an expected read number calculation calling module 362, a read proportion calculation calling module 363, and a copy number calculation calling module 364.
Further, as shown in fig. 6, the copy number variation analysis apparatus 40 may further include an annotation and graphical presentation calling module 47. The annotation and graphical presentation calling module 47 is used for annotating and graphically presenting the proportion of reads of all target regions and CNV regions. The annotation and graphical display calling module 47 is configured to call the seventh application program and the eighth application program through the seventh application program interface and the eighth application program interface, respectively, to obtain results of performing annotation and graphical display on the Ratio of reads in all target regions and CNV regions by the seventh application program and the eighth application program, such as performing numerical values in genes and exons, an OMIM database, and the like on the calculated CNVs, and obtaining the Ratio of reads in all target regionsiCNV area and ratio of read to CNV arearatioAnd visually displaying the value.
Based on the embodiments described above, the present invention further provides a computer device for analyzing copy number variation, which has a processor and a memory, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method for analyzing copy number variation according to any of the embodiments described above.
It will be understood by those skilled in the art that all or part of the processes of the above methods may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
Accordingly, the present invention also provides a computer storage medium for analyzing copy number variation, which stores a computer program, and when the computer program is executed, the steps of the method for analyzing copy number variation of any of the above embodiments are realized.
The copy number variation analysis method is suitable for all whole exon sequencing technologies, and in order to illustrate the technical effect generated by the analysis method, a gene defect screening package item is used as case analysis, the item is used for screening diseases by using an extracted blood sample, and the analysis is carried out by shifting from the traditional single-site screening to a second-generation sequencing platform. In this case, 2 positive samples and 1 negative sample (sample names: TEST001, TEST002 and TEST003, respectively) were selected. The specific analysis procedure is as follows:
the first step is as follows: the Illumina Nextseq500 platform is adopted for sequencing, and the sequencing adopts base sequencing with the reading length of 150bp and the single joint of 8 bp. Acquiring the bcl data of the lower machine, and performing bcl data conversion, wherein bcl2fastq-R < lower machine directory > < output directory > - -sample-sheet < sample sheet file > - -use-bases-mask y150n, I8 and y150 n. A fastq file of 3 samples was obtained. The data volume statistics of the machine-off are shown in table 1.
TABLE 1 sample offline data volume statistics
Sample name
|
Clusters(Raw)
|
Clusters(PF)
|
Yield(MBases)
|
TEST001
|
255,944
|
255,944
|
77
|
TEST002
|
218,307
|
218,307
|
65
|
TEST003
|
219,816
|
219,816
|
66 |
The second step is that: selecting a fastq file with a specific sequencing depth
The target region of the test item is about 80kb, and because of germline mutation, a minimum depth of 200 × per sample is required. The number of reads to be extracted (81314 × 200)/(150 × 0.9) 120465. Extraction is realized by adopting seqtk software: seqtk sample-s100infastq 120465> outfastq.
The third step: sequencing quality assessment
The fatqc < outfattq >, through the quality control analysis, the sequencing data of the three samples meet the requirements in the aspects of GC content, Q30 percentage, base composition and the like.
The fourth step is genome alignment and PCR repeat read marking
Genome alignment: and b, obtaining a comparison result bam file by the bw mem-M-t 2< R1.fastq.gz > < R2.fastq.gz >, and establishing an index file, namely samtools index < mapping.
PCR repeat read labeling: read, java-Xmx5g-jar marks duties, jar INPUT, < mapping, bam > OUTPUT, < out.dup.bam > circuitry _ FILE, < dup.metrics >.
The genome alignment statistics are shown in Table 2.
TABLE 2 sample genome alignment and sequencing depth statistics
The fifth step: depth of target region
In order to facilitate the analysis process and the standardization of documents, a depth calculation algorithm is integrated into an R language script getDepth _ NB.R, and the depths of different target areas are counted according to different genders. Autosomes of males and females were counted together, and X for females and X, Y for males were counted separately. Specifically, the comparison result bam files of all samples are placed in a bam.list file, the comparison result bam of a female sample is placed in a bamx.list file, the comparison result file of a male sample is placed in a bamy.list file, the target region of an autosome is stored in chra.bed, the target region of an X chromosome is placed in chrx.bed, the XY target region is placed in chry.bed, and the command of counting the depth is as follows:
R--slave--args chrA.bed bam.list outdir A<getDepth_NB.R
R--slave--args chrX.bed bamX.list outdir X<getDepth_NB.R
R--slave--args chrY.bed bamY.list outdir Y<getDepth_NB.R
the statistical results are put in the files chrA.data.info, chrX.data.info and chrY.data.info.
And a sixth step: CNV calculation
In order to facilitate analysis process and document standardization, all the various algorithms (target region expected read number calculation and ratio value algorithm; CNV ratio value algorithm) in the CNV calculation step are integrated in the R language script callCNV _ NB.R.
The concrete operation is the same as the fifth step, and the command for calculating the CNV is as follows:
R--slave--args chrA.bed bam.list outdir A<callCNV_NB.R
R--slave--args chrX.bed bamX.list outdir X<callCNV_NB.R
R--slave--args chrY.bed bamY.list outdir Y<callCNV_NB.R
this step obtains the CNV results for CNV on autosomal, X, Y chromosome.
The seventh step: CNV annotations and graphical presentations
And according to the analysis result in the sixth step, performing annotation and visualization on the CNV. TEST001 sample CNV annotation results see table 3 and fig. 7; TEST002 sample CNV annotation results see table 4 and fig. 8; TEST003 sample CNV annotation results see fig. 9, which is a negative sample.
TABLE 3 TEST001CNV notes
TABLE 4 TEST002CNV notes results
Based on the CNV results generated in the seventh step, MLPA probes were designed and the CNV results were verified, and the MLPAratio values of 3 samples are shown in FIGS. 10, 11 and 12. As shown in fig. 10, 11 and 12, the positions of CNV copy numbers calculated by the three samples were all consistent with the positions of CNVs obtained by the MLPA method and copy abnormality types, and the CNV calculated by the TEST001 sample was a repeat of the 3 rd to 9 th exon regions of DMD gene, consistent with the MLPA results; and calculating that the TEST002 sample has a repeated region in the exon 3-exon 7 region of the DMD gene, and the result is consistent with the verification result of MLPA. The absence of missing repeat regions in the TEST003 samples was calculated to be consistent with the MLPA validation results.
All positive CNVs differed substantially very closely from the copy number of the MLPA assay, calculating a ratio of 1.98 for TEST001 replicates and a DMD copy ratio of around 2 for the MLPA assay. And calculating the ratio value of TEST002 repetition to be 1.3, verifying by MLPA to obtain the copy ratio value of the DMD to be about 1.3, and testing other subsequent samples to obtain the ratio value calculated by the algorithm, wherein the difference between the ratio value and the verification result of the MLPA is not more than 0.3. The calculated CNV false positive rate is 0%, namely all positive CNV calculated by the algorithm can be verified by an MLPA experimental means.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.