CN108959853B

CN108959853B - Analysis method, analysis device, equipment and storage medium for copy number variation

Info

Publication number: CN108959853B
Application number: CN201810481363.XA
Authority: CN
Inventors: 孙明明; 唐小艳; 邓俊豪; 向丽娜; 赵薇薇; 于世辉
Original assignee: Guangzhou Kingmed Diagnostics Group Co ltd; Guangzhou Kingmed Diagnostics Central Co Ltd
Current assignee: Guangzhou Kingmed Diagnostics Group Co ltd
Priority date: 2018-05-18
Filing date: 2018-05-18
Publication date: 2020-01-17
Anticipated expiration: 2038-05-18
Also published as: CN108959853A

Abstract

The present invention relates to a method, an apparatus, a device, and a storage medium for analyzing copy number variation. The copy number variation analysis method provided by the invention is used for sequentially extracting, comparing, marking and distinguishing and carrying out statistical analysis on the second-generation sequenced DNA sequencing data by calling a corresponding application program, so that the occupation ratio and/or the copy number of the read in the CNV region are/is finally obtained, the final result is high in accuracy and good in resolution, and especially in the extraction process, the number of the read to be extracted is determined according to the number of bases in a target region, the sequencing read length and the preset average depth, so that different sequencing results can be analyzed in a targeted manner, and the reliability of the analysis result is greatly improved.

Description

Analysis method, analysis device, equipment and storage medium for copy number variation

Technical Field

The present invention relates to the field of bioinformatics technologies, and in particular, to a method, an apparatus, a device, and a storage medium for analyzing copy number variation.

Background

With the increasing maturity of the second-generation sequencing technology and the rapid development of the second-generation sequencing technology in the application field related to human genome detection, the adoption of the second-generation sequencing technology to perform human genome analysis to assist in diagnosing diseases or performing pathological analysis of diseases has become an effective means, wherein Copy Number Variation (CNV) analysis is a very important analysis content. The main principle of copy number variation analysis is to determine the coverage of important region segments on human genome by the next generation sequencing technology and to determine whether there is difference with reference sample by statistical means, so as to determine whether there is mutation such as deletion or duplication, and finally to determine genotype or assist diagnosis of related diseases.

At present, the result of copy number variation analyzed by the second-generation sequencing data has certain false positive, so that the verification needs to be carried out by the first-generation sequencing (Sanger) of the industry gold standard. The method for verifying copy number variation by using a first-generation sequencing technology is a multiplex Ligation-dependent Probe Amplification (MLPA) technology, has very high accuracy and resolution in detecting copy number variation, and is one of the currently accepted gold standards for verifying copy number variation. However, the first-generation sequencing is generally inefficient and cumbersome to operate.

Disclosure of Invention

In view of the above, there is a need for a method, an apparatus, a device and a storage medium for analyzing copy number variation, so as to improve the accuracy and resolution of copy number variation analysis based on next generation sequencing.

A method of analyzing copy number variation, comprising the steps of:

step S1: obtaining DNA sequencing data of a genome target region;

step S2: calling a first application program through a first application program interface, and obtaining extracted sequencing data obtained by extracting reads covering the target area from the DNA sequencing data by the first application program according to the number of the reads to be extracted, wherein the number of the reads to be extracted is determined according to the number of bases of the target area, the sequencing read length and the preset average depth;

step S3: calling a second application program through a second application program interface to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program;

step S4: calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program;

step S5: calling a fourth application program through a fourth application program interface, and obtaining the number of reads which are counted from the distinguishing result by the fourth application program and have non-PCR repetition and comparison scores not less than a preset value, wherein the number of the reads falling into each target area is obtained;

step S6: and calling a fifth application program through a fifth application program interface, and obtaining the ratio of reads and/or the copy number of the CNV region determined by the fifth application program according to the read number of each target region.

In one embodiment, in step S2, the number of reads to be extracted is (number of bases of target region x preset average depth)/(sequencing read length x correlation coefficient), wherein the correlation coefficient is less than 1;

the predetermined average depth is determined according to the mutation analysis type of the sample to be detected, wherein the predetermined average depth of the somatic mutation is not less than 950 ×, and the predetermined average depth of the germline mutation is not less than 80 ×.

In one embodiment, after the step S2 and before the step S3, the method further comprises:

step S03: calling a sixth application program through a sixth application program interface, obtaining an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program, and executing step S3 on the extracted sequencing data of which the evaluation result meets a preset requirement; otherwise, calling a first application program through a first application program interface, extracting reads covering the target region from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted to obtain new extracted sequencing data, calling a sixth application program through a sixth application program interface to obtain a new evaluation result obtained by carrying out sequencing quality evaluation on the new extracted sequencing data through the sixth application program, and executing a step S3 on the new extracted sequencing data of which the new evaluation result meets the preset requirement; otherwise, returning to step S1, DNA sequencing data for the new genomic target region is obtained.

In one embodiment, the preset requirements are: the average read mass is more than 85% of the total read mass of Q30, the average GC content is between 40% and 55%, and the bases A, T, C and G respectively account for 25% +/-2%.

In one embodiment, the step S6 includes:

step S61: obtaining the expected value p of the ith target area of the test sample obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through the beta-binomial distribution model_i；

Step S62: obtaining a formula exp (Y) by said fifth application_i)＝Y_i*P_i/(1-P_i) Determined expected read number exp (Y) for each target region of the test specimen_i) Wherein Y is_iThe read number of the ith target area of the test sample;

step S63: obtaining a formula by said fifth applicationThe determined proportion of reads of the CNV region on the corresponding chromosome, and/or

Obtaining a formula CNV by said fifth application_copy＝CNV _ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formula_copy＝CNV _ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formula_copy＝CNV_ratioDetermining the copy number of the CNV region of the male X or Y chromosome;

wherein, CNV_ratioIs the ratio of read of the CNV region, CNV_copyIs the copy number, X, of the CNV region_i-jIs the number of reads from the ith target region to the jth target region in the region of the test sample where the CNV is located,

in one embodiment, in the step S62, the method further includes: obtaining a formula Ratio by the fifth application_i＝Y_i/exp(Y_i) Determined Ratio of reads for each target region of test sample_iWherein Y is_iThe number of reads for the ith target region of the test sample.

In one embodiment, the method for analyzing copy number variation further includes step S7: and calling a seventh application program and an eighth application program through a seventh application program interface and an eighth application program interface respectively to obtain the results of the seventh application program and the eighth application program for annotating and graphically displaying the ratio of reads of all target areas and CNV areas.

An apparatus for analyzing copy number variation, comprising:

the sequencing data acquisition module is used for acquiring DNA sequencing data of the genome target region;

the extraction calling module is used for calling a first application program through a first application program interface, extracting reads covering the target region from the DNA sequencing data according to the number of the reads to be extracted by the first application program, and obtaining extracted sequencing data, wherein the number of the reads to be extracted is determined according to the number of bases of the target region, the sequencing read length and the preset average depth;

the comparison calling module is used for calling a second application program through a second application program interface to obtain a comparison result obtained by comparing the genome of the extracted sequencing data by the second application program;

the distinguishing and calling module is used for calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program;

the statistical calling module is used for calling a fourth application program through a fourth application program interface to obtain the number of reads which are counted from the distinguishing result by the fourth application program and are not repeated by the PCR and have the comparison score not less than a preset value, and the number of the reads falling into each target area is obtained; and

and the CNV analysis calling module is used for calling a fifth application program through a fifth application program interface to obtain the read proportion and/or copy number of the CNV region determined by the fifth application program according to the read number of each target region.

A computer device having a processor and a memory, the memory having stored thereon a computer program, the processor implementing the steps of the method for analyzing copy number variation according to any of the above embodiments when executing the computer program.

A computer storage medium having a computer program stored thereon, the computer program when executed implementing the steps of the method for analyzing copy number variation according to any of the above embodiments.

The traditional next generation sequencing technology is used for analyzing CNV by means of related biological information software, and researches show that most of the biological information software is not accurate enough in predicting the missing or repeated copy number of CNV fragments and determining the positions of variant regions, and has larger deviation from gold standard (MLPA verification) data of CNV. The analysis method, the analysis device, the equipment and the storage medium for copy number variation provided by the invention can be used for sequentially extracting, comparing, marking and distinguishing the DNA sequencing data of the second-generation sequencing by calling the corresponding application program, and finally obtaining the read proportion and/or the copy number of the CNV region, the final result has high accuracy and good resolution, and particularly, in the extraction process, the number of the reads to be extracted is determined according to the base number, the sequencing read length and the preset average depth of the target region, so that different sequencing results can be analyzed in a targeted manner, and the reliability of the analysis result is greatly improved.

The analysis method of the copy number variation is an analysis method for non-disease diagnosis purposes, the CNV region of the genome target region is analyzed by the analysis method of the copy number variation, the obtained result can be used for various CNV analyses so as to be further used for researching various effective or ineffective CNVs, particularly CNV related to health diseases, and the analysis result can not be directly used as a diagnosis result for diagnosing whether a certain disease is suffered from, but can be used as an intermediate result together with other results for auxiliary diagnosis of the disease and pathological research and analysis of the disease, so that the analysis method has important clinical research and use values.

Drawings

FIG. 1 is a flow chart illustrating a method for analyzing copy number variation according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a specific process of step S16 in FIG. 1;

FIG. 3 is a schematic flow chart of a method for analyzing copy number variation according to another embodiment;

FIG. 4 is a schematic structural diagram of an apparatus for analyzing copy number variation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a specific structure of the CNV analysis module shown in FIG. 4;

FIG. 6 is a schematic view of an apparatus for analyzing copy number variation according to another embodiment;

FIG. 7 is a TEST001CNV visualization;

FIG. 8 is a TEST002CNV visualization;

FIG. 9 is a TEST003 CNV visualization;

FIG. 10 shows TEST001CNV MLPA validation results;

FIG. 11 shows TEST002CNV MLPA validation results;

FIG. 12 shows TEST003 CNV MLPA validation results.

Detailed Description

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

"read" as used herein refers to a sequencing sequence generated by a high throughput sequencing platform (e.g., a second generation sequencing platform); the sequencing depth refers to the ratio of the total base number obtained by sequencing to the size of the genome to be detected, and usually the unit x represents the multiple; the "sequencing read length" refers to the length of one sequencing in the sequencing process; the 'PCR repeated read' refers to read obtained by carrying out mirror image replication on the same molecule for multiple times by PCR, and the standard for judging whether the molecule is a mirror image molecule is as follows: the start and end positions of reads are the same, the base sequences between the start point and the end point are the same, and different reads are obtained as long as one of the start point, the end point or the sequences between the start point and the end point is different; the comparison score refers to a scoring mechanism of sequence comparison, the quality of a reaction sequence compared with a genome is high, and the accuracy of sequence comparison with the genome is high when the comparison score is high; the "reference sample" is a normal sample without CNV region on the genome theoretically, however, it is difficult to obtain such an ideal sample in reality, and in order to reduce data bias caused by different capture efficiency of experiments in different batches, the selection is performed in the samples in the same batch, and the conditions are selected as follows: and calculating a correlation coefficient between the read numbers of the test sample and the alternative sample in the target area, wherein the sample with the correlation coefficient meeting the preset requirement (such as being more than or equal to 0.97) is selected as a reference sample, and CNV calculation is carried out on the test sample by using the reference sample correlation data.

As shown in fig. 1, the method for analyzing copy number variation according to an embodiment of the present invention includes the following steps:

step S11: DNA sequencing data for the genomic target region is obtained.

In this embodiment, a second-generation sequencing mode is adopted to sequence the captured DNA (whole exon region or exon region of a specific gene set) of the genomic target region, so as to obtain DNA sequencing data. Data obtained off-line is typically in bcl format.

In one embodiment, step S11 further includes: calling applications such as bcl2fastq (Illumina company) through a bcl2fastq program interface to obtain a fastq format data file which is analyzed by subsequent software and converted from the bcl format data file by the applications such as bcl2 fastq.

Step S12: and calling the first application program through the first application program interface to obtain the extracted sequencing data, wherein the first application program extracts the reads covering the target area from the DNA sequencing data according to the number of the reads to be extracted.

The number of reads to be extracted in this example is determined based on the number of bases in the target region, the sequencing read length, and the predetermined average depth.

By extracting corresponding reads, the sequencing samples of different batches or sample piece differences caused by different data among samples can be reduced. Specifically, the depth of the extracted data can be determined according to whether the detected sample is to be subjected to somatic mutation or germ line mutation, and the preset average depth of the somatic mutation is generally not less than 950 ×, such as 1000 × ± 20 ×, and the preset average depth of the germ line mutation is not less than 80 ×, such as 100 × ± 20 ×, 200 × ± 20 × etc., which are commonly used.

More specifically, in one embodiment, T is defined_sizeNumber of bases (unit: bp), D, which is a target region_MIs a preset average depth (can be selected according to requirements, as long as all sample standards are consistent), and reads_sizeAs sequencing read length (sequence length of read measured by a sequencing instrument), the number of reads to be extracted_numCan be calculated according to the following formula:

read_num＝T_size*D_M/(read_sizecorrelation coefficient)

The correlation coefficient is less than 1, for example, the correlation coefficient may be in a range of 0.8 to 0.98, or may be 0.8, 0.85, 0.9, 0.95, or the like. By selecting a proper correlation coefficient, the number of the extracted reads is more than that preset theoretically, so that a certain number of invalid reads can be removed when genome comparison and PCR repeated read distinguishing are carried out subsequently, and the accuracy and reliability of an analysis result are ensured.

In one embodiment, the first application may be a software program, such as seqtk, by which corresponding reads are extracted from the fastq-formatted DNA sequencing data according to the number of reads to be extracted.

Step S13: and calling a second application program through a second application program interface to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program.

The comparison function is to perform genome positioning on the sequencing reads, and obtain comparison results for recording comparison information of all the reads, such as sequence names, comparison scores, genome positions, sequence comparison details and the like.

In an embodiment, the second application may be alignment software such as bwa and bowie, and the second application performs genome alignment on the extracted fastq-format sequencing data meeting the quality assessment standard, so as to obtain an alignment result of the bam-format file after the alignment.

Step S14: and calling a third application program through a third application program interface to obtain a distinguishing result obtained by distinguishing the PCR repeated read and the non-PCR repeated read in the comparison result by the third application program.

In one embodiment, the third application may be picardtools, samtools, or other marking software, and the third application compares the results to mark the PCR repeat read to distinguish it from the non-PCR repeat read, and removes the repeat read generated by the PCR process during subsequent calculations.

Step S15: and calling a fourth application program through a fourth application program interface, and obtaining the read which is obtained by the fourth application program from the distinguishing result and has statistical non-PCR repetition and the comparison score not less than a preset value, so as to obtain the read number falling into each target area.

In one embodiment, the fourth application program may be the Rsamtools toolkit in R language, etc., and the sequencing depth of autosomes of all samples (including the test sample and the reference sample), X chromosomes of female sample, X and Y chromosomes of male sample in the target region, i.e., the number of reads falling in the target region, and the GC content of the target region are counted by the fourth application program. According to the processing of the above step S13 and step S14, the request for read is: 1) non-PCR repeat read; 2) the comparison score is not less than a preset value, such as not less than 20.

The target region for sequencing is the region captured by all probes contained in the panle design, e.g., can be stored in the format of a bed file, e.g., in a more specific embodiment, in the format of: first column: chromosome number, second column: the starting position of the target region, the third column: end position of target region, fourth column: name of the target area. For the male and female sex chromosome statistics, because the female sex chromosome composition is different from the male sex chromosome composition, the female sex chromosome composition is 2X chromosomes, the male sex chromosome composition is one X chromosome, one Y chromosome, so when the percentage of reads or copy number of CNV region of sex chromosome is calculated subsequently, it needs to be calculated separately according to sex.

Step S16: and calling a fifth application program through a fifth application program interface, and obtaining the proportion and/or copy number of the reads of the CNV region determined by the fifth application program according to the number of the reads of each target region.

In step 16, CNV analysis is performed on the autosome of all samples (including the test sample and the reference sample), the X chromosome of female and the X, Y chromosome of male.

In one embodiment, the fifth application may be an ExomeDepth toolkit in R language, or the like.

In one embodiment, as shown in fig. 2, step S16 includes:

step S161: obtaining an expected value p of the ith target area of the test sample obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through a beta-binomial distribution model_i。

Step S162: the expected number of reads for each target region of the test sample is determined.

Specifically, in one embodiment, the formula exp (Y) is obtained by the fifth application as available_i)＝Y_i*P_i/(1-P_i) Determined expected read number exp (Y) for each target region of the test specimen_i) Wherein Y is_iThe number of reads for the ith target region of the test sample.

Step S163: determining the proportion of reads of the CNV region on the corresponding chromosome, and/or the copy number of the CNV region.

Specifically, in one embodiment, the fifth application formulates as may be obtained

Determining the proportion of reads of the CNV region on the corresponding chromosome.

Occupation ratio of read in CNV region CNV_ratioAfter the determination, a fifth application program according to the formula CNV may be further obtained_copy＝CNV _ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formula_copy＝CNV _ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formula_copy＝CNV_ratioDetermining the copy number of the CNV region of the male X or Y chromosome.

Wherein, CNV_ratioIs the ratio of read of the CNV region, CNV_copyIs the copy number of the CNV region, X_i-jIs the number of reads from the ith target region to the jth target region in the region of the test sample where the CNV is located,

in calculating the read occupation ratio of CNV region (CNV)_ratio) In this case, the CNV region may be generally one or more adjacent exon regions, so the proportion of read of the CNV region needs to be recalculated based on the CNV region. CNV calculated as described above_ratioIt does not reflect the specific duplication or exact value of the CNV, and thus, the CNV can be further calculated_copyTo be straightThe copy number values of the CNV regions on the genome are then reflected. The formula for calculating the copy number of the CNV region on autosomal chromosomes 1-22 and female X chromosome is: CNV_copy＝CNV _ratio2, the calculated formula for copy number of CNV region of male X or Y chromosome: CNV_copy＝CNV_ratio。

CNV if it is located on the autosomal or female X chromosome _copy2, the CNV region is a normal region, and there is no duplication or deletion, since human is diploid, there are two copies per chromosome in normal autosome, CNV_copy<2 indicates that the CNV region is deleted, CNV_copy>2 indicates that there is a duplication in this CNV region. Copy number of CNV region and CNV on Male X and Y chromosome_ratioThe values are the same because both the X and Y chromosomes are 1 copy in males, if the CNV is located on chromosome X, Y in males_copy1 indicates that the CNV region is a normal region, if CNV_copy<1 denotes that the CNV region is a deletion region, CNV_copy>1 indicates that the CNV region is an overlapping region.

Further, in an embodiment, the step S162 further includes: obtaining by the fifth application according to the formula Ratio_i＝Y_i/exp(Y_i) Determined Ratio of reads for each target region of test sample_iWherein Y is_iThe number of reads for the ith target region of the test sample.

As shown in fig. 3, in the method for analyzing copy number variation according to another embodiment of the present invention, after step S22 and before step S23, the method further includes step S023:

calling a sixth application program through a sixth application program interface to obtain an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program, and executing step S3 on the extracted sequencing data of which the evaluation result meets the preset requirement; otherwise, calling the first application program through the first application program interface, obtaining reads covering the target area extracted from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted, obtaining new extracted sequencing data, calling the sixth application program through the sixth application program interface, obtaining a new evaluation result obtained by carrying out sequencing quality evaluation on the new extracted sequencing data through the sixth application program, and executing the step S3 on the new extracted sequencing data of which the new evaluation result meets the preset requirement; otherwise, returning to step S1, DNA sequencing data for the new genomic target region is obtained.

Steps S21, S22, S23, S24, S25 and S26 are the same as steps S11, S12, S13, S14, S15 and S16, respectively. Step S26 may further include steps S161, S162, and S163 described above.

In step S023, the sixth application program is used to mainly evaluate the contents such as the base quality fraction, Q30, GC content, etc., and the base composition in the read, and to screen the read with poor quality. The sixth application may be quality control software such as fastqc, fastx, ClinQC, etc. Specifically, in one embodiment, the preset requirements are: the average read mass is more than 85% of the total read mass of Q30, the average GC content is between 40% and 55%, and the bases A, T, C and G respectively account for 25% +/-2%.

The random seed number is a parameter in the seqtk software, and researches show that adjusting the random seed number causes the final extracted reads to be different, and new extracted sequencing data can be obtained.

If neither of the two sequencing quality assessments meets the predetermined requirements, the sample is required to be resequenced, and step S21 is performed to retrieve DNA sequencing data of a new genomic target region.

By carrying out sequencing quality evaluation on the extracted sequencing data, namely quality control, unqualified sequencing data can be abandoned, and the accuracy and reliability of subsequent data analysis can be guaranteed.

Further, as shown in fig. 3, in an embodiment, the method for analyzing copy number variation further includes step S27: and calling a seventh application program and an eighth application program through a seventh application program interface and an eighth application program interface respectively to obtain the results of the seventh application program and the eighth application program which are used for annotating and graphically displaying the ratio of reads of all target areas and CNV areas.

In step S27, the CNVs calculated in step S26 are subjected to numerical values of genes, exons, OMIM database, and the like, and the Ratio of reads for all target regions_iCNV area and ratio of read to CNV area_ratioAnd visually displaying the value.

Specifically, the seventh application program may be a perl language, the eighth application program may be a ggplot package of an R language, and for example, the perl language may be used to sort data formats, and the ggplot toolkit of the R language is used to implement drawing.

By annotating and graphically displaying the corresponding results, the copy number variation condition can be more intuitively and clearly reflected.

Based on the same idea as the above method, as shown in fig. 4, the present invention also provides an analysis apparatus 30 for copy number variation, comprising:

a sequencing data acquisition module 31, configured to acquire DNA sequencing data of a genome target region;

the extraction calling module 32 is configured to call a first application program through a first application program interface, obtain extracted sequencing data obtained by extracting, by the first application program, reads covering a target region from DNA sequencing data according to the number of the reads to be extracted, where the number of the reads to be extracted is determined according to the number of bases in the target region, the sequencing read length, and a preset average depth;

the comparison calling module 33 is configured to call a second application program through a second application program interface, so as to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data by the second application program;

the distinguishing and calling module 34 is configured to call a third application program through a third application program interface, so as to obtain a distinguishing result obtained by distinguishing the PCR repeat read and the non-PCR repeat read in the comparison result by the third application program;

the statistical calling module 35 is configured to call a fourth application program through a fourth application program interface, obtain reads with a comparison score not less than a preset value and counted from the discrimination result by the fourth application program, and obtain the number of reads falling into each target region; and

and the CNV analysis calling module 36 is configured to call a fifth application program through a fifth application program interface, and obtain the percentage of reads and/or the copy number of the CNV regions, which are determined by the fifth application program according to the read number of each target region.

The sequencing data obtaining module 31 may further include a format conversion module, for example, calling an application program such as bcl2fastq through a bcl2fastq program interface to obtain a data file in the fastq format, which is analyzed by subsequent software and converted from a data file in the bcl format by the application program such as bcl2 fastq.

The extraction calling module 32 may further comprise a read number calculating module for calculating the read number according to the formula_num＝T_size*D_M/(read_sizeCorrelation coefficient) to calculate the number of reads to be extracted, where T_sizeNumber of bases (unit: bp), D, which is a target region_MIs a preset average depth (can be selected according to requirements, as long as all sample standards are consistent), and reads_sizeThe sequence read length (the sequence length of read measured by the sequencer). The correlation coefficient is less than 1, such as 0.8-0.98.

Specifically, as shown in fig. 5, in one embodiment, the CNV analysis calling module 36 includes an expected value calculation calling module 361, an expected read number calculation calling module 362, and a read proportion calculation calling module 363.

The expected value calculation calling module 361 is used for obtaining an expected value p of the ith target area of the test sample, which is obtained by fitting the fifth application program to the total read numbers of the test sample and the reference sample respectively through a beta-binomial distribution model_i。

The expected read number calculation calling module 362 is used for obtaining the formula exp (Y) by the fifth application_i)＝Y_i*P_i/(1-P_i) Determined expected read number exp (Y) for each target region of the test specimen_i) Wherein Y is_iThe number of reads for the ith target region of the test sample.

The read proportion calculation calling module 363 is used for obtaining the result of the fifth applicationProcedure according to formula

Further, the CNV analysis calling module 36 further includes a copy number calculation calling module 364. The copy number calculation call module 364 is used to obtain the formula CNV by the fifth application_copy＝CNV _ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formula_copy＝CNV _ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formula_copy＝CNV_ratioDetermining the copy number of the CNV region of the male X or Y chromosome.

Further, in one embodiment, the read percentage calculation call module 363 is further configured to obtain the Ratio of the fifth application program according to the formula Ratio_i＝Y_i/exp(Y_i) Determined Ratio of reads for each target region of test sample_iWherein Y is_iThe number of reads for the ith target region of the test sample.

In another embodiment, as shown in fig. 6, the apparatus 40 for analyzing copy number variation further includes a quality control calling module 043. The quality control calling module 043 is configured to call a sixth application program through a sixth application program interface, and obtain an evaluation result obtained by performing sequencing quality evaluation on the extracted sequencing data by the sixth application program. The comparison calling module 43 is used for calling a second application program through a second application program interface for the extracted sequencing data meeting the preset requirement, so as to obtain a comparison result obtained by comparing the genome of the extracted sequencing data by the second application program; otherwise, the extraction calling module 42 calls the first application program through the first application program interface to obtain the read covering the target region extracted from the DNA sequencing data according to the number of the reads to be extracted after the parameters of the first application program are adjusted, so as to obtain new extracted sequencing data, the quality control calling module 043 calls the sixth application program through the sixth application program interface to obtain an evaluation result obtained by performing sequencing quality evaluation on the new extracted sequencing data again through the sixth application program, the comparison calling module 43 is used for calling the second application program through the second application program interface on the new extracted sequencing data meeting the preset requirement, so as to obtain a comparison result obtained by performing genome comparison on the extracted sequencing data through the second application program interface; otherwise, DNA sequencing data for the new genomic target region is acquired by the sequencing data acquisition module 41 on the second pass.

In the embodiment shown in fig. 6, the functions of the sequencing data obtaining module 41, the extraction calling module 42, the comparison calling module 43, the differentiation calling module 44, the statistic calling module 45 and the CNV analysis calling module 46 are respectively the same as the sequencing data obtaining module 31, the extraction calling module 32, the comparison calling module 33, the differentiation calling module 34, the statistic calling module 35 and the CNV analysis calling module 36 in fig. 5. The CNV analysis calling module 46 may further include an expected value calculation calling module 361, an expected read number calculation calling module 362, and a read proportion calculation calling module 363, or include an expected value calculation calling module 361, an expected read number calculation calling module 362, a read proportion calculation calling module 363, and a copy number calculation calling module 364.

Further, as shown in fig. 6, the copy number variation analysis apparatus 40 may further include an annotation and graphical presentation calling module 47. The annotation and graphical presentation calling module 47 is used for annotating and graphically presenting the proportion of reads of all target regions and CNV regions. The annotation and graphical display calling module 47 is configured to call the seventh application program and the eighth application program through the seventh application program interface and the eighth application program interface, respectively, to obtain results of performing annotation and graphical display on the Ratio of reads in all target regions and CNV regions by the seventh application program and the eighth application program, such as performing numerical values in genes and exons, an OMIM database, and the like on the calculated CNVs, and obtaining the Ratio of reads in all target regions_iCNV area and ratio of read to CNV area_ratioAnd visually displaying the value.

Based on the embodiments described above, the present invention further provides a computer device for analyzing copy number variation, which has a processor and a memory, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method for analyzing copy number variation according to any of the embodiments described above.

It will be understood by those skilled in the art that all or part of the processes of the above methods may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Accordingly, the present invention also provides a computer storage medium for analyzing copy number variation, which stores a computer program, and when the computer program is executed, the steps of the method for analyzing copy number variation of any of the above embodiments are realized.

The copy number variation analysis method is suitable for all whole exon sequencing technologies, and in order to illustrate the technical effect generated by the analysis method, a gene defect screening package item is used as case analysis, the item is used for screening diseases by using an extracted blood sample, and the analysis is carried out by shifting from the traditional single-site screening to a second-generation sequencing platform. In this case, 2 positive samples and 1 negative sample (sample names: TEST001, TEST002 and TEST003, respectively) were selected. The specific analysis procedure is as follows:

the first step is as follows: the Illumina Nextseq500 platform is adopted for sequencing, and the sequencing adopts base sequencing with the reading length of 150bp and the single joint of 8 bp. Acquiring the bcl data of the lower machine, and performing bcl data conversion, wherein bcl2fastq-R < lower machine directory > < output directory > - -sample-sheet < sample sheet file > - -use-bases-mask y150n, I8 and y150 n. A fastq file of 3 samples was obtained. The data volume statistics of the machine-off are shown in table 1.

TABLE 1 sample offline data volume statistics

Sample name	Clusters(Raw)	Clusters(PF)	Yield(MBases)
				TEST001	255,944	255,944	77
TEST002	218,307	218,307	65
				TEST003	219,816	219,816	66

The second step is that: selecting a fastq file with a specific sequencing depth

The target region of the test item is about 80kb, and because of germline mutation, a minimum depth of 200 × per sample is required. The number of reads to be extracted (81314 × 200)/(150 × 0.9) 120465. Extraction is realized by adopting seqtk software: seqtk sample-s100infastq 120465> outfastq.

The third step: sequencing quality assessment

The fatqc < outfattq >, through the quality control analysis, the sequencing data of the three samples meet the requirements in the aspects of GC content, Q30 percentage, base composition and the like.

The fourth step is genome alignment and PCR repeat read marking

Genome alignment: and b, obtaining a comparison result bam file by the bw mem-M-t 2< R1.fastq.gz > < R2.fastq.gz >, and establishing an index file, namely samtools index < mapping.

PCR repeat read labeling: read, java-Xmx5g-jar marks duties, jar INPUT, < mapping, bam > OUTPUT, < out.dup.bam > circuitry _ FILE, < dup.metrics >.

The genome alignment statistics are shown in Table 2.

TABLE 2 sample genome alignment and sequencing depth statistics

The fifth step: depth of target region

In order to facilitate the analysis process and the standardization of documents, a depth calculation algorithm is integrated into an R language script getDepth _ NB.R, and the depths of different target areas are counted according to different genders. Autosomes of males and females were counted together, and X for females and X, Y for males were counted separately. Specifically, the comparison result bam files of all samples are placed in a bam.list file, the comparison result bam of a female sample is placed in a bamx.list file, the comparison result file of a male sample is placed in a bamy.list file, the target region of an autosome is stored in chra.bed, the target region of an X chromosome is placed in chrx.bed, the XY target region is placed in chry.bed, and the command of counting the depth is as follows:

R--slave--args chrA.bed bam.list outdir A<getDepth_NB.R

R--slave--args chrX.bed bamX.list outdir X<getDepth_NB.R

R--slave--args chrY.bed bamY.list outdir Y<getDepth_NB.R

the statistical results are put in the files chrA.data.info, chrX.data.info and chrY.data.info.

And a sixth step: CNV calculation

In order to facilitate analysis process and document standardization, all the various algorithms (target region expected read number calculation and ratio value algorithm; CNV ratio value algorithm) in the CNV calculation step are integrated in the R language script callCNV _ NB.R.

The concrete operation is the same as the fifth step, and the command for calculating the CNV is as follows:

R--slave--args chrA.bed bam.list outdir A<callCNV_NB.R

R--slave--args chrX.bed bamX.list outdir X<callCNV_NB.R

R--slave--args chrY.bed bamY.list outdir Y<callCNV_NB.R

this step obtains the CNV results for CNV on autosomal, X, Y chromosome.

The seventh step: CNV annotations and graphical presentations

And according to the analysis result in the sixth step, performing annotation and visualization on the CNV. TEST001 sample CNV annotation results see table 3 and fig. 7; TEST002 sample CNV annotation results see table 4 and fig. 8; TEST003 sample CNV annotation results see fig. 9, which is a negative sample.

TABLE 3 TEST001CNV notes

TABLE 4 TEST002CNV notes results

Based on the CNV results generated in the seventh step, MLPA probes were designed and the CNV results were verified, and the MLPAratio values of 3 samples are shown in FIGS. 10, 11 and 12. As shown in fig. 10, 11 and 12, the positions of CNV copy numbers calculated by the three samples were all consistent with the positions of CNVs obtained by the MLPA method and copy abnormality types, and the CNV calculated by the TEST001 sample was a repeat of the 3 rd to 9 th exon regions of DMD gene, consistent with the MLPA results; and calculating that the TEST002 sample has a repeated region in the exon 3-exon 7 region of the DMD gene, and the result is consistent with the verification result of MLPA. The absence of missing repeat regions in the TEST003 samples was calculated to be consistent with the MLPA validation results.

All positive CNVs differed substantially very closely from the copy number of the MLPA assay, calculating a ratio of 1.98 for TEST001 replicates and a DMD copy ratio of around 2 for the MLPA assay. And calculating the ratio value of TEST002 repetition to be 1.3, verifying by MLPA to obtain the copy ratio value of the DMD to be about 1.3, and testing other subsequent samples to obtain the ratio value calculated by the algorithm, wherein the difference between the ratio value and the verification result of the MLPA is not more than 0.3. The calculated CNV false positive rate is 0%, namely all positive CNV calculated by the algorithm can be verified by an MLPA experimental means.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for analyzing copy number variation, comprising the steps of:

step S1: obtaining DNA sequencing data of a genome target region;

2. The method for analyzing copy number variation according to claim 1, wherein in step S2, the number of reads to be extracted is (number of bases in target region x preset average depth)/(sequencing read length x correlation coefficient), wherein the correlation coefficient is less than 1;

3. The method for analyzing copy number variation according to claim 1, further comprising, after the step S2 and before the step S3:

4. The method for analyzing copy number variation according to claim 3, wherein the predetermined requirement is: the average read mass is more than 85% of the total read mass of Q30, the average GC content is between 40% and 55%, and the bases A, T, C and G respectively account for 25% +/-2%.

5. The method for analyzing copy number variation according to any one of claims 1 to 4, wherein the step S6 comprises:

step S63: obtaining a formula by said fifth application

The determined proportion of reads of the CNV region on the corresponding chromosome, and/or

Is obtained bySaid fifth application program according to formula CNV_copy＝CNV_ratio2 copy number of CNV region of female autosomal and X chromosomes, or CNV according to formula_copy＝CNV_ratio2 copy number of CNV region of male autosomal chromosomes and CNV according to the formula_copy＝CNV_ratioDetermining the copy number of the CNV region of the male X or Y chromosome;

6. the method for analyzing copy number variation according to claim 5, wherein the step S62 further comprises: obtaining a formula Ratio by the fifth application_i＝Y_i/exp(Y_i) Determined Ratio of reads for each target region of test sample_iWherein Y is_iThe number of reads for the ith target region of the test sample.

7. The method for analyzing copy number variation according to claim 6, further comprising step S7: and calling a seventh application program and an eighth application program through a seventh application program interface and an eighth application program interface respectively to obtain the results of the seventh application program and the eighth application program for annotating and graphically displaying the ratio of reads of all target areas and CNV areas.

8. An apparatus for analyzing copy number variation, comprising:

9. A computer device having a processor and a memory, the memory having stored thereon a computer program, the processor implementing the steps of the method of analyzing copy number variation according to any one of claims 1 to 7 when executing the computer program.

10. A computer storage medium having a computer program stored thereon, wherein the computer program when executed implements the steps of the method of analyzing copy number variation of any of claims 1-7.