CN110951853B - Method for accurately detecting DNA viruses in human genome - Google Patents
Method for accurately detecting DNA viruses in human genome Download PDFInfo
- Publication number
- CN110951853B CN110951853B CN201911264769.3A CN201911264769A CN110951853B CN 110951853 B CN110951853 B CN 110951853B CN 201911264769 A CN201911264769 A CN 201911264769A CN 110951853 B CN110951853 B CN 110951853B
- Authority
- CN
- China
- Prior art keywords
- reads
- virus
- genome
- read
- human
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 241000450599 DNA viruses Species 0.000 title abstract description 10
- 241000700605 Viruses Species 0.000 claims abstract description 136
- 230000010354 integration Effects 0.000 claims abstract description 56
- 108020004414 DNA Proteins 0.000 claims abstract description 49
- 238000001514 detection method Methods 0.000 claims abstract description 27
- 230000003612 virological effect Effects 0.000 claims description 35
- 108090000623 proteins and genes Proteins 0.000 claims description 34
- 238000012163 sequencing technique Methods 0.000 claims description 31
- 238000001914 filtration Methods 0.000 claims description 20
- 239000000523 sample Substances 0.000 claims description 19
- 210000000349 chromosome Anatomy 0.000 claims description 18
- 108020005202 Viral DNA Proteins 0.000 claims description 9
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000003780 insertion Methods 0.000 claims description 9
- 230000037431 insertion Effects 0.000 claims description 9
- 238000012217 deletion Methods 0.000 claims description 8
- 230000037430 deletion Effects 0.000 claims description 8
- 238000011002 quantification Methods 0.000 claims description 6
- 230000000717 retained effect Effects 0.000 claims description 6
- 210000003917 human chromosome Anatomy 0.000 claims description 5
- 238000009825 accumulation Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000003252 repetitive effect Effects 0.000 claims description 2
- 208000015181 infectious disease Diseases 0.000 abstract description 27
- 102000053602 DNA Human genes 0.000 abstract description 18
- 230000009385 viral infection Effects 0.000 abstract description 7
- 230000004927 fusion Effects 0.000 abstract description 4
- 238000010276 construction Methods 0.000 abstract description 3
- 241000701806 Human papillomavirus Species 0.000 description 53
- 206010028980 Neoplasm Diseases 0.000 description 19
- 239000012634 fragment Substances 0.000 description 17
- 241000701044 Human gammaherpesvirus 4 Species 0.000 description 13
- 230000008569 process Effects 0.000 description 12
- 201000011510 cancer Diseases 0.000 description 11
- 210000001519 tissue Anatomy 0.000 description 11
- 206010008342 Cervix carcinoma Diseases 0.000 description 8
- 241000700721 Hepatitis B virus Species 0.000 description 8
- 241000341655 Human papillomavirus type 16 Species 0.000 description 8
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 8
- 210000004027 cell Anatomy 0.000 description 8
- 201000010881 cervical cancer Diseases 0.000 description 8
- 238000012216 screening Methods 0.000 description 7
- 208000009608 Papillomavirus Infections Diseases 0.000 description 6
- 230000000711 cancerogenic effect Effects 0.000 description 6
- 208000004449 DNA Virus Infections Diseases 0.000 description 5
- 231100000315 carcinogenic Toxicity 0.000 description 5
- 238000002573 colposcopy Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 210000002469 basement membrane Anatomy 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000003902 lesion Effects 0.000 description 3
- 201000007270 liver cancer Diseases 0.000 description 3
- 208000014018 liver neoplasm Diseases 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000001712 DNA sequencing Methods 0.000 description 2
- 208000002454 Nasopharyngeal Carcinoma Diseases 0.000 description 2
- 206010061306 Nasopharyngeal cancer Diseases 0.000 description 2
- 241001631646 Papillomaviridae Species 0.000 description 2
- 208000037581 Persistent Infection Diseases 0.000 description 2
- 208000006374 Uterine Cervicitis Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 206010008323 cervicitis Diseases 0.000 description 2
- 238000009396 hybridization Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 201000011216 nasopharynx carcinoma Diseases 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000012070 whole genome sequencing analysis Methods 0.000 description 2
- 238000012049 whole transcriptome sequencing Methods 0.000 description 2
- 208000003322 Coinfection Diseases 0.000 description 1
- 101000943248 Homo sapiens Chromatin accessibility complex protein 1 Proteins 0.000 description 1
- 101100388061 Mus musculus Polq gene Proteins 0.000 description 1
- 206010061309 Neoplasm progression Diseases 0.000 description 1
- 102000015087 Poly (ADP-Ribose) Polymerase-1 Human genes 0.000 description 1
- 108010064218 Poly (ADP-Ribose) Polymerase-1 Proteins 0.000 description 1
- 108091081062 Repeated sequence (DNA) Proteins 0.000 description 1
- 208000019802 Sexually transmitted disease Diseases 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 230000000259 anti-tumor effect Effects 0.000 description 1
- 230000000840 anti-viral effect Effects 0.000 description 1
- 239000002246 antineoplastic agent Substances 0.000 description 1
- 229940041181 antineoplastic drug Drugs 0.000 description 1
- 239000011324 bead Substances 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 210000003679 cervix uteri Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000037029 cross reaction Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000010460 detection of virus Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 231100000676 disease causative agent Toxicity 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000002615 epidermis Anatomy 0.000 description 1
- 210000002919 epithelial cell Anatomy 0.000 description 1
- 210000000981 epithelium Anatomy 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 210000004392 genitalia Anatomy 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 210000004185 liver Anatomy 0.000 description 1
- 208000019423 liver disease Diseases 0.000 description 1
- 210000001165 lymph node Anatomy 0.000 description 1
- 208000018555 lymphatic system disease Diseases 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 208000015386 nasopharyngeal disease Diseases 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 231100000590 oncogenic Toxicity 0.000 description 1
- 230000002246 oncogenic effect Effects 0.000 description 1
- 238000009595 pap smear Methods 0.000 description 1
- 239000013610 patient sample Substances 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 239000003761 preservation solution Substances 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000002250 progressing effect Effects 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 210000003491 skin Anatomy 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000011895 specific detection Methods 0.000 description 1
- 210000004085 squamous epithelial cell Anatomy 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 208000018556 stomach disease Diseases 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000005751 tumor progression Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Organic Chemistry (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Immunology (AREA)
- Databases & Information Systems (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- Microbiology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for accurately detecting DNA viruses in human genomes, which can accurately evaluate an infection type and a load capacity of a double-stranded DNA virus, and simultaneously accurately and flexibly judge the type, an integration site and a generated fusion sequence of the double-stranded DNA virus integrated into the human genomes. The method of the invention can detect a plurality of different virus infections simultaneously; the method can finely distinguish the read classification of different subtypes of the same virus, so as to judge the virus load; the method is suitable for reading types with different library construction sources, and has the universality of the etiology detection of the NGS virus.
Description
Technical Field
The invention relates to the technical field of virus detection, in particular to a method for accurately detecting DNA viruses in human genomes.
Background
The tumor related to the double-stranded DNA virus refers to a tumor which is closely related to the double-stranded DNA virus infection, is caused by triggering carcinogenic mechanism and is caused by a series of biological effects generated by interaction between the double-stranded DNA virus after infection and host cells, and is often accompanied by the phenomena of infection of high-risk carcinogenic virus strains, insertion of high-risk carcinogenic virus genome DNA into human body cell DNA, co-infection of various virus subtypes in the tumor progression process and the like. Such as double-stranded DNA viruses such as Human Papillomavir (HPV) and cervical cancer, head and neck tumors, etc.; our previous studies found that Hepatitis B Virus (HBV), Epstein-Barr virus (EBV) have a phenomenon of universal DNA integration into the human genome, and that the generation of DNA integration plays an essential role in the carcinogenic process. Therefore, blocking the occurrence of integration thereof has become an important point of research.
Taking HPV virus as an example, more than 200 types of HPV are discovered at present, and the HPV virus is a double-stranded DNA virus which specifically infects human skin mucosal squamous epithelial cells. HPV infection is a sexually transmitted disease, and according to the incomplete condition, the genital HPV infection rate of a young female with sexual activity is as high as 80%, and the female can be infected with different types of HPV types in different periods of life and can be infected with a plurality of HPV types in the same period. Persistent infection with high-risk HPV types (15 types, 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, 82, etc.) is the most critical causative agent for development of cervical cancer, when the cervical epithelium is damaged, HPV can break through the epidermis through the damaged site, enter the basal lamina, divide with the basal lamina stem cells, and start to replicate in large numbers in squamous cells above the basal lamina, and mature virus is released upon separation of surface cells. HPV generally exists in a free state in cervical epithelial cells, DNA of the HPV can be integrated into human chromosomes, high-risk HPV is integrated into host genome and is one of decisive factors in the generation and development process of cervical cancer, and researches show that the integration of the high-risk HPV can be detected in more than 90% of cervical cancer. Therefore, the identification of HPV infection type, load and integration is of great significance for the accurate prevention and treatment of cervical cancer.
At present, Hybrid Capture 2(HC2) and Cervista based on hybridization signal amplification analysis and Cobas 4800 based on real-time PCR method are mainly used for clinical virus type and load measurement. Neither of the above methods covers all HPV types and avoids cross-reaction between different HPV types, and most importantly, the above methods cannot determine whether HPV is integrated or not and detect the integration state. Meanwhile, the rapid development of the second generation sequencing technology creates a new method for virus type and integrated detection. The whole genome sequencing, the whole transcriptome sequencing and the specific virus capture target sequencing provide an opportunity for comprehensively detecting the virus type and the state.
Disclosure of Invention
Based on the above problems, the present invention aims to overcome the disadvantages of the prior art and provide a method for accurately detecting DNA viruses in human genomes, which can accurately evaluate the infection type and load of double-stranded DNA viruses, and simultaneously accurately and flexibly judge the type, integration site and generated fusion sequence of the double-stranded DNA viruses integrated into human genomes.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following aspects:
in a first aspect, the present invention provides a method of detecting the virus type in the human genome, comprising the steps of:
1) collecting all types of virus genomes from a database and taking the virus genomes as pseudo chromosomes, and combining the pseudo chromosomes with chromosomes of human genomes to obtain mixed genomes;
2) extracting and sequencing the DNA of a patient to obtain a genome of the patient, and comparing the genome of the patient with the mixed genome obtained in the step 1) for the first time;
3) counting the non-human chromosomes in the comparison result in the step 2), and classifying the read according to the length ratio and the similarity ratio of the read in the first comparison for the compared specific type of virus genome, wherein the read is screened by adopting the following formula:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of viral DNA aligned at both ends of the reads, LIIndicating the length of the intermediate insertion, L, on the readDIndicating the length of the middle deletion, L, on the readMISRepresents the length of a single base mismatch on the reads;
4) and (4) counting the virus type and the load of the reads meeting the two formulas in the step 3).
It should be noted that the detection method of the present invention can accurately detect the specific type and relative load of the double-stranded DNA virus infection, and is suitable for diseases related to the double-stranded DNA virus infection, such as cervical diseases, head and neck diseases, HBV-related liver diseases, EBV-related lymphatic system diseases, nasopharyngeal diseases, and gastric diseases.
Preferably, the comparison in step 2) is performed by using a BWA-MEM algorithm.
Preferably, the step 2) further comprises removing PCR repeat sequences. More preferably, the PCR repeats are removed using the software Picard Markduplicates.
Preferably, in the step 4), for the double-end sequencing reads, when both the two reads satisfy the two formulas in the step 3), the statistics of the virus type and the load can be performed.
In a second aspect, the present invention provides a method for detecting the viral content in the human genome, comprising the steps of: based on the statistical results of the virus types and the loads, the relative quantification of the virus copy number is carried out according to the comparison result of the selectable reference genes and the mixed genome, and the quantification formula is as follows:
wherein, CNHThe copy number of the reference gene is 2, D by defaultVFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single-base site coverage of the viral genome by all reads of step 3) above, DHFor effective accumulation of the internal reference gene, the number of times of covering single base sites of all reads after the internal reference gene is compared with the mixed genome is accumulated in the same way as described above, CVFor the aligned coverage of the viral genome, i.e.the single base sites involved in all reads of step 3) above occupy the length of the viral genome, CHThe comparison coverage of the reference gene, i.e. the length of the single base site related to all reads of the mixed gene in the step 1) of the reference gene comparison, LVFor sequencing the effective length of the viral genome to which the probe is directed, LHThe effective length of the reference gene related to the sequencing probe.
In a third aspect, the present invention provides a method for detecting the presence or absence of viral integration and integration sites in the human genome, comprising the steps of:
constructing reference genomes of human and corresponding virus types according to the detected virus genome types;
re-aligning each of all first alignment reads to the reference genome; and
and (3) detecting whether the virus is integrated or not and the integration site based on the detection principle of the chimera reads according to the comparison result of the specific virus types.
Preferably, the method comprises the steps of:
s1, independently aligning all the first alignment reads with a reference genome;
s2, independently aligning all the first alignment reads to a virus reference genome of a specific type;
s3, comparing all the first comparison reads with the mixed reference genome of the human and the corresponding type, and removing PCR repetitive sequences in the comparison result by using Picard Mark duplicates;
s4, combining the results of the step S1 and the step S2, and performing reading statistical classification on the comparison result in the step S3 to divide the comparison result into a single-ended chimera reading segment, a double-ended chimera reading segment and a remote double-ended transregional reading segment;
s5, merging the two-end chimera reads into an integral read for second comparison; for the single-ended chimeric reads, performing second comparison on the chimeric single reads;
s6, performing reading filtering on the comparison result of the step S5;
s7, locally clustering all the read segments retained after filtering in the step S6 according to the read segment positions of the human genome, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites; and
s8, assembling the reads annotated in the step S7, performing third comparison of the assembled sequences into virus and human parts to the mixed reference genome, and reserving the assembled sequences with the comparison result consistent with the BWA-MEM comparison result of claim 2.
Preferably, the reads filtered in step S6 include the following reads:
the result of comparison with BWA-MEM is inconsistent;
the reading of the virus and the human is too short (less than or equal to 30 bp);
the cross read proportion of the virus and human is too long (more than or equal to 50 percent of the read length);
the comparison result of the human reading part is not unique; or
The human reads are derived in part from low-repeat regions of DNA.
Preferably, the ANNOVAR software is used for annotating gene positions and functions in the step S7; in the step S8, IDBA-UD software is used for assembly; the second alignment in step S5 and the third alignment in S8 both use BLASTN software.
In conclusion, the beneficial effects of the invention are as follows:
the method of the invention can detect a plurality of different virus infections simultaneously;
the method can finely distinguish the read classification of different subtypes of the same virus, so as to judge the virus load;
the method is suitable for reading types with different database construction sources, and has the universality of the etiology detection of the NGS virus;
the method can accurately detect the type and the integration site of the virus integrated into the human genome, and the specific integration sequence, and provides solid theoretical support for downstream verification.
Drawings
FIG. 1 is a schematic flow diagram of a method of detecting viral integration in the human genome according to the present invention;
FIG. 2 is a graph showing the relationship between the highest-load HPV type and the number of integration sites of the integrated HPV types in various HPV infection samples, wherein about 66.7% of the HPV types integrated in various HPV infection samples are the types with the highest viral load;
FIG. 3 is a graph showing the result of HPV typing in example 1, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 4 is a graph showing the result of HPV typing in example 2, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 5 is a graph showing the result of HPV typing in example 3, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 6 is a graph showing the result of HPV typing in example 4, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 7 is a graph showing the result of HPV typing in example 5, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 8 is a graph showing the result of HPV typing in example 6, wherein the vertical axis represents the number of effective HPV alignments;
FIG. 9 is a graph of read support statistics for a sample with the highest viral load Type 1;
FIG. 10 is a graph of statistical read support counts for samples with the highest viral load Type 2.
Detailed Description
In some embodiments, the invention provides a method for accurately detecting double-stranded DNA virus susceptibility polymorphism and load, virus integration breakpoint and human-virus genome fusion sequence, and the detection result based on the method guides virus-related tumor screening and treatment decision, so that the method is more accurate and efficient. The method can detect the main virus infection type with potential carcinogenic effect, judge the cancer risk through the occurrence of integration, guide the personalized screening strategy of related tumors, and provide an antiviral and antitumor targeted treatment scheme for cancer patients according to the number of virus integration sites and biological significance.
In some embodiments, the invention provides a method of calculating the infection type of a double-stranded DNA virus, the method being based on second-generation sequencing reads, with the best use scenario for virus capture sequencing; the method comprises the following steps of filtering comparison information of sequencing reads, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, and specifically comprises the following steps:
1) in the initial comparison process, all types of virus genomes collected from a database are taken as pseudo chromosomes and are merged with chromosomes of human genomes to construct mixed genomes;
2) the comparison software adopts a BWA-MEM algorithm supporting local optimal comparison, and after comparison, Picard Markduplicates are used for removing PCR repetition;
3) and (3) counting the comparison result of the non-human chromosomes of the comparison result, and carrying out secondary accurate classification on the reads according to the length ratio and the similarity ratio of the read comparison when the specific type of viruses are compared, wherein a specific read screening formula is as follows:
LM≥(LM+LS+LH+LI)×0.5
3×LI+2×LD+LMIS≤(LM+LD)×0.2
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of the two ends (larger fragments) on the reads compared to the viral DNA, LIIndicates the length of the insertion of the middle (small fragment) on the read, LDIndicates the length of the deletion in the middle (small fragment) on the read, LMISRepresents the length of a single base mismatch on the reads;
4) for reads meeting the above two conditions (formulas), entering into the statistics of the virus type loading, for double-end sequencing reads, two reads can enter into downstream statistics if both the two reads meet the above conditions (two formulas);
5) through the steps, the statistics of the virus type loading capacity is preliminarily completed, and then the relative quantification of the virus copy number is carried out according to the comparison condition of the selectable internal reference genes, wherein the quantification formula is as follows:
wherein, CNHThe copy number of the reference gene is 2, D by defaultVFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single-base site coverage of the viral genome by all reads of step 3) above, DHFor effective accumulation of the internal reference gene, the number of times of covering single base sites of all reads after the internal reference gene is compared with the mixed genome is accumulated in the same way as described above, CVFor the aligned coverage of the viral genome, i.e.the single base sites involved in all reads of step 3) above occupy the length of the viral genome, CHIs the ratio of internal reference genesCoverage, i.e., the length of the single base site of the reference gene relative to all reads of the mixed gene, LVFor sequencing the effective length of the viral genome to which the probe is directed, LHThe effective length of the reference gene related to the sequencing probe.
The internal reference genes involved in the calculation of the copy number of the virus are suitable for fixed probes and sequencing reads adopted by different library construction detection systems, and include but are not limited to conserved genes of human genomes.
In some embodiments, the above algorithm is mainly applied to whole genome sequencing, exon sequencing, whole transcriptome sequencing and virus capture sequencing data of clinical patients related to double-stranded DNA virus infection, by mining DNA sequencing data, scanning existing double-stranded DNA virus types and types, identifying DNA fusion site sequences for each virus type reaching a detection level, comprehensively judging corresponding virus carcinogenic risks of patients through double-stranded DNA virus infection types, infection abundance, integration conditions into human genome and meanings of corresponding integration sites, and guiding clinical early warning decision and treatment schemes, and the main application scenarios are as follows:
1. the determination of the infection type of double-stranded DNA viruses includes, for example, Human Papilloma Virus (HPV), Hepatitis B Virus (HBV), and EBV (Epstein-Barr virus, EBV). The types and variant strains of the viruses are numerous, and the mixed multi-type infection forms often exist, through the detection method, the virus DNA type reaching the detection abundance in the DNA sequencing data can be detected, and the abundance of each type is counted;
2. predicting the integration site of the viral DNA integrated into the human genome, explaining the significance of the corresponding human genome integration site, guiding the clinical intervention of the occurrence of viral integration, and preventing the corresponding tumor from progressing;
3. the method is applied to the genome of a tumor patient, and carries out pattern recognition on large-fragment structural variation to predict the prognosis of pan-cancer;
4. the gene is applied to the genome of a tumor patient to predict the treatment response of pan-cancer to the synthetic lethal-principle-based anti-tumor drugs such as polq and PARP 1.
In some embodiments, the reference genomes of human and corresponding virus types are specifically constructed according to the virus genome types detected by the algorithm, all reads are compared again, and the detection of virus integration and integration sites is carried out based on the detection principle of chimera reads according to the comparison result of type specificity.
The method specifically comprises the following steps:
1. comparing all reads separately against the human reference genome;
2. comparing all reads individually to a specific type of viral reference genome;
3. comparing all the reads with a mixed reference genome of a human and a corresponding type, and performing PCR duplication elimination on comparison results by using Picard Mark duplicates;
4. and (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (see figure 1).
5. Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, carrying out BLASTN secondary comparison on the chimera single reads;
6. filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human read portion is from a low repeat region of DNA;
7. performing local clustering on all the read segments retained after filtering in the step 6 according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and performing annotation on the gene positions and functions of the sites by using ANNOVAR software;
8. and (4) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into viruses and human parts, carrying out third BLASTN comparison, and reserving the assembled sequences with the comparison result consistent with the BWA-MEM comparison to obtain the target product.
In one embodiment, in order to detect multiple virus types of the same species, the risk assessment can be performed according to the virus integration sites obtained by the above calculation method, and the virus type with major oncogenic risk is identified, and the virus type with integration is consistent with the virus type with the highest viral load, and the consistent rate reaches 70% (see fig. 2).
In one embodiment, the free DNA sample is obtained from a cell tissue in routine examination, and is suitable for, but not limited to, exfoliated cells of the cervix, punctured cells of the liver, lymph node biopsy, blood, saliva, and the like.
In some embodiments, such as the discovery of integration sites for tumor-associated viruses in a patient sample, the density of downstream clinical monitoring observations can be increased, or more aggressive clinical treatment protocols can be changed, and conversely, the density of clinical monitoring can be decreased, or the clinical treatment protocol can be degraded (see examples 1-6).
In one embodiment, the virus types involved in the present invention may comprise one or more of double-stranded DNA viruses such as HPV, HBV and EBV; for a virus, any of a variety of different types of virus species may be included; for the same species of the same virus, probes designed for the entire genome may be considered, as well as probes designed for regions of the genome.
To better illustrate the objects, aspects and advantages of the present invention, the present invention will be further described with reference to the accompanying drawings and specific embodiments. The present invention is illustrated by the following examples of type infection distribution and integration site detection for cervical cancer, nasopharyngeal carcinoma and liver cancer samples, which are for illustrative purposes only and are not intended to be limiting. Unless otherwise specified, the experimental methods in the present invention are all conventional methods.
Example 1
One embodiment of the method for accurately detecting a DNA virus in a human genome of the present invention comprises the steps of:
for patients with mild cervicitis A, a part of cervical tissue is taken and subjected to capture sequencing. The data obtained from the sequencing were analyzed as follows.
The method comprises the following steps of filtering comparison information of sequencing reads, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, wherein the method comprises the following specific steps:
(1) in the initial comparison process, all types of HPV viral genomes collected from a papilloma virus genome database PaVE are taken as pseudo chromosomes and combined with chromosomes of a human genome to construct a mixed genome;
(2) the comparison software adopts a BWA-MEM algorithm supporting local optimal comparison, and after comparison, Picard Markduplicates are used for removing PCR repetition;
(3) counting the comparison result of HPV genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when comparing the HPV viruses of a specific type, wherein the specific read screening formula is as follows:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of the two ends (larger fragments) on the reads compared to the viral DNA, LIIndicates the length of the insertion of the middle (small fragment) on the read, LDIndicates the length of the deletion in the middle (small fragment) on the read, LMISRepresents the length of a single base mismatch on the reads;
(4) for reads meeting the above two conditions enter into the statistics of the virus type loading, for double-ended sequencing reads, both reads meeting the above conditions can enter into downstream statistics.
Based on the detected viral genome types, two HPV types (HPV31, HPV33) were found in the patient A samples. For these two infection types, detection of the integration site was performed separately. For HPV31, a mixed reference genome of human and HPV31 viruses was constructed. And comparing all the reads again, and detecting whether the viruses are integrated or not and the integration sites based on the detection principle of the chimera reads according to the comparison result of the type specificity.
The method comprises the following specific steps:
(1) all reads were aligned individually to the reference genome.
(2) All reads were aligned individually to the HPV31 virus reference genome.
(3) All reads were aligned to human and HPV31 virus mixed reference genomes and the alignment was performed with Picard Mark duplicates to eliminate PCR duplicates.
(4) And (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (shown in the attached figure 1).
(5) Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, a BLASTN secondary alignment is performed on the chimera single reads.
(6) Filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human reads are derived in part from low-repeat regions of DNA.
(7) And (3) locally clustering all the read segments retained after filtering in the step (6) according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites by using ANNOVAR software.
(8) And (3) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into virus and human parts, carrying out third BLASTN alignment, and reserving the assembled sequences which are aligned to be consistent with the BWA-MEM alignment.
The results showed that no integration site of HPV31 on the human genome was detected.
For HPV33, the same analysis was performed and two different integration sites were detected (see table 1 for results).
TABLE 1 HPV33 integration results
In summary, in patient a, infection with two HPV types (HPV31, HPV33, see fig. 3) was found, and at the same time, high-risk HPV33 was found to integrate into two different sites of the human genome, which could be followed by colposcopy to avoid missed diagnosis, unlike international guidelines that were non-HPV 16, 18 positive.
Example 2
In patient B with mild cervicitis (see example 1 for detection methods), infection with three HPV types (HPV16, HPV31, HPV56) was found (as shown in fig. 4), but no HPV integration was found, follow-up was continued to avoid unnecessary colposcopy, unlike the HPV16 positive recommendation colposcopic referral of the international guideline.
Example 3
In the follow-up visit of cervical low grade lesion patient C (detection method see example 1), high risk type HPV16 infection (as shown in fig. 5) was found, but no HPV integration was found, follow-up visit could be continued to avoid unnecessary colposcopy, unlike the international guideline HPV16 positive suggested colposcopic referral.
Example 4
In the subsequent follow-up of patients with cervical low-grade lesions D (see example 1 for detection methods), persistent infection of multiple HPV types (HPV16, HPV56) was found (as shown in fig. 6), while high-risk HPV56 was found integrated into the human genome (see table 2), which could be followed by colposcopy to avoid progression.
TABLE 2 HPV56 integration results
Example 5
In cervical high-grade lesion patient E (detection method, see example 1), high-risk HPV16 infection (shown in figure 7) is found and accompanied by 2 integration sites (see Table 3), and surgical treatment can be adopted to avoid the progression to cervical cancer.
TABLE 3 HPV16 integration results
Example 6
In the cervical cancer patient F (detection method, see example 1), a plurality of HPV types (HPV16 and HPV18) are found to be infected (as shown in figure 8), and high-risk HPV18 is simultaneously found to be integrated into two different sites of a human genome (see table 4), and the integration site is positioned near a human CHRAC1 gene to guide clinical personalized medicine application.
TABLE 4 HPV18 integration results
Example 7
112 nasopharyngeal carcinoma samples were collected from the first hospital affiliated to Zhongshan university and used in one embodiment of the method of the present invention for detecting EBV virus infection type, comprising the following steps:
(1) in the initial alignment process, two types of EBV virus Type1 and Type2 were collected from literature and NCBI databases, and the two types of EBV virus genomes were combined with chromosomes of the human genome as pseudochromosomes to construct a mixed genome.
(2) The alignment software used a BWA-MEM algorithm supporting locally optimal alignment, and after alignment, Picard Markduplicates were used to remove PCR duplication.
(3) And (3) counting the comparison result of the EBV virus genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when the specific type of virus is compared, wherein a specific read screening formula is as follows:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of the two ends (larger fragments) on the reads compared to the viral DNA, LIIndicates the length of the insertion of the middle (small fragment) on the read, LDIndicates the length of the deletion in the middle (small fragment) on the read, LMISRepresents the length of a single base mismatch on the reads;
(4) for reads meeting the above two conditions enter into the statistics of the virus type loading, for double-ended sequencing reads, both reads meeting the above conditions can enter into downstream statistics.
The EBV type with the highest virus infection load and the read support number of the EBV type are counted in each sample, and the results are shown in FIGS. 9 and 10: the Type of most samples infected with a high viral load was Type 1.
Example 8
In the liver cancer patient G, the liver cancer tissue and the tissue beside the cancer are taken and respectively captured and sequenced, and the infection type and the integration site of the cancer tissue and the tissue beside the cancer are detected by the method. The specific detection steps of cancer tissues are as follows:
the method comprises the following steps of filtering comparison information of sequencing reads of cancer tissues, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, wherein the specific steps are as follows:
(1) in the initial alignment process, 11 HBV viruses are collected from the literature and NCBI databases, and the HBV viral genomes of all the collected types are taken as pseudo chromosomes and merged with the chromosomes of the human genome to construct a mixed genome.
(2) The alignment software used a BWA-MEM algorithm supporting locally optimal alignment, and after alignment, Picard Markduplicates were used to remove PCR duplication.
(3) Counting the comparison result of HBV genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when the specific type of HBV virus is compared, wherein the specific read screening formula is as follows:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of the two ends (larger fragments) on the reads compared to the viral DNA, LIIndicates the length of the insertion of the middle (small fragment) on the read, LDIndicates the length of the deletion in the middle (small fragment) on the read, LMISRepresents the length of a single base mismatch on the reads;
(4) for reads meeting the above two conditions enter into the statistics of the virus type loading, for double-ended sequencing reads, both reads meeting the above conditions can enter into downstream statistics.
Based on the detected viral genome types, infection with three HPV types (AB014381, AF090842, AB033554) was found in cancer samples from patient G. For these three infection types, detection of integration sites was performed separately. For AB014381, a mixed reference genome of human and AB014381 viruses was constructed. And comparing all the reads again, and detecting whether the viruses are integrated or not and the integration sites based on the detection principle of the chimera reads according to the comparison result of the type specificity.
The method comprises the following specific steps:
(1) all reads were aligned individually to the reference genome.
(2) All reads were aligned individually to the AB014381 virus reference genome.
(3) All reads were aligned to human and AB014381 virus mixed reference genomes and the alignment was performed with PCR duplication removed using Picard Mark duplicates.
(4) And (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (figure 1).
(5) Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, a BLASTN secondary alignment is performed on the chimera single reads.
(6) Filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human reads are derived in part from low-repeat regions of DNA.
(7) And (3) locally clustering all the read segments retained after filtering in the step (6) according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites by using ANNOVAR software.
(8) And (3) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into virus and human parts, carrying out third BLASTN alignment, and reserving the assembled sequences which are aligned to be consistent with the BWA-MEM alignment.
For two additional infection types detected in cancer tissues, AF090842 and AB033554, the integration site detection procedure described above was repeated, and finally, only 3 integration sites were detected on AB014381 virus (as shown in table 5 below).
Infection type detection was performed in the same manner as described above on paraneoplastic tissues, and similarly infection with three HPV types (AB014381, AF090842, AB033554) was detected in paraneoplastic tissues. For these three infection types, detection of integration sites was performed separately. Finally, 2 integration sites were detected on AB014381 virus (as shown in table 5 below).
TABLE 5 AB014381 Virus integration site
Example 9
Collecting female cervical brush samples from a cervical screening clinic, a first hospital affiliated to Zhongshan university, preserving by using BD SurePath LBC cell preservation solution, extracting Genomic DNA by using Beijing all-style gold easy pure Genomic DNA Kit, breaking the Genomic DNA by using a Bioruptor Pico breaking instrument, adding a joint, purifying, preparing a DNA library, carrying out hybridization reaction with HPV probe DNA, capturing by using magnetic beads, sequencing captured fragments by using high-throughput double-ended PE150, and then analyzing sequencing data by using the method disclosed by the invention as follows:
the method comprises the following steps of filtering comparison information of sequencing reads, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, wherein the method comprises the following specific steps:
(1) in the initial alignment process, all types of HPV viral genomes collected from the papillomavirus genome database PaVE are taken as pseudo chromosomes and combined with chromosomes of a human genome to construct a mixed genome.
(2) The alignment software used a BWA-MEM algorithm supporting locally optimal alignment, and after alignment, Picard Markduplicates were used to remove PCR duplication.
(3) Counting the comparison result of HPV virus genome, and carrying out secondary accurate classification on the reads according to the length ratio and the similarity ratio of the read comparison when the comparison result of the HPV virus genome is compared with a specific type of virus, wherein a specific read screening formula is as follows:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates that the two ends (larger fragments) on the reads are not alignedLength of viral DNA, LIIndicates the length of the insertion of the middle (small fragment) on the read, LDIndicates the length of the deletion in the middle (small fragment) on the read, LMISRepresents the length of a single base mismatch on the reads;
(4) for reads meeting the above two conditions enter into the statistics of the virus type loading, for double-ended sequencing reads, both reads meeting the above conditions can enter into downstream statistics.
According to the detected type of the viral genome, constructing reference genomes of human and corresponding types in a specific manner, comparing all the reads again, and detecting whether the virus is integrated or not and the integrated site based on the detection principle of chimera reads according to the specific comparison result of the types.
The method comprises the following specific steps:
(1) all reads were aligned individually to the reference genome.
(2) All reads were individually aligned to a specific type of viral reference genome.
(3) All reads were aligned to a mixed reference genome of human and corresponding classes, and PCR duplication was removed from the alignment using Picard Mark duplicates.
(4) And (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (figure 1).
(5) Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, a BLASTN secondary alignment is performed on the chimera single reads.
(6) Filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human reads are derived in part from low-repeat regions of DNA.
(7) And (3) locally clustering all the read segments retained after filtering in the step (6) according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites by using ANNOVAR software.
(8) And (3) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into virus and human parts, carrying out third BLASTN alignment, and reserving the assembled sequences which are aligned to be consistent with the BWA-MEM alignment.
The infection type and the number of integration sites were counted for each sample. Samples with multiple HPV infections and integration sites were selected, for a total of 15 cases. The ratio of the read support number of all infection types of each sample to the total virus infection reads was used to make a stacked bar graph, and the number of integration sites per infection type of each sample was used to make a stacked bar graph.
The results are shown in FIG. 2, where the horizontal axis represents the sample name, the different colors represent different infection types, the lower graph is the ratio of the number of reads per infection type in each sample to the total number of virus infection reads, and the upper graph is the number of integration sites per infection type in each sample. As can be seen from FIG. 2, among the 15 samples infected with various HPV viruses, only one HPV type was integrated in 11 samples, and the HPV type integrated in 10 samples, which is the highest viral load type, accounts for 66.7% of all samples.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims (9)
1. A method for detecting the presence or absence of viral integration and integration sites in the human genome for non-diagnostic purposes comprising the steps of:
1) collecting all types of virus genomes from a database and taking the virus genomes as pseudo chromosomes, and combining the pseudo chromosomes with chromosomes of human genomes to obtain mixed genomes;
2) extracting and sequencing the DNA of a patient to obtain a genome of the patient, and comparing the genome of the patient with the mixed genome obtained in the step 1) for the first time;
3) counting the non-human chromosomes in the comparison result in the step 2), and classifying the read according to the length ratio and the similarity ratio of the read in the first comparison for the compared specific type of virus genome, wherein the read is screened by adopting the following formula:
LM≥(LM|LS|LH|LI)×0.5;
3×Li+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of viral DNA aligned at both ends of the reads, LIIndicating the length of the insertion in the middle of the read, LDIndicating the length of the deletion in the middle of the read, LMISRepresents the length of a single base mismatch on the reads;
4) counting the types and the loads of the reads which meet the two formulas in the step 3) to obtain the type of the virus in the human genome; 5) constructing reference genomes of the human and the corresponding virus types according to the virus types in the human genome detected in the step 4);
6) re-aligning each of all first alignment reads to the reference genome; and
and (3) detecting whether the virus is integrated or not and the integration site based on the detection principle of the chimera reads according to the comparison result of the specific virus types.
2. The method of claim 1, wherein the step 2) is performed using a BWA-MEM algorithm.
3. The method of claim 1, wherein the step 2) further comprises removing PCR repeats.
4. The method of claim 3, wherein the PCR repeats are removed using the software Picard Markduplicates.
5. The method of claim 1, wherein in step 4), the virus type and load statistics are performed only when both reads satisfy the two formulas in step 3) for paired-end sequencing reads.
6. The method of claim 1, comprising the steps of:
s1, independently aligning all the first alignment reads with a reference genome;
s2, independently aligning all the first alignment reads to a virus reference genome of a specific type;
s3, comparing all the first comparison reads with the mixed reference genome of the human and the corresponding type, and removing PCR repetitive sequences in the comparison result by using Picard Mark duplicates;
s4, combining the results of the step S1 and the step S2, and performing reading statistical classification on the comparison result in the step S3 to divide the comparison result into a single-ended chimera reading segment, a double-ended chimera reading segment and a remote double-ended transregional reading segment;
s5, merging the two-end chimera reads into an integral read for second comparison; for the single-ended chimeric reads, performing second comparison on the chimeric single reads;
s6, performing reading filtering on the comparison result of the step S5;
s7, locally clustering all the read segments retained after filtering in the step S6 according to the read segment positions of the human genome, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites; and
s8, assembling the reads annotated in the step S7, performing third comparison of the assembled sequences into virus and human parts to the mixed reference genome, and reserving the assembled sequences with the comparison result consistent with the BWA-MEM comparison result of claim 2.
7. The method of claim 6, wherein the reads filtered in step S6 include the following reads:
the result of comparison with BWA-MEM is inconsistent;
the viral and human reads are too short;
the cross-read ratio of virus and human is too long;
the comparison result of the human reading part is not unique; or
The human reads are derived in part from low-repeat regions of DNA.
8. The method of claim 6, wherein the annotation of gene location and function is performed using ANNOVAR software in step S7; in the step S8, IDBA-UD software is used for assembly; the second alignment in step S5 and the third alignment in S8 both use BLASTN software.
9. A method for detecting the viral content of a human genome for non-diagnostic purposes, comprising the steps of:
1) collecting all types of virus genomes from a database and taking the virus genomes as pseudo chromosomes, and combining the pseudo chromosomes with chromosomes of human genomes to obtain mixed genomes;
2) extracting and sequencing the DNA of a patient to obtain a genome of the patient, and comparing the genome of the patient with the mixed genome obtained in the step 1) for the first time;
3) counting the non-human chromosomes in the comparison result in the step 2), and classifying the read according to the length ratio and the similarity ratio of the read in the first comparison for the compared specific type of virus genome, wherein the read is screened by adopting the following formula:
LM≥(LM+LS+LH+LI)×0.5;
3×LI+2×LD+LMIS≤(LM+LD)×0.2,
wherein L isMIndicates the read length, L, of the particular type of virus alignedS、LHIndicates the length of viral DNA aligned at both ends of the reads, LIIndicating the length of the insertion in the middle of the read, LDIndicating the length of the deletion in the middle of the read, LMISRepresents the length of a single base mismatch on the reads;
4) counting the types and the loads of the reads which meet the two formulas in the step 3) to obtain the type of the virus in the human genome;
5) based on the statistical results of the virus types and the loads in the step 4), performing relative quantification of the virus copy number according to the comparison result of the selectable reference genes and the mixed genome, wherein the quantification formula is as follows:
wherein, CNHThe copy number of the reference gene is 2, D by defaultVFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single base site coverages of all reads of step 3) of claim 1, DHFor effective accumulation of the depth of multiplication of the reference gene, obtained by accumulating the number of single-base site coverage of all reads after the comparison of the reference gene with the mixed genome of claim 1 in the same manner as described above, CVFor the aligned coverage of the viral genome, i.e.the length of the viral genome, C, of the single-base sites involved in all reads of step 3) in claim 1HThe alignment coverage of the reference gene, i.e., the length of the single base site involved in aligning all reads of the mixed gene of claim 1 to the reference gene, LVFor sequencing the effective length of the viral genome to which the probe is directed, LHThe effective length of the reference gene related to the sequencing probe.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911264769.3A CN110951853B (en) | 2019-12-10 | 2019-12-10 | Method for accurately detecting DNA viruses in human genome |
PCT/CN2019/124917 WO2021114186A1 (en) | 2019-12-10 | 2019-12-12 | Method for accurately detecting dna viruses in human genome |
AU2020101909A AU2020101909A4 (en) | 2019-12-10 | 2020-08-20 | A method for accurately detecting dna virus in the human genome |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911264769.3A CN110951853B (en) | 2019-12-10 | 2019-12-10 | Method for accurately detecting DNA viruses in human genome |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110951853A CN110951853A (en) | 2020-04-03 |
CN110951853B true CN110951853B (en) | 2021-03-30 |
Family
ID=69980885
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911264769.3A Expired - Fee Related CN110951853B (en) | 2019-12-10 | 2019-12-10 | Method for accurately detecting DNA viruses in human genome |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110951853B (en) |
AU (1) | AU2020101909A4 (en) |
WO (1) | WO2021114186A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111584003B (en) * | 2020-04-10 | 2022-05-10 | 中国人民解放军海军军医大学 | Optimized detection method for virus sequence integration |
CN112530519B (en) * | 2020-12-14 | 2021-08-24 | 广东美格基因科技有限公司 | Method and system for detecting microorganisms and drug resistance genes in sample |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010130731A1 (en) * | 2009-05-12 | 2010-11-18 | Virco Bvba | Hiv-1-c resistance monitoring |
CN103261442A (en) * | 2010-12-02 | 2013-08-21 | 深圳华大基因健康科技有限公司 | Method and system for bioinformatics analysis of hpv precise typing |
CN104762402A (en) * | 2015-04-21 | 2015-07-08 | 广州定康信息科技有限公司 | Method for rapidly detecting human genome single base mutation and micro-insertion deletion |
CN110349629A (en) * | 2019-06-20 | 2019-10-18 | 广州赛哲生物科技股份有限公司 | Analysis method for detecting microorganisms by using metagenome or macrotranscriptome |
-
2019
- 2019-12-10 CN CN201911264769.3A patent/CN110951853B/en not_active Expired - Fee Related
- 2019-12-12 WO PCT/CN2019/124917 patent/WO2021114186A1/en active Application Filing
-
2020
- 2020-08-20 AU AU2020101909A patent/AU2020101909A4/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010130731A1 (en) * | 2009-05-12 | 2010-11-18 | Virco Bvba | Hiv-1-c resistance monitoring |
CN103261442A (en) * | 2010-12-02 | 2013-08-21 | 深圳华大基因健康科技有限公司 | Method and system for bioinformatics analysis of hpv precise typing |
CN104762402A (en) * | 2015-04-21 | 2015-07-08 | 广州定康信息科技有限公司 | Method for rapidly detecting human genome single base mutation and micro-insertion deletion |
CN110349629A (en) * | 2019-06-20 | 2019-10-18 | 广州赛哲生物科技股份有限公司 | Analysis method for detecting microorganisms by using metagenome or macrotranscriptome |
Non-Patent Citations (2)
Title |
---|
Genomic modeling of hepatitis B virus integration frequency in the human genome;Ondrej Podlaha等;《PLOS ONE》;20190729;第14卷(第7期);e0220376:第1-9页 * |
VERSE: a novel approach to detect virus integration in host genomes through reference genome customization;Qingguo Wang等;《Genome Medicine》;20150120;第7卷(第2期);第1-8页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110951853A (en) | 2020-04-03 |
AU2020101909A4 (en) | 2020-09-24 |
WO2021114186A1 (en) | 2021-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230132951A1 (en) | Methods and systems for tumor detection | |
US11479825B2 (en) | Diagnostic applications using nucleic acid fragments | |
Venceslau et al. | HPV detection using primers MY09/MY11 and GP5+/GP6+ in patients with cytologic and/or colposcopic changes | |
CN110951853B (en) | Method for accurately detecting DNA viruses in human genome | |
CN112646882B (en) | Composition and diagnostic reagent for detecting cervical high-grade lesion and cervical cancer | |
AU2014331476B2 (en) | Methods and devices for nasopharyngeal carcinoma screening | |
Lin et al. | Genital human papillomavirus screening by gene chip in Chinese women of Guangdong province | |
CN115612744A (en) | Human papilloma virus typing and related gene methylation integrated detection model and construction method thereof | |
TW202020165A (en) | Nucleic acid rearrangement and integration analysis | |
JP2004538010A (en) | Assay | |
JP2004538010A5 (en) | ||
Halatsi et al. | Co-Testing: Pap-Test and mRNA HPV-Test for Cervical Cancer Screening | |
Carcea et al. | A cohort retrospective study of high-risk HPV recurrence in Greek women after cervical lesion treatment through detection of viral E6/E7 mRNA expression | |
Brancaccio | Novel strategies for the identification and full genomic characterization of unknown HPV types from human DNA samples | |
Wang et al. | Prevalence and genotype distribution of HPV infection from Hangzhou of Zhejiang Province pre-and during COVID-19 pandemic | |
Mutetwa et al. | Host Nuclear Genome Copy Number Variations Identify High-Risk Anal Precancers in People Living With HIV | |
US20180148780A1 (en) | Next-Generation Sequencing-Based Genotyping Assay for Human Papilloma Virus (HPV) | |
CN1367844A (en) | System and method for screening of nasopharyngeal carcinoma | |
Belembaogo et al. | Pamela Moussavou-Boundzanga1, Ismaël Hervé Koumakpayi2, Ingrid Labouba1, Eric M. Leroy1, 3 | |
Li et al. | Risk of Cervical Intraepithelial Neoplasia Grade 2 or Worse and HPV Integration Status Conversion in HPV Integration-Positive Women: A One-Year Follow-Up |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210330 |