WO2021114186A1 - Method for accurately detecting dna viruses in human genome - Google Patents

Method for accurately detecting dna viruses in human genome Download PDF

Info

Publication number
WO2021114186A1
WO2021114186A1 PCT/CN2019/124917 CN2019124917W WO2021114186A1 WO 2021114186 A1 WO2021114186 A1 WO 2021114186A1 CN 2019124917 W CN2019124917 W CN 2019124917W WO 2021114186 A1 WO2021114186 A1 WO 2021114186A1
Authority
WO
WIPO (PCT)
Prior art keywords
reads
virus
genome
comparison
human
Prior art date
Application number
PCT/CN2019/124917
Other languages
French (fr)
Chinese (zh)
Inventor
胡争
崔资凤
许微
Original Assignee
中山大学附属第一医院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中山大学附属第一医院 filed Critical 中山大学附属第一医院
Publication of WO2021114186A1 publication Critical patent/WO2021114186A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the invention relates to the technical field of virus detection, in particular to a method for accurately detecting DNA viruses in the human genome.
  • Double-stranded DNA virus-associated tumors refer to tumors that are closely related to double-stranded DNA virus infections and are caused by a series of biological effects caused by the interaction of double-stranded DNA viruses with host cells that trigger carcinogenic mechanisms and are often accompanied by high-risk cancers.
  • double-stranded DNA viruses such as Human papillomavirus (HPV) and cervical cancer, head and neck tumors, etc.; our previous research found that Hepatitis B virus (HBV), Epstein-Barr virus (EBV) have universal DNA integration into the human genome. And the production of DNA integration plays an essential role in its carcinogenic process. Therefore, blocking its integration has become the focus of research.
  • HPV virus there are currently more than 200 types of HPV discovered, which are double-stranded DNA viruses that specifically infect human skin mucosal squamous epithelial cells.
  • HPV infection is a sexually transmitted disease. According to incomplete data, the reproductive tract HPV infection rate of sexually active young women is as high as 80%, and women may be infected with different types of HPV at different periods of their lives, and there may be multiple HPV types at the same period. Other infections.
  • HPV Persistent infection of high-risk HPV (16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, and 82) is the most critical factor leading to the development of cervical cancer
  • HPV can break through the epidermis through the damaged site, enter the basal layer, divide with the basal stem cells, and begin to replicate in large numbers in the squamous cells above the basal layer, and the mature virus separates on the surface cells ⁇ released.
  • HPV generally exists in a free state in cervical epithelial cells, and its DNA can be integrated into human chromosomes.
  • HC2 Hybrid Capture 2
  • Cervista Cervista based on hybrid signal amplification analysis
  • Cobas 4800 real-time PCR method.
  • Either of the above methods cannot cover all HPV type testing, and cross-reactions between different types of HPV cannot be avoided.
  • none of the above methods can determine whether HPV is integrated and integrated.
  • the rapid development of second-generation sequencing technology has created new methods for virus type and integrated detection.
  • Whole genome sequencing, whole transcriptome sequencing and specific virus capture targeted sequencing provide opportunities for comprehensive detection of virus types and status.
  • the purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art and provide a method that can accurately detect DNA viruses in the human genome, which can accurately assess the type and load analysis of double-stranded DNA viruses. Method, and at the same time accurately and flexibly determine the type of double-stranded DNA virus integrated into the human genome, the integration site and the resulting fusion sequence.
  • the technical solutions adopted by the present invention include the following aspects:
  • the present invention provides a method for detecting virus types in the human genome, including the following steps:
  • step 2) Extract the patient's DNA and sequence to obtain the patient's genome, and compare it with the mixed genome obtained in step 1) for the first time;
  • L M represents the length of the read segment of a specific type of virus in the comparison
  • L S , L H represent the length of the virus DNA at the two ends of the read segment that are not aligned with the length of the virus DNA
  • L I represents the length of the insert in the middle of the read segment
  • L D Indicates the missing length in the middle of the read
  • L MIS represents the length of the mismatch of a single base on the read
  • step 3 For the read segments that meet the two formulas in step 3), perform the statistics of virus type and load to obtain.
  • the detection method of the present invention can accurately detect the specific type and relative load of double-stranded DNA virus infection, and is suitable for but not limited to double-stranded DNA virus infection-related diseases, such as HPV infection-related cervical diseases, head and neck diseases. Department of diseases, HBV-related liver disease, EBV-related lymphatic system disease, nasopharyngeal disease, gastric disease.
  • double-stranded DNA virus infection-related diseases such as HPV infection-related cervical diseases, head and neck diseases.
  • Department of diseases HBV-related liver disease, EBV-related lymphatic system disease, nasopharyngeal disease, gastric disease.
  • the BWA-MEM algorithm is used for comparison in the step 2).
  • the step 2) further includes removing PCR repetitive sequences. More preferably, the software Picard Markduplicates is used to remove PCR repetitive sequences.
  • step 4 for paired-end sequencing reads, when both reads meet the two formulas in step 3), the statistics of virus type and load can be performed.
  • the present invention provides a method for detecting the virus content in the human genome, which includes the following steps: based on the statistical results of the above-mentioned virus type and load, according to the ratio of the optional internal reference gene and the above-mentioned mixed genome The relative quantification of virus copy number is performed on the results, and the quantification formula is as follows:
  • CN H is the copy number of the internal reference gene
  • D V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome
  • D H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above
  • C V is the comparison coverage of the viral genome, namely single base site above step 3) all reads relates account the length of the viral genome
  • the single-base site of occupies the length of the internal reference gene
  • L V is the effective length of the viral genome involved in the sequencing probe
  • L H is the effective length of the internal reference gene involved in the sequencing probe.
  • the present invention provides a method for detecting virus integration and integration sites in the human genome, which includes the following steps:
  • the virus integration or integration site detection is based on the detection principle of chimera reads, and it is obtained.
  • the method includes the following steps:
  • step S1 and step S2 combine the results of step S1 and step S2, perform a statistical classification of the reads in the comparison result in step S3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads;
  • step S6 Perform reading segment filtering on the comparison result of step S5;
  • step S7 Perform partial clustering of all reads retained after filtering in step S6 according to the read positions of the human genome, retain the sites with the number of reads ⁇ 3, and annotate the sites with gene positions and functions;
  • step S8 Assemble the reads annotated in step S7, and divide the assembled sequence into viral and human parts for the third comparison of the mixed reference genome, and the comparison result is consistent with the BWA-MEM comparison result of claim 2. The assembly sequence is retained, and it is obtained.
  • the reads filtered in step S6 include the following reads:
  • Virus and human reads are too short ( ⁇ 30bp);
  • the ratio of cross reads between viruses and people is too long ( ⁇ 50% read length);
  • ANNOVAR software is used to annotate the gene position and function; in the step S8, the IDBA-UD software is used for assembly; the second comparison in step S5 and the third comparison in S8 All use BLASTN software.
  • the method of the present invention can simultaneously detect multiple different virus infections
  • the method of the present invention can finely distinguish the read segment classification of different subtypes of the same virus, thereby judging the viral load;
  • the method of the present invention is suitable for reading types of different database sources, and has the versatility of NGS virus pathogenic detection;
  • the method of the present invention can accurately detect the type of virus integrated into the human genome, the integration site, and the specific integration sequence, and provide solid theoretical support for downstream verification.
  • Figure 1 is a schematic flow chart of the method for detecting virus integration in the human genome in the present invention
  • Figure 2 is a graph showing the relationship between the highest HPV types of multiple HPV infection samples and the number of integrated HPV types integration sites, which shows that about 66.7% of the samples infected with multiple HPV viruses have integrated HPV types as viral loads Highest type
  • Figure 3 is a graph of the results of HPV typing in Example 1, where the vertical axis is the number of valid HPV comparison reads;
  • Figure 4 is a graph of HPV typing results in Example 2, where the vertical axis is the number of HPV valid comparison reads;
  • Figure 5 is a graph of HPV typing results in Example 3, where the vertical axis is the number of HPV valid comparison reads;
  • Figure 6 is a graph of HPV typing results in Example 4, where the vertical axis is the number of HPV valid comparison reads;
  • Figure 7 is a graph of HPV typing results in Example 5, where the vertical axis is the number of valid HPV comparison reads;
  • Figure 8 is a graph of HPV typing results in Example 6, where the vertical axis is the number of HPV valid comparison reads;
  • Figure 9 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type1;
  • Figure 10 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type2.
  • the present invention provides a method for accurately detecting double-stranded DNA virus-susceptible polytypes and load, viral integration breakpoints, and human-virus genome fusion sequences, and guide virus-related tumors based on the detection results of the method Screening and treatment decisions are more accurate and efficient.
  • the method of the present invention can detect the main viral infection types with potential carcinogenic effects, and judge the risk of cancer through the occurrence of integration, guide the personalized screening strategy of related tumors, and according to the number of virus integration sites and biological significance , Provide anti-viral and anti-tumor targeted therapy programs for cancer patients.
  • the present invention provides a method for calculating the type of double-stranded DNA virus infection.
  • the calculation method is based on next-generation sequencing reads, and the best use scenario is virus capture sequencing; by comparing the sequencing reads
  • the information is filtered, the reads from viral DNA are accurately selected, and the repetitive offsets that may be caused by the process of library building are removed by the reads, and the number of reads of different types of viral DNA is counted, which indirectly reflects the infection of the virus.
  • the load capacity of the type includes the following steps:
  • the comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;
  • L M represents the read length of the specific type of virus in the comparison
  • L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA
  • L I represents the middle of the read ( The insertion length of the small fragment)
  • L D represents the missing length of the middle (small fragment) on the read
  • L MIS represents the mismatch length of a single base on the read
  • CN H is the copy number of the internal reference gene
  • D V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome
  • D H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above
  • C V is the comparison coverage of the viral genome, namely single base site in the above step 3) all reads relates account the length of the viral genome
  • C H than the reference gene is a single base pair coverage, i.e. internal reference genes involved than all the mixed reads gene
  • the site occupies the length of the internal reference gene
  • L V is the effective length of the viral genome involved in the sequencing probe
  • L H is the effective length of the internal reference gene involved in the sequencing probe.
  • the internal reference genes involved in the calculation of the above virus copy numbers are suitable for fixed probes and sequencing reads used by different database-building detection systems, including but not limited to the conserved genes of the human genome.
  • the above algorithm is mainly applied to the whole genome sequencing, exome sequencing, whole transcriptome sequencing, and virus capture sequencing data of clinical patients related to double-stranded DNA virus infection.
  • scanning exists Types and types of double-stranded DNA viruses, and identify the DNA fusion site sequence for each virus type that reaches the detection level, through the type of double-stranded DNA virus infection, infection abundance, integration into the human genome and corresponding integration
  • the significance of the site comprehensively judges the patient's corresponding virus carcinogenic risk, and guides clinical early-warning decision-making and treatment plans.
  • the main application scenarios are as follows:
  • a reference genome of human and corresponding virus types will be specifically constructed, and all reads will be re-compared, and for the type-specific comparison results, Based on the detection principle of chimera reads, the detection of virus integration and integration sites will be carried out. Specifically include the following steps:
  • step 4 Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads ( See Figure 1).
  • the filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the proportion of virus and human reads is too short ( ⁇ 30bp); the cross reading of virus and human comparison The segment is too long ( ⁇ 50% of the read length); the comparison result of the human read part is not unique; the human read part comes from the low repetitive region of DNA;
  • step 7 Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ⁇ 3, and use ANNOVAR software to annotate the gene location and function of the sites;
  • risk assessment can be performed based on the virus integration sites obtained by the above calculation method, and the virus types that are mainly carcinogenic risk can be identified, and the integrated virus types and viral load The virus types with the highest amount are more consistent, with a consensus rate of 70% (see Figure 2).
  • the cell-free DNA sample used is derived from cell tissue in routine examinations, and is applicable but not limited to cervical exfoliated cells, liver puncture cells, lymph node biopsy tissue, blood, saliva, and the like.
  • the downstream clinical monitoring observation density can be increased, or a more powerful clinical treatment plan can be replaced, and conversely, the clinical monitoring density can be reduced, or the clinical treatment plan can be downgraded ( See Examples 1 to 6).
  • the virus types involved in the present invention can include one or more double-stranded DNA viruses such as HPV, HBV, and EBV; for one virus, it can include any different types of virus species; for the same For the same species of a virus, you can consider designing probes for the entire genome, or you can consider designing probes for partial regions of the genome.
  • the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused by the library construction process, the reads of different types of viral DNA are removed. Counting the number of segments indirectly reflects the load of the infected virus type. The specific steps are as follows:
  • the comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;
  • L M represents the length of the read segment of a specific type of virus on the comparison
  • L S and L H represent the length of the two ends (larger fragments) of the read segment that cannot be compared with the length of the virus DNA
  • L I represents the middle of the read segment ( The insertion length of the small fragment)
  • L D represents the missing length of the middle (small fragment) on the read
  • L MIS represents the mismatch length of a single base on the read
  • HPV31 two types of HPV (HPV31, HPV33) infection were found in patient A's sample.
  • the integration sites were tested separately.
  • HPV31 a mixed reference genome of human and HPV31 viruses was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.
  • step 1 Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. ( Figure 1).
  • the filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short ( ⁇ 30bp); the comparison of viruses and human crosses The reads are too long ( ⁇ 50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
  • HPV31 HPV31
  • HPV33 HPV33
  • Colposcopy can be performed immediately. Avoid missed diagnosis, which is different from the international guidelines for non-HPV16 and 18 positive to continue to be observed.
  • HPV16, HPV31, HPV56 three HPV types (HPV16, HPV31, HPV56) were found to be infected (as shown in Figure 4), but no HPV integration was found, and follow-up can be continued. Avoid unnecessary colposcopy, which is different from the international guidelines for HPV16 positive recommendations for colposcopy referral.
  • HPV16 infection As shown in Figure 7, is found with two integration sites (see Table 3), and surgery can be used to avoid progression to Cervical cancer.
  • HPV16, HPV18 were found to be infected (as shown in Figure 8), and high-risk HPV18 was found to be integrated into two different sites in the human genome. (See Table 4).
  • the integration site is located near the human CHRAC1 gene to guide clinical personalized medicine.
  • EBV virus Type1 and Type2 were collected from the literature and NCBI database, and the two types of EBV virus genomes were regarded as pseudochromosomes and merged with the chromosomes of the human genome. To construct a hybrid genome.
  • the comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
  • L M represents the read length of the specific type of virus in the comparison
  • L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA
  • L I represents the middle of the read ( The insertion length of the small fragment)
  • L D represents the missing length of the middle (small fragment) on the read
  • L MIS represents the mismatch length of a single base on the read
  • liver cancer patient G the liver cancer tissue and its adjacent tissues are taken, and captured and sequenced respectively, and the infection types and integration sites of the cancer tissues and adjacent tissues are detected by the method of the present invention.
  • the specific detection steps for cancer tissues are as follows:
  • the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused during the database construction process, the different types of viruses are removed.
  • the number of DNA reads is counted, which indirectly reflects the load of the infected virus type.
  • the comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
  • L M represents the read length of the specific type of virus in the comparison
  • L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA
  • L I represents the middle of the read ( The insertion length of the small fragment)
  • L D represents the missing length of the middle (small fragment) on the read
  • L MIS represents the mismatch length of a single base on the read
  • AB014381 three types of HPV (AB014381, AF090842, AB033554) infections were found in the cancer samples of patient G.
  • the integration sites were tested separately.
  • AB014381 a mixed reference genome of human and AB014381 virus was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.
  • step 1 performs a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (figure 1).
  • the filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short ( ⁇ 30bp); the comparison of viruses and human crosses The reads are too long ( ⁇ 50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
  • the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused by the library construction process, the reads of different types of viral DNA are removed. Counting the number of segments indirectly reflects the load of the infected virus type. The specific steps are as follows:
  • the comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
  • L M represents the read length of the specific type of virus in the comparison
  • L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA
  • L I represents the middle of the read ( The insertion length of the small fragment)
  • L D represents the missing length of the middle (small fragment) on the read
  • L MIS represents the mismatch length of a single base on the read
  • a reference genome of human and corresponding types will be constructed specifically, and all reads will be re-compared.
  • the type-specific comparison results will be based on the detection principle of chimera reads Perform virus integration and detection of integration sites.
  • step 1 performs a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (figure 1).
  • the filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short ( ⁇ 30bp); the comparison of viruses and human crosses The reads are too long ( ⁇ 50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
  • Count the type of infection and the number of integration sites for each sample A total of 15 samples with multiple HPV infections and integration sites were selected.
  • a stacked bar graph is made based on the ratio of the number of reads supported by all infection types of each sample to all virus-infected reads, and a stacked bar graph is made based on the number of integration sites of each infection type of each sample.
  • the results are shown in Figure 2.
  • the horizontal axis is the sample name, and different colors indicate different infection types.
  • the figure below shows the proportion of reads of each infection type in each sample to the total virus infection reads.
  • the figure above is The number of integration sites for each infection type in each sample. It can be seen from Figure 2 that among the 15 samples with multiple HPV infections, 11 samples have integrated HPV types, and 10 of them have integrated HPV types with the highest viral load. The type accounted for 66.7% of all samples.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Immunology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Microbiology (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method for accurately detecting DNA viruses in a human genome. The method is an analysis method capable of accurately evaluating the infection type and load of a double-stranded DNA virus and accurately and flexibly determining the types and integration sites of the double-stranded DNA viruses integrated into the human genome and a generated fusion sequence. The method can be used for simultaneously detecting multiple different virus infections; the method can be used for finely distinguishing reads classification of different subtypes of the same virus so as to determine the load of the virus; and the method is suitable for reads types having different library building sources and has universality of NGS virus etiology detection.

Description

一种精确检测人基因组中DNA病毒的方法A Method for Precise Detection of DNA Viruses in Human Genome 技术领域Technical field
本发明涉及病毒检测技术领域,尤其是一种精确检测人基因组中DNA病毒的方法。The invention relates to the technical field of virus detection, in particular to a method for accurately detecting DNA viruses in the human genome.
背景技术Background technique
双链DNA病毒相关肿瘤,是指跟双链DNA病毒感染密切相关、由双链DNA病毒感染后与宿主细胞相互作用产生的一系列生物学效应触发致癌机并导致的肿瘤,常伴随高风险致癌病毒株的感染、高风险致癌病毒基因组DNA插入人体细胞DNA、肿瘤进展过程中存在多种病毒亚型的共感染等现象。如双链DNA病毒如Human papillomavirus(HPV)与宫颈癌、头颈部肿瘤等;我们前期研究发现,Hepatitis B virus(HBV),Epstein–Barr virus(EBV)存在普遍的DNA整合入人类基因组的现象,并且DNA整合的产生在其致癌的过程中发挥着必不可少的作用。因此,阻断其整合的发生成为研究的重点。Double-stranded DNA virus-associated tumors refer to tumors that are closely related to double-stranded DNA virus infections and are caused by a series of biological effects caused by the interaction of double-stranded DNA viruses with host cells that trigger carcinogenic mechanisms and are often accompanied by high-risk cancers. Virus strain infection, high-risk cancer-causing virus genomic DNA inserted into human cell DNA, and co-infection of multiple virus subtypes during tumor progression. For example, double-stranded DNA viruses such as Human papillomavirus (HPV) and cervical cancer, head and neck tumors, etc.; our previous research found that Hepatitis B virus (HBV), Epstein-Barr virus (EBV) have universal DNA integration into the human genome. And the production of DNA integration plays an essential role in its carcinogenic process. Therefore, blocking its integration has become the focus of research.
以HPV病毒为例,目前发现的HPV分型大约200多种,为特异性感染人皮肤黏膜鳞状上皮细胞的双链DNA病毒。HPV感染是一种性传播疾病,据不完全,性活跃的年轻女性生殖道HPV感染率高达80%,且女性一生的不同时期可能感染不同类型的HPV型别、同一时期可能存在多种HPV型别的感染。高危型HPV(16、18、31、33、35、39、45、51、52、56、58、59、68、73和82等15种)的持续感染则是导致宫颈癌发生发展的最关键的致病因子,当宫颈上皮破损时,HPV可通过破损部位突破表皮,进入基底层,随基底层干细胞分裂,并在基底层以上的鳞状细胞中开始大量复制,成熟的病毒在表面细胞分离时释放。HPV在宫颈上皮细胞中一般以游离状态存在,其DNA可以整合入人的染色体当中,而高危型HPV整合入宿主的基因组,是宫颈癌发生和发展过程当中的决定性因素之一,研究表明90%以上的宫颈癌中可检测到高危型HPV的整合。因此,鉴定HPV感染型别、载量及是否整合对于宫颈癌的精准防治有着重要的意义。Taking HPV virus as an example, there are currently more than 200 types of HPV discovered, which are double-stranded DNA viruses that specifically infect human skin mucosal squamous epithelial cells. HPV infection is a sexually transmitted disease. According to incomplete data, the reproductive tract HPV infection rate of sexually active young women is as high as 80%, and women may be infected with different types of HPV at different periods of their lives, and there may be multiple HPV types at the same period. Other infections. Persistent infection of high-risk HPV (16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, and 82) is the most critical factor leading to the development of cervical cancer When the cervical epithelium is damaged, HPV can break through the epidermis through the damaged site, enter the basal layer, divide with the basal stem cells, and begin to replicate in large numbers in the squamous cells above the basal layer, and the mature virus separates on the surface cells时released. HPV generally exists in a free state in cervical epithelial cells, and its DNA can be integrated into human chromosomes. The integration of high-risk HPV into the host's genome is one of the decisive factors in the occurrence and development of cervical cancer. Studies have shown that 90% The integration of high-risk HPV can be detected in the above cervical cancer. Therefore, the identification of HPV infection type, load and whether it is integrated is of great significance for the precise prevention and treatment of cervical cancer.
目前,临床测病毒型别及载量主要采用的基于杂交信号放大分析的Hybrid Capture 2(HC2)和Cervista,以及基于实时PCR方法的Cobas 4800。无论是以上的哪种方法,都无法覆盖所有的HPV型别检测,且无法避免HPV不同型别间的交叉反应,最为重要的是,以上方法都无法确定HPV是否整 合及整合状态检测。与此同时,二代测序技术的飞速发展为病毒型别及整合的检测创造了新的方法。全基因组测序、全转录组测序及特异性的病毒捕获靶向测序,为全面进行病毒型别及状态的检测提供了机会。At present, clinical virus type and load detection mainly use Hybrid Capture 2 (HC2) and Cervista based on hybrid signal amplification analysis, and Cobas 4800 based on real-time PCR method. Either of the above methods cannot cover all HPV type testing, and cross-reactions between different types of HPV cannot be avoided. Most importantly, none of the above methods can determine whether HPV is integrated and integrated. At the same time, the rapid development of second-generation sequencing technology has created new methods for virus type and integrated detection. Whole genome sequencing, whole transcriptome sequencing and specific virus capture targeted sequencing provide opportunities for comprehensive detection of virus types and status.
发明内容Summary of the invention
基于上述问题,本发明的目的在于克服上述现有技术的不足之处而提供一种能精确检测人基因组中DNA病毒的方法,该方法能准确评估双链DNA病毒感染型别及载量的分析方法,并同时准确、灵活的判断整合入人类基因组的双链DNA病毒型别、整合位点及产生的融合序列。Based on the above problems, the purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art and provide a method that can accurately detect DNA viruses in the human genome, which can accurately assess the type and load analysis of double-stranded DNA viruses. Method, and at the same time accurately and flexibly determine the type of double-stranded DNA virus integrated into the human genome, the integration site and the resulting fusion sequence.
为实现上述目的,本发明采取的技术方案包括以下几个方面:In order to achieve the above objectives, the technical solutions adopted by the present invention include the following aspects:
在第一个方面,本发明提供了一种检测人基因组中病毒型别的方法,包括以下步骤:In the first aspect, the present invention provides a method for detecting virus types in the human genome, including the following steps:
1)从数据库收集所有型别的病毒基因组并当作伪染色体,与人基因组的染色体合并,得到混合基因组;1) Collect all types of virus genomes from the database and use them as pseudo-chromosomes, merge them with the chromosomes of the human genome to obtain a mixed genome;
2)提取患者DNA并测序得到患者基因组,与步骤1)所得混合基因组第一次比对;2) Extract the patient's DNA and sequence to obtain the patient's genome, and compare it with the mixed genome obtained in step 1) for the first time;
3)统计步骤2)比对结果中的非人染色体,对于比对到的特定型别的病毒基因组,根据第一次比对读段的长度占比及相似度占比将读段进行归类,所述读段采用如下公式筛选:3) Statistical step 2) For the non-human chromosomes in the comparison result, for the specific type of virus genome compared, classify the reads according to the length ratio and similarity ratio of the first comparison read , The read segment is filtered using the following formula:
L M≥(L M+L S+L H+L I)×0.5; L M ≥(L M +L S +L H +L I )×0.5;
3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端比对不上病毒DNA的长度,L I表示读段上中间的插入长度,L D表示读段上中间的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the length of the read segment of a specific type of virus in the comparison, L S , L H represent the length of the virus DNA at the two ends of the read segment that are not aligned with the length of the virus DNA, L I represents the length of the insert in the middle of the read segment, L D Indicates the missing length in the middle of the read, and L MIS represents the length of the mismatch of a single base on the read;
4)对于满足步骤3)中两个公式的读段,进行病毒型别和载量的统计,即得。4) For the read segments that meet the two formulas in step 3), perform the statistics of virus type and load to obtain.
需要说明的是,采用本发明的检测方法能准确检测双链DNA病毒感染的具体型别及相对载量,适用于却不仅限于双链DNA病毒感染相关疾病,如HPV感染相关宫颈疾病、头颈部疾病,HBV相关肝脏疾病,EBV相关淋巴系统疾病、鼻咽疾病、胃部疾病。It should be noted that the detection method of the present invention can accurately detect the specific type and relative load of double-stranded DNA virus infection, and is suitable for but not limited to double-stranded DNA virus infection-related diseases, such as HPV infection-related cervical diseases, head and neck diseases. Department of diseases, HBV-related liver disease, EBV-related lymphatic system disease, nasopharyngeal disease, gastric disease.
优选地,所述步骤2)中采用BWA-MEM算法进行比对。Preferably, the BWA-MEM algorithm is used for comparison in the step 2).
优选地,所述步骤2)还包括去除PCR重复序列。更优选地,采用软件Picard Markduplicates去除 PCR重复序列。Preferably, the step 2) further includes removing PCR repetitive sequences. More preferably, the software Picard Markduplicates is used to remove PCR repetitive sequences.
优选地,所述步骤4)中,对于双端测序读段,两条读段均满足步骤3)中两个公式时,才能进行病毒型别和载量的统计。Preferably, in step 4), for paired-end sequencing reads, when both reads meet the two formulas in step 3), the statistics of virus type and load can be performed.
在第二个方面,本发明提供了一种检测人基因组中病毒含量的方法,包括如下步骤:基于上述的病毒型别和载量的统计结果,根据可选的内参基因与上述混合基因组的比对结果进行病毒拷贝数的相对定量,定量公式如下:In a second aspect, the present invention provides a method for detecting the virus content in the human genome, which includes the following steps: based on the statistical results of the above-mentioned virus type and load, according to the ratio of the optional internal reference gene and the above-mentioned mixed genome The relative quantification of virus copy number is performed on the results, and the quantification formula is as follows:
Figure PCTCN2019124917-appb-000001
Figure PCTCN2019124917-appb-000001
其中,CN H为内参基因的拷贝数,默认为2,D V为病毒基因组的有效累加乘深,通过累加计算上述步骤3)的所有读段对病毒基因组的单碱基位点覆盖次数得到,D H为内参基因的有效累加乘深,通过上述相同方式累加内参基因与上述混合基因组比对后所有读段的单碱基位点覆盖次数得到,C V为病毒基因组的比对覆盖度,即上述中步骤3)的所有读段涉及的单碱基位点占病毒基因组的长度,C H为内参基因的比对覆盖度,即内参基因比对上述步骤1)中混合基因的所有读段涉及的单碱基位点占内参基因的长度,L V为测序探针涉及的病毒基因组的有效长度,L H为测序探针涉及的内参基因的有效长度。 Among them, CN H is the copy number of the internal reference gene, the default is 2, and D V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome, D H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above, and C V is the comparison coverage of the viral genome, namely single base site above step 3) all reads relates account the length of the viral genome, C H for all reads than the reference gene of coverage, i.e. reference gene than the above step 1) was mixed genes relate The single-base site of, occupies the length of the internal reference gene, L V is the effective length of the viral genome involved in the sequencing probe, and L H is the effective length of the internal reference gene involved in the sequencing probe.
在第三个方面,本发明提供了一种检测人基因组中病毒整合与否及整合位点的方法,包括如下步骤:In the third aspect, the present invention provides a method for detecting virus integration and integration sites in the human genome, which includes the following steps:
根据上述检测到的病毒基因组型别,构建人类和相应病毒型别的参考基因组;According to the virus genome types detected above, construct reference genomes of humans and corresponding virus types;
将所有第一次比对读段的分别再次比对所述参考基因组;以及Align all the reads of the first alignment with the reference genome again; and
针对特异病毒型别的比对结果,基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测,即得。For the comparison results of specific virus types, the virus integration or integration site detection is based on the detection principle of chimera reads, and it is obtained.
优选地,所述方法包括如下步骤:Preferably, the method includes the following steps:
S1、将所有第一次比对读段单独比对人参考基因组;S1. Compare all the first-time comparison reads individually to the human reference genome;
S2、将所有第一次比对读段单独比对特定型别的病毒参考基因组;S2, compare all the first comparison reads individually to the reference genome of a specific type of virus;
S3、将所有第一次比对读段比对人和相应型别的混合参考基因组,使用Picard Mark duplicates去除比对结果中PCR重复序列;S3. Compare all the first-time comparison reads to the mixed reference genome of the person and the corresponding type, and use Picard Mark Duplicates to remove the PCR repetitive sequence in the comparison result;
S4、结合步骤S1和步骤S2的结果,对步骤S3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段以及远距离双端跨区域读段;S4. Combining the results of step S1 and step S2, perform a statistical classification of the reads in the comparison result in step S3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads;
S5、对于所述双端嵌合体读段,将其合并成一体读段进行第二次比对;对于所述单端嵌合体读段,对嵌合单条读段进行第二次比对;S5. For the double-ended chimera reads, merge them into a single read for the second comparison; for the single-ended chimera reads, perform the second comparison for the chimeric single read;
S6、对步骤S5的比对结果进行读段过滤;S6. Perform reading segment filtering on the comparison result of step S5;
S7、将步骤S6所有过滤后保留的读段按照人基因组的读段位置进行局部聚类,保留读段数目≥3的位点,将位点进行基因位置及功能的注释;以及S7. Perform partial clustering of all reads retained after filtering in step S6 according to the read positions of the human genome, retain the sites with the number of reads ≥ 3, and annotate the sites with gene positions and functions; and
S8、将步骤S7注释后的读段进行组装,将组装的序列分病毒及人的部分进行第三次比对所述混合参考基因组,比对结果与权利要求2的BWA-MEM比对结果一致的组装序列进行保留,即得。S8. Assemble the reads annotated in step S7, and divide the assembled sequence into viral and human parts for the third comparison of the mixed reference genome, and the comparison result is consistent with the BWA-MEM comparison result of claim 2. The assembly sequence is retained, and it is obtained.
优选地,所述步骤S6中过滤的读段包括以下读段:Preferably, the reads filtered in step S6 include the following reads:
跟BWA-MEM比对的结果不一致;The results of comparison with BWA-MEM are inconsistent;
病毒及人的读段过短(≤30bp);Virus and human reads are too short (≤30bp);
病毒及人的比对交叉读段占比过长(≥50%读段长度);The ratio of cross reads between viruses and people is too long (≥50% read length);
人的读段部分比对结果不唯一;或The comparison result of the reading part of the person is not unique; or
人的读段部分来自DNA低重复区域。Human reads partly come from low repetitive regions of DNA.
优选地,所述步骤S7中使用ANNOVAR软件进行基因位置及功能的注释;所述步骤S8中使用IDBA-UD软件进行组装;所述步骤S5中第二次比对和S8中第三次比对均采用BLASTN软件。Preferably, in the step S7, ANNOVAR software is used to annotate the gene position and function; in the step S8, the IDBA-UD software is used for assembly; the second comparison in step S5 and the third comparison in S8 All use BLASTN software.
综上所述,本发明的有益效果为:In summary, the beneficial effects of the present invention are:
本发明的方法可同时检测多种不同的病毒感染;The method of the present invention can simultaneously detect multiple different virus infections;
本发明的方法可精细区分同种病毒不同亚型的读段归类,以此判断病毒载量;The method of the present invention can finely distinguish the read segment classification of different subtypes of the same virus, thereby judging the viral load;
本发明的方法适用于建库来源不同的读段类型,具有NGS病毒病原学检测的通用性;The method of the present invention is suitable for reading types of different database sources, and has the versatility of NGS virus pathogenic detection;
本发明的方法可精确检测整合入人基因组的病毒型别、整合位点,具体的整合序列,为下游验证提供坚实的理论支持。The method of the present invention can accurately detect the type of virus integrated into the human genome, the integration site, and the specific integration sequence, and provide solid theoretical support for downstream verification.
附图说明Description of the drawings
图1是本发明中检测人基因组中病毒整合方法的流程示意图;Figure 1 is a schematic flow chart of the method for detecting virus integration in the human genome in the present invention;
图2是多种HPV感染样本最高载量HPV型别和发生整合的HPV型别整合位点数目的关系图,其 中显示多种HPV病毒感染的样本约66.7%发生整合的HPV型别为病毒载量最高的型别;Figure 2 is a graph showing the relationship between the highest HPV types of multiple HPV infection samples and the number of integrated HPV types integration sites, which shows that about 66.7% of the samples infected with multiple HPV viruses have integrated HPV types as viral loads Highest type
图3是实施例1中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 3 is a graph of the results of HPV typing in Example 1, where the vertical axis is the number of valid HPV comparison reads;
图4是实施例2中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 4 is a graph of HPV typing results in Example 2, where the vertical axis is the number of HPV valid comparison reads;
图5是实施例3中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 5 is a graph of HPV typing results in Example 3, where the vertical axis is the number of HPV valid comparison reads;
图6是实施例4中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 6 is a graph of HPV typing results in Example 4, where the vertical axis is the number of HPV valid comparison reads;
图7是实施例5中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 7 is a graph of HPV typing results in Example 5, where the vertical axis is the number of valid HPV comparison reads;
图8是实施例6中HPV分型结果图,其中,纵轴为HPV有效比对读段数量;Figure 8 is a graph of HPV typing results in Example 6, where the vertical axis is the number of HPV valid comparison reads;
图9是病毒载量最高型别为Type1的样本的读段支持数统计结果图;Figure 9 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type1;
图10是病毒载量最高型别为Type2的样本的读段支持数统计结果图。Figure 10 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type2.
具体实施方式Detailed ways
在一些实施例中,本发明提供了一种精确检测双链DNA病毒感多型别及载量、病毒整合断裂点及人-病毒基因组融合序列的方法,基于该方法的检测结果指导病毒相关肿瘤筛查与治疗决策,更准确和高效。本发明的方法可以检测有潜在致癌作用的主要病毒感染型别,并通过整合的发生判断患癌风险、指导相关的肿瘤的个性化筛查策略,以及根据病毒整合位点的数目及生物学意义,给出癌症病人的抗病毒、抗肿瘤靶向治疗方案。In some embodiments, the present invention provides a method for accurately detecting double-stranded DNA virus-susceptible polytypes and load, viral integration breakpoints, and human-virus genome fusion sequences, and guide virus-related tumors based on the detection results of the method Screening and treatment decisions are more accurate and efficient. The method of the present invention can detect the main viral infection types with potential carcinogenic effects, and judge the risk of cancer through the occurrence of integration, guide the personalized screening strategy of related tumors, and according to the number of virus integration sites and biological significance , Provide anti-viral and anti-tumor targeted therapy programs for cancer patients.
在一些实施例中,本发明提供了一种双链DNA病毒感染型别的计算方法,该计算方法基于二代测序读段,最佳使用情景为病毒捕获测序;通过对测序读段的比对信息进行过滤,精准挑选出来自病毒DNA的读段,并对通过对读段进行去除建库过程中可能带来的重复偏移,将不同型别病毒DNA的读段数进行统计,间接反映感染病毒型别的载量,具体包括如下步骤:In some embodiments, the present invention provides a method for calculating the type of double-stranded DNA virus infection. The calculation method is based on next-generation sequencing reads, and the best use scenario is virus capture sequencing; by comparing the sequencing reads The information is filtered, the reads from viral DNA are accurately selected, and the repetitive offsets that may be caused by the process of library building are removed by the reads, and the number of reads of different types of viral DNA is counted, which indirectly reflects the infection of the virus. The load capacity of the type includes the following steps:
1)在初始的比对过程中,将从数据库中收集的所有型别的病毒基因组当作伪染色体,与人基因组的染色体进行合并,构建混合基因组;1) In the initial comparison process, all types of virus genomes collected from the database are used as pseudo-chromosomes and merged with the chromosomes of the human genome to construct a mixed genome;
2)比对软件采用了支持局部最优比对的BWA-MEM算法,比对后使用Picard Markduplicates去除PCR重复;2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;
3)对比对结果进行非人染色体比对结果的统计,在比对到特定型别的病毒,根据读段比对的长度占比及相似度占比将读段进行二次准确归类,具体读段筛选公式如下:3) Comparing the results and performing statistics on the results of non-human chromosome comparison. When comparing to a specific type of virus, the reads are classified twice accurately according to the length ratio and similarity ratio of the read comparison. The reading filter formula is as follows:
L M≥(L M+L S+L H+L I)×0.5 L M ≥(L M +L S +L H +L I )×0.5
3×L I+2×L D+L MIS≤(L M+L D)×0.2 3×L I +2×L D +L MIS ≤(L M +L D )×0.2
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端(较大片段)比对不上病毒DNA的长度,L I表示读段上中间(小片段)的插入长度,L D表示读段上中间(小片段)的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the read length of the specific type of virus in the comparison, L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA, and L I represents the middle of the read ( The insertion length of the small fragment), L D represents the missing length of the middle (small fragment) on the read, and L MIS represents the mismatch length of a single base on the read;
4)对于满足以上两个条件(公式)的读段进入病毒型别载量的统计中,对于双端测序读段,两条读段均满足以上条件(两个公式)方可进入下游统计;4) For reads that meet the above two conditions (formulas), they enter the statistics of virus type load. For paired-end sequencing reads, both reads meet the above conditions (two formulas) before entering downstream statistics;
5)通过上述的步骤,初步完成了病毒型别载量的统计,进而将根据可选的内参基因的比对情况进行病毒拷贝数的相对定量,定量公式如下:5) Through the above steps, the statistics of virus type load are initially completed, and then the relative quantification of virus copy number will be performed according to the comparison of optional internal reference genes. The quantitative formula is as follows:
Figure PCTCN2019124917-appb-000002
Figure PCTCN2019124917-appb-000002
其中,CN H为内参基因的拷贝数,默认为2,D V为病毒基因组的有效累加乘深,通过累加计算上述步骤3)的所有读段对病毒基因组的单碱基位点覆盖次数得到,D H为内参基因的有效累加乘深,通过上述相同方式累加内参基因与上述混合基因组比对后所有读段的单碱基位点覆盖次数得到,C V为病毒基因组的比对覆盖度,即上述中步骤3)的所有读段涉及的单碱基位点占病毒基因组的长度,C H为内参基因的比对覆盖度,即内参基因比对上述混合基因的所有读段涉及的单碱基位点占内参基因的长度,L V为测序探针涉及的病毒基因组的有效长度,L H为测序探针涉及的内参基因的有效长度。 Among them, CN H is the copy number of the internal reference gene, the default is 2, and D V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome, D H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above, and C V is the comparison coverage of the viral genome, namely single base site in the above step 3) all reads relates account the length of the viral genome, C H than the reference gene is a single base pair coverage, i.e. internal reference genes involved than all the mixed reads gene The site occupies the length of the internal reference gene, L V is the effective length of the viral genome involved in the sequencing probe, and L H is the effective length of the internal reference gene involved in the sequencing probe.
以上病毒拷贝数的计算涉及的内参基因适用于不同建库检测系统采用的固定探针及测序读段,包括且不仅限于人类基因组的保守基因。The internal reference genes involved in the calculation of the above virus copy numbers are suitable for fixed probes and sequencing reads used by different database-building detection systems, including but not limited to the conserved genes of the human genome.
在一些实施方案中,上述算法主要应用于双链DNA病毒感染相关的临床病人的全基因组测序、外显子测序、全转录组测序及病毒捕获测序数据,通过对DNA测序数据的挖掘,扫描存在的双链DNA病毒种类及型别,对每种达到检测水平的病毒型别进行DNA融合位点序列的鉴定,通过双链DNA病毒感染种类、感染丰度、整合入人类基因组的情况及相应整合位点的意义综合判断病人相应的病毒致癌风险,指导临床预警决策及治疗方案,主要应用场景如下:In some embodiments, the above algorithm is mainly applied to the whole genome sequencing, exome sequencing, whole transcriptome sequencing, and virus capture sequencing data of clinical patients related to double-stranded DNA virus infection. Through the mining of DNA sequencing data, scanning exists Types and types of double-stranded DNA viruses, and identify the DNA fusion site sequence for each virus type that reaches the detection level, through the type of double-stranded DNA virus infection, infection abundance, integration into the human genome and corresponding integration The significance of the site comprehensively judges the patient's corresponding virus carcinogenic risk, and guides clinical early-warning decision-making and treatment plans. The main application scenarios are as follows:
1.双链DNA病毒的感染型别判定,如人类乳头瘤病毒(Human Papillomavirus,HPV)、乙型肝炎病毒(hepatitis B virus,HBV)及EB病毒(Epstein-Barr virus,EBV)等双链DNA病毒。以上病毒的型别及变异株众多、且常常存在混合多型别感染的形式,通过本发明的检测方法,可将DNA测序数据里达到检测丰度的病毒DNA型别检测出来,并统计每种型别的丰度;1. Determination of the infection type of double-stranded DNA virus, such as human papillomavirus (HPV), hepatitis B virus (HBV) and Epstein-Barr virus (EBV) and other double-stranded DNA virus. There are many types and variants of the above viruses, and there are often forms of mixed multi-type infection. Through the detection method of the present invention, the virus DNA types that have reached the detection abundance in the DNA sequencing data can be detected, and each type is counted. Type abundance;
2.预测整合入人类基因组的病毒DNA的整合位点,解释相应人类基因组整合位点的意义,指导临床干预病毒整合的发生,阻止相应的肿瘤进展;2. Predict the integration site of viral DNA integrated into the human genome, explain the meaning of the corresponding human genome integration site, guide clinical intervention in the occurrence of viral integration, and prevent the corresponding tumor progression;
3.应用于肿瘤病人基因组,在大片段结构变异上进行模式识别,预测泛癌预后;3. Apply to tumor patient genome, perform pattern recognition on large-segment structural variation, and predict the prognosis of pan-cancer;
4.应用于肿瘤病人基因组,预测泛癌对于polq及PARP1等基于合成致死原理的抗肿瘤药物的治疗反应。4. Applied to the genome of tumor patients to predict the therapeutic response of pan-cancer to polq and PARP1 based on synthetic lethal principles of anti-tumor drugs.
在一些实施例中,根据上述算法检测到的病毒基因组型别,将特异性的构建人类和相应病毒型别的参考基因组,进行所有读段的再次比对,针对型别特异的比对结果,将基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测。具体包括如下步骤:In some embodiments, based on the virus genome types detected by the above algorithm, a reference genome of human and corresponding virus types will be specifically constructed, and all reads will be re-compared, and for the type-specific comparison results, Based on the detection principle of chimera reads, the detection of virus integration and integration sites will be carried out. Specifically include the following steps:
1、将所有读段单独比对人参考基因组;1. Compare all reads individually to the human reference genome;
2、将所有读段单独比对特定型别的病毒参考基因组;2. Compare all reads individually to the reference genome of a specific type of virus;
3、将所有读段比对人和相应型别的混合参考基因组,使用Picard Mark duplicates对比对结果进行去除PCR重复;3. Compare all reads with the mixed reference genome of the corresponding type, and use Picard Mark Duplicates to compare the results to remove PCR duplication;
4、结合步骤1和步骤2的结果,对步骤3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段及远距离双端跨区域读段(参见图1)。4. Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads ( See Figure 1).
5、分情况处理单端嵌合体读段和双端嵌合体读段:对于双端嵌合体读段,将其合并成一体读段进行BLASTN二次比对;对于单端嵌合体读段,对嵌合单条读段进行BLASTN二次比对;5. Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, right Merge a single read for BLASTN secondary comparison;
6、对步骤5的比对结果进行过滤,过滤读段包括:跟BWA-MEM比对的结果不一致;病毒及人的读段占比过短(≤30bp);病毒及人的比对交叉读段过长(≥50%读段长度);人的读段部分比对结果不唯一;人的读段部分来自DNA低重复区域;6. Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the proportion of virus and human reads is too short (≤30bp); the cross reading of virus and human comparison The segment is too long (≥50% of the read length); the comparison result of the human read part is not unique; the human read part comes from the low repetitive region of DNA;
7、将步骤6所有过滤后保留的读段按照人的读段位置进行局部聚类,保留读段数目≥3的位点,将位点使用ANNOVAR软件进行基因位置及功能的注释;7. Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene location and function of the sites;
8、将步骤7中的读段使用IDBA-UD软件进行组装,将组装的序列分病毒及人的部分进行第三次BLASTN比对,比对结果与BWA-MEM比对一致的组装序列进行保留,即得。8. Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN comparison. The result of the comparison is consistent with the assembled sequence of the BWA-MEM comparison. , That is.
在一个实施方案里,为了检测同物种的多种病毒型别,可根据上述计算方法所得病毒整合位点进行风险评估,鉴定主要有致癌风险的病毒型别,发生整合的病毒型别与病毒载量最高的病毒型别较一致,一致率达70%(参见图2)。In one embodiment, in order to detect multiple virus types of the same species, risk assessment can be performed based on the virus integration sites obtained by the above calculation method, and the virus types that are mainly carcinogenic risk can be identified, and the integrated virus types and viral load The virus types with the highest amount are more consistent, with a consensus rate of 70% (see Figure 2).
在一个实施方案里,采用的游离DNA样本来自常规检查中的细胞组织,适用但不仅限于宫颈的脱落细胞、肝穿的穿刺细胞、淋巴结活检组织、血液、唾液等。In one embodiment, the cell-free DNA sample used is derived from cell tissue in routine examinations, and is applicable but not limited to cervical exfoliated cells, liver puncture cells, lymph node biopsy tissue, blood, saliva, and the like.
在一些实施方案里,如病人样本中发现肿瘤相关病毒的整合位点,可提高下游临床监测观察密度、 或更换更有力的临床治疗方案,反之,可降低临床监测密度,或降级临床治疗方案(参见实施例1~6)。In some embodiments, if tumor-associated virus integration sites are found in patient samples, the downstream clinical monitoring observation density can be increased, or a more powerful clinical treatment plan can be replaced, and conversely, the clinical monitoring density can be reduced, or the clinical treatment plan can be downgraded ( See Examples 1 to 6).
在一个实施方案里,本发明涉及的病毒型别可囊括HPV,HBV及EBV等双链DNA病毒的一种或多种;对于一种病毒,可包括任意种不同型别的病毒物种;对于同一种病毒的同一物种,可考虑针对整个基因组的设计探针,也可考虑针对基因组部分区域设计探针。In one embodiment, the virus types involved in the present invention can include one or more double-stranded DNA viruses such as HPV, HBV, and EBV; for one virus, it can include any different types of virus species; for the same For the same species of a virus, you can consider designing probes for the entire genome, or you can consider designing probes for partial regions of the genome.
为更好的说明本发明的目的、技术方案和优点,下面将结合附图和具体实施例对本发明作进一步说明。下面通过具体的宫颈癌,鼻咽癌和肝癌样本的型别感染分布和整合位点检测的实施例,对本发明进行说明,需要说明的是该实施例仅仅是为了说明目的,本发明不仅仅局限于这三种疾病。如无特别说明,本发明中的实验方法均为常规方法。In order to better illustrate the objectives, technical solutions and advantages of the present invention, the present invention will be further described below with reference to the accompanying drawings and specific embodiments. The following describes the present invention through specific examples of cervical cancer, nasopharyngeal cancer and liver cancer samples with type infection distribution and integration site detection. It should be noted that this example is only for illustrative purposes, and the present invention is not only limited. Because of these three diseases. Unless otherwise specified, the experimental methods in the present invention are all conventional methods.
实施例1Example 1
本发明的精确检测人基因组中DNA病毒的方法的一种实施例,包括以下步骤:An embodiment of the method for accurately detecting DNA viruses in the human genome of the present invention includes the following steps:
对于轻度宫颈炎患者A,取部分宫颈组织,进行捕获测序。对测序得到的数据进行以下分析。For patient A with mild cervicitis, part of the cervical tissue is taken for capture and sequencing. Perform the following analysis on the data obtained by sequencing.
通过对测序读段的比对信息进行过滤,精准挑选出来自病毒DNA的读段,并对通过对读段进行去除建库过程中可能带来的重复偏移,将不同型别病毒DNA的读段数进行统计,间接反映感染病毒型别的载量,具体步骤如下:By filtering the comparison information of the sequencing reads, the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused by the library construction process, the reads of different types of viral DNA are removed. Counting the number of segments indirectly reflects the load of the infected virus type. The specific steps are as follows:
(1)在初始的比对过程中,从乳头瘤病毒基因组数据库PaVE中收集的所有型别的HPV病毒基因组,将收集到的HPV病毒基因组当作伪染色体,与人基因组的染色体进行合并,构建混合基因组;(1) In the initial comparison process, all types of HPV virus genomes collected from the papillomavirus genome database PaVE are used as pseudo chromosomes and merged with the chromosomes of the human genome to construct a hybrid Genome;
(2)比对软件采用了支持局部最优比对的BWA-MEM算法,比对后使用Picard Markduplicates去除PCR重复;(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;
(3)对比对结果进行HPV基因组比对结果的统计,在比对到特定型别的HPV病毒,根据读段比对的长度占比及相似度占比将读段进行二次准确归类,具体读段筛选公式如下:(3) Comparing the results and performing statistics on HPV genome comparison results. When comparing specific types of HPV viruses, the reads are classified twice accurately according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:
L M≥(L M+L S+L H+L I)×0.5; L M ≥(L M +L S +L H +L I )×0.5;
3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端(较大片段)比对不上病毒DNA的长度,L I表示读段上中间(小片段)的插入长度,L D表示读段上中间(小片段)的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the length of the read segment of a specific type of virus on the comparison, L S and L H represent the length of the two ends (larger fragments) of the read segment that cannot be compared with the length of the virus DNA, and L I represents the middle of the read segment ( The insertion length of the small fragment), L D represents the missing length of the middle (small fragment) on the read, and L MIS represents the mismatch length of a single base on the read;
(4)对于满足以上两个条件的读段进入病毒型别载量的统计中,对于双端测序读段,两条读段均满足以上条件方可进入下游统计。(4) Reads that meet the above two conditions are included in the statistics of virus type load. For paired-end sequencing reads, both reads can enter the downstream statistics only if they meet the above conditions.
根据检测到的病毒基因组型别,在患者A的样本中发现了两种HPV型别(HPV31,HPV33)感染。针对这两种感染型别,分别进行整合位点的检测。对于HPV31,构建了人和HPV31病毒的混合参考基因组。进行所有读段的再次比对,针对型别特异的比对结果,将基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测。According to the detected virus genome type, two types of HPV (HPV31, HPV33) infection were found in patient A's sample. For these two types of infection, the integration sites were tested separately. For HPV31, a mixed reference genome of human and HPV31 viruses was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.
具体步骤如下:Specific steps are as follows:
(1)将所有读段单独比对人参考基因组。(1) Compare all reads individually to the human reference genome.
(2)将所有读段单独比对HPV31病毒参考基因组。(2) Compare all reads individually to the HPV31 virus reference genome.
(3)将所有读段比对人和HPV31病毒混合参考基因组,使用Picard Mark duplicates对比对结果进行去除PCR重复。(3) Compare all reads to the human and HPV31 virus mixed reference genome, and use Picard Mark Duplicates to compare the results to remove PCR duplication.
(4)结合步骤1和步骤2的结果,对步骤3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段及远距离双端跨区域读段(附图1)。(4) Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (Figure 1).
(5)分情况处理单端嵌合体读段和双端嵌合体读段:对于双端嵌合体读段,将其合并成一体读段进行BLASTN二次比对;对于单端嵌合体读段,对嵌合单条读段进行BLASTN二次比对。(5) Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, Perform BLASTN secondary alignment on chimeric single reads.
(6)对步骤5的比对结果进行过滤,过滤读段包括:跟BWA-MEM比对的结果不一致;病毒及人的读段占比过短(≤30bp);病毒及人的比对交叉读段过长(≥50%读段长度);人的读段部分比对结果不唯一;人的读段部分来自DNA低重复区域。(6) Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short (≤30bp); the comparison of viruses and human crosses The reads are too long (≥50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
(7)将步骤6所有过滤后保留的读段按照人的读段位置进行局部聚类,保留读段数目≥3的位点,将位点使用ANNOVAR软件进行基因位置及功能的注释。(7) Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene positions and functions of the sites.
(8)将步骤7中的读段使用IDBA-UD软件进行组装,将组装的序列分病毒及人的部分进行第三次BLASTN比对,比对结果与BWA-MEM比对一致的组装序列进行保留。(8) Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN alignment, and the alignment results are consistent with the assembled sequence of the BWA-MEM alignment. Reserved.
结果显示,没有检测到HPV31在人基因组上的整合位点。The results showed that the integration site of HPV31 on the human genome was not detected.
对于HPV33,同样进行上述分析,检测到两个不同的整合位点(结果参见表1)。For HPV33, the above analysis was also performed, and two different integration sites were detected (see Table 1 for the results).
表1 HPV33整合结果Table 1 HPV33 integration results
Figure PCTCN2019124917-appb-000003
Figure PCTCN2019124917-appb-000003
综上所述,在患者A中,发现两种HPV型别(HPV31,HPV33,参见图3)感染,同时发现高危型HPV33整合入人基因组的两个不同位点,可随即进行阴道镜检查,避免漏诊,与国际指南非HPV16、 18阳性可继续观察不同。In summary, in patient A, two types of HPV (HPV31, HPV33, see Figure 3) were found to be infected, and high-risk HPV33 was found to be integrated into two different sites of the human genome. Colposcopy can be performed immediately. Avoid missed diagnosis, which is different from the international guidelines for non-HPV16 and 18 positive to continue to be observed.
实施例2Example 2
在轻度宫颈炎患者B(检测方法参见实施例1)中,发现三HPV型别(HPV16,HPV 31,HPV56)感染(如图4所示),但未发现HPV整合,可继续进行随访,避免不必要的阴道镜检查,与国际指南HPV16阳性建议阴道镜转诊不同。In patient B with mild cervicitis (see Example 1 for the detection method), three HPV types (HPV16, HPV31, HPV56) were found to be infected (as shown in Figure 4), but no HPV integration was found, and follow-up can be continued. Avoid unnecessary colposcopy, which is different from the international guidelines for HPV16 positive recommendations for colposcopy referral.
实施例3Example 3
在宫颈低级别病变患者C(检测方法参见实施例1)的后续随访中,发现高危型HPV16感染(如图5所示),但未发现HPV整合,可继续进行随访,避免不必要的阴道镜检查,与国际指南HPV16阳性建议阴道镜转诊不同。In the follow-up follow-up of patient C with low-grade cervical lesions (see Example 1 for the detection method), high-risk HPV16 infection was found (as shown in Figure 5), but no HPV integration was found. Follow-up can be continued to avoid unnecessary colposcopy The examination is different from the international guidelines for HPV16 positive recommendations for colposcopy referral.
实施例4Example 4
在宫颈低级别病变患者D(检测方法参见实施例1)的后续随访中,发现多种HPV型别(HPV16,HPV56)的持续感染(如图6所示),同时发现高危型HPV56整合入人基因组(参见表2),可随即进行阴道镜检查,以避免进展。In the follow-up follow-up of patient D with low-grade cervical lesions (see Example 1 for detection methods), persistent infections of multiple HPV types (HPV16, HPV56) were found (as shown in Figure 6), and high-risk HPV56 was found to be integrated into humans. Genome (see Table 2), colposcopy can be performed immediately to avoid progression.
表2 HPV56整合结果Table 2 HPV56 integration results
Figure PCTCN2019124917-appb-000004
Figure PCTCN2019124917-appb-000004
实施例5Example 5
在宫颈高级别病变患者E(检测方法参见实施例1)中,发现高危型HPV16感染(如图7所示)并伴随2个整合位点(参见表3),可采取手术治疗,避免进展为宫颈癌。In patient E with high-grade cervical lesions (see Example 1 for detection methods), high-risk HPV16 infection (as shown in Figure 7) is found with two integration sites (see Table 3), and surgery can be used to avoid progression to Cervical cancer.
表3 HPV16整合结果Table 3 HPV16 integration results
Figure PCTCN2019124917-appb-000005
Figure PCTCN2019124917-appb-000005
实施例6Example 6
在宫颈癌患F(检测方法参见实施例1)中,发现多种HPV型别(HPV16,HPV18)感染(如图8所示),同时发现高危型HPV18整合入人基因组的两个不同位点(参见表4),整合位点位于人CHRAC1基因附近可指导临床个性化用药。In patients with cervical cancer F (see Example 1 for detection methods), multiple HPV types (HPV16, HPV18) were found to be infected (as shown in Figure 8), and high-risk HPV18 was found to be integrated into two different sites in the human genome. (See Table 4). The integration site is located near the human CHRAC1 gene to guide clinical personalized medicine.
表4 HPV18整合结果Table 4 HPV18 integration results
Figure PCTCN2019124917-appb-000006
Figure PCTCN2019124917-appb-000006
实施例7Example 7
从从中山大学附属第一医院收集到112例鼻咽癌样本,用本发明检测其EBV病毒感染型别的方法的一种实施例,具体步骤如下:An example of the method for detecting the type of EBV virus infection in 112 cases of nasopharyngeal carcinoma collected from the First Affiliated Hospital of Sun Yat-sen University, the specific steps are as follows:
(1)在初始的比对过程中,从文献和NCBI数据库中收集了两种类型的EBV病毒Type1和Type2,将这两种型别的EBV病毒基因组当作伪染色体,与人基因组的染色体进行合并,构建混合基因组。(1) In the initial comparison process, two types of EBV virus Type1 and Type2 were collected from the literature and NCBI database, and the two types of EBV virus genomes were regarded as pseudochromosomes and merged with the chromosomes of the human genome. To construct a hybrid genome.
(2)比对软件采用了支持局部最优比对的BWA-MEM算法,比对后使用Picard Markduplicates去除PCR重复。(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
(3)对比对结果进行EBV病毒基因组比对结果的统计,在比对到特定型别的病毒,根据读段比对的长度占比及相似度占比将读段进行二次准确归类,具体读段筛选公式如下:(3) Comparing the results and performing the statistics of the EBV virus genome comparison results. After comparing to a specific type of virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:
L M≥(L M+L S+L H+L I)×0.5; L M ≥(L M +L S +L H +L I )×0.5;
3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端(较大片段)比对不上病毒DNA的长度,L I表示读段上中间(小片段)的插入长度,L D表示读段上中间(小片段)的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the read length of the specific type of virus in the comparison, L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA, and L I represents the middle of the read ( The insertion length of the small fragment), L D represents the missing length of the middle (small fragment) on the read, and L MIS represents the mismatch length of a single base on the read;
(4)对于满足以上两个条件的读段进入病毒型别载量的统计中,对于双端测序读段,两条读段均满足以上条件方可进入下游统计。(4) Reads that meet the above two conditions are included in the statistics of virus type load. For paired-end sequencing reads, both reads can enter the downstream statistics only if they meet the above conditions.
统计每个样本感染病毒载量最高的EBV型别及该种EBV型别的读段支持数,结果如图9和10所 示:大多数样本感染病毒载量较多的型别是Type1。Count the type of EBV with the highest viral load of each sample and the number of read support for that type of EBV. The results are shown in Figures 9 and 10: The type with the highest viral load of most samples is Type1.
实施例8Example 8
在肝癌患者G中,取其肝癌组织及其癌旁组织,分别进行捕获测序,用本发明的方法检测其癌症组织和癌旁组织的感染型别和整合位点。癌症组织的具体检测步骤如下:In liver cancer patient G, the liver cancer tissue and its adjacent tissues are taken, and captured and sequenced respectively, and the infection types and integration sites of the cancer tissues and adjacent tissues are detected by the method of the present invention. The specific detection steps for cancer tissues are as follows:
通过对癌症组织的测序读段的比对信息进行过滤,精准挑选出来自病毒DNA的读段,并对通过对读段进行去除建库过程中可能带来的重复偏移,将不同型别病毒DNA的读段数进行统计,间接反映感染病毒型别的载量,具体步骤如下:By filtering the comparison information of the sequencing reads of cancer tissues, the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused during the database construction process, the different types of viruses are removed. The number of DNA reads is counted, which indirectly reflects the load of the infected virus type. The specific steps are as follows:
(1)在初始的比对过程中,从文献及NCBI数据库中收集了11种HBV病毒,将收集的所有型别的HBV病毒基因组当作伪染色体,与人基因组的染色体进行合并,构建混合基因组。(1) In the initial comparison process, 11 types of HBV viruses were collected from the literature and NCBI database, and the collected genomes of all types of HBV viruses were used as pseudochromosomes, and merged with the chromosomes of the human genome to construct a mixed genome.
(2)比对软件采用了支持局部最优比对的BWA-MEM算法,比对后使用Picard Markduplicates去除PCR重复。(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
(3)对比对结果进行HBV基因组比对结果的统计,在比对到特定型别的HBV病毒,根据读段比对的长度占比及相似度占比将读段进行二次准确归类,具体读段筛选公式如下:(3) Comparing the results and performing the statistics of the HBV genome comparison results. After comparing to a specific type of HBV virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:
L M≥(L M+L S+L H+L I)×0.5; L M ≥(L M +L S +L H +L I )×0.5;
3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端(较大片段)比对不上病毒DNA的长度,L I表示读段上中间(小片段)的插入长度,L D表示读段上中间(小片段)的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the read length of the specific type of virus in the comparison, L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA, and L I represents the middle of the read ( The insertion length of the small fragment), L D represents the missing length of the middle (small fragment) on the read, and L MIS represents the mismatch length of a single base on the read;
(4)对于满足以上两个条件的读段进入病毒型别载量的统计中,对于双端测序读段,两条读段均满足以上条件方可进入下游统计。(4) Reads that meet the above two conditions are included in the statistics of virus type load. For paired-end sequencing reads, both reads can enter the downstream statistics only if they meet the above conditions.
根据检测到的病毒基因组型别,在患者G的癌症样本中发现了三种HPV型别(AB014381,AF090842,AB033554)感染。针对这三种感染型别,分别进行整合位点的检测。对于AB014381,构建了人和AB014381病毒的混合参考基因组。进行所有读段的再次比对,针对型别特异的比对结果,将基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测。According to the detected viral genome types, three types of HPV (AB014381, AF090842, AB033554) infections were found in the cancer samples of patient G. For these three types of infection, the integration sites were tested separately. For AB014381, a mixed reference genome of human and AB014381 virus was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.
具体步骤如下:Specific steps are as follows:
(1)将所有读段单独比对人参考基因组。(1) Compare all reads individually to the human reference genome.
(2)将所有读段单独比对AB014381病毒参考基因组。(2) Compare all reads individually to the reference genome of the AB014381 virus.
(3)将所有读段比对人和AB014381病毒混合参考基因组,使用Picard Mark duplicates对比对结果进行去除PCR重复。(3) Compare all reads to the human and AB014381 virus mixed reference genome, and use Picard Mark Duplicates to compare the results to remove PCR duplication.
(4)结合步骤1和步骤2的结果,对步骤3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段及远距离双端跨区域读段(图1)。(4) Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (figure 1).
(5)分情况处理单端嵌合体读段和双端嵌合体读段:对于双端嵌合体读段,将其合并成一体读段进行BLASTN二次比对;对于单端嵌合体读段,对嵌合单条读段进行BLASTN二次比对。(5) Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, Perform BLASTN secondary alignment on chimeric single reads.
(6)对步骤5的比对结果进行过滤,过滤读段包括:跟BWA-MEM比对的结果不一致;病毒及人的读段占比过短(≤30bp);病毒及人的比对交叉读段过长(≥50%读段长度);人的读段部分比对结果不唯一;人的读段部分来自DNA低重复区域。(6) Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short (≤30bp); the comparison of viruses and human crosses The reads are too long (≥50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
(7)将步骤6所有过滤后保留的读段按照人的读段位置进行局部聚类,保留读段数目≥3的位点,将位点使用ANNOVAR软件进行基因位置及功能的注释。(7) Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene positions and functions of the sites.
(8)将步骤7中的读段使用IDBA-UD软件进行组装,将组装的序列分病毒及人的部分进行第三次BLASTN比对,比对结果与BWA-MEM比对一致的组装序列进行保留。(8) Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN alignment, and the alignment results are consistent with the assembled sequence of the BWA-MEM alignment. Reserved.
对于癌症组织检测到的另外两种感染型别AF090842和AB033554,重复上述整合位点检测步骤,最终,只在AB014381病毒上检测到了3个整合位点(如下表5所示)。For the other two types of infections AF090842 and AB033554 detected in cancer tissues, the above integration site detection steps were repeated. In the end, only 3 integration sites were detected on the AB014381 virus (as shown in Table 5 below).
同上,对癌旁组织进行上述一样步骤的感染型别检测,同样的,在癌旁组织中检测到三种HPV型别(AB014381,AF090842,AB033554)感染。针对这三种感染型别,分别进行整合位点的检测。最后,在AB014381病毒上检测到了2个整合位点(如下表5所示)。Same as above, the infection type detection of the above-mentioned steps was performed on the adjacent tissues. Similarly, three types of HPV (AB014381, AF090842, AB033554) infections were detected in the adjacent tissues. For these three types of infection, the integration sites were tested separately. Finally, two integration sites were detected on the AB014381 virus (as shown in Table 5 below).
表5 AB014381病毒整合位点Table 5 AB014381 Virus Integration Site
Figure PCTCN2019124917-appb-000007
Figure PCTCN2019124917-appb-000007
实施例9Example 9
从中山大学附属第一医院宫颈筛查门诊收集女性宫颈刷片样本,使用BD SurePath LBC细胞保存 液进行保存,采用北京全式金EasyPure Genomic DNA Kit提取基因组DNA,将基因组DNA使用Bioruptor Pico打断仪打断,加接头、纯化,制备DNA文库,与HPV探针DNA进行杂交反应,使用磁珠进行捕获,将捕获片段进行高通量双端PE150测序,然后对测序数据采用本发明的方法进行以下分析:Collect female cervical scan samples from the Cervical Screening Clinic of the First Affiliated Hospital of Sun Yat-sen University, use BD SurePath LBC cell preservation solution for preservation, use Beijing Quanjin EasyPure Genomic DNA Kit to extract genomic DNA, and use Bioruptor Pico interrupter for genomic DNA Interrupt, add adapters, purify, prepare DNA library, perform hybridization reaction with HPV probe DNA, use magnetic beads for capture, perform high-throughput paired-end PE150 sequencing on the captured fragments, and then use the method of the present invention to perform the following on the sequencing data analysis:
通过对测序读段的比对信息进行过滤,精准挑选出来自病毒DNA的读段,并对通过对读段进行去除建库过程中可能带来的重复偏移,将不同型别病毒DNA的读段数进行统计,间接反映感染病毒型别的载量,具体步骤如下:By filtering the comparison information of the sequencing reads, the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused by the library construction process, the reads of different types of viral DNA are removed. Counting the number of segments indirectly reflects the load of the infected virus type. The specific steps are as follows:
(1)在初始的比对过程中,从乳头瘤病毒基因组数据库PaVE中收集的所有型别的HPV病毒基因组,将收集到的HPV病毒基因组当作伪染色体,与人基因组的染色体进行合并,构建混合基因组。(1) In the initial comparison process, all types of HPV virus genomes collected from the papillomavirus genome database PaVE are used as pseudo chromosomes and merged with the chromosomes of the human genome to construct a hybrid Genome.
(2)比对软件采用了支持局部最优比对的BWA-MEM算法,比对后使用Picard Markduplicates去除PCR重复。(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.
(3)对比对结果进行HPV病毒基因组比对结果的统计,在比对到特定型别的病毒,根据读段比对的长度占比及相似度占比将读段进行二次准确归类,具体读段筛选公式如下:(3) Comparing the results and performing statistics on the results of HPV virus genome comparison. After comparing to a specific type of virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:
L M≥(L M+L S+L H+L I)×0.5; L M ≥(L M +L S +L H +L I )×0.5;
3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端(较大片段)比对不上病毒DNA的长度,L I表示读段上中间(小片段)的插入长度,L D表示读段上中间(小片段)的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the read length of the specific type of virus in the comparison, L S and L H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA, and L I represents the middle of the read ( The insertion length of the small fragment), L D represents the missing length of the middle (small fragment) on the read, and L MIS represents the mismatch length of a single base on the read;
(4)对于满足以上两个条件的读段进入病毒型别载量的统计中,对于双端测序读段,两条读段均满足以上条件方可进入下游统计。(4) Reads that meet the above two conditions are included in the statistics of virus type load. For paired-end sequencing reads, both reads can enter the downstream statistics only if they meet the above conditions.
根据检测到的病毒基因组型别,将特异性的构建人类和相应型别的参考基因组,进行所有读段的再次比对,针对型别特异的比对结果,将基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测。According to the detected virus genome type, a reference genome of human and corresponding types will be constructed specifically, and all reads will be re-compared. The type-specific comparison results will be based on the detection principle of chimera reads Perform virus integration and detection of integration sites.
具体步骤如下:Specific steps are as follows:
(1)将所有读段单独比对人参考基因组。(1) Compare all reads individually to the human reference genome.
(2)将所有读段单独比对特定型别的病毒参考基因组。(2) Compare all reads individually to the reference genome of a specific type of virus.
(3)将所有读段比对人和相应型别的混合参考基因组,使用Picard Mark duplicates对比对结果进行去除PCR重复。(3) Compare all reads with the mixed reference genome of the corresponding type and use Picard Mark Duplicates to compare the results to remove PCR duplication.
(4)结合步骤1和步骤2的结果,对步骤3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段及远距离双端跨区域读段(图1)。(4) Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (figure 1).
(5)分情况处理单端嵌合体读段和双端嵌合体读段:对于双端嵌合体读段,将其合并成一体读段进行BLASTN二次比对;对于单端嵌合体读段,对嵌合单条读段进行BLASTN二次比对。(5) Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, Perform BLASTN secondary alignment on chimeric single reads.
(6)对步骤5的比对结果进行过滤,过滤读段包括:跟BWA-MEM比对的结果不一致;病毒及人的读段占比过短(≤30bp);病毒及人的比对交叉读段过长(≥50%读段长度);人的读段部分比对结果不唯一;人的读段部分来自DNA低重复区域。(6) Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short (≤30bp); the comparison of viruses and human crosses The reads are too long (≥50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.
(7)将步骤6所有过滤后保留的读段按照人的读段位置进行局部聚类,保留读段数目≥3的位点,将位点使用ANNOVAR软件进行基因位置及功能的注释。(7) Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene positions and functions of the sites.
(8)将步骤7中的读段使用IDBA-UD软件进行组装,将组装的序列分病毒及人的部分进行第三次BLASTN比对,比对结果与BWA-MEM比对一致的组装序列进行保留。(8) Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN alignment, and the alignment results are consistent with the assembled sequence of the BWA-MEM alignment. Reserved.
统计每个样本的感染型别和整合位点的数目。挑选出多种HPV感染并有整合位点的样本,共15例。以每个样本的全部感染型别的读段支持数占全部病毒感染读段的比例做堆积条形图,并以每个样本的每个感染型别的整合位点数目做堆积条形图。Count the type of infection and the number of integration sites for each sample. A total of 15 samples with multiple HPV infections and integration sites were selected. A stacked bar graph is made based on the ratio of the number of reads supported by all infection types of each sample to all virus-infected reads, and a stacked bar graph is made based on the number of integration sites of each infection type of each sample.
结果如图2所示,横轴为样本名称,不同颜色表示不同的感染型别,下图为每个样本中的每个感染型别的读段数占全部病毒感染读段的比例,上图为每个样本中每个感染型别的整合位点的数目。从图2中可以看出,在15个多种HPV病毒感染中的样本中,有11个样本发生整合的HPV型别只有一个,其中有10个样本发生整合的HPV型别为病毒载量最高的型别,占全部样本中的66.7%。The results are shown in Figure 2. The horizontal axis is the sample name, and different colors indicate different infection types. The figure below shows the proportion of reads of each infection type in each sample to the total virus infection reads. The figure above is The number of integration sites for each infection type in each sample. It can be seen from Figure 2 that among the 15 samples with multiple HPV infections, 11 samples have integrated HPV types, and 10 of them have integrated HPV types with the highest viral load. The type accounted for 66.7% of all samples.
最后应当说明的是,以上实施例仅用以说明本发明的技术方案而非对本发明保护范围的限制,尽管参照较佳实施例对本发明作了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的实质和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit the protection scope of the present invention. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that Modifications or equivalent replacements are made to the technical solution of the present invention without departing from the essence and scope of the technical solution of the present invention.

Claims (10)

  1. 一种检测人基因组中病毒型别的方法,包括以下步骤:A method for detecting virus types in the human genome includes the following steps:
    1)从数据库收集所有型别的病毒基因组并当作伪染色体,与人基因组的染色体合并,得到混合基因组;1) Collect all types of virus genomes from the database and use them as pseudo-chromosomes, merge them with the chromosomes of the human genome to obtain a mixed genome;
    2)提取患者DNA并测序得到患者基因组,与步骤1)所得混合基因组第一次比对;2) Extract the patient's DNA and sequence to obtain the patient's genome, and compare it with the mixed genome obtained in step 1) for the first time;
    3)统计步骤2)比对结果中的非人染色体,对于比对到的特定型别的病毒基因组,根据第一次比对读段的长度占比及相似度占比将读段进行归类,所述读段采用如下公式筛选:3) Statistical step 2) For the non-human chromosomes in the comparison result, for the specific type of virus genome compared, classify the reads according to the length ratio and similarity ratio of the first comparison read , The read segment is filtered using the following formula:
    L M≥(L M+L S+L H+L I)×O.5; L M ≥(L M +L S +L H +L I )×O.5;
    3×L I+2×L D+L MIS≤(L M+L D)×0.2, 3×L I +2×L D +L MIS ≤(L M +L D )×0.2,
    其中,L M表示比对上特定型别病毒的读段长度,L S、L H表示读段上两端比对不上病毒DNA的长度,L I表示读段中间的插入长度,L D表示读段中间的缺失长度,L MIS表示读段上单碱基的错配长度; Among them, L M represents the length of the read segment of a specific type of virus in the comparison, L S , L H represent the length of the virus DNA at both ends of the read segment that are not aligned with the virus DNA, L I represents the insertion length in the middle of the read segment, and L D represents The length of the missing in the middle of the read, L MIS represents the length of the mismatch of a single base on the read;
    4)对于满足步骤3)中两个公式的读段,进行病毒型别和载量的统计,即得。4) For the read segments that meet the two formulas in step 3), perform the statistics of virus type and load to obtain.
  2. 权利要求1的方法,其中所述步骤2)中采用BWA-MEM算法进行比对。The method of claim 1, wherein the BWA-MEM algorithm is used for comparison in step 2).
  3. 权利要求1的方法,其中所述步骤2)还包括去除PCR重复序列。The method of claim 1, wherein said step 2) further comprises removing PCR repetitive sequences.
  4. 权利要求3的方法,其中,采用软件PicardMarkduplicates去除PCR重复序列。The method of claim 3, wherein the software PicardMarkduplicates is used to remove PCR repetitive sequences.
  5. 权利要求1的方法,其中,所述步骤4)中,对于双端测序读段,两条读段均满足步骤3)中两个公式时,才能进行病毒型别和载量的统计。The method of claim 1, wherein in step 4), for paired-end sequencing reads, when both reads satisfy the two formulas in step 3), the statistics of virus type and load can be performed.
  6. 一种检测人基因组中病毒含量的方法,包括如下步骤:基于权利要求1~5任一项的病毒型别和载量的统计结果,根据可选的内参基因与权利要求1中混合基因组的比对结果进行病毒拷贝数的相对定量,定量公式如下:A method for detecting the virus content in the human genome, comprising the following steps: based on the statistical results of the virus type and load of any one of claims 1 to 5, according to the ratio of the optional internal reference gene to the mixed genome in claim 1. The relative quantification of virus copy number is performed on the results, and the quantification formula is as follows:
    Figure PCTCN2019124917-appb-100001
    Figure PCTCN2019124917-appb-100001
    其中,CN H为内参基因的拷贝数,默认为2,D V为病毒基因组的有效累加乘深,通过累加计算权利要求1中步骤3)的所有读段对病毒基因组的单碱基位点覆盖次数得到,D H为内参基因的有效累加乘深,通过上述相同方式累加内参基因与权利要求1中混合基因组比对后所有读段的单碱基位点覆盖 次数得到,C V为病毒基因组的比对覆盖度,即权利要求1中步骤3)的所有读段涉及的单碱基位点占病毒基因组的长度,C H为内参基因的比对覆盖度,即内参基因比对权利要求1中混合基因的所有读段涉及的单碱基位点占内参基因的长度,L V为测序探针涉及的病毒基因组的有效长度,L H为测序探针涉及的内参基因的有效长度。 Among them, CN H is the copy number of the internal reference gene, the default is 2, and D V is the effective cumulative multiplication depth of the viral genome, and all the reads in step 3) of claim 1 cover the single base site of the viral genome by cumulative calculation The number of times is obtained, D H is the effective accumulation of the internal reference gene and the depth is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the mixed genome of claim 1 in the same manner as described above, and C V is the viral genome All reads relates to a single base site than coverage, i.e., as claimed in claim 1, step 3) accounts for the length of the viral genome, C H than the reference gene of coverage, i.e. reference gene alignments claim 1 claim The single-base sites involved in all reads of the mixed gene account for the length of the internal reference gene, L V is the effective length of the viral genome involved in the sequencing probe, and L H is the effective length of the internal reference gene involved in the sequencing probe.
  7. 一种检测人基因组中病毒整合与否及整合位点的方法,包括如下步骤:A method for detecting virus integration and integration sites in the human genome includes the following steps:
    根据权利要求1~5任一项检测到的病毒基因组型别,构建人类和相应病毒型别的参考基因组;According to the virus genome type detected in any one of claims 1 to 5, construct the reference genome of human and the corresponding virus type;
    将所有第一次比对读段的分别再次比对所述参考基因组;以及Align all the reads of the first alignment with the reference genome again; and
    针对特异病毒型别的比对结果,基于嵌合体读段的检测原理进行病毒整合与否及整合位点的检测,即得。For the comparison results of specific virus types, the virus integration or integration site detection is based on the detection principle of chimera reads, and it is obtained.
  8. 权利要求7的方法,包括如下步骤:The method of claim 7, comprising the following steps:
    S1、将所有第一次比对读段单独比对人参考基因组;S1. Compare all the first-time comparison reads individually to the human reference genome;
    S2、将所有第一次比对读段单独比对特定型别的病毒参考基因组;S2, compare all the first comparison reads individually to the reference genome of a specific type of virus;
    S3、将所有第一次比对读段比对人和相应型别的混合参考基因组,使用PicardMark duplicates去除比对结果中PCR重复序列;S3. Compare all the first comparison reads to the mixed reference genome of the person and the corresponding type, and use PicardMark Duplicates to remove the PCR repetitive sequence in the comparison result;
    S4、结合步骤S1和步骤S2的结果,对步骤S3中的比对结果进行读段的统计分类,分成单端嵌合体读段,双端嵌合体读段以及远距离双端跨区域读段;S4. Combining the results of step S1 and step S2, perform a statistical classification of the reads in the comparison result in step S3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads;
    S5、对于所述双端嵌合体读段,将其合并成一体读段进行第二次比对;对于所述单端嵌合体读段,对嵌合单条读段进行第二次比对;S5. For the double-ended chimera reads, merge them into a single read for the second comparison; for the single-ended chimera reads, perform the second comparison for the chimeric single read;
    S6、对步骤S5的比对结果进行读段过滤;S6. Perform reading segment filtering on the comparison result of step S5;
    S7、将步骤S6所有过滤后保留的读段按照人基因组的读段位置进行局部聚类,保留读段数目≥3的位点,将位点进行基因位置及功能的注释;以及S7. Perform partial clustering of all reads retained after filtering in step S6 according to the read positions of the human genome, retain the sites with the number of reads ≥ 3, and annotate the sites with gene positions and functions; and
    S8、将步骤S7注释后的读段进行组装,将组装的序列分病毒及人的部分进行第三次比对所述混合参考基因组,比对结果与权利要求2的BWA-MEM比对结果一致的组装序列进行保留,即得。S8. Assemble the reads annotated in step S7, and divide the assembled sequence into viral and human parts for the third comparison of the mixed reference genome, and the comparison result is consistent with the BWA-MEM comparison result of claim 2. The assembly sequence is retained, and it is obtained.
  9. 权利要求8的方法,其中,所述步骤S6中过滤的读段包括以下读段:The method of claim 8, wherein the reads filtered in step S6 include the following reads:
    跟BWA-MEM比对的结果不一致;The results of comparison with BWA-MEM are inconsistent;
    病毒及人的读段过短;Virus and human readings are too short;
    病毒及人的比对交叉读段占比过长;The ratio of cross readings between viruses and people is too long;
    人的读段部分比对结果不唯一;或The comparison result of the reading part of the person is not unique; or
    人的读段部分来自DNA低重复区域。Human reads partly come from low repetitive regions of DNA.
  10. 权利要求8的方法,其中,所述步骤S7中使用ANNOVAR软件进行基因位置及功能的注释;所述步骤S8中使用IDBA-UD软件进行组装;所述步骤S5中第二次比对和S8中第三次比对均采用BLASTN软件。The method of claim 8, wherein in step S7, ANNOVAR software is used to annotate gene position and function; in step S8, IDBA-UD software is used for assembly; in the second comparison in step S5 and in step S8 BLASTN software was used for the third comparison.
PCT/CN2019/124917 2019-12-10 2019-12-12 Method for accurately detecting dna viruses in human genome WO2021114186A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911264769.3A CN110951853B (en) 2019-12-10 2019-12-10 Method for accurately detecting DNA viruses in human genome
CN201911264769.3 2019-12-10

Publications (1)

Publication Number Publication Date
WO2021114186A1 true WO2021114186A1 (en) 2021-06-17

Family

ID=69980885

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124917 WO2021114186A1 (en) 2019-12-10 2019-12-12 Method for accurately detecting dna viruses in human genome

Country Status (3)

Country Link
CN (1) CN110951853B (en)
AU (1) AU2020101909A4 (en)
WO (1) WO2021114186A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111584003B (en) * 2020-04-10 2022-05-10 中国人民解放军海军军医大学 Optimized detection method for virus sequence integration

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010130731A1 (en) * 2009-05-12 2010-11-18 Virco Bvba Hiv-1-c resistance monitoring
CN103261442A (en) * 2010-12-02 2013-08-21 深圳华大基因健康科技有限公司 Method and system for bioinformatics analysis of hpv precise typing
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN110349629A (en) * 2019-06-20 2019-10-18 广州赛哲生物科技股份有限公司 Analysis method for detecting microorganisms by using metagenome or macrotranscriptome

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010130731A1 (en) * 2009-05-12 2010-11-18 Virco Bvba Hiv-1-c resistance monitoring
CN103261442A (en) * 2010-12-02 2013-08-21 深圳华大基因健康科技有限公司 Method and system for bioinformatics analysis of hpv precise typing
CN104762402A (en) * 2015-04-21 2015-07-08 广州定康信息科技有限公司 Method for rapidly detecting human genome single base mutation and micro-insertion deletion
CN110349629A (en) * 2019-06-20 2019-10-18 广州赛哲生物科技股份有限公司 Analysis method for detecting microorganisms by using metagenome or macrotranscriptome

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PODLAHA ONDREJ, WU GEORGE, DOWNIE BRYAN, RAMAMURTHY RAGHURAMAN, GAGGAR ANUJ, SUBRAMANIAN MANI, YE ZHISHEN, JIANG ZHAOSHI: "Genomic modeling of hepatitis B virus integration frequency in the human genome", PLOS ONE, vol. 14, no. 7, pages e0220376, XP055820125, DOI: 10.1371/journal.pone.0220376 *
QINGGUO WANG;PEILIN JIA;ZHONGMING ZHAO: "VERSE: a novel approach to detect virus integration in host genomes through reference genome customization", GENOME MEDICINE, vol. 7, no. 1, 20 January 2015 (2015-01-20), pages 2, XP021210912, ISSN: 1756-994X, DOI: 10.1186/s13073-015-0126-6 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113689912A (en) * 2020-12-14 2021-11-23 广东美格基因科技有限公司 Method and system for correcting microbial contrast result based on metagenome sequencing

Also Published As

Publication number Publication date
CN110951853A (en) 2020-04-03
CN110951853B (en) 2021-03-30
AU2020101909A4 (en) 2020-09-24

Similar Documents

Publication Publication Date Title
Behtash et al. Cervical cancer: screening and prevention
Palefsky et al. Detection of human papillomavirus DNA in anal intraepithelial neoplasia and anal cancer
Zhao et al. High-risk human papillomavirus genotype distribution and attribution to cervical cancer and precancerous lesions in a rural Chinese population
AU2018212272A1 (en) Diagnostic applications using nucleic acid fragments
Kroupis et al. Human papilloma virus (HPV) molecular diagnostics
Wang et al. Clinical effect of human papillomavirus genotypes in patients with cervical cancer undergoing primary radiotherapy
Ouh et al. Prevalence of human papillomavirus genotypes and precancerous cervical lesions in a screening population in the Republic of Korea, 2014–2016
WO2021114186A1 (en) Method for accurately detecting dna viruses in human genome
US20220220563A1 (en) Head and neck squamous cell carcinoma assays
WO2023280050A1 (en) Application of reagent for methylation typing of cpg site of eb virus in saliva in preparation of nasopharyngeal carcinoma diagnostic kit
Weiner et al. Relationship of human papillomavirus to Schneiderian papillomas
Morbini et al. Human papillomavirus and head and neck carcinomas: focus on evidence in the babel of published data
Zhao et al. The performance of human papillomavirus DNA detection with type 16/18 genotyping by hybrid capture in primary test of cervical cancer screening: a cross-sectional study in 10,669 Chinese women
Hao et al. Screening nasopharyngeal carcinoma by detection of the latent membrane protein 1 (LMP‐1) gene with nasopharyngeal swabs
Yu et al. Presence of lytic Epstein‐Barr virus infection in nasopharyngeal carcinoma
Martora et al. Seven years prevalence and distribution of high and low risk HPV genotypes in women living in the metropolitan area of Naples
JP2016531596A (en) Circulating cancer biomarkers and uses thereof
Chen et al. Epidemiological study of HPV infection in 40,693 women in Putian: a population study based on screening for high-risk HPV infection
Baek et al. Human papillomavirus is more frequently detected in the pelvic than non-pelvic area in patients with squamous cell carcinoma in situ (Bowen’s disease)
Lee et al. Immunohistochemistry and polymerase chain reaction for detection human papilloma virus in warts: a comparative study
Lee et al. Comparison of human papillomavirus detection and typing by hybrid capture 2, linear array, DNA chip, and cycle sequencing in cervical swab samples
CN116334228A (en) Marker for cervical cancer DNA methylation detection and application thereof
Ngamkham et al. Detection and type-distribution of human papillomavirus in vulva and vaginal abnormal cytology lesions and cancer tissues from Thai women
Fischer et al. Evaluation and application of a broad-spectrum polymerase chain reaction assay for human papillomaviruses in the screening of squamous cell tumours of the head and neck
Chinchai et al. Comparison between direct sequencing and INNO-LiPA methods for HPV detection and genotyping in Thai Women

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19955818

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19955818

Country of ref document: EP

Kind code of ref document: A1