CN110951853B

CN110951853B - Method for accurately detecting DNA viruses in human genome

Info

Publication number: CN110951853B
Application number: CN201911264769.3A
Authority: CN
Inventors: 胡争; 崔资凤; 许微
Original assignee: First Affiliated Hospital of Sun Yat Sen University
Current assignee: First Affiliated Hospital of Sun Yat Sen University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2021-03-30
Anticipated expiration: 2039-12-10
Also published as: CN110951853A; AU2020101909A4; WO2021114186A1

Abstract

The invention discloses a method for accurately detecting DNA viruses in human genomes, which can accurately evaluate an infection type and a load capacity of a double-stranded DNA virus, and simultaneously accurately and flexibly judge the type, an integration site and a generated fusion sequence of the double-stranded DNA virus integrated into the human genomes. The method of the invention can detect a plurality of different virus infections simultaneously; the method can finely distinguish the read classification of different subtypes of the same virus, so as to judge the virus load; the method is suitable for reading types with different library construction sources, and has the universality of the etiology detection of the NGS virus.

Description

Method for accurately detecting DNA viruses in human genome

Technical Field

The invention relates to the technical field of virus detection, in particular to a method for accurately detecting DNA viruses in human genomes.

Background

The tumor related to the double-stranded DNA virus refers to a tumor which is closely related to the double-stranded DNA virus infection, is caused by triggering carcinogenic mechanism and is caused by a series of biological effects generated by interaction between the double-stranded DNA virus after infection and host cells, and is often accompanied by the phenomena of infection of high-risk carcinogenic virus strains, insertion of high-risk carcinogenic virus genome DNA into human body cell DNA, co-infection of various virus subtypes in the tumor progression process and the like. Such as double-stranded DNA viruses such as Human Papillomavir (HPV) and cervical cancer, head and neck tumors, etc.; our previous studies found that Hepatitis B Virus (HBV), Epstein-Barr virus (EBV) have a phenomenon of universal DNA integration into the human genome, and that the generation of DNA integration plays an essential role in the carcinogenic process. Therefore, blocking the occurrence of integration thereof has become an important point of research.

Taking HPV virus as an example, more than 200 types of HPV are discovered at present, and the HPV virus is a double-stranded DNA virus which specifically infects human skin mucosal squamous epithelial cells. HPV infection is a sexually transmitted disease, and according to the incomplete condition, the genital HPV infection rate of a young female with sexual activity is as high as 80%, and the female can be infected with different types of HPV types in different periods of life and can be infected with a plurality of HPV types in the same period. Persistent infection with high-risk HPV types (15 types, 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, 82, etc.) is the most critical causative agent for development of cervical cancer, when the cervical epithelium is damaged, HPV can break through the epidermis through the damaged site, enter the basal lamina, divide with the basal lamina stem cells, and start to replicate in large numbers in squamous cells above the basal lamina, and mature virus is released upon separation of surface cells. HPV generally exists in a free state in cervical epithelial cells, DNA of the HPV can be integrated into human chromosomes, high-risk HPV is integrated into host genome and is one of decisive factors in the generation and development process of cervical cancer, and researches show that the integration of the high-risk HPV can be detected in more than 90% of cervical cancer. Therefore, the identification of HPV infection type, load and integration is of great significance for the accurate prevention and treatment of cervical cancer.

At present, Hybrid Capture 2(HC2) and Cervista based on hybridization signal amplification analysis and Cobas 4800 based on real-time PCR method are mainly used for clinical virus type and load measurement. Neither of the above methods covers all HPV types and avoids cross-reaction between different HPV types, and most importantly, the above methods cannot determine whether HPV is integrated or not and detect the integration state. Meanwhile, the rapid development of the second generation sequencing technology creates a new method for virus type and integrated detection. The whole genome sequencing, the whole transcriptome sequencing and the specific virus capture target sequencing provide an opportunity for comprehensively detecting the virus type and the state.

Disclosure of Invention

Based on the above problems, the present invention aims to overcome the disadvantages of the prior art and provide a method for accurately detecting DNA viruses in human genomes, which can accurately evaluate the infection type and load of double-stranded DNA viruses, and simultaneously accurately and flexibly judge the type, integration site and generated fusion sequence of the double-stranded DNA viruses integrated into human genomes.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following aspects:

in a first aspect, the present invention provides a method of detecting the virus type in the human genome, comprising the steps of:

1) collecting all types of virus genomes from a database and taking the virus genomes as pseudo chromosomes, and combining the pseudo chromosomes with chromosomes of human genomes to obtain mixed genomes;

2) extracting and sequencing the DNA of a patient to obtain a genome of the patient, and comparing the genome of the patient with the mixed genome obtained in the step 1) for the first time;

3) counting the non-human chromosomes in the comparison result in the step 2), and classifying the read according to the length ratio and the similarity ratio of the read in the first comparison for the compared specific type of virus genome, wherein the read is screened by adopting the following formula:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

wherein L is_MIndicates the read length, L, of the particular type of virus aligned_S、L_HIndicates the length of viral DNA aligned at both ends of the reads, L_IIndicating the length of the intermediate insertion, L, on the read_DIndicating the length of the middle deletion, L, on the read_MISRepresents the length of a single base mismatch on the reads;

4) and (4) counting the virus type and the load of the reads meeting the two formulas in the step 3).

It should be noted that the detection method of the present invention can accurately detect the specific type and relative load of the double-stranded DNA virus infection, and is suitable for diseases related to the double-stranded DNA virus infection, such as cervical diseases, head and neck diseases, HBV-related liver diseases, EBV-related lymphatic system diseases, nasopharyngeal diseases, and gastric diseases.

Preferably, the comparison in step 2) is performed by using a BWA-MEM algorithm.

Preferably, the step 2) further comprises removing PCR repeat sequences. More preferably, the PCR repeats are removed using the software Picard Markduplicates.

Preferably, in the step 4), for the double-end sequencing reads, when both the two reads satisfy the two formulas in the step 3), the statistics of the virus type and the load can be performed.

In a second aspect, the present invention provides a method for detecting the viral content in the human genome, comprising the steps of: based on the statistical results of the virus types and the loads, the relative quantification of the virus copy number is carried out according to the comparison result of the selectable reference genes and the mixed genome, and the quantification formula is as follows:

wherein, CN_HThe copy number of the reference gene is 2, D by default_VFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single-base site coverage of the viral genome by all reads of step 3) above, D_HFor effective accumulation of the internal reference gene, the number of times of covering single base sites of all reads after the internal reference gene is compared with the mixed genome is accumulated in the same way as described above, C_VFor the aligned coverage of the viral genome, i.e.the single base sites involved in all reads of step 3) above occupy the length of the viral genome, C_HThe comparison coverage of the reference gene, i.e. the length of the single base site related to all reads of the mixed gene in the step 1) of the reference gene comparison, L_VFor sequencing the effective length of the viral genome to which the probe is directed, L_HThe effective length of the reference gene related to the sequencing probe.

In a third aspect, the present invention provides a method for detecting the presence or absence of viral integration and integration sites in the human genome, comprising the steps of:

constructing reference genomes of human and corresponding virus types according to the detected virus genome types;

re-aligning each of all first alignment reads to the reference genome; and

and (3) detecting whether the virus is integrated or not and the integration site based on the detection principle of the chimera reads according to the comparison result of the specific virus types.

Preferably, the method comprises the steps of:

s1, independently aligning all the first alignment reads with a reference genome;

s2, independently aligning all the first alignment reads to a virus reference genome of a specific type;

s3, comparing all the first comparison reads with the mixed reference genome of the human and the corresponding type, and removing PCR repetitive sequences in the comparison result by using Picard Mark duplicates;

s4, combining the results of the step S1 and the step S2, and performing reading statistical classification on the comparison result in the step S3 to divide the comparison result into a single-ended chimera reading segment, a double-ended chimera reading segment and a remote double-ended transregional reading segment;

s5, merging the two-end chimera reads into an integral read for second comparison; for the single-ended chimeric reads, performing second comparison on the chimeric single reads;

s6, performing reading filtering on the comparison result of the step S5;

s7, locally clustering all the read segments retained after filtering in the step S6 according to the read segment positions of the human genome, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites; and

s8, assembling the reads annotated in the step S7, performing third comparison of the assembled sequences into virus and human parts to the mixed reference genome, and reserving the assembled sequences with the comparison result consistent with the BWA-MEM comparison result of claim 2.

Preferably, the reads filtered in step S6 include the following reads:

the result of comparison with BWA-MEM is inconsistent;

the reading of the virus and the human is too short (less than or equal to 30 bp);

the cross read proportion of the virus and human is too long (more than or equal to 50 percent of the read length);

the comparison result of the human reading part is not unique; or

The human reads are derived in part from low-repeat regions of DNA.

Preferably, the ANNOVAR software is used for annotating gene positions and functions in the step S7; in the step S8, IDBA-UD software is used for assembly; the second alignment in step S5 and the third alignment in S8 both use BLASTN software.

In conclusion, the beneficial effects of the invention are as follows:

the method of the invention can detect a plurality of different virus infections simultaneously;

the method can finely distinguish the read classification of different subtypes of the same virus, so as to judge the virus load;

the method is suitable for reading types with different database construction sources, and has the universality of the etiology detection of the NGS virus;

the method can accurately detect the type and the integration site of the virus integrated into the human genome, and the specific integration sequence, and provides solid theoretical support for downstream verification.

Drawings

FIG. 1 is a schematic flow diagram of a method of detecting viral integration in the human genome according to the present invention;

FIG. 2 is a graph showing the relationship between the highest-load HPV type and the number of integration sites of the integrated HPV types in various HPV infection samples, wherein about 66.7% of the HPV types integrated in various HPV infection samples are the types with the highest viral load;

FIG. 3 is a graph showing the result of HPV typing in example 1, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 4 is a graph showing the result of HPV typing in example 2, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 5 is a graph showing the result of HPV typing in example 3, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 6 is a graph showing the result of HPV typing in example 4, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 7 is a graph showing the result of HPV typing in example 5, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 8 is a graph showing the result of HPV typing in example 6, wherein the vertical axis represents the number of effective HPV alignments;

FIG. 9 is a graph of read support statistics for a sample with the highest viral load Type 1;

FIG. 10 is a graph of statistical read support counts for samples with the highest viral load Type 2.

Detailed Description

In some embodiments, the invention provides a method for accurately detecting double-stranded DNA virus susceptibility polymorphism and load, virus integration breakpoint and human-virus genome fusion sequence, and the detection result based on the method guides virus-related tumor screening and treatment decision, so that the method is more accurate and efficient. The method can detect the main virus infection type with potential carcinogenic effect, judge the cancer risk through the occurrence of integration, guide the personalized screening strategy of related tumors, and provide an antiviral and antitumor targeted treatment scheme for cancer patients according to the number of virus integration sites and biological significance.

In some embodiments, the invention provides a method of calculating the infection type of a double-stranded DNA virus, the method being based on second-generation sequencing reads, with the best use scenario for virus capture sequencing; the method comprises the following steps of filtering comparison information of sequencing reads, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, and specifically comprises the following steps:

1) in the initial comparison process, all types of virus genomes collected from a database are taken as pseudo chromosomes and are merged with chromosomes of human genomes to construct mixed genomes;

2) the comparison software adopts a BWA-MEM algorithm supporting local optimal comparison, and after comparison, Picard Markduplicates are used for removing PCR repetition;

3) and (3) counting the comparison result of the non-human chromosomes of the comparison result, and carrying out secondary accurate classification on the reads according to the length ratio and the similarity ratio of the read comparison when the specific type of viruses are compared, wherein a specific read screening formula is as follows:

L_M≥(L_M+L_S+L_H+L_I)×0.5

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2

wherein L is_MIndicates the read length, L, of the particular type of virus aligned_S、L_HIndicates the length of the two ends (larger fragments) on the reads compared to the viral DNA, L_IIndicates the length of the insertion of the middle (small fragment) on the read, L_DIndicates the length of the deletion in the middle (small fragment) on the read, L_MISRepresents the length of a single base mismatch on the reads;

4) for reads meeting the above two conditions (formulas), entering into the statistics of the virus type loading, for double-end sequencing reads, two reads can enter into downstream statistics if both the two reads meet the above conditions (two formulas);

5) through the steps, the statistics of the virus type loading capacity is preliminarily completed, and then the relative quantification of the virus copy number is carried out according to the comparison condition of the selectable internal reference genes, wherein the quantification formula is as follows:

wherein, CN_HThe copy number of the reference gene is 2, D by default_VFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single-base site coverage of the viral genome by all reads of step 3) above, D_HFor effective accumulation of the internal reference gene, the number of times of covering single base sites of all reads after the internal reference gene is compared with the mixed genome is accumulated in the same way as described above, C_VFor the aligned coverage of the viral genome, i.e.the single base sites involved in all reads of step 3) above occupy the length of the viral genome, C_HIs the ratio of internal reference genesCoverage, i.e., the length of the single base site of the reference gene relative to all reads of the mixed gene, L_VFor sequencing the effective length of the viral genome to which the probe is directed, L_HThe effective length of the reference gene related to the sequencing probe.

The internal reference genes involved in the calculation of the copy number of the virus are suitable for fixed probes and sequencing reads adopted by different library construction detection systems, and include but are not limited to conserved genes of human genomes.

In some embodiments, the above algorithm is mainly applied to whole genome sequencing, exon sequencing, whole transcriptome sequencing and virus capture sequencing data of clinical patients related to double-stranded DNA virus infection, by mining DNA sequencing data, scanning existing double-stranded DNA virus types and types, identifying DNA fusion site sequences for each virus type reaching a detection level, comprehensively judging corresponding virus carcinogenic risks of patients through double-stranded DNA virus infection types, infection abundance, integration conditions into human genome and meanings of corresponding integration sites, and guiding clinical early warning decision and treatment schemes, and the main application scenarios are as follows:

1. the determination of the infection type of double-stranded DNA viruses includes, for example, Human Papilloma Virus (HPV), Hepatitis B Virus (HBV), and EBV (Epstein-Barr virus, EBV). The types and variant strains of the viruses are numerous, and the mixed multi-type infection forms often exist, through the detection method, the virus DNA type reaching the detection abundance in the DNA sequencing data can be detected, and the abundance of each type is counted;

2. predicting the integration site of the viral DNA integrated into the human genome, explaining the significance of the corresponding human genome integration site, guiding the clinical intervention of the occurrence of viral integration, and preventing the corresponding tumor from progressing;

3. the method is applied to the genome of a tumor patient, and carries out pattern recognition on large-fragment structural variation to predict the prognosis of pan-cancer;

4. the gene is applied to the genome of a tumor patient to predict the treatment response of pan-cancer to the synthetic lethal-principle-based anti-tumor drugs such as polq and PARP 1.

In some embodiments, the reference genomes of human and corresponding virus types are specifically constructed according to the virus genome types detected by the algorithm, all reads are compared again, and the detection of virus integration and integration sites is carried out based on the detection principle of chimera reads according to the comparison result of type specificity.

The method specifically comprises the following steps:

1. comparing all reads separately against the human reference genome;

2. comparing all reads individually to a specific type of viral reference genome;

3. comparing all the reads with a mixed reference genome of a human and a corresponding type, and performing PCR duplication elimination on comparison results by using Picard Mark duplicates;

4. and (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (see figure 1).

5. Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, carrying out BLASTN secondary comparison on the chimera single reads;

6. filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human read portion is from a low repeat region of DNA;

7. performing local clustering on all the read segments retained after filtering in the step 6 according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and performing annotation on the gene positions and functions of the sites by using ANNOVAR software;

8. and (4) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into viruses and human parts, carrying out third BLASTN comparison, and reserving the assembled sequences with the comparison result consistent with the BWA-MEM comparison to obtain the target product.

In one embodiment, in order to detect multiple virus types of the same species, the risk assessment can be performed according to the virus integration sites obtained by the above calculation method, and the virus type with major oncogenic risk is identified, and the virus type with integration is consistent with the virus type with the highest viral load, and the consistent rate reaches 70% (see fig. 2).

In one embodiment, the free DNA sample is obtained from a cell tissue in routine examination, and is suitable for, but not limited to, exfoliated cells of the cervix, punctured cells of the liver, lymph node biopsy, blood, saliva, and the like.

In some embodiments, such as the discovery of integration sites for tumor-associated viruses in a patient sample, the density of downstream clinical monitoring observations can be increased, or more aggressive clinical treatment protocols can be changed, and conversely, the density of clinical monitoring can be decreased, or the clinical treatment protocol can be degraded (see examples 1-6).

In one embodiment, the virus types involved in the present invention may comprise one or more of double-stranded DNA viruses such as HPV, HBV and EBV; for a virus, any of a variety of different types of virus species may be included; for the same species of the same virus, probes designed for the entire genome may be considered, as well as probes designed for regions of the genome.

To better illustrate the objects, aspects and advantages of the present invention, the present invention will be further described with reference to the accompanying drawings and specific embodiments. The present invention is illustrated by the following examples of type infection distribution and integration site detection for cervical cancer, nasopharyngeal carcinoma and liver cancer samples, which are for illustrative purposes only and are not intended to be limiting. Unless otherwise specified, the experimental methods in the present invention are all conventional methods.

Example 1

One embodiment of the method for accurately detecting a DNA virus in a human genome of the present invention comprises the steps of:

for patients with mild cervicitis A, a part of cervical tissue is taken and subjected to capture sequencing. The data obtained from the sequencing were analyzed as follows.

The method comprises the following steps of filtering comparison information of sequencing reads, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, wherein the method comprises the following specific steps:

(1) in the initial comparison process, all types of HPV viral genomes collected from a papilloma virus genome database PaVE are taken as pseudo chromosomes and combined with chromosomes of a human genome to construct a mixed genome;

(2) the comparison software adopts a BWA-MEM algorithm supporting local optimal comparison, and after comparison, Picard Markduplicates are used for removing PCR repetition;

(3) counting the comparison result of HPV genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when comparing the HPV viruses of a specific type, wherein the specific read screening formula is as follows:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

(4) for reads meeting the above two conditions enter into the statistics of the virus type loading, for double-ended sequencing reads, both reads meeting the above conditions can enter into downstream statistics.

Based on the detected viral genome types, two HPV types (HPV31, HPV33) were found in the patient A samples. For these two infection types, detection of the integration site was performed separately. For HPV31, a mixed reference genome of human and HPV31 viruses was constructed. And comparing all the reads again, and detecting whether the viruses are integrated or not and the integration sites based on the detection principle of the chimera reads according to the comparison result of the type specificity.

The method comprises the following specific steps:

(1) all reads were aligned individually to the reference genome.

(2) All reads were aligned individually to the HPV31 virus reference genome.

(3) All reads were aligned to human and HPV31 virus mixed reference genomes and the alignment was performed with Picard Mark duplicates to eliminate PCR duplicates.

(4) And (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (shown in the attached figure 1).

(5) Case-by-case processing of single-ended chimera reads and double-ended chimera reads: combining the two-end chimera reads into an integrated read for BLASTN secondary comparison; for single-ended chimera reads, a BLASTN secondary alignment is performed on the chimera single reads.

(6) Filtering the comparison result of the step 5, wherein the filtering reads comprise: the result of comparison with BWA-MEM is inconsistent; the ratio of the virus to the human reading is too short (less than or equal to 30 bp); the cross-reads for virus and human alignment are too long (greater than or equal to 50% of the read length); the comparison result of the human reading part is not unique; the human reads are derived in part from low-repeat regions of DNA.

(7) And (3) locally clustering all the read segments retained after filtering in the step (6) according to the positions of the read segments of human, retaining the sites with the number of the read segments being more than or equal to 3, and annotating the gene positions and functions of the sites by using ANNOVAR software.

(8) And (3) assembling the reads in the step (7) by using IDBA-UD software, dividing the assembled sequences into virus and human parts, carrying out third BLASTN alignment, and reserving the assembled sequences which are aligned to be consistent with the BWA-MEM alignment.

The results showed that no integration site of HPV31 on the human genome was detected.

For HPV33, the same analysis was performed and two different integration sites were detected (see table 1 for results).

TABLE 1 HPV33 integration results

In summary, in patient a, infection with two HPV types (HPV31, HPV33, see fig. 3) was found, and at the same time, high-risk HPV33 was found to integrate into two different sites of the human genome, which could be followed by colposcopy to avoid missed diagnosis, unlike international guidelines that were non-HPV 16, 18 positive.

Example 2

In patient B with mild cervicitis (see example 1 for detection methods), infection with three HPV types (HPV16, HPV31, HPV56) was found (as shown in fig. 4), but no HPV integration was found, follow-up was continued to avoid unnecessary colposcopy, unlike the HPV16 positive recommendation colposcopic referral of the international guideline.

Example 3

In the follow-up visit of cervical low grade lesion patient C (detection method see example 1), high risk type HPV16 infection (as shown in fig. 5) was found, but no HPV integration was found, follow-up visit could be continued to avoid unnecessary colposcopy, unlike the international guideline HPV16 positive suggested colposcopic referral.

Example 4

In the subsequent follow-up of patients with cervical low-grade lesions D (see example 1 for detection methods), persistent infection of multiple HPV types (HPV16, HPV56) was found (as shown in fig. 6), while high-risk HPV56 was found integrated into the human genome (see table 2), which could be followed by colposcopy to avoid progression.

TABLE 2 HPV56 integration results

Example 5

In cervical high-grade lesion patient E (detection method, see example 1), high-risk HPV16 infection (shown in figure 7) is found and accompanied by 2 integration sites (see Table 3), and surgical treatment can be adopted to avoid the progression to cervical cancer.

TABLE 3 HPV16 integration results

Example 6

In the cervical cancer patient F (detection method, see example 1), a plurality of HPV types (HPV16 and HPV18) are found to be infected (as shown in figure 8), and high-risk HPV18 is simultaneously found to be integrated into two different sites of a human genome (see table 4), and the integration site is positioned near a human CHRAC1 gene to guide clinical personalized medicine application.

TABLE 4 HPV18 integration results

Example 7

112 nasopharyngeal carcinoma samples were collected from the first hospital affiliated to Zhongshan university and used in one embodiment of the method of the present invention for detecting EBV virus infection type, comprising the following steps:

(1) in the initial alignment process, two types of EBV virus Type1 and Type2 were collected from literature and NCBI databases, and the two types of EBV virus genomes were combined with chromosomes of the human genome as pseudochromosomes to construct a mixed genome.

(2) The alignment software used a BWA-MEM algorithm supporting locally optimal alignment, and after alignment, Picard Markduplicates were used to remove PCR duplication.

(3) And (3) counting the comparison result of the EBV virus genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when the specific type of virus is compared, wherein a specific read screening formula is as follows:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

The EBV type with the highest virus infection load and the read support number of the EBV type are counted in each sample, and the results are shown in FIGS. 9 and 10: the Type of most samples infected with a high viral load was Type 1.

Example 8

In the liver cancer patient G, the liver cancer tissue and the tissue beside the cancer are taken and respectively captured and sequenced, and the infection type and the integration site of the cancer tissue and the tissue beside the cancer are detected by the method. The specific detection steps of cancer tissues are as follows:

the method comprises the following steps of filtering comparison information of sequencing reads of cancer tissues, accurately selecting the reads from the virus DNA, removing repeated offset possibly brought in the library building process of the reads, counting the number of the reads of different types of virus DNA, and indirectly reflecting the load of the infected virus types, wherein the specific steps are as follows:

(1) in the initial alignment process, 11 HBV viruses are collected from the literature and NCBI databases, and the HBV viral genomes of all the collected types are taken as pseudo chromosomes and merged with the chromosomes of the human genome to construct a mixed genome.

(3) Counting the comparison result of HBV genome, and accurately classifying the reads for the second time according to the length ratio and the similarity ratio of the read comparison when the specific type of HBV virus is compared, wherein the specific read screening formula is as follows:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

Based on the detected viral genome types, infection with three HPV types (AB014381, AF090842, AB033554) was found in cancer samples from patient G. For these three infection types, detection of integration sites was performed separately. For AB014381, a mixed reference genome of human and AB014381 viruses was constructed. And comparing all the reads again, and detecting whether the viruses are integrated or not and the integration sites based on the detection principle of the chimera reads according to the comparison result of the type specificity.

The method comprises the following specific steps:

(1) all reads were aligned individually to the reference genome.

(2) All reads were aligned individually to the AB014381 virus reference genome.

(3) All reads were aligned to human and AB014381 virus mixed reference genomes and the alignment was performed with PCR duplication removed using Picard Mark duplicates.

(4) And (3) combining the results of the step (1) and the step (2), performing statistical classification on the read of the comparison result in the step (3), and dividing the read into a single-ended chimeric read, a double-ended chimeric read and a remote double-ended cross-region read (figure 1).

For two additional infection types detected in cancer tissues, AF090842 and AB033554, the integration site detection procedure described above was repeated, and finally, only 3 integration sites were detected on AB014381 virus (as shown in table 5 below).

Infection type detection was performed in the same manner as described above on paraneoplastic tissues, and similarly infection with three HPV types (AB014381, AF090842, AB033554) was detected in paraneoplastic tissues. For these three infection types, detection of integration sites was performed separately. Finally, 2 integration sites were detected on AB014381 virus (as shown in table 5 below).

TABLE 5 AB014381 Virus integration site

Example 9

Collecting female cervical brush samples from a cervical screening clinic, a first hospital affiliated to Zhongshan university, preserving by using BD SurePath LBC cell preservation solution, extracting Genomic DNA by using Beijing all-style gold easy pure Genomic DNA Kit, breaking the Genomic DNA by using a Bioruptor Pico breaking instrument, adding a joint, purifying, preparing a DNA library, carrying out hybridization reaction with HPV probe DNA, capturing by using magnetic beads, sequencing captured fragments by using high-throughput double-ended PE150, and then analyzing sequencing data by using the method disclosed by the invention as follows:

(1) in the initial alignment process, all types of HPV viral genomes collected from the papillomavirus genome database PaVE are taken as pseudo chromosomes and combined with chromosomes of a human genome to construct a mixed genome.

(3) Counting the comparison result of HPV virus genome, and carrying out secondary accurate classification on the reads according to the length ratio and the similarity ratio of the read comparison when the comparison result of the HPV virus genome is compared with a specific type of virus, wherein a specific read screening formula is as follows:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

wherein L is_MIndicates the read length, L, of the particular type of virus aligned_S、L_HIndicates that the two ends (larger fragments) on the reads are not alignedLength of viral DNA, L_IIndicates the length of the insertion of the middle (small fragment) on the read, L_DIndicates the length of the deletion in the middle (small fragment) on the read, L_MISRepresents the length of a single base mismatch on the reads;

According to the detected type of the viral genome, constructing reference genomes of human and corresponding types in a specific manner, comparing all the reads again, and detecting whether the virus is integrated or not and the integrated site based on the detection principle of chimera reads according to the specific comparison result of the types.

The method comprises the following specific steps:

(1) all reads were aligned individually to the reference genome.

(2) All reads were individually aligned to a specific type of viral reference genome.

(3) All reads were aligned to a mixed reference genome of human and corresponding classes, and PCR duplication was removed from the alignment using Picard Mark duplicates.

The infection type and the number of integration sites were counted for each sample. Samples with multiple HPV infections and integration sites were selected, for a total of 15 cases. The ratio of the read support number of all infection types of each sample to the total virus infection reads was used to make a stacked bar graph, and the number of integration sites per infection type of each sample was used to make a stacked bar graph.

The results are shown in FIG. 2, where the horizontal axis represents the sample name, the different colors represent different infection types, the lower graph is the ratio of the number of reads per infection type in each sample to the total number of virus infection reads, and the upper graph is the number of integration sites per infection type in each sample. As can be seen from FIG. 2, among the 15 samples infected with various HPV viruses, only one HPV type was integrated in 11 samples, and the HPV type integrated in 10 samples, which is the highest viral load type, accounts for 66.7% of all samples.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope of the present invention, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for detecting the presence or absence of viral integration and integration sites in the human genome for non-diagnostic purposes comprising the steps of:

L_M≥(L_M|L_S|L_H|L_I)×0.5；

3×L_i+2×L_D+L_MIS≤(L_M+L_D)×0.2，

wherein L is_MIndicates the read length, L, of the particular type of virus aligned_S、L_HIndicates the length of viral DNA aligned at both ends of the reads, L_IIndicating the length of the insertion in the middle of the read, L_DIndicating the length of the deletion in the middle of the read, L_MISRepresents the length of a single base mismatch on the reads;

4) counting the types and the loads of the reads which meet the two formulas in the step 3) to obtain the type of the virus in the human genome; 5) constructing reference genomes of the human and the corresponding virus types according to the virus types in the human genome detected in the step 4);

6) re-aligning each of all first alignment reads to the reference genome; and

2. The method of claim 1, wherein the step 2) is performed using a BWA-MEM algorithm.

3. The method of claim 1, wherein the step 2) further comprises removing PCR repeats.

4. The method of claim 3, wherein the PCR repeats are removed using the software Picard Markduplicates.

5. The method of claim 1, wherein in step 4), the virus type and load statistics are performed only when both reads satisfy the two formulas in step 3) for paired-end sequencing reads.

6. The method of claim 1, comprising the steps of:

s6, performing reading filtering on the comparison result of the step S5;

7. The method of claim 6, wherein the reads filtered in step S6 include the following reads:

the result of comparison with BWA-MEM is inconsistent;

the viral and human reads are too short;

the cross-read ratio of virus and human is too long;

the comparison result of the human reading part is not unique; or

The human reads are derived in part from low-repeat regions of DNA.

8. The method of claim 6, wherein the annotation of gene location and function is performed using ANNOVAR software in step S7; in the step S8, IDBA-UD software is used for assembly; the second alignment in step S5 and the third alignment in S8 both use BLASTN software.

9. A method for detecting the viral content of a human genome for non-diagnostic purposes, comprising the steps of:

L_M≥(L_M+L_S+L_H+L_I)×0.5；

3×L_I+2×L_D+L_MIS≤(L_M+L_D)×0.2，

4) counting the types and the loads of the reads which meet the two formulas in the step 3) to obtain the type of the virus in the human genome;

5) based on the statistical results of the virus types and the loads in the step 4), performing relative quantification of the virus copy number according to the comparison result of the selectable reference genes and the mixed genome, wherein the quantification formula is as follows:

wherein, CN_HThe copy number of the reference gene is 2, D by default_VFor efficient cumulative multiplication of the viral genome, obtained by cumulatively calculating the number of single base site coverages of all reads of step 3) of claim 1, D_HFor effective accumulation of the depth of multiplication of the reference gene, obtained by accumulating the number of single-base site coverage of all reads after the comparison of the reference gene with the mixed genome of claim 1 in the same manner as described above, C_VFor the aligned coverage of the viral genome, i.e.the length of the viral genome, C, of the single-base sites involved in all reads of step 3) in claim 1_HThe alignment coverage of the reference gene, i.e., the length of the single base site involved in aligning all reads of the mixed gene of claim 1 to the reference gene, L_VFor sequencing the effective length of the viral genome to which the probe is directed, L_HThe effective length of the reference gene related to the sequencing probe.