WO2021114186A1

WO2021114186A1 - Method for accurately detecting dna viruses in human genome

Info

Publication number: WO2021114186A1
Application number: PCT/CN2019/124917
Authority: WO
Inventors: 胡争; 崔资凤; 许微
Original assignee: 中山大学附属第一医院
Priority date: 2019-12-10
Filing date: 2019-12-12
Publication date: 2021-06-17
Also published as: CN110951853A; CN110951853B; AU2020101909A4

Abstract

A method for accurately detecting DNA viruses in a human genome. The method is an analysis method capable of accurately evaluating the infection type and load of a double-stranded DNA virus and accurately and flexibly determining the types and integration sites of the double-stranded DNA viruses integrated into the human genome and a generated fusion sequence. The method can be used for simultaneously detecting multiple different virus infections; the method can be used for finely distinguishing reads classification of different subtypes of the same virus so as to determine the load of the virus; and the method is suitable for reads types having different library building sources and has universality of NGS virus etiology detection.

Description

A Method for Precise Detection of DNA Viruses in Human Genome

Technical field

The invention relates to the technical field of virus detection, in particular to a method for accurately detecting DNA viruses in the human genome.

Background technique

Double-stranded DNA virus-associated tumors refer to tumors that are closely related to double-stranded DNA virus infections and are caused by a series of biological effects caused by the interaction of double-stranded DNA viruses with host cells that trigger carcinogenic mechanisms and are often accompanied by high-risk cancers. Virus strain infection, high-risk cancer-causing virus genomic DNA inserted into human cell DNA, and co-infection of multiple virus subtypes during tumor progression. For example, double-stranded DNA viruses such as Human papillomavirus (HPV) and cervical cancer, head and neck tumors, etc.; our previous research found that Hepatitis B virus (HBV), Epstein-Barr virus (EBV) have universal DNA integration into the human genome. And the production of DNA integration plays an essential role in its carcinogenic process. Therefore, blocking its integration has become the focus of research.

Taking HPV virus as an example, there are currently more than 200 types of HPV discovered, which are double-stranded DNA viruses that specifically infect human skin mucosal squamous epithelial cells. HPV infection is a sexually transmitted disease. According to incomplete data, the reproductive tract HPV infection rate of sexually active young women is as high as 80%, and women may be infected with different types of HPV at different periods of their lives, and there may be multiple HPV types at the same period. Other infections. Persistent infection of high-risk HPV (16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68, 73, and 82) is the most critical factor leading to the development of cervical cancer When the cervical epithelium is damaged, HPV can break through the epidermis through the damaged site, enter the basal layer, divide with the basal stem cells, and begin to replicate in large numbers in the squamous cells above the basal layer, and the mature virus separates on the surface cells时released. HPV generally exists in a free state in cervical epithelial cells, and its DNA can be integrated into human chromosomes. The integration of high-risk HPV into the host's genome is one of the decisive factors in the occurrence and development of cervical cancer. Studies have shown that 90% The integration of high-risk HPV can be detected in the above cervical cancer. Therefore, the identification of HPV infection type, load and whether it is integrated is of great significance for the precise prevention and treatment of cervical cancer.

At present, clinical virus type and load detection mainly use Hybrid Capture 2 (HC2) and Cervista based on hybrid signal amplification analysis, and Cobas 4800 based on real-time PCR method. Either of the above methods cannot cover all HPV type testing, and cross-reactions between different types of HPV cannot be avoided. Most importantly, none of the above methods can determine whether HPV is integrated and integrated. At the same time, the rapid development of second-generation sequencing technology has created new methods for virus type and integrated detection. Whole genome sequencing, whole transcriptome sequencing and specific virus capture targeted sequencing provide opportunities for comprehensive detection of virus types and status.

Summary of the invention

Based on the above problems, the purpose of the present invention is to overcome the shortcomings of the above-mentioned prior art and provide a method that can accurately detect DNA viruses in the human genome, which can accurately assess the type and load analysis of double-stranded DNA viruses. Method, and at the same time accurately and flexibly determine the type of double-stranded DNA virus integrated into the human genome, the integration site and the resulting fusion sequence.

In order to achieve the above objectives, the technical solutions adopted by the present invention include the following aspects:

In the first aspect, the present invention provides a method for detecting virus types in the human genome, including the following steps:

1) Collect all types of virus genomes from the database and use them as pseudo-chromosomes, merge them with the chromosomes of the human genome to obtain a mixed genome;

2) Extract the patient's DNA and sequence to obtain the patient's genome, and compare it with the mixed genome obtained in step 1) for the first time;

3) Statistical step 2) For the non-human chromosomes in the comparison result, for the specific type of virus genome compared, classify the reads according to the length ratio and similarity ratio of the first comparison read , The read segment is filtered using the following formula:

L _M ≥(L _M +L _S +L _H +L _I )×0.5;

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2,

Among them, L _M represents the length of the read segment of a specific type of virus in the comparison, L _S , L _H represent the length of the virus DNA at the two ends of the read segment that are not aligned with the length of the virus DNA, L _I represents the length of the insert in the middle of the read segment, L _D Indicates the missing length in the middle of the read, and L _MIS represents the length of the mismatch of a single base on the read;

4) For the read segments that meet the two formulas in step 3), perform the statistics of virus type and load to obtain.

It should be noted that the detection method of the present invention can accurately detect the specific type and relative load of double-stranded DNA virus infection, and is suitable for but not limited to double-stranded DNA virus infection-related diseases, such as HPV infection-related cervical diseases, head and neck diseases. Department of diseases, HBV-related liver disease, EBV-related lymphatic system disease, nasopharyngeal disease, gastric disease.

Preferably, the BWA-MEM algorithm is used for comparison in the step 2).

Preferably, the step 2) further includes removing PCR repetitive sequences. More preferably, the software Picard Markduplicates is used to remove PCR repetitive sequences.

Preferably, in step 4), for paired-end sequencing reads, when both reads meet the two formulas in step 3), the statistics of virus type and load can be performed.

In a second aspect, the present invention provides a method for detecting the virus content in the human genome, which includes the following steps: based on the statistical results of the above-mentioned virus type and load, according to the ratio of the optional internal reference gene and the above-mentioned mixed genome The relative quantification of virus copy number is performed on the results, and the quantification formula is as follows:

Among them, CN _H is the copy number of the internal reference gene, the default is 2, and D _V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome, D _H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above, and C _V is the comparison coverage of the viral genome, namely single base site above step 3) all reads relates account the length of the viral genome, C _H for all reads than the reference gene of coverage, i.e. reference gene than the above step 1) was mixed genes relate The single-base site of, occupies the length of the internal reference gene, L _V is the effective length of the viral genome involved in the sequencing probe, and L _H is the effective length of the internal reference gene involved in the sequencing probe.

In the third aspect, the present invention provides a method for detecting virus integration and integration sites in the human genome, which includes the following steps:

According to the virus genome types detected above, construct reference genomes of humans and corresponding virus types;

Align all the reads of the first alignment with the reference genome again; and

For the comparison results of specific virus types, the virus integration or integration site detection is based on the detection principle of chimera reads, and it is obtained.

Preferably, the method includes the following steps:

S1. Compare all the first-time comparison reads individually to the human reference genome;

S2, compare all the first comparison reads individually to the reference genome of a specific type of virus;

S3. Compare all the first-time comparison reads to the mixed reference genome of the person and the corresponding type, and use Picard Mark Duplicates to remove the PCR repetitive sequence in the comparison result;

S4. Combining the results of step S1 and step S2, perform a statistical classification of the reads in the comparison result in step S3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads;

S5. For the double-ended chimera reads, merge them into a single read for the second comparison; for the single-ended chimera reads, perform the second comparison for the chimeric single read;

S6. Perform reading segment filtering on the comparison result of step S5;

S7. Perform partial clustering of all reads retained after filtering in step S6 according to the read positions of the human genome, retain the sites with the number of reads ≥ 3, and annotate the sites with gene positions and functions; and

S8. Assemble the reads annotated in step S7, and divide the assembled sequence into viral and human parts for the third comparison of the mixed reference genome, and the comparison result is consistent with the BWA-MEM comparison result of claim 2. The assembly sequence is retained, and it is obtained.

Preferably, the reads filtered in step S6 include the following reads:

The results of comparison with BWA-MEM are inconsistent;

Virus and human reads are too short (≤30bp);

The ratio of cross reads between viruses and people is too long (≥50% read length);

The comparison result of the reading part of the person is not unique; or

Human reads partly come from low repetitive regions of DNA.

Preferably, in the step S7, ANNOVAR software is used to annotate the gene position and function; in the step S8, the IDBA-UD software is used for assembly; the second comparison in step S5 and the third comparison in S8 All use BLASTN software.

In summary, the beneficial effects of the present invention are:

The method of the present invention can simultaneously detect multiple different virus infections;

The method of the present invention can finely distinguish the read segment classification of different subtypes of the same virus, thereby judging the viral load;

The method of the present invention is suitable for reading types of different database sources, and has the versatility of NGS virus pathogenic detection;

The method of the present invention can accurately detect the type of virus integrated into the human genome, the integration site, and the specific integration sequence, and provide solid theoretical support for downstream verification.

Description of the drawings

Figure 1 is a schematic flow chart of the method for detecting virus integration in the human genome in the present invention;

Figure 2 is a graph showing the relationship between the highest HPV types of multiple HPV infection samples and the number of integrated HPV types integration sites, which shows that about 66.7% of the samples infected with multiple HPV viruses have integrated HPV types as viral loads Highest type

Figure 3 is a graph of the results of HPV typing in Example 1, where the vertical axis is the number of valid HPV comparison reads;

Figure 4 is a graph of HPV typing results in Example 2, where the vertical axis is the number of HPV valid comparison reads;

Figure 5 is a graph of HPV typing results in Example 3, where the vertical axis is the number of HPV valid comparison reads;

Figure 6 is a graph of HPV typing results in Example 4, where the vertical axis is the number of HPV valid comparison reads;

Figure 7 is a graph of HPV typing results in Example 5, where the vertical axis is the number of valid HPV comparison reads;

Figure 8 is a graph of HPV typing results in Example 6, where the vertical axis is the number of HPV valid comparison reads;

Figure 9 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type1;

Figure 10 is a graph showing the statistical results of the read support number of the samples with the highest viral load of Type2.

Detailed ways

In some embodiments, the present invention provides a method for accurately detecting double-stranded DNA virus-susceptible polytypes and load, viral integration breakpoints, and human-virus genome fusion sequences, and guide virus-related tumors based on the detection results of the method Screening and treatment decisions are more accurate and efficient. The method of the present invention can detect the main viral infection types with potential carcinogenic effects, and judge the risk of cancer through the occurrence of integration, guide the personalized screening strategy of related tumors, and according to the number of virus integration sites and biological significance , Provide anti-viral and anti-tumor targeted therapy programs for cancer patients.

In some embodiments, the present invention provides a method for calculating the type of double-stranded DNA virus infection. The calculation method is based on next-generation sequencing reads, and the best use scenario is virus capture sequencing; by comparing the sequencing reads The information is filtered, the reads from viral DNA are accurately selected, and the repetitive offsets that may be caused by the process of library building are removed by the reads, and the number of reads of different types of viral DNA is counted, which indirectly reflects the infection of the virus. The load capacity of the type includes the following steps:

1) In the initial comparison process, all types of virus genomes collected from the database are used as pseudo-chromosomes and merged with the chromosomes of the human genome to construct a mixed genome;

2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;

3) Comparing the results and performing statistics on the results of non-human chromosome comparison. When comparing to a specific type of virus, the reads are classified twice accurately according to the length ratio and similarity ratio of the read comparison. The reading filter formula is as follows:

L _M ≥(L _M +L _S +L _H +L _I )×0.5

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2

Among them, L _M represents the read length of the specific type of virus in the comparison, L _S and L _H represent the length of the two ends (larger fragments) of the read that are not aligned with the length of the viral DNA, and L _I represents the middle of the read ( The insertion length of the small fragment), L _D represents the missing length of the middle (small fragment) on the _{read, and L MIS} represents the mismatch length of a single base on the read;

4) For reads that meet the above two conditions (formulas), they enter the statistics of virus type load. For paired-end sequencing reads, both reads meet the above conditions (two formulas) before entering downstream statistics;

5) Through the above steps, the statistics of virus type load are initially completed, and then the relative quantification of virus copy number will be performed according to the comparison of optional internal reference genes. The quantitative formula is as follows:

Among them, CN _H is the copy number of the internal reference gene, the default is 2, and D _V is the effective cumulative multiplication depth of the viral genome, which is obtained by cumulatively calculating the number of times all the reads in step 3) above cover the single base site of the viral genome, D _H is the effective accumulation multiplication depth of the internal reference gene, which is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the above-mentioned mixed genome in the same manner as described above, and C _V is the comparison coverage of the viral genome, namely single base site in the above step 3) all reads relates account the length of the viral genome, C _H than the reference gene is a single base pair coverage, i.e. internal reference genes involved than all the mixed reads gene The site occupies the length of the internal reference gene, L _V is the effective length of the viral genome involved in the sequencing probe, and L _H is the effective length of the internal reference gene involved in the sequencing probe.

The internal reference genes involved in the calculation of the above virus copy numbers are suitable for fixed probes and sequencing reads used by different database-building detection systems, including but not limited to the conserved genes of the human genome.

In some embodiments, the above algorithm is mainly applied to the whole genome sequencing, exome sequencing, whole transcriptome sequencing, and virus capture sequencing data of clinical patients related to double-stranded DNA virus infection. Through the mining of DNA sequencing data, scanning exists Types and types of double-stranded DNA viruses, and identify the DNA fusion site sequence for each virus type that reaches the detection level, through the type of double-stranded DNA virus infection, infection abundance, integration into the human genome and corresponding integration The significance of the site comprehensively judges the patient's corresponding virus carcinogenic risk, and guides clinical early-warning decision-making and treatment plans. The main application scenarios are as follows:

1. Determination of the infection type of double-stranded DNA virus, such as human papillomavirus (HPV), hepatitis B virus (HBV) and Epstein-Barr virus (EBV) and other double-stranded DNA virus. There are many types and variants of the above viruses, and there are often forms of mixed multi-type infection. Through the detection method of the present invention, the virus DNA types that have reached the detection abundance in the DNA sequencing data can be detected, and each type is counted. Type abundance;

2. Predict the integration site of viral DNA integrated into the human genome, explain the meaning of the corresponding human genome integration site, guide clinical intervention in the occurrence of viral integration, and prevent the corresponding tumor progression;

3. Apply to tumor patient genome, perform pattern recognition on large-segment structural variation, and predict the prognosis of pan-cancer;

4. Applied to the genome of tumor patients to predict the therapeutic response of pan-cancer to polq and PARP1 based on synthetic lethal principles of anti-tumor drugs.

In some embodiments, based on the virus genome types detected by the above algorithm, a reference genome of human and corresponding virus types will be specifically constructed, and all reads will be re-compared, and for the type-specific comparison results, Based on the detection principle of chimera reads, the detection of virus integration and integration sites will be carried out. Specifically include the following steps:

1. Compare all reads individually to the human reference genome;

2. Compare all reads individually to the reference genome of a specific type of virus;

3. Compare all reads with the mixed reference genome of the corresponding type, and use Picard Mark Duplicates to compare the results to remove PCR duplication;

4. Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads ( See Figure 1).

5. Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, right Merge a single read for BLASTN secondary comparison;

6. Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the proportion of virus and human reads is too short (≤30bp); the cross reading of virus and human comparison The segment is too long (≥50% of the read length); the comparison result of the human read part is not unique; the human read part comes from the low repetitive region of DNA;

7. Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene location and function of the sites;

8. Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN comparison. The result of the comparison is consistent with the assembled sequence of the BWA-MEM comparison. , That is.

In one embodiment, in order to detect multiple virus types of the same species, risk assessment can be performed based on the virus integration sites obtained by the above calculation method, and the virus types that are mainly carcinogenic risk can be identified, and the integrated virus types and viral load The virus types with the highest amount are more consistent, with a consensus rate of 70% (see Figure 2).

In one embodiment, the cell-free DNA sample used is derived from cell tissue in routine examinations, and is applicable but not limited to cervical exfoliated cells, liver puncture cells, lymph node biopsy tissue, blood, saliva, and the like.

In some embodiments, if tumor-associated virus integration sites are found in patient samples, the downstream clinical monitoring observation density can be increased, or a more powerful clinical treatment plan can be replaced, and conversely, the clinical monitoring density can be reduced, or the clinical treatment plan can be downgraded ( See Examples 1 to 6).

In one embodiment, the virus types involved in the present invention can include one or more double-stranded DNA viruses such as HPV, HBV, and EBV; for one virus, it can include any different types of virus species; for the same For the same species of a virus, you can consider designing probes for the entire genome, or you can consider designing probes for partial regions of the genome.

In order to better illustrate the objectives, technical solutions and advantages of the present invention, the present invention will be further described below with reference to the accompanying drawings and specific embodiments. The following describes the present invention through specific examples of cervical cancer, nasopharyngeal cancer and liver cancer samples with type infection distribution and integration site detection. It should be noted that this example is only for illustrative purposes, and the present invention is not only limited. Because of these three diseases. Unless otherwise specified, the experimental methods in the present invention are all conventional methods.

Example 1

An embodiment of the method for accurately detecting DNA viruses in the human genome of the present invention includes the following steps:

For patient A with mild cervicitis, part of the cervical tissue is taken for capture and sequencing. Perform the following analysis on the data obtained by sequencing.

By filtering the comparison information of the sequencing reads, the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused by the library construction process, the reads of different types of viral DNA are removed. Counting the number of segments indirectly reflects the load of the infected virus type. The specific steps are as follows:

(1) In the initial comparison process, all types of HPV virus genomes collected from the papillomavirus genome database PaVE are used as pseudo chromosomes and merged with the chromosomes of the human genome to construct a hybrid Genome;

(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison;

(3) Comparing the results and performing statistics on HPV genome comparison results. When comparing specific types of HPV viruses, the reads are classified twice accurately according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:

L _M ≥(L _M +L _S +L _H +L _I )×0.5;

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2,

Among them, L _M represents the length of the read segment of a specific type of virus on the comparison, L _S and L _H represent the length of the two ends (larger fragments) of the read segment that cannot be compared with the length of the virus DNA, and L _I represents the middle of the read segment ( The insertion length of the small fragment), L _D represents the missing length of the middle (small fragment) on the _{read, and L MIS} represents the mismatch length of a single base on the read;

(4) Reads that meet the above two conditions are included in the statistics of virus type load. For paired-end sequencing reads, both reads can enter the downstream statistics only if they meet the above conditions.

According to the detected virus genome type, two types of HPV (HPV31, HPV33) infection were found in patient A's sample. For these two types of infection, the integration sites were tested separately. For HPV31, a mixed reference genome of human and HPV31 viruses was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.

Specific steps are as follows:

(1) Compare all reads individually to the human reference genome.

(2) Compare all reads individually to the HPV31 virus reference genome.

(3) Compare all reads to the human and HPV31 virus mixed reference genome, and use Picard Mark Duplicates to compare the results to remove PCR duplication.

(4) Combining the results of step 1 and step 2, perform a statistical classification of the reads from the comparison results in step 3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads. (Figure 1).

(5) Handling single-ended chimera reads and double-ended chimera reads according to circumstances: for double-ended chimera reads, merge them into a single read for BLASTN secondary comparison; for single-ended chimera reads, Perform BLASTN secondary alignment on chimeric single reads.

(6) Filter the comparison results of step 5. The filtered reads include: the results of the comparison with BWA-MEM are inconsistent; the ratio of virus and human reads is too short (≤30bp); the comparison of viruses and human crosses The reads are too long (≥50% of the read length); the comparison results of the human reads are not unique; the human reads are from low repetitive regions of DNA.

(7) Perform partial clustering of all the reads retained after filtering in step 6 according to the human read position, retain the sites with the number of reads ≥ 3, and use ANNOVAR software to annotate the gene positions and functions of the sites.

(8) Use the IDBA-UD software to assemble the reads in step 7, and divide the assembled sequence into virus and human parts for the third BLASTN alignment, and the alignment results are consistent with the assembled sequence of the BWA-MEM alignment. Reserved.

The results showed that the integration site of HPV31 on the human genome was not detected.

For HPV33, the above analysis was also performed, and two different integration sites were detected (see Table 1 for the results).

Table 1 HPV33 integration results

In summary, in patient A, two types of HPV (HPV31, HPV33, see Figure 3) were found to be infected, and high-risk HPV33 was found to be integrated into two different sites of the human genome. Colposcopy can be performed immediately. Avoid missed diagnosis, which is different from the international guidelines for non-HPV16 and 18 positive to continue to be observed.

Example 2

In patient B with mild cervicitis (see Example 1 for the detection method), three HPV types (HPV16, HPV31, HPV56) were found to be infected (as shown in Figure 4), but no HPV integration was found, and follow-up can be continued. Avoid unnecessary colposcopy, which is different from the international guidelines for HPV16 positive recommendations for colposcopy referral.

Example 3

In the follow-up follow-up of patient C with low-grade cervical lesions (see Example 1 for the detection method), high-risk HPV16 infection was found (as shown in Figure 5), but no HPV integration was found. Follow-up can be continued to avoid unnecessary colposcopy The examination is different from the international guidelines for HPV16 positive recommendations for colposcopy referral.

Example 4

In the follow-up follow-up of patient D with low-grade cervical lesions (see Example 1 for detection methods), persistent infections of multiple HPV types (HPV16, HPV56) were found (as shown in Figure 6), and high-risk HPV56 was found to be integrated into humans. Genome (see Table 2), colposcopy can be performed immediately to avoid progression.

Table 2 HPV56 integration results

Example 5

In patient E with high-grade cervical lesions (see Example 1 for detection methods), high-risk HPV16 infection (as shown in Figure 7) is found with two integration sites (see Table 3), and surgery can be used to avoid progression to Cervical cancer.

Table 3 HPV16 integration results

Example 6

In patients with cervical cancer F (see Example 1 for detection methods), multiple HPV types (HPV16, HPV18) were found to be infected (as shown in Figure 8), and high-risk HPV18 was found to be integrated into two different sites in the human genome. (See Table 4). The integration site is located near the human CHRAC1 gene to guide clinical personalized medicine.

Table 4 HPV18 integration results

Example 7

An example of the method for detecting the type of EBV virus infection in 112 cases of nasopharyngeal carcinoma collected from the First Affiliated Hospital of Sun Yat-sen University, the specific steps are as follows:

(1) In the initial comparison process, two types of EBV virus Type1 and Type2 were collected from the literature and NCBI database, and the two types of EBV virus genomes were regarded as pseudochromosomes and merged with the chromosomes of the human genome. To construct a hybrid genome.

(2) The comparison software adopts the BWA-MEM algorithm that supports local optimal comparison, and uses Picard Markduplicates to remove PCR duplication after comparison.

(3) Comparing the results and performing the statistics of the EBV virus genome comparison results. After comparing to a specific type of virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:

L _M ≥(L _M +L _S +L _H +L _I )×0.5;

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2,

Count the type of EBV with the highest viral load of each sample and the number of read support for that type of EBV. The results are shown in Figures 9 and 10: The type with the highest viral load of most samples is Type1.

Example 8

In liver cancer patient G, the liver cancer tissue and its adjacent tissues are taken, and captured and sequenced respectively, and the infection types and integration sites of the cancer tissues and adjacent tissues are detected by the method of the present invention. The specific detection steps for cancer tissues are as follows:

By filtering the comparison information of the sequencing reads of cancer tissues, the reads from the viral DNA are accurately selected, and by removing the repetitive offsets that may be caused during the database construction process, the different types of viruses are removed. The number of DNA reads is counted, which indirectly reflects the load of the infected virus type. The specific steps are as follows:

(1) In the initial comparison process, 11 types of HBV viruses were collected from the literature and NCBI database, and the collected genomes of all types of HBV viruses were used as pseudochromosomes, and merged with the chromosomes of the human genome to construct a mixed genome.

(3) Comparing the results and performing the statistics of the HBV genome comparison results. After comparing to a specific type of HBV virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:

L _M ≥(L _M +L _S +L _H +L _I )×0.5;

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2,

According to the detected viral genome types, three types of HPV (AB014381, AF090842, AB033554) infections were found in the cancer samples of patient G. For these three types of infection, the integration sites were tested separately. For AB014381, a mixed reference genome of human and AB014381 virus was constructed. Perform a re-comparison of all reads, and for the type-specific comparison results, the detection of virus integration and integration sites will be performed based on the detection principle of chimera reads.

Specific steps are as follows:

(1) Compare all reads individually to the human reference genome.

(2) Compare all reads individually to the reference genome of the AB014381 virus.

(3) Compare all reads to the human and AB014381 virus mixed reference genome, and use Picard Mark Duplicates to compare the results to remove PCR duplication.

For the other two types of infections AF090842 and AB033554 detected in cancer tissues, the above integration site detection steps were repeated. In the end, only 3 integration sites were detected on the AB014381 virus (as shown in Table 5 below).

Same as above, the infection type detection of the above-mentioned steps was performed on the adjacent tissues. Similarly, three types of HPV (AB014381, AF090842, AB033554) infections were detected in the adjacent tissues. For these three types of infection, the integration sites were tested separately. Finally, two integration sites were detected on the AB014381 virus (as shown in Table 5 below).

Table 5 AB014381 Virus Integration Site

Example 9

Collect female cervical scan samples from the Cervical Screening Clinic of the First Affiliated Hospital of Sun Yat-sen University, use BD SurePath LBC cell preservation solution for preservation, use Beijing Quanjin EasyPure Genomic DNA Kit to extract genomic DNA, and use Bioruptor Pico interrupter for genomic DNA Interrupt, add adapters, purify, prepare DNA library, perform hybridization reaction with HPV probe DNA, use magnetic beads for capture, perform high-throughput paired-end PE150 sequencing on the captured fragments, and then use the method of the present invention to perform the following on the sequencing data analysis:

(1) In the initial comparison process, all types of HPV virus genomes collected from the papillomavirus genome database PaVE are used as pseudo chromosomes and merged with the chromosomes of the human genome to construct a hybrid Genome.

(3) Comparing the results and performing statistics on the results of HPV virus genome comparison. After comparing to a specific type of virus, the reads are classified twice according to the length ratio and similarity ratio of the read comparison. The specific reading selection formula is as follows:

L _M ≥(L _M +L _S +L _H +L _I )×0.5;

3×L _I +2×L _D +L _MIS ≤(L _M +L _D )×0.2,

According to the detected virus genome type, a reference genome of human and corresponding types will be constructed specifically, and all reads will be re-compared. The type-specific comparison results will be based on the detection principle of chimera reads Perform virus integration and detection of integration sites.

Specific steps are as follows:

(1) Compare all reads individually to the human reference genome.

(2) Compare all reads individually to the reference genome of a specific type of virus.

(3) Compare all reads with the mixed reference genome of the corresponding type and use Picard Mark Duplicates to compare the results to remove PCR duplication.

Count the type of infection and the number of integration sites for each sample. A total of 15 samples with multiple HPV infections and integration sites were selected. A stacked bar graph is made based on the ratio of the number of reads supported by all infection types of each sample to all virus-infected reads, and a stacked bar graph is made based on the number of integration sites of each infection type of each sample.

The results are shown in Figure 2. The horizontal axis is the sample name, and different colors indicate different infection types. The figure below shows the proportion of reads of each infection type in each sample to the total virus infection reads. The figure above is The number of integration sites for each infection type in each sample. It can be seen from Figure 2 that among the 15 samples with multiple HPV infections, 11 samples have integrated HPV types, and 10 of them have integrated HPV types with the highest viral load. The type accounted for 66.7% of all samples.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit the protection scope of the present invention. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that Modifications or equivalent replacements are made to the technical solution of the present invention without departing from the essence and scope of the technical solution of the present invention.

Claims

A method for detecting virus types in the human genome includes the following steps:

1) Collect all types of virus genomes from the database and use them as pseudo-chromosomes, merge them with the chromosomes of the human genome to obtain a mixed genome;

2) Extract the patient's DNA and sequence to obtain the patient's genome, and compare it with the mixed genome obtained in step 1) for the first time;

3) Statistical step 2) For the non-human chromosomes in the comparison result, for the specific type of virus genome compared, classify the reads according to the length ratio and similarity ratio of the first comparison read , The read segment is filtered using the following formula:

L M ≥(L M +L S +L H +L I )×O.5;

3×L I +2×L D +L MIS ≤(L M +L D )×0.2,

Among them, L M represents the length of the read segment of a specific type of virus in the comparison, L S , L H represent the length of the virus DNA at both ends of the read segment that are not aligned with the virus DNA, L I represents the insertion length in the middle of the read segment, and L D represents The length of the missing in the middle of the read, L MIS represents the length of the mismatch of a single base on the read;

4) For the read segments that meet the two formulas in step 3), perform the statistics of virus type and load to obtain.
The method of claim 1, wherein the BWA-MEM algorithm is used for comparison in step 2).
The method of claim 1, wherein said step 2) further comprises removing PCR repetitive sequences.
The method of claim 3, wherein the software PicardMarkduplicates is used to remove PCR repetitive sequences.
The method of claim 1, wherein in step 4), for paired-end sequencing reads, when both reads satisfy the two formulas in step 3), the statistics of virus type and load can be performed.
A method for detecting the virus content in the human genome, comprising the following steps: based on the statistical results of the virus type and load of any one of claims 1 to 5, according to the ratio of the optional internal reference gene to the mixed genome in claim 1. The relative quantification of virus copy number is performed on the results, and the quantification formula is as follows:

Among them, CN H is the copy number of the internal reference gene, the default is 2, and D V is the effective cumulative multiplication depth of the viral genome, and all the reads in step 3) of claim 1 cover the single base site of the viral genome by cumulative calculation The number of times is obtained, D H is the effective accumulation of the internal reference gene and the depth is obtained by accumulating the single-base site coverage times of all reads after the internal reference gene is compared with the mixed genome of claim 1 in the same manner as described above, and C V is the viral genome All reads relates to a single base site than coverage, i.e., as claimed in claim 1, step 3) accounts for the length of the viral genome, C H than the reference gene of coverage, i.e. reference gene alignments claim 1 claim The single-base sites involved in all reads of the mixed gene account for the length of the internal reference gene, L V is the effective length of the viral genome involved in the sequencing probe, and L H is the effective length of the internal reference gene involved in the sequencing probe.
A method for detecting virus integration and integration sites in the human genome includes the following steps:

According to the virus genome type detected in any one of claims 1 to 5, construct the reference genome of human and the corresponding virus type;

Align all the reads of the first alignment with the reference genome again; and

For the comparison results of specific virus types, the virus integration or integration site detection is based on the detection principle of chimera reads, and it is obtained.
The method of claim 7, comprising the following steps:

S1. Compare all the first-time comparison reads individually to the human reference genome;

S2, compare all the first comparison reads individually to the reference genome of a specific type of virus;

S3. Compare all the first comparison reads to the mixed reference genome of the person and the corresponding type, and use PicardMark Duplicates to remove the PCR repetitive sequence in the comparison result;

S4. Combining the results of step S1 and step S2, perform a statistical classification of the reads in the comparison result in step S3, and divide them into single-ended chimera reads, double-ended chimera reads, and long-distance double-ended cross-regional reads;

S5. For the double-ended chimera reads, merge them into a single read for the second comparison; for the single-ended chimera reads, perform the second comparison for the chimeric single read;

S6. Perform reading segment filtering on the comparison result of step S5;

S7. Perform partial clustering of all reads retained after filtering in step S6 according to the read positions of the human genome, retain the sites with the number of reads ≥ 3, and annotate the sites with gene positions and functions; and

S8. Assemble the reads annotated in step S7, and divide the assembled sequence into viral and human parts for the third comparison of the mixed reference genome, and the comparison result is consistent with the BWA-MEM comparison result of claim 2. The assembly sequence is retained, and it is obtained.
The method of claim 8, wherein the reads filtered in step S6 include the following reads:

The results of comparison with BWA-MEM are inconsistent;

Virus and human readings are too short;

The ratio of cross readings between viruses and people is too long;

The comparison result of the reading part of the person is not unique; or

Human reads partly come from low repetitive regions of DNA.
The method of claim 8, wherein in step S7, ANNOVAR software is used to annotate gene position and function; in step S8, IDBA-UD software is used for assembly; in the second comparison in step S5 and in step S8 BLASTN software was used for the third comparison.