WO2020077559A1 - 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质 - Google Patents

从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质 Download PDF

Info

Publication number
WO2020077559A1
WO2020077559A1 PCT/CN2018/110636 CN2018110636W WO2020077559A1 WO 2020077559 A1 WO2020077559 A1 WO 2020077559A1 CN 2018110636 W CN2018110636 W CN 2018110636W WO 2020077559 A1 WO2020077559 A1 WO 2020077559A1
Authority
WO
WIPO (PCT)
Prior art keywords
phage
sequence
bacteriophage
analysis
mild
Prior art date
Application number
PCT/CN2018/110636
Other languages
English (en)
French (fr)
Inventor
宋文琛
孙海汐
肖敏凤
程丽
邓子卿
王云
沈玥
李俊桦
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2018/110636 priority Critical patent/WO2020077559A1/zh
Priority to CN201880098544.2A priority patent/CN112823206B/zh
Publication of WO2020077559A1 publication Critical patent/WO2020077559A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N7/00Viruses; Bacteriophages; Compositions thereof; Preparation or purification thereof
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/70Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving virus or bacteriophage

Definitions

  • the present invention relates to the technical field of bacteriophage, in particular to a method, device and storage medium for mining mild bacteriophage from the whole genome sequence of bacteria.
  • a bacteriophage is a virus that specifically infects bacteria. It is widely distributed in bacterial communities, such as soil, ocean, human and animal intestines. Phages do not have a cellular structure, and are mainly composed of a shell formed by protein and a single nucleic acid DNA or RNA genetic material surrounding it. The length of the bacteriophage varies from 20 nanometers to 200 nanometers, and its genome can encode as few as several hundreds of genes. Phages cannot grow or replicate independently, and must use the energy and metabolic system in the host cells to achieve their own growth and proliferation. The bacteriophage recognizes the host by specifically binding to the receptor on the bacterial surface, and therefore has strict host specificity.
  • Phages can be divided into two types: lysis type and mild type according to different life cycles. After infecting the host cell, the lysed bacteriophage quickly completes its self-appreciation, and releases the progeny phage by lysing the host cell. Mild phage does not directly lyse the cell after infecting the host cell, but integrates its genome into the host DNA or exists in the cell as a circular plasmid to accompany the host DNA replication. Under certain conditions, mild phage can enter a lysed state, by lysis of host cells to release progeny phage.
  • the lysed phage is relatively easy to be excavated and analyzed because of its phenotype specificity.
  • the basic scientific research on various phage is still difficult to fully cover, and the natural lysed phage is directly isolated. Phage used in the treatment of bacterial infections has certain unpredictable problems in clinical effect, timeliness and safety.
  • PhiSpy which features features such as functional protein length, transcribed sequence orientation, quantitative AT and GC ratios, phage insertion sites, and phage similarity proteins, uses five features to identify those that do not have any sequence similarity to known phage genes Prophage is still lacking in screening accuracy, and the integrity of functional bacteriophage cannot be guaranteed. Prophinder, a prediction tool only for prophage, is lacking in terms of functionality and has a single function, but its embeddability allows it to have a greater advantage in collaboration with other software tools.
  • PhageFinder is based on the fact that the phage region does not always consist of atypical G + C nucleotides, and the phage does not always integrate into the coding region, nor does it specifically use tRNA as the integration target site, Therefore, the results of the screening of the destroyed gene or tRNA as an independent method are not sufficiently reliable.
  • the search for the collection of phage sequences and the 441 phage-specific hidden Markov models (HMM) from HMMSEARCH analysis to locate the original region of phage is a highlight of its improved accuracy.
  • HMM hidden Markov models
  • the online data service PHAST tool and the upgraded version of PHASTER it provides a website server that can accurately identify, annotate and graphically display the results of bacterial and viral sequences.
  • PHAST accepts raw DNA sequence data or parts Annotate data in GenBank format, and quickly perform some database comparisons and phage "basic" feature recognition steps to locate, annotate, and display presequence and prephasic features.
  • the user interface can only accept one set of data at a time, so high-throughput online operations are not possible.
  • the most prominent phage prediction mining software in recent years include MetaPhinder, VirFinder, Virsorter and so on.
  • MetaPhinder it compares the assembled fragments of metagenomics with the pre-built phage database through the blastn algorithm, and calculates a comprehensive indicator average nucleotide identity (ANI) based on the results of all successful comparisons. If the ANI value of a fragment is greater than 1.7%, it is considered that the fragment may contain phage sequences.
  • This method is simple in principle and easy to operate, but its limitations are also obvious, that is, it is difficult to obtain phage sequences that have not been included in the database.
  • VirFinder developed by Ren et al. Has solved this problem well.
  • VirFinder is based on the difference in k-mer frequency between bacteriophage and host bacteria, using a large number of logistic regression classifiers trained from k-mers derived from bacteriophage and bacteria. VirFinder does not depend on the database, and can achieve ideal results when classifying the test set independent of the training set, proving its reliability in identifying unknown phage sequence applications. Virsorter relies heavily on similar searches with existing virus databases, but it also has the additional advantage that it uses a self-written virus reference genome database, which adds from freshwater, seawater and human intestines, lungs And saliva collected viral genome sequence. Another advantage is the use of chain switching and short gene criteria, which are typical viral characteristics that do not require similarity searches. In the screening based on viral genes, at least three prediction genes in a continuum are required for prediction, thereby excluding many shorter continuum.
  • the invention provides a method and a device and a storage medium for mining mild phage from the whole genome sequence of bacteria, which can realize accurate phage sequence mining prediction.
  • an embodiment provides a method for mining a mild phage from a whole genome sequence of bacteria, including:
  • genomic characteristics analysis is performed to output the sequence information of the functional bacteriophage.
  • the above method further includes:
  • the relatively conserved sequence of the host genus species is compared with the sequence region of the mild phage to determine whether the relatively conserved sequence of the host genus species is inserted into the sequence region.
  • the above functional elements are extracted from the whole genome data with phage annotation information in the phage genome database based on the bacteriophage mosaic structure, wherein the above mosaic structure means that genes expressing similar functions on the phage genome tend to be in adjacent positions .
  • the above functional element further includes one or more of a cleavage element, an integration element, a replication element, a regulatory element, a packaging element, an immune escape element, and a tRNA element.
  • genomics feature analysis includes GC content and / or K-mer frequency feature analysis.
  • genomics feature analysis also includes a comparative analysis of the genomic features of the host bacteria and the genomic features of the bacteriophage.
  • the above comparative analysis is the analysis of the difference in protein length between the prophage and the host bacteria.
  • the above comparative analysis is the analysis of the difference in transcription directions between the prophage and the host bacterium.
  • an embodiment provides a device for mining a mild phage from a whole genome sequence of bacteria, including:
  • Sequence acquisition unit used to acquire the whole genome sequence of bacteria
  • the suspected region determining unit is used to cluster and compare the functional elements of mild phage to the whole genome sequence of the bacteria to obtain the region where the functional elements are clustered as a suspected phage region, wherein the functional elements include a hypothetical protein element, Infected components, assembled components and unknown conserved components;
  • the integration site determination unit is used to search for repeated sequences at both ends of the suspected phage region to obtain a mild phage integration site, thereby determining the sequence region of the mild phage;
  • the omics feature analysis unit is used to perform genomics feature analysis on the above-mentioned mild phage sequence region as the bacteriophage genome sequence, and output the sequence information of the functional bacteriophage.
  • the above device further includes:
  • the conservative sequence judgment unit is used to compare the relative conserved sequence of the host genus species with the sequence region of the mild phage before performing the above genomics analysis to determine whether the relative host genus species is inserted into the sequence region Conservative sequences.
  • the above functional elements are extracted from the whole genome data with phage annotation information in the phage genome database based on the bacteriophage mosaic structure, wherein the above mosaic structure means that genes expressing similar functions on the phage genome tend to be in adjacent positions .
  • the above functional element further includes one or more of a cleavage element, an integration element, a replication element, a regulatory element, a packaging element, an immune escape element, and a tRNA element.
  • genomics feature analysis includes GC content and / or K-mer frequency feature analysis.
  • genomics feature analysis also includes a comparative analysis of the genomic features of the host bacteria and the genomic features of the bacteriophage.
  • the above comparative analysis is the analysis of the difference in protein length between the prophage and the host bacteria.
  • the above comparative analysis is the analysis of the difference in transcription directions between the prophage and the host bacterium.
  • an embodiment provides a computer-readable storage medium, including a program, which can be executed by a processor to implement the method of the first aspect.
  • the method of the invention can realize accurate phage sequence mining prediction, the prediction result of which is a functional phage, and reduce false positives and false negatives in the results.
  • FIG. 1 is a flow chart of a method for mining mild phage from a whole genome sequence of bacteria according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of the principle mechanism of the integration of bacteriophage into the genome of the host bacteria in the embodiment of the present invention
  • FIG. 3 is a structural block diagram of an apparatus for mining mild phage from a whole genome sequence of bacteria according to an embodiment of the present invention
  • FIG. 4 is a graph showing the clustering phenomenon of four types of functional elements (INF, HYP, ASB, UNS) on the genome of a host bacteria known to contain prophage in the embodiment of the present invention
  • FIG. 5 is a diagram showing the result of inserting a conserved sequence derived from E. coli into a suspected phage region in an embodiment of the present invention
  • FIG. 6 is a graph showing the analysis of GC% content characteristics of bacteriophage genomics in an embodiment of the present invention.
  • FIG. 8 is the analysis result of the difference in the average length of the proteins of the prophage and related host bacteria in the embodiment of the present invention.
  • 9 is a graph showing the analysis results of the difference in transcription directions between the prophage and the host in the embodiment of the present invention.
  • the embodiment of the invention provides a method for accurately mining and predicting the mild phage sequence from the bacterial whole genome sequencing data.
  • This method combines biological characteristics such as horizontal gene transfer of host bacteria, and analysis of bacteriophage genomics features, and finally realizes not only the output of predicted phage sequences in the results, but also the combination of bacterial genomics and bacteriophage comparison features, and the output of biologically proven results Informatics evidence to ensure that the predicted result is a functional bacteriophage, reducing false positives and false negatives in the results.
  • a method for mining mild phage from the whole genome sequence of bacteria includes the following steps:
  • the bacterial whole genome sequence may be a genome sequence assembled from sequencing data of the bacterial whole genome.
  • Sequencing data includes a large number of sequencing reads. These sequencing reads can be combined with various existing methods to obtain bacterial whole genome sequences. Sequencing read length can be derived from any second-generation high-throughput sequencing technology.
  • S102 Cluster the functional elements of mild bacteriophages to the whole genome sequence of the bacteria, and obtain the region where the functional elements are clustered as the suspected phage.
  • the functional elements include putative protein elements, infection elements, assembly elements and unknown Conservative elements.
  • the functional element of the mild bacteriophage is an element extracted from the annotation information of the existing bacteriophage genome.
  • the functional element is based on a mosaic structure of bacteriophage (that is, genes expressing similar functions on the bacteriophage genome tend to be in adjacent positions). Extracted elements from genomic data.
  • dig phage functional elements from the annotation information of the 2101 bacteriophage genomes in the NCBI-Refseq database (January 2018) to build a capacity including about 100,000 functional elements Component library.
  • the functional elements are classified into 11 main categories: LYS (lysis, lysis element), INT (integration, integration element), REP (replication, replication element), REG (regulatory , Regulatory element), PAC (package, packaging element), ASB (assembly, assembly element), INF (infection, infection element), EVA (immune evasion, immune escape element), HYP (hypothetical, hypothetical protein element), UNS (unsorted, unknown conserved elements), and tRNA elements (transport elements). De-redundant all the functional elements obtained by preliminary excavation, remove the elements with nucleic acid similarity> 75%, and finally obtain 99438 phage functional elements (Table 1), and calculate the proportion of related functional elements in the phage genome to obtain data feature.
  • LYS lysis, lysis element
  • INT integration, integration element
  • REP replication, replication element
  • REG regulatory , Regulatory element
  • PAC packetage, packaging element
  • ASB assembly, assembly element
  • INF infection, infection element
  • the infective element contains phage tail, tail shell and other structural protein-related functions, together with the assembly element (ASB) as an essential element for bacteriophage survival; and the highest proportion of hypothetical protein elements (HYP) ), which is related to the unknown conserved element (UNS) and phage structure, which is more accurate when applied to the prophage positioning.
  • the functional elements are constructed using the annotation information of bacteriophage functional proteins, and statistics Methods
  • the proportion of related functional elements and the composition of bacteriophage genome were analyzed to obtain the genomic characteristics of bacteriophage.
  • Four types of optimal functional elements were selected from them, and the clustering method of functional elements was used to determine the suspected phage regions.
  • the functional elements are extracted based on the special mosaic structure of bacteriophage, so as to expand the range of functional genes and improve the accuracy of fuzzy matching.
  • the clustering of functional elements and the proportion of functionality provide a reliable basis for proving the existence of probacteriophages from a global perspective.
  • the fuzzy matching of functional elements effectively avoids the defects of the currently lacking phage database, maximizes the use of known phage data information, and the selection of the optimal functional element avoids the bias of the lytic phage.
  • the most prominent functional element is Infected components can basically filter most negative results in the preliminary screening.
  • the area where the functional element is "clustered” refers to the area where the functional element is concentrated on the whole genome sequence of the bacteria, which is consistent with the special mosaic structure characteristic of bacteriophage. Such a region is called a “phage suspect region", and it means that the bacteriophage appears in this region with a high probability.
  • cleavage elements in addition to the four types of functional elements of hypothetical protein elements, infection elements, assembly elements, and unknown conservative elements, cleavage elements, integration elements, replication elements, regulatory elements, packaging elements, One or more of the immune escape element and the tRNA element are clustered with the whole genome sequence of bacteria. Selecting additional functional elements can improve the accuracy of clustering comparison to a certain extent.
  • S103 Search for repeated sequences at both ends of the suspected phage region to obtain a mild phage integration site, thereby determining the mild phage sequence region.
  • bacteriophages integration into the host bacterium genome As shown in Fig. 2, according to the principle mechanism of bacteriophage integration into the host bacterium genome, mild bacteriophages, when infecting the host, use integrase and integration sites to insert into the host's genome to form the state of prophage.
  • POP ' represents the integration site of bacteriophage
  • BOB' represents the integration site of host bacteriophage
  • the bacteriophage circulates itself during infection, and then POP 'and BOB' homologous recombination, the bacteriophage integrates into the genome of the host bacterium to form a prophage
  • the repeated sequences at both ends of the prophage that is, BOP 'and POB' are the integration sites at both ends of the prophage, respectively, and they appear as a repeat sequence with a length of mostly 10 to 200 bp.
  • a mild phage integration site can be obtained, which is the precise boundary between the two ends of the original phage, and the region between the two ends is the sequence region of the mild phage.
  • the relatively conserved sequence of the host genus species is compared with the sequence region of the mild phage to determine whether the relatively conserved sequence of the host genus species is inserted in the sequence region .
  • the "relatively conserved sequence of the host genus species” refers to a sequence that exists in the form of a conserved sequence in the genus to which the host bacterium in which the mild phage is located belongs.
  • the "relatively conserved sequence of the host genus species” refers to a sequence conserved among various genus species of E. coli.
  • the multi-sequence alignment software is used to construct the representative whole genome data of the host genus species to obtain a relatively conserved sequence of the host genus species.
  • the multi-sequence alignment software such as Musgy
  • use the existing genus and species information of the host bacteria to obtain representative genome data and use multi-sequence alignment software (such as Musgy) to construct a representative genome collection to obtain a representative genome conserved sequence of E. coli , That is, a relatively conserved sequence of the host genus.
  • the relative conserved sequence of the host genus species is used to compare with the sequence region of the mild phage or the phage suspect region, and the sequence region inserted into the relatively conserved sequence of the host genus species is selected, and such a sequence region is eliminated, which can effectively Remove false positive results from the prediction results to ensure the accuracy of mild phage mining prediction.
  • S104 Perform genomic feature analysis on the sequence region of the mild phage as the bacteriophage genomic sequence, and output the sequence information of the functional bacteriophage.
  • genomics feature analysis includes GC content and / or K-mer frequency feature analysis, and optional phage virus feature analysis such as annotation of unknown functional proteins. Further, in a preferred embodiment of the present invention, the genomics feature analysis also includes a comparative analysis of the genomic characteristics of the host bacteria and the bacteriophage, such as the comparative analysis of the difference in the transcription directions between the prophage and the host bacteria.
  • the method of the embodiment of the present invention makes full use of the existing mild phage database to analyze and obtain the characteristics of its omics data, and uses multi-sequence alignment analysis of the host bacterial population using a large amount of genome-wide data of the bacteria belonging to the same host to obtain the host relative Conservative sequence, and use the conservative sequence to eliminate false positives, combined with the phage assembly mechanism, set parameters to eliminate the result of inserting the host conservative sequence and destroying the integrity.
  • Comprehensive consideration of the biological characteristics of the bacteriophage can accurately determine whether the suspected bacteriophage is a functional bacteriophage.
  • the method of the embodiment of the present invention directly outputs the sequence information of the functional bacteriophage, and the comparison result between the genomic feature of the optional host bacterium and the predicted bacteriophage genomic feature to ensure the accuracy of the prediction result.
  • the prediction result agrees with the true result at a rate of more than 95%; accurately from the 267 cases of E. coli genome-wide data, 3 functional bacteriophages are predicted, after experimental results Verify and sequence to prove the accuracy of the sequence information of the functional phage output; the predicted mild phage will promote the expansion of the phage database; and the mild phage prediction for specific host bacteria provides a new method for phage targeted host bacteria treatment s Choice.
  • the present invention also provides an apparatus for mining mild phage from bacterial whole genome sequence, as shown in FIG. 3, including: a sequence acquisition unit 301, used to obtain the bacterial whole genome sequence; the suspected area determining unit 302, used to cluster the functional elements of mild phage to the bacterial whole genome sequence, to obtain the area where the above functional elements are clustered as a phage suspect Region, where the above functional elements include putative protein elements, infection elements, assembly elements, and unknown conserved elements; integration site determination unit 303 is used to search for repeated sequences at both ends of the phage suspected region to obtain mild phage integration sites In order to determine the sequence region of the mild phage; the omics feature analysis unit 304 is used to perform genomic feature analysis on the sequence region of the mild phage as the bacteriophage genome sequence, and output the sequence information of the functional bacteriophage.
  • a sequence acquisition unit 301 used to obtain the bacterial whole genome sequence
  • the suspected area determining unit 302 used to cluster the functional elements of mild phage to the
  • the device for mining mild bacteriophage from the whole genome sequence of bacteria further includes: a conservative sequence judgment unit 305, which is used to use the relatively conserved sequence of the host genus species before performing the above genomics feature analysis The sequence regions of the mild phage are compared to determine whether a relatively conserved sequence of the host genus is inserted into the sequence region.
  • the functional element is extracted from the whole genome data with phage annotation information in the phage genome database based on the bacteriophage mosaic structure, where the mosaic structure means that genes expressing similar functions on the phage genome tend to Adjacent location.
  • the functional element further includes one or more of a cleavage element, an integration element, a replication element, a regulatory element, a packaging element, an immune escape element, and a tRNA element.
  • the genomics feature analysis includes GC content and / or K-mer frequency feature analysis.
  • the genomic feature analysis also includes a comparative analysis of the genomic characteristics of the host bacteria and the bacteriophage genome, for example, the analysis of the difference in protein length between the prophage and the host bacteria, or the analysis of the difference in the transcription directions between the prophage and the host bacteria.
  • a person skilled in the art may understand that all or part of the functions of the various methods in the above-described embodiments may be implemented by hardware, or by a computer program.
  • the program may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc.
  • the computer executes the program to realize the above functions. For example, by storing the program in the memory of the device, when the processor executes the program in the memory, all or part of the above functions can be realized.
  • the program may also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a mobile hard disk, and saved by downloading or copying Go to the memory of the local device or update the version of the system of the local device.
  • a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a mobile hard disk, and saved by downloading or copying Go to the memory of the local device or update the version of the system of the local device.
  • a computer-readable storage medium which includes a program that can be executed by a processor to implement a method for mining a mild phage from a bacterial whole genome sequence as in an embodiment of the present invention.
  • the functional elements were screened, and four types of functional elements—infection elements (INF), assembly elements (ASB), and putative protein elements were obtained from the comparison of 11 types of functional elements.
  • INF infection elements
  • ASB assembly elements
  • putative protein elements putative protein elements obtained from the comparison of 11 types of functional elements.
  • HEP unknown conserved element
  • INF infectious elements
  • ASB assembly elements
  • HOP unknown conserved element
  • the suspected region of the bacteriophage can be clearly obtained, as shown in Figure 4, which shows four types of functional elements on the genome of the host bacteria (NC_000907, Haemophilus influenzae Rd KW20) known to contain probacteriophages (INF, HYP, ASB, UNS) clustering phenomenon, the obtained suspected phage region coincides with the actual phage region (annotated phage). This indicates that the suspected phage region obtained by the method of the present invention has reliability.
  • the integration sites at both ends of the suspected phage region were excavated.
  • the biological basis is that the mild phage circulates itself when infecting the host bacteria, encodes an integrase, recognizes the integration site on the host bacteria, integrates into the host genome under the action of the integrase, and undergoes homologous recombination.
  • new integration sites are formed at both ends of the formed prophage (that is, the state where the mild phage is inserted into the host genome) (as shown in FIG. 2). Based on the biological characteristics, the repeating sequences at both ends are used to accurately determine the exact positions of the two ends of the prophage.
  • E. coli ATCC25922 which is judged to contain prophage, is integrated in its prediction area (that is, the suspected phage area).
  • E. coli ATCC25922 which is judged to contain prophage
  • Lambda phage but after the comparison of Escherichia coli conserved sequences, it was found that the suspected phage region was inserted into a conserved sequence derived from E. coli MG1655, and its functional integrity was destroyed. After induction experiments, no functional bacteriophage existed. Therefore, this is a false positive result, not a functional bacteriophage, and the final mining prediction result needs to be excluded.
  • the analysis summarizes the relevant genomics characteristics and obtains the corresponding data characteristics. For example, the analysis results of the GC% content characteristics of 1031 bacteriophage genomics are shown in FIG. -32%, 52% -58%, 69% -71%), the GC content of the host bacteria is basically distributed between 30% and 60%, the difference is more obvious.
  • the number of prophage is 46.6%, showing that most of the continuous genes are encoded on the transcribed chain in the same direction, while the host bacteria is only 0.8%, showing that only 56 of the 7146 genes are consecutively transcribed in the same direction. Based on these three different characteristics, whether the suspected bacteriophage region determined in the previous step is a mild functional bacteriophage is determined by matching characteristics.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Engineering & Computer Science (AREA)
  • Immunology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Microbiology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Medicinal Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)

Abstract

提供一种从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和包含能执行所述方法的程序的存储介质,所述方法包括:获取细菌全基因组序列,将温和型噬菌体的功能元件聚类比对到细菌全基因组序列上,得到功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件,在噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域,将温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。

Description

从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质 技术领域
本发明涉及噬菌体技术领域,具体涉及一种从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质。
背景技术
噬菌体是一种专门感染细菌的病毒,广泛存在于细菌群落分布的地方,例如土壤、海洋、人和动物肠道等。噬菌体不具有细胞结构,主要由蛋白质形成的外壳和包裹其中的单一核酸DNA或RNA遗传物质组成。噬菌体的长度由20纳米到200纳米不等,其基因组可编码少至若干个、多达几百个基因。噬菌体不能独立生长或复制,必须利用宿主细胞中的能量和代谢系统来实现自身的生长和增殖。噬菌体通过与细菌表面的受体特异性结合识别宿主,因此具有严格的宿主特异性。噬菌体按生命周期的不同可以分为裂解型和温和型两类。裂解型噬菌体在侵染宿主细胞后快速完成自我增值,以裂解宿主细胞的方式释放子代噬菌体。温和型噬菌体在侵染宿主细胞后,并不直接裂解细胞,而是将其基因组整合入宿主DNA或以环状质粒形式存在于细胞内,使之伴随宿主DNA复制。在一定条件下,温和型噬菌体可进入裂解状态,通过裂解宿主细胞以释放子代噬菌体。
早在十九世纪二十年代,噬菌体发现之初,噬菌体便被其发现者Felixd’Herelle用于细菌感染的临床应用,例如防控印度霍乱以及埃及鼠疫的爆发。但由于当时噬菌体相关基础研究不足,噬菌体疗法存在疗效评价标准不统一、生产方法难以标准化、产物纯度低等问题。随着抗生素的发现和使用,噬菌体疗法很快便被更加便宜、高效的抗生素疗法取代。噬菌体疗法也因此慢慢淡出西方发达国家的医疗以及研究体系。近些年来,随着抗生素耐药细菌在全球的 蔓延,抗生素对细菌感染的治疗效果受到严峻挑战,这促使部分科学家重新投入噬菌体疗法的研究中。2005年7月抗菌药物和化学疗法(Antimicrobial Agents and Chemotherapy)杂志报道了噬菌体疗法的首次规范化随机双盲人体试验,证明了口服噬菌体制剂对人体的安全性。2009年6月,伤口护理杂志(Journal of Wound Care)报道了首个美国FDA批准的临床I期试验,证明了噬菌体制剂在伤口治疗中的安全性。同年9月,临床耳鼻喉科(Clinical Otolaryngology)杂志报道了首个评估噬菌体疗法疗效的随机对照临床试验,该研究表明噬菌体混合制剂治疗耐药绿脓杆菌引起的人类慢性耳部感染是安全且有效的。此外,还有许多其它的动物实验和人体临床试验评估噬菌体对包括烧伤感染、肺部感染等的治疗效果。
自然界存在的天然噬菌体物种多样性极为丰富,其中绝对部分是温和型噬菌体。近年来才有可能通过使用生物信息学工具“大规模序列数据集—细菌基因组数据”来挖掘病毒、检测病毒与宿主菌之间的关系。但目前,噬菌体资源库和数据库内容尤为匮乏,NCBI GeneBank、EMBL-EBI、Phantom这几个全球最著名的大型数据库都仅有3000株左右噬菌体的基因组信息,并且基因组注释十分不完善,这给噬菌体的相关研究、改造以及噬菌体疗法带来不便。
对占绝大多数的温和型噬菌体来说,其原噬菌体(整合到宿主基因组中的噬菌体)可以通过相应基因表达影响代谢、细菌粘附、定植、入侵、扩散、抗免疫反应、外毒素产生、血清抗性、竞争性细菌的破坏和抗生素抗性等。针对温和型噬菌体功能性研究以及多样性分析,这意味着需要更完整的噬菌体基因组序列来充分了解遗传多样性的真实程度、遗传能力与交换能力以及噬菌体的进化等。在噬菌体与宿主共存的真实情况发生时,人们往往将原噬菌体的存在忽略,视其为细菌基因组的一部分,所以理论上在细菌完整基因组序列中存在一个巨大的、探索不足的公共可利用的噬菌体基因组的资源。而在已发现的噬菌体中,裂解型噬菌体因其表型特异相对更容易被挖掘分析,但针对各种噬菌体的基础科学研究目前还很难做到全覆盖,而直接将分离出的天然裂解型噬菌 体用于细菌感染治疗在临床效果、时效性以及安全性上都有一定的不可预见性的问题。
因此,解决噬菌体疗法的弊端,推进噬菌体用于细菌相关疾病治疗,首先需要面对的最大问题是人们对自然界存在的10^31至10^32噬菌体了解甚少。同时,伴随基因组学、合成生物学领域飞速发展,而制约其噬菌体研究的最重要的问题是噬菌体数据匮乏。针对噬菌体全基因组序列预测挖掘技术的发展,将直接作用于噬菌体研究领域,噬菌体治疗领域的发展。
噬菌体挖掘在生物信息学、基因组学、二代测序技术未发展普及之初,仅利用宿主菌靶向筛选天然裂解型噬菌体的实验方法,不得不面临耗时、成本高、分离筛选的偶然随机性等一系列弊端。其技术方案即宿主菌与疑似噬菌体存在样本共培养,扩增后观察裂解现象,进一步获得噬菌体样本,并且往往在对噬菌体未测序了解的情况下,也就是说对其是否含有毒力基因未知的情况下,用于临床杀灭病原菌。
在噬菌体预测挖掘的生物信息学领域,人们为了挖掘噬菌体资源,也开发了一系列的预测软件和挖掘方法。早期开发的预测软件和挖掘方法包括Phage Finder(Fouts,2006)、Prophinder(Lima-Mendez et al.,2008)、PHAST(Zhou et al.,2011)以及PhiSpy(Akhter,Aziz&Edwards,2012)。以功能蛋白质长度、转录序列方向性、定量AT和GC比率、噬菌体插入位点以及噬菌体相似性蛋白质为特征对比分析的PhiSpy,利用五个特征可以识别与已知的噬菌体基因没有任何序列相似性的原噬菌体,在筛选精准性上还有所欠缺,对于功能性噬菌体的完整性也不能保证。而只针对原噬菌体的预测工具Prophinder,则在功能性相关方面有所欠缺,功能单一,但是其可镶嵌性使得在与其他软件工具的协同合作方面,拥有更大的优势。在准确性方面,Phage Finder基于原噬菌体区域并不总是由具有非典型的G+C核苷酸组成,噬菌体并不总是整合到编码区域,也不是专门使用tRNA作为整合的靶位点,因此将被破坏的基因或tRNA作为独立的方法进行扫描的筛选带来的结果并不足够可靠。而利用搜索针对噬菌体序列的集合和来 自HMMSEARCH分析的441个噬菌体特异性隐藏的马尔可夫模型(HMM)来定位噬菌体原始区域,是其提高准确性的一大亮点。除此之外对于在线数据服务的PHAST工具以及升级版PHASTER,其提供了可以准确地识别、注释和图形化显示细菌以及病毒序列结果的网站服务器,最重要的是PHAST接受原始DNA序列数据或部分注释的GenBank格式的数据,并快速执行一些数据库比较以及噬菌体“基础”特征识别步骤来定位,注释和显示前序列和前噬特征。与此同时其存在用户界面一次只能接受一组数据,所以无法高通量的在线操作的缺点。
近些年较为突出的噬菌体预测挖掘软件包括MetaPhinder、VirFinder、Virsorter等。以MetaPhinder为例,它是将宏基因组组装片段通过blastn算法与预先构建好的噬菌体数据库比对,根据所有比对成功的结果计算一个综合指标平均核苷酸一致性(average nucleotide identity,ANI),若某条片段的ANI值大于1.7%,则认为该片段可能含有噬菌体序列。此类方法原理简单,操作便捷,但其局限性也显而易见,即难以获取数据库中尚未收录的噬菌体序列。近期,Ren等人开发的VirFinder很好地解决了这一问题。有研究表明,病毒与其宿主在基因组的k-mer频数存在比较显著的差异。VirFinder正是基于噬菌体与宿主细菌在k-mer频数上的差异,利用大量分别来源于噬菌体和细菌的k-mer训练而成的逻辑回归分类器。VirFinder不依赖于数据库,且在对独立于训练集中测试集进行分类时能取得理想的效果,证明其在识别未知噬菌体序列应用上的可靠性。Virsorter在很大程度上依赖于与现有病毒数据库的相似搜索,但它还有一个额外的优势,即它使用自行编写的病毒参考基因组数据库,其中增加了从淡水、海水和人类肠道、肺和唾液采集的病毒基因组序列。另一个优势是使用了链转换和短基因标准,这两个是不需要相似对比搜索的典型病毒特征。其基于病毒基因的筛选中,需要在一个连续体中至少有三个预测基因来进行预测,从而排除了许多较短的连续体。
目前,基于生物信息分析的所有预测挖掘噬菌体的方法都受到现有噬菌体数据匮乏的限制,无论是基于数据库的比对,还是利用机器学习方法训练计算 机模型预测,相对于噬菌体复杂的物种多样性和自然界未知且庞大的数量,都具有相对局限性。并且在对结果的进一步筛选中往往忽略功能性噬菌体(即可以完整脱离宿主基因组并可以形成病毒颗粒)的完整性,原噬菌体整合在宿主基因组也面临宿主的防御机制、水平基因转移等问题,使得自身不完整无法重新脱离形成病毒颗粒。并且缺乏后续验证预测结果手段,无法针对性避免假阳性、假阴性情况的发生。
发明内容
本发明提供一种从细菌全基因组序列中挖掘温和型噬菌体的方法和装置及存储介质,能够实现精确的噬菌体序列挖掘预测。
根据第一方面,一种实施例中提供一种从细菌全基因组序列中挖掘温和型噬菌体的方法,包括:
获取细菌全基因组序列;
将温和型噬菌体的功能元件聚类比对到上述细菌全基因组序列上,得到上述功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中上述功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件;
在上述噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域;
将上述温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
进一步地,上述方法还包括:
在进行上述基因组学特征分析之前,使用宿主菌属种相对保守序列与上述温和型噬菌体的序列区域进行比对,以判断上述序列区域中是否插入上述宿主菌属种相对保守序列。
进一步地,上述功能元件是基于噬菌体的马赛克结构从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的,其中上述马赛克结构是指噬菌体基因组上表达相似功能的基因趋向于在相邻的位置。
进一步地,上述功能元件还包括裂解元件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个。
进一步地,上述基因组学特征分析包括GC含量和/或K-mer频数特征分析。
进一步地,上述基因组学特征分析还包括宿主菌基因组特征与噬菌体基因组特征的对比分析。
进一步地,上述对比分析是原噬菌体与宿主菌蛋白长度差异分析。
进一步地,上述对比分析是原噬菌体与宿主菌转录方向差异分析。
根据第二方面,一种实施例中提供一种从细菌全基因组序列中挖掘温和型噬菌体的装置,包括:
序列获取单元,用于获取细菌全基因组序列;
疑似区域确定单元,用于将温和型噬菌体的功能元件聚类比对到上述细菌全基因组序列上,得到上述功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中上述功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件;
整合位点确定单元,用于在上述噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域;
组学特征分析单元,用于将上述温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
进一步地,上述装置还包括:
保守序列判断单元,用于在进行上述基因组学特征分析之前,使用宿主菌属种相对保守序列与上述温和型噬菌体的序列区域进行比对,以判断上述序列区域中是否插入上述宿主菌属种相对保守序列。
进一步地,上述功能元件是基于噬菌体的马赛克结构从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的,其中上述马赛克结构是指噬菌体基因组上表达相似功能的基因趋向于在相邻的位置。
进一步地,上述功能元件还包括裂解元件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个。
进一步地,上述基因组学特征分析包括GC含量和/或K-mer频数特征分析。
进一步地,上述基因组学特征分析还包括宿主菌基因组特征与噬菌体基因组特征的对比分析。
进一步地,上述对比分析是原噬菌体与宿主菌蛋白长度差异分析。
进一步地,上述对比分析是原噬菌体与宿主菌转录方向差异分析。
根据第三方面,一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如第一方面的方法。
本发明的方法能够实现精确的噬菌体序列挖掘预测,其预测结果为功能性噬菌体,并且降低结果中的假阳性和假阴性。
附图说明
图1为本发明实施例的从细菌全基因组序列中挖掘温和型噬菌体的方法流程图;
图2为本发明实施例中噬菌体整合到宿主菌基因组上的原理机制示意图;
图3为本发明实施例的从细菌全基因组序列中挖掘温和型噬菌体的装置结构框图;
图4为本发明实施例中四类功能元件(INF,HYP,ASB,UNS)在已知含有原噬菌体的宿主菌的基因组上聚簇现象结果图;
图5为本发明实施例中疑似噬菌体区域被插入一段来源于大肠杆菌的保守 序列结果图;
图6为本发明实施例中噬菌体基因组学GC%含量特征分析结果图;
图7为本发明实施例中原噬菌体与相关宿主菌蛋白长度差异分析结果;
图8为本发明实施例中原噬菌体与相关宿主菌蛋白平均长度差异分析结果;
图9为本发明实施例中原噬菌体与宿主转录方向差异分析结果图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本发明能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他元件、材料、方法所替代。
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。
本发明实施例提供一种从细菌全基因组测序数据中精确地挖掘预测温和型噬菌体序列的方法。该方法结合宿主菌基因水平转移等生物学特征,以及噬菌体基因组学特征分析,最终实现不仅在结果中输出预测噬菌体的序列,还结合细菌基因组学与噬菌体组学对比特征,输出证明结果准确的生物信息学证据,确保预测结果为功能性噬菌体,降低结果中的假阳性和假阴性。
如图1所示,在本发明一个实施例中,一种从细菌全基因组序列中挖掘温和型噬菌体的方法,包括如下步骤:
S101:获取细菌全基因组序列。
本发明实施例中,细菌全基因组序列可以是从细菌全基因组测序数据组装得到的基因组序列。测序数据包括大量测序读长(reads),这些测序读长经现有各种方法组合即可得到细菌全基因组序列。测序读长可以来源于任何二代高通量测序技术。
S102:将温和型噬菌体的功能元件聚类比对到细菌全基因组序列上,得到功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件。
本发明实施例中,温和型噬菌体的功能元件是从现有噬菌体基因组注释信息中提取的元件。具体而言,在本发明一个实施例中,功能元件是基于噬菌体的马赛克结构(即噬菌体基因组上表达相似功能的基因趋向于在相邻的位置),从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的元件。在一个具体实施例中,利用噬菌体所特有的马赛克结构,从NCBI-Refseq数据库(2018年1月)中的2101株噬菌体基因组注释信息中挖掘噬菌体功能元件,搭建一个容量包括约10万个功能元件的元件库。然后,基于各功能元件的功能,对功能元件进行分类,共分为11个主要类别:LYS(lysis,裂解元件),INT(integration,整合元件),REP(replication,复制元件),REG(regulatory,调控元件),PAC(package,包装元件),ASB(assembly,组装元件),INF(infection,侵染元件),EVA(immune evasion,免疫逃逸元件),HYP(hypothetical,假定蛋白元件),UNS(unsorted,未知保守元件),以及tRNA元件(转运元件)。对初步挖掘得到的所有功能元件进行去冗余,去掉核酸相似度>75%的元件,最终得到99438个噬菌体功能元件(表1),并统计相关功能元件在噬菌体基因组上的占比,得到数据特征。
表1
Figure PCTCN2018110636-appb-000001
Figure PCTCN2018110636-appb-000002
本发明实施例中,只需要使用假定蛋白元件、侵染元件、组装元件和未知保守元件这4类功能元件聚类比对到细菌全基因组序列上,即可得到这些功能元件呈聚簇现象的区域,并以该区域作为噬菌体疑似区域。这4类功能元件是利用已知含有原噬菌体的宿主菌基因组,进行功能元件的筛选,从上述11类功能元件比对得到的最优元件。它们作为最优元件的依据在于:侵染元件(INF)包含噬菌体尾部、尾壳等结构蛋白相关功能,与组装元件(ASB)共同作为噬菌体生存必须元件;而占比最高的假定蛋白元件(HYP),与未知保守元件(UNS)与噬菌体结构相关,应用于原噬菌体定位更为准确。
本发明实施例,基于已分析的NCBI-Refseq数据库现有噬菌体库中2000多条注释信息较为完整的全基因组数据,利用噬菌体功能蛋白注释信息,构建了约10万个功能元件,并通过统计学方法分析相关功能元件比例以及噬菌体基因 组组成,得到噬菌体基因组学特征。从中筛选出四类最优功能元件,利用功能元件聚类方法来判断噬菌体疑似区域。其中,功能元件是基于噬菌体特殊的马赛克结构提取的,从而扩展功能基因的范围,提高模糊匹配的准确性。功能元件的聚簇现象以及功能性占比,为从全局角度证明原噬菌体的存在提供可靠依据。功能元件的模糊匹配有效避免了目前匮乏的噬菌体数据库的缺陷,最大限度保证已知噬菌体数据信息的利用,并且最优功能元件的选择避免了裂解型噬菌体的偏向性,表现最突出的功能元件是侵染元件,在初步筛选中基本上可以过滤大部分阴性结果。
本发明实施例中,功能元件呈“聚簇现象”的区域,是指在细菌全基因组序列上,功能元件集中出现的区域,这一点符合噬菌体特殊的马赛克结构特点。这样的区域被称为“噬菌体疑似区域”,是指噬菌体具有高度可能性地出现在该区域。
在本发明的优选实施例中,除了选用假定蛋白元件、侵染元件、组装元件和未知保守元件这4类功能元件以外,还可以选用裂解元件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个,与细菌全基因组序列进行聚类比对。选择额外的功能元件能在一定程度上提高聚类比对准确性。
S103:在噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域。
如图2所示,根据噬菌体整合到宿主菌基因组上的原理机制,温和型噬菌体在侵染宿主时,利用整合酶以及整合位点,插入到宿主的基因组内形成原噬菌体状态。图中,POP’表示噬菌体整合位点,BOB’表示宿主菌整合位点,噬菌体侵染时自身环化,然后POP’与BOB’同源重组,噬菌体整合到宿主菌基因组上形成原噬菌体,该原噬菌体两端重复序列,即BOP’和POB’分别是原噬菌体两端整合位点,表现为长度大部分为10~200bp的重复序列。因此,通过在噬菌体疑似区域的两端寻找重复序列,即可得到温和型噬菌体整合位点,也就是原 噬菌体两端的精确边界,位于两端的边界之间的区域就是温和型噬菌体的序列区域。
在本发明的优选实施例中,在进行下一步S104之前,首先使用宿主菌属种相对保守序列与温和型噬菌体的序列区域进行比对,以判断序列区域中是否插入宿主菌属种相对保守序列。
本发明实施例中,“宿主菌属种相对保守序列”是指温和型噬菌体所在的宿主菌所属的属种中以保守序列形式存在的序列。例如,如果温和型噬菌体的宿主菌是大肠杆菌,那么“宿主菌属种相对保守序列”是指在大肠杆菌的各属种中保守的序列。
在本发明一个实施例中,利用多序列比对软件,将宿主菌属种的代表性全基因组数据构建得到宿主菌属种相对保守序列。例如,针对大肠杆菌宿主菌,利用现有的宿主菌所在属种信息,获取代表性基因组数据,利用多序列比对软件(例如Musgy),构建代表基因组集合,得到大肠杆菌的代表性基因组保守序列,即宿主菌属种相对保守序列。然后,使用该宿主菌属种相对保守序列与温和型噬菌体的序列区域或噬菌体疑似区域进行比对,筛选出插入宿主菌属种相对保守序列的序列区域,并将这样的序列区域剔除,能够有效地去除预测结果中的假阳性结果,保证温和型噬菌体挖掘预测的准确性。
S104:将温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
在本发明的优选实施例中,“基因组学特征分析”包括GC含量和/或K-mer频数特征分析,以及任选的未知功能蛋白注释等噬菌体病毒特征分析。进一步地,在本发明的优选实施例中,基因组学特征分析还包括宿主菌基因组特征与噬菌体基因组特征的对比分析,例如原噬菌体与宿主菌转录方向差异分析等对比分析。
本发明实施例的方法,充分利用现有温和型噬菌体数据库,分析得到其组 学数据特征,并利用大量的宿主同属细菌的全基因组数据的宿主菌群体多序列比对分析,得到宿主相对噬菌体的保守序列,并利用该保守序列进行假阳性剔除,并结合噬菌体组装机制,设置参数剔除插入宿主保守序列而破坏完整性的结果。综合考虑噬菌体的生物学特征,准确判定噬菌体疑似区域是否为功能性噬菌体。本发明实施例的方法直接输出功能性噬菌体的序列信息,以及任选的宿主菌基因组特征与预测的噬菌体基因组特征对比结果,保证预测结果的准确性。
通过本发明实施例的方法,在构建的测试数据集中,预测结果与真实结果吻合率在95%以上;准确地从267例大肠杆菌全基因组数据中,预测到3株功能性噬菌体,经过实验结果验证,并测序证明输出的功能性噬菌体的序列信息的准确性;预测得到的温和型噬菌体将推动噬菌体数据库的扩展;并且针对特殊宿主菌的温和噬菌体预测,为噬菌体靶向宿主菌治疗提供了新的选择。
对应于本发明实施例的从细菌全基因组序列中挖掘温和型噬菌体的方法,本发明还提供一种从细菌全基因组序列中挖掘温和型噬菌体的装置,如图3所示,包括:序列获取单元301,用于获取细菌全基因组序列;疑似区域确定单元302,用于将温和型噬菌体的功能元件聚类比对到上述细菌全基因组序列上,得到上述功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中上述功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件;整合位点确定单元303,用于在上述噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域;组学特征分析单元304,用于将上述温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
在本发明的优选实施例中,从细菌全基因组序列中挖掘温和型噬菌体的装置还包括:保守序列判断单元305,用于在进行上述基因组学特征分析之前,使用宿主菌属种相对保守序列与上述温和型噬菌体的序列区域进行比对,以判断上述序列区域中是否插入上述宿主菌属种相对保守序列。
在本发明的优选实施例中,功能元件是基于噬菌体的马赛克结构从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的,其中马赛克结构是指噬菌体基因组上表达相似功能的基因趋向于在相邻的位置。
在本发明的优选实施例中,功能元件还包括裂解元件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个。
在本发明的优选实施例中,基因组学特征分析包括GC含量和/或K-mer频数特征分析。在本发明的优选实施例中,基因组学特征分析还包括宿主菌基因组特征与噬菌体基因组特征的对比分析,例如,原噬菌体与宿主菌蛋白长度差异分析,或原噬菌体与宿主菌转录方向差异分析。
本领域技术人员可以理解,上述实施方式中各种方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述实施方式中全部或部分功能。
因此,本发明一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如本发明实施例的从细菌全基因组序列中挖掘温和型噬菌体的方法。
以下通过实施例详细说明本发明的技术方案,应当理解,实施例仅是示例性的,不能理解为对本发明保护范围的限制。
实施例
1、噬菌体功能元件库的构建:
因噬菌体所特有的马赛克结构,即相邻的基因表达相同的功能,利用功能元件的概念来挖掘噬菌体功能元件,搭建一个容量约为10万功能元件的元件库。利用2018年1月的NCBI-Refseq数据库中的2101株噬菌体基因组注释信息,提取对应的功能元件,并基于各功能,对挖掘的功能元件进行分类,分为11个主要类别:LYS(lysis,裂解元件),INT(integration,整合元件),REP(replication,复制元件),REG(regulatory,调控元件),PAC(package,包装元件),ASB(assembly,组装元件),INF(infection,侵染元件),EVA(immune evasion,免疫逃逸元件),HYP(hypothetical,假定蛋白元件),UNS(unsorted,未知保守元件),以及tRNA元件(转运元件)。对初步挖掘得到的所有功能元件进行去冗余,去掉核酸相似度>75%的元件,最终得到99438个噬菌体功能元件(表1),并统计相关功能元件在噬菌体基因组上的占比,得到数据特征。
2、噬菌体功能元件库的筛选:
利用已知含有原噬菌体的宿主菌基因组,进行功能元件的筛选,从11类功能元件比对得到最优的四类功能元件——侵染元件(INF)、组装元件(ASB)、假定蛋白元件(HYP)和未知保守元件(UNS)。它们作为最优元件的依据在于:侵染元件(INF)包含噬菌体头部、尾部等结构蛋白相关功能,与组装元件(ASB)共同作为噬菌体生存必须元件;而占比最高的假定蛋白元件(HYP),与未知保守元件(UNS)与噬菌体结构相关,应用于原噬菌体定位更为准确。
利用功能元件的聚簇现象,可以很明确地得到噬菌体所在疑似区域,如图4所示,示出了已知含有原噬菌体的宿主菌(NC_000907,Haemophilus influenzae Rd KW20)的基因组上四类功能元件(INF,HYP,ASB,UNS)的聚簇现象,得到的疑似噬菌体区域与实际噬菌体所在区域(已注释噬菌体)吻合。表明本发明方法得到的疑似噬菌体区域具有可靠性。
3、疑似噬菌体区域两端整合位点挖掘:
利用python语言编写的滑动窗口脚本,进行疑似噬菌体区域两端整合位点的挖掘。其生物学依据在于:温和型噬菌体在侵染宿主菌时进行自身环化,并编码整合酶,识别宿主菌上的整合位点,在整合酶作用下整合进宿主基因组,并进行同源重组。最终在形成的原噬菌体(即温和型噬菌体插入宿主基因组的状态)的两端形成新的整合位点(如图2所示)。基于该生物学特性,利用两端重复序列,精准确定原噬菌体两端准确位置。
4、宿主菌群体的多序列比对分析:
利用现有的宿主所在属种信息,获取代表基因组数据,利用多序列比对分析软件(例如Musgy),构建代表基因组集。根据分析结果,得到宿主菌属种相对保守序列信息。利用该宿主菌属种相对保守序列信息进行疑似噬菌体区域比对,有效筛选出预测结果中的假阳性结果。
例如,如图5所示,基于MetaPhinder(图中A软件)、Phaster(图中B软件)等预测软件中判断为含有原噬菌体的大肠杆菌ATCC25922,在其预测区域(即疑似噬菌体区域)整合了λ噬菌体,但经过大肠杆菌保守序列比对,发现其疑似噬菌体区域被插入一段来源于大肠杆菌MG1655的保守序列,其功能完整性被破坏,经诱导实验验证,无功能性噬菌体存在。因此这是一个假阳性结果,不是功能性噬菌体,需要排除出最终的挖掘预测结果。
5、疑似噬菌体基因组学特征分析:
基于温和型噬菌体,分析总结了相关的基因组学特征,得到相应数据特征,如1031株噬菌体基因组学GC%含量特征分析结果如图6所示,噬菌体基因组相对特异性存在三个数值区域(26%-32%,52%-58%,69%-71%),宿主细菌GC含量基本分布于30%~60%,差异较为明显。以铜绿假单胞菌(CP011317.1,Pseudomonas aeruginosa strain Carb01 63)为例,原噬菌体与相关宿主菌蛋白长度差异分析结果如图7所示,横坐标为蛋白氨基酸长度区间,纵坐标为其功能蛋白基因占比,每个柱子表示该蛋白长度区间表达基因数/总表达基因数,已被证实的原噬菌体与宿主菌的蛋白长度差异明显,原噬菌体蛋白长度大多位于 50-200区间,占比大于50%。原噬菌体与相关宿主菌蛋白平均长度差异分析结果如图8所示,进一步发现,原噬菌体蛋白平均长度为206个氨基酸,宿主菌的蛋白平均长度为316个氨基酸,原噬菌体通常表达蛋白长度低于宿主菌。以铜绿假单胞菌(CP011317.1,Pseudomonas aeruginosa strain Carb01 63)为例,原噬菌体与宿主转录方向差异分析结果如图9所示,柱状图表示同向连续转录的基因数占比基因组总基因数,原噬菌体为46.6%,表现为大多数连续基因在同一方向转录链上编码,而宿主菌仅为0.8%,表现为7146个基因中仅有56个为同向转录连续基因。基于这三个差异特征,对上一步确定的疑似噬菌体区域是否为温和型功能性噬菌体进行匹配特征判定。
6、预测挖掘结果:
本发明实施例,从267例宿主菌大肠杆菌中,发现3株大肠杆菌预测存在功能性温和型噬菌体,经过生物学理论论证以及诱导实验验证,符合预测结果。
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。

Claims (17)

  1. 一种从细菌全基因组序列中挖掘温和型噬菌体的方法,其特征在于,所述方法包括:
    获取细菌全基因组序列;
    将温和型噬菌体的功能元件聚类比对到所述细菌全基因组序列上,得到所述功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中所述功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件;
    在所述噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域;
    将所述温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    在进行所述基因组学特征分析之前,使用宿主菌属种相对保守序列与所述温和型噬菌体的序列区域进行比对,以判断所述序列区域中是否插入所述宿主菌属种相对保守序列。
  3. 根据权利要求1所述的方法,其特征在于,所述功能元件是基于噬菌体的马赛克结构从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的,其中所述马赛克结构是指噬菌体基因组上表达相似功能的基因趋向于在相邻的位置。
  4. 根据权利要求1所述的方法,其特征在于,所述功能元件还包括裂解元件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个。
  5. 根据权利要求1所述的方法,其特征在于,所述基因组学特征分析包括GC含量和/或K-mer频数特征分析。
  6. 根据权利要求5所述的方法,其特征在于,所述基因组学特征分析还包 括宿主菌基因组特征与噬菌体基因组特征的对比分析。
  7. 根据权利要求6所述的方法,其特征在于,所述对比分析是原噬菌体与宿主菌蛋白长度差异分析。
  8. 根据权利要求6所述的方法,其特征在于,所述对比分析是原噬菌体与宿主菌转录方向差异分析。
  9. 一种从细菌全基因组序列中挖掘温和型噬菌体的装置,其特征在于,所述装置包括:
    序列获取单元,用于获取细菌全基因组序列;
    疑似区域确定单元,用于将温和型噬菌体的功能元件聚类比对到所述细菌全基因组序列上,得到所述功能元件呈聚簇现象的区域作为噬菌体疑似区域,其中所述功能元件包括假定蛋白元件、侵染元件、组装元件和未知保守元件;
    整合位点确定单元,用于在所述噬菌体疑似区域的两端寻找重复序列,得到温和型噬菌体整合位点,从而确定温和型噬菌体的序列区域;
    组学特征分析单元,用于将所述温和型噬菌体的序列区域作为噬菌体基因组序列进行基因组学特征分析,输出功能性噬菌体的序列信息。
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:
    保守序列判断单元,用于在进行所述基因组学特征分析之前,使用宿主菌属种相对保守序列与所述温和型噬菌体的序列区域进行比对,以判断所述序列区域中是否插入所述宿主菌属种相对保守序列。
  11. 根据权利要求9所述的装置,其特征在于,所述功能元件是基于噬菌体的马赛克结构从噬菌体基因组数据库中具有噬菌体注释信息的全基因组数据中提取的,其中所述马赛克结构是指噬菌体基因组上表达相似功能的基因趋向于在相邻的位置。
  12. 根据权利要求9所述的装置,其特征在于,所述功能元件还包括裂解元 件、整合元件、复制元件、调控元件、包装元件、免疫逃逸元件和tRNA元件中的一个或多个。
  13. 根据权利要求9所述的装置,其特征在于,所述基因组学特征分析包括GC含量和/或K-mer频数特征分析。
  14. 根据权利要求13所述的装置,其特征在于,所述基因组学特征分析还包括宿主菌基因组特征与噬菌体基因组特征的对比分析。
  15. 根据权利要求14所述的装置,其特征在于,所述对比分析是原噬菌体与宿主菌蛋白长度差异分析。
  16. 根据权利要求14所述的装置,其特征在于,所述对比分析是原噬菌体与宿主菌转录方向差异分析。
  17. 一种计算机可读存储介质,其特征在于,包括程序,所述程序能够被处理器执行以实现如权利要求1-8中任一项所述的方法。
PCT/CN2018/110636 2018-10-17 2018-10-17 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质 WO2020077559A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/110636 WO2020077559A1 (zh) 2018-10-17 2018-10-17 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质
CN201880098544.2A CN112823206B (zh) 2018-10-17 2018-10-17 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/110636 WO2020077559A1 (zh) 2018-10-17 2018-10-17 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质

Publications (1)

Publication Number Publication Date
WO2020077559A1 true WO2020077559A1 (zh) 2020-04-23

Family

ID=70283327

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/110636 WO2020077559A1 (zh) 2018-10-17 2018-10-17 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN112823206B (zh)
WO (1) WO2020077559A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658641A (zh) * 2021-07-20 2021-11-16 北京大学 一种噬菌体分类方法、装置、设备及存储介质
CN115198036B (zh) * 2022-09-13 2022-12-30 江苏省环境工程技术有限公司 一种基于纳米孔和高通量测序数据的噬菌体鉴定和宿主预测方法

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108220249A (zh) * 2016-12-12 2018-06-29 上海交通大学医学院附属第九人民医院 长尾噬菌体及其获得方法和应用

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100055669A1 (en) * 2004-07-06 2010-03-04 Mixis France, S.A. Generation of Recombinant Genes in Bacteriophages
FR2910492B1 (fr) * 2006-12-20 2013-02-15 Bio Modeling Systems Ou Bmsystems Procede de preparation de bacteriophages modifies par insertion de sequences aleatoires dans les proteines de ciblage desdits bacteriophages
FR2945049B1 (fr) * 2009-04-30 2013-10-04 Pherecydes Pharma Modification du genome d'un bacteriophage lytique par immobilisation dudit bacteriophage dans sa bacterie hote

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108220249A (zh) * 2016-12-12 2018-06-29 上海交通大学医学院附属第九人民医院 长尾噬菌体及其获得方法和应用

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BONNIE, L. H. ET AL.: "Phage hunters: computational strategies for finding phages in large- scale omics datasets", VIRUS RESEARCH, vol. 244, January 2018 (2018-01-01), pages 110 - 115, XP055704139 *

Also Published As

Publication number Publication date
CN112823206A (zh) 2021-05-18
CN112823206B (zh) 2024-04-16

Similar Documents

Publication Publication Date Title
Coelho et al. Towards the biogeography of prokaryotic genes
Dutilh et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes
Silva et al. SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data
Jurtz et al. MetaPhinder—identifying bacteriophage sequences in metagenomic data sets
Yahara et al. Long-read metagenomics using PromethION uncovers oral bacteriophages and their interaction with host bacteria
Deboutte et al. Honey-bee–associated prokaryotic viral communities reveal wide viral diversity and a profound metabolic coding potential
Miller et al. A draft genome sequence for the Ixodes scapularis cell line, ISE6
Boratto et al. Yaravirus: A novel 80-nm virus infecting Acanthamoeba castellanii
Wu et al. DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach
Le Doujet et al. Closely-related Photobacterium strains comprise the majority of bacteria in the gut of migrating Atlantic cod (Gadus morhua)
Nasko et al. CRISPR spacers indicate preferential matching of specific virioplankton genes
Kahlke et al. Unique core genomes of the bacterial family vibrionaceae: insights into niche adaptation and speciation
Yin et al. On the origin of microbial ORFans: quantifying the strength of the evidence for viral lateral transfer
Li et al. Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut
Bzhalava et al. Extension of the viral ecology in humans using viral profile hidden Markov models
WO2020077559A1 (zh) 从细菌全基因组序列中挖掘温和型噬菌体的方法、装置和存储介质
Gauthier et al. DEPhT: a novel approach for efficient prophage discovery and precise extraction
Roach et al. Hecatomb: an end-to-end research platform for viral metagenomics
Xu et al. A chromosome-level genome assembly of the red drum, Sciaenops ocellatus
Du et al. Highly host-linked viromes in the built environment possess habitat-dependent diversity and functions for potential virus-host coevolution
Scott et al. Genes and regulatory mechanisms associated with experimentally-induced bovine respiratory disease identified using supervised machine learning methodology
Pfennig et al. MgCod: Gene Prediction in Phage Genomes with Multiple Genetic Codes
Bai et al. RNA-seq of HaHV-1-infected abalones reveals a common transcriptional signature of Malacoherpesviruses
Coutinho et al. RaFAH: A superior method for virus-host prediction
Papudeshi et al. Host interactions of novel Crassvirales species belonging to multiple families infecting bacterial host, Bacteroides cellulosilyticus WH2

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18937035

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18937035

Country of ref document: EP

Kind code of ref document: A1