CN116525009A

CN116525009A - Method and device for determining tumor neoepitope from microorganism

Info

Publication number: CN116525009A
Application number: CN202310487903.6A
Authority: CN
Inventors: 于建东; 蔡毅骅; 陈庚; 李航文
Original assignee: Siwei Shanghai Biotechnology Co ltd
Current assignee: Siwei Shanghai Biotechnology Co ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-08-01

Abstract

The present application relates to a method, apparatus, computing device, computer readable storage medium and computer program product for determining a tumor neoepitope of microbial origin, which can predict, screen protein fragments recognizable by the host immune system based on known microbial protein sequences or metagenomic predicted microbial protein sequences, and enable a quick, efficient, high accuracy determination of neoepitopes.

Description

Method and device for determining tumor neoepitope from microorganism

Technical Field

The present disclosure relates to the field of bioinformatics and tumor immunotherapy, and more particularly to methods, apparatus, computing devices, computer readable storage media and computer program products for determining tumor neoepitopes of microbial origin.

Background

Tumor specific antigen, also known as tumor Neoantigen, is an antigen produced only in tumor cells, which can bind to Human Leukocyte Antigen (HLA) and be recognized by CD4+, CD8+ T cells, and activate the anti-tumor immune response of the body (Zhang, Z., et al, neoantigen: A New Breakthrough in Tumor immunology, front Immunol,2021.12: p.672356.). Sources of neoantigens are numerous, including Single Nucleotide Variations (SNVs), insertions/deletions (INDELs), transcript splice variations, gene fusions, and the like. The new antigen is not in normal tissue cells, so that central tolerance is bypassed, off-target damage to non-tumor tissues can be avoided, the new antigen becomes a new target point of tumor immunotherapy, and the new antigen has ideal conditions for constructing cancer vaccines and has wide treatment prospect and clinical application value.

Studies have shown that the presence of microorganisms such as bacteria, invasion of tumors, protein fragments of bacteria invading tumor cells can be presented on the surface of tumor cells and recognized by the immune system. So as to activate immune cells, enhance the recognition of the immune cells to tumor cells and kill the tumor cells. Kalaora S, et al identification of bacteria-derived HLA-bound peptides in melanoma. Nature.2021Apr;592 138-143 it is proposed that bacterial peptides presented on tumor cells can serve as potential targets for immunotherapy, providing directions to the mechanisms by which bacteria influence the activation of the immune system and therapeutic response. The team identified intratumoral bacteria in melanoma, obtained these bacterial genome maps, identified peptide sequences of bacteria capable of being recognized by the immune system using 16S rRNA gene sequencing and HLA peptide group science (HLA peptides), and finally identified nearly 300 peptides from 41 different bacteria presented by HLA protein complexes on the melanoma cell surface. Many of the peptides of bacterial origin are common to different metastases of the same patient or tumors of different patients and therefore also have a powerful capacity to generate immune activation.

Metagenomic sequencing based on high throughput sequencing (NGS) can accurately identify microorganism species at the species level, predict genes and proteins expressed by genes in the microorganism genome, and will help to determine tumor neoepitopes derived from microorganisms.

Disclosure of Invention

A first aspect of the present disclosure proposes a method of determining a tumour neoepitope of microbial origin, the method comprising: obtaining metagenome sequencing data, wherein the metagenome sequencing data comprises sequencing data obtained by sequencing bacterial DNA in tumor-related samples and non-tumor-related samples in a high throughput manner; performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence; determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data; determining bacteria that are significantly enriched in the tumor-associated sample; obtaining a known species determination of a bacterium, the known species determination of a bacterium indicating whether the significantly enriched bacterium is a known species; determining, based on the significantly enriched bacteria, the genome or protein sequence of bacteria of a known species from a known genome database; or determining the protein sequence of bacteria of unknown species from the predicted encoding gene based on the significantly enriched bacteria; predicting the binding affinity of the peptide fragment in the protein sequence of the known species of bacteria or the peptide fragment in the protein sequence of the unknown species of bacteria to the MHC, screening the peptide fragment in the protein sequence of the known species of bacteria that can bind to the MHC or the peptide fragment in the protein sequence of the unknown species of bacteria, and thereby determining the peptide fragment that can bind to the MHC; and determining the immunogenicity, host similarity and the number of MHC class-divisions of said MHC-binding peptide fragments based on said MHC-binding peptide fragments, and screening peptide fragments based on said immunogenicity, host similarity and the number of MHC class-divisions, thereby determining tumor neoepitopes.

Optionally, in one embodiment of the above aspect, the tumor-associated sample is a donor tumor tissue sample or a tumor patient stool sample.

Optionally, in one embodiment of the above aspect, the non-tumor associated sample is a donor paracancerous tissue sample, a normal tissue sample, or a stool sample of a healthy population.

Optionally, in one embodiment of the above aspect, the method further comprises: and performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and performing quality control on the metagenome sequencing data before predicting coding genes in the genome based on the assembled genome sequence to further obtain the metagenome sequencing data after quality control.

Optionally, in one embodiment of the above aspect, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.

Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on predicted encoding genes and/or the metagenomic sequencing data comprises: determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; or determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and selecting bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determining their corresponding bacterial abundances.

Alternatively, in one embodiment of the above aspect, the length of the assembled genomic sequence is 90bp or greater.

Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding genes comprises: sequence alignment of the predicted coding genes with sequences in a known database to predict bacterial species and to determine co-categorical levels of bacterial abundance.

Alternatively, in one embodiment of the above aspect, the input sequence for sequence alignment is the predicted translated protein sequence of the coding gene.

Alternatively, in one embodiment of the above aspect, performing metagenome assembly based on the metagenome sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in a genome based on the assembled genomic sequence comprises: and performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence.

Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples using species annotation software based on the metagenomic sequencing data comprises: species annotation software is used to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on quality-controlled metagenomic sequencing data.

Optionally, in one embodiment of the above aspect, determining the bacteria significantly enriched in the tumor-associated sample comprises: the significantly enriched bacteria were determined by Wilcoxon rank sum test.

Alternatively, in one embodiment of the above aspect, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-related samples is greater than or equal to 2 times that in the non-tumor-related samples, and the p-value of the statistical test is less than or equal to 0.05.

Alternatively, in one embodiment of the above aspect, the MHC is a high frequency HLA of the chinese population.

Alternatively, in one embodiment of the above aspect, the screening criteria for a peptide fragment in a protein sequence of a bacterium of a known species that binds to MHC or a peptide fragment in a protein sequence of a bacterium of an unknown species are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM.

Alternatively, in one embodiment of the above aspect, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: determining the immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment.

Optionally, in one embodiment of the above aspect, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity.

Alternatively, in one embodiment of the above aspect, determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: and (3) comparing the peptide capable of binding to the MHC with the protein sequences of the host from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide capable of binding to the MHC and the host.

Optionally, in one embodiment of the above aspect, determining the MHC-binding peptide fragment to host similarity based on the MHC-binding peptide fragment further comprises introducing an MHC-binding peptide fragment to host protein sequence similarity score as an output of the sequence alignment with a higher score to the host similarity.

Alternatively, in one embodiment of the above aspect, determining the MHC class number of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: counting the number of all MHCs to which said MHC-binding peptide fragment may bind, thereby determining the MHC typing number of said MHC-binding peptide fragment.

Optionally, in one embodiment of the above aspect, determining the number of MHC class-divisions of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises introducing a score for the number of all MHC classes potentially bound by each peptide fragment, the higher the score the greater the number of MHC class-divisions of the MHC-binding peptide fragment.

Alternatively, in one embodiment of the above aspect, screening the peptide fragments based on the immunogenicity, similarity to the host, and MHC typing numbers, and further determining the tumor neoepitope comprises: screening the deep learning model to obtain high scoring, wherein the multiple MHC (major histocompatibility complex) classification is the same, the scoring of the binding peptide fragments is high, and the scoring of the peptide fragments capable of binding the MHC is similar to the scoring of the host protein sequence, so that the tumor neoepitope is determined.

In a second aspect the present disclosure provides an apparatus for determining a tumour neoepitope of microbial origin comprising: a metagenome sequencing data acquisition module configured to acquire metagenome sequencing data comprising sequencing data after high-throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples; a coding gene prediction module configured to perform metagenome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and predict a coding gene in a genome based on the assembled genome sequence; a determination module of bacterial species and abundance configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on predicted encoding genes and/or the metagenomic sequencing data; a determination module of significantly enriched bacteria configured to determine significantly enriched bacteria in the tumor-associated sample; a bacteria known species judgment result acquisition module configured to acquire a bacteria known species judgment result indicating whether the significantly enriched bacteria is a known species; a determining module of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the significantly enriched bacterium, the genomic or protein sequence of the bacterium of the known species from a known genomic database; a determining module of a protein sequence of a bacterium of an unknown species configured to determine a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significance-enriched bacterium; a determining module of peptide fragments capable of binding to MHC configured to predict binding affinity of peptide fragments in the protein sequence of the known species of bacteria or peptide fragments in the protein sequence of the unknown species of bacteria to MHC, screening peptide fragments in the protein sequence of the known species of bacteria capable of binding to MHC or peptide fragments in the protein sequence of the unknown species of bacteria, and thereby determining peptide fragments capable of binding to MHC; and a tumor neoepitope determining module configured to determine immunogenicity, host similarity and number of MHC types of the MHC-binding peptide fragments based on the MHC-binding peptide fragments, and to screen peptide fragments based on the immunogenicity, host similarity and number of MHC types, thereby determining tumor neoepitopes.

Optionally, in one embodiment of the above aspect, the apparatus further comprises a sequencing data quality control module configured to perform macro genome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and quality control the metagenome sequencing data prior to predicting the encoding genes in the genome based on the assembled genome sequence, thereby obtaining quality-controlled metagenome sequencing data.

Optionally, in one embodiment of the above aspect, the determining module of the bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; or a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.

Alternatively, in one embodiment of the above aspect, the tumor neoepitope determining module comprises: a determining module of immunogenicity of an MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment; a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host; a determining module of MHC class number of MHC-binding peptide fragments configured to count the number of all MHC that the MHC-binding peptide fragments may bind to, thereby determining the MHC class number of the MHC-binding peptide fragments; and a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on the immunogenicity, the similarity to host, and the MHC typing number, thereby determining a tumor neoepitope.

A third aspect of the present disclosure proposes a computing device comprising: a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of determining a tumor neoepitope of microbial origin in the first aspect.

A fourth aspect of the present disclosure proposes a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope of microbial origin of the first aspect.

A fifth aspect of the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope of microbial origin of the first aspect.

A sixth aspect of the present disclosure proposes a protein sequence comprising the sequence shown in any one of SEQ ID NOs 11 to 40.

A seventh aspect of the disclosure proposes a nucleic acid sequence encoding the protein sequence of the sixth aspect. Alternatively, in one embodiment of the above aspect, the nucleic acid sequence is an mRNA sequence.

Drawings

Features, advantages, and other aspects of the disclosure will become more apparent upon reference to the following detailed description, taken in conjunction with the accompanying drawings, wherein, by way of illustration and not limitation, several embodiments of the disclosure are shown in which:

FIG. 1 shows a schematic flow chart of a method for determining a tumor neoepitope of microbial origin according to the present invention.

Fig. 2 shows a schematic flow chart of a process of determining bacterial species and abundance according to an embodiment of the disclosure.

Fig. 3 shows a schematic flow chart of a process of determining a tumor neoepitope according to one embodiment of the present disclosure.

Fig. 4 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure.

FIG. 5 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to the present invention.

Fig. 6 shows a schematic block diagram of a determination module of bacterial species and abundance according to an embodiment of the disclosure.

Fig. 7 shows a schematic block diagram of a tumor neoepitope determination module according to one embodiment of the present disclosure.

Fig. 8 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to one embodiment of the present disclosure.

Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure.

Fig. 10 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure.

Fig. 11 shows a species distribution circle plot of abundance in a species-level tumor-associated sample significantly higher than abundance in a non-tumor-associated sample in an embodiment in accordance with the disclosure. The left graph part shows the ratio of different groups in different bacteria; the right panel shows the ratio of different species of bacteria in different groupings. The enriched bacterial species of different groups can be clearly judged according to the distribution circle diagram.

Fig. 12 shows a species distribution heat map of abundance in a species-level tumor-associated sample significantly higher than abundance in a non-tumor-associated sample in an embodiment in accordance with the disclosure. The uppermost color block represents grouping information of samples, and the hierarchical cluster tree represents the species composition similarity degree of different samples, and the closer the species distance is, the more similar the species are distributed in the samples. The colors are light to dark, indicating that the relative abundance of the species is low to high.

Detailed Description

General definitions and terms

All patents, patent applications, scientific publications, manufacturer's instructions and guidelines, and the like, cited herein, whether supra or infra, are hereby incorporated by reference in their entirety. Nothing herein is to be construed as an admission that the disclosure is not entitled to antedate such disclosure.

Unless otherwise indicated, scientific and technical terms used herein have the meaning as commonly understood by one of ordinary skill in the art to which this invention pertainsAs commonly understood by those of skill in the art. Also, the terms related to protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, as used herein, are terms that are widely used in the relevant art (see, e.g., molecular Cloning: A Laboratory Manual, 2) ^nd Edition, j. Sambrook et al eds., cold Spring Harbor Laboratory Press, cold Spring Harbor 1989). Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.

As used herein, "at least one" or "one or more" may mean 1, 2, 3, 4, 5, 6, 7, 8 or more.

As used herein, the terms "comprises," "comprising," "includes," "including," "having" and "containing" are open-ended, meaning the inclusion of the stated elements, steps or components, but not the exclusion of other non-recited elements, steps or components. The expression "consisting of … …" does not include any elements, steps or components not specified. The expression "consisting essentially of … …" means that the scope is limited to the specified elements, steps, or components, plus any optional elements, steps, or components that do not significantly affect the basic and novel properties of the claimed subject matter. It should be understood that the expressions "consisting essentially of … …" and "consisting of … …" are encompassed within the meaning of the expression "comprising".

As used herein, the term "and/or" in connection with a plurality of recited elements should be understood to include both individual and combined options. In other words, "and/or" includes "and" as well as "or". For example, a and/or B include A, B and a+b. A. B and/or C include A, B, C and any combination thereof, e.g., a+ B, A + C, B +c and a+b+c. Further elements defined by "and/or" are to be understood in a similar manner and include any one of, and any combination of, these.

Any numerical value or range of numerical values, such as concentration or range of concentration, should be construed as modified by the term "about" in any event, unless otherwise indicated. Thus, a numerical value typically includes ±10% of the value. For example, a concentration of 1mg/mL includes 0.9mg/mL to 1.1mg/mL. Likewise, a concentration range of 1% to 10% (w/v) includes 0.9% (w/v) to 11% (w/v). As used herein, the use of a numerical range explicitly includes all possible subranges, all individual values within the range including integers and fractions within the range unless the context clearly indicates otherwise.

As used herein, the term "sample" includes any sample of tissue or fluid isolated from an individual, such as, for example, skin, plasma, serum, spinal fluid, lymph, synovial fluid, urine, sputum, tears, blood cells, organs, tumors, and fecal matter (e.g., feces). The term "tumor-associated sample" is used herein to include a tumor tissue sample or a sample directly related to or derived from a lesion or disorder thereof. For example, in a colon cancer patient, the tumor-related sample comprises a colon cancer tissue sample or a stool sample from a colon cancer tumor patient. For example, in a nasopharyngeal carcinoma patient, the tumor-related sample includes a nasopharyngeal carcinoma tissue sample of the nasopharyngeal carcinoma patient or a sputum sample of the nasopharyngeal carcinoma patient. The term "non-tumor-related sample" as used herein refers to a paracancestral tissue sample, a normal tissue sample, or other healthy population samples that can serve as normal controls for tumor-related samples, without the corresponding cancer cells and tissues. Healthy population herein refers to a population that does not have a tumor of some kind, as well as other diseases that may affect the outcome of the experiment, relative to a tumor patient that has such a tumor. For example, when the tumor-related sample is a stool sample from a colon cancer patient, the non-tumor-related sample may be a stool sample from a healthy population not suffering from colon cancer and other diseases that may affect the outcome of the experiment. In some embodiments, a healthy population refers to a population that does not have any tumor.

As used herein, the term "wild-type" means that the sequence is naturally occurring and not artificially modified, including naturally occurring mutants.

As used herein, "antigen" refers to a molecule that upon entry into the body can elicit an immune response that is acquired by the body and that can be directed to the production of antibodies, or to specific immunogenically active cells, or both. It will be appreciated by those skilled in the art that any macromolecule, including almost all proteins or peptide fragments, may act as an antigen. Still further, the antigen may be from recombinant or genomic DNA or RNA.

The term "neoantigen" as used herein is an antigen having at least one alteration that makes it different from the corresponding wild-type parent antigen,

for example, the change is a tumor cell variation. The neoantigen may include a polypeptide sequence or a nucleotide sequence. As used herein, the term "variation" is the difference between a subject's nucleic acid and a reference human genome used as a control, which includes both genetic mutation and genetic recombination. Mutations may include point mutations caused by single base changes, or deletions, duplications and insertions of multiple bases.

As used herein, the term "tumor neoantigen" is a neoantigen that is present in a tumor cell or tissue of a subject but not in a corresponding normal cell or tissue of the subject. Can serve as a tumor marker when identifying tumor cells by diagnostic testing, and can also serve as a potential candidate for cancer treatment.

As used herein, an "epitope (also referred to as an antigenic determinant)" is a portion of an antigen that is recognized by the immune system (particularly by antibodies, B cells or T cells) in a suitable context. The epitope may be a conformational epitope or a linear epitope. The epitope in the present invention is a linear epitope, defined by a linear continuous amino acid sequence of a specific region of a protein. As used herein, a "tumor neoepitope" is capable of binding with high affinity to MHC molecules such that tumor neoantigen is presented for recognition by T cells and causes T cell activation, thereby attacking tumor cells.

As used herein, the term "tumor" encompasses solid tumors and hematological tumors. Solid tumors include, but are not limited to: squamous cell carcinoma, adenocarcinoma, basal cell carcinoma, renal cell carcinoma, ductal carcinoma of the breast, soft tissue sarcoma, osteosarcoma, melanoma, small-cell lung cancer, non-small-cell lung cancer, lung adenocarcinoma, peritoneal carcinoma, hepatocellular carcinoma, gastrointestinal cancer, gastric cancer, pancreatic cancer, neuroendocrine carcinoma, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, brain cancer, hepatoma, breast cancer, colon cancer, colorectal cancer, endometrial or uterine cancer, esophageal cancer, salivary gland cancer, renal cancer, liver cancer, prostate cancer, vulval cancer, thyroid cancer, neuroblastoma, or head and neck cancer; hematological neoplasms include, but are not limited to: leukemia, lymphoma, myeloma, acute myelogenous leukemia, chronic myelogenous leukemia, acute lymphoblastic leukemia, chronic lymphoblastic leukemia, hairy cell leukemia, hodgkin's lymphoma, non-hodgkin's lymphoma, mantle cell lymphoma or multiple myeloma.

As used herein, the term "HLA binding affinity", "HLA affinity" or "MHC binding affinity" means the binding affinity between a particular antigen and a particular MHC allele.

The term "Major Histocompatibility Complex (MHC)" relates to the gene complex that occurs in all vertebrates. MHC proteins or molecules play a role in the signaling between lymphocytes and antigen presenting cells in a normal immune response. Human MHC, also known as HLA, human leukocyte antigen, is located on chromosome 6 and mainly comprises MHC-I and MHC-II.

The term "MHC-I" or "MHC class I" refers to a major histocompatibility complex class I protein or gene. Within the human MHC-I (HLA-I) region, there are HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, CD1a, CD1b and CD1c sub-regions. MHC class I proteins are present on almost all cell surfaces, including most tumor cells. MHC-I proteins are loaded with antigens that are typically derived from endogenous proteins or pathogens present within the cell and then presented to Cytotoxic T Lymphocytes (CTLs). T cell receptors are capable of recognizing and binding peptides complexed with MHC class I molecules. Each cytotoxic T lymphocyte expresses a unique T cell receptor, capable of binding to a specific MHC/peptide complex. MHC class I molecules mediate primarily the presentation of endogenous antigens.

The term "MHC-II" or "MHC class II" refers to a major histocompatibility complex class II protein or gene. MHC class II proteins are mainly expressed on antigen presenting cells such as B cells, monocytes, macrophages and dendritic cells. MHC class II molecules mediate primarily the presentation of exogenous antigens, which present exogenous antigen polypeptide molecules to Th cells (helper T cells).

As used herein, the term "allele" generally refers to a pair of genes that control relative traits, either in one form of a gene or in one form of a gene sequence or in one form of a protein, located at the same position on a pair of homologous chromosomes. The term "allelic typing" refers to the positioning of alleles (or heterozygous sites) on a diploid (or even polyploid) genome correctly on the chromosomes of the male parent or female parent according to its parent, ultimately enabling all alleles from the same parent to be aligned in the same chromosome.

HLA is classified into three major classes according to different genetic loci: type I, type II and type III, because of the highly variable sequence, result in many different alleles of HLA. The aberrant peptide needs to bind to HLA to assist T Cell Receptor (TCR) recognition, thereby eliciting an immune response. Thus, predicting HLA typing is critical for recognizing tumor antigens. Currently, HLA genotyping is performed on HLA systems using mainly PCR technology in combination with allele-specific oligonucleotides (ASO) or sequence-specific oligonucleotide probes (SSO). HLA typing is carried out through second generation sequencing data analysis, so that polymorphism as small as single SNV genotype and haplotype information can be obtained.

As used herein, "genome" refers to the sum of all genetic material of an organism. Genetic material includes DNA or RNA.

As used herein, the term "neural network" is a machine learning model for classification or regression, consisting of a multi-layer linear transformation followed by element-wise nonlinearities, typically trained by random gradient descent and back propagation.

"Nucleoside" is a generic term for a class of glycosides. Nucleosides are constituents of nucleic acids and nucleotides. The nucleosides are all formed by condensing D-ribose or D-Z-deoxyribose with pyrimidine base or purine base. Herein, "nucleotide" includes deoxyribonucleotides and ribonucleotides and derivatives thereof. As used herein, a "ribonucleotide" is a constituent material of ribonucleic acid (RNA) and consists of one molecule of base, one molecule of pentose, and one molecule of phosphate, which refers to a nucleotide having a hydroxyl group at the 2' -position of the β -D-ribofuranosyl group. The "deoxyribonucleotide" is a constituent substance of deoxyribonucleic acid (DNA), and is also composed of one molecule of base, one molecule of pentose and one molecule of phosphoric acid, and refers to a nucleotide in which the hydroxyl group at the 2' -position of the beta-D-ribofuranosyl group is replaced by hydrogen, and is a main chemical component of a chromosome. "nucleotide" is generally referred to by the single letter representing the base therein: "A (a)" means adenine-containing deoxyadenylate or adenylate, "C (C)" means cytosine-containing deoxycytidylate or cytidylate, "G (G)" means guanine-containing deoxyguanylate or guanylate, "U (U)" means uracil-containing uridylate, "T (T)" means thymine-containing deoxythymidylate.

As used herein, the terms "polynucleotide" and "nucleic acid" are used interchangeably to refer to a polymer of deoxyribonucleotides (deoxyribonucleic acid, DNA) or a polymer of ribonucleotides (ribonucleic acid, RNA). "Polynucleotide sequence", "nucleic acid sequence" and "nucleotide sequence" are used interchangeably to refer to the ordering of nucleotides in a polynucleotide. It will be appreciated by those skilled in the art that the coding strand (sense strand) of DNA can be considered to have the same nucleotide sequence as the RNA it encodes, with deoxythymidylate in the sequence of the coding strand of DNA corresponding to uridylate in the sequence of the RNA it encodes.

As used herein, "encoding" refers to the inherent properties of a particular nucleotide sequence in a polynucleotide, such as a gene, cDNA or mRNA, that can be used as a template to synthesize polymers and macromolecules in other biological processes, provided that there is a defined nucleotide sequence or a defined amino acid sequence. Thus, a gene encodes a protein, meaning that mRNA of the gene is transcribed and translated to produce the protein in a cell or other biological system.

As used herein, the term "polypeptide" refers to a polymer comprising two or more amino acids covalently linked by peptide bonds. A "protein" may comprise one or more polypeptides, wherein the polypeptides interact with each other by covalent or non-covalent means. Unless otherwise indicated, "polypeptide" and "protein" may be used interchangeably.

As used herein, the term "host" refers to a subject from which tumor-related samples and non-tumor-related samples are derived. The host may be any animal, such as a mammal, particularly a human.

As used herein, NGS ("Next-generation" sequencing technology), a Next generation sequencing technique, also known as High-throughput sequencing (High-throughput sequencing), or massive parallel sequencing (Massively parallel sequencing, MPS). Unlike conventional Sanger (dideoxy) sequencing, techniques that allow parallel sequencing of a large number of nucleic acid molecules in parallel at a time, typically a single sequencing reaction yields no less than 100Mb of sequencing data.

As used herein, "Q20" refers to a sequencing base error rate of 1%. "Q30" means that the sequencing base error rate is 0.1%. As used herein, "N base" refers to an unknown base, a base type that cannot be determined by sequencing.

As used herein, "bp" refers to Base Pair, a Pair of matched bases, commonly used to measure the length of DNA.

As used herein, "Wilcoxon (Wilcoxon symbol rank test)" is a non-parametric test, often used to test for differences between comparison groups. As used herein, "P-value" refers to one parameter used to determine the outcome of a hypothesis test, with smaller P-values indicating more pronounced results, i.e., more pronounced differences between comparison groups.

As used herein, "abundance" refers to relative content.

As used herein, the "NCBI NR database" is a Non-redundant protein library (Non-Redundant Protein Sequence Database), including the Non-redundant protein sequences in GenBank, EMBL, DDBJ, PDB. NCBI NR gives the amino acid sequence corresponding to all known or possible coding sequences, as well as the sequence numbers in the specialized protein database. The nucleic acid data and the protein data can be linked together corresponding to a cross index based on nucleic acid sequences.

As used herein, "MetaPhlAn" is metagenomic species annotation software that enables rapid acquisition of qualitative and quantitative analysis of microbial population species classification and analysis of relative abundance information by comparison, based on sequencing data of metagenome.

As used herein, "NetMHCpan-4.1" is software for predicting the affinity of a peptide fragment for an MHC class I molecule, version number 4.1.

As used herein, "Deep Neural Network (DNN)" also known as deep feed forward network (DFN), multi-layer perceptron (MLP), refers to a neural network with many hidden layers, a technology in the field of Machine Learning (ML).

As used herein, "BlastP" refers to a search tool based on a local alignment algorithm, which is a commonly used tool software for bioinformatics. The input protein sequence can be compared with known sequences in a database to obtain information such as sequence similarity, so as to judge the source or evolutionary relationship of the sequence.

As used herein, "Quality control" also referred to as Quality Control (QC) refers to the artificial shearing and screening of sequences with low confidence during analysis.

Methods and apparatuses for determining tumor neoepitopes according to embodiments of the present specification are described below with reference to the accompanying drawings.

Method for determining tumor neoepitope derived from microorganism

In a first aspect, the present disclosure provides a method of determining a tumor neoepitope of microbial origin. Fig. 1 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to the present disclosure. As shown in fig. 1, the method 100 of determining a tumor neoantigen includes steps 110, 120, 130, 140, 150, 160, 170, 180, and 190.

Step 110 obtains metagenomic sequencing data, where metagenomic sequencing data is obtained, the metagenomic sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples.

In some embodiments, tumor-related and non-tumor-related samples are obtained, and bacterial DNA in the tumor-related and non-tumor-related samples is extracted for high throughput sequencing. In some embodiments, bacterial DNA is extracted from tumor-associated and non-tumor-associated samples by reference to Nejman D, et al, the methods of extraction human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. In some embodiments, metagenomic sequencing is performed by an ILLUMINA high throughput sequencing platform. In some embodiments, the sequencing mode of high throughput sequencing is double ended, with a sequencing length of 150bp. Single-end sequencing firstly fragments a DNA sample to form a 200-500bp fragment, connecting a primer sequence to one end of the DNA fragment, adding a connector to the tail end, fixing the fragment on a flow tank to generate a DNA cluster, and then sequencing and reading. And double-End sequencing is to add sequencing primer binding sites to the joints at two ends when constructing a DNA library, remove template strands of the first round of sequencing after the first round of sequencing is completed, and guide complementary strands to regenerate and amplify in situ by using a pair of sequencing-by-reading Module (Paired-End Module) so as to reach the template quantity used for the second round of sequencing, and perform the synthesis sequencing of the complementary strands of the second round. The advantage of double-ended sequencing over single-ended sequencing is that errors can be reduced, resulting in better assembly results. And single-ended sequencing reads in only one direction, which can result in the quality of the sequencing decreasing with increasing read length; and the double-end sequencing can read more than half of the sequence to be detected from two directions, then splicing is carried out according to the overlapping part of the two sequences, and the quality of the read sequence is better.

In some embodiments, in obtaining the metagenomic sequencing data at step 110, the tumor-associated sample and the non-tumor-associated sample are each no less than 10. In some embodiments, step 110 of obtaining metagenomic sequencing data comprises: and reading a sample information table provided by a user, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to obtain macro genome sequencing data. In some implementations, the sample information table is samplelist. In some embodiments, the format of the information table is: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.

In a specific embodiment, a tumor-related sample and a non-tumor-related sample are obtained, and the number of samples of the tumor-related sample and the non-tumor-related sample is not less than 10, respectively; referring to Nejman D, et al, the human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. Extraction methods, bacterial DNA in tumor-related and non-tumor-related samples is extracted, and bacterial DNA sequencing, i.e., metagenomic sequencing, is completed by an ILLUMINA high throughput sequencing platform, and a fastq file after double-ended sequencing is obtained. Copying the fastq file subjected to double-ended sequencing into a specific folder. Step 110 of obtaining metagenomic sequencing data includes: and reading a sample information table provided by a user, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to obtain macro genome sequencing data. The user provides a sample information table, e.g., samplelist. Txt, in the format described below: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.

Step 120 predicts the coding genes in the genome, in which step metagenome assembly is performed based on metagenome sequencing data, an assembled genome sequence is obtained, and the coding genes in the genome are predicted based on the assembled genome sequence.

In some embodiments, step 120 predicts that the encoding gene in the genome comprises performing metagenome assembly based on metagenome sequencing data, the metagenome assembly comprising: the sequencing data is broken into small fragments, and sequence fragments of the metagenome are assembled. In some embodiments, metagenomic assembly comprises: breaking the sequencing data into small fragments based on a Debrucine (de Bruijn) graph algorithm, and assembling sequence fragments of a metagenome; sequence segment assembly is firstly carried out on sequencing data of each sample, the successful assembly is called single sample assembly, the read length of all samples which are not successfully assembled is used for assembly again, the assembly is called mixed assembly, and further sequence segments of single sample assembly and mixed assembly are obtained. In some embodiments, step 120 predicting the encoding genes in the genome comprises predicting the encoding genes in the genome based on the assembled genome sequence, the predicting the encoding genes in the genome comprising identifying potential start codons and stop codons in the assembled sequence segments; the open reading frame (Open reading fram, ORF) between the start codon and the stop codon was searched, and the protein-encoding gene was predicted based on the characteristics such as the ORF length and codon usage bias. In some embodiments, step 120 predicts a determination of the amount of expression of the encoding gene in the genome. In some embodiments, determining the expression level of the encoding gene includes removing redundant genes and calculating the expression level of different genes. In some embodiments, removing redundant genes includes clustering similar sequences according to a defined sequence identification threshold, and creating a representative sequence (e.g., referred to as a "centroid") for each cluster, which is the longest sequence in the cluster. Non-redundant sequences are generated by deleting sequences that are highly similar to the centroid. In some embodiments, step 120 predicts the coding genes in the genome by breaking the sequencing data into small fragments, assembling the sequence fragments of the metagenome, first assembling the sequence fragments of the sequencing data of each sample, called "single sample assembly", successfully assembled, reading the length of all samples that were not successfully assembled, assembling again, called "hybrid assembly", and further obtaining single sample assembly and hybrid assembly sequence fragments; identifying potential start and stop codons in the assembled sequence fragment; and, the open reading frame (Open reading fram, ORF) between the start codon and the stop codon is searched, and the protein coding gene is predicted according to the characteristics of ORF length, codon usage deviation and the like; removing redundant genes and calculating the expression amounts of different predicted genes.

In a specific embodiment, step 120 predicts the coding genes in the genome comprising assembling sequence fragments based on metagenome sequencing data, calling megahit to assemble sequence data of each sample, called "single sample assembly" for successful assembly, and assembling again using megahit for reading length of all samples that were not successfully assembled, called "hybrid assembly", thereby obtaining sequence fragments for single sample assembly and hybrid assembly; based on the assembled sequence fragments, calling coding genes in a prodigal predictive genome, further based on the predictive coding genes, calling cd-hit to remove redundant genes, calling salcon to compare sequencing data to the predictive genes, and calculating the expression quantity of different genes.

Step 130 determines bacterial species and abundance, in which bacterial species and abundance in tumor-related and non-tumor-related samples are determined based on predicted encoding genes and/or metagenomic sequencing data.

In some embodiments, step 130 determining the bacterial species and abundance comprises: based on the predicted encoding genes, bacterial species and abundance in tumor-related and non-tumor-related samples are determined. In some embodiments, step 130 determining the bacterial species and abundance comprises: based on the metagenomic sequencing data, species annotation software is used to determine bacterial species and abundance in tumor-related and non-tumor-related samples. In some embodiments, step 130 determining the bacterial species and abundance comprises: determining bacterial species and abundance in tumor-related and non-tumor-related samples based on the predicted encoding genes; determining bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and selecting bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determining their corresponding bacterial abundances.

The two methods of determining the bacterial species based on predicted encoding genes and determining the bacterial species using species annotation software are superior to each other. If two methods are adopted to determine the bacterial species, the two methods can be mutually verified; and the bacterial types contained in the results of the two methods are used as the final determined bacterial types, so that the reliability of the bacterial type determination is improved, and a more reliable bacterial type determination result is selected.

Step 140 identifies the significantly enriched bacteria, in which step the significantly enriched bacteria in the tumor-associated sample are identified.

In some embodiments, in determining the significantly enriched bacteria at step 140, the significantly enriched bacteria in the tumor-associated sample are determined based on a statistical test. In some embodiments, the statistical test is a Wilcoxon rank sum test. In some embodiments, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the statistical test p-value is less than or equal to 0.05. The Wilcoxon rank sum test has the advantages that the Wilcoxon rank sum test is not limited by overall distribution, and has wide application range; the method is applicable to data with no determined value at two ends; the application can be carried out without considering what kind of distribution and whether the distribution is known or not, and the calculation is easy. Setting the screening criteria for significantly enriched bacteria facilitates selection of bacterial species enriched in tumor-associated samples and facilitates determination of a broad spectrum of microorganism-derived tumor neoepitopes in subsequent steps.

In a specific embodiment, step 140 determines that the bacteria that are significantly enriched comprises: calling Wilcoxon rank sum test to count the bacteria which are remarkably enriched in the tumor-related samples; the screening criteria for the significantly enriched bacteria were: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the p-value of the statistical test is less than or equal to 0.05. Alternatively, the software R and R package circlize were used to plot a species distribution circle showing significantly enriched bacterial species and bacterial abundance in tumor-related samples. Alternatively, species distribution heatmaps showing significantly enriched bacterial species and bacterial abundance in tumor-related samples were plotted using software R and R package pheeatmap. The species distribution circle and the species distribution heat map can intuitively show the bacterial species, the bacterial abundance and/or the bacterial distribution which are remarkably enriched in the tumor-related samples.

Step 150 obtains a determination of known species of bacteria, in which step a determination of known species of bacteria is obtained, the determination of known species of bacteria indicating whether the significantly enriched bacteria are of known species.

In some embodiments, in obtaining the bacterial known species determination result at step 150, it is determined whether the significantly enriched bacteria are bacteria of a known species or bacteria of an unknown species based on the bacterial species and the significantly enriched bacteria in the tumor-associated sample and the non-tumor-associated sample. In some embodiments, in obtaining the determination of the known species of bacteria at step 150, the known database is searched based on the bacterial species determined at step 130 and the significantly enriched bacteria determined at step 140, and a determination is made as to whether the genomic sequence or protein sequence of the significantly enriched bacteria is present in the known database, and if so, the known species is determined; if the bacteria are not present, judging the bacteria to be of unknown species, and thus obtaining a judging result of the bacteria of the known species. In some embodiments, the known database is the NCBI Genome database.

In one embodiment, in step 150, the determination of known species of bacteria is performed by searching NCBI Genome database (https:// www.ncbi.nlm.nih.gov/Genome) for the presence of the genomic sequence or protein sequence information of the significantly enriched bacteria determined in step 140, based on the names of bacteria predicted and screened in step 130, and if so, determining the known species; if the bacteria are not present, judging the bacteria to be of unknown species, and thus obtaining a judging result of the bacteria of the known species.

If the significantly enriched bacterial species is known, step 160 is employed. Step 160 determines the genome or protein sequence of a bacterium of a known species, in which step the genome or protein sequence of a bacterium of a known species is determined from a database of known genomes based on the bacteria enriched for significance.

In some embodiments, the known genomic database is the NCBI Genome database. In some embodiments, step 160 determining the Genome or protein sequence of a bacterium of a known species includes retrieving the NCBI Genome database to determine the bacterial Genome or protein sequence based on the name of the bacterial species level that is significantly enriched in the tumor-related sample.

In a specific embodiment, step 160 determining the Genome or protein sequence of a bacterium of a known species includes retrieving the NCBI Genome database based on the name of the bacterial species level that is significantly enriched in tumor-related samples, and obtaining a bacterial Genome or protein fasta sequence file.

If the species of bacteria enriched for significance is not known, step 170 is employed. Step 170 determines the protein sequence of the bacteria of the unknown species, in which step the protein sequence of the bacteria of the unknown species is determined from the predicted encoding genes based on the bacteria enriched in significance.

In some embodiments, step 170 determining the protein sequence of the bacteria of the unknown species comprises translating the gene into a protein sequence based on the significance-enriched bacteria according to the predicted encoding gene, thereby determining the protein sequence of the bacteria of the unknown species.

In a specific embodiment, step 170 of determining the protein sequence of the bacteria of the unknown species comprises translating the genes into a protein fasta sequence file based on the correspondence of the genes and bacterial species predicted in step 130.

Step 180 determines peptide fragments that bind to MHC, in which step binding affinity to MHC is predicted for peptide fragments in the protein sequence of a bacterium of a known species or for peptide fragments in the protein sequence of a bacterium of an unknown species, and peptide fragments in the protein sequence of a bacterium of a known species that bind to MHC are screened for peptide fragments in the protein sequence of a bacterium of an unknown species, and peptide fragments that bind to MHC are determined.

In some embodiments, step 180 determines that the peptide fragment that can bind MHC includes a genomic or protein sequence based on the bacteria of the known species of step 160, or a protein sequence based on the bacteria of the unknown species of step 170, and predicts and scores the binding affinity of the peptide fragment to MHC molecules of the known sequence using an artificial neural network. In some embodiments, when the number of peptide fragments screened exceeds 50, step 180 of determining peptide fragments that bind to MHC further comprises predicting the binding affinity of the peptide fragments to MHC molecules of known sequence using the deep neural network and scoring, aiding in the verification of the screened peptide fragments having affinity. In some embodiments, the predicted binding affinity results are presented as peptide fragment-MHC pairing results. In some embodiments, step 180 of determining the MHC binding peptide fragment comprises screening the MHC binding peptide fragment using NetMHCpan-4.1. In some embodiments, the screening criteria for NetMHCpan-4.1 are: screening the NetMHCpan-4.1 analysis results to obtain peptide fragments with rank% less than 0.5. In some embodiments, step 180 of determining peptide fragments that bind MHC includes using BigMHC scoring to aid in verifying the screened peptide fragments that have affinity, in some embodiments, higher scoring peptide fragments in BigMHC scoring are more preferred.

In a specific embodiment, step 180 of determining MHC-binding peptide fragments comprises calling an open-source NetMHCpan-4.1 trained model to predict binding affinity of peptide fragments to MHC molecules of known sequence based on the bacterial genome or protein fasta sequence file of step 160, or protein fasta sequence file of step 170, to screen for peptide fragments having HLA type I affinity. And invoking a BigMHC-trained model of open source to predict binding affinity of the peptide fragment to MHC molecules of known sequence to aid in validating the predicted peptide fragment of NetMHCpan-4.1.

NetMHCpan-4.1: the binding affinity of peptide fragments to MHC molecules of known sequence was predicted using artificial neural networks, and their predicted affinities were expressed as sw_score a and sw_score b on a score. SW_score A and SW_score B are the "EL_rank" item and "BA_rank" item in the NetMHCpan-4.1 calculation, respectively, and the score of these two scoring items ranges from 0 to 100. The smaller sw_score and sw_score, the more strongly a peptide fragment binds to the corresponding MHC molecule. Illustratively, the screening criteria for SW_SCREA and SW_SCREB are set to SW_SCREA and SW_SCREB equal to or less than 2 (see https:// services. Health. Dtu. Dk/services/NetMHCITman-4.0 /).

BigMHC: the binding affinity of peptide fragments to MHC molecules of known sequence was predicted using deep neural networks, and its predicted affinity was expressed as sw_score on a score. SW_score is the term "bigmhc_im" in the BigMHC calculation result, and the value range of the score is 0-1. Higher SW_score indicates stronger binding of a peptide fragment to the corresponding MHC molecule (see Albert BA, et al deep Neural Networks Predict MHC-I Epitope Presentation and Transfer Learn Neoepitope immunology. BioRxiv; 2022.).

Step 190 determines a tumor neoepitope, in which immunogenicity, host similarity and number of MHC-typing of the peptide fragment binding to MHC are determined based on the peptide fragment binding to MHC, and the peptide fragment is selected based on the immunogenicity, host similarity and number of MHC-typing, thereby determining the tumor neoepitope.

As used herein, "immunogenicity" refers primarily to the ability of the body to elicit an immune response to itself or a related protein (e.g., a therapeutic protein) or to cause an immune-related event, i.e., the ability of an antigen to stimulate a particular immune cell, to activate, proliferate, differentiate the immune cell, ultimately producing immune effector antibodies and sensitized lymphocytes.

In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: immunogenicity of the MHC-binding peptide is determined based on a deep learning model established by the deep neural network and the MHC-binding peptide. In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity.

As used herein, "similarity" is the number of identical amino acids of a peptide fragment that can bind to MHC as the protein sequence of the host divided by the number of total amino acids comprised by the peptide fragment that can bind to MHC.

In some embodiments, determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: and (3) comparing the peptide fragment capable of binding to the MHC with the protein sequences of hosts from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide fragment capable of binding to the MHC and the hosts. In some embodiments, determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing a score for the similarity of the MHC-binding peptide to the host protein sequence as a result of the output of the sequence alignment, the higher the score being with higher similarity to the host.

As used herein, the "number of MHC genotypes of an MHC-binding peptide fragment" is the number of different MHC genotypes having binding affinity for the same MHC-binding peptide fragment. The greater the number of MHC-typing of peptide fragments that bind to MHC, the more likely a peptide fragment is bound by multiple MHC, exerting a stronger immune activation. The determination of the MHC typing number of peptide fragments capable of binding to MHC is beneficial to screening of high-quality tumor neoepitope capable of exerting strong immune activation.

In some embodiments, determining the MHC class number of MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises: the number of all MHC to which the MHC-binding peptide fragment may be bound is counted, and thus the MHC typing number of the MHC-binding peptide fragment is determined. In some embodiments, determining the number of MHC-binding peptide fragments based on the MHC-binding peptide fragments further comprises introducing a score of the same binding peptide fragment for multiple MHC-typing, the score being the number of all MHC fragments likely to bind per peptide fragment, the higher the score the greater the number of MHC-typing of MHC-binding peptide fragments.

In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: determining immunogenicity of the peptide fragment capable of binding to the MHC based on a deep learning model established by the deep neural network and the peptide fragment capable of binding to the MHC; determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity; determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: comparing the peptide segment capable of binding to MHC with the protein sequences of the host from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide segment capable of binding to MHC and the host; determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing an MHC-binding peptide to the host protein sequence similarity score, which is the output of the sequence alignment, with a higher score compared to the host similarity; determining the MHC-typing number of MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises: counting the number of all MHC possibly bound by the peptide fragment capable of binding to MHC, and further determining the MHC typing number of the peptide fragment capable of binding to MHC; and determining the number of MHC-binding peptide fragments based on the MHC-binding peptide fragments further comprises introducing a score for the same binding peptide fragment for multiple MHC-typing, the score being the number of all MHC fragments likely to bind per peptide fragment, the higher the score the greater the number of MHC-typing of MHC-binding peptide fragments. The immunogenicity of peptide segment capable of combining MHC, the similarity with host and the introduction of MHC typing number are favorable for screening out high quality tumor neoepitope from microorganism.

In some embodiments, screening peptide fragments based on immunogenicity, similarity to host, and MHC typing numbers, and further determining tumor neoepitopes comprises: screening the peptide fragments with high score in the scoring of the deep learning model, and determining the tumor neoepitope by classifying the peptide fragments with the same binding peptide fragments with multiple MHC (major histocompatibility complex) and the peptide fragments with low score in the scoring of the similarity of the peptide fragments capable of binding the MHC and the host protein sequence. The screening of peptide fragments with high immunogenicity, low similarity with hosts and large MHC typing quantity is beneficial to screening out high-quality tumor neoepitope derived from microorganisms.

Fig. 2 shows a schematic flow chart of a process of determining bacterial species and abundance according to an embodiment of the disclosure. As shown in FIG. 2, a process 200 of determining bacterial species and abundance includes steps 210, 220, and 230.

Step 210 determines bacterial species and abundance based on the predicted encoding genes, in which step bacterial species and abundance in tumor-associated and non-tumor-associated samples are determined based on the predicted encoding genes.

In some embodiments, step 210 determining bacterial species and abundance based on the predicted encoding genes comprises sequence aligning the predicted genes with sequences in a known database to predict bacterial species and determine co-categorical levels of bacterial abundance. In some embodiments, the known database is the NCBI NR database. In some embodiments, step 210 determining bacterial species and abundance based on the predicted encoding genes comprises sequence aligning the predicted genes with sequences in the NCBI NR database to predict bacterial species and determine co-categorical levels of bacterial abundance. In some embodiments, the input sequence for sequence alignment is a predicted translated protein sequence of the coding gene. Sequence alignment with known databases may facilitate prediction of bacterial species.

In a specific embodiment, step 210 determining bacterial species and abundance based on the predicted coding genes includes using blast to sequence align the predicted genes with sequences in the NCBI NR database (https:// ftp. NCBI. Nlm. Nih. Gov/blast/db /) to predict bacterial species and determine bacterial abundance at a level of co-classification.

Step 220 uses species annotation software to determine bacterial species and abundance, in which step bacterial species and abundance in tumor-related and non-tumor-related samples are determined using species annotation software based on the metagenomic sequencing data. In some embodiments, the species annotation software is MetaPhlAn.

In a specific embodiment, step 220 uses species annotation software to determine bacterial species and abundance includes calling MetaPhlAn to predict bacterial species and distribution in tumor-related and non-tumor-related samples based on metagenomic sequencing data, and determining on-grade bacterial abundance.

Step 230 selects bacterial species and determines bacterial abundance, in which step bacterial species contained in both the results of determining bacterial species and abundance based on predicted encoding genes and determining bacterial species and abundance using species annotation software are selected as bacterial species in the determined tumor-associated and non-tumor-associated samples and their corresponding bacterial abundances are determined. In some embodiments, determining their respective bacterial abundances comprises calculating an average of the bacterial abundances of the selected bacterial species in determining bacterial species and abundances based on the predicted encoding genes and determining bacterial species and abundances using species annotation software, thereby determining the respective bacterial abundances.

Returning to FIG. 1, after the bacterial species and abundance are determined as described above, step 140 is performed to determine the significantly enriched bacteria, in which step the significantly enriched bacteria in the tumor-associated sample are determined. The specific operation of step 140 to determine the significantly enriched bacteria is as described above.

Fig. 3 shows a schematic flow chart of a process for determining a tumor neoepitope according to one embodiment of the present disclosure. As shown in fig. 3, the process 300 of determining a tumor neoepitope includes steps 310, 320, 330 and 340.

Step 310 determines the immunogenicity of the MHC-binding peptide fragments, in which step the immunogenicity of the MHC-binding peptide fragments is determined based on a deep learning model established by the deep neural network and the MHC-binding peptide fragments.

In some embodiments, step 310 determining the immunogenicity of the MHC-binding peptide fragment comprises determining the immunogenicity using BigMHC. In some embodiments, the deep learning model built based on the deep neural network is trained based on amino acid sequence characteristics of different peptide fragments and MHC. In some embodiments, the deep learning model established by the deep neural network is a model that the BigMHC has been trained to complete. In some embodiments, step 310 of determining the immunogenicity of the peptide fragments that can bind to MHC comprises determining the immunogenicity using a model that has been trained on BigMHC, the model being trained based on amino acid sequence characteristics of the different peptide fragments and MHC. In some embodiments, step 310 of determining the immunogenicity of the MHC-binding peptide fragments comprises scoring and ranking according to a deep learning model, the higher the score the higher the immunogenicity. Tumor immunogenicity is the basis for initiating tumor immunotherapy, so that the higher the immunogenicity of neoantigens, the higher the likelihood of immune response. In some embodiments, in determining the immunogenicity of peptide fragments that bind to MHC in step 310, the sequence of the peptide fragments and MHC typing information are input to the model, and the model outputs a score between 0 and 1, with higher scores resulting in greater immunogenicity of the peptide fragments.

In a specific embodiment, step 310 of determining the immunogenicity of the MHC-binding peptide fragment comprises calling an open-source BigMHC trained model to predict the immunogenicity of the peptide fragment. In this step, the sequence of the peptide fragment (e.g., QTYKTNSSVKK, SEQ ID NO: 41) and the typing information of MHC (e.g., HLA-A. Times.11:01) need to be entered into the model. The result is a score between 0 and 1, the higher the score, the more immunogenic the peptide fragment. The BigMHC-trained model was trained based on amino acid sequence characteristics of different peptide fragments and MHC.

Step 320 determines the similarity of the MHC-binding peptide to the host, in which step the MHC-binding peptide is aligned with the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and the similarity of the MHC-binding peptide to the host is determined.

In some embodiments, step 320 determining the similarity of the MHC-binding peptide to the host comprises introducing a score for the similarity of the MHC-binding peptide to the host protein sequence to determine the similarity to the host. In some embodiments, step 320 of determining the MHC-binding peptide fragment to host similarity comprises creating a protein sequence index file of the host (e.g., human) in advance, aligning the MHC-binding peptide fragment to host protein sequences based on the pre-created index file, and scoring the similarity of the MHC-binding peptide fragment to host protein sequences to determine the MHC-binding peptide fragment to host similarity.

In a specific embodiment, in step 320, determining the similarity between the MHC-binding peptide and the host, creating a protein sequence blastp index file of the host (e.g., human) in advance, performing sequence alignment on the MHC-binding peptide and the host protein sequence using blastp based on the created blastp index file, and obtaining a scoring of the similarity between the MHC-binding peptide and the host protein sequence (the scoring is blastp output result), thereby determining the similarity between the MHC-binding peptide and the host.

Step 330 determines the number of MHC-typing of MHC-binding peptide fragments, in which step the number of MHC-typing of MHC-binding peptide fragments is counted for all MHC-binding peptides likely to bind.

In some embodiments, step 330 of determining the number of MHC genotypes of the MHC-binding peptide fragments comprises introducing a score for multiple MHC-typing of the same binding peptide fragment to determine the number of MHC-typing of the MHC-binding peptide fragments. In step 330, the predicted binding affinity results are presented as peptide-MHC pairing results, where one peptide may bind to multiple MHC in the output, in determining the number of MHC-typing of peptide that can bind to MHC. Scoring of multiple MHC-typing identical binding peptide fragments is a fundamental statistical method that calculates the number of all MHC that each peptide fragment may bind.

In a specific embodiment, step 330 of determining the number of MHC-binding peptide fragments comprises introducing a score for the same binding peptide fragment for multiple MHC-binding based on the MHC-binding peptide fragments, counting the number of all MHC-binding peptides each peptide fragment may bind, and thereby determining the number of MHC-binding peptide fragments for MHC-binding.

Step 340 screens peptide fragments and determines tumor neoepitopes, in which step peptide fragments are screened based on immunogenicity, similarity to host, and number of MHC typing, thereby determining tumor neoepitopes.

In some embodiments, step 340 screening the peptide fragments and determining the tumor neoepitope comprises screening for a high score in a deep learning model score, a high score in a multi-MHC class identical binding peptide fragment score, and a low score peptide fragment in a class similarity score between the MHC-binding peptide fragment and the host protein sequence, thereby determining the tumor neoepitope.

In a specific embodiment, step 340 of screening peptide fragments and determining tumor neoepitope comprises selecting peptide fragments with a score of 0.9 or more in the immunogenicity deep learning model scoring, a score of 0.5 or less in the similarity scoring of MHC-binding peptide fragments and host protein sequences, and a score of 2 or more in the multi-MHC-typing identical binding peptide fragments as tumor neoepitope.

In some embodiments, the method of determining a tumor neoepitope of microbial origin further comprises sequencing data quality control. Fig. 4 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure. As shown in fig. 4, the method 400 of determining a tumor neoepitope of microbial origin comprises steps 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414 and 415. Among them, step 401, step 403, step 404, step 405, step 406, step 407, step 408, step 409, step 410, step 411, step 412, step 413, step 414, and step 415. The steps are the same as step 110, step 120, step 210, step 220, step 230, step 140, step 150, step 160, step 170, step 180, step 310, step 320, step 330 and step 340, respectively, and are not described in detail herein.

Step 402 of quality control of sequencing data, in which, in the step, macro genome assembly is performed based on macro genome sequencing data to obtain an assembled genome sequence, and quality control is performed on the macro genome sequencing data before predicting coding genes in the genome based on the assembled genome sequence, thereby obtaining quality-controlled macro genome sequencing data.

In some embodiments, the criteria for quality control are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp. In some embodiments, step 402 sequencing data quality control includes removing the linker sequence. In some embodiments, step 402 quality control of the sequencing data comprises removing the linker sequence, removing the low quality sequencing data, and obtaining quality controlled sequencing data having a terminal base quality greater than Q20, a number of N bases less than 5, and a sequence length greater than or equal to 100bp.

In a specific embodiment, step 402 of quality control of sequencing data comprises invoking fastp software to perform quality control of sequencing data with fastq sequencing data obtained in step 401 (or step 110) as input, the quality control comprising: removing the linker sequence, removing low quality sequencing data, and obtaining sequencing data after quality control (i.e. after removing the linker, sequencing data with terminal base mass greater than Q20 and N base number less than 5 and sequence length greater than or equal to 100 bp).

The quality control of the sequencing data in step 402 can improve the quality of the obtained sequencing data so as to avoid sequence pollution and further avoid influencing the subsequent analysis and the determination of tumor neoepitopes.

In some embodiments, step 403 (or step 120) predicts that the encoding gene in the genome comprises: and performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence. The use of quality-controlled sequencing data for the manipulation can avoid affecting subsequent analysis and determination of tumor neoepitopes.

In some embodiments, step 405 (or step 220) determining bacterial species and abundance using species annotation software comprises: based on the metagenomic sequencing data after quality control, species annotation software is used to determine bacterial species and abundance in tumor-related and non-tumor-related samples. The use of quality-controlled sequencing data for the manipulation can avoid affecting subsequent analysis and determination of tumor neoepitopes.

In some embodiments, the tumor-related sample is a donor tumor tissue sample or a tumor patient stool sample, or the non-tumor-related sample is a donor paracancerous tissue sample, a normal tissue sample, or a healthy population stool sample. In some embodiments, the tumor-related sample is a donor tumor tissue sample or a tumor patient stool sample, and the non-tumor-related sample is a donor paracancerous tissue sample, a normal tissue sample, or a healthy population stool sample. A tumor neoantigen is a neoantigen that is present in a tumor cell or tissue of a subject but not in a corresponding normal cell or tissue of the subject. Can serve as a tumor marker when identifying tumor cells by diagnostic testing, and can also serve as a potential candidate for cancer treatment. A paracancestral tissue sample, a normal tissue sample, or a stool sample from a healthy population. The establishment of such control samples is useful for the study of significantly enriched bacteria in tumor tissue samples or fecal samples from tumor patients and for the further determination of tumor neoepitopes of microbial origin.

In some embodiments, the length of the assembled genomic sequence is greater than or equal to 90bp. Long sequences can improve the accuracy of species annotation analysis.

In some embodiments, the MHC is a high frequency HLA of the chinese population. The determination of the peptide segment capable of combining with the high-frequency HLA of Chinese population is beneficial to developing and researching the tumor neoepitope suitable for the cancer treatment of Chinese population. The Chinese population high frequency HLA contains human MHC listed in Table 3, see, for example, heY, li J, mao W, et al HLA common and well-documented alleles in China [ J ]. Hla,2018,92 (4): 199-205.

In some embodiments, step 411 (or step 180) determines that, of the peptide fragments that can bind to MHC, the screening criteria for screening for peptide fragments in the protein sequence of a bacterium of a known species or for peptide fragments in the protein sequence of a bacterium of an unknown species that can bind to MHC are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM. Tumor immunogenicity is the basis for initiating tumor immunotherapy, so the higher the probability of an immune response, the generation of neoantigens that bind with high affinity to MHC. Screening peptides with higher binding affinity to MHC is advantageous for screening high quality tumor neoepitopes of microbial origin.

Device for determining tumor neoepitope derived from microorganism

In a second aspect, the present disclosure provides a device for determining a tumor neoepitope of microbial origin. Fig. 5 shows a schematic block diagram of an apparatus for determining a tumor neoepitope of microbial origin according to the present disclosure. As shown in fig. 5, the apparatus 500 for determining a tumor neoepitope derived from a microorganism includes a metagenome sequencing data acquisition module 510, a coding gene prediction module 520, a determination module 530 of bacterial species and abundance, a determination module 540 of bacteria with enriched significance, a determination module 550 of a determination result of known species of bacteria, a determination module 560 of genome or protein sequence of bacteria of known species, a determination module 570 of protein sequence of bacteria of unknown species, a determination module 580 of peptide fragment capable of binding MHC, and a tumor neoepitope determination module 590.

A metagenome sequencing data acquisition module 510 configured to acquire metagenome sequencing data, the metagenome sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples. In some embodiments, the operation of the metagenomic sequencing data acquisition module 510 may refer to the operation described above with reference to 110 of fig. 1.

In some embodiments, tumor-related and non-tumor-related samples are obtained, and bacterial DNA in the tumor-related and non-tumor-related samples is extracted for high throughput sequencing. In some embodiments, bacterial DNA is extracted from tumor-associated and non-tumor-associated samples by reference to Nejman D, et al, the methods of extraction human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. In some embodiments, metagenomic sequencing is performed by an ILLUMINA high throughput sequencing platform. In some embodiments, the sequencing mode of high throughput sequencing is double ended, with a sequencing length of 150bp. In some embodiments, the tumor-associated sample and the non-tumor-associated sample are each no less than 10.

A coding gene prediction module 520 configured to perform metagenomic assembly based on the metagenomic sequencing data, obtain an assembled genomic sequence, and predict a coding gene in the genome based on the assembled genomic sequence. In some embodiments, the operation of the encoding gene prediction module 520 may refer to the operation described above with reference to 120 of fig. 1.

A determination module 530 of bacterial species and abundance configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples based on predicted encoding genes and/or metagenomic sequencing data. In some embodiments, the operation of the determination module 530 of bacterial species and abundance may refer to the operation described above with reference to 130 of fig. 1.

In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene.

In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data.

In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene; a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.

The methods in the module for determining bacterial species and abundance based on predicted encoding genes and the methods in the module for determining bacterial species and abundance using species annotation software are both good and bad. If two modules are adopted at the same time to determine the bacterial species, the methods in the two modules can be mutually verified; and the bacterial types contained in the results of the two modules are used as the final determined bacterial types, so that the reliability of determining the bacterial types is improved, and a more reliable bacterial type determining result is selected.

A determination module 540 of the significantly enriched bacteria configured to determine the significantly enriched bacteria in the tumor-related sample. In some embodiments, the operation of the determination module 540 of the prominently enriched bacteria may refer to the operation described above with reference to 140 of fig. 1.

A bacteria known species judgment result acquisition module 550 configured to acquire a bacteria known species judgment result indicating whether the bacteria significantly enriched are of a known species. In some embodiments, the operation of the bacteria known species determination result acquisition module 550 may refer to the operation described above with reference to 150 of fig. 1.

If the significantly enriched bacterial species is known, module 560 is employed. A determination module 560 of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the bacteria enriched in significance, the genomic or protein sequence of the bacterium of the known species from a database of known genomes. In some embodiments, the operation of the determination module 560 of the genome or protein sequence of a bacterium of a known species may refer to the operation described above with reference to 160 of fig. 1.

If the species of bacteria for which the significance is enriched is unknown, module 570 is employed. A determination module 570 of a protein sequence of a bacterium of the unknown species is configured to determine the protein sequence of the bacterium of the unknown species from the predicted encoding gene based on the bacteria enriched in significance. In some embodiments, the operation of the determination module 570 of the protein sequence of the bacteria of the unknown species may refer to the operation described above with reference to 170 of fig. 1.

A determining module 580 for peptide fragments that can bind to MHC configured to predict binding affinity of peptide fragments in the protein sequence of a bacterium of a known species or a bacterium of an unknown species to MHC, screening peptide fragments in the protein sequence of a bacterium of a known species that can bind to MHC or a bacterium of an unknown species, and thereby determining peptide fragments that can bind to MHC. In some embodiments, the operations of determining module 580 for MHC-binding peptide fragments may be referred to the operations described above with reference to 180 of fig. 1.

A tumor neoepitope determination module 590 configured to determine immunogenicity, host similarity, and number of MHC types of the MHC-binding peptide fragments based on the MHC-binding peptide fragments, and screen the peptide fragments based on the immunogenicity, host similarity, and number of MHC types, thereby determining the tumor neoepitope. In some embodiments, the operation of the tumor neoepitope determination module 590 can refer to the operation described above with reference to 190 of fig. 1.

In some embodiments, the tumor neoepitope determination module 590 comprises: a determining module of immunogenicity of the MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by the deep neural network and the MHC-binding peptide fragment; a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host; a determining module of the MHC-typing number of MHC-binding peptide fragments, configured to count the number of all MHC-binding peptide fragments likely to bind, thereby determining the MHC-typing number of MHC-binding peptide fragments; and a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on immunogenicity, similarity to host, and MHC typing number, thereby determining tumor neoepitopes.

The introduction of the determining module capable of combining the immunogenicity of the peptide fragment of the MHC, the determining module capable of combining the similarity of the peptide fragment of the MHC and the host and the determining module capable of combining the MHC typing number of the peptide fragment of the MHC is beneficial to screening out high-quality tumor neoepitope derived from microorganism.

Fig. 6 shows a schematic block diagram of a determination module of bacterial species and abundance according to an embodiment of the disclosure. As shown in fig. 6, the determination of bacterial species and abundance module 600 includes a determination of bacterial species and abundance module 610 based on predicted encoding genes, a determination of bacterial species and abundance module 620 using species annotation software, and a bacterial species selection and bacterial abundance determination module 630.

A determination module 610 based on the predicted bacterial species and abundance of the encoding gene, configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene. In some embodiments, the operation of the determination module 610 based on the predicted bacterial species and abundance of the encoding gene may be referred to the operation described above with reference to 210 of fig. 2.

A determination module 620 of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data. In some embodiments, the operation of the determination module 620 using species annotation software may refer to the operation described above with reference to 220 of fig. 2.

A bacterial species selection and bacterial abundance determination module 630 configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances. In some embodiments, the operation of the bacterial species selection module 630 may refer to the operation described above with reference to 230 of fig. 2.

Fig. 7 shows a schematic block diagram of a tumor neoepitope determination module according to one embodiment of the present disclosure. As shown in fig. 7, the tumor neoepitope determining module 700 includes a determining module 710 for immunogenicity of MHC-binding peptide fragments, a determining module 720 for similarity of MHC-binding peptide fragments to a host, a determining module 730 for MHC-typing number of MHC-binding peptide fragments, and a peptide fragment screening and neoepitope determining module 740.

Determining module 710 of immunogenicity of the MHC-binding peptide fragments, configured to determine immunogenicity of the MHC-binding peptide fragments based on a deep learning model established by the deep neural network and the MHC-binding peptide fragments. In some embodiments, the operation of the immunogenicity determination module 710 may refer to the operation described above with reference to 310 of fig. 3.

And a determining module 720 for determining the similarity of the MHC-binding peptide fragment to the host, wherein the determining module is configured to determine the similarity of the MHC-binding peptide fragment to the host by aligning the sequences of proteins of the host from which the tumor-associated sample and the non-tumor-associated sample are derived. In some embodiments, the operation of the determine with host similarity module 720 may refer to the operation described above with reference to 320 of fig. 3.

A determining module 730 for determining the number of MHC-class types of peptide fragments capable of binding to MHC, configured to count the number of all MHC-class types that may be bound by the peptide fragments capable of binding to MHC, thereby determining the number of MHC-class types of peptide fragments capable of binding to MHC. In some embodiments, the operation of the determining module 730 of the MHC-binding MHC class number of peptide fragments may refer to the operation described above with reference to 330 of fig. 3.

A peptide fragment screening and neoepitope determining module 740 configured to screen peptide fragments based on immunogenicity, similarity to host, and MHC typing number, thereby determining tumor neoepitopes. In some embodiments, the operations of the peptide fragment screening and neoepitope determination module 740 may be referred to the operations described above with reference to 340 of fig. 3.

In some embodiments, the apparatus for determining a tumor neoepitope of microbial origin further comprises a sequencing data quality control module. Fig. 8 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to one embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 for determining tumor neoepitopes includes a metagenome sequencing data acquisition module 801, a sequencing data quality control module 802, a coding gene prediction module 803, a determination module 804 of bacterial species and abundance based on predicted coding genes, a determination module 805 of bacterial species and abundance using species annotation software, a determination module 806 of bacterial species selection and bacterial abundance, a determination module 807 of bacteria enriched in significance, a determination module 808 of known species judgment result acquisition module, a determination module 809 of genome or protein sequence of bacteria of known species, a determination module 810 of protein sequence of bacteria of unknown species, a determination module 811 of peptide fragment capable of binding to MHC, a determination module 812 of immunogenicity of peptide fragment capable of binding to MHC, a determination module 813 of similarity of peptide fragment capable of binding to MHC to host, a determination module 814 of MHC typing number of peptide fragment capable of binding to MHC, and a determination module 815 of peptide fragment screening and neoepitopes. Module 801, module 803, module 804, module 805, module 806, module 807, module 808, module 809, module 810, module 811, module 812, module 813, module 814, and module 815 are the same as module 510, module 520, module 610, module 620, module 630, module 540, module 550, module 560, module 570, module 580, module 710, module 720, module 730, and module 740, respectively, and are not repeated herein.

A sequencing data quality control module 802 configured to perform quality control on the metagenomic sequencing data before performing metagenomic assembly based on the metagenomic sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in the genome based on the assembled genomic sequence (performing an operation of the coding gene prediction module 803 or the coding gene prediction module 520), thereby obtaining quality-controlled metagenomic sequencing data. The operation of the sequencing data quality control module 802 may refer to the operation described above with reference to 402 of fig. 4.

In some embodiments, the criteria for quality control are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp. In some embodiments, the sequencing data quality control module 802 is configured to remove linker sequences, remove low quality sequencing data, and obtain quality controlled sequencing data having a terminal base quality greater than Q20, a number of N bases less than 5, and a sequence length greater than or equal to 100bp.

Other aspects

In a third aspect, the present disclosure provides a computing device comprising: a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method in the embodiments.

Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure. As can be seen in fig. 9, computing device 900 includes a Central Processing Unit (CPU) 910 (e.g., a processor) and a memory 920 coupled to Central Processing Unit (CPU) 910. The memory 920 is used to store computer-executable instructions that, when executed, cause the Central Processing Unit (CPU) 910 to perform the method of determining a tumor neoepitope derived from a microorganism in the above embodiments. A Central Processing Unit (CPU) 910 and a memory 920 are connected to each other by a bus, to which an input/output (I/O) interface is also connected. Computing device 900 may also include a number of components (not shown in fig. 9) connected to the I/O interface, including, but not limited to: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communications unit allows the computing device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

Further, the above-described method can alternatively be implemented by a computer-readable storage medium. The computer readable storage medium has computer readable program instructions embodied thereon for performing various embodiments of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

Thus, in a fourth aspect, the present disclosure proposes a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope of microbial origin in various embodiments of the present disclosure.

In a fifth aspect, the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope of microbial origin in various embodiments of the present disclosure.

Tumor neoepitope derived from microorganism

In a sixth aspect, the present disclosure also provides a protein sequence comprising the sequence set forth in any one of SEQ ID NOs 11-40. These protein sequences are exemplary microbial-derived tumor neoepitopes determined using the methods or apparatus of the present invention for determining microbial-derived tumor neoepitopes.

In a seventh aspect, the present disclosure also provides a nucleic acid sequence encoding the protein sequence of the sixth aspect. In some embodiments, the nucleic acid sequence encodes a sequence set forth in any one of SEQ ID NOs 11-40. The nucleic acid sequence may be a DNA sequence or an RNA sequence, which is a coding sequence for the protein sequence of the sixth aspect. As used herein, "coding sequence" or "coding region sequence" refers to a nucleotide sequence in a polynucleotide that can be used as a template for synthesis of a polypeptide having a defined nucleotide sequence (e.g., tRNA and mRNA) or a defined amino acid sequence in a biological process. The coding sequence may be a DNA sequence or an RNA sequence. A DNA sequence or mRNA sequence is considered to encode a polypeptide if mRNA corresponding to the DNA sequence (including the same coding strand as the mRNA sequence and the template strand complementary thereto) is translated into the polypeptide in a biological process.

Preferably, the nucleic acid sequence is an mRNA sequence. The translated protein sequence of the mRNA sequence comprises the sequence shown in any one of SEQ ID NOs 11 to 40. These mRNA sequences are mRNA sequences encoding exemplary microbial-derived tumor neoepitopes or their complements determined using the methods or apparatus of the invention for determining microbial-derived tumor neoepitopes. In general, mRNA can comprise a 5'-UTR sequence, a coding sequence for a polypeptide, a 3' -UTR sequence, and optionally a poly (a) sequence. mRNA can be produced, for example, by in vitro transcription or chemical synthesis. In some embodiments, the mRNA of the invention is obtained by in vitro transcription with a DNA template by an RNA polymerase (e.g., T7RNA polymerase). In some embodiments, the mRNA of the invention comprises (1) a 5'-UTR, (2) a coding sequence, (3) a 3' -UTR, and (4) an optionally present poly (a) sequence. In some embodiments, the mRNA of the invention is a nucleoside modified mRNA. In some embodiments, the mRNA of the present invention comprises an optionally present 5' cap.

In some embodiments, the sequences shown in SEQ ID NOS.11-40 are derived from F.nucleatum (Fusobacterium nucleatum).

In some embodiments, the sequences shown in SEQ ID NOS 11-40 have affinity for the corresponding mouse MHC shown in the mouse MHC column of Table 3, respectively.

Advantageous effects

Compared with the traditional tumor neoepitope discovery method, the method and the device can focus on microorganisms (such as bacteria) with tumor symbiosis, provide a brand-new tumor neoepitope discovery angle, and greatly improve the number of tumor neoepitope discovery. Meanwhile, the targeting of tumor treatment is effectively improved based on the specificity of specific microorganisms (such as bacteria) in tumors. Provides important reference value for the development of tumor neoantigen vaccine, tumor immunotherapy and the design of tumor immunotherapy targets.

The methods, devices and platforms (also known as SmartBacNeo) provided by the inventors are capable of efficiently identifying and screening protein fragments of intracellular microorganisms (e.g., bacteria) of tumors that are potentially recognized by the immune system. For known intratumoral symbiotic microorganisms (e.g., bacteria) or pathogenic microorganisms (e.g., bacteria, such as F.nucleatum (Fusobacterium nucleatum)), protein sequences of bacterial protein sequences that are recognized by the host (e.g., human) immune system can be rapidly and accurately screened.

Examples

Various exemplary embodiments of the present disclosure are described in detail below with reference to the drawings. While the exemplary methods, apparatus described below include software and/or firmware executed on hardware among other components, it should be noted that these examples are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of the hardware, software, and firmware components could be embodied exclusively in hardware, exclusively in software, or in any combination of hardware and software. Thus, while exemplary methods and apparatus have been described below, those skilled in the art will readily appreciate that the examples provided are not intended to limit the manner in which such methods and apparatus may be implemented.

Furthermore, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or operations, or combinations of special purpose hardware and computer instructions.

The present disclosure is described below in terms of several embodiments.

Example 1 establishment of a tumor neoepitope identification tool SmartBacNeo based on the method for determining a tumor neoepitope derived from a microorganism provided in the present disclosure

The invention provides a tumor neoepitope identification tool SmartBacNeo, smartBacNeo based on a metagenome sequencing and deep learning model, which mainly comprises the steps shown in fig. 10. As shown in fig. 10, the method 1000 of determining a tumor neoepitope of microbial origin comprises step 1010, step 1020, step 1030, step 1040, step 1050, step 1060, step 1070, step 1080 and step 1090.

Step 1010 obtains sequencing data, in which metagenomic sequencing data is obtained, the metagenomic sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples. The number of samples of the tumor-related sample and the non-tumor-related sample is not less than 10, respectively. Referring to Nejman D, et al human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. Extraction methods, bacterial DNA in tumor-related and non-tumor-related samples is extracted, bacterial DNA sequencing, i.e., metagenomic sequencing, is completed by an ILLUMINA high throughput sequencing platform, and a fastq file after double-ended sequencing is obtained. Copying the fastq file subjected to double-ended sequencing into a specific folder. And reading a sample information table provided by a user by using SmartBacneo, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to acquire macro genome sequencing data. The user provides a sample information table, e.g., samplelist. Txt, in the format described below: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.

Step 1020, quality control of sequencing data, wherein the quality control of sequencing data is performed using SmartBacNeo to automatically invoke fastp software with the fastq sequencing data obtained in step 1010 as input, the quality control comprising: removing the linker sequence, removing low quality sequencing data, and obtaining sequencing data after quality control (i.e. after removing the linker, sequencing data with terminal base mass greater than Q20 and N base number less than 5 and sequence length greater than or equal to 100 bp).

Step 1030, performing gene assembly, gene prediction and determination of gene expression level, wherein in the step, based on sequencing data after quality control, sequence fragment assembly is performed on sequencing data of each sample by using SmartBacneo to automatically call megahit, which is called "single sample assembly" when successful assembly is performed, and reading lengths of all samples which are not successfully assembled are assembled, and assembly is performed again by using megahit, which is called "hybrid assembly", so as to obtain sequence fragments of single sample assembly and hybrid assembly; based on the assembled sequence fragments, smartBacNeo is used for automatically calling coding genes in a prodigal prediction genome, further based on the predicted coding genes, smartBacNeo is used for automatically calling cd-hit to remove redundant genes, smartBacNeo is used for automatically calling salcon to compare sequencing data after quality control to the predicted genes, and the expression quantity of different genes is calculated. Wherein the definition and operation of the tools invoked are as follows:

megahit: based on Debrucine (de Bruijn) graph algorithm, the sequencing data after quality control is broken into small fragments, and sequence fragments of a metagenome are assembled.

prodigal: potential start codons (ATG, GTG and TTG) and stop codons (TAA, TAG and TGA) are identified in the assembled sequence fragments (DNA sequences). The open reading frame (Open reading fram, ORF) between the start codon and the stop codon was searched, and the protein-encoding gene was predicted based on the characteristics such as the ORF length and codon usage bias.

cd-hit: similar sequences are clustered according to a defined sequence identification threshold, and a representative sequence, called the "centroid", is created for each cluster, which is the longest sequence in the cluster. Non-redundant sequences are generated by deleting sequences that are highly similar to the centroid.

salmon: the concept of quasi-mapping is combined with a two-stage reasoning process to provide accurate expression estimates at very high speed while using little memory. salson uses the expressivity of sequencing data and a true model to infer, which takes into account experimental properties and bias that are common in true sequencing data.

Step 1040 bacterial distribution statistics and determination of bacteria enriched in significance, in which the genes predicted in step 1030 are aligned with sequences in the NCBI NR database using blast to predict bacterial species and determine co-categorical levels of bacterial abundance; based on fastq sequencing data after quality control, automatically calling MetaPhlan by SmartBacneo to predict bacterial types and distribution, and determining bacterial abundance at the same classification level; and selecting bacterial species contained in the results of determining the bacterial species and abundance based on the predicted encoding genes as bacterial species in the determined tumor-associated and non-tumor-associated samples; calculating an average value of bacterial abundance of the selected bacterial species in a result of determining bacterial species and abundance based on the predicted encoding genes using species annotation software, thereby determining a corresponding bacterial abundance; the bacteria significantly enriched in tumor-related samples were counted using SmartBacNeo auto-call Wilcoxon rank sum test; the screening criteria for the significantly enriched bacteria were: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the p-value of the statistical test is less than or equal to 0.05. Alternatively, the software R and R package circlize were used to plot a species distribution circle showing significantly enriched bacterial species and bacterial abundance in tumor-related samples. Alternatively, species distribution heatmaps showing significantly enriched bacterial species and bacterial abundance in tumor-related samples were plotted using software R and R package pheeatmap.

Step 1050 obtains a determination of known species of bacteria, in which step, based on the bacterial species determined in step 1040, genomic sequence or protein sequence information of the bacteria that are significantly enriched in the NCBI Genome database (https:// www.ncbi.nlm.nih.gov/Genome) is retrieved, and if so, the known species is determined; if not, then the unknown species is determined.

If the species of bacteria that are significantly enriched are known, then step 1060 is employed to determine the Genome or protein sequence of the bacteria of the known species, in which step the NCBI Genome database is searched for bacterial Genome or protein fasta sequence files based on the level name of the significantly enriched bacteria species in the tumor-associated sample.

If the species of bacteria whose significance is enriched is not known, step 1070 is used to determine the protein sequence of the bacteria of the unknown species, in which the genes are translated into a protein fasta sequence file using SmartBacneo based on the correspondence of genes and bacterial species predicted in the bacterial distribution statistics of step 1040.

Step 1080 determines a bacterial derived MHC-binding peptide fragment, in which binding affinity of the peptide fragment to MHC molecules of known sequence is predicted based on the bacterial genome or protein fasta sequence file of step 1060, or the protein fasta sequence file of step 1070, using a model that has been trained by SmartBacNeo to automatically call on-source NetMHCpan-4.1 to screen for peptide fragments with HLA type I affinity. And the binding affinity of the peptide fragment to MHC molecules of known sequence was predicted using a BigMHC trained model of SmartBacneo auto-call open source to assist in validating the predicted peptide fragment of NetMHCpan-4.1. Wherein the definition and operation of the tools invoked are as follows:

NetMHCpan-4.1: binding of peptide fragments to MHC molecules of known sequence is predicted using artificial neural networks. Their predicted affinities are expressed on scoring as sw_score a and sw_score b: the smaller sw_score and sw_score, the more strongly a peptide fragment binds to the corresponding MHC molecule.

BigMHC: the binding affinity of peptide fragments to MHC molecules of known sequence is predicted using deep neural networks, the predicted affinity of which appears as sw_score on a score. Higher sw_score indicates stronger binding of a peptide fragment to the corresponding MHC molecule.

Step 1090, determining a tumor neoepitope, wherein in the step, a model trained by the open-source bigMHC is called for predicting the immunogenicity of the peptide fragment; the same binding peptide scoring mechanism for multi-MHC typing was introduced as shown in Table 2, "number_of_HLAs"; a scoring mechanism for similarity of MHC-binding peptide fragments to host protein sequences is introduced, as shown in Table 2, "similarity (match length/epitope length)", to thereby determine tumor neoepitopes. Wherein, the scoring mechanism of the same binding peptide fragments of multiple MHC types reflects the number of the peptide fragments which can be bound by the same MHC; the higher this score, the more MHC a peptide may bind, exerting a stronger immune activation. Among the candidate peptide fragments, the peptide fragment with the higher score is preferred to be subjected to the next study. Illustratively, among the candidate peptide fragments, peptide fragments having an immunogenicity score of 0.9 or more, a similarity score with the host of 0.5 or less, and an MHC class number score of 2 or more are selected as tumor neoepitopes.

Example 2 tumor neoepitope identification tool SmartBacneo predicts a bacterial-derived tumor neoepitope

In this exemplary embodiment, the SmartBacNeo assay tool was used to predict bacterial-derived tumor neo-epitopes. Specific operations thereof may be referred to the operations described above with reference to 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080 and 1090 of fig. 10. In this embodiment, in step 1010, the tumor-related samples are stool samples of 10 colorectal cancer patients, and the non-tumor-related samples are stool samples of 10 healthy people.

Table 1 shows the bacteria significantly enriched in the faecal samples of tumor patients obtained using the SmartBacNeo analysis by steps 1010, 1020, 1030 and 1040 in this example. As shown in table 1, human-based colorectal cancer samples were screened for bacteria associated with colorectal cancer, such as fusobacterium nucleatum. Yachida S, et al Metageomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer [ J ]. Nature media, 2019,25 (6): 968-976. Studies have shown that F.nucleatum (Fusobacterium nucleatum) is highly enriched in colon cancer tissue, possibly with a strong correlation with colorectal cancer occurrence. Thereby proving the feasibility of SmartBacNeo.

TABLE 1 results of analysis of significantly enriched bacteria in colorectal cancer patients' intestinal tract by SmartBacneo

Species bacterial Species level designation

Tumor_mean average bacterial abundance in the intestine of Tumor patients

Healthy_mean: average value of bacterial abundance in intestinal tracts of healthy people

log2Foldchange: ratio of bacterial abundance in intestinal tracts of tumor patients to bacterial abundance in intestinal tracts of healthy people

p-value: statistical test p-value

Inf: infinity of infinity

Also, this example obtained an exemplary species distribution circle graph showing the significantly enriched bacterial species and bacterial abundance in the stool of tumor patients using software R and R package circle, via step 1040, as shown in fig. 11. The figure visually shows the bacterial species that are significantly enriched in fecal samples from tumor patients. The left graph part shows the ratio of different groups in different bacteria; the right panel shows the ratio of different species of bacteria in different groupings. The bacterial species enriched in the feces of the tumor patient can be clearly judged according to the distribution circle diagram.

This example also obtains an exemplary species distribution heat map showing significantly enriched bacterial species and bacterial abundance in the faeces of tumor patients using software R and R package pheeatmap, step 1040, as shown in fig. 12. The uppermost color block represents grouping information of samples, and the hierarchical cluster tree represents the species composition similarity degree of different samples, and the closer the species distance is, the more similar the species are distributed in the samples. The colors are light to dark, indicating that the relative abundance of the species is low to high.

The results of the exemplary analysis of step 1080 of this example are shown in table 2, where each of table 2 acts as a pair of HLA-peptide fragments, scored as a score for analyzing the affinity, immunogenicity of SmartBacNeo predicted peptide fragments with HLA. The reliability scores of the predicted results of the affinity peptide fragments are respectively as follows: sw_score a, sw_score b, sw_score c. Wherein, the smaller the SW_score A and SW_score B are, the higher the reliability of the result is; the larger the sw_score, the higher the reliability of the result.

TABLE 2 example Table of neoepitope prediction results

TABLE 2 exemplary Table of neoepitope prediction results

HLA_type: parting information

Epitope: predicted epitope information

Length: length of epitope sequence

Sw_score a: scoring criterion A

Sw_score b: scoring standard B

Sw_score c: scoring standard CSimilarity (match length/match length): similarity of epitope sequences to host SwissProt protein sequences, name_in_multi_hlas: HLA typing information having the same epitope

Number_of_hlas: total number of HLA types having the same epitope

The received_in_multi_peptides: HLA typing information for peptide fragments containing the same epitope

Number_of_pep_hlas: total number of peptide fragments containing the same epitope

In this exemplary embodiment, smartBacneo was also used to predict common tumor pathogens with peptides of type I MHC affinity in F.nucleatum. Screening peptide fragments with rank% less than 0.5 in NetMHCpan-4.1 analysis results in step 1080; if the number of peptide fragments screened in this step exceeds 50, then BigMHC is invoked in step 1080, and preferably the peptide fragment with the highest score in "BigMHC score (immunogenicity score)" is selected, the peptide fragment with the highest score in "multiple MHC score for identical binding peptide fragment score" is selected in step 1090, and the peptide fragment with the lowest score in "MHC-bindable peptide fragment to host protein sequence similarity score" is selected in step 1090. Illustratively, a candidate peptide with an immunogenicity score of 0.9 or more, a similarity score with the host of 0.5 or less, and an MHC typing number score of 2 or more is selected as the tumor neoepitope.

SmartBacNeo has two core functions: 1) Identifying and screening bacteria that are specifically enriched in tumor-associated samples; 2) Peptide fragments capable of binding by the host MHC and activating the host immune response are predicted and selected from different species of bacteria. Based on SmartBacNeo, some neoepitope sequences were predicted from the protein sequence of fusobacterium nucleatum, some of which have been validated. As shown in Table 3, human MHC-conjugated "neoepitope sequence-1" is a predicted peptide fragment of a portion of Fusobacterium nucleatum proteins that might activate human immune responses. These 10 peptide fragments (SEQ ID NOS: 1-10) were those which have been validated by Kalaora et al, derived from Fusobacterium nucleatum in melanoma, and which are capable of binding by human MHC (Kalaora S, et al identification of bacteria-derived HLA-bound peptides in melanoma [ J ]. Nature,2021,592 (7852):138-143.). This demonstrates the feasibility of the SmartBacNeo tool to determine tumor neoepitopes.

If the predicted neoepitope sequence is authentic, it should be able to trigger recognition of the immune system and thus kill tumor cells. In order to verify the credibility of the predicted neoepitope sequences, and considering that it is difficult to directly perform the related experiments on humans, the applicant has chosen to verify the feasibility and credibility of the neoepitope sequences predicted by SmartBacNeo using a mouse experiment first.

First, applicants have determined in step 1080 that the bacterial derived MHC-binding peptide fragment, based on the bacterial genome or protein fasta sequence file of step 1060, or the protein fasta sequence file of step 1070, predicted binding affinity of the peptide fragment to a mouse MHC molecule of known sequence using a model that has been trained to automatically invoke SmartBacNeo on open source NetMHCpan-4.1 to screen for peptide fragments with mouse MHC affinity. And finally, in step 1090, a mouse neoepitope sequence against a mouse MHC molecule is obtained (as shown in table 3 "neoepitope sequence-2") in determining the tumor neoepitope.

Secondly, the applicant plans to prepare a mouse tumor model for further experiments to verify the feasibility and the credibility of a new epitope sequence predicted by SmartBacneo, and the specific method comprises the following steps: fusobacterium nucleatum (ATCC 25586) and a mouse-derived esophageal squamous carcinoma cell line mEC are mixed and cultured, and the mixed culture is inoculated into a C57BL16 mouse subcutaneously to prepare a mouse tumor model. And mice were divided into experimental and control groups. The experimental group mice were tumor-bearing mice injected with the relevant vaccine containing the mouse neoepitope sequences in table 3 and were grouped according to the neoepitope sequences to be verified. The control group was tumor-bearing mice without vaccine injection. This example can demonstrate the feasibility and credibility of the neoepitope sequences predicted by SmartBacNeo by vaccine inhibition or killing of tumor cells. If the result is that the subcutaneous tumor cells of the mice in the experimental group are inhibited or apoptosis, and the subcutaneous tumor cells of the mice in the control group are normally grown, the vaccine comprising the neoepitope sequence predicted by SmartBacneo has the inhibition or killing effect on the tumor cells infected with Fusobacterium nucleatum, thereby verifying the feasibility of SmartBacneo.

TABLE 3 SmartBacneo predicted exemplary Fusobacterium nucleatum neoepitope sequences

Human MHC	New epitope sequence-1	SEQ ID NO:	Mouse MHC	New epitope sequence-2	SEQ ID NO:
						HLA-A02:01	ATSDLNDLY	1	H-2-Db	KAIEFMQTM	11
HLA-A11:01	FSDKMVDYL	2	H-2-Dd	LSIQNFTVL	12
						HLA-A24:02	FTTDTAAAL	3	H-2-Dq	ISFKNVITF	13
HLA-B40:01	HRYPDRVLL	4	H-2-Kb	VSHENLSLL	14
						HLA-B46:01	IAHTNPNTL	5	H-2-Kd	IAMTSYTPL	15
HLA-C01:02	IASDVSAIL	6	H-2-Kk	VSYEIQDTM	16
						HLA-C03:04	LAHTNPNTL	7	H-2-Kq	IAISFFDNI	17
HLA-C06:02	NSDADPMSY	8	H-2-Ld	SAFGVIATL	18
						HLA-C07:02	QQLETPIML	9	H-2-Lq	HSVEFLQYL	19
HLA-C08:01	TVDHAAITL	10	H-2-Qa1	ITLQNYFRM	20
									H-2-Qa2	MVYNNLYEL	21
				SGILFIHYI	22
										AAPGRFEAL	23
				VSYSNEAKI	24
										KAIEFVDRI	25
				VAMKKLPEL	26
										IGFGNYISV	27
				IMVRNHAKL	28
										ISPTSFFQI	29
				SALWPFSTM	30
										FGYNIPTL	31
				STYRIITEI	32
										VSLNIIEL	33
				SAYIPTNVISI	34
										YIPTNVISI	35
				IAILGMDEL	36
										VFYLHSRL	37
				AVYNHYKRI	38
										LADDNFSTI	39
				RGVPQIEVTF	40

Other embodiments

Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure. As can be seen in fig. 9, computing device 900 includes a Central Processing Unit (CPU) 910 (e.g., a processor) and a memory 920 coupled to Central Processing Unit (CPU) 910. The memory 920 is used to store computer-executable instructions that, when executed, cause the Central Processing Unit (CPU) 910 to perform the methods in the above embodiments. A Central Processing Unit (CPU) 910 and a memory 920 are connected to each other by a bus, to which an input/output (I/O) interface is also connected. Computing device 900 may also include a number of components (not shown in fig. 9) connected to the I/O interface, including, but not limited to: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communications unit allows the computing device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.

Accordingly, in another embodiment, the present disclosure presents a computing device comprising a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of determining tumor neoepitopes in various embodiments of the present disclosure.

Thus, in another embodiment, the present disclosure presents a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope in various embodiments of the present disclosure.

In another embodiment, the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope in various embodiments of the present disclosure.

In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Computer readable program instructions or computer program products for executing the various embodiments of the present disclosure can also be stored at the cloud end, and when a call is required, a user can access the computer readable program instructions stored on the cloud end for executing one embodiment of the present disclosure through the mobile internet, the fixed network, or other networks, thereby implementing the technical solutions disclosed according to the various embodiments of the present disclosure.

While embodiments of the present disclosure have been described with reference to several particular embodiments, it should be understood that embodiments of the present disclosure are not limited to the particular embodiments of the disclosure. The embodiments of the disclosure are intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. A method of determining a tumor neoepitope of microbial origin, the method comprising:

obtaining metagenome sequencing data, wherein the metagenome sequencing data comprises sequencing data obtained by sequencing bacterial DNA in tumor-related samples and non-tumor-related samples in a high throughput manner;

Performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence;

determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data;

determining bacteria that are significantly enriched in the tumor-associated sample;

obtaining a known species determination of a bacterium, the known species determination of a bacterium indicating whether the significantly enriched bacterium is a known species;

determining, based on the significantly enriched bacteria, the genome or protein sequence of bacteria of a known species from a known genome database; or alternatively

Determining a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significantly enriched bacterium;

predicting the binding affinity of the peptide fragment in the protein sequence of the known species of bacteria or the peptide fragment in the protein sequence of the unknown species of bacteria to the MHC, screening the peptide fragment in the protein sequence of the known species of bacteria that can bind to the MHC or the peptide fragment in the protein sequence of the unknown species of bacteria, and thereby determining the peptide fragment that can bind to the MHC; and

Determining the immunogenicity, host similarity and number of MHC-typing of said MHC-binding peptide fragments based on said MHC-binding peptide fragments, and screening peptide fragments based on said immunogenicity, host similarity and number of MHC-typing, thereby determining the tumor neoepitope.

2. The method of claim 1, wherein

The tumor-related sample is a donor tumor tissue sample or a fecal sample of a tumor patient, and/or

The non-tumor related sample is a donor paracancerous tissue sample, a normal tissue sample or a stool sample of a healthy population.

3. The method of claim 1 or 2, the method further comprising: performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and performing quality control on the metagenome sequencing data before predicting coding genes in the genome based on the assembled genome sequence to further obtain metagenome sequencing data after quality control; preferably, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.

4. A method according to any one of claims 1-3, wherein

Determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data comprises:

determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; or alternatively

Determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or alternatively

Determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes;

determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and

and selecting bacterial species contained in the results of determining the bacterial species and the abundance based on the predicted coding genes and determining the bacterial species and the abundance by using species annotation software as the bacterial species in the determined tumor-related samples and the non-tumor-related samples, and determining the corresponding bacterial abundance thereof.

5. The method of any one of claims 1-4, wherein

The length of the assembled genome sequence is more than or equal to 90bp.

6. The method of claim 4 or 5, wherein

Based on the predicted encoding genes, determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples comprises:

the predicted coding genes are sequence aligned with sequences in a known database to predict bacterial species and determine bacterial abundance at the same taxonomic level, preferably the input sequence for sequence alignment is the translated protein sequence of the predicted coding genes.

7. The method of any one of claims 4-6, wherein

Performing metagenome assembly based on the metagenome sequencing data, obtaining an assembled genome sequence, and predicting a coding gene in a genome based on the assembled genome sequence, including:

performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence; and/or

Determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data comprises:

Species annotation software is used to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on quality-controlled metagenomic sequencing data.

8. The method of any one of claims 1-7, wherein

Determining bacteria that are significantly enriched in the tumor-associated sample comprises: determining the bacteria with remarkably enriched significance by Wilcoxon rank sum test; preferably, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-related samples is greater than or equal to 2 times that in the non-tumor-related samples, and the p-value of the statistical test is less than or equal to 0.05.

9. The method of any one of claims 1-8, wherein

The MHC is high-frequency HLA of Chinese population.

10. The method of any one of claims 1-9, wherein

Screening criteria for peptide fragments in protein sequences of bacteria of known species or of unknown species that bind to MHC are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM.

11. The method of any one of claims 1-10, wherein

Determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises:

Determining the immunogenicity of the MHC-binding peptide based on a deep learning model established by a deep neural network and the MHC-binding peptide, preferably, determining the immunogenicity of the MHC-binding peptide based on the MHC-binding peptide further comprises scoring according to the deep learning model, the higher the score the higher the immunogenicity; and/or

Determining the similarity of the MHC-binding peptide fragment to a host based on the MHC-binding peptide fragment comprises:

determining the similarity of the MHC-binding peptide to the host by comparing the MHC-binding peptide with the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, preferably, determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing an MHC-binding peptide to the host protein sequence similarity score, wherein the score is an output result of the sequence comparison, and the score is higher the similarity to the host; and/or

Determining the MHC class number of the MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises:

counting the number of all MHC to which the MHC-binding peptide fragment may bind, thereby determining the MHC-typing number of the MHC-binding peptide fragment, preferably, determining the MHC-typing number of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises introducing a multiple MHC-typing identical binding peptide fragment scoring score for the number of all MHC to which each peptide fragment may bind, the higher the score the greater the MHC-typing number of the MHC-binding peptide fragment.

12. The method of claim 11, wherein

Screening peptide fragments based on the immunogenicity, host similarity and MHC typing numbers, and further determining tumor neoepitopes comprises:

screening the deep learning model to obtain high scoring, wherein the multiple MHC (major histocompatibility complex) classification is the same, the scoring of the binding peptide fragments is high, and the scoring of the peptide fragments capable of binding the MHC is similar to the scoring of the host protein sequence, so that the tumor neoepitope is determined.

13. An apparatus for determining a tumor neoepitope of microbial origin, comprising:

a metagenome sequencing data acquisition module configured to acquire metagenome sequencing data comprising sequencing data after high-throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples;

a coding gene prediction module configured to perform metagenome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and predict a coding gene in a genome based on the assembled genome sequence;

a determination module of bacterial species and abundance configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on predicted encoding genes and/or the metagenomic sequencing data;

A determination module of significantly enriched bacteria configured to determine significantly enriched bacteria in the tumor-associated sample;

a bacteria known species judgment result acquisition module configured to acquire a bacteria known species judgment result indicating whether the significantly enriched bacteria is a known species;

a determining module of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the significantly enriched bacterium, the genomic or protein sequence of the bacterium of the known species from a known genomic database;

a determining module of a protein sequence of a bacterium of an unknown species configured to determine a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significance-enriched bacterium;

a determining module of peptide fragments capable of binding to MHC configured to predict binding affinity of peptide fragments in the protein sequence of the known species of bacteria or peptide fragments in the protein sequence of the unknown species of bacteria to MHC, screening peptide fragments in the protein sequence of the known species of bacteria capable of binding to MHC or peptide fragments in the protein sequence of the unknown species of bacteria, and thereby determining peptide fragments capable of binding to MHC; and

A tumor neoepitope determining module configured to determine immunogenicity, host similarity and number of MHC class-divisions of the MHC-binding peptide based on the MHC-binding peptide, and to screen the peptide based on the immunogenicity, host similarity and number of MHC class-divisions, thereby determining a tumor neoepitope.

14. The apparatus of claim 13, further comprising a sequencing data quality control module configured to quality control the metagenomic sequencing data prior to performing metagenomic assembly based on the metagenomic sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in a genome based on the assembled genomic sequence, thereby obtaining quality controlled metagenomic sequencing data; preferably, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.

15. The apparatus of claim 13 or 14, wherein the determination module of bacterial species and abundance comprises:

a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; or alternatively

A determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or alternatively

A determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene;

a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and

a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.

16. The device of any one of claims 13-15, wherein the tumor neoepitope determining module comprises:

A determining module of immunogenicity of an MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment;

a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host;

a determining module of MHC class number of MHC-binding peptide fragments configured to count the number of all MHC that the MHC-binding peptide fragments may bind to, thereby determining the MHC class number of the MHC-binding peptide fragments; and

a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on the immunogenicity, the similarity to host, and the MHC typing number, thereby determining a tumor neoepitope.

17. A computing device, comprising:

a processor; and

a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-16.

18. A computer readable storage medium having stored thereon computer executable instructions for performing the method according to any of claims 1-12.

19. A computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of any one of claims 1-12.

20. A protein sequence comprising the sequence set forth in any one of SEQ ID NOs 11-40.

21. A nucleic acid sequence encoding the protein sequence of claim 20; preferably, the nucleic acid sequence is an mRNA sequence.