CN116525009A - Method and device for determining tumor neoepitope from microorganism - Google Patents
Method and device for determining tumor neoepitope from microorganism Download PDFInfo
- Publication number
- CN116525009A CN116525009A CN202310487903.6A CN202310487903A CN116525009A CN 116525009 A CN116525009 A CN 116525009A CN 202310487903 A CN202310487903 A CN 202310487903A CN 116525009 A CN116525009 A CN 116525009A
- Authority
- CN
- China
- Prior art keywords
- tumor
- mhc
- determining
- species
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 394
- 238000000034 method Methods 0.000 title claims abstract description 81
- 244000005700 microbiome Species 0.000 title description 24
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 285
- 102000004169 proteins and genes Human genes 0.000 claims abstract description 148
- 230000000813 microbial effect Effects 0.000 claims abstract description 35
- 238000003860 storage Methods 0.000 claims abstract description 34
- 238000004590 computer program Methods 0.000 claims abstract description 7
- 241000894007 species Species 0.000 claims description 351
- 230000001580 bacterial effect Effects 0.000 claims description 245
- 241000894006 Bacteria Species 0.000 claims description 180
- 238000012163 sequencing technique Methods 0.000 claims description 175
- 102000007079 Peptide Fragments Human genes 0.000 claims description 174
- 108010033276 Peptide Fragments Proteins 0.000 claims description 172
- 108700018351 Major Histocompatibility Complex Proteins 0.000 claims description 171
- 230000020382 suppression by virus of host antigen processing and presentation of peptide antigen via MHC class I Effects 0.000 claims description 170
- LZOIGVDSAMDBIO-LXWJMTKESA-N (2S)-2-[[(2S,3R)-2-[[(2S)-2-[[(2S,3S)-2-[[(2S)-4-amino-2-[[(2S,3S)-2-[[(2S)-2-[[(2S)-2-[[(2S)-2-amino-4-methylsulfanylbutanoyl]amino]-3-(4-hydroxyphenyl)propanoyl]amino]-3-phenylpropanoyl]amino]-3-methylpentanoyl]amino]-4-oxobutanoyl]amino]-3-methylpentanoyl]amino]-4-methylpentanoyl]amino]-3-hydroxybutanoyl]amino]-4-methylpentanoic acid Chemical group C([C@@H](C(=O)N[C@H](C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(C)C)C(O)=O)[C@@H](C)CC)NC(=O)[C@H](CC=1C=CC(O)=CC=1)NC(=O)[C@@H](N)CCSC)C1=CC=CC=C1 LZOIGVDSAMDBIO-LXWJMTKESA-N 0.000 claims description 152
- 230000005847 immunogenicity Effects 0.000 claims description 70
- 238000012216 screening Methods 0.000 claims description 48
- 238000003908 quality control method Methods 0.000 claims description 45
- 108010066345 MHC binding peptide Proteins 0.000 claims description 40
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 36
- 238000013136 deep learning model Methods 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 108020004999 messenger RNA Proteins 0.000 claims description 19
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 16
- 238000012165 high-throughput sequencing Methods 0.000 claims description 16
- 108020000946 Bacterial DNA Proteins 0.000 claims description 15
- 150000007523 nucleic acids Chemical group 0.000 claims description 15
- 238000002864 sequence alignment Methods 0.000 claims description 9
- 238000000528 statistical test Methods 0.000 claims description 8
- 238000000585 Mann–Whitney U test Methods 0.000 claims description 7
- 230000002550 fecal effect Effects 0.000 claims description 3
- 239000012634 fragment Substances 0.000 abstract description 28
- 210000000987 immune system Anatomy 0.000 abstract description 8
- 239000000523 sample Substances 0.000 description 129
- 210000001519 tissue Anatomy 0.000 description 27
- 238000009826 distribution Methods 0.000 description 20
- 239000000427 antigen Substances 0.000 description 19
- 108091007433 antigens Proteins 0.000 description 19
- 102000036639 antigens Human genes 0.000 description 19
- 102000004196 processed proteins & peptides Human genes 0.000 description 19
- 210000004881 tumor cell Anatomy 0.000 description 18
- 238000010586 diagram Methods 0.000 description 17
- 206010009944 Colon cancer Diseases 0.000 description 14
- 108020004414 DNA Proteins 0.000 description 14
- 102000053602 DNA Human genes 0.000 description 14
- 241000605986 Fusobacterium nucleatum Species 0.000 description 14
- 239000002773 nucleotide Substances 0.000 description 14
- 125000003729 nucleotide group Chemical group 0.000 description 14
- 241000699666 Mus <mouse, genus> Species 0.000 description 13
- 238000004458 analytical method Methods 0.000 description 12
- 210000004027 cell Anatomy 0.000 description 12
- 230000014509 gene expression Effects 0.000 description 12
- 238000012545 processing Methods 0.000 description 11
- 150000001413 amino acids Chemical group 0.000 description 10
- 229920001184 polypeptide Polymers 0.000 description 10
- 229920002477 rna polymer Polymers 0.000 description 8
- 108700028369 Alleles Proteins 0.000 description 7
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 7
- 208000029742 colonic neoplasm Diseases 0.000 description 7
- 230000028993 immune response Effects 0.000 description 7
- 238000009169 immunotherapy Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 108091026890 Coding region Proteins 0.000 description 6
- 108020004705 Codon Proteins 0.000 description 6
- 241000699670 Mus sp. Species 0.000 description 6
- 238000004891 communication Methods 0.000 description 6
- 238000012268 genome sequencing Methods 0.000 description 6
- 201000001441 melanoma Diseases 0.000 description 6
- 102000039446 nucleic acids Human genes 0.000 description 6
- 108020004707 nucleic acids Proteins 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 102000043129 MHC class I family Human genes 0.000 description 5
- 108091054437 MHC class I family Proteins 0.000 description 5
- 108091081024 Start codon Proteins 0.000 description 5
- 210000000349 chromosome Anatomy 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 5
- 230000003834 intracellular effect Effects 0.000 description 5
- 102000040430 polynucleotide Human genes 0.000 description 5
- 108091033319 polynucleotide Proteins 0.000 description 5
- 239000002157 polynucleotide Substances 0.000 description 5
- 229960005486 vaccine Drugs 0.000 description 5
- 208000002454 Nasopharyngeal Carcinoma Diseases 0.000 description 4
- 206010061306 Nasopharyngeal cancer Diseases 0.000 description 4
- 108091008874 T cell receptors Proteins 0.000 description 4
- 102000016266 T-Cell Antigen Receptors Human genes 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 4
- 201000011510 cancer Diseases 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 210000003608 fece Anatomy 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 210000001035 gastrointestinal tract Anatomy 0.000 description 4
- 230000005934 immune activation Effects 0.000 description 4
- 210000002865 immune cell Anatomy 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 201000011216 nasopharynx carcinoma Diseases 0.000 description 4
- 239000002777 nucleoside Substances 0.000 description 4
- 229920000642 polymer Polymers 0.000 description 4
- 230000001902 propagating effect Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 108091033380 Coding strand Proteins 0.000 description 3
- 108700010070 Codon Usage Proteins 0.000 description 3
- 102000043131 MHC class II family Human genes 0.000 description 3
- 108091054438 MHC class II family Proteins 0.000 description 3
- 108700026244 Open Reading Frames Proteins 0.000 description 3
- 108091028664 Ribonucleotide Proteins 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 230000031018 biological processes and functions Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 230000000295 complement effect Effects 0.000 description 3
- 239000000470 constituent Substances 0.000 description 3
- 210000001151 cytotoxic T lymphocyte Anatomy 0.000 description 3
- 239000005547 deoxyribonucleotide Substances 0.000 description 3
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000001976 improved effect Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 230000035772 mutation Effects 0.000 description 3
- 239000002336 ribonucleotide Substances 0.000 description 3
- 125000002652 ribonucleotide group Chemical group 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 108020003589 5' Untranslated Regions Proteins 0.000 description 2
- 238000001712 DNA sequencing Methods 0.000 description 2
- 235000012571 Ficus glomerata Nutrition 0.000 description 2
- 244000153665 Ficus glomerata Species 0.000 description 2
- 102100028972 HLA class I histocompatibility antigen, A alpha chain Human genes 0.000 description 2
- 102100028976 HLA class I histocompatibility antigen, B alpha chain Human genes 0.000 description 2
- 108010075704 HLA-A Antigens Proteins 0.000 description 2
- NBIIXXVUZAFLBC-UHFFFAOYSA-N Phosphoric acid Chemical compound OP(O)(O)=O NBIIXXVUZAFLBC-UHFFFAOYSA-N 0.000 description 2
- 206010035226 Plasma cell myeloma Diseases 0.000 description 2
- 206010036790 Productive cough Diseases 0.000 description 2
- 108091034057 RNA (poly(A)) Proteins 0.000 description 2
- 210000001744 T-lymphocyte Anatomy 0.000 description 2
- 108091036066 Three prime untranslated region Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical compound O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- PYMYPHUHKUWMLA-LMVFSUKVSA-N aldehydo-D-ribose Chemical compound OC[C@@H](O)[C@@H](O)[C@@H](O)C=O PYMYPHUHKUWMLA-LMVFSUKVSA-N 0.000 description 2
- 210000000612 antigen-presenting cell Anatomy 0.000 description 2
- 210000003719 b-lymphocyte Anatomy 0.000 description 2
- HMFHBZSHGGEWLO-TXICZTDVSA-N beta-D-ribose Chemical group OC[C@H]1O[C@@H](O)[C@H](O)[C@@H]1O HMFHBZSHGGEWLO-TXICZTDVSA-N 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- GYOZYWVXFNDGLU-XLPZGREQSA-N dTMP Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](COP(O)(O)=O)[C@@H](O)C1 GYOZYWVXFNDGLU-XLPZGREQSA-N 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 238000002405 diagnostic procedure Methods 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 239000012530 fluid Substances 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 2
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 2
- 238000000338 in vitro Methods 0.000 description 2
- 230000005764 inhibitory process Effects 0.000 description 2
- 230000000977 initiatory effect Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 230000002601 intratumoral effect Effects 0.000 description 2
- 210000000265 leukocyte Anatomy 0.000 description 2
- 230000000670 limiting effect Effects 0.000 description 2
- 201000007270 liver cancer Diseases 0.000 description 2
- 208000014018 liver neoplasm Diseases 0.000 description 2
- 210000004698 lymphocyte Anatomy 0.000 description 2
- 229920002521 macromolecule Polymers 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 150000003833 nucleoside derivatives Chemical class 0.000 description 2
- 125000003835 nucleoside group Chemical group 0.000 description 2
- 244000052769 pathogen Species 0.000 description 2
- 150000002972 pentoses Chemical class 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 210000003802 sputum Anatomy 0.000 description 2
- 208000024794 sputum Diseases 0.000 description 2
- 206010041823 squamous cell carcinoma Diseases 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000007920 subcutaneous administration Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical compound CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 230000005851 tumor immunogenicity Effects 0.000 description 2
- 239000000439 tumor marker Substances 0.000 description 2
- DJJCXFVJDGTHFX-XVFCMESISA-N uridine 5'-monophosphate Chemical compound O[C@@H]1[C@H](O)[C@@H](COP(O)(O)=O)O[C@H]1N1C(=O)NC(=O)C=C1 DJJCXFVJDGTHFX-XVFCMESISA-N 0.000 description 2
- 108700026220 vif Genes Proteins 0.000 description 2
- 108020004465 16S ribosomal RNA Proteins 0.000 description 1
- KHWCHTKSEGGWEX-RRKCRQDMSA-N 2'-deoxyadenosine 5'-monophosphate Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@H]1C[C@H](O)[C@@H](COP(O)(O)=O)O1 KHWCHTKSEGGWEX-RRKCRQDMSA-N 0.000 description 1
- NCMVOABPESMRCP-SHYZEUOFSA-N 2'-deoxycytosine 5'-monophosphate Chemical compound O=C1N=C(N)C=CN1[C@@H]1O[C@H](COP(O)(O)=O)[C@@H](O)C1 NCMVOABPESMRCP-SHYZEUOFSA-N 0.000 description 1
- LTFMZDNNPPEQNG-KVQBGUIXSA-N 2'-deoxyguanosine 5'-monophosphate Chemical compound C1=2NC(N)=NC(=O)C=2N=CN1[C@H]1C[C@H](O)[C@@H](COP(O)(O)=O)O1 LTFMZDNNPPEQNG-KVQBGUIXSA-N 0.000 description 1
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 1
- 208000024893 Acute lymphoblastic leukemia Diseases 0.000 description 1
- 208000014697 Acute lymphocytic leukaemia Diseases 0.000 description 1
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 208000010507 Adenocarcinoma of Lung Diseases 0.000 description 1
- 241000972773 Aulopiformes Species 0.000 description 1
- 208000032791 BCR-ABL1 positive chronic myelogenous leukemia Diseases 0.000 description 1
- 108700003860 Bacterial Genes Proteins 0.000 description 1
- 108010077805 Bacterial Proteins Proteins 0.000 description 1
- 206010004146 Basal cell carcinoma Diseases 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 208000003174 Brain Neoplasms Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 210000001266 CD8-positive T-lymphocyte Anatomy 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 208000010833 Chronic myeloid leukaemia Diseases 0.000 description 1
- 102000004163 DNA-directed RNA polymerases Human genes 0.000 description 1
- 108090000626 DNA-directed RNA polymerases Proteins 0.000 description 1
- 206010014733 Endometrial cancer Diseases 0.000 description 1
- 206010014759 Endometrial neoplasm Diseases 0.000 description 1
- 208000000461 Esophageal Neoplasms Diseases 0.000 description 1
- 206010017993 Gastrointestinal neoplasms Diseases 0.000 description 1
- 102100028971 HLA class I histocompatibility antigen, C alpha chain Human genes 0.000 description 1
- 102100028970 HLA class I histocompatibility antigen, alpha chain E Human genes 0.000 description 1
- 102100028966 HLA class I histocompatibility antigen, alpha chain F Human genes 0.000 description 1
- 102100029966 HLA class II histocompatibility antigen, DP alpha 1 chain Human genes 0.000 description 1
- 108010036972 HLA-A11 Antigen Proteins 0.000 description 1
- 108010013476 HLA-A24 Antigen Proteins 0.000 description 1
- 108010058607 HLA-B Antigens Proteins 0.000 description 1
- 108010087480 HLA-B40 Antigen Proteins 0.000 description 1
- 108010076685 HLA-B46 antigen Proteins 0.000 description 1
- 108010052199 HLA-C Antigens Proteins 0.000 description 1
- 208000002250 Hematologic Neoplasms Diseases 0.000 description 1
- 108010088652 Histocompatibility Antigens Class I Proteins 0.000 description 1
- 208000017604 Hodgkin disease Diseases 0.000 description 1
- 208000021519 Hodgkin lymphoma Diseases 0.000 description 1
- 208000010747 Hodgkins lymphoma Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 101000986085 Homo sapiens HLA class I histocompatibility antigen, alpha chain E Proteins 0.000 description 1
- 101000986080 Homo sapiens HLA class I histocompatibility antigen, alpha chain F Proteins 0.000 description 1
- 101000864089 Homo sapiens HLA class II histocompatibility antigen, DP alpha 1 chain Proteins 0.000 description 1
- 101000930802 Homo sapiens HLA class II histocompatibility antigen, DQ alpha 1 chain Proteins 0.000 description 1
- 101000968032 Homo sapiens HLA class II histocompatibility antigen, DR beta 3 chain Proteins 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 208000031422 Lymphocytic Chronic B-Cell Leukemia Diseases 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 206010061269 Malignant peritoneal neoplasm Diseases 0.000 description 1
- 241000124008 Mammalia Species 0.000 description 1
- 208000025205 Mantle-Cell Lymphoma Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 208000034578 Multiple myelomas Diseases 0.000 description 1
- 108010085220 Multiprotein Complexes Proteins 0.000 description 1
- 102000007474 Multiprotein Complexes Human genes 0.000 description 1
- 208000033761 Myelogenous Chronic BCR-ABL Positive Leukemia Diseases 0.000 description 1
- 208000033776 Myeloid Acute Leukemia Diseases 0.000 description 1
- 206010029260 Neuroblastoma Diseases 0.000 description 1
- 208000015914 Non-Hodgkin lymphomas Diseases 0.000 description 1
- 206010030155 Oesophageal carcinoma Diseases 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 108020005187 Oligonucleotide Probes Proteins 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 206010061535 Ovarian neoplasm Diseases 0.000 description 1
- 229910019142 PO4 Inorganic materials 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 208000020584 Polyploidy Diseases 0.000 description 1
- 208000006664 Precursor Cell Lymphoblastic Leukemia-Lymphoma Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 208000006265 Renal cell carcinoma Diseases 0.000 description 1
- 208000004337 Salivary Gland Neoplasms Diseases 0.000 description 1
- 206010061934 Salivary gland cancer Diseases 0.000 description 1
- 206010039491 Sarcoma Diseases 0.000 description 1
- 108091081021 Sense strand Proteins 0.000 description 1
- 206010041067 Small cell lung cancer Diseases 0.000 description 1
- 208000021712 Soft tissue sarcoma Diseases 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 230000006044 T cell activation Effects 0.000 description 1
- 101710137500 T7 RNA polymerase Proteins 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 208000002495 Uterine Neoplasms Diseases 0.000 description 1
- 241000251539 Vertebrata <Metazoa> Species 0.000 description 1
- 206010047741 Vulval cancer Diseases 0.000 description 1
- JLCPHMBAVCMARE-UHFFFAOYSA-N [3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[3-[[3-[[3-[[3-[[3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-[[5-(2-amino-6-oxo-1H-purin-9-yl)-3-hydroxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxyoxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(5-methyl-2,4-dioxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(6-aminopurin-9-yl)oxolan-2-yl]methoxy-hydroxyphosphoryl]oxy-5-(4-amino-2-oxopyrimidin-1-yl)oxolan-2-yl]methyl [5-(6-aminopurin-9-yl)-2-(hydroxymethyl)oxolan-3-yl] hydrogen phosphate Polymers Cc1cn(C2CC(OP(O)(=O)OCC3OC(CC3OP(O)(=O)OCC3OC(CC3O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c3nc(N)[nH]c4=O)C(COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3COP(O)(=O)OC3CC(OC3CO)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3ccc(N)nc3=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cc(C)c(=O)[nH]c3=O)n3cc(C)c(=O)[nH]c3=O)n3ccc(N)nc3=O)n3cc(C)c(=O)[nH]c3=O)n3cnc4c3nc(N)[nH]c4=O)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)n3cnc4c(N)ncnc34)O2)c(=O)[nH]c1=O JLCPHMBAVCMARE-UHFFFAOYSA-N 0.000 description 1
- 230000001594 aberrant effect Effects 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 208000009956 adenocarcinoma Diseases 0.000 description 1
- UDMBCSSLTHHNCD-KQYNXXCUSA-N adenosine 5'-monophosphate Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](COP(O)(O)=O)[C@@H](O)[C@H]1O UDMBCSSLTHHNCD-KQYNXXCUSA-N 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 229910000147 aluminium phosphate Inorganic materials 0.000 description 1
- 230000000890 antigenic effect Effects 0.000 description 1
- 230000005975 antitumor immune response Effects 0.000 description 1
- 230000006907 apoptotic process Effects 0.000 description 1
- PYMYPHUHKUWMLA-UHFFFAOYSA-N arabinose Natural products OCC(O)C(O)C(O)C=O PYMYPHUHKUWMLA-UHFFFAOYSA-N 0.000 description 1
- 238000003556 assay Methods 0.000 description 1
- SRBFZHDQGSBBOR-UHFFFAOYSA-N beta-D-Pyranose-Lyxose Natural products OC1COC(O)C(O)C1O SRBFZHDQGSBBOR-UHFFFAOYSA-N 0.000 description 1
- 210000000601 blood cell Anatomy 0.000 description 1
- 208000014581 breast ductal adenocarcinoma Diseases 0.000 description 1
- 229940022399 cancer vaccine Drugs 0.000 description 1
- 238000009566 cancer vaccine Methods 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000002299 complementary DNA Substances 0.000 description 1
- IERHLVCPSMICTF-XVFCMESISA-N cytidine 5'-monophosphate Chemical compound O=C1N=C(N)C=CN1[C@H]1[C@H](O)[C@H](O)[C@@H](COP(O)(O)=O)O1 IERHLVCPSMICTF-XVFCMESISA-N 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 210000004443 dendritic cell Anatomy 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 208000035475 disorder Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012645 endogenous antigen Substances 0.000 description 1
- 230000002357 endometrial effect Effects 0.000 description 1
- 201000004101 esophageal cancer Diseases 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 102000054766 genetic haplotypes Human genes 0.000 description 1
- 238000003205 genotyping method Methods 0.000 description 1
- 208000005017 glioblastoma Diseases 0.000 description 1
- 229930182470 glycoside Natural products 0.000 description 1
- 150000002338 glycosides Chemical class 0.000 description 1
- RQFCJASXJCIDSX-UUOKFMHZSA-N guanosine 5'-monophosphate Chemical compound C1=2NC(N)=NC(=O)C=2N=CN1[C@@H]1O[C@H](COP(O)(O)=O)[C@@H](O)[C@H]1O RQFCJASXJCIDSX-UUOKFMHZSA-N 0.000 description 1
- 244000005709 gut microbiome Species 0.000 description 1
- 201000009277 hairy cell leukemia Diseases 0.000 description 1
- 201000010536 head and neck cancer Diseases 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 210000002443 helper t lymphocyte Anatomy 0.000 description 1
- 201000005787 hematologic cancer Diseases 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 231100000844 hepatocellular carcinoma Toxicity 0.000 description 1
- 230000005745 host immune response Effects 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 239000001257 hydrogen Substances 0.000 description 1
- 125000004435 hydrogen atom Chemical class [H]* 0.000 description 1
- 239000012642 immune effector Substances 0.000 description 1
- 230000000899 immune system response Effects 0.000 description 1
- 230000002163 immunogen Effects 0.000 description 1
- 229940121354 immunomodulator Drugs 0.000 description 1
- 238000011065 in-situ storage Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 210000000936 intestine Anatomy 0.000 description 1
- 230000009545 invasion Effects 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 230000002147 killing effect Effects 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 208000032839 leukemia Diseases 0.000 description 1
- 201000005249 lung adenocarcinoma Diseases 0.000 description 1
- 210000002751 lymph Anatomy 0.000 description 1
- 210000002540 macrophage Anatomy 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000002705 metabolomic analysis Methods 0.000 description 1
- 230000001431 metabolomic effect Effects 0.000 description 1
- 244000000010 microbial pathogen Species 0.000 description 1
- 238000010369 molecular cloning Methods 0.000 description 1
- 210000001616 monocyte Anatomy 0.000 description 1
- 201000000050 myeloid neoplasm Diseases 0.000 description 1
- 201000002120 neuroendocrine carcinoma Diseases 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 208000002154 non-small cell lung carcinoma Diseases 0.000 description 1
- 239000002751 oligonucleotide probe Substances 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 201000008968 osteosarcoma Diseases 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 201000002524 peritoneal carcinoma Diseases 0.000 description 1
- NBIIXXVUZAFLBC-UHFFFAOYSA-K phosphate Chemical compound [O-]P([O-])([O-])=O NBIIXXVUZAFLBC-UHFFFAOYSA-K 0.000 description 1
- 239000010452 phosphate Substances 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 235000019515 salmon Nutrition 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 238000010008 shearing Methods 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 208000000587 small cell lung carcinoma Diseases 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 201000011549 stomach cancer Diseases 0.000 description 1
- 230000031068 symbiosis, encompassing mutualism through parasitism Effects 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 210000001138 tear Anatomy 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
- 230000004797 therapeutic response Effects 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000005909 tumor killing Effects 0.000 description 1
- 208000029729 tumor suppressor gene on chromosome 11 Diseases 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
- 206010046766 uterine cancer Diseases 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 201000005102 vulva cancer Diseases 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present application relates to a method, apparatus, computing device, computer readable storage medium and computer program product for determining a tumor neoepitope of microbial origin, which can predict, screen protein fragments recognizable by the host immune system based on known microbial protein sequences or metagenomic predicted microbial protein sequences, and enable a quick, efficient, high accuracy determination of neoepitopes.
Description
Technical Field
The present disclosure relates to the field of bioinformatics and tumor immunotherapy, and more particularly to methods, apparatus, computing devices, computer readable storage media and computer program products for determining tumor neoepitopes of microbial origin.
Background
Tumor specific antigen, also known as tumor Neoantigen, is an antigen produced only in tumor cells, which can bind to Human Leukocyte Antigen (HLA) and be recognized by CD4+, CD8+ T cells, and activate the anti-tumor immune response of the body (Zhang, Z., et al, neoantigen: A New Breakthrough in Tumor immunology, front Immunol,2021.12: p.672356.). Sources of neoantigens are numerous, including Single Nucleotide Variations (SNVs), insertions/deletions (INDELs), transcript splice variations, gene fusions, and the like. The new antigen is not in normal tissue cells, so that central tolerance is bypassed, off-target damage to non-tumor tissues can be avoided, the new antigen becomes a new target point of tumor immunotherapy, and the new antigen has ideal conditions for constructing cancer vaccines and has wide treatment prospect and clinical application value.
Studies have shown that the presence of microorganisms such as bacteria, invasion of tumors, protein fragments of bacteria invading tumor cells can be presented on the surface of tumor cells and recognized by the immune system. So as to activate immune cells, enhance the recognition of the immune cells to tumor cells and kill the tumor cells. Kalaora S, et al identification of bacteria-derived HLA-bound peptides in melanoma. Nature.2021Apr;592 138-143 it is proposed that bacterial peptides presented on tumor cells can serve as potential targets for immunotherapy, providing directions to the mechanisms by which bacteria influence the activation of the immune system and therapeutic response. The team identified intratumoral bacteria in melanoma, obtained these bacterial genome maps, identified peptide sequences of bacteria capable of being recognized by the immune system using 16S rRNA gene sequencing and HLA peptide group science (HLA peptides), and finally identified nearly 300 peptides from 41 different bacteria presented by HLA protein complexes on the melanoma cell surface. Many of the peptides of bacterial origin are common to different metastases of the same patient or tumors of different patients and therefore also have a powerful capacity to generate immune activation.
Metagenomic sequencing based on high throughput sequencing (NGS) can accurately identify microorganism species at the species level, predict genes and proteins expressed by genes in the microorganism genome, and will help to determine tumor neoepitopes derived from microorganisms.
Disclosure of Invention
A first aspect of the present disclosure proposes a method of determining a tumour neoepitope of microbial origin, the method comprising: obtaining metagenome sequencing data, wherein the metagenome sequencing data comprises sequencing data obtained by sequencing bacterial DNA in tumor-related samples and non-tumor-related samples in a high throughput manner; performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence; determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data; determining bacteria that are significantly enriched in the tumor-associated sample; obtaining a known species determination of a bacterium, the known species determination of a bacterium indicating whether the significantly enriched bacterium is a known species; determining, based on the significantly enriched bacteria, the genome or protein sequence of bacteria of a known species from a known genome database; or determining the protein sequence of bacteria of unknown species from the predicted encoding gene based on the significantly enriched bacteria; predicting the binding affinity of the peptide fragment in the protein sequence of the known species of bacteria or the peptide fragment in the protein sequence of the unknown species of bacteria to the MHC, screening the peptide fragment in the protein sequence of the known species of bacteria that can bind to the MHC or the peptide fragment in the protein sequence of the unknown species of bacteria, and thereby determining the peptide fragment that can bind to the MHC; and determining the immunogenicity, host similarity and the number of MHC class-divisions of said MHC-binding peptide fragments based on said MHC-binding peptide fragments, and screening peptide fragments based on said immunogenicity, host similarity and the number of MHC class-divisions, thereby determining tumor neoepitopes.
Optionally, in one embodiment of the above aspect, the tumor-associated sample is a donor tumor tissue sample or a tumor patient stool sample.
Optionally, in one embodiment of the above aspect, the non-tumor associated sample is a donor paracancerous tissue sample, a normal tissue sample, or a stool sample of a healthy population.
Optionally, in one embodiment of the above aspect, the method further comprises: and performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and performing quality control on the metagenome sequencing data before predicting coding genes in the genome based on the assembled genome sequence to further obtain the metagenome sequencing data after quality control.
Optionally, in one embodiment of the above aspect, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.
Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on predicted encoding genes and/or the metagenomic sequencing data comprises: determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; or determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and selecting bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determining their corresponding bacterial abundances.
Alternatively, in one embodiment of the above aspect, the length of the assembled genomic sequence is 90bp or greater.
Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding genes comprises: sequence alignment of the predicted coding genes with sequences in a known database to predict bacterial species and to determine co-categorical levels of bacterial abundance.
Alternatively, in one embodiment of the above aspect, the input sequence for sequence alignment is the predicted translated protein sequence of the coding gene.
Alternatively, in one embodiment of the above aspect, performing metagenome assembly based on the metagenome sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in a genome based on the assembled genomic sequence comprises: and performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence.
Optionally, in one embodiment of the above aspect, determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples using species annotation software based on the metagenomic sequencing data comprises: species annotation software is used to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on quality-controlled metagenomic sequencing data.
Optionally, in one embodiment of the above aspect, determining the bacteria significantly enriched in the tumor-associated sample comprises: the significantly enriched bacteria were determined by Wilcoxon rank sum test.
Alternatively, in one embodiment of the above aspect, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-related samples is greater than or equal to 2 times that in the non-tumor-related samples, and the p-value of the statistical test is less than or equal to 0.05.
Alternatively, in one embodiment of the above aspect, the MHC is a high frequency HLA of the chinese population.
Alternatively, in one embodiment of the above aspect, the screening criteria for a peptide fragment in a protein sequence of a bacterium of a known species that binds to MHC or a peptide fragment in a protein sequence of a bacterium of an unknown species are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM.
Alternatively, in one embodiment of the above aspect, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: determining the immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment.
Optionally, in one embodiment of the above aspect, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity.
Alternatively, in one embodiment of the above aspect, determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: and (3) comparing the peptide capable of binding to the MHC with the protein sequences of the host from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide capable of binding to the MHC and the host.
Optionally, in one embodiment of the above aspect, determining the MHC-binding peptide fragment to host similarity based on the MHC-binding peptide fragment further comprises introducing an MHC-binding peptide fragment to host protein sequence similarity score as an output of the sequence alignment with a higher score to the host similarity.
Alternatively, in one embodiment of the above aspect, determining the MHC class number of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: counting the number of all MHCs to which said MHC-binding peptide fragment may bind, thereby determining the MHC typing number of said MHC-binding peptide fragment.
Optionally, in one embodiment of the above aspect, determining the number of MHC class-divisions of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises introducing a score for the number of all MHC classes potentially bound by each peptide fragment, the higher the score the greater the number of MHC class-divisions of the MHC-binding peptide fragment.
Alternatively, in one embodiment of the above aspect, screening the peptide fragments based on the immunogenicity, similarity to the host, and MHC typing numbers, and further determining the tumor neoepitope comprises: screening the deep learning model to obtain high scoring, wherein the multiple MHC (major histocompatibility complex) classification is the same, the scoring of the binding peptide fragments is high, and the scoring of the peptide fragments capable of binding the MHC is similar to the scoring of the host protein sequence, so that the tumor neoepitope is determined.
In a second aspect the present disclosure provides an apparatus for determining a tumour neoepitope of microbial origin comprising: a metagenome sequencing data acquisition module configured to acquire metagenome sequencing data comprising sequencing data after high-throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples; a coding gene prediction module configured to perform metagenome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and predict a coding gene in a genome based on the assembled genome sequence; a determination module of bacterial species and abundance configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on predicted encoding genes and/or the metagenomic sequencing data; a determination module of significantly enriched bacteria configured to determine significantly enriched bacteria in the tumor-associated sample; a bacteria known species judgment result acquisition module configured to acquire a bacteria known species judgment result indicating whether the significantly enriched bacteria is a known species; a determining module of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the significantly enriched bacterium, the genomic or protein sequence of the bacterium of the known species from a known genomic database; a determining module of a protein sequence of a bacterium of an unknown species configured to determine a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significance-enriched bacterium; a determining module of peptide fragments capable of binding to MHC configured to predict binding affinity of peptide fragments in the protein sequence of the known species of bacteria or peptide fragments in the protein sequence of the unknown species of bacteria to MHC, screening peptide fragments in the protein sequence of the known species of bacteria capable of binding to MHC or peptide fragments in the protein sequence of the unknown species of bacteria, and thereby determining peptide fragments capable of binding to MHC; and a tumor neoepitope determining module configured to determine immunogenicity, host similarity and number of MHC types of the MHC-binding peptide fragments based on the MHC-binding peptide fragments, and to screen peptide fragments based on the immunogenicity, host similarity and number of MHC types, thereby determining tumor neoepitopes.
Optionally, in one embodiment of the above aspect, the apparatus further comprises a sequencing data quality control module configured to perform macro genome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and quality control the metagenome sequencing data prior to predicting the encoding genes in the genome based on the assembled genome sequence, thereby obtaining quality-controlled metagenome sequencing data.
Optionally, in one embodiment of the above aspect, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.
Optionally, in one embodiment of the above aspect, the determining module of the bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; or a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.
Alternatively, in one embodiment of the above aspect, the tumor neoepitope determining module comprises: a determining module of immunogenicity of an MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment; a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host; a determining module of MHC class number of MHC-binding peptide fragments configured to count the number of all MHC that the MHC-binding peptide fragments may bind to, thereby determining the MHC class number of the MHC-binding peptide fragments; and a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on the immunogenicity, the similarity to host, and the MHC typing number, thereby determining a tumor neoepitope.
A third aspect of the present disclosure proposes a computing device comprising: a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of determining a tumor neoepitope of microbial origin in the first aspect.
A fourth aspect of the present disclosure proposes a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope of microbial origin of the first aspect.
A fifth aspect of the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope of microbial origin of the first aspect.
A sixth aspect of the present disclosure proposes a protein sequence comprising the sequence shown in any one of SEQ ID NOs 11 to 40.
A seventh aspect of the disclosure proposes a nucleic acid sequence encoding the protein sequence of the sixth aspect. Alternatively, in one embodiment of the above aspect, the nucleic acid sequence is an mRNA sequence.
Drawings
Features, advantages, and other aspects of the disclosure will become more apparent upon reference to the following detailed description, taken in conjunction with the accompanying drawings, wherein, by way of illustration and not limitation, several embodiments of the disclosure are shown in which:
FIG. 1 shows a schematic flow chart of a method for determining a tumor neoepitope of microbial origin according to the present invention.
Fig. 2 shows a schematic flow chart of a process of determining bacterial species and abundance according to an embodiment of the disclosure.
Fig. 3 shows a schematic flow chart of a process of determining a tumor neoepitope according to one embodiment of the present disclosure.
Fig. 4 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure.
FIG. 5 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to the present invention.
Fig. 6 shows a schematic block diagram of a determination module of bacterial species and abundance according to an embodiment of the disclosure.
Fig. 7 shows a schematic block diagram of a tumor neoepitope determination module according to one embodiment of the present disclosure.
Fig. 8 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to one embodiment of the present disclosure.
Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure.
Fig. 10 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure.
Fig. 11 shows a species distribution circle plot of abundance in a species-level tumor-associated sample significantly higher than abundance in a non-tumor-associated sample in an embodiment in accordance with the disclosure. The left graph part shows the ratio of different groups in different bacteria; the right panel shows the ratio of different species of bacteria in different groupings. The enriched bacterial species of different groups can be clearly judged according to the distribution circle diagram.
Fig. 12 shows a species distribution heat map of abundance in a species-level tumor-associated sample significantly higher than abundance in a non-tumor-associated sample in an embodiment in accordance with the disclosure. The uppermost color block represents grouping information of samples, and the hierarchical cluster tree represents the species composition similarity degree of different samples, and the closer the species distance is, the more similar the species are distributed in the samples. The colors are light to dark, indicating that the relative abundance of the species is low to high.
Detailed Description
General definitions and terms
All patents, patent applications, scientific publications, manufacturer's instructions and guidelines, and the like, cited herein, whether supra or infra, are hereby incorporated by reference in their entirety. Nothing herein is to be construed as an admission that the disclosure is not entitled to antedate such disclosure.
Unless otherwise indicated, scientific and technical terms used herein have the meaning as commonly understood by one of ordinary skill in the art to which this invention pertainsAs commonly understood by those of skill in the art. Also, the terms related to protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, as used herein, are terms that are widely used in the relevant art (see, e.g., molecular Cloning: A Laboratory Manual, 2) nd Edition, j. Sambrook et al eds., cold Spring Harbor Laboratory Press, cold Spring Harbor 1989). Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.
As used herein, "at least one" or "one or more" may mean 1, 2, 3, 4, 5, 6, 7, 8 or more.
As used herein, the terms "comprises," "comprising," "includes," "including," "having" and "containing" are open-ended, meaning the inclusion of the stated elements, steps or components, but not the exclusion of other non-recited elements, steps or components. The expression "consisting of … …" does not include any elements, steps or components not specified. The expression "consisting essentially of … …" means that the scope is limited to the specified elements, steps, or components, plus any optional elements, steps, or components that do not significantly affect the basic and novel properties of the claimed subject matter. It should be understood that the expressions "consisting essentially of … …" and "consisting of … …" are encompassed within the meaning of the expression "comprising".
As used herein, the term "and/or" in connection with a plurality of recited elements should be understood to include both individual and combined options. In other words, "and/or" includes "and" as well as "or". For example, a and/or B include A, B and a+b. A. B and/or C include A, B, C and any combination thereof, e.g., a+ B, A + C, B +c and a+b+c. Further elements defined by "and/or" are to be understood in a similar manner and include any one of, and any combination of, these.
Any numerical value or range of numerical values, such as concentration or range of concentration, should be construed as modified by the term "about" in any event, unless otherwise indicated. Thus, a numerical value typically includes ±10% of the value. For example, a concentration of 1mg/mL includes 0.9mg/mL to 1.1mg/mL. Likewise, a concentration range of 1% to 10% (w/v) includes 0.9% (w/v) to 11% (w/v). As used herein, the use of a numerical range explicitly includes all possible subranges, all individual values within the range including integers and fractions within the range unless the context clearly indicates otherwise.
As used herein, the term "sample" includes any sample of tissue or fluid isolated from an individual, such as, for example, skin, plasma, serum, spinal fluid, lymph, synovial fluid, urine, sputum, tears, blood cells, organs, tumors, and fecal matter (e.g., feces). The term "tumor-associated sample" is used herein to include a tumor tissue sample or a sample directly related to or derived from a lesion or disorder thereof. For example, in a colon cancer patient, the tumor-related sample comprises a colon cancer tissue sample or a stool sample from a colon cancer tumor patient. For example, in a nasopharyngeal carcinoma patient, the tumor-related sample includes a nasopharyngeal carcinoma tissue sample of the nasopharyngeal carcinoma patient or a sputum sample of the nasopharyngeal carcinoma patient. The term "non-tumor-related sample" as used herein refers to a paracancestral tissue sample, a normal tissue sample, or other healthy population samples that can serve as normal controls for tumor-related samples, without the corresponding cancer cells and tissues. Healthy population herein refers to a population that does not have a tumor of some kind, as well as other diseases that may affect the outcome of the experiment, relative to a tumor patient that has such a tumor. For example, when the tumor-related sample is a stool sample from a colon cancer patient, the non-tumor-related sample may be a stool sample from a healthy population not suffering from colon cancer and other diseases that may affect the outcome of the experiment. In some embodiments, a healthy population refers to a population that does not have any tumor.
As used herein, the term "wild-type" means that the sequence is naturally occurring and not artificially modified, including naturally occurring mutants.
As used herein, "antigen" refers to a molecule that upon entry into the body can elicit an immune response that is acquired by the body and that can be directed to the production of antibodies, or to specific immunogenically active cells, or both. It will be appreciated by those skilled in the art that any macromolecule, including almost all proteins or peptide fragments, may act as an antigen. Still further, the antigen may be from recombinant or genomic DNA or RNA.
The term "neoantigen" as used herein is an antigen having at least one alteration that makes it different from the corresponding wild-type parent antigen,
for example, the change is a tumor cell variation. The neoantigen may include a polypeptide sequence or a nucleotide sequence. As used herein, the term "variation" is the difference between a subject's nucleic acid and a reference human genome used as a control, which includes both genetic mutation and genetic recombination. Mutations may include point mutations caused by single base changes, or deletions, duplications and insertions of multiple bases.
As used herein, the term "tumor neoantigen" is a neoantigen that is present in a tumor cell or tissue of a subject but not in a corresponding normal cell or tissue of the subject. Can serve as a tumor marker when identifying tumor cells by diagnostic testing, and can also serve as a potential candidate for cancer treatment.
As used herein, an "epitope (also referred to as an antigenic determinant)" is a portion of an antigen that is recognized by the immune system (particularly by antibodies, B cells or T cells) in a suitable context. The epitope may be a conformational epitope or a linear epitope. The epitope in the present invention is a linear epitope, defined by a linear continuous amino acid sequence of a specific region of a protein. As used herein, a "tumor neoepitope" is capable of binding with high affinity to MHC molecules such that tumor neoantigen is presented for recognition by T cells and causes T cell activation, thereby attacking tumor cells.
As used herein, the term "tumor" encompasses solid tumors and hematological tumors. Solid tumors include, but are not limited to: squamous cell carcinoma, adenocarcinoma, basal cell carcinoma, renal cell carcinoma, ductal carcinoma of the breast, soft tissue sarcoma, osteosarcoma, melanoma, small-cell lung cancer, non-small-cell lung cancer, lung adenocarcinoma, peritoneal carcinoma, hepatocellular carcinoma, gastrointestinal cancer, gastric cancer, pancreatic cancer, neuroendocrine carcinoma, glioblastoma, cervical cancer, ovarian cancer, liver cancer, bladder cancer, brain cancer, hepatoma, breast cancer, colon cancer, colorectal cancer, endometrial or uterine cancer, esophageal cancer, salivary gland cancer, renal cancer, liver cancer, prostate cancer, vulval cancer, thyroid cancer, neuroblastoma, or head and neck cancer; hematological neoplasms include, but are not limited to: leukemia, lymphoma, myeloma, acute myelogenous leukemia, chronic myelogenous leukemia, acute lymphoblastic leukemia, chronic lymphoblastic leukemia, hairy cell leukemia, hodgkin's lymphoma, non-hodgkin's lymphoma, mantle cell lymphoma or multiple myeloma.
As used herein, the term "HLA binding affinity", "HLA affinity" or "MHC binding affinity" means the binding affinity between a particular antigen and a particular MHC allele.
The term "Major Histocompatibility Complex (MHC)" relates to the gene complex that occurs in all vertebrates. MHC proteins or molecules play a role in the signaling between lymphocytes and antigen presenting cells in a normal immune response. Human MHC, also known as HLA, human leukocyte antigen, is located on chromosome 6 and mainly comprises MHC-I and MHC-II.
The term "MHC-I" or "MHC class I" refers to a major histocompatibility complex class I protein or gene. Within the human MHC-I (HLA-I) region, there are HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, CD1a, CD1b and CD1c sub-regions. MHC class I proteins are present on almost all cell surfaces, including most tumor cells. MHC-I proteins are loaded with antigens that are typically derived from endogenous proteins or pathogens present within the cell and then presented to Cytotoxic T Lymphocytes (CTLs). T cell receptors are capable of recognizing and binding peptides complexed with MHC class I molecules. Each cytotoxic T lymphocyte expresses a unique T cell receptor, capable of binding to a specific MHC/peptide complex. MHC class I molecules mediate primarily the presentation of endogenous antigens.
The term "MHC-II" or "MHC class II" refers to a major histocompatibility complex class II protein or gene. MHC class II proteins are mainly expressed on antigen presenting cells such as B cells, monocytes, macrophages and dendritic cells. MHC class II molecules mediate primarily the presentation of exogenous antigens, which present exogenous antigen polypeptide molecules to Th cells (helper T cells).
As used herein, the term "allele" generally refers to a pair of genes that control relative traits, either in one form of a gene or in one form of a gene sequence or in one form of a protein, located at the same position on a pair of homologous chromosomes. The term "allelic typing" refers to the positioning of alleles (or heterozygous sites) on a diploid (or even polyploid) genome correctly on the chromosomes of the male parent or female parent according to its parent, ultimately enabling all alleles from the same parent to be aligned in the same chromosome.
HLA is classified into three major classes according to different genetic loci: type I, type II and type III, because of the highly variable sequence, result in many different alleles of HLA. The aberrant peptide needs to bind to HLA to assist T Cell Receptor (TCR) recognition, thereby eliciting an immune response. Thus, predicting HLA typing is critical for recognizing tumor antigens. Currently, HLA genotyping is performed on HLA systems using mainly PCR technology in combination with allele-specific oligonucleotides (ASO) or sequence-specific oligonucleotide probes (SSO). HLA typing is carried out through second generation sequencing data analysis, so that polymorphism as small as single SNV genotype and haplotype information can be obtained.
As used herein, "genome" refers to the sum of all genetic material of an organism. Genetic material includes DNA or RNA.
As used herein, the term "neural network" is a machine learning model for classification or regression, consisting of a multi-layer linear transformation followed by element-wise nonlinearities, typically trained by random gradient descent and back propagation.
"Nucleoside" is a generic term for a class of glycosides. Nucleosides are constituents of nucleic acids and nucleotides. The nucleosides are all formed by condensing D-ribose or D-Z-deoxyribose with pyrimidine base or purine base. Herein, "nucleotide" includes deoxyribonucleotides and ribonucleotides and derivatives thereof. As used herein, a "ribonucleotide" is a constituent material of ribonucleic acid (RNA) and consists of one molecule of base, one molecule of pentose, and one molecule of phosphate, which refers to a nucleotide having a hydroxyl group at the 2' -position of the β -D-ribofuranosyl group. The "deoxyribonucleotide" is a constituent substance of deoxyribonucleic acid (DNA), and is also composed of one molecule of base, one molecule of pentose and one molecule of phosphoric acid, and refers to a nucleotide in which the hydroxyl group at the 2' -position of the beta-D-ribofuranosyl group is replaced by hydrogen, and is a main chemical component of a chromosome. "nucleotide" is generally referred to by the single letter representing the base therein: "A (a)" means adenine-containing deoxyadenylate or adenylate, "C (C)" means cytosine-containing deoxycytidylate or cytidylate, "G (G)" means guanine-containing deoxyguanylate or guanylate, "U (U)" means uracil-containing uridylate, "T (T)" means thymine-containing deoxythymidylate.
As used herein, the terms "polynucleotide" and "nucleic acid" are used interchangeably to refer to a polymer of deoxyribonucleotides (deoxyribonucleic acid, DNA) or a polymer of ribonucleotides (ribonucleic acid, RNA). "Polynucleotide sequence", "nucleic acid sequence" and "nucleotide sequence" are used interchangeably to refer to the ordering of nucleotides in a polynucleotide. It will be appreciated by those skilled in the art that the coding strand (sense strand) of DNA can be considered to have the same nucleotide sequence as the RNA it encodes, with deoxythymidylate in the sequence of the coding strand of DNA corresponding to uridylate in the sequence of the RNA it encodes.
As used herein, "encoding" refers to the inherent properties of a particular nucleotide sequence in a polynucleotide, such as a gene, cDNA or mRNA, that can be used as a template to synthesize polymers and macromolecules in other biological processes, provided that there is a defined nucleotide sequence or a defined amino acid sequence. Thus, a gene encodes a protein, meaning that mRNA of the gene is transcribed and translated to produce the protein in a cell or other biological system.
As used herein, the term "polypeptide" refers to a polymer comprising two or more amino acids covalently linked by peptide bonds. A "protein" may comprise one or more polypeptides, wherein the polypeptides interact with each other by covalent or non-covalent means. Unless otherwise indicated, "polypeptide" and "protein" may be used interchangeably.
As used herein, the term "host" refers to a subject from which tumor-related samples and non-tumor-related samples are derived. The host may be any animal, such as a mammal, particularly a human.
As used herein, NGS ("Next-generation" sequencing technology), a Next generation sequencing technique, also known as High-throughput sequencing (High-throughput sequencing), or massive parallel sequencing (Massively parallel sequencing, MPS). Unlike conventional Sanger (dideoxy) sequencing, techniques that allow parallel sequencing of a large number of nucleic acid molecules in parallel at a time, typically a single sequencing reaction yields no less than 100Mb of sequencing data.
As used herein, "Q20" refers to a sequencing base error rate of 1%. "Q30" means that the sequencing base error rate is 0.1%. As used herein, "N base" refers to an unknown base, a base type that cannot be determined by sequencing.
As used herein, "bp" refers to Base Pair, a Pair of matched bases, commonly used to measure the length of DNA.
As used herein, "Wilcoxon (Wilcoxon symbol rank test)" is a non-parametric test, often used to test for differences between comparison groups. As used herein, "P-value" refers to one parameter used to determine the outcome of a hypothesis test, with smaller P-values indicating more pronounced results, i.e., more pronounced differences between comparison groups.
As used herein, "abundance" refers to relative content.
As used herein, the "NCBI NR database" is a Non-redundant protein library (Non-Redundant Protein Sequence Database), including the Non-redundant protein sequences in GenBank, EMBL, DDBJ, PDB. NCBI NR gives the amino acid sequence corresponding to all known or possible coding sequences, as well as the sequence numbers in the specialized protein database. The nucleic acid data and the protein data can be linked together corresponding to a cross index based on nucleic acid sequences.
As used herein, "MetaPhlAn" is metagenomic species annotation software that enables rapid acquisition of qualitative and quantitative analysis of microbial population species classification and analysis of relative abundance information by comparison, based on sequencing data of metagenome.
As used herein, "NetMHCpan-4.1" is software for predicting the affinity of a peptide fragment for an MHC class I molecule, version number 4.1.
As used herein, "Deep Neural Network (DNN)" also known as deep feed forward network (DFN), multi-layer perceptron (MLP), refers to a neural network with many hidden layers, a technology in the field of Machine Learning (ML).
As used herein, "BlastP" refers to a search tool based on a local alignment algorithm, which is a commonly used tool software for bioinformatics. The input protein sequence can be compared with known sequences in a database to obtain information such as sequence similarity, so as to judge the source or evolutionary relationship of the sequence.
As used herein, "Quality control" also referred to as Quality Control (QC) refers to the artificial shearing and screening of sequences with low confidence during analysis.
Methods and apparatuses for determining tumor neoepitopes according to embodiments of the present specification are described below with reference to the accompanying drawings.
Method for determining tumor neoepitope derived from microorganism
In a first aspect, the present disclosure provides a method of determining a tumor neoepitope of microbial origin. Fig. 1 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to the present disclosure. As shown in fig. 1, the method 100 of determining a tumor neoantigen includes steps 110, 120, 130, 140, 150, 160, 170, 180, and 190.
Step 110 obtains metagenomic sequencing data, where metagenomic sequencing data is obtained, the metagenomic sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples.
In some embodiments, tumor-related and non-tumor-related samples are obtained, and bacterial DNA in the tumor-related and non-tumor-related samples is extracted for high throughput sequencing. In some embodiments, bacterial DNA is extracted from tumor-associated and non-tumor-associated samples by reference to Nejman D, et al, the methods of extraction human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. In some embodiments, metagenomic sequencing is performed by an ILLUMINA high throughput sequencing platform. In some embodiments, the sequencing mode of high throughput sequencing is double ended, with a sequencing length of 150bp. Single-end sequencing firstly fragments a DNA sample to form a 200-500bp fragment, connecting a primer sequence to one end of the DNA fragment, adding a connector to the tail end, fixing the fragment on a flow tank to generate a DNA cluster, and then sequencing and reading. And double-End sequencing is to add sequencing primer binding sites to the joints at two ends when constructing a DNA library, remove template strands of the first round of sequencing after the first round of sequencing is completed, and guide complementary strands to regenerate and amplify in situ by using a pair of sequencing-by-reading Module (Paired-End Module) so as to reach the template quantity used for the second round of sequencing, and perform the synthesis sequencing of the complementary strands of the second round. The advantage of double-ended sequencing over single-ended sequencing is that errors can be reduced, resulting in better assembly results. And single-ended sequencing reads in only one direction, which can result in the quality of the sequencing decreasing with increasing read length; and the double-end sequencing can read more than half of the sequence to be detected from two directions, then splicing is carried out according to the overlapping part of the two sequences, and the quality of the read sequence is better.
In some embodiments, in obtaining the metagenomic sequencing data at step 110, the tumor-associated sample and the non-tumor-associated sample are each no less than 10. In some embodiments, step 110 of obtaining metagenomic sequencing data comprises: and reading a sample information table provided by a user, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to obtain macro genome sequencing data. In some implementations, the sample information table is samplelist. In some embodiments, the format of the information table is: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.
In a specific embodiment, a tumor-related sample and a non-tumor-related sample are obtained, and the number of samples of the tumor-related sample and the non-tumor-related sample is not less than 10, respectively; referring to Nejman D, et al, the human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. Extraction methods, bacterial DNA in tumor-related and non-tumor-related samples is extracted, and bacterial DNA sequencing, i.e., metagenomic sequencing, is completed by an ILLUMINA high throughput sequencing platform, and a fastq file after double-ended sequencing is obtained. Copying the fastq file subjected to double-ended sequencing into a specific folder. Step 110 of obtaining metagenomic sequencing data includes: and reading a sample information table provided by a user, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to obtain macro genome sequencing data. The user provides a sample information table, e.g., samplelist. Txt, in the format described below: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.
Step 120 predicts the coding genes in the genome, in which step metagenome assembly is performed based on metagenome sequencing data, an assembled genome sequence is obtained, and the coding genes in the genome are predicted based on the assembled genome sequence.
In some embodiments, step 120 predicts that the encoding gene in the genome comprises performing metagenome assembly based on metagenome sequencing data, the metagenome assembly comprising: the sequencing data is broken into small fragments, and sequence fragments of the metagenome are assembled. In some embodiments, metagenomic assembly comprises: breaking the sequencing data into small fragments based on a Debrucine (de Bruijn) graph algorithm, and assembling sequence fragments of a metagenome; sequence segment assembly is firstly carried out on sequencing data of each sample, the successful assembly is called single sample assembly, the read length of all samples which are not successfully assembled is used for assembly again, the assembly is called mixed assembly, and further sequence segments of single sample assembly and mixed assembly are obtained. In some embodiments, step 120 predicting the encoding genes in the genome comprises predicting the encoding genes in the genome based on the assembled genome sequence, the predicting the encoding genes in the genome comprising identifying potential start codons and stop codons in the assembled sequence segments; the open reading frame (Open reading fram, ORF) between the start codon and the stop codon was searched, and the protein-encoding gene was predicted based on the characteristics such as the ORF length and codon usage bias. In some embodiments, step 120 predicts a determination of the amount of expression of the encoding gene in the genome. In some embodiments, determining the expression level of the encoding gene includes removing redundant genes and calculating the expression level of different genes. In some embodiments, removing redundant genes includes clustering similar sequences according to a defined sequence identification threshold, and creating a representative sequence (e.g., referred to as a "centroid") for each cluster, which is the longest sequence in the cluster. Non-redundant sequences are generated by deleting sequences that are highly similar to the centroid. In some embodiments, step 120 predicts the coding genes in the genome by breaking the sequencing data into small fragments, assembling the sequence fragments of the metagenome, first assembling the sequence fragments of the sequencing data of each sample, called "single sample assembly", successfully assembled, reading the length of all samples that were not successfully assembled, assembling again, called "hybrid assembly", and further obtaining single sample assembly and hybrid assembly sequence fragments; identifying potential start and stop codons in the assembled sequence fragment; and, the open reading frame (Open reading fram, ORF) between the start codon and the stop codon is searched, and the protein coding gene is predicted according to the characteristics of ORF length, codon usage deviation and the like; removing redundant genes and calculating the expression amounts of different predicted genes.
In a specific embodiment, step 120 predicts the coding genes in the genome comprising assembling sequence fragments based on metagenome sequencing data, calling megahit to assemble sequence data of each sample, called "single sample assembly" for successful assembly, and assembling again using megahit for reading length of all samples that were not successfully assembled, called "hybrid assembly", thereby obtaining sequence fragments for single sample assembly and hybrid assembly; based on the assembled sequence fragments, calling coding genes in a prodigal predictive genome, further based on the predictive coding genes, calling cd-hit to remove redundant genes, calling salcon to compare sequencing data to the predictive genes, and calculating the expression quantity of different genes.
Step 130 determines bacterial species and abundance, in which bacterial species and abundance in tumor-related and non-tumor-related samples are determined based on predicted encoding genes and/or metagenomic sequencing data.
In some embodiments, step 130 determining the bacterial species and abundance comprises: based on the predicted encoding genes, bacterial species and abundance in tumor-related and non-tumor-related samples are determined. In some embodiments, step 130 determining the bacterial species and abundance comprises: based on the metagenomic sequencing data, species annotation software is used to determine bacterial species and abundance in tumor-related and non-tumor-related samples. In some embodiments, step 130 determining the bacterial species and abundance comprises: determining bacterial species and abundance in tumor-related and non-tumor-related samples based on the predicted encoding genes; determining bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and selecting bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determining their corresponding bacterial abundances.
The two methods of determining the bacterial species based on predicted encoding genes and determining the bacterial species using species annotation software are superior to each other. If two methods are adopted to determine the bacterial species, the two methods can be mutually verified; and the bacterial types contained in the results of the two methods are used as the final determined bacterial types, so that the reliability of the bacterial type determination is improved, and a more reliable bacterial type determination result is selected.
Step 140 identifies the significantly enriched bacteria, in which step the significantly enriched bacteria in the tumor-associated sample are identified.
In some embodiments, in determining the significantly enriched bacteria at step 140, the significantly enriched bacteria in the tumor-associated sample are determined based on a statistical test. In some embodiments, the statistical test is a Wilcoxon rank sum test. In some embodiments, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the statistical test p-value is less than or equal to 0.05. The Wilcoxon rank sum test has the advantages that the Wilcoxon rank sum test is not limited by overall distribution, and has wide application range; the method is applicable to data with no determined value at two ends; the application can be carried out without considering what kind of distribution and whether the distribution is known or not, and the calculation is easy. Setting the screening criteria for significantly enriched bacteria facilitates selection of bacterial species enriched in tumor-associated samples and facilitates determination of a broad spectrum of microorganism-derived tumor neoepitopes in subsequent steps.
In a specific embodiment, step 140 determines that the bacteria that are significantly enriched comprises: calling Wilcoxon rank sum test to count the bacteria which are remarkably enriched in the tumor-related samples; the screening criteria for the significantly enriched bacteria were: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the p-value of the statistical test is less than or equal to 0.05. Alternatively, the software R and R package circlize were used to plot a species distribution circle showing significantly enriched bacterial species and bacterial abundance in tumor-related samples. Alternatively, species distribution heatmaps showing significantly enriched bacterial species and bacterial abundance in tumor-related samples were plotted using software R and R package pheeatmap. The species distribution circle and the species distribution heat map can intuitively show the bacterial species, the bacterial abundance and/or the bacterial distribution which are remarkably enriched in the tumor-related samples.
Step 150 obtains a determination of known species of bacteria, in which step a determination of known species of bacteria is obtained, the determination of known species of bacteria indicating whether the significantly enriched bacteria are of known species.
In some embodiments, in obtaining the bacterial known species determination result at step 150, it is determined whether the significantly enriched bacteria are bacteria of a known species or bacteria of an unknown species based on the bacterial species and the significantly enriched bacteria in the tumor-associated sample and the non-tumor-associated sample. In some embodiments, in obtaining the determination of the known species of bacteria at step 150, the known database is searched based on the bacterial species determined at step 130 and the significantly enriched bacteria determined at step 140, and a determination is made as to whether the genomic sequence or protein sequence of the significantly enriched bacteria is present in the known database, and if so, the known species is determined; if the bacteria are not present, judging the bacteria to be of unknown species, and thus obtaining a judging result of the bacteria of the known species. In some embodiments, the known database is the NCBI Genome database.
In one embodiment, in step 150, the determination of known species of bacteria is performed by searching NCBI Genome database (https:// www.ncbi.nlm.nih.gov/Genome) for the presence of the genomic sequence or protein sequence information of the significantly enriched bacteria determined in step 140, based on the names of bacteria predicted and screened in step 130, and if so, determining the known species; if the bacteria are not present, judging the bacteria to be of unknown species, and thus obtaining a judging result of the bacteria of the known species.
If the significantly enriched bacterial species is known, step 160 is employed. Step 160 determines the genome or protein sequence of a bacterium of a known species, in which step the genome or protein sequence of a bacterium of a known species is determined from a database of known genomes based on the bacteria enriched for significance.
In some embodiments, the known genomic database is the NCBI Genome database. In some embodiments, step 160 determining the Genome or protein sequence of a bacterium of a known species includes retrieving the NCBI Genome database to determine the bacterial Genome or protein sequence based on the name of the bacterial species level that is significantly enriched in the tumor-related sample.
In a specific embodiment, step 160 determining the Genome or protein sequence of a bacterium of a known species includes retrieving the NCBI Genome database based on the name of the bacterial species level that is significantly enriched in tumor-related samples, and obtaining a bacterial Genome or protein fasta sequence file.
If the species of bacteria enriched for significance is not known, step 170 is employed. Step 170 determines the protein sequence of the bacteria of the unknown species, in which step the protein sequence of the bacteria of the unknown species is determined from the predicted encoding genes based on the bacteria enriched in significance.
In some embodiments, step 170 determining the protein sequence of the bacteria of the unknown species comprises translating the gene into a protein sequence based on the significance-enriched bacteria according to the predicted encoding gene, thereby determining the protein sequence of the bacteria of the unknown species.
In a specific embodiment, step 170 of determining the protein sequence of the bacteria of the unknown species comprises translating the genes into a protein fasta sequence file based on the correspondence of the genes and bacterial species predicted in step 130.
Step 180 determines peptide fragments that bind to MHC, in which step binding affinity to MHC is predicted for peptide fragments in the protein sequence of a bacterium of a known species or for peptide fragments in the protein sequence of a bacterium of an unknown species, and peptide fragments in the protein sequence of a bacterium of a known species that bind to MHC are screened for peptide fragments in the protein sequence of a bacterium of an unknown species, and peptide fragments that bind to MHC are determined.
In some embodiments, step 180 determines that the peptide fragment that can bind MHC includes a genomic or protein sequence based on the bacteria of the known species of step 160, or a protein sequence based on the bacteria of the unknown species of step 170, and predicts and scores the binding affinity of the peptide fragment to MHC molecules of the known sequence using an artificial neural network. In some embodiments, when the number of peptide fragments screened exceeds 50, step 180 of determining peptide fragments that bind to MHC further comprises predicting the binding affinity of the peptide fragments to MHC molecules of known sequence using the deep neural network and scoring, aiding in the verification of the screened peptide fragments having affinity. In some embodiments, the predicted binding affinity results are presented as peptide fragment-MHC pairing results. In some embodiments, step 180 of determining the MHC binding peptide fragment comprises screening the MHC binding peptide fragment using NetMHCpan-4.1. In some embodiments, the screening criteria for NetMHCpan-4.1 are: screening the NetMHCpan-4.1 analysis results to obtain peptide fragments with rank% less than 0.5. In some embodiments, step 180 of determining peptide fragments that bind MHC includes using BigMHC scoring to aid in verifying the screened peptide fragments that have affinity, in some embodiments, higher scoring peptide fragments in BigMHC scoring are more preferred.
In a specific embodiment, step 180 of determining MHC-binding peptide fragments comprises calling an open-source NetMHCpan-4.1 trained model to predict binding affinity of peptide fragments to MHC molecules of known sequence based on the bacterial genome or protein fasta sequence file of step 160, or protein fasta sequence file of step 170, to screen for peptide fragments having HLA type I affinity. And invoking a BigMHC-trained model of open source to predict binding affinity of the peptide fragment to MHC molecules of known sequence to aid in validating the predicted peptide fragment of NetMHCpan-4.1.
NetMHCpan-4.1: the binding affinity of peptide fragments to MHC molecules of known sequence was predicted using artificial neural networks, and their predicted affinities were expressed as sw_score a and sw_score b on a score. SW_score A and SW_score B are the "EL_rank" item and "BA_rank" item in the NetMHCpan-4.1 calculation, respectively, and the score of these two scoring items ranges from 0 to 100. The smaller sw_score and sw_score, the more strongly a peptide fragment binds to the corresponding MHC molecule. Illustratively, the screening criteria for SW_SCREA and SW_SCREB are set to SW_SCREA and SW_SCREB equal to or less than 2 (see https:// services. Health. Dtu. Dk/services/NetMHCITman-4.0 /).
BigMHC: the binding affinity of peptide fragments to MHC molecules of known sequence was predicted using deep neural networks, and its predicted affinity was expressed as sw_score on a score. SW_score is the term "bigmhc_im" in the BigMHC calculation result, and the value range of the score is 0-1. Higher SW_score indicates stronger binding of a peptide fragment to the corresponding MHC molecule (see Albert BA, et al deep Neural Networks Predict MHC-I Epitope Presentation and Transfer Learn Neoepitope immunology. BioRxiv; 2022.).
Step 190 determines a tumor neoepitope, in which immunogenicity, host similarity and number of MHC-typing of the peptide fragment binding to MHC are determined based on the peptide fragment binding to MHC, and the peptide fragment is selected based on the immunogenicity, host similarity and number of MHC-typing, thereby determining the tumor neoepitope.
As used herein, "immunogenicity" refers primarily to the ability of the body to elicit an immune response to itself or a related protein (e.g., a therapeutic protein) or to cause an immune-related event, i.e., the ability of an antigen to stimulate a particular immune cell, to activate, proliferate, differentiate the immune cell, ultimately producing immune effector antibodies and sensitized lymphocytes.
In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: immunogenicity of the MHC-binding peptide is determined based on a deep learning model established by the deep neural network and the MHC-binding peptide. In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity.
As used herein, "similarity" is the number of identical amino acids of a peptide fragment that can bind to MHC as the protein sequence of the host divided by the number of total amino acids comprised by the peptide fragment that can bind to MHC.
In some embodiments, determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: and (3) comparing the peptide fragment capable of binding to the MHC with the protein sequences of hosts from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide fragment capable of binding to the MHC and the hosts. In some embodiments, determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing a score for the similarity of the MHC-binding peptide to the host protein sequence as a result of the output of the sequence alignment, the higher the score being with higher similarity to the host.
As used herein, the "number of MHC genotypes of an MHC-binding peptide fragment" is the number of different MHC genotypes having binding affinity for the same MHC-binding peptide fragment. The greater the number of MHC-typing of peptide fragments that bind to MHC, the more likely a peptide fragment is bound by multiple MHC, exerting a stronger immune activation. The determination of the MHC typing number of peptide fragments capable of binding to MHC is beneficial to screening of high-quality tumor neoepitope capable of exerting strong immune activation.
In some embodiments, determining the MHC class number of MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises: the number of all MHC to which the MHC-binding peptide fragment may be bound is counted, and thus the MHC typing number of the MHC-binding peptide fragment is determined. In some embodiments, determining the number of MHC-binding peptide fragments based on the MHC-binding peptide fragments further comprises introducing a score of the same binding peptide fragment for multiple MHC-typing, the score being the number of all MHC fragments likely to bind per peptide fragment, the higher the score the greater the number of MHC-typing of MHC-binding peptide fragments.
In some embodiments, determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises: determining immunogenicity of the peptide fragment capable of binding to the MHC based on a deep learning model established by the deep neural network and the peptide fragment capable of binding to the MHC; determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises scoring according to a deep learning model, the higher the score the higher the immunogenicity; determining the similarity of the MHC-binding peptide fragment to the host based on the MHC-binding peptide fragment comprises: comparing the peptide segment capable of binding to MHC with the protein sequences of the host from which the tumor-related sample and the non-tumor-related sample are derived, and determining the similarity between the peptide segment capable of binding to MHC and the host; determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing an MHC-binding peptide to the host protein sequence similarity score, which is the output of the sequence alignment, with a higher score compared to the host similarity; determining the MHC-typing number of MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises: counting the number of all MHC possibly bound by the peptide fragment capable of binding to MHC, and further determining the MHC typing number of the peptide fragment capable of binding to MHC; and determining the number of MHC-binding peptide fragments based on the MHC-binding peptide fragments further comprises introducing a score for the same binding peptide fragment for multiple MHC-typing, the score being the number of all MHC fragments likely to bind per peptide fragment, the higher the score the greater the number of MHC-typing of MHC-binding peptide fragments. The immunogenicity of peptide segment capable of combining MHC, the similarity with host and the introduction of MHC typing number are favorable for screening out high quality tumor neoepitope from microorganism.
In some embodiments, screening peptide fragments based on immunogenicity, similarity to host, and MHC typing numbers, and further determining tumor neoepitopes comprises: screening the peptide fragments with high score in the scoring of the deep learning model, and determining the tumor neoepitope by classifying the peptide fragments with the same binding peptide fragments with multiple MHC (major histocompatibility complex) and the peptide fragments with low score in the scoring of the similarity of the peptide fragments capable of binding the MHC and the host protein sequence. The screening of peptide fragments with high immunogenicity, low similarity with hosts and large MHC typing quantity is beneficial to screening out high-quality tumor neoepitope derived from microorganisms.
Fig. 2 shows a schematic flow chart of a process of determining bacterial species and abundance according to an embodiment of the disclosure. As shown in FIG. 2, a process 200 of determining bacterial species and abundance includes steps 210, 220, and 230.
Step 210 determines bacterial species and abundance based on the predicted encoding genes, in which step bacterial species and abundance in tumor-associated and non-tumor-associated samples are determined based on the predicted encoding genes.
In some embodiments, step 210 determining bacterial species and abundance based on the predicted encoding genes comprises sequence aligning the predicted genes with sequences in a known database to predict bacterial species and determine co-categorical levels of bacterial abundance. In some embodiments, the known database is the NCBI NR database. In some embodiments, step 210 determining bacterial species and abundance based on the predicted encoding genes comprises sequence aligning the predicted genes with sequences in the NCBI NR database to predict bacterial species and determine co-categorical levels of bacterial abundance. In some embodiments, the input sequence for sequence alignment is a predicted translated protein sequence of the coding gene. Sequence alignment with known databases may facilitate prediction of bacterial species.
In a specific embodiment, step 210 determining bacterial species and abundance based on the predicted coding genes includes using blast to sequence align the predicted genes with sequences in the NCBI NR database (https:// ftp. NCBI. Nlm. Nih. Gov/blast/db /) to predict bacterial species and determine bacterial abundance at a level of co-classification.
Step 220 uses species annotation software to determine bacterial species and abundance, in which step bacterial species and abundance in tumor-related and non-tumor-related samples are determined using species annotation software based on the metagenomic sequencing data. In some embodiments, the species annotation software is MetaPhlAn.
In a specific embodiment, step 220 uses species annotation software to determine bacterial species and abundance includes calling MetaPhlAn to predict bacterial species and distribution in tumor-related and non-tumor-related samples based on metagenomic sequencing data, and determining on-grade bacterial abundance.
Step 230 selects bacterial species and determines bacterial abundance, in which step bacterial species contained in both the results of determining bacterial species and abundance based on predicted encoding genes and determining bacterial species and abundance using species annotation software are selected as bacterial species in the determined tumor-associated and non-tumor-associated samples and their corresponding bacterial abundances are determined. In some embodiments, determining their respective bacterial abundances comprises calculating an average of the bacterial abundances of the selected bacterial species in determining bacterial species and abundances based on the predicted encoding genes and determining bacterial species and abundances using species annotation software, thereby determining the respective bacterial abundances.
Returning to FIG. 1, after the bacterial species and abundance are determined as described above, step 140 is performed to determine the significantly enriched bacteria, in which step the significantly enriched bacteria in the tumor-associated sample are determined. The specific operation of step 140 to determine the significantly enriched bacteria is as described above.
Fig. 3 shows a schematic flow chart of a process for determining a tumor neoepitope according to one embodiment of the present disclosure. As shown in fig. 3, the process 300 of determining a tumor neoepitope includes steps 310, 320, 330 and 340.
Step 310 determines the immunogenicity of the MHC-binding peptide fragments, in which step the immunogenicity of the MHC-binding peptide fragments is determined based on a deep learning model established by the deep neural network and the MHC-binding peptide fragments.
In some embodiments, step 310 determining the immunogenicity of the MHC-binding peptide fragment comprises determining the immunogenicity using BigMHC. In some embodiments, the deep learning model built based on the deep neural network is trained based on amino acid sequence characteristics of different peptide fragments and MHC. In some embodiments, the deep learning model established by the deep neural network is a model that the BigMHC has been trained to complete. In some embodiments, step 310 of determining the immunogenicity of the peptide fragments that can bind to MHC comprises determining the immunogenicity using a model that has been trained on BigMHC, the model being trained based on amino acid sequence characteristics of the different peptide fragments and MHC. In some embodiments, step 310 of determining the immunogenicity of the MHC-binding peptide fragments comprises scoring and ranking according to a deep learning model, the higher the score the higher the immunogenicity. Tumor immunogenicity is the basis for initiating tumor immunotherapy, so that the higher the immunogenicity of neoantigens, the higher the likelihood of immune response. In some embodiments, in determining the immunogenicity of peptide fragments that bind to MHC in step 310, the sequence of the peptide fragments and MHC typing information are input to the model, and the model outputs a score between 0 and 1, with higher scores resulting in greater immunogenicity of the peptide fragments.
In a specific embodiment, step 310 of determining the immunogenicity of the MHC-binding peptide fragment comprises calling an open-source BigMHC trained model to predict the immunogenicity of the peptide fragment. In this step, the sequence of the peptide fragment (e.g., QTYKTNSSVKK, SEQ ID NO: 41) and the typing information of MHC (e.g., HLA-A. Times.11:01) need to be entered into the model. The result is a score between 0 and 1, the higher the score, the more immunogenic the peptide fragment. The BigMHC-trained model was trained based on amino acid sequence characteristics of different peptide fragments and MHC.
Step 320 determines the similarity of the MHC-binding peptide to the host, in which step the MHC-binding peptide is aligned with the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and the similarity of the MHC-binding peptide to the host is determined.
In some embodiments, step 320 determining the similarity of the MHC-binding peptide to the host comprises introducing a score for the similarity of the MHC-binding peptide to the host protein sequence to determine the similarity to the host. In some embodiments, step 320 of determining the MHC-binding peptide fragment to host similarity comprises creating a protein sequence index file of the host (e.g., human) in advance, aligning the MHC-binding peptide fragment to host protein sequences based on the pre-created index file, and scoring the similarity of the MHC-binding peptide fragment to host protein sequences to determine the MHC-binding peptide fragment to host similarity.
In a specific embodiment, in step 320, determining the similarity between the MHC-binding peptide and the host, creating a protein sequence blastp index file of the host (e.g., human) in advance, performing sequence alignment on the MHC-binding peptide and the host protein sequence using blastp based on the created blastp index file, and obtaining a scoring of the similarity between the MHC-binding peptide and the host protein sequence (the scoring is blastp output result), thereby determining the similarity between the MHC-binding peptide and the host.
Step 330 determines the number of MHC-typing of MHC-binding peptide fragments, in which step the number of MHC-typing of MHC-binding peptide fragments is counted for all MHC-binding peptides likely to bind.
In some embodiments, step 330 of determining the number of MHC genotypes of the MHC-binding peptide fragments comprises introducing a score for multiple MHC-typing of the same binding peptide fragment to determine the number of MHC-typing of the MHC-binding peptide fragments. In step 330, the predicted binding affinity results are presented as peptide-MHC pairing results, where one peptide may bind to multiple MHC in the output, in determining the number of MHC-typing of peptide that can bind to MHC. Scoring of multiple MHC-typing identical binding peptide fragments is a fundamental statistical method that calculates the number of all MHC that each peptide fragment may bind.
In a specific embodiment, step 330 of determining the number of MHC-binding peptide fragments comprises introducing a score for the same binding peptide fragment for multiple MHC-binding based on the MHC-binding peptide fragments, counting the number of all MHC-binding peptides each peptide fragment may bind, and thereby determining the number of MHC-binding peptide fragments for MHC-binding.
Step 340 screens peptide fragments and determines tumor neoepitopes, in which step peptide fragments are screened based on immunogenicity, similarity to host, and number of MHC typing, thereby determining tumor neoepitopes.
In some embodiments, step 340 screening the peptide fragments and determining the tumor neoepitope comprises screening for a high score in a deep learning model score, a high score in a multi-MHC class identical binding peptide fragment score, and a low score peptide fragment in a class similarity score between the MHC-binding peptide fragment and the host protein sequence, thereby determining the tumor neoepitope.
In a specific embodiment, step 340 of screening peptide fragments and determining tumor neoepitope comprises selecting peptide fragments with a score of 0.9 or more in the immunogenicity deep learning model scoring, a score of 0.5 or less in the similarity scoring of MHC-binding peptide fragments and host protein sequences, and a score of 2 or more in the multi-MHC-typing identical binding peptide fragments as tumor neoepitope.
In some embodiments, the method of determining a tumor neoepitope of microbial origin further comprises sequencing data quality control. Fig. 4 shows a schematic flow chart of a method of determining a tumor neoepitope of microbial origin according to one embodiment of the present disclosure. As shown in fig. 4, the method 400 of determining a tumor neoepitope of microbial origin comprises steps 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414 and 415. Among them, step 401, step 403, step 404, step 405, step 406, step 407, step 408, step 409, step 410, step 411, step 412, step 413, step 414, and step 415. The steps are the same as step 110, step 120, step 210, step 220, step 230, step 140, step 150, step 160, step 170, step 180, step 310, step 320, step 330 and step 340, respectively, and are not described in detail herein.
Step 402 of quality control of sequencing data, in which, in the step, macro genome assembly is performed based on macro genome sequencing data to obtain an assembled genome sequence, and quality control is performed on the macro genome sequencing data before predicting coding genes in the genome based on the assembled genome sequence, thereby obtaining quality-controlled macro genome sequencing data.
In some embodiments, the criteria for quality control are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp. In some embodiments, step 402 sequencing data quality control includes removing the linker sequence. In some embodiments, step 402 quality control of the sequencing data comprises removing the linker sequence, removing the low quality sequencing data, and obtaining quality controlled sequencing data having a terminal base quality greater than Q20, a number of N bases less than 5, and a sequence length greater than or equal to 100bp.
In a specific embodiment, step 402 of quality control of sequencing data comprises invoking fastp software to perform quality control of sequencing data with fastq sequencing data obtained in step 401 (or step 110) as input, the quality control comprising: removing the linker sequence, removing low quality sequencing data, and obtaining sequencing data after quality control (i.e. after removing the linker, sequencing data with terminal base mass greater than Q20 and N base number less than 5 and sequence length greater than or equal to 100 bp).
The quality control of the sequencing data in step 402 can improve the quality of the obtained sequencing data so as to avoid sequence pollution and further avoid influencing the subsequent analysis and the determination of tumor neoepitopes.
In some embodiments, step 403 (or step 120) predicts that the encoding gene in the genome comprises: and performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence. The use of quality-controlled sequencing data for the manipulation can avoid affecting subsequent analysis and determination of tumor neoepitopes.
In some embodiments, step 405 (or step 220) determining bacterial species and abundance using species annotation software comprises: based on the metagenomic sequencing data after quality control, species annotation software is used to determine bacterial species and abundance in tumor-related and non-tumor-related samples. The use of quality-controlled sequencing data for the manipulation can avoid affecting subsequent analysis and determination of tumor neoepitopes.
In some embodiments, the tumor-related sample is a donor tumor tissue sample or a tumor patient stool sample, or the non-tumor-related sample is a donor paracancerous tissue sample, a normal tissue sample, or a healthy population stool sample. In some embodiments, the tumor-related sample is a donor tumor tissue sample or a tumor patient stool sample, and the non-tumor-related sample is a donor paracancerous tissue sample, a normal tissue sample, or a healthy population stool sample. A tumor neoantigen is a neoantigen that is present in a tumor cell or tissue of a subject but not in a corresponding normal cell or tissue of the subject. Can serve as a tumor marker when identifying tumor cells by diagnostic testing, and can also serve as a potential candidate for cancer treatment. A paracancestral tissue sample, a normal tissue sample, or a stool sample from a healthy population. The establishment of such control samples is useful for the study of significantly enriched bacteria in tumor tissue samples or fecal samples from tumor patients and for the further determination of tumor neoepitopes of microbial origin.
In some embodiments, the length of the assembled genomic sequence is greater than or equal to 90bp. Long sequences can improve the accuracy of species annotation analysis.
In some embodiments, the MHC is a high frequency HLA of the chinese population. The determination of the peptide segment capable of combining with the high-frequency HLA of Chinese population is beneficial to developing and researching the tumor neoepitope suitable for the cancer treatment of Chinese population. The Chinese population high frequency HLA contains human MHC listed in Table 3, see, for example, heY, li J, mao W, et al HLA common and well-documented alleles in China [ J ]. Hla,2018,92 (4): 199-205.
In some embodiments, step 411 (or step 180) determines that, of the peptide fragments that can bind to MHC, the screening criteria for screening for peptide fragments in the protein sequence of a bacterium of a known species or for peptide fragments in the protein sequence of a bacterium of an unknown species that can bind to MHC are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM. Tumor immunogenicity is the basis for initiating tumor immunotherapy, so the higher the probability of an immune response, the generation of neoantigens that bind with high affinity to MHC. Screening peptides with higher binding affinity to MHC is advantageous for screening high quality tumor neoepitopes of microbial origin.
Device for determining tumor neoepitope derived from microorganism
In a second aspect, the present disclosure provides a device for determining a tumor neoepitope of microbial origin. Fig. 5 shows a schematic block diagram of an apparatus for determining a tumor neoepitope of microbial origin according to the present disclosure. As shown in fig. 5, the apparatus 500 for determining a tumor neoepitope derived from a microorganism includes a metagenome sequencing data acquisition module 510, a coding gene prediction module 520, a determination module 530 of bacterial species and abundance, a determination module 540 of bacteria with enriched significance, a determination module 550 of a determination result of known species of bacteria, a determination module 560 of genome or protein sequence of bacteria of known species, a determination module 570 of protein sequence of bacteria of unknown species, a determination module 580 of peptide fragment capable of binding MHC, and a tumor neoepitope determination module 590.
A metagenome sequencing data acquisition module 510 configured to acquire metagenome sequencing data, the metagenome sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples. In some embodiments, the operation of the metagenomic sequencing data acquisition module 510 may refer to the operation described above with reference to 110 of fig. 1.
In some embodiments, tumor-related and non-tumor-related samples are obtained, and bacterial DNA in the tumor-related and non-tumor-related samples is extracted for high throughput sequencing. In some embodiments, bacterial DNA is extracted from tumor-associated and non-tumor-associated samples by reference to Nejman D, et al, the methods of extraction human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. In some embodiments, metagenomic sequencing is performed by an ILLUMINA high throughput sequencing platform. In some embodiments, the sequencing mode of high throughput sequencing is double ended, with a sequencing length of 150bp. In some embodiments, the tumor-associated sample and the non-tumor-associated sample are each no less than 10.
A coding gene prediction module 520 configured to perform metagenomic assembly based on the metagenomic sequencing data, obtain an assembled genomic sequence, and predict a coding gene in the genome based on the assembled genomic sequence. In some embodiments, the operation of the encoding gene prediction module 520 may refer to the operation described above with reference to 120 of fig. 1.
A determination module 530 of bacterial species and abundance configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples based on predicted encoding genes and/or metagenomic sequencing data. In some embodiments, the operation of the determination module 530 of bacterial species and abundance may refer to the operation described above with reference to 130 of fig. 1.
In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene.
In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data.
In some embodiments, the determination module 530 of bacterial species and abundance comprises: a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene; a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.
The methods in the module for determining bacterial species and abundance based on predicted encoding genes and the methods in the module for determining bacterial species and abundance using species annotation software are both good and bad. If two modules are adopted at the same time to determine the bacterial species, the methods in the two modules can be mutually verified; and the bacterial types contained in the results of the two modules are used as the final determined bacterial types, so that the reliability of determining the bacterial types is improved, and a more reliable bacterial type determining result is selected.
A determination module 540 of the significantly enriched bacteria configured to determine the significantly enriched bacteria in the tumor-related sample. In some embodiments, the operation of the determination module 540 of the prominently enriched bacteria may refer to the operation described above with reference to 140 of fig. 1.
A bacteria known species judgment result acquisition module 550 configured to acquire a bacteria known species judgment result indicating whether the bacteria significantly enriched are of a known species. In some embodiments, the operation of the bacteria known species determination result acquisition module 550 may refer to the operation described above with reference to 150 of fig. 1.
If the significantly enriched bacterial species is known, module 560 is employed. A determination module 560 of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the bacteria enriched in significance, the genomic or protein sequence of the bacterium of the known species from a database of known genomes. In some embodiments, the operation of the determination module 560 of the genome or protein sequence of a bacterium of a known species may refer to the operation described above with reference to 160 of fig. 1.
If the species of bacteria for which the significance is enriched is unknown, module 570 is employed. A determination module 570 of a protein sequence of a bacterium of the unknown species is configured to determine the protein sequence of the bacterium of the unknown species from the predicted encoding gene based on the bacteria enriched in significance. In some embodiments, the operation of the determination module 570 of the protein sequence of the bacteria of the unknown species may refer to the operation described above with reference to 170 of fig. 1.
A determining module 580 for peptide fragments that can bind to MHC configured to predict binding affinity of peptide fragments in the protein sequence of a bacterium of a known species or a bacterium of an unknown species to MHC, screening peptide fragments in the protein sequence of a bacterium of a known species that can bind to MHC or a bacterium of an unknown species, and thereby determining peptide fragments that can bind to MHC. In some embodiments, the operations of determining module 580 for MHC-binding peptide fragments may be referred to the operations described above with reference to 180 of fig. 1.
A tumor neoepitope determination module 590 configured to determine immunogenicity, host similarity, and number of MHC types of the MHC-binding peptide fragments based on the MHC-binding peptide fragments, and screen the peptide fragments based on the immunogenicity, host similarity, and number of MHC types, thereby determining the tumor neoepitope. In some embodiments, the operation of the tumor neoepitope determination module 590 can refer to the operation described above with reference to 190 of fig. 1.
In some embodiments, the tumor neoepitope determination module 590 comprises: a determining module of immunogenicity of the MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by the deep neural network and the MHC-binding peptide fragment; a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host; a determining module of the MHC-typing number of MHC-binding peptide fragments, configured to count the number of all MHC-binding peptide fragments likely to bind, thereby determining the MHC-typing number of MHC-binding peptide fragments; and a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on immunogenicity, similarity to host, and MHC typing number, thereby determining tumor neoepitopes.
The introduction of the determining module capable of combining the immunogenicity of the peptide fragment of the MHC, the determining module capable of combining the similarity of the peptide fragment of the MHC and the host and the determining module capable of combining the MHC typing number of the peptide fragment of the MHC is beneficial to screening out high-quality tumor neoepitope derived from microorganism.
Fig. 6 shows a schematic block diagram of a determination module of bacterial species and abundance according to an embodiment of the disclosure. As shown in fig. 6, the determination of bacterial species and abundance module 600 includes a determination of bacterial species and abundance module 610 based on predicted encoding genes, a determination of bacterial species and abundance module 620 using species annotation software, and a bacterial species selection and bacterial abundance determination module 630.
A determination module 610 based on the predicted bacterial species and abundance of the encoding gene, configured to determine bacterial species and abundance in the tumor-associated sample and the non-tumor-associated sample based on the predicted encoding gene. In some embodiments, the operation of the determination module 610 based on the predicted bacterial species and abundance of the encoding gene may be referred to the operation described above with reference to 210 of fig. 2.
A determination module 620 of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data. In some embodiments, the operation of the determination module 620 using species annotation software may refer to the operation described above with reference to 220 of fig. 2.
A bacterial species selection and bacterial abundance determination module 630 configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances. In some embodiments, the operation of the bacterial species selection module 630 may refer to the operation described above with reference to 230 of fig. 2.
Fig. 7 shows a schematic block diagram of a tumor neoepitope determination module according to one embodiment of the present disclosure. As shown in fig. 7, the tumor neoepitope determining module 700 includes a determining module 710 for immunogenicity of MHC-binding peptide fragments, a determining module 720 for similarity of MHC-binding peptide fragments to a host, a determining module 730 for MHC-typing number of MHC-binding peptide fragments, and a peptide fragment screening and neoepitope determining module 740.
Determining module 710 of immunogenicity of the MHC-binding peptide fragments, configured to determine immunogenicity of the MHC-binding peptide fragments based on a deep learning model established by the deep neural network and the MHC-binding peptide fragments. In some embodiments, the operation of the immunogenicity determination module 710 may refer to the operation described above with reference to 310 of fig. 3.
And a determining module 720 for determining the similarity of the MHC-binding peptide fragment to the host, wherein the determining module is configured to determine the similarity of the MHC-binding peptide fragment to the host by aligning the sequences of proteins of the host from which the tumor-associated sample and the non-tumor-associated sample are derived. In some embodiments, the operation of the determine with host similarity module 720 may refer to the operation described above with reference to 320 of fig. 3.
A determining module 730 for determining the number of MHC-class types of peptide fragments capable of binding to MHC, configured to count the number of all MHC-class types that may be bound by the peptide fragments capable of binding to MHC, thereby determining the number of MHC-class types of peptide fragments capable of binding to MHC. In some embodiments, the operation of the determining module 730 of the MHC-binding MHC class number of peptide fragments may refer to the operation described above with reference to 330 of fig. 3.
A peptide fragment screening and neoepitope determining module 740 configured to screen peptide fragments based on immunogenicity, similarity to host, and MHC typing number, thereby determining tumor neoepitopes. In some embodiments, the operations of the peptide fragment screening and neoepitope determination module 740 may be referred to the operations described above with reference to 340 of fig. 3.
In some embodiments, the apparatus for determining a tumor neoepitope of microbial origin further comprises a sequencing data quality control module. Fig. 8 shows a schematic block diagram of an apparatus for determining tumor neoepitopes of microbial origin according to one embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 for determining tumor neoepitopes includes a metagenome sequencing data acquisition module 801, a sequencing data quality control module 802, a coding gene prediction module 803, a determination module 804 of bacterial species and abundance based on predicted coding genes, a determination module 805 of bacterial species and abundance using species annotation software, a determination module 806 of bacterial species selection and bacterial abundance, a determination module 807 of bacteria enriched in significance, a determination module 808 of known species judgment result acquisition module, a determination module 809 of genome or protein sequence of bacteria of known species, a determination module 810 of protein sequence of bacteria of unknown species, a determination module 811 of peptide fragment capable of binding to MHC, a determination module 812 of immunogenicity of peptide fragment capable of binding to MHC, a determination module 813 of similarity of peptide fragment capable of binding to MHC to host, a determination module 814 of MHC typing number of peptide fragment capable of binding to MHC, and a determination module 815 of peptide fragment screening and neoepitopes. Module 801, module 803, module 804, module 805, module 806, module 807, module 808, module 809, module 810, module 811, module 812, module 813, module 814, and module 815 are the same as module 510, module 520, module 610, module 620, module 630, module 540, module 550, module 560, module 570, module 580, module 710, module 720, module 730, and module 740, respectively, and are not repeated herein.
A sequencing data quality control module 802 configured to perform quality control on the metagenomic sequencing data before performing metagenomic assembly based on the metagenomic sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in the genome based on the assembled genomic sequence (performing an operation of the coding gene prediction module 803 or the coding gene prediction module 520), thereby obtaining quality-controlled metagenomic sequencing data. The operation of the sequencing data quality control module 802 may refer to the operation described above with reference to 402 of fig. 4.
In some embodiments, the criteria for quality control are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp. In some embodiments, the sequencing data quality control module 802 is configured to remove linker sequences, remove low quality sequencing data, and obtain quality controlled sequencing data having a terminal base quality greater than Q20, a number of N bases less than 5, and a sequence length greater than or equal to 100bp.
Other aspects
In a third aspect, the present disclosure provides a computing device comprising: a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method in the embodiments.
Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure. As can be seen in fig. 9, computing device 900 includes a Central Processing Unit (CPU) 910 (e.g., a processor) and a memory 920 coupled to Central Processing Unit (CPU) 910. The memory 920 is used to store computer-executable instructions that, when executed, cause the Central Processing Unit (CPU) 910 to perform the method of determining a tumor neoepitope derived from a microorganism in the above embodiments. A Central Processing Unit (CPU) 910 and a memory 920 are connected to each other by a bus, to which an input/output (I/O) interface is also connected. Computing device 900 may also include a number of components (not shown in fig. 9) connected to the I/O interface, including, but not limited to: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communications unit allows the computing device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
Further, the above-described method can alternatively be implemented by a computer-readable storage medium. The computer readable storage medium has computer readable program instructions embodied thereon for performing various embodiments of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
Thus, in a fourth aspect, the present disclosure proposes a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope of microbial origin in various embodiments of the present disclosure.
In a fifth aspect, the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope of microbial origin in various embodiments of the present disclosure.
Tumor neoepitope derived from microorganism
In a sixth aspect, the present disclosure also provides a protein sequence comprising the sequence set forth in any one of SEQ ID NOs 11-40. These protein sequences are exemplary microbial-derived tumor neoepitopes determined using the methods or apparatus of the present invention for determining microbial-derived tumor neoepitopes.
In a seventh aspect, the present disclosure also provides a nucleic acid sequence encoding the protein sequence of the sixth aspect. In some embodiments, the nucleic acid sequence encodes a sequence set forth in any one of SEQ ID NOs 11-40. The nucleic acid sequence may be a DNA sequence or an RNA sequence, which is a coding sequence for the protein sequence of the sixth aspect. As used herein, "coding sequence" or "coding region sequence" refers to a nucleotide sequence in a polynucleotide that can be used as a template for synthesis of a polypeptide having a defined nucleotide sequence (e.g., tRNA and mRNA) or a defined amino acid sequence in a biological process. The coding sequence may be a DNA sequence or an RNA sequence. A DNA sequence or mRNA sequence is considered to encode a polypeptide if mRNA corresponding to the DNA sequence (including the same coding strand as the mRNA sequence and the template strand complementary thereto) is translated into the polypeptide in a biological process.
Preferably, the nucleic acid sequence is an mRNA sequence. The translated protein sequence of the mRNA sequence comprises the sequence shown in any one of SEQ ID NOs 11 to 40. These mRNA sequences are mRNA sequences encoding exemplary microbial-derived tumor neoepitopes or their complements determined using the methods or apparatus of the invention for determining microbial-derived tumor neoepitopes. In general, mRNA can comprise a 5'-UTR sequence, a coding sequence for a polypeptide, a 3' -UTR sequence, and optionally a poly (a) sequence. mRNA can be produced, for example, by in vitro transcription or chemical synthesis. In some embodiments, the mRNA of the invention is obtained by in vitro transcription with a DNA template by an RNA polymerase (e.g., T7RNA polymerase). In some embodiments, the mRNA of the invention comprises (1) a 5'-UTR, (2) a coding sequence, (3) a 3' -UTR, and (4) an optionally present poly (a) sequence. In some embodiments, the mRNA of the invention is a nucleoside modified mRNA. In some embodiments, the mRNA of the present invention comprises an optionally present 5' cap.
In some embodiments, the sequences shown in SEQ ID NOS.11-40 are derived from F.nucleatum (Fusobacterium nucleatum).
In some embodiments, the sequences shown in SEQ ID NOS 11-40 have affinity for the corresponding mouse MHC shown in the mouse MHC column of Table 3, respectively.
Advantageous effects
Compared with the traditional tumor neoepitope discovery method, the method and the device can focus on microorganisms (such as bacteria) with tumor symbiosis, provide a brand-new tumor neoepitope discovery angle, and greatly improve the number of tumor neoepitope discovery. Meanwhile, the targeting of tumor treatment is effectively improved based on the specificity of specific microorganisms (such as bacteria) in tumors. Provides important reference value for the development of tumor neoantigen vaccine, tumor immunotherapy and the design of tumor immunotherapy targets.
The methods, devices and platforms (also known as SmartBacNeo) provided by the inventors are capable of efficiently identifying and screening protein fragments of intracellular microorganisms (e.g., bacteria) of tumors that are potentially recognized by the immune system. For known intratumoral symbiotic microorganisms (e.g., bacteria) or pathogenic microorganisms (e.g., bacteria, such as F.nucleatum (Fusobacterium nucleatum)), protein sequences of bacterial protein sequences that are recognized by the host (e.g., human) immune system can be rapidly and accurately screened.
Examples
Various exemplary embodiments of the present disclosure are described in detail below with reference to the drawings. While the exemplary methods, apparatus described below include software and/or firmware executed on hardware among other components, it should be noted that these examples are merely illustrative and should not be considered as limiting. For example, it is contemplated that any or all of the hardware, software, and firmware components could be embodied exclusively in hardware, exclusively in software, or in any combination of hardware and software. Thus, while exemplary methods and apparatus have been described below, those skilled in the art will readily appreciate that the examples provided are not intended to limit the manner in which such methods and apparatus may be implemented.
Furthermore, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or operations, or combinations of special purpose hardware and computer instructions.
The present disclosure is described below in terms of several embodiments.
Example 1 establishment of a tumor neoepitope identification tool SmartBacNeo based on the method for determining a tumor neoepitope derived from a microorganism provided in the present disclosure
The invention provides a tumor neoepitope identification tool SmartBacNeo, smartBacNeo based on a metagenome sequencing and deep learning model, which mainly comprises the steps shown in fig. 10. As shown in fig. 10, the method 1000 of determining a tumor neoepitope of microbial origin comprises step 1010, step 1020, step 1030, step 1040, step 1050, step 1060, step 1070, step 1080 and step 1090.
Step 1010 obtains sequencing data, in which metagenomic sequencing data is obtained, the metagenomic sequencing data comprising sequencing data after high throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples. The number of samples of the tumor-related sample and the non-tumor-related sample is not less than 10, respectively. Referring to Nejman D, et al human tumor microbiome is composed of tumor type-specific intracellular bacteria [ J ]. Science,2020,368 (6494):973-980. Extraction methods, bacterial DNA in tumor-related and non-tumor-related samples is extracted, bacterial DNA sequencing, i.e., metagenomic sequencing, is completed by an ILLUMINA high throughput sequencing platform, and a fastq file after double-ended sequencing is obtained. Copying the fastq file subjected to double-ended sequencing into a specific folder. And reading a sample information table provided by a user by using SmartBacneo, and processing a sequencing file of each sample stored in a specific folder according to the sample information table so as to acquire macro genome sequencing data. The user provides a sample information table, e.g., samplelist. Txt, in the format described below: the tab is divided into two columns, wherein the first column is sample grouping information, the second column is sample name, and each sample is in a single row.
Step 1020, quality control of sequencing data, wherein the quality control of sequencing data is performed using SmartBacNeo to automatically invoke fastp software with the fastq sequencing data obtained in step 1010 as input, the quality control comprising: removing the linker sequence, removing low quality sequencing data, and obtaining sequencing data after quality control (i.e. after removing the linker, sequencing data with terminal base mass greater than Q20 and N base number less than 5 and sequence length greater than or equal to 100 bp).
Step 1030, performing gene assembly, gene prediction and determination of gene expression level, wherein in the step, based on sequencing data after quality control, sequence fragment assembly is performed on sequencing data of each sample by using SmartBacneo to automatically call megahit, which is called "single sample assembly" when successful assembly is performed, and reading lengths of all samples which are not successfully assembled are assembled, and assembly is performed again by using megahit, which is called "hybrid assembly", so as to obtain sequence fragments of single sample assembly and hybrid assembly; based on the assembled sequence fragments, smartBacNeo is used for automatically calling coding genes in a prodigal prediction genome, further based on the predicted coding genes, smartBacNeo is used for automatically calling cd-hit to remove redundant genes, smartBacNeo is used for automatically calling salcon to compare sequencing data after quality control to the predicted genes, and the expression quantity of different genes is calculated. Wherein the definition and operation of the tools invoked are as follows:
megahit: based on Debrucine (de Bruijn) graph algorithm, the sequencing data after quality control is broken into small fragments, and sequence fragments of a metagenome are assembled.
prodigal: potential start codons (ATG, GTG and TTG) and stop codons (TAA, TAG and TGA) are identified in the assembled sequence fragments (DNA sequences). The open reading frame (Open reading fram, ORF) between the start codon and the stop codon was searched, and the protein-encoding gene was predicted based on the characteristics such as the ORF length and codon usage bias.
cd-hit: similar sequences are clustered according to a defined sequence identification threshold, and a representative sequence, called the "centroid", is created for each cluster, which is the longest sequence in the cluster. Non-redundant sequences are generated by deleting sequences that are highly similar to the centroid.
salmon: the concept of quasi-mapping is combined with a two-stage reasoning process to provide accurate expression estimates at very high speed while using little memory. salson uses the expressivity of sequencing data and a true model to infer, which takes into account experimental properties and bias that are common in true sequencing data.
Step 1040 bacterial distribution statistics and determination of bacteria enriched in significance, in which the genes predicted in step 1030 are aligned with sequences in the NCBI NR database using blast to predict bacterial species and determine co-categorical levels of bacterial abundance; based on fastq sequencing data after quality control, automatically calling MetaPhlan by SmartBacneo to predict bacterial types and distribution, and determining bacterial abundance at the same classification level; and selecting bacterial species contained in the results of determining the bacterial species and abundance based on the predicted encoding genes as bacterial species in the determined tumor-associated and non-tumor-associated samples; calculating an average value of bacterial abundance of the selected bacterial species in a result of determining bacterial species and abundance based on the predicted encoding genes using species annotation software, thereby determining a corresponding bacterial abundance; the bacteria significantly enriched in tumor-related samples were counted using SmartBacNeo auto-call Wilcoxon rank sum test; the screening criteria for the significantly enriched bacteria were: the bacterial abundance in the tumor-associated samples is greater than or equal to 2 times the bacterial abundance in the non-tumor-associated samples, and the p-value of the statistical test is less than or equal to 0.05. Alternatively, the software R and R package circlize were used to plot a species distribution circle showing significantly enriched bacterial species and bacterial abundance in tumor-related samples. Alternatively, species distribution heatmaps showing significantly enriched bacterial species and bacterial abundance in tumor-related samples were plotted using software R and R package pheeatmap.
Step 1050 obtains a determination of known species of bacteria, in which step, based on the bacterial species determined in step 1040, genomic sequence or protein sequence information of the bacteria that are significantly enriched in the NCBI Genome database (https:// www.ncbi.nlm.nih.gov/Genome) is retrieved, and if so, the known species is determined; if not, then the unknown species is determined.
If the species of bacteria that are significantly enriched are known, then step 1060 is employed to determine the Genome or protein sequence of the bacteria of the known species, in which step the NCBI Genome database is searched for bacterial Genome or protein fasta sequence files based on the level name of the significantly enriched bacteria species in the tumor-associated sample.
If the species of bacteria whose significance is enriched is not known, step 1070 is used to determine the protein sequence of the bacteria of the unknown species, in which the genes are translated into a protein fasta sequence file using SmartBacneo based on the correspondence of genes and bacterial species predicted in the bacterial distribution statistics of step 1040.
Step 1080 determines a bacterial derived MHC-binding peptide fragment, in which binding affinity of the peptide fragment to MHC molecules of known sequence is predicted based on the bacterial genome or protein fasta sequence file of step 1060, or the protein fasta sequence file of step 1070, using a model that has been trained by SmartBacNeo to automatically call on-source NetMHCpan-4.1 to screen for peptide fragments with HLA type I affinity. And the binding affinity of the peptide fragment to MHC molecules of known sequence was predicted using a BigMHC trained model of SmartBacneo auto-call open source to assist in validating the predicted peptide fragment of NetMHCpan-4.1. Wherein the definition and operation of the tools invoked are as follows:
NetMHCpan-4.1: binding of peptide fragments to MHC molecules of known sequence is predicted using artificial neural networks. Their predicted affinities are expressed on scoring as sw_score a and sw_score b: the smaller sw_score and sw_score, the more strongly a peptide fragment binds to the corresponding MHC molecule.
BigMHC: the binding affinity of peptide fragments to MHC molecules of known sequence is predicted using deep neural networks, the predicted affinity of which appears as sw_score on a score. Higher sw_score indicates stronger binding of a peptide fragment to the corresponding MHC molecule.
Step 1090, determining a tumor neoepitope, wherein in the step, a model trained by the open-source bigMHC is called for predicting the immunogenicity of the peptide fragment; the same binding peptide scoring mechanism for multi-MHC typing was introduced as shown in Table 2, "number_of_HLAs"; a scoring mechanism for similarity of MHC-binding peptide fragments to host protein sequences is introduced, as shown in Table 2, "similarity (match length/epitope length)", to thereby determine tumor neoepitopes. Wherein, the scoring mechanism of the same binding peptide fragments of multiple MHC types reflects the number of the peptide fragments which can be bound by the same MHC; the higher this score, the more MHC a peptide may bind, exerting a stronger immune activation. Among the candidate peptide fragments, the peptide fragment with the higher score is preferred to be subjected to the next study. Illustratively, among the candidate peptide fragments, peptide fragments having an immunogenicity score of 0.9 or more, a similarity score with the host of 0.5 or less, and an MHC class number score of 2 or more are selected as tumor neoepitopes.
Example 2 tumor neoepitope identification tool SmartBacneo predicts a bacterial-derived tumor neoepitope
In this exemplary embodiment, the SmartBacNeo assay tool was used to predict bacterial-derived tumor neo-epitopes. Specific operations thereof may be referred to the operations described above with reference to 1010, 1020, 1030, 1040, 1050, 1060, 1070, 1080 and 1090 of fig. 10. In this embodiment, in step 1010, the tumor-related samples are stool samples of 10 colorectal cancer patients, and the non-tumor-related samples are stool samples of 10 healthy people.
Table 1 shows the bacteria significantly enriched in the faecal samples of tumor patients obtained using the SmartBacNeo analysis by steps 1010, 1020, 1030 and 1040 in this example. As shown in table 1, human-based colorectal cancer samples were screened for bacteria associated with colorectal cancer, such as fusobacterium nucleatum. Yachida S, et al Metageomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer [ J ]. Nature media, 2019,25 (6): 968-976. Studies have shown that F.nucleatum (Fusobacterium nucleatum) is highly enriched in colon cancer tissue, possibly with a strong correlation with colorectal cancer occurrence. Thereby proving the feasibility of SmartBacNeo.
TABLE 1 results of analysis of significantly enriched bacteria in colorectal cancer patients' intestinal tract by SmartBacneo
Species bacterial Species level designation
Tumor_mean average bacterial abundance in the intestine of Tumor patients
Healthy_mean: average value of bacterial abundance in intestinal tracts of healthy people
log2Foldchange: ratio of bacterial abundance in intestinal tracts of tumor patients to bacterial abundance in intestinal tracts of healthy people
p-value: statistical test p-value
Inf: infinity of infinity
Also, this example obtained an exemplary species distribution circle graph showing the significantly enriched bacterial species and bacterial abundance in the stool of tumor patients using software R and R package circle, via step 1040, as shown in fig. 11. The figure visually shows the bacterial species that are significantly enriched in fecal samples from tumor patients. The left graph part shows the ratio of different groups in different bacteria; the right panel shows the ratio of different species of bacteria in different groupings. The bacterial species enriched in the feces of the tumor patient can be clearly judged according to the distribution circle diagram.
This example also obtains an exemplary species distribution heat map showing significantly enriched bacterial species and bacterial abundance in the faeces of tumor patients using software R and R package pheeatmap, step 1040, as shown in fig. 12. The uppermost color block represents grouping information of samples, and the hierarchical cluster tree represents the species composition similarity degree of different samples, and the closer the species distance is, the more similar the species are distributed in the samples. The colors are light to dark, indicating that the relative abundance of the species is low to high.
The results of the exemplary analysis of step 1080 of this example are shown in table 2, where each of table 2 acts as a pair of HLA-peptide fragments, scored as a score for analyzing the affinity, immunogenicity of SmartBacNeo predicted peptide fragments with HLA. The reliability scores of the predicted results of the affinity peptide fragments are respectively as follows: sw_score a, sw_score b, sw_score c. Wherein, the smaller the SW_score A and SW_score B are, the higher the reliability of the result is; the larger the sw_score, the higher the reliability of the result.
TABLE 2 example Table of neoepitope prediction results
TABLE 2 exemplary Table of neoepitope prediction results
TABLE 2 exemplary Table of neoepitope prediction results
TABLE 2 exemplary Table of neoepitope prediction results
HLA_type: parting information
Epitope: predicted epitope information
Length: length of epitope sequence
Sw_score a: scoring criterion A
Sw_score b: scoring standard B
Sw_score c: scoring standard CSimilarity (match length/match length): similarity of epitope sequences to host SwissProt protein sequences, name_in_multi_hlas: HLA typing information having the same epitope
Number_of_hlas: total number of HLA types having the same epitope
The received_in_multi_peptides: HLA typing information for peptide fragments containing the same epitope
Number_of_pep_hlas: total number of peptide fragments containing the same epitope
In this exemplary embodiment, smartBacneo was also used to predict common tumor pathogens with peptides of type I MHC affinity in F.nucleatum. Screening peptide fragments with rank% less than 0.5 in NetMHCpan-4.1 analysis results in step 1080; if the number of peptide fragments screened in this step exceeds 50, then BigMHC is invoked in step 1080, and preferably the peptide fragment with the highest score in "BigMHC score (immunogenicity score)" is selected, the peptide fragment with the highest score in "multiple MHC score for identical binding peptide fragment score" is selected in step 1090, and the peptide fragment with the lowest score in "MHC-bindable peptide fragment to host protein sequence similarity score" is selected in step 1090. Illustratively, a candidate peptide with an immunogenicity score of 0.9 or more, a similarity score with the host of 0.5 or less, and an MHC typing number score of 2 or more is selected as the tumor neoepitope.
SmartBacNeo has two core functions: 1) Identifying and screening bacteria that are specifically enriched in tumor-associated samples; 2) Peptide fragments capable of binding by the host MHC and activating the host immune response are predicted and selected from different species of bacteria. Based on SmartBacNeo, some neoepitope sequences were predicted from the protein sequence of fusobacterium nucleatum, some of which have been validated. As shown in Table 3, human MHC-conjugated "neoepitope sequence-1" is a predicted peptide fragment of a portion of Fusobacterium nucleatum proteins that might activate human immune responses. These 10 peptide fragments (SEQ ID NOS: 1-10) were those which have been validated by Kalaora et al, derived from Fusobacterium nucleatum in melanoma, and which are capable of binding by human MHC (Kalaora S, et al identification of bacteria-derived HLA-bound peptides in melanoma [ J ]. Nature,2021,592 (7852):138-143.). This demonstrates the feasibility of the SmartBacNeo tool to determine tumor neoepitopes.
If the predicted neoepitope sequence is authentic, it should be able to trigger recognition of the immune system and thus kill tumor cells. In order to verify the credibility of the predicted neoepitope sequences, and considering that it is difficult to directly perform the related experiments on humans, the applicant has chosen to verify the feasibility and credibility of the neoepitope sequences predicted by SmartBacNeo using a mouse experiment first.
First, applicants have determined in step 1080 that the bacterial derived MHC-binding peptide fragment, based on the bacterial genome or protein fasta sequence file of step 1060, or the protein fasta sequence file of step 1070, predicted binding affinity of the peptide fragment to a mouse MHC molecule of known sequence using a model that has been trained to automatically invoke SmartBacNeo on open source NetMHCpan-4.1 to screen for peptide fragments with mouse MHC affinity. And finally, in step 1090, a mouse neoepitope sequence against a mouse MHC molecule is obtained (as shown in table 3 "neoepitope sequence-2") in determining the tumor neoepitope.
Secondly, the applicant plans to prepare a mouse tumor model for further experiments to verify the feasibility and the credibility of a new epitope sequence predicted by SmartBacneo, and the specific method comprises the following steps: fusobacterium nucleatum (ATCC 25586) and a mouse-derived esophageal squamous carcinoma cell line mEC are mixed and cultured, and the mixed culture is inoculated into a C57BL16 mouse subcutaneously to prepare a mouse tumor model. And mice were divided into experimental and control groups. The experimental group mice were tumor-bearing mice injected with the relevant vaccine containing the mouse neoepitope sequences in table 3 and were grouped according to the neoepitope sequences to be verified. The control group was tumor-bearing mice without vaccine injection. This example can demonstrate the feasibility and credibility of the neoepitope sequences predicted by SmartBacNeo by vaccine inhibition or killing of tumor cells. If the result is that the subcutaneous tumor cells of the mice in the experimental group are inhibited or apoptosis, and the subcutaneous tumor cells of the mice in the control group are normally grown, the vaccine comprising the neoepitope sequence predicted by SmartBacneo has the inhibition or killing effect on the tumor cells infected with Fusobacterium nucleatum, thereby verifying the feasibility of SmartBacneo.
TABLE 3 SmartBacneo predicted exemplary Fusobacterium nucleatum neoepitope sequences
Human MHC | New epitope sequence-1 | SEQ ID NO: | Mouse MHC | New epitope sequence-2 | SEQ ID NO: |
HLA-A02:01 | ATSDLNDLY | 1 | H-2-Db | KAIEFMQTM | 11 |
HLA-A11:01 | FSDKMVDYL | 2 | H-2-Dd | LSIQNFTVL | 12 |
HLA-A24:02 | FTTDTAAAL | 3 | H-2-Dq | ISFKNVITF | 13 |
HLA-B40:01 | HRYPDRVLL | 4 | H-2-Kb | VSHENLSLL | 14 |
HLA-B46:01 | IAHTNPNTL | 5 | H-2-Kd | IAMTSYTPL | 15 |
HLA-C01:02 | IASDVSAIL | 6 | H-2-Kk | VSYEIQDTM | 16 |
HLA-C03:04 | LAHTNPNTL | 7 | H-2-Kq | IAISFFDNI | 17 |
HLA-C06:02 | NSDADPMSY | 8 | H-2-Ld | SAFGVIATL | 18 |
HLA-C07:02 | QQLETPIML | 9 | H-2-Lq | HSVEFLQYL | 19 |
HLA-C08:01 | TVDHAAITL | 10 | H-2-Qa1 | ITLQNYFRM | 20 |
H-2-Qa2 | MVYNNLYEL | 21 | |||
SGILFIHYI | 22 | ||||
AAPGRFEAL | 23 | ||||
VSYSNEAKI | 24 | ||||
KAIEFVDRI | 25 | ||||
VAMKKLPEL | 26 | ||||
IGFGNYISV | 27 | ||||
IMVRNHAKL | 28 | ||||
ISPTSFFQI | 29 | ||||
SALWPFSTM | 30 | ||||
FGYNIPTL | 31 | ||||
STYRIITEI | 32 | ||||
VSLNIIEL | 33 | ||||
SAYIPTNVISI | 34 | ||||
YIPTNVISI | 35 | ||||
IAILGMDEL | 36 | ||||
VFYLHSRL | 37 | ||||
AVYNHYKRI | 38 | ||||
LADDNFSTI | 39 | ||||
RGVPQIEVTF | 40 |
Other embodiments
Fig. 9 shows a schematic block diagram of a computing device according to one embodiment of the present disclosure. As can be seen in fig. 9, computing device 900 includes a Central Processing Unit (CPU) 910 (e.g., a processor) and a memory 920 coupled to Central Processing Unit (CPU) 910. The memory 920 is used to store computer-executable instructions that, when executed, cause the Central Processing Unit (CPU) 910 to perform the methods in the above embodiments. A Central Processing Unit (CPU) 910 and a memory 920 are connected to each other by a bus, to which an input/output (I/O) interface is also connected. Computing device 900 may also include a number of components (not shown in fig. 9) connected to the I/O interface, including, but not limited to: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communications unit allows the computing device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
Accordingly, in another embodiment, the present disclosure presents a computing device comprising a processor; and a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of determining tumor neoepitopes in various embodiments of the present disclosure.
Further, the above-described method can alternatively be implemented by a computer-readable storage medium. The computer readable storage medium has computer readable program instructions embodied thereon for performing various embodiments of the present disclosure. The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
Thus, in another embodiment, the present disclosure presents a computer-readable storage medium having stored thereon computer-executable instructions for performing the method of determining a tumor neoepitope in various embodiments of the present disclosure.
In another embodiment, the present disclosure proposes a computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of determining a tumor neoepitope in various embodiments of the present disclosure.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Computer readable program instructions or computer program products for executing the various embodiments of the present disclosure can also be stored at the cloud end, and when a call is required, a user can access the computer readable program instructions stored on the cloud end for executing one embodiment of the present disclosure through the mobile internet, the fixed network, or other networks, thereby implementing the technical solutions disclosed according to the various embodiments of the present disclosure.
While embodiments of the present disclosure have been described with reference to several particular embodiments, it should be understood that embodiments of the present disclosure are not limited to the particular embodiments of the disclosure. The embodiments of the disclosure are intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims. The scope of the claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
Claims (21)
1. A method of determining a tumor neoepitope of microbial origin, the method comprising:
obtaining metagenome sequencing data, wherein the metagenome sequencing data comprises sequencing data obtained by sequencing bacterial DNA in tumor-related samples and non-tumor-related samples in a high throughput manner;
Performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence;
determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data;
determining bacteria that are significantly enriched in the tumor-associated sample;
obtaining a known species determination of a bacterium, the known species determination of a bacterium indicating whether the significantly enriched bacterium is a known species;
determining, based on the significantly enriched bacteria, the genome or protein sequence of bacteria of a known species from a known genome database; or alternatively
Determining a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significantly enriched bacterium;
predicting the binding affinity of the peptide fragment in the protein sequence of the known species of bacteria or the peptide fragment in the protein sequence of the unknown species of bacteria to the MHC, screening the peptide fragment in the protein sequence of the known species of bacteria that can bind to the MHC or the peptide fragment in the protein sequence of the unknown species of bacteria, and thereby determining the peptide fragment that can bind to the MHC; and
Determining the immunogenicity, host similarity and number of MHC-typing of said MHC-binding peptide fragments based on said MHC-binding peptide fragments, and screening peptide fragments based on said immunogenicity, host similarity and number of MHC-typing, thereby determining the tumor neoepitope.
2. The method of claim 1, wherein
The tumor-related sample is a donor tumor tissue sample or a fecal sample of a tumor patient, and/or
The non-tumor related sample is a donor paracancerous tissue sample, a normal tissue sample or a stool sample of a healthy population.
3. The method of claim 1 or 2, the method further comprising: performing metagenome assembly based on the metagenome sequencing data to obtain an assembled genome sequence, and performing quality control on the metagenome sequencing data before predicting coding genes in the genome based on the assembled genome sequence to further obtain metagenome sequencing data after quality control; preferably, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.
4. A method according to any one of claims 1-3, wherein
Determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on predicted encoding genes and/or the metagenomic sequencing data comprises:
determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes; or alternatively
Determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or alternatively
Determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples based on the predicted encoding genes;
determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and
and selecting bacterial species contained in the results of determining the bacterial species and the abundance based on the predicted coding genes and determining the bacterial species and the abundance by using species annotation software as the bacterial species in the determined tumor-related samples and the non-tumor-related samples, and determining the corresponding bacterial abundance thereof.
5. The method of any one of claims 1-4, wherein
The length of the assembled genome sequence is more than or equal to 90bp.
6. The method of claim 4 or 5, wherein
Based on the predicted encoding genes, determining bacterial species and abundance in the tumor-associated and non-tumor-associated samples comprises:
the predicted coding genes are sequence aligned with sequences in a known database to predict bacterial species and determine bacterial abundance at the same taxonomic level, preferably the input sequence for sequence alignment is the translated protein sequence of the predicted coding genes.
7. The method of any one of claims 4-6, wherein
Performing metagenome assembly based on the metagenome sequencing data, obtaining an assembled genome sequence, and predicting a coding gene in a genome based on the assembled genome sequence, including:
performing metagenome assembly based on the metagenome sequencing data after quality control to obtain an assembled genome sequence, and predicting coding genes in the genome based on the assembled genome sequence; and/or
Determining bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data comprises:
Species annotation software is used to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on quality-controlled metagenomic sequencing data.
8. The method of any one of claims 1-7, wherein
Determining bacteria that are significantly enriched in the tumor-associated sample comprises: determining the bacteria with remarkably enriched significance by Wilcoxon rank sum test; preferably, the screening criteria for the significantly enriched bacteria are: the bacterial abundance in the tumor-related samples is greater than or equal to 2 times that in the non-tumor-related samples, and the p-value of the statistical test is less than or equal to 0.05.
9. The method of any one of claims 1-8, wherein
The MHC is high-frequency HLA of Chinese population.
10. The method of any one of claims 1-9, wherein
Screening criteria for peptide fragments in protein sequences of bacteria of known species or of unknown species that bind to MHC are: affinity was 0.5% before ordering, with affinities greater than 0 and less than or equal to 500nM.
11. The method of any one of claims 1-10, wherein
Determining the immunogenicity of the MHC-binding peptide fragment based on the MHC-binding peptide fragment comprises:
Determining the immunogenicity of the MHC-binding peptide based on a deep learning model established by a deep neural network and the MHC-binding peptide, preferably, determining the immunogenicity of the MHC-binding peptide based on the MHC-binding peptide further comprises scoring according to the deep learning model, the higher the score the higher the immunogenicity; and/or
Determining the similarity of the MHC-binding peptide fragment to a host based on the MHC-binding peptide fragment comprises:
determining the similarity of the MHC-binding peptide to the host by comparing the MHC-binding peptide with the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, preferably, determining the similarity of the MHC-binding peptide to the host based on the MHC-binding peptide further comprises introducing an MHC-binding peptide to the host protein sequence similarity score, wherein the score is an output result of the sequence comparison, and the score is higher the similarity to the host; and/or
Determining the MHC class number of the MHC-binding peptide fragments based on the MHC-binding peptide fragments comprises:
counting the number of all MHC to which the MHC-binding peptide fragment may bind, thereby determining the MHC-typing number of the MHC-binding peptide fragment, preferably, determining the MHC-typing number of the MHC-binding peptide fragment based on the MHC-binding peptide fragment further comprises introducing a multiple MHC-typing identical binding peptide fragment scoring score for the number of all MHC to which each peptide fragment may bind, the higher the score the greater the MHC-typing number of the MHC-binding peptide fragment.
12. The method of claim 11, wherein
Screening peptide fragments based on the immunogenicity, host similarity and MHC typing numbers, and further determining tumor neoepitopes comprises:
screening the deep learning model to obtain high scoring, wherein the multiple MHC (major histocompatibility complex) classification is the same, the scoring of the binding peptide fragments is high, and the scoring of the peptide fragments capable of binding the MHC is similar to the scoring of the host protein sequence, so that the tumor neoepitope is determined.
13. An apparatus for determining a tumor neoepitope of microbial origin, comprising:
a metagenome sequencing data acquisition module configured to acquire metagenome sequencing data comprising sequencing data after high-throughput sequencing of bacterial DNA in tumor-related samples and non-tumor-related samples;
a coding gene prediction module configured to perform metagenome assembly based on the metagenome sequencing data, obtain an assembled genome sequence, and predict a coding gene in a genome based on the assembled genome sequence;
a determination module of bacterial species and abundance configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples based on predicted encoding genes and/or the metagenomic sequencing data;
A determination module of significantly enriched bacteria configured to determine significantly enriched bacteria in the tumor-associated sample;
a bacteria known species judgment result acquisition module configured to acquire a bacteria known species judgment result indicating whether the significantly enriched bacteria is a known species;
a determining module of a genomic or protein sequence of a bacterium of a known species configured to determine, based on the significantly enriched bacterium, the genomic or protein sequence of the bacterium of the known species from a known genomic database;
a determining module of a protein sequence of a bacterium of an unknown species configured to determine a protein sequence of a bacterium of an unknown species from the predicted encoding gene based on the significance-enriched bacterium;
a determining module of peptide fragments capable of binding to MHC configured to predict binding affinity of peptide fragments in the protein sequence of the known species of bacteria or peptide fragments in the protein sequence of the unknown species of bacteria to MHC, screening peptide fragments in the protein sequence of the known species of bacteria capable of binding to MHC or peptide fragments in the protein sequence of the unknown species of bacteria, and thereby determining peptide fragments capable of binding to MHC; and
A tumor neoepitope determining module configured to determine immunogenicity, host similarity and number of MHC class-divisions of the MHC-binding peptide based on the MHC-binding peptide, and to screen the peptide based on the immunogenicity, host similarity and number of MHC class-divisions, thereby determining a tumor neoepitope.
14. The apparatus of claim 13, further comprising a sequencing data quality control module configured to quality control the metagenomic sequencing data prior to performing metagenomic assembly based on the metagenomic sequencing data, obtaining an assembled genomic sequence, and predicting a coding gene in a genome based on the assembled genomic sequence, thereby obtaining quality controlled metagenomic sequencing data; preferably, the quality control criteria are: the terminal base mass is more than Q20, the number of N bases is less than 5, and the sequence length is more than or equal to 100bp.
15. The apparatus of claim 13 or 14, wherein the determination module of bacterial species and abundance comprises:
a determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene; or alternatively
A determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; or alternatively
A determination module based on the predicted bacterial species and abundance of the encoding gene configured to determine bacterial species and abundance in the tumor-associated sample and non-tumor-associated sample based on the predicted encoding gene;
a determination module of bacterial species and abundance using species annotation software configured to determine bacterial species and abundance in the tumor-related and non-tumor-related samples using species annotation software based on the metagenomic sequencing data; and
a bacterial species selection and bacterial abundance determination module configured to select bacterial species contained in both the results of determining bacterial species and abundance based on the predicted encoding genes and determining bacterial species and abundance using species annotation software as bacterial species in the determined tumor-associated and non-tumor-associated samples and determine their respective bacterial abundances.
16. The device of any one of claims 13-15, wherein the tumor neoepitope determining module comprises:
A determining module of immunogenicity of an MHC-binding peptide fragment configured to determine immunogenicity of the MHC-binding peptide fragment based on a deep learning model established by a deep neural network and the MHC-binding peptide fragment;
a module for determining the similarity of the MHC-binding peptide to the host, which is configured to sequence-align the MHC-binding peptide to the protein sequences of the host from which the tumor-associated sample and the non-tumor-associated sample are derived, and determine the similarity of the MHC-binding peptide to the host;
a determining module of MHC class number of MHC-binding peptide fragments configured to count the number of all MHC that the MHC-binding peptide fragments may bind to, thereby determining the MHC class number of the MHC-binding peptide fragments; and
a peptide fragment screening and neoepitope determining module configured to screen peptide fragments based on the immunogenicity, the similarity to host, and the MHC typing number, thereby determining a tumor neoepitope.
17. A computing device, comprising:
a processor; and
a memory for storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-16.
18. A computer readable storage medium having stored thereon computer executable instructions for performing the method according to any of claims 1-12.
19. A computer program product tangibly stored on a computer-readable storage medium and comprising computer-executable instructions that, when executed, cause at least one processor to perform the method of any one of claims 1-12.
20. A protein sequence comprising the sequence set forth in any one of SEQ ID NOs 11-40.
21. A nucleic acid sequence encoding the protein sequence of claim 20; preferably, the nucleic acid sequence is an mRNA sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310487903.6A CN116525009A (en) | 2023-05-04 | 2023-05-04 | Method and device for determining tumor neoepitope from microorganism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310487903.6A CN116525009A (en) | 2023-05-04 | 2023-05-04 | Method and device for determining tumor neoepitope from microorganism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116525009A true CN116525009A (en) | 2023-08-01 |
Family
ID=87389904
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310487903.6A Pending CN116525009A (en) | 2023-05-04 | 2023-05-04 | Method and device for determining tumor neoepitope from microorganism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116525009A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118053500A (en) * | 2024-02-23 | 2024-05-17 | 深圳微伴医学检验实验室 | Multi-layer neural network-based flora abundance prediction method and system |
-
2023
- 2023-05-04 CN CN202310487903.6A patent/CN116525009A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118053500A (en) * | 2024-02-23 | 2024-05-17 | 深圳微伴医学检验实验室 | Multi-layer neural network-based flora abundance prediction method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shukla et al. | Comprehensive analysis of cancer-associated somatic mutations in class I HLA genes | |
Jia et al. | Local mutational diversity drives intratumoral immune heterogeneity in non-small cell lung cancer | |
Chowell et al. | Evolutionary divergence of HLA class I genotype impacts efficacy of cancer immunotherapy | |
Kim et al. | Neopepsee: accurate genome-level prediction of neoantigens by harnessing sequence and amino acid immunogenicity information | |
Liu et al. | Applications of immunogenomics to cancer | |
US11081210B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
Graham et al. | Antigen discovery and specification of immunodominance hierarchies for MHCII-restricted epitopes | |
US20200395097A1 (en) | Pan-cancer model to predict the pd-l1 status of a cancer cell sample using rna expression data and other patient data | |
JP6680680B2 (en) | Methods and processes for non-invasive assessment of chromosomal alterations | |
US11725237B2 (en) | Polymorphic gene typing and somatic change detection using sequencing data | |
CN110752041B (en) | Method, device and storage medium for predicting neoantigen based on second-generation sequencing | |
KR20220011140A (en) | Systems and Methods for Tumor Fraction Assessment | |
WO2018136888A1 (en) | Methods for non-invasive assessment of genetic alterations | |
US20230154563A1 (en) | Detection of Human Leukocyte Antigen Loss of Heterozygosity | |
US20220310200A1 (en) | Methods for identifying and using disease-associated antigens | |
CN116525009A (en) | Method and device for determining tumor neoepitope from microorganism | |
Katayama et al. | Machine learning approaches to TCR repertoire analysis | |
Yi et al. | Investigations of sequencing data and sample type on HLA class Ia typing with different computational tools | |
Chuwdhury et al. | ImmuneMirror: A machine learning-based integrative pipeline and web server for neoantigen prediction | |
CN116403646A (en) | Method and device for determining tumor neoantigen | |
US20230064530A1 (en) | Detection of Genetic Variants in Human Leukocyte Antigen Genes | |
WO2023277932A1 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
US20240294984A1 (en) | Methods and systems for allele typing | |
US20240055073A1 (en) | Sample contamination detection of contaminated fragments with cpg-snp contamination markers | |
US20240312564A1 (en) | White blood cell contamination detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |