CN111627497B

CN111627497B - Method for extracting immunotherapeutic new antigen based on tumor specific transcription region assembled by new transcripts and application

Info

Publication number: CN111627497B
Application number: CN202010426721.4A
Authority: CN
Inventors: 万季; 刘鹏; 夏迪; 潘有东; 王奕; 宋麒
Original assignee: Shenzhen Neocura Biotechnology Corp
Current assignee: Shenzhen Neocura Biotechnology Corp
Priority date: 2020-05-19
Filing date: 2020-05-19
Publication date: 2023-06-13
Anticipated expiration: 2040-05-19
Also published as: CN111627497A

Abstract

The invention discloses a method for extracting an immunotherapeutic new antigen based on a tumor specific transcription region assembled by new transcripts and application thereof. The method comprises the following steps: s01, transcriptome deep sequencing data comparison; s02, transcript assembly; s03, filtering transcripts; s04, predicting a translation initiation codon; s05, translating transcripts; s06, obtaining a tumor specific full-length new transcript protein sequence; s07, obtaining a new transcript protein sequence with a tumor specific partial sequence difference; s08, combining protein fragments; s09, dividing protein fragments; s10, genotyping human leukocyte antigen; s11, predicting peptide fragment affinity; and optionally, S12, mass spectrometry validation. The tumor neoantigen discovered by the method of the invention is not limited to the annotated coding region, and more neoantigens can be discovered; the high-expression transcripts from non-mutation have certain universality in different tumor types; the mass spectrum experiment proves that the immune response is generated with higher probability.

Description

Method for extracting immunotherapeutic new antigen based on tumor specific transcription region assembled by new transcripts and application

Technical Field

The invention relates to the field of tumor immunotherapy, in particular to a method for extracting an immunotherapeutic new antigen based on a tumor specific transcription region assembled by new transcripts and application thereof.

Background

The tumor immunotherapy method using the new antigen vaccine has the characteristics of obvious therapeutic effect, wide application range of cancer species, small toxic and side effects and the like, and becomes an important member of the immunotherapy family. The effect of the treatment method is severely dependent on the selection of the neoantigen polypeptide, and further the selection of the neoantigen polypeptide is severely dependent on data and predictive algorithms. In theory, the generation of neoantigens may come from a variety of sources, while in actual clinical practice, only the DNA point mutations and indels are focused on neoantigens. Although neoantigen vaccines based on DNA point mutations and indels exhibit good clinical results, studies have shown that neoantigens generated based on other biological pathway sources may have a more immunogenic response. While for some malignancies with less mutation burden, selection of tumor neoantigen vaccine formulations is limited based on only a few sources due to insufficient predicted neoantigen data. Therefore, the development of more new antigen sources has important significance for research and clinical application of the new antigens.

Disclosure of Invention

In order to solve the problems of obtaining tumor neoantigens, the invention fully considers the fact that a large number of new transcripts exist in tumor genome, and develops a set of bioinformatics method for obtaining tumor specific neoantigens.

In a first aspect, the present invention provides a method for extracting immunotherapeutic neoantigens based on tumor-specific transcribed regions of neotranscript assembly, comprising the steps of:

s01, transcriptome deep sequencing data comparison;

s02, transcript assembly;

s03, filtering transcripts;

s04, predicting a translation initiation codon;

s05, translating transcripts;

s06, obtaining a tumor specific full-length new transcript protein sequence;

s07, obtaining a new transcript protein sequence with a tumor specific partial sequence difference;

s08, combining protein fragments;

s09, dividing protein fragments;

s10, genotyping human leukocyte antigen;

s11, predicting peptide fragment affinity;

and optionally, S12, mass spectrometry validation.

In some embodiments of the invention, S01 comprises the steps of:

s101, acquiring full transcriptome depth sequencing data containing coding RNA and non-coding RNA of a tumor sample and a normal control sample;

s102, filtering full transcriptome depth sequencing data of tumor samples and normal control samples;

s103, constructing an index for a reference genome;

s104, comparing the filtered data obtained in the S12 with the reference genome obtained in the S13;

preferably, in S101, adopting a ribosome-removing chain specific library construction method and a small fragment enrichment screening library construction method for library construction sequencing;

preferably, in S101, the sample data includes a plurality of overlapping or partially overlapping short read sequences, and the sequencing data of the tumor sample and the normal control sample are not less than 30G;

preferably, in S102, short read sequences are removed wherein the average base mass is below 20 or comprise sequencing primer adaptors.

In some embodiments of the invention, in S02, full transcriptome deep sequencing data alignment results that have mapped short read sequences to a reference genome are assembled into transcripts.

In some embodiments of the invention, in S03, known human full-length transcripts and repetitive sequences present in the assembled transcripts are removed.

In some embodiments of the invention, S04 comprises the steps of:

s401, calculating the new transcript coding capacity of the tumor sample and the normal control sample, and dividing the new transcript coding capacity into protein coding transcripts and non-protein coding transcripts according to the intensity of the coding capacity;

s402, predicting translation initiation codons of protein coding transcripts in tumor samples and normal control samples.

In some embodiments of the invention, in S05, the novel transcripts with coding capacity in tumor samples and normal control samples are translated according to predicted translational start codons to yield protein sequences.

In some embodiments of the invention, in S06, comparing the translated protein sequences of the tumor sample with the translated protein sequences of the normal control sample, traversing the protein sequences of the tumor sample to obtain a unique protein sequence of the tumor sample that cannot be searched in the normal control.

In some embodiments of the invention, S07 comprises the steps of:

s701, filtering a tumor sample specific protein;

s702, comparing all the filtered new transcription proteins with all the transcription protein sequences corresponding to the normal control sample, wherein the sequence inconsistent with the normal control sample in the comparison result is defined as a new transcription protein sequence with a tumor specific partial sequence difference.

In some embodiments of the invention, in S08, the tumor specific full-length novel transcript protein sequence obtained in S06 and the tumor specific partial sequence difference novel transcript protein sequence obtained in S07 are combined and sequences less than 9 in length are filtered.

In some embodiments of the invention, in S09, the protein sequence obtained in S08 is split, preferably into k-mer residue peptide fragments of 9 to 12 amino acids in length.

In some embodiments of the invention, in S11, the affinity of the k-mer residue peptide fragment after S09 cleavage to the HLA molecule is predicted and a candidate neoantigen having an affinity greater than a threshold is selected.

In some embodiments of the present invention, in S12, mass spectrometry is performed on a tumor sample, the generated data is imported into MaxQuant software, candidate neoantigens are added as a search library, and finally the obtained peptide fragments can be successfully identified as neoantigens.

According to an aspect of the present invention, there is provided a computer-implemented bioinformatics method of exploring a tumor neoantigen based on a result of assembling a new transcript, comprising the steps performed by a processor of: acquiring full transcriptome sequencing data of a tumor sample and a normal control sample; assembling transcripts of tumor samples and normal control samples; obtaining new transcripts of the tumor sample and the normal control sample; predicting new transcript encoding protein sequences of tumor samples and normal control samples; obtaining a new transcript protein sequence and a protein fragment sequence specific to a tumor sample; calculating the binding affinity of the specific proteins and protein fragments of the tumor sample and the MHC molecules to obtain candidate tumor neoantigens; screening and verifying candidate new antigens based on mass spectrum data.

Preferably, the sample is a fresh tissue sample; alternatively, paraffin tissue samples may be selected.

A second aspect of the invention provides the use of the method of the first aspect for the preparation of a medicament or medical device for extracting immunotherapeutic neoantigens.

Compared with the prior art, the scheme of the invention has the following advantages:

1. the tumour neoantigens found by the protocol of the invention are not limited in origin to the annotated coding regions, and more neoantigens can be found. The current common method mainly adopts a target region capture sequencing or exome sequencing treatment process, and obtains a new antigen through affinity prediction after recognizing somatic cell mutation. This essentially localizes the analysis region to a known coding region on the genome.

2. The tumor new antigen obtained by the invention is derived from non-mutated high-expression transcripts (such as endogenous reverse transcription), so that the tumor new antigen has certain universality in different tumor types.

3. The mass spectrum experiment proves that the obtained tumor neoantigen has the advantages that the obtained peptide fragment is expressed in real existence and has higher probability of generating immune response.

Drawings

FIG. 1 is a flow chart of the extraction of immunotherapeutic neoantigens according to one embodiment of the invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.

In order that those skilled in the art will better understand the present invention, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The terms "first," "second," "again," "then," "next," and the like as used in the specific embodiments herein are not intended to be limiting of the order.

FIG. 1 is a flow chart of an embodiment of the invention for extracting immunotherapeutic neoantigens, the method comprising the following steps performed by a processor:

s01, transcriptome deep sequencing data alignment

Specifically, firstly, a ribosome-removing chain specific library construction method and a small fragment enrichment screening library construction method are adopted for library construction and sequencing, full transcriptome depth sequencing data comprising coding RNA and non-coding RNA of a tumor sample and a normal control sample are obtained, the sample data comprise a plurality of overlapping or partially overlapping short reading sequences, the difference of the overlapping degree is related to the depth of sequencing, and the tumor sample and the normal control sample are required to respectively obtain the sequencing data of not less than 30G.

And secondly, filtering the full transcriptome deep sequencing data of the tumor sample and the normal control sample, and removing the short-reading sequence with average base quality lower than 20 or containing sequencing primer joints, so that the accuracy and the efficiency of subsequent analysis can be improved.

Again, indexing the reference genome, which refers to the base sequence data on each chromosome of humans, typically in FASTA format, which can be downloaded via UCSC using version hg38/GRCh38; the filtered data is then aligned to the reference genome for sequence alignment to locate the short read sequence above the reference genome. Specifically, the software HISAT2 can be used for sequence alignment of the data after filtration of tumor samples and normal control samples.

S02, transcript Assembly

The whole transcriptome deep sequencing data is aligned, the short reading sequence is positioned to a reference genome, and the aligned result can be assembled into transcripts by relying on the reference gene and considering de novo assembly. Specifically, the tumor samples and normal control samples can be assembled into transcripts using the software StringTie.

S03, transcript filtration

The assembled transcripts have a large number of known human full-length transcripts, and the transcripts are expressed in normal tissues, so that the removal of the transcripts is beneficial to the improvement of the subsequent analysis speed. Specifically, known transcripts in tumor samples and normal control samples were filtered according to transcript numbering in the StringTie assembly results.

Second, about 55% of the repeats in the human genome, because of the large number of simple repeats, often the short reads align to incorrect locations on the genome when aligned to the reference genome, thereby affecting transcript assembly based on the alignment, and therefore require removal of the repeats. Specifically, the transcript sequences were evaluated using the software repoatmask, and transcripts containing repeat sequences were then removed from tumor samples and normal control samples.

S04, predictive translation initiation codon

Specifically, firstly, calculating new transcript coding capacity of a tumor sample and a normal control sample by using software CPAT, and dividing the new transcript coding capacity into protein coding transcripts and non-protein coding transcripts according to the intensity of the coding capacity; second, the translation initiation codons of the protein-encoding transcripts in tumor samples and normal control samples were predicted.

S05, translation transcripts

Specifically, the novel transcripts with coding capacity in tumor samples and normal control samples are translated according to predicted translation initiation codons by using autonomously developed software to obtain protein sequences. Similarly, the protein sequence may be obtained by translating the new transcript using the software ORFfinder or gelator.

S06, obtaining the tumor specific full-length new transcript protein sequence

Tumor specific protein sequences refer to proteins that are only translationally expressed in tumor samples and not expressed in normal control samples. Specifically, the protein sequences obtained by comparing the tumor sample with the protein sequences obtained by translating the normal control sample by using the autonomously developed software are traversed through the tumor sample protein sequences, and the specific protein sequences of the tumor sample which cannot be searched in the normal control are obtained.

S07, obtaining a new transcript protein sequence with a tumor specific partial sequence difference

In addition to the full-length new transcript protein sequence obtained in S06, there is also a new transcript in the tumor sample that has a partial sequence difference from the normal control sample transcript. Such new transcripts may be due to different cleavage patterns, insertional deletion variants, etc. The translation results are generally expressed in that a part of the protein sequence is only present in the tumor sample, and such a part of the differential protein sequence is also likely to form a neoantigen. Specifically, the specific proteins of the tumor sample are filtered first, and then all the new transcript proteins obtained by filtration are compared with all the transcript protein sequences corresponding to the normal control sample by using the software developed independently. Sequences in the alignment that are inconsistent with the normal control sample will be defined as tumor specific partial sequence differences in the new transcript protein sequence.

S08, pooled protein fragments

Specifically, the full-length novel transcript protein sequence specific to the tumor obtained in the step S06 and the novel transcript protein sequence with the partial sequence difference specific to the tumor obtained in the step S07 are combined, and the sequences with the length less than 9 are filtered.

S09, protein fragment segmentation

Specifically, the protein sequence obtained in the previous step is divided into k-mers with smaller lengths. k-mers refer to all possible sub-string sets of length k comprised by a string, and for an input protein sequence, sequences of fixed length k are extracted sequentially from the first amino acid residue using a sliding window of step size 1, these sequences being k-mers. More specifically, the protein sequence obtained in S08 is split into k-mers of 9 to 12 amino acids in length using autonomously developed software.

S10, genotyping human leukocyte antigen

The human leukocyte antigen gene is a polymorphic region of a short arm of a chromosome 6 participating in immune response, is a gene complex with highest allelic polymorphism in the gene, and the coded MHC class I molecules mainly mediate the recognition and the killing of the antigen by CD8+ T cells, and class II molecules are mainly combined with CD4+ T cells, so that the immune response is started. The affinities of different HLA subtype molecules for the same polypeptide may be different, so determining the HLA subtype of a sample is a prerequisite for HLA and candidate neoantigen binding screening. Specifically, the human leukocyte antigens of the normal control samples were genotyped using the software HLA-LA.

S11, peptide fragment affinity prediction

The mutant proteins expressed by tumor cells are not expressed by normal cells, and these abnormal protein sequences are processed into short peptides by proteasomes in cells, then bound by human leukocyte antigens, presented on the cell surface, and recognized by T cells as foreign antigens. And predicting the affinity between the specific HLA subtype and the polypeptide through an algorithm, and screening out peptide fragments with strong affinity with HLA molecules. Specifically, the affinity of k-mer residue peptide fragments after S09 segmentation to HLA molecules was predicted using software NetMHCpan 4.0, selecting as candidate neoantigens with affinities greater than a threshold (typically <500 nm).

S12, mass spectrum verification

Specifically, mass spectrometry experiment analysis is carried out on a tumor sample, the generated data is imported into MaxQuant software, candidate neoantigens are added as a search library, and finally the obtained peptide fragments can be successfully identified as the neoantigens.

The specific parameters of the software used in the invention are as follows:

filtering of raw data is performed using a trimmatic, an example command of which is:

wherein, trimmable-0.36. Jar is a trimmable tool executable file, PE indicates double-ended sequencing, and Phred33 indicates the mass format of bases, sample_1.Fastq.gz and sample_2.Fastq.gz are input raw data, sample.clear.R1. Fq.gz, sample.unpaired.R1.Fq.gz, sample.clear.R2. Fq.gz and sample.unpaired.R2.Fq.gz are output data, ILLUNACINACIP: adapter.fa:2:30:10:8:true indicates the sequence of the cut sequencing primer, parameters are respectively followed by a linker sequence file, the allowed maximum number of mismatches, a threshold number of bases matched in the parlindrome mode, and a threshold number of bases matched in the simple mode; the leader indicates that the base at the head end is excised by a base of less than 20; trail indicates that the base with a mass of less than 20 of the base at the end of the excision is removed; MINLEN indicates the minimum sequence length.

The genome index is constructed using HISAT, first the cut sites and the exon sequences in the genome annotation file are extracted separately, and then the genome index is constructed, with example commands:

where hg38.fa is the human genome sequence and gencode. Exactsplicsi_sites. Py, exactextrans. Py, HISAT2-build are the software contained in the HISAT2 package.

Sequences were aligned using HISAT2, an example command of which is:

where hg38 represents the reference genome index that has been constructed, and the results of the alignment are ranked using the SAMtools after alignment. SAMtools view represents the view command of SAMtools, used here to make further filtering of results.

Transcript assembly using StringTie, an example command of which is:

wherein gencode. Section. Gtf is a human genome annotation file.

Transcripts of the repeated sequences were removed using a repeater mask, an example command of which is:

wherein the constructed transcript sequences are first extracted by using software bedtools and then the repeat sequences therein are marked by using a repeat mask.

The ability to encode transcripts is predicted using CPAT, an example command of which is:

wherein-d and-x parameters correspond to a model built for the software, -o is a prediction result file.

The transcripts are translated using autonomously developed software, an example command of which is:

the autonomously developed software was used to find the differential protein sequence, an example command of which is:

wherein-t is the protein sequence of the tumor sample, -n is the protein sequence of the normal sample, -out1 is the protein sequence expressed only in the tumor sample, -out2 is the differential partial sequence of the protein expressed in both the normal sample and the tumor sample but with a different sequence.

HLA genotyping was performed using HLA-LA, an example command of which is:

wherein, the map PRG_MHC_GRCh38_withIMGT indicates the group gene structure index file, which can be built by HLA-LA program itself or can be downloaded by a download page provided by the program.

Peptide fragment affinity prediction using netMHCpan 4.0, an example command is:

wherein-BA indicates that a classification prediction is to be made, -l indicates the peptide fragment length, -a indicates the HLA genotype, -inptype indicates that the input is the HAL genotype, -xls and-xlfile together indicate the output file.

Mass spectrum verification was performed using MaxQuant, and after mass spectrum data of the sample was imported, a digest mode was set as No digest, and Global Fasta File was set as a candidate neoantigen Fasta file.

While the preferred embodiments and examples of the present invention have been described in detail, the present invention is not limited to the above-described embodiments and examples, and various changes may be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A method for extracting immunotherapeutic neoantigens based on tumor specific transcribed regions of neotranscript assembly comprising the steps of:

s01, transcriptome deep sequencing data comparison;

s02, transcript assembly;

s03, filtering transcripts;

s04, predicting a translation initiation codon;

s05, translating transcripts;

s06, obtaining a tumor specific full-length new transcript protein sequence;

s08, combining protein fragments;

s09, dividing protein fragments;

s10, genotyping human leukocyte antigen;

s11, predicting peptide fragment affinity;

s01, comprising the following steps:

s103, constructing an index for a reference genome;

s104, comparing the filtered data obtained in the S102 with the reference genome obtained in the S103;

s02, assembling complete transcriptome deep sequencing data comparison results with short reading sequences positioned to a reference genome into transcripts;

s06, comparing the protein sequences obtained by translation of the tumor sample and the normal control sample, traversing the protein sequences of the tumor sample, and obtaining a specific protein sequence of the tumor sample which cannot be searched in the normal control;

s07, the method comprises the steps of:

s701, filtering a tumor sample specific protein;

s702, comparing all the filtered new transcription proteins with all the transcription protein sequences corresponding to the normal control sample, wherein the sequence inconsistent with the normal control sample in the comparison result is defined as a new transcription protein sequence with a tumor specific partial sequence difference;

in S08, combining the tumor specific full-length novel transcript protein sequence obtained in S06 and the tumor specific partial sequence difference novel transcript protein sequence obtained in S07, and filtering the sequences with the length less than 9.

2. The method of claim 1, further comprising S12, mass spectrometry validation.

3. The method of claim 1, wherein in S101, the library is sequenced using a ribosome strand-specific library construction method and a small fragment enrichment screening library construction method.

4. The method of claim 1, wherein in S101, the sample data comprises a plurality of overlapping or partially overlapping short read sequences, and the tumor sample and the normal control sample have no less than 30G of sequencing data.

5. The method of claim 1, wherein in S102 short read sequences are removed wherein the average base mass is below 20 or comprise sequencing primer adaptors.

6. The method according to any one of claims 1 to 5, wherein in S03, known human full-length transcripts and repetitive sequences present in the assembled transcripts are removed.

7. The method according to any one of claims 1-5, characterized in that in S04, the following steps are included:

8. The method according to any one of claims 1 to 5, wherein in S05, the novel transcripts having the ability to encode in tumor samples and normal control samples are translated according to predicted translational start codons to give protein sequences.

9. The method according to any one of claims 1 to 5, wherein in S09 the protein sequence obtained in S08 is split;

and/or, in S11, predicting the affinity of the k-mer residue peptide segment after S09 segmentation and HLA molecules, and selecting the k-mer residue peptide segment with the affinity being greater than a threshold value as a candidate new antigen;

and/or S12, carrying out mass spectrometry experimental analysis on the tumor sample, importing the generated data into MaxQuant software, adding candidate neoantigens as a search library, and finally, successfully identifying the obtained peptide fragment as the neoantigen.

10. The method according to claim 9, wherein in S09 the protein sequence obtained in S08 is split into k-mer residue peptide fragments of 9 to 12 amino acids in length.

11. Use of a method according to any one of claims 1-10 for the preparation of a medicament or medical device for extracting immunotherapeutic neoantigens.