WO2024051097A1 - Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium - Google Patents

Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium Download PDF

Info

Publication number
WO2024051097A1
WO2024051097A1 PCT/CN2023/077356 CN2023077356W WO2024051097A1 WO 2024051097 A1 WO2024051097 A1 WO 2024051097A1 CN 2023077356 W CN2023077356 W CN 2023077356W WO 2024051097 A1 WO2024051097 A1 WO 2024051097A1
Authority
WO
WIPO (PCT)
Prior art keywords
tumor
candidate
neoantigen
specific
reads
Prior art date
Application number
PCT/CN2023/077356
Other languages
French (fr)
Chinese (zh)
Inventor
万季
汪健
赵钊
潘有东
王弈
Original Assignee
深圳新合睿恩生物医疗科技有限公司
深圳市新合生物医疗科技有限公司
北京新合睿恩生物医疗科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳新合睿恩生物医疗科技有限公司, 深圳市新合生物医疗科技有限公司, 北京新合睿恩生物医疗科技有限公司 filed Critical 深圳新合睿恩生物医疗科技有限公司
Publication of WO2024051097A1 publication Critical patent/WO2024051097A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • the present invention requires the priority of the Chinese patent application submitted to the China Patent Office on September 6, 2022, with the application number 2022110862377, and the application name is "Neoantigen identification method and device, equipment and medium for tumor-specific circular RNA", which The entire contents are incorporated herein by reference.
  • the invention belongs to the technical field of bioinformatics, and specifically relates to a neoantigen identification method and device, electronic equipment and storage medium for tumor-specific circular RNA.
  • RNAs that stably exist in the body exist in covalently closed circular structures, most of which are produced by back splicing and are called circular RNAs (circRNAs).
  • circRNAs circular RNAs
  • circRNA has a covalently closed stable structure, its degradation rate is lower than that of mRNA.
  • the neoantigen it encodes persists in tumor cells for a long time and is more likely to be recognized by T cells.
  • the object of the present invention is to provide a new antigen identification method and device for tumor-specific circular RNA.
  • Settings, equipment, and storage media can improve the accuracy of identification of circRNA, thereby making circRNA-derived neoantigens more immunogenic.
  • a first aspect of the present invention discloses a method for identifying neoantigens of tumor-specific circular RNA, which includes: obtaining first sequencing data of tumor tissue samples and second sequencing data of adjacent cancer tissue samples;
  • the specified number of neoantigen candidate peptides ranked first are determined as the neoantigen target peptides.
  • a second aspect of the present invention discloses a neoantigen identification device for tumor-specific circular RNA, which includes: a data acquisition unit, used to acquire first sequencing data of tumor tissue samples and second sequencing data of adjacent cancer tissue samples;
  • a detection unit configured to detect circRNAs on the first sequencing data and the second sequencing data respectively, to obtain a plurality of candidate circRNAs
  • a pseudo-reference unit used to construct a pseudo-reference sequence of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order for each of the candidate circular RNAs;
  • Alignment unit used to determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidates. reads; and, determine the reads in the second sequencing data that have the BSJ site and the sequences of the second specified length upstream and downstream of the BSJ site that match the pseudo-reference sequence as second candidate reads; Wherein, the second specified length is less than or equal to the first specified length;
  • a first determination unit configured to determine the candidate circular RNA supported by the first candidate reads as the first circular RNA detected from the tumor tissue sample; and, determine the candidate circular RNA supported by the second candidate reads.
  • the candidate circRNA is determined to be the second circRNA detected from the adjacent cancer tissue sample;
  • a filtering unit configured to filter out the first circular RNA that is the same as the second circular RNA from the plurality of first circular RNAs to obtain a plurality of tumor-specific circular RNAs
  • a translation prediction unit for predicting the translation ability score of each of the tumor-specific circular RNAs
  • a peptide acquisition unit used to acquire a plurality of neoantigen candidate peptides derived from a plurality of tumor-specific circular RNAs
  • a scoring unit configured to score and rank multiple neoantigen candidate peptide segments for immunogenicity according to the translation ability score of the tumor-specific circular RNA
  • the second determination unit is used to determine a specified number of neoantigen candidate peptides that are ranked first as neoantigen target peptides.
  • a third aspect of the present invention discloses an electronic device, including a memory storing executable program code. and a processor coupled to the memory; the processor calls the executable program code stored in the memory for executing the neoantigen identification method of tumor-specific circular RNA disclosed in the first aspect.
  • a fourth aspect of the present invention discloses a computer-readable storage medium that stores a computer program, wherein the computer program causes the computer to execute the neoantigen identification method of tumor-specific circular RNA disclosed in the first aspect.
  • the beneficial effect of the present invention is that the provided neoantigen identification method, device, equipment, and storage medium for tumor-specific circular RNAs construct pseudo-reference sequences for re-alignment of the detected candidate circular RNAs, respectively.
  • the sequencing data of tumor tissue samples and adjacent cancer tissue samples are compared with pseudo-reference sequences, the first candidate reads and the second candidate reads on the comparison are extracted respectively, and the candidate circRNA supported by the first candidate reads is determined as The first circRNA detected from the tumor tissue sample, and the candidate circRNA supported by the second candidate reads was determined to be the second circRNA detected from the paracancerous tissue sample.
  • the fusion of the two can simultaneously Normal circRNAs present in tumor tissue samples and paracancerous tissue samples are filtered out to obtain tumor-specific circRNAs, and then the translation ability scores of tumor-specific circRNAs are further predicted, and neoantigen candidate peptides derived from them are then The segments are scored and sorted for immunogenicity, and the top ones are finally determined as neoantigen target peptide segments, which can increase the source of tumor neoantigens, broaden the scope of neoantigen screening, and improve the accuracy of identification of circRNAs. , thereby making circRNA-derived neoantigens more immunogenic.
  • Figure 1 is a flow chart of a method for neoantigen identification of tumor-specific circular RNA
  • Figure 2 is a schematic structural diagram of the reverse 100bp pseudo-reference sequence of the circular RNA
  • Figure 3 is a schematic structural diagram of two different circular RNA sequences formed by the same BSJ site
  • Figure 4 is a schematic structural diagram of a tumor-specific circular RNA neoantigen identification device
  • FIG. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present invention.
  • 401 Data acquisition unit; 402. Detection unit; 403. Pseudo reference unit; 404. Comparison unit; 405. First determination unit; 406. Filtering unit; 407. Translation prediction unit; 408. Peptide acquisition unit; 409 , scoring unit; 410, second determination unit; 501, memory; 502, processor.
  • the embodiment of the present invention discloses a method for identifying neoantigens of tumor-specific circular RNA.
  • method including the following steps S1 ⁇ S10:
  • the sample library construction is processed using the A tailing+RNase R (Li+buffer) method.
  • a tailing+RNase R Li+buffer
  • this method can digest linear RNA to the greatest extent, which is not only conducive to the accurate detection of circular RNA (circRNA), but is also very important for determining the full-length sequence of circRNA.
  • the samples include tumor tissue samples and adjacent cancer tissue samples.
  • the first sequencing data, the second sequencing data, and the third sequencing data are all second-generation high-throughput circular RNA sequencing data, and all It can be high-quality sequencing data obtained after quality control and filtering of the original sequencing data that was removed from the machine.
  • step S1 the same process is performed on the original sequencing (circleSeq) data of the off-machine tumor tissue samples, the adjacent cancer tissue samples, and the original sequencing (polysome profiling RNASeq) data of the tumor tissue samples based on polysome profiling.
  • Quality control filtering is performed to obtain the first sequencing data, the second sequencing data and the third sequencing data.
  • the quality control filtering process may include the following steps S101 to S102:
  • the adapter removal operation is performed on the original sequencing data to filter out adapter (adapter) sequences; at the same time, low-quality reads are filtered out.
  • a window with a length of 10 and a step size of 1 is used to right on the original data that is downloaded. Sliding, the average sequencing quality of the 10 bases in the window is calculated every time it slides. If the average sequencing quality is ⁇ 15, the window sequence and the sequence to its right are judged to be low-quality areas, and the window sequence and the sequence to its right are located Delete the entire reads, and then further judge the length of the reads that are retained for the first time after deletion.
  • the sequence length of the reads is required to be >20 bp.
  • the sequence length of the reads that are retained for the first time after deletion is ⁇ 20 bp, it is considered that the sequence length is ⁇ 20 bp.
  • the reads are low-quality reads, and the entire reads are deleted. After deletion, the filtered reads are ultimately retained.
  • the filtered reads are aligned to the human ribosomal RNA sequence, the aligned ribosomal reads are filtered out, and the unaligned reads are retained as high-quality sequencing data for subsequent analysis, which can avoid ribosomal RNA Contamination impact analysis results.
  • step S2 the same processing is performed on the first sequencing data of the tumor tissue sample and the second sequencing data of the adjacent cancer tissue sample to perform circular RNA detection.
  • tumor tissue circleSeq sequencing data i.e., the first sequencing data
  • tumor.circ.R1.fq and tumor.circ.R2.fq used below represent the sequencing data file names.
  • the standard algorithm processes of detection algorithms CIRCexplorer2 and CIRI2 are respectively used to detect circRNA in tumor tissues.
  • the CIRCexplorer2 standard algorithm process includes three steps: Align, Parse, and Annotate, which are summarized as follows:
  • the comparison software STAR is used to compare the first sequencing data.
  • the example command is:
  • --chimSegmentMin specifies that the number of bases aligned at one end of the chimeric alignment is at least 10bp;
  • --runThreadN specifies the number of running threads
  • --genomeDir specifies the reference index file path used
  • --readFilesIn specifies the input sequencing data.
  • -t indicates the comparison tool used in the Align step
  • -b indicates the output file name
  • Chimeric.out.junction indicates the input file name
  • the back splicing site is annotated based on the provided gene annotation file.
  • the example command is:
  • -r specifies the gene annotation file
  • -g specifies the human reference genome file
  • -b specifies the junction information output by the Parse step
  • -o specifies the output file name.
  • --in specifies the sam format file, which is generated by comparing circRNA sequencing data with the bwa-mem tool; --ref_file specifies the human reference genome file; --anno specifies the gene annotation file; --out specifies the output result file.
  • the BSJ site refers to the backsplicing junction (BSJ).
  • the first specified length can be set to 100 bp. Therefore, step S3 is specifically for each candidate circRNA, construct the reverse splicing junction according to the sequence order of the circRNA.
  • the pseudo-reference sequence and its index information of 100 bp upstream and downstream of the BSJ site are shown in Figure 2.
  • the reverse pseudo-reference sequence constructed for the candidate circRNA is shown in Figure 2.
  • S4 Determine the reads that have the BSJ site in the first sequencing data and match the second specified length sequence upstream and downstream of the BSJ site with the pseudo-reference sequence as the first candidate reads; and, identify the reads that have the BSJ site in the second sequencing data.
  • the reads that match the second specified length sequence upstream and downstream of the BSJ site and the pseudo-reference sequence are determined as second candidate reads.
  • the second specified length is less than or equal to the first specified length.
  • the second specified length may be set to 3bp, 5bp, or 10bp, etc., or in some other possible embodiments, the second specified length may also be set equal to the first specified length. Specify the length, but should not be greater than the second specified length, which is 100bp.
  • the second specified length is set to 3 bp as an example.
  • step S4 all reads supporting the circular RNA BSJ site in the first sequencing data and the second sequencing data are specifically extracted, and according to the index information Re-align to the pseudo-reference sequence constructed in the previous step S3. It is required that the 3 bp sequences upstream and downstream of the BSJ site in the reads completely match the pseudo-reference sequence. That is to say, when the first sequencing data and the second sequencing data There are reads with a BSJ site and the 3 bp sequences upstream and downstream of the BSJ site that completely match the pseudo-reference sequence, which can be determined as candidate circular RNA reads (i.e., the first candidate reads and the second candidate reads).
  • the screened candidate circRNA reads can be compared with the normal human genome and transcriptome, and the candidate circRNA reads on the normal alignment can be filtered out to obtain the real candidate circRNA reads.
  • S5. Determine the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and determine the candidate circRNA supported by the second candidate reads as detected from the adjacent cancer tissue. Second circular RNA detected in the sample.
  • step S5 among multiple candidate circRNAs, the candidate circRNAs supported by reads with 3 bp sequences upstream and downstream of the BSJ site that completely match the pseudo-reference sequence are selected from tumor tissue samples and adjacent cancer tissue samples.
  • the final detected circular RNA i.e. the first circular RNA and the second circular RNA.
  • tumor-specific circular RNA refers to circular RNA that only exists in tumor cells. Therefore, the circRNAs also detected in adjacent tissue samples can be filtered out. These circRNAs belong to the circRNAs that normally exist in cells. Therefore, these circRNAs need to be filtered out, that is, multiple first circRNAs need to be filtered out.
  • the circRNA database preferably uses one or more combinations of circBase, CIRCpedia v2 and CircAtlas.
  • step S6 may specifically include: filtering out the first circRNAs that are identical to the second circRNAs from the plurality of first circRNAs, and filtering out the circRNAs in the specified circRNA database from the plurality of first circRNAs.
  • the first circRNA that is the same as the normal cell circRNA is obtained, and the retained first circRNA is obtained as the final tumor-specific circRNA.
  • the translation ability of the tumor-specific circRNA can be further evaluated, that is, the translation potential of the tumor-specific circRNA can be determined.
  • the specific implementation includes the following steps S701 to S704:
  • the third sequencing data is already high-quality sequencing data after quality control filtering and processing according to the above steps S101 to S102, the third sequencing data can be directly compared with the pseudo-reference sequence of tumor-specific circular RNA. For the results, if there are reads with a BSJ site in the third sequencing data and the second specified length (3bp) sequence upstream and downstream of the BSJ site matches the pseudo-reference sequence of tumor-specific circular RNA, it is determined to span the BSJ site. The third candidate reads.
  • Polyribosome analysis can extract the RNA that is being translated with ribosomes in the cell, and combined with high-throughput RNA sequencing can accurately identify the RNA sequence that is being translated. That is to say, if the position of any tumor-specific circular RNA across the BSJ site is aligned with the third candidate read, it means that the tumor-specific circular RNA is being combined with ribosomes for translation, and its translation potential can be considered to be the greatest. Then its translation ability score can be set to the maximum score, such as 1. Assume that the value range of the translation ability score is [0, 1], then the minimum score is 0 and the maximum score is 1.
  • the 5'-end capped structure uses an internal ribosome entry site (IRES) element with a special secondary structure to recruit ribosomes to initiate translation. Therefore, its translation ability can be predicted by analyzing the endogenous IRES elements in the circRNA sequence. In this example, for tumor-specific circular RNAs that do not have third candidate reads aligned across the BSJ site, the translation ability is predicted by analyzing the endogenous IRES element in the circular RNA sequence.
  • IRES internal ribosome entry site
  • step 704 construct the full-length sequence of all tumor-specific circular RNAs that do not have third candidate reads across the BSJ site as the first full-length sequence, and combine multiple designated hexamers
  • the nucleic acid sequence is mapped to the first full-length sequence, multiple target hexamer nucleic acid sequences that overlap with the specified hexamer nucleic acid sequence in the first full-length sequence are determined, and adjacent target hexamer nucleic acid sequences are merged,
  • the merged sequence is regarded as an IRES fragment, in which the target hexamer nucleic acid sequences that overlap on the first full-length sequence need to be merged, but are not mutually exclusive with each other after merging in the entire first full-length sequence. Overlapping ones are regarded as different IRES fragments, so that multiple IRES fragments in the first full-length sequence can be obtained.
  • the specified hexamer nucleic acid sequence includes the following four types:
  • the first full-length sequence is: AATAAAAGATTGGAGGACAAAAACCGG.
  • the bolded part is the merged two IRES fragments.
  • the raw score of each IRES fragment is equal to the sum of the Z scores of all target hexamer nucleic acid sequences in the IRES fragment divided by the combined sequence length of the IRES fragment. Finally, among multiple IRES fragments in tumor-specific circular RNA The maximum raw score is normalized so that it is distributed in a specified value range (such as [0, 1]), thereby obtaining the translation ability score of the tumor-specific circRNA.
  • the specified hexamer nucleic acid sequence can be a pre-collected hexamer nucleic acid sequence with Z score>7.
  • These IRES-like functional hexamer nucleic acid sequence short elements are significantly enriched in circular RNA to drive circular RNA. translate.
  • Step S8 specifically includes the following steps S801 to S804:
  • Circular RNA is formed by abnormal back-splicing of precursor RNA molecules, and circular RNA in the cytoplasm is mainly formed from the exons of conventional transcripts. For circRNAs spanning multiple exons, completely different circRNA sequences may be formed due to alternative splicing. As shown in Figure 3, the same BSJ site can form two different circRNA sequences. Therefore, it is necessary to construct the full-length sequence of tumor-specific circular RNA.
  • samtools was first used to extract all reads within the interval covered by the tumor-specific circular RNA. Since the experimental processing of sample library construction has filtered out the sequences of all linear transcripts, these reads can be used to determine the internal structure of the circular RNA. structure. Then the number of reads for all possible exon junctions in tumor-specific circular RNAs was counted, and the number of reads ⁇ 3 was considered a valid junction. Finally, the full-length sequences of all tumor-specific circular RNAs were constructed based on the exon junction information within the region as the second full-length sequence.
  • Non-"ATG” start codons are relatively common in the IRES-mediated translation process, so all “NTG” are considered as their start codons when predicting the open reading frame (Open Reading Frame, ORF) sequence, where N represents Any base among A, C, T, and G.
  • the second full-length sequence constructed based on all tumor-specific circular RNAs is predicted according to three reading frames, with "NTG" as the start codon extending backward and stopping when it encounters the stop codon. For a base sequence starting from the start codon and not interrupted by the stop codon, it is determined as an ORF sequence with the potential to encode proteins in the DNA sequence. Each predicted ORF sequence is at least 60 bp in length and must span the domain BSJ site. If there is a longer ORF sequence that completely covers the shorter ORF sequence, the short ORF sequence will no longer be considered.
  • the ORF sequences that meet the criteria are translated into amino acid sequences according to the codon table and are regarded as full-length protein sequences derived from tumor-specific circular RNAs. .
  • RNA neoantigen candidate peptide segments Because the sequence of a circular RNA molecule is highly consistent with the sequence of the transcript from which it is derived, the protein sequence is also very similar to a large extent. Therefore, after being divided into multiple peptide segments, these peptide segments need to be further filtered to filter out peptide segments that exist in the normal human proteome to obtain tumor-specific cyclic RNA neoantigen candidate peptide segments.
  • the translation ability score is specifically used as a scoring indicator of the neoantigen immunogenicity scoring model, and then the neoantigen immunogenicity scoring model is used to score each neoantigen candidate peptide segment separately and then rank.
  • step S6 specifically includes: scoring and ranking the immunogenicity of multiple neoantigen candidate peptides according to the translation ability score and abundance of the tumor-specific circRNA.
  • the immunogenicity process of neoantigens is relatively complex, including proteasome cleavage of protein sequences to obtain peptide fragments, processing of peptide fragments in the endoplasmic reticulum and presentation to the cell surface by MHC molecules, pMHC (peptide-MHC) complexes and T cells Binding of receptors (TCR) initiates immune responses and so on.
  • the embodiments of the present invention further take into account that the peptide segments recognized and presented by human histocompatibility antigen molecules (human leukocyte antigen, HLA) are usually shorter, among which class I HLA mainly recognizes peptide segments in the length range of 8-12aa, and class II HLA The sequence length of the identified peptides is slightly longer, mainly 15aa.
  • human histocompatibility antigen molecules human leukocyte antigen, HLA
  • step S804 the full-length protein sequence obtained in step S804 can be slidingly divided into peptide segments corresponding to the length range according to the above-mentioned sequence length mainly recognized by HLA, thereby dividing the sequence into Neoantigen candidate peptides whose sequence length is within the first length range are determined to be class I neoantigen candidate peptides; and neoantigen candidate peptides whose sequence length is within the second length range are determined to be class II neoantigen candidate peptides. ; Wherein, the second length range is greater than the first length range.
  • peptides whose sequence length is within the first length range are regarded as class I neoantigen candidate peptides and bind to HLA class I molecules; peptides whose sequence length is within the second length range including 15aa Peptides within the range (such as 14-16aa) are considered class II neoantigen candidate peptides.
  • the preferred class II neoantigen candidate peptide is 15aa in length and binds to HLA class II molecules.
  • step S9 is specifically: based on the translation ability score and abundance of the tumor-specific circRNA, and the binding affinity of each neoantigen candidate peptide to the corresponding HLA molecule, perform immunogenicity on multiple neoantigen candidate peptides respectively. Score and sort.
  • the physical and chemical properties of the neoantigen candidate peptide itself can also be considered, such as the possibility of the peptide being cleaved by the proteasome, the hydrophobicity score of the T cell-binding residues in the peptide, etc.
  • the scoring indicators used for neoantigen candidate peptides include at least: the translation ability score of the tumor-specific circRNA from which the peptides are derived. , the abundance of tumor-specific circRNA from which the peptide is derived, the binding affinity of the peptide to the corresponding HLA molecule, the possibility that the peptide is cleaved by the proteasome, and the hydrophobicity score of the T cell-binding residues in the peptide. multiple combinations.
  • the neoantigen immunogenicity scoring model can be set as a linear model, and the neoantigen immunogenicity score is obtained by assigning different weights to each scoring index after normalization and summing. Specifically, for each neoantigen candidate peptide, the translation ability score of the tumor-specific circRNA from which the peptide is derived, the abundance of the tumor-specific circRNA from which the peptide is derived, and the binding of the peptide to the corresponding HLA molecule Multiple combinations of affinity, the possibility that the peptide is cleaved by the proteasome, and the hydrophobicity score of the T cell-binding residues in the peptide are calculated with the corresponding weight coefficients to obtain each neoantigen candidate peptide. The neoantigen immunogenicity scores of the segments are then sorted from high to low according to the neoantigen immunogenicity scores.
  • top-ranked neoantigen candidate peptides can be determined as circRNA-derived neoantigen target peptides, and their immunogenicity can be further verified through experiments and used in clinical immunotherapy for tumor patients.
  • the first sequencing data of the tumor tissue samples and the second sequencing data of the adjacent cancer tissue samples are respectively Compare with the pseudo-reference sequence, extract the first candidate reads and second candidate reads of the pseudo-reference sequence, and determine the first circRNA detected in the tumor tissue sample and the first circRNA detected in the adjacent tissue sample.
  • the second circRNA the two are fused to filter out the normal circRNA that exists in both tumor tissue samples and adjacent tissue samples to obtain tumor-specific circRNAs, and then verify that each tumor-specific circRNA is in the corresponding pseudo-cRNA.
  • the number of aligned first candidate reads in the reference sequence is used to calculate the abundance of tumor-specific circular RNAs from which each neoantigen candidate peptide segment is derived based on the number of first candidate reads; and for tumor-specific circular RNAs that span BSJ sites
  • the translation ability score is set to the maximum score
  • the translation ability score is predicted based on the endogenous IRES element
  • the tumor-specific circRNA from which the fusion peptide is derived is Based on scoring indicators such as the abundance of circRNA, the translation ability score, and the binding affinity of the peptide to the corresponding HLA molecule, the immunogenicity of the neoantigen candidate peptides derived from it is scored and ranked, and the top ranked ones are finally determined.
  • Neoantigen target peptides can increase the source of tumor neoantigens and broaden the scope of neoantigen screening. At the same time, it can further improve the identification accuracy of circRNA, thereby further making circRNA-derived neoantigens more immunogenic. sex.
  • embodiments of the present invention provide a computer-implemented method to explore proteins that translate circular RNA based on second-generation sequencing data.
  • neoantigens for tumor-specific immunotherapy, it expands the screening scope of neoantigens, which is especially beneficial for tumor types with low mutation load; and comprehensively considers the translation potential of tumor-specific circRNAs, and by integrating the two most advanced
  • the results of the circular RNA detection algorithm (CIRCexplorer2, CIRI2) were used to construct a circular RNA pseudo-reference sequence for re-alignment, and verify the number of aligned reads of each candidate circular RNA in the corresponding pseudo-sequence reference, achieving a more accurate Accurate circRNA identification can be used to more accurately identify circRNA-based immunotherapy neoantigens.
  • the embodiment of the present invention discloses a neoantigen identification device for tumor-specific circular RNA.
  • the device can be embedded in a computer.
  • the device includes a data acquisition unit 401, a detection unit 402, a pseudo reference unit 403, a comparison unit 404, a first determination unit 405, a filtering unit 406, a translation prediction unit 407, and a peptide acquisition unit.
  • the data acquisition unit 401 is used to acquire the first sequencing data of tumor tissue samples and the second sequencing data of adjacent cancer tissue samples;
  • the detection unit 402 is used to detect circRNAs on the first sequencing data and the second sequencing data respectively to obtain multiple candidate circRNAs;
  • Pseudo-reference unit 403 is used to construct pseudo-reference sequences of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order for each candidate circRNA;
  • the comparison unit 404 is used to determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidate reads; and, determine the second candidate reads
  • the reads that have a BSJ site in the sequencing data and the sequences of the second specified length upstream and downstream of the BSJ site match the pseudo-reference sequence are determined as second candidate reads; where the second specified length is less than or equal to the first specified length;
  • the first determination unit 405 is used to determine the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and, determine the candidate circRNA supported by the second candidate reads. Identified as a second circular RNA detected in paracancerous tissue samples;
  • the filtering unit 406 is used to filter out the first circular RNA that is the same as the second circular RNA from the plurality of first circular RNAs to obtain a plurality of tumor-specific circular RNAs;
  • the translation prediction unit 407 is used to predict the translation ability score of each tumor-specific circular RNA
  • the peptide acquisition unit 408 is used to acquire multiple neoantigen candidate peptides derived from multiple tumor-specific circular RNAs;
  • the scoring unit 409 is used to score and rank the immunogenicity of multiple neoantigen candidate peptide segments according to the translation ability score of the tumor-specific circular RNA;
  • the second determination unit 410 is used to determine a specified number of neoantigen candidate peptide segments ranked first as Neoantigen target peptides.
  • an embodiment of the present invention discloses an electronic device, including a memory 501 storing executable program code and a processor 502 coupled with the memory 501;
  • the processor 502 calls the executable program code stored in the memory 501 to execute the tumor-specific circular RNA neoantigen identification method described in the above embodiments.
  • Embodiments of the present invention also disclose a computer-readable storage medium that stores a computer program, wherein the computer program causes the computer to execute the neoantigen identification method of tumor-specific circular RNA described in the above embodiments.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention belongs to the technical field of bioinformatics. Disclosed in the present invention is a neoantigen identification method for tumor-specific circular RNAs. The method comprises: for candidate circular RNAs that have been detected, constructing pseudo-reference sequences for realignment, respectively aligning sequencing data of tumor tissue samples and paracancerous tissue samples to the pseudo-reference sequences, extracting determined first candidate reads and second candidate reads after alignment, determining candidate circular RNAs supported by the first candidate reads as first circular RNAs in the tumor tissues and candidate circular RNAs supported by the second candidate reads as second circular RNAs in the paracancerous tissues, fusing the first circular RNAs and the second circular RNAs, filtering the circular RNAs, which are present in both the tumor tissues and the paracancerous tissues, to obtain tumor-specific circular RNAs, further predicting a translation ability score, and scoring and ranking neoantigens derived from the circular RNAs to obtain top-ranked neoantigens. In the present invention, sources of tumor neoantigens can be increased, the identification accuracy of circular RNAs can also be improved, and the neoantigens thereof are more immunogenic.

Description

肿瘤特异环状RNA的新抗原鉴定方法及装置、设备、介质New antigen identification method and device, equipment and medium for tumor-specific circular RNA
本发明要求于2022年09月06日提交中国专利局、申请号为2022110862377,申请名称为“肿瘤特异环状RNA的新抗原鉴定方法及装置、设备、介质”的中国专利申请的优先权,其全部内容通过引用结合在本发明中。The present invention requires the priority of the Chinese patent application submitted to the China Patent Office on September 6, 2022, with the application number 2022110862377, and the application name is "Neoantigen identification method and device, equipment and medium for tumor-specific circular RNA", which The entire contents are incorporated herein by reference.
技术领域Technical field
本发明属于生物信息学技术领域,具体涉及一种肿瘤特异环状RNA的新抗原鉴定方法及装置、电子设备、存储介质。The invention belongs to the technical field of bioinformatics, and specifically relates to a neoantigen identification method and device, electronic equipment and storage medium for tumor-specific circular RNA.
背景技术Background technique
区别于传统认识,许多体内稳定存在的长非编码RNA以共价闭合环形结构存在,其中大部分由反向剪接产生,被称为环状RNA(circular RNA,circRNA)。近年来,随着转录组测序新技术的发展和应用,以及相应计算生物学分析流程的不断优化,科学家在真核生物中发现了多达数十万条的外显子反向剪接环状RNA。Different from traditional understanding, many long non-coding RNAs that stably exist in the body exist in covalently closed circular structures, most of which are produced by back splicing and are called circular RNAs (circRNAs). In recent years, with the development and application of new technologies for transcriptome sequencing and the continuous optimization of corresponding computational biology analysis processes, scientists have discovered hundreds of thousands of exon back-spliced circRNAs in eukaryotes .
有研究发现,环状RNA编码的蛋白质在调节癌细胞生长过程中具有重要作用。环状RNA由于具有共价闭合的稳定结构,降解速度比mRNA低,其编码新抗原在肿瘤细胞中存续时间长,更可能被T细胞识别。Studies have found that proteins encoded by circular RNAs play an important role in regulating the growth of cancer cells. Because circRNA has a covalently closed stable structure, its degradation rate is lower than that of mRNA. The neoantigen it encodes persists in tumor cells for a long time and is more likely to be recognized by T cells.
但是,环状RNA由于其大部分序列与正常的基因组序列完全一致,仅在反向剪接连接点附近才有所不同,且同一基因转录出的线性RNA会干扰环状RNA的检测,因此目前环状RNA的检测结果假阳性居高不下,导致环状RNA新抗原鉴定准确性较低。However, since most of the sequences of circular RNAs are completely identical to the normal genome sequence, they only differ near the back-splicing junction, and linear RNA transcribed from the same gene will interfere with the detection of circular RNAs. Therefore, currently, circular RNAs are False positive detection results for circRNA remain high, resulting in low accuracy in the identification of circRNA neoantigens.
发明内容Contents of the invention
本发明的目的在于提供一种肿瘤特异环状RNA的新抗原鉴定方法及装 置、设备、存储介质,可以提高环状RNA的鉴定准确性,进而使得环状RNA衍生的新抗原更具免疫原性。The object of the present invention is to provide a new antigen identification method and device for tumor-specific circular RNA. Settings, equipment, and storage media can improve the accuracy of identification of circRNA, thereby making circRNA-derived neoantigens more immunogenic.
本发明第一方面公开一种肿瘤特异环状RNA的新抗原鉴定方法,包括:获取肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据;A first aspect of the present invention discloses a method for identifying neoantigens of tumor-specific circular RNA, which includes: obtaining first sequencing data of tumor tissue samples and second sequencing data of adjacent cancer tissue samples;
分别对所述第一测序数据和所述第二测序数据进行环状RNA检测,获得多个候选环状RNA;Perform circRNA detection on the first sequencing data and the second sequencing data respectively to obtain multiple candidate circRNAs;
对每个所述候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列;For each of the candidate circRNAs, construct a pseudo-reference sequence of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order;
将所述第一测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第一候选reads;以及,将所述第二测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第二候选reads;其中,所述第二指定长度小于或等于所述第一指定长度;Determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidate reads; and, The reads in the second sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence are determined as second candidate reads; wherein, the second The specified length is less than or equal to the first specified length;
将具有所述第一候选reads支持的候选环状RNA确定为从所述肿瘤组织样本中检测到的第一环状RNA;以及,将具有所述第二候选reads支持的候选环状RNA确定为从所述癌旁组织样本中检测到的第二环状RNA;determining the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and determining the candidate circRNA supported by the second candidate reads as a second circular RNA detected from the adjacent cancer tissue sample;
从多个所述第一环状RNA中滤除与所述第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA;Filter out the first circRNA that is the same as the second circRNA from the plurality of first circRNAs to obtain a plurality of tumor-specific circRNAs;
预测每个所述肿瘤特异环状RNA的翻译能力分值;predicting a translation ability score for each of the tumor-specific circular RNAs;
获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段;Obtain multiple neoantigen candidate peptide segments derived from multiple tumor-specific circular RNAs;
根据所述肿瘤特异环状RNA的翻译能力分值,对多个所述新抗原候选肽段分别进行免疫原性打分并排序;According to the translation ability score of the tumor-specific circular RNA, perform immunogenicity scoring and ranking on multiple neoantigen candidate peptide segments;
将排序靠前的指定个数的新抗原候选肽段确定为新抗原目标肽段。The specified number of neoantigen candidate peptides ranked first are determined as the neoantigen target peptides.
本发明第二方面公开一种肿瘤特异环状RNA的新抗原鉴定装置,包括: 数据获取单元,用于获取肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据;A second aspect of the present invention discloses a neoantigen identification device for tumor-specific circular RNA, which includes: a data acquisition unit, used to acquire first sequencing data of tumor tissue samples and second sequencing data of adjacent cancer tissue samples;
检测单元,用于分别对所述第一测序数据和所述第二测序数据进行环状RNA检测,获得多个候选环状RNA;A detection unit, configured to detect circRNAs on the first sequencing data and the second sequencing data respectively, to obtain a plurality of candidate circRNAs;
伪参考单元,用于对每个所述候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列;A pseudo-reference unit, used to construct a pseudo-reference sequence of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order for each of the candidate circular RNAs;
比对单元,用于将所述第一测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第一候选reads;以及,将所述第二测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第二候选reads;其中,所述第二指定长度小于或等于所述第一指定长度;Alignment unit, used to determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidates. reads; and, determine the reads in the second sequencing data that have the BSJ site and the sequences of the second specified length upstream and downstream of the BSJ site that match the pseudo-reference sequence as second candidate reads; Wherein, the second specified length is less than or equal to the first specified length;
第一确定单元,用于将具有所述第一候选reads支持的候选环状RNA确定为从所述肿瘤组织样本中检测到的第一环状RNA;以及,将具有所述第二候选reads支持的候选环状RNA确定为从所述癌旁组织样本中检测到的第二环状RNA;A first determination unit, configured to determine the candidate circular RNA supported by the first candidate reads as the first circular RNA detected from the tumor tissue sample; and, determine the candidate circular RNA supported by the second candidate reads. The candidate circRNA is determined to be the second circRNA detected from the adjacent cancer tissue sample;
滤除单元,用于从多个所述第一环状RNA中滤除与所述第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA;a filtering unit, configured to filter out the first circular RNA that is the same as the second circular RNA from the plurality of first circular RNAs to obtain a plurality of tumor-specific circular RNAs;
翻译预测单元,用于预测每个所述肿瘤特异环状RNA的翻译能力分值;a translation prediction unit for predicting the translation ability score of each of the tumor-specific circular RNAs;
肽段获取单元,用于获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段;a peptide acquisition unit, used to acquire a plurality of neoantigen candidate peptides derived from a plurality of tumor-specific circular RNAs;
打分单元,用于根据所述肿瘤特异环状RNA的翻译能力分值,对多个所述新抗原候选肽段分别进行免疫原性打分并排序;A scoring unit, configured to score and rank multiple neoantigen candidate peptide segments for immunogenicity according to the translation ability score of the tumor-specific circular RNA;
第二确定单元,用于将排序靠前的指定个数的新抗原候选肽段确定为新抗原目标肽段。The second determination unit is used to determine a specified number of neoantigen candidate peptides that are ranked first as neoantigen target peptides.
本发明第三方面公开一种电子设备,包括存储有可执行程序代码的存储器 以及与所述存储器耦合的处理器;所述处理器调用所述存储器中存储的所述可执行程序代码,用于执行第一方面公开的肿瘤特异环状RNA的新抗原鉴定方法。A third aspect of the present invention discloses an electronic device, including a memory storing executable program code. and a processor coupled to the memory; the processor calls the executable program code stored in the memory for executing the neoantigen identification method of tumor-specific circular RNA disclosed in the first aspect.
本发明第四方面公开一种计算机可读存储介质,所述计算机可读存储介质存储计算机程序,其中,所述计算机程序使得计算机执行第一方面公开的肿瘤特异环状RNA的新抗原鉴定方法。A fourth aspect of the present invention discloses a computer-readable storage medium that stores a computer program, wherein the computer program causes the computer to execute the neoantigen identification method of tumor-specific circular RNA disclosed in the first aspect.
本发明的有益效果在于,所提供的肿瘤特异环状RNA的新抗原鉴定方法及装置、设备、存储介质,通过对检测出的候选环状RNA构建用于重比对的伪参考序列,分别将肿瘤组织样本和癌旁组织样本的测序数据与伪参考序列进行比对,分别提取出比对上的第一候选reads和第二候选reads,将具有第一候选reads支持的候选环状RNA确定为从肿瘤组织样本中检测到的第一环状RNA,而具有第二候选reads支持的候选环状RNA确定为从癌旁组织样本中检测到的第二环状RNA,两者相融合可将同时存在于肿瘤组织样本和癌旁组织样本中的正常环状RNA滤除,获得肿瘤特异环状RNA,然后进一步预测肿瘤特异环状RNA的翻译能力分值,据此对其衍生的新抗原候选肽段进行免疫原性打分并排序,最终将排序靠前的确定为新抗原目标肽段,从而可以在增加肿瘤新抗原的来源、扩宽新抗原筛选范围的同时,提高环状RNA的鉴定准确性,进而使得环状RNA衍生的新抗原更具免疫原性。The beneficial effect of the present invention is that the provided neoantigen identification method, device, equipment, and storage medium for tumor-specific circular RNAs construct pseudo-reference sequences for re-alignment of the detected candidate circular RNAs, respectively. The sequencing data of tumor tissue samples and adjacent cancer tissue samples are compared with pseudo-reference sequences, the first candidate reads and the second candidate reads on the comparison are extracted respectively, and the candidate circRNA supported by the first candidate reads is determined as The first circRNA detected from the tumor tissue sample, and the candidate circRNA supported by the second candidate reads was determined to be the second circRNA detected from the paracancerous tissue sample. The fusion of the two can simultaneously Normal circRNAs present in tumor tissue samples and paracancerous tissue samples are filtered out to obtain tumor-specific circRNAs, and then the translation ability scores of tumor-specific circRNAs are further predicted, and neoantigen candidate peptides derived from them are then The segments are scored and sorted for immunogenicity, and the top ones are finally determined as neoantigen target peptide segments, which can increase the source of tumor neoantigens, broaden the scope of neoantigen screening, and improve the accuracy of identification of circRNAs. , thereby making circRNA-derived neoantigens more immunogenic.
附图说明Description of the drawings
此处的附图,示出了本发明所述技术方案的具体实例,并与具体实施方式构成说明书的一部分,用于解释本发明的技术方案、原理及效果。The drawings here show specific examples of the technical solutions described in the present invention, and constitute a part of the specification together with the specific implementation modes, and are used to explain the technical solutions, principles and effects of the present invention.
除非特别说明或另有定义,不同附图中,相同的附图标记代表相同或相似的技术特征,对于相同或相似的技术特征,也可能会采用不同的附图标记进行表示。Unless otherwise specified or otherwise defined, the same reference signs in different drawings represent the same or similar technical features, and the same or similar technical features may also be represented by different reference signs.
图1是一种肿瘤特异环状RNA的新抗原鉴定方法的流程图; Figure 1 is a flow chart of a method for neoantigen identification of tumor-specific circular RNA;
图2是环状RNA的反向各100bp伪参考序列的结构示意图;Figure 2 is a schematic structural diagram of the reverse 100bp pseudo-reference sequence of the circular RNA;
图3是同一BSJ位点形成的两个不同的环状RNA序列的结构示意图;Figure 3 is a schematic structural diagram of two different circular RNA sequences formed by the same BSJ site;
图4是一种肿瘤特异环状RNA的新抗原鉴定装置的结构示意图;Figure 4 is a schematic structural diagram of a tumor-specific circular RNA neoantigen identification device;
图5是本发明实施例公开的一种电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present invention.
附图标记说明:Explanation of reference symbols:
401、数据获取单元;402、检测单元;403、伪参考单元;404、比对单元;405、第一确定单元;406、滤除单元;407、翻译预测单元;408、肽段获取单元;409、打分单元;410、第二确定单元;501、存储器;502、处理器。401. Data acquisition unit; 402. Detection unit; 403. Pseudo reference unit; 404. Comparison unit; 405. First determination unit; 406. Filtering unit; 407. Translation prediction unit; 408. Peptide acquisition unit; 409 , scoring unit; 410, second determination unit; 501, memory; 502, processor.
具体实施方式Detailed ways
为了便于理解本发明,下面将参照说明书附图对本发明的具体实施例进行更详细的描述。In order to facilitate understanding of the present invention, specific embodiments of the present invention will be described in more detail below with reference to the accompanying drawings.
除非特别说明或另有定义,本文所使用的所有技术和科学术语与所属技术领域的技术人员通常理解的含义相同。在结合本发明的技术方案以现实的场景的情况下,本文所使用的所有技术和科学术语也可以具有与实现本发明的技术方案的目的相对应的含义。本文所使用的“第一、第二…”仅仅是用于对名称的区分,不代表具体的数量或顺序。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的和所有的组合。Unless otherwise specified or defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. In the case of combining the technical solutions of the present invention with realistic scenarios, all technical and scientific terms used herein may also have meanings corresponding to the purpose of realizing the technical solutions of the present invention. The "first, second..." used in this article is only used to distinguish names and does not represent a specific number or order. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
除非特别说明或另有定义,本文所使用的“所述”、“该”为相应位置之前所提及或描述的技术特征或技术内容,该技术特征或技术内容与其所提及的技术特征或技术内容可以是相同的,也可以是相似的。Unless otherwise specified or otherwise defined, "said" and "the" used in this article refer to the technical features or technical content mentioned or described before the corresponding position, and the technical features or technical content are different from the mentioned technical features or technical content. The technical content can be the same or similar.
毫无疑义,与本发明的目的相违背,或者明显矛盾的技术内容或技术特征,应被排除在外。There is no doubt that technical content or technical features that are contrary to the purpose of the present invention or are obviously contradictory shall be excluded.
如图1所示,本发明实施例公开一种肿瘤特异环状RNA的新抗原鉴定方 法,包括以下步骤S1~S10:As shown in Figure 1, the embodiment of the present invention discloses a method for identifying neoantigens of tumor-specific circular RNA. method, including the following steps S1 ~ S10:
S1、获取肿瘤组织样本的第一测序数据、癌旁组织样本的第二测序数据、以及肿瘤组织样本基于多核糖体分析的第三测序数据。S1. Obtain the first sequencing data of the tumor tissue sample, the second sequencing data of the adjacent cancer tissue sample, and the third sequencing data of the tumor tissue sample based on polyribosome analysis.
优选地,样本建库采用A tailing+RNase R(Li+buffer)方法处理。有研究表明,该方法可以最大程度地消化线性RNA,不仅有利于准确检测环状RNA(circRNA),对确定环状RNA的全长序列也非常重要。Preferably, the sample library construction is processed using the A tailing+RNase R (Li+buffer) method. Studies have shown that this method can digest linear RNA to the greatest extent, which is not only conducive to the accurate detection of circular RNA (circRNA), but is also very important for determining the full-length sequence of circRNA.
在本发明实施例中,样本包括肿瘤组织样本和癌旁组织样本,该步骤中第一测序数据、第二测序数据和第三测序数据均是二代高通量环状RNA测序数据,且均可以是对下机的原始测序数据进行质控过滤后获得的高质量测序数据。In the embodiment of the present invention, the samples include tumor tissue samples and adjacent cancer tissue samples. In this step, the first sequencing data, the second sequencing data, and the third sequencing data are all second-generation high-throughput circular RNA sequencing data, and all It can be high-quality sequencing data obtained after quality control and filtering of the original sequencing data that was removed from the machine.
具体的在步骤S1中,针对下机的肿瘤组织样本、癌旁组织样本的原始测序(circleSeq)数据、肿瘤组织样本基于多核糖体分析的原始测序(polysome profiling RNASeq)数据,均分别进行相同的质控过滤处理,从而获得第一测序数据、第二测序数据和第三测序数据。其中,质控过滤处理可以包括以下步骤S101~S102:Specifically, in step S1, the same process is performed on the original sequencing (circleSeq) data of the off-machine tumor tissue samples, the adjacent cancer tissue samples, and the original sequencing (polysome profiling RNASeq) data of the tumor tissue samples based on polysome profiling. Quality control filtering is performed to obtain the first sequencing data, the second sequencing data and the third sequencing data. Among them, the quality control filtering process may include the following steps S101 to S102:
S101、获取下机的原始测序数据进行去adapter操作,并滤除低质量的reads,获得过滤后的reads。S101. Obtain the raw sequencing data off the computer, perform adapter removal, and filter out low-quality reads to obtain filtered reads.
该步骤中,对原始测序数据进行去adapter操作,以滤除adapter(接头)序列;同时滤除低质量的reads,具体以长度为10、步长为1的窗口在下机的原始数据上向右滑动,每滑动一次就计算窗口内10个碱基的平均测序质量,若平均测序质量<15,则判定该窗口序列及其右侧序列为低质量区域,将该窗口序列及其右侧序列所在的整条reads删除,然后进一步对删除后首次保留下来的reads进行长度判断,要求reads的序列长度>20bp,也即是说,若删除后首次保留下来的reads的序列长度<20bp,则认为该reads为低质量的reads,将整条reads删除,删除后最终保留的则为过滤后的reads。In this step, the adapter removal operation is performed on the original sequencing data to filter out adapter (adapter) sequences; at the same time, low-quality reads are filtered out. Specifically, a window with a length of 10 and a step size of 1 is used to right on the original data that is downloaded. Sliding, the average sequencing quality of the 10 bases in the window is calculated every time it slides. If the average sequencing quality is <15, the window sequence and the sequence to its right are judged to be low-quality areas, and the window sequence and the sequence to its right are located Delete the entire reads, and then further judge the length of the reads that are retained for the first time after deletion. The sequence length of the reads is required to be >20 bp. That is to say, if the sequence length of the reads that are retained for the first time after deletion is <20 bp, it is considered that the sequence length is <20 bp. The reads are low-quality reads, and the entire reads are deleted. After deletion, the filtered reads are ultimately retained.
S102、将过滤后的reads比对到人类核糖体RNA序列,将未比对上的reads 作为高质量测序数据。S102. Align the filtered reads to the human ribosomal RNA sequence, and compare the unaligned reads to as high-quality sequencing data.
其中,将过滤后的reads比对到人类核糖体RNA序列,滤除比对上核糖体的reads,而将未比对上的reads保留为高质量测序数据用作后续分析,可以避免核糖体RNA污染影响分析结果。Among them, the filtered reads are aligned to the human ribosomal RNA sequence, the aligned ribosomal reads are filtered out, and the unaligned reads are retained as high-quality sequencing data for subsequent analysis, which can avoid ribosomal RNA Contamination impact analysis results.
S2、分别对第一测序数据和第二测序数据进行环状RNA检测,获得多个候选环状RNA。S2. Perform circRNA detection on the first sequencing data and the second sequencing data respectively, and obtain multiple candidate circRNAs.
步骤S2中,针对肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据进行相同的处理,从而进行环状RNA检测。为了避免冗余的叙述,以下以肿瘤组织circleSeq测序数据(即第一测序数据)进行举例,下面使用的tumor.circ.R1.fq和tumor.circ.R2.fq表示测序数据文件名。In step S2, the same processing is performed on the first sequencing data of the tumor tissue sample and the second sequencing data of the adjacent cancer tissue sample to perform circular RNA detection. In order to avoid redundant description, the following uses tumor tissue circleSeq sequencing data (i.e., the first sequencing data) as an example. tumor.circ.R1.fq and tumor.circ.R2.fq used below represent the sequencing data file names.
优选的,分别使用检测算法CIRCexplorer2和CIRI2的标准算法流程检测肿瘤组织中的环状RNA。其中,CIRCexplorer2标准算法流程包括Align、Parse、Annotate三个步骤,概述如下:Preferably, the standard algorithm processes of detection algorithms CIRCexplorer2 and CIRI2 are respectively used to detect circRNA in tumor tissues. Among them, the CIRCexplorer2 standard algorithm process includes three steps: Align, Parse, and Annotate, which are summarized as follows:
Align步骤中使用比对软件STAR对第一测序数据进行比对,示例命令为:In the Align step, the comparison software STAR is used to compare the first sequencing data. The example command is:
STAR\STAR\
--chimSegmentMin 10\--chimSegmentMin 10\
--runThreadN 10\--runThreadN 10\
--genomeDir hg38_star_index\--genomeDir hg38_star_index\
--readFilesIn tumor.circ.R1.fq tumor.circ.R2.fq--readFilesIn tumor.circ.R1.fq tumor.circ.R2.fq
其中,--chimSegmentMin指明嵌合比对中一端比对的碱基数至少为10bp;Among them, --chimSegmentMin specifies that the number of bases aligned at one end of the chimeric alignment is at least 10bp;
--runThreadN指明运行的线程数;--genomeDir指明所使用的参考索引文件路径;--readFilesIn指明输入的测序数据。--runThreadN specifies the number of running threads; --genomeDir specifies the reference index file path used; --readFilesIn specifies the input sequencing data.
Parse步骤中使用CIRCexplorer2 parse命令解析Align步骤中输出的 junction信息,示例命令为:In the Parse step, use the CIRCexplorer2 parse command to parse the output in the Align step. junction information, the example command is:
CIRCexplorer2 parse\CIRCexplorer2 parse\
-t STAR\-t STAR\
-b back_spliced_junction.bed\-b back_spliced_junction.bed\
Chimeric.out.junctionChimeric.out.junction
其中,-t指明Align步骤所使用的比对工具;-b指明输出的文件名;Chimeric.out.junction表示输入文件名。Among them, -t indicates the comparison tool used in the Align step; -b indicates the output file name; Chimeric.out.junction indicates the input file name.
Annotate步骤中根据提供的基因注释文件来注释反向剪接位点,示例命令为:In the Annotate step, the back splicing site is annotated based on the provided gene annotation file. The example command is:
CIRCexplorer2 annotate\CIRCexplorer2 annotate\
-r GTF\-r GTF\
-g hg38.fa\-g hg38.fa\
-b back_spliced_junction.bed\-b back_spliced_junction.bed\
-o tumor_circRNA.txt-o tumor_circRNA.txt
其中,-r指明基因注释文件;-g指明人类参考基因组文件;-b指明Parse步骤输出的junction信息;-o指明输出文件名。Among them, -r specifies the gene annotation file; -g specifies the human reference genome file; -b specifies the junction information output by the Parse step; -o specifies the output file name.
另外一种检测算法CIRI2的标准流程示例命令如下:The standard process example command of another detection algorithm CIRI2 is as follows:
perl CIRI2.pl\perlCIRI2.pl\
--in tumor.circ.sam\--in tumor.circ.sam\
--ref_file hg38.fa\--ref_file hg38.fa\
--anno GTF\ --anno GTF\
--out outfile--out outfile
其中,--in指明由sam格式文件,该文件由bwa-mem工具比对circRNA测序数据生成;--ref_file指明人类参考基因组文件;--anno指明基因注释文件;--out指明输出结果文件。Among them, --in specifies the sam format file, which is generated by comparing circRNA sequencing data with the bwa-mem tool; --ref_file specifies the human reference genome file; --anno specifies the gene annotation file; --out specifies the output result file.
最后汇总CIRCexplorer2和CIRI2两种检测算法分别针对第一测序数据和第二测序数据进行检测得到的所有候选环状RNA。Finally, all candidate circRNAs detected by the two detection algorithms CIRCexplorer2 and CIRI2 on the first sequencing data and the second sequencing data were summarized.
S3、对每个候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列。S3. For each candidate circRNA, construct the reverse pseudo-reference sequence of the first specified length upstream and downstream of the BSJ site according to its sequence order.
BSJ位点指的是反向剪接连接点(backsplicing junction,BSJ),第一指定长度可设置为100bp,因此步骤S3具体是对于每一个候选环状RNA,按照环状RNA的序列顺序构建反向的BSJ位点上下游各100bp的伪参考序列及其索引信息,其中针对候选环状RNA所构建的反向伪参考序列如图2所示。The BSJ site refers to the backsplicing junction (BSJ). The first specified length can be set to 100 bp. Therefore, step S3 is specifically for each candidate circRNA, construct the reverse splicing junction according to the sequence order of the circRNA. The pseudo-reference sequence and its index information of 100 bp upstream and downstream of the BSJ site are shown in Figure 2. The reverse pseudo-reference sequence constructed for the candidate circRNA is shown in Figure 2.
S4、将第一测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与伪参考序列相匹配的reads确定为第一候选reads;以及,将第二测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与伪参考序列相匹配的reads确定为第二候选reads。S4. Determine the reads that have the BSJ site in the first sequencing data and match the second specified length sequence upstream and downstream of the BSJ site with the pseudo-reference sequence as the first candidate reads; and, identify the reads that have the BSJ site in the second sequencing data. The reads that match the second specified length sequence upstream and downstream of the BSJ site and the pseudo-reference sequence are determined as second candidate reads.
其中,第二指定长度小于或等于第一指定长度,如第二指定长度可以设置为3bp、5bp、或者10bp等,或者在其它一些可能的实施例中,第二指定长度也有可能设置等于第一指定长度,但不应大于第二指定长度,即100bp。Wherein, the second specified length is less than or equal to the first specified length. For example, the second specified length may be set to 3bp, 5bp, or 10bp, etc., or in some other possible embodiments, the second specified length may also be set equal to the first specified length. Specify the length, but should not be greater than the second specified length, which is 100bp.
在本实施例中以第二指定长度设置为3bp为例进行阐述,在该步骤S4中,具体提取第一测序数据和第二测序数据中所有支持环状RNA BSJ位点的reads,根据索引信息重新比对到上一步骤S3中所构建的伪参考序列中,要求reads中BSJ位点上下游各3bp序列与伪参考序列完全匹配,也即是说,当第一测序数据和第二测序数据中存在具有BSJ位点且BSJ位点上下游各3bp序列与伪参考序列完全匹配的reads,可分别确定为候选环状RNA reads(即第一候选reads和第二候选reads)。 In this embodiment, the second specified length is set to 3 bp as an example. In step S4, all reads supporting the circular RNA BSJ site in the first sequencing data and the second sequencing data are specifically extracted, and according to the index information Re-align to the pseudo-reference sequence constructed in the previous step S3. It is required that the 3 bp sequences upstream and downstream of the BSJ site in the reads completely match the pseudo-reference sequence. That is to say, when the first sequencing data and the second sequencing data There are reads with a BSJ site and the 3 bp sequences upstream and downstream of the BSJ site that completely match the pseudo-reference sequence, which can be determined as candidate circular RNA reads (i.e., the first candidate reads and the second candidate reads).
为了进一步消除假阳性的reads,筛选出的候选环状RNA reads可与人类正常基因组和转录组进行比对,滤除正常比对上的候选环状RNA reads,即可获得真实的候选环状RNA reads。In order to further eliminate false positive reads, the screened candidate circRNA reads can be compared with the normal human genome and transcriptome, and the candidate circRNA reads on the normal alignment can be filtered out to obtain the real candidate circRNA reads.
S5、将具有第一候选reads支持的候选环状RNA确定为从肿瘤组织样本中检测到的第一环状RNA;以及,将具有第二候选reads支持的候选环状RNA确定为从癌旁组织样本中检测到的第二环状RNA。S5. Determine the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and determine the candidate circRNA supported by the second candidate reads as detected from the adjacent cancer tissue. Second circular RNA detected in the sample.
步骤S5中,在多个候选环状RNA中,将具有这种BSJ位点上下游各3bp序列与伪参考序列完全匹配的reads支持的候选环状RNA,作为从肿瘤组织样本和癌旁组织样本中最终检测到的环状RNA(即第一环状RNA和第二环状RNA)。In step S5, among multiple candidate circRNAs, the candidate circRNAs supported by reads with 3 bp sequences upstream and downstream of the BSJ site that completely match the pseudo-reference sequence are selected from tumor tissue samples and adjacent cancer tissue samples. The final detected circular RNA (i.e. the first circular RNA and the second circular RNA).
S6、从多个第一环状RNA中滤除与第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA。S6. Filter out the first circRNA that is identical to the second circRNA from the plurality of first circRNAs to obtain multiple tumor-specific circRNAs.
顾名思义,肿瘤特异环状RNA是指只存在于肿瘤细胞中的环状RNA。因此可以滤除癌旁组织样本中同样检测到的环状RNA,这些环状RNA属于细胞中正常存在的环状RNA,因此需要滤除这些环状RNA,也即滤除多个第一环状RNA中与第二环状RNA相同的第一环状RNA。As the name suggests, tumor-specific circular RNA refers to circular RNA that only exists in tumor cells. Therefore, the circRNAs also detected in adjacent tissue samples can be filtered out. These circRNAs belong to the circRNAs that normally exist in cells. Therefore, these circRNAs need to be filtered out, that is, multiple first circRNAs need to be filtered out. The first circular RNA in RNA that is identical to the second circular RNA.
不过由于个体差异以及测序数据的局限性,癌旁组织样本中的环状RNA不能代表所有正常的环状RNA,因此还可以从已发表的circRNA数据库中收集更多正常细胞的环状RNA,将肿瘤组织样本中检测到的第一环状RNA进一步与它们进行比较,来确定肿瘤组织特异的环状RNA。其中,circRNA数据库优选采用circBase、CIRCpedia v2和CircAtlas中的一种或多种组合。However, due to individual differences and limitations of sequencing data, circRNAs in adjacent cancer tissue samples cannot represent all normal circRNAs. Therefore, more circRNAs from normal cells can also be collected from published circRNA databases to combine The first circRNA detected in the tumor tissue sample is further compared with them to identify tumor tissue-specific circRNAs. Among them, the circRNA database preferably uses one or more combinations of circBase, CIRCpedia v2 and CircAtlas.
因此步骤S6具体可以包括:从多个第一环状RNA中滤除与第二环状RNA相同的第一环状RNA,以及从多个第一环状RNA中滤除与指定circRNA数据库中的正常细胞环状RNA相同的第一环状RNA,获得保留下来的第一环状RNA作为最终获得的肿瘤特异环状RNA。Therefore, step S6 may specifically include: filtering out the first circRNAs that are identical to the second circRNAs from the plurality of first circRNAs, and filtering out the circRNAs in the specified circRNA database from the plurality of first circRNAs. The first circRNA that is the same as the normal cell circRNA is obtained, and the retained first circRNA is obtained as the final tumor-specific circRNA.
S7、预测每个肿瘤特异环状RNA的翻译能力分值。 S7. Predict the translation ability score of each tumor-specific circular RNA.
在检测得到肿瘤特异环状RNA之后,可以进一步评估肿瘤特异环状RNA的翻译能力,即确定肿瘤特异环状RNA的翻译潜能。具体实施方式包括以下步骤S701~S704:After the tumor-specific circRNA is detected, the translation ability of the tumor-specific circRNA can be further evaluated, that is, the translation potential of the tumor-specific circRNA can be determined. The specific implementation includes the following steps S701 to S704:
S701、将第三测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与肿瘤特异环状RNA的伪参考序列相匹配的reads确定为跨BSJ位点的reads。S701. Determine the reads in the third sequencing data that have the BSJ site and that the second specified length sequences upstream and downstream of the BSJ site match the pseudo-reference sequence of the tumor-specific circular RNA as reads spanning the BSJ site.
由于第三测序数据已经是按照上述步骤S101~S102进行质控过滤处理后的高质量测序数据,此处可将第三测序数据直接与肿瘤特异环状RNA的伪参考序列进行比对,根据比对结果,若第三测序数据中存在reads具有BSJ位点且BSJ位点上下游各第二指定长度(3bp)的序列与肿瘤特异环状RNA的伪参考序列相匹配,确定为跨BSJ位点的第三候选reads。Since the third sequencing data is already high-quality sequencing data after quality control filtering and processing according to the above steps S101 to S102, the third sequencing data can be directly compared with the pseudo-reference sequence of tumor-specific circular RNA. For the results, if there are reads with a BSJ site in the third sequencing data and the second specified length (3bp) sequence upstream and downstream of the BSJ site matches the pseudo-reference sequence of tumor-specific circular RNA, it is determined to span the BSJ site. The third candidate reads.
S702、依次判断每个肿瘤特异环状RNA跨BSJ位点所在位置是否比对上第三候选reads。若是,执行步骤S703;否则,执行步骤S704。S702. Determine whether the position of each tumor-specific circular RNA across the BSJ site is aligned with the third candidate read. If yes, execute step S703; otherwise, execute step S704.
S703、将跨BSJ位点所在位置比对上第三候选reads的肿瘤特异环状RNA的翻译能力分值置为最大分值。S703. Set the translation ability score of the tumor-specific circular RNA of the third candidate read across the position of the BSJ site to the maximum score.
多核糖体分析可以提取细胞中正在与核糖体结合翻译的RNA,结合高通量RNA测序能够准确地识别正在进行翻译的RNA序列。也即是说,若任一个肿瘤特异环状RNA的跨BSJ位点所在位置比对上第三候选reads,说明该肿瘤特异环状RNA正在与核糖体结合翻译,其翻译潜能可视为最大,那么对其翻译能力分值可置为最大分值,例如1。假设翻译能力分值的取值范围为[0,1],那么最小分值为0,最大分值为1。Polyribosome analysis can extract the RNA that is being translated with ribosomes in the cell, and combined with high-throughput RNA sequencing can accurately identify the RNA sequence that is being translated. That is to say, if the position of any tumor-specific circular RNA across the BSJ site is aligned with the third candidate read, it means that the tumor-specific circular RNA is being combined with ribosomes for translation, and its translation potential can be considered to be the greatest. Then its translation ability score can be set to the maximum score, such as 1. Assume that the value range of the translation ability score is [0, 1], then the minimum score is 0 and the maximum score is 1.
S704、对跨BSJ位点所在位置没有比对上第三候选reads的肿瘤特异环状RNA构建其第一全长序列,确定第一全长序列中的多个IRES片段,计算每个IRES片段的原始分数;将多个IRES片段中的最大原始分数进行标准化,获得该肿瘤特异环状RNA的翻译能力分值。S704. Construct the first full-length sequence of the tumor-specific circular RNA that does not align with the third candidate read across the BSJ site, determine the multiple IRES fragments in the first full-length sequence, and calculate the value of each IRES fragment. Raw score; the maximum raw score among multiple IRES fragments is normalized to obtain the translation ability score of the tumor-specific circRNA.
大量研究证实,环状RNA由于其自身闭合成环的特点缺乏与mRNA类似 的5’端加帽结构,采用一种具有特殊二级结构的内部核糖体进入位点序列(Iternal Ribosome Entry Site,IRES)元件,募集核糖体启动翻译。因此可以通过分析环状RNA序列中的内源性IRES元件预测其翻译能力。在本实施例中,对于跨BSJ位点所在位置没有比对上第三候选reads的肿瘤特异环状RNA,则通过分析环状RNA序列中的内源性IRES元件预测其翻译能力。A large number of studies have confirmed that circular RNA is not similar to mRNA due to its closed loop characteristics. The 5'-end capped structure uses an internal ribosome entry site (IRES) element with a special secondary structure to recruit ribosomes to initiate translation. Therefore, its translation ability can be predicted by analyzing the endogenous IRES elements in the circRNA sequence. In this example, for tumor-specific circular RNAs that do not have third candidate reads aligned across the BSJ site, the translation ability is predicted by analyzing the endogenous IRES element in the circular RNA sequence.
那么步骤704的具体实施方式是:对跨BSJ位点所在位置没有比对上第三候选reads的所有肿瘤特异环状RNA构建其全长序列作为第一全长序列,将多个指定六聚体核酸序列映射到第一全长序列中,确定第一全长序列中与指定六聚体核酸序列重叠的多个目标六聚体核酸序列,将位置相邻的目标六聚体核酸序列进行合并,合并后的序列视为一个IRES片段,其中,在第一全长序列上有重叠(overlap)的目标六聚体核酸序列需要合并,但是在整条第一全长序列中合并后互相之间没有重叠的视为不同的IRES片段,从而可获得第一全长序列中的多个IRES片段。举例来说,假设指定六聚体核酸序列包括以下四种:Then the specific implementation of step 704 is: construct the full-length sequence of all tumor-specific circular RNAs that do not have third candidate reads across the BSJ site as the first full-length sequence, and combine multiple designated hexamers The nucleic acid sequence is mapped to the first full-length sequence, multiple target hexamer nucleic acid sequences that overlap with the specified hexamer nucleic acid sequence in the first full-length sequence are determined, and adjacent target hexamer nucleic acid sequences are merged, The merged sequence is regarded as an IRES fragment, in which the target hexamer nucleic acid sequences that overlap on the first full-length sequence need to be merged, but are not mutually exclusive with each other after merging in the entire first full-length sequence. Overlapping ones are regarded as different IRES fragments, so that multiple IRES fragments in the first full-length sequence can be obtained. For example, assume that the specified hexamer nucleic acid sequence includes the following four types:
AATAAA,AAAAGA,ACAAAA,CAAAAA;AATAAA, AAAAGA, ACAAAA, CAAAAA;
而第一全长序列为:AATAAAAGATTGGAGGACAAAAACCGG。则标粗的部分为合并的两个IRES片段。The first full-length sequence is: AATAAAAGATTGGAGGACAAAAACCGG. The bolded part is the merged two IRES fragments.
每个IRES片段的原始分数等于该IRES片段中所有目标六聚体核酸序列的Z分值总和除以合并后的该IRES片段的序列长度,最后对肿瘤特异环状RNA中的多个IRES片段中最大原始分数进行标准化,使其分布于指定的取值范围(如[0,1])中,从而得到该肿瘤特异环状RNA的翻译能力分值。The raw score of each IRES fragment is equal to the sum of the Z scores of all target hexamer nucleic acid sequences in the IRES fragment divided by the combined sequence length of the IRES fragment. Finally, among multiple IRES fragments in tumor-specific circular RNA The maximum raw score is normalized so that it is distributed in a specified value range (such as [0, 1]), thereby obtaining the translation ability score of the tumor-specific circRNA.
其中,指定六聚体核酸序列可以是预先收集的Z score>7的六聚体核酸序列,这些IRES类似功能的六聚体核酸序列短元件在环状RNA中显著富集,以驱动环状RNA翻译。Among them, the specified hexamer nucleic acid sequence can be a pre-collected hexamer nucleic acid sequence with Z score>7. These IRES-like functional hexamer nucleic acid sequence short elements are significantly enriched in circular RNA to drive circular RNA. translate.
S8、获取多个肿瘤特异环状RNA衍生的多个新抗原候选肽段。S8. Obtain multiple neoantigen candidate peptides derived from multiple tumor-specific circular RNAs.
步骤S8具体包括以下步骤S801~S804:Step S8 specifically includes the following steps S801 to S804:
S801、构建多个肿瘤特异环状RNA的第二全长序列。 S801. Construct second full-length sequences of multiple tumor-specific circular RNAs.
环状RNA是由前体RNA分子异常的反向剪接形成的,细胞质中的环状RNA主要由常规转录本的外显子形成。对于跨越多个外显子的环状RNA,由于可变剪接可能形成完全不同的环状RNA序列,如图3所示,同一BSJ位点可形成两个不同的环状RNA序列。因此,需要构建肿瘤特异环状RNA的全长序列。Circular RNA is formed by abnormal back-splicing of precursor RNA molecules, and circular RNA in the cytoplasm is mainly formed from the exons of conventional transcripts. For circRNAs spanning multiple exons, completely different circRNA sequences may be formed due to alternative splicing. As shown in Figure 3, the same BSJ site can form two different circRNA sequences. Therefore, it is necessary to construct the full-length sequence of tumor-specific circular RNA.
在构建过程中,首先使用samtools提取肿瘤特异环状RNA所覆盖区间内的所有reads,由于样本建库的实验处理已经过滤了所有线性转录本的序列,这些reads可以用来确定环状RNA的内部结构。然后统计肿瘤特异环状RNA中所有可能外显子连接点的reads数,reads数≥3视为有效的连接点。最后根据区域内外显子连接点信息构建所有肿瘤特异环状RNA的全长序列作为第二全长序列。During the construction process, samtools was first used to extract all reads within the interval covered by the tumor-specific circular RNA. Since the experimental processing of sample library construction has filtered out the sequences of all linear transcripts, these reads can be used to determine the internal structure of the circular RNA. structure. Then the number of reads for all possible exon junctions in tumor-specific circular RNAs was counted, and the number of reads ≥ 3 was considered a valid junction. Finally, the full-length sequences of all tumor-specific circular RNAs were constructed based on the exon junction information within the region as the second full-length sequence.
S802、对第二全长序列按照三种读码框架进行预测,获得多个开放阅读框序列。S802. Predict the second full-length sequence according to three reading frames and obtain multiple open reading frame sequences.
非“ATG”的起始密码子在IRES介导的翻译过程中比较常见,因此在预测开放阅读框(Open Reading Frame,ORF)序列时考虑所有“NTG”作为其起始密码子,其中N表示A、C、T、G中任一碱基。Non-"ATG" start codons are relatively common in the IRES-mediated translation process, so all "NTG" are considered as their start codons when predicting the open reading frame (Open Reading Frame, ORF) sequence, where N represents Any base among A, C, T, and G.
对基于所有肿瘤特异环状RNA构建的第二全长序列按照3种读码框架进行预测,以“NTG”作为起始密码子往后延伸,遇到终止密码子时停止。对于从起始密码子开始,且无终止密码子打断的一段碱基序列,确定为在DNA序列中具有编码蛋白质潜能的ORF序列。每个预测的ORF序列长度至少为60bp,且必须跨域BSJ位点。如果有较长的ORF序列完全覆盖了较短的ORF序列,则不再考虑短的ORF序列。The second full-length sequence constructed based on all tumor-specific circular RNAs is predicted according to three reading frames, with "NTG" as the start codon extending backward and stopping when it encounters the stop codon. For a base sequence starting from the start codon and not interrupted by the stop codon, it is determined as an ORF sequence with the potential to encode proteins in the DNA sequence. Each predicted ORF sequence is at least 60 bp in length and must span the domain BSJ site. If there is a longer ORF sequence that completely covers the shorter ORF sequence, the short ORF sequence will no longer be considered.
S803、按照密码子表将长度达到第三指定长度且跨域BSJ位点的开放阅读框序列翻译成氨基酸序列。S803. Translate the open reading frame sequence of the third specified length and the cross-domain BSJ site into an amino acid sequence according to the codon table.
预测出多个ORF序列之后,将符合标准(即长度至少为60bp且必须跨域BSJ位点)的ORF序列按照密码子表翻译成氨基酸序列,视为肿瘤特异环状RNA衍生的全长蛋白质序列。 After multiple ORF sequences are predicted, the ORF sequences that meet the criteria (i.e., are at least 60 bp in length and must cross domain BSJ sites) are translated into amino acid sequences according to the codon table and are regarded as full-length protein sequences derived from tumor-specific circular RNAs. .
S804、将氨基酸序列切分成多个肽段,并滤除人类正常蛋白质组中包含的肽段,获得多个新抗原候选肽段。S804. Divide the amino acid sequence into multiple peptide segments, and filter out the peptide segments included in the human normal proteome to obtain multiple neoantigen candidate peptide segments.
由于环状RNA分子的序列与其所来源的转录本序列一致性很高,蛋白质序列在很大程度上也非常相似。因此,在切分成多个肽段之后,还需要将这些肽段做进一步过滤,滤除人类正常蛋白质组中存在的肽段,以获得肿瘤特异环状RNA新抗原候选肽段。Because the sequence of a circular RNA molecule is highly consistent with the sequence of the transcript from which it is derived, the protein sequence is also very similar to a large extent. Therefore, after being divided into multiple peptide segments, these peptide segments need to be further filtered to filter out peptide segments that exist in the normal human proteome to obtain tumor-specific cyclic RNA neoantigen candidate peptide segments.
S9、根据肿瘤特异环状RNA的翻译能力分值,对多个新抗原候选肽段进行免疫原性打分并排序。S9. Score and rank the immunogenicity of multiple neoantigen candidate peptides according to the translation ability score of the tumor-specific circular RNA.
其中,具体将翻译能力分值作为新抗原免疫原性打分模型的一个评分指标,然后利用新抗原免疫原性打分模型对各个新抗原候选肽段分别进行打分,然后排序。Among them, the translation ability score is specifically used as a scoring indicator of the neoantigen immunogenicity scoring model, and then the neoantigen immunogenicity scoring model is used to score each neoantigen candidate peptide segment separately and then rank.
另外优选的,在执行上述步骤S6中的从多个第一环状RNA中滤除与第二环状RNA相同的第一环状RNA获得多个肿瘤特异环状RNA之后,还可以统计每个肿瘤特异环状RNA比对上的第一候选reads数目,然后根据第一候选reads数目计算每个肿瘤特异环状RNA的丰度,以每百万reads数(Reads Per Million,RPM)表示,该丰度亦可作为新抗原免疫原性打分模型的一个评分指标。因此步骤S9具体为:根据肿瘤特异环状RNA的翻译能力分值和丰度,对多个新抗原候选肽段分别进行免疫原性打分并排序。In addition, preferably, after performing the above step S6 to filter out the first circRNA that is the same as the second circRNA from the plurality of first circRNAs to obtain multiple tumor-specific circRNAs, each of them can also be counted. The number of first candidate reads on the tumor-specific circRNA alignment, and then the abundance of each tumor-specific circRNA was calculated based on the number of first candidate reads, expressed in reads per million (RPM). Abundance can also be used as a scoring indicator in the neoantigen immunogenicity scoring model. Therefore, step S9 specifically includes: scoring and ranking the immunogenicity of multiple neoantigen candidate peptides according to the translation ability score and abundance of the tumor-specific circRNA.
新抗原发挥免疫原性的过程比较复杂,包括蛋白酶体切割蛋白质序列获得肽段,肽段在内质网中的加工以及被MHC分子呈递到细胞表面,pMHC(peptide-MHC)复合物与T细胞受体(TCR)的结合启动免疫反应等等。The immunogenicity process of neoantigens is relatively complex, including proteasome cleavage of protein sequences to obtain peptide fragments, processing of peptide fragments in the endoplasmic reticulum and presentation to the cell surface by MHC molecules, pMHC (peptide-MHC) complexes and T cells Binding of receptors (TCR) initiates immune responses and so on.
本发明实施例进一步考虑到人类组织相容性抗原分子(human leukocyte antigen,HLA)所识别呈递的肽段通常较短,其中I类HLA主要识别长度范围在8-12aa的肽段,II类HLA识别肽段的序列长度稍长,主要为15aa。The embodiments of the present invention further take into account that the peptide segments recognized and presented by human histocompatibility antigen molecules (human leukocyte antigen, HLA) are usually shorter, among which class I HLA mainly recognizes peptide segments in the length range of 8-12aa, and class II HLA The sequence length of the identified peptides is slightly longer, mainly 15aa.
因此在步骤S804中,可按照上述所说的HLA主要识别的序列长度,将步骤S804中得到的全长蛋白质序列滑动切分成对应长度范围的肽段,从而将序 列长度位于第一长度范围内的新抗原候选肽段确定为I类新抗原候选肽段;以及,将序列长度位于第二长度范围内的新抗原候选肽段确定为II类新抗原候选肽段;其中,第二长度范围大于第一长度范围。Therefore, in step S804, the full-length protein sequence obtained in step S804 can be slidingly divided into peptide segments corresponding to the length range according to the above-mentioned sequence length mainly recognized by HLA, thereby dividing the sequence into Neoantigen candidate peptides whose sequence length is within the first length range are determined to be class I neoantigen candidate peptides; and neoantigen candidate peptides whose sequence length is within the second length range are determined to be class II neoantigen candidate peptides. ; Wherein, the second length range is greater than the first length range.
其中,满足序列长度位于第一长度范围(如8-12aa)内的肽段则视为I类新抗原候选肽段,与HLA I类分子结合;满足序列长度位于包括15aa在内的第二长度范围(如14-16aa)内的肽段则视为II类新抗原候选肽段,优选的II类新抗原候选肽段长度为15aa,与HLA II类分子结合。Among them, peptides whose sequence length is within the first length range (such as 8-12aa) are regarded as class I neoantigen candidate peptides and bind to HLA class I molecules; peptides whose sequence length is within the second length range including 15aa Peptides within the range (such as 14-16aa) are considered class II neoantigen candidate peptides. The preferred class II neoantigen candidate peptide is 15aa in length and binds to HLA class II molecules.
然后可分别预测I类新抗原候选肽段与肿瘤组织样本对应患者的HLA I类分子之间的结合亲和力,以及II类新抗原候选肽段与肿瘤组织样本对应患者HLA II类分子之间的结合亲和力,该结合亲和力亦可作为新抗原免疫原性打分模型的一个评分指标。那么步骤S9具体为:根据肿瘤特异环状RNA的翻译能力分值和丰度、以及每个新抗原候选肽段与对应HLA分子的结合亲和力,对多个新抗原候选肽段分别进行免疫原性打分并排序。Then the binding affinity between the class I neoantigen candidate peptide and the patient's HLA class I molecule corresponding to the tumor tissue sample can be predicted respectively, as well as the binding affinity between the class II neoantigen candidate peptide and the patient's HLA class II molecule corresponding to the tumor tissue sample. Affinity, the binding affinity can also be used as a scoring index in the neoantigen immunogenicity scoring model. Then step S9 is specifically: based on the translation ability score and abundance of the tumor-specific circRNA, and the binding affinity of each neoantigen candidate peptide to the corresponding HLA molecule, perform immunogenicity on multiple neoantigen candidate peptides respectively. Score and sort.
除此之外,还可考虑新抗原候选肽段本身的物理化学性质,例如肽段被蛋白酶体切割产生的可能性,肽段中与T细胞结合残基的疏水性分值等等。In addition, the physical and chemical properties of the neoantigen candidate peptide itself can also be considered, such as the possibility of the peptide being cleaved by the proteasome, the hydrophobicity score of the T cell-binding residues in the peptide, etc.
最终优选的,新抗原免疫原性打分模型综合考虑了上述过程,针对新抗原候选肽段(简称肽段)所使用的评分指标至少包括:肽段所来源肿瘤特异环状RNA的翻译能力分值、肽段所来源肿瘤特异环状RNA的丰度、肽段与对应HLA分子的结合亲和力、肽段被蛋白酶体切割产生的可能性、肽段中与T细胞结合残基的疏水性分值中的多项组合。Ultimately, the preferred neoantigen immunogenicity scoring model takes the above process into consideration. The scoring indicators used for neoantigen candidate peptides (peptides for short) include at least: the translation ability score of the tumor-specific circRNA from which the peptides are derived. , the abundance of tumor-specific circRNA from which the peptide is derived, the binding affinity of the peptide to the corresponding HLA molecule, the possibility that the peptide is cleaved by the proteasome, and the hydrophobicity score of the T cell-binding residues in the peptide. multiple combinations.
优选的,新抗原免疫原性打分模型可设置为线性模型,通过将各个评分指标经标准化后赋予不同的权重并求和得到新抗原免疫原性分数。具体的,针对每个新抗原候选肽段,可将肽段所来源肿瘤特异环状RNA的翻译能力分值、肽段所来源肿瘤特异环状RNA的丰度、肽段与对应HLA分子的结合亲和力、肽段被蛋白酶体切割产生的可能性、肽段中与T细胞结合残基的疏水性分值中的多项组合,分别与相应的权重系数进行加权计算,获得每个新抗原候选肽段的新抗原免疫原性分数,然后根据新抗原免疫原性分数从高到低进行排序。 Preferably, the neoantigen immunogenicity scoring model can be set as a linear model, and the neoantigen immunogenicity score is obtained by assigning different weights to each scoring index after normalization and summing. Specifically, for each neoantigen candidate peptide, the translation ability score of the tumor-specific circRNA from which the peptide is derived, the abundance of the tumor-specific circRNA from which the peptide is derived, and the binding of the peptide to the corresponding HLA molecule Multiple combinations of affinity, the possibility that the peptide is cleaved by the proteasome, and the hydrophobicity score of the T cell-binding residues in the peptide are calculated with the corresponding weight coefficients to obtain each neoantigen candidate peptide. The neoantigen immunogenicity scores of the segments are then sorted from high to low according to the neoantigen immunogenicity scores.
S10、将排序靠前的指定个数的新抗原候选肽段确定为新抗原目标肽段。S10. Determine the specified number of neoantigen candidate peptides that are ranked first as the neoantigen target peptides.
最终,排序靠前的新抗原候选肽段可确定为环状RNA衍生的新抗原目标肽段,可进一步通过实验验证其免疫原性,用于肿瘤患者临床免疫治疗中。Finally, the top-ranked neoantigen candidate peptides can be determined as circRNA-derived neoantigen target peptides, and their immunogenicity can be further verified through experiments and used in clinical immunotherapy for tumor patients.
可见实施本发明实施例,通过对检测出的候选环状RNA构建用于重比对的环状RNA伪参考序列,分别将肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据与伪参考序列进行比对,提取出比对上伪参考序列的第一候选reads和第二候选reads,确定从肿瘤组织样本中检测到的第一环状RNA和癌旁组织样本中检测到的第二环状RNA,两者相融合将同时存在于肿瘤组织样本和癌旁组织样本中的正常环状RNA滤除,获得肿瘤特异环状RNA,然后验证每个肿瘤特异环状RNA在对应伪参考序列中比对上的第一候选reads数目,根据第一候选reads数目评估计算每个新抗原候选肽段所来源肿瘤特异环状RNA的丰度;以及对于具有跨BSJ位点的肿瘤特异环状RNA,其翻译能力分值置为最大分值;而对未具有跨BSJ位点的肿瘤特异环状RNA,则根据内源性IRES元件预测其翻译能力分值,融合肽段所来源肿瘤特异环状RNA的丰度、翻译能力分值以及肽段与对应HLA分子的结合亲和力等评分指标,对其衍生的新抗原候选肽段进行免疫原性打分并排序,最终将排序靠前的确定为新抗原目标肽段,从而可以在增加肿瘤新抗原的来源、扩宽新抗原筛选范围的同时,可以进一步提高环状RNA的鉴定准确性,进而进一步使得环状RNA衍生的新抗原更具免疫原性。It can be seen that by implementing the embodiments of the present invention, by constructing a circRNA pseudo-reference sequence for re-alignment from the detected candidate circRNAs, the first sequencing data of the tumor tissue samples and the second sequencing data of the adjacent cancer tissue samples are respectively Compare with the pseudo-reference sequence, extract the first candidate reads and second candidate reads of the pseudo-reference sequence, and determine the first circRNA detected in the tumor tissue sample and the first circRNA detected in the adjacent tissue sample. The second circRNA, the two are fused to filter out the normal circRNA that exists in both tumor tissue samples and adjacent tissue samples to obtain tumor-specific circRNAs, and then verify that each tumor-specific circRNA is in the corresponding pseudo-cRNA. The number of aligned first candidate reads in the reference sequence is used to calculate the abundance of tumor-specific circular RNAs from which each neoantigen candidate peptide segment is derived based on the number of first candidate reads; and for tumor-specific circular RNAs that span BSJ sites For circRNAs, the translation ability score is set to the maximum score; for tumor-specific circRNAs that do not have a cross-BSJ site, the translation ability score is predicted based on the endogenous IRES element, and the tumor-specific circRNA from which the fusion peptide is derived is Based on scoring indicators such as the abundance of circRNA, the translation ability score, and the binding affinity of the peptide to the corresponding HLA molecule, the immunogenicity of the neoantigen candidate peptides derived from it is scored and ranked, and the top ranked ones are finally determined. Neoantigen target peptides can increase the source of tumor neoantigens and broaden the scope of neoantigen screening. At the same time, it can further improve the identification accuracy of circRNA, thereby further making circRNA-derived neoantigens more immunogenic. sex.
相比主要集中在体细胞点突变(SNV)和插入缺失变异(INDEL)所衍生的新抗原,本发明实施例提供了一种由计算机实现的基于二代测序数据探索将环状RNA翻译的蛋白质作为肿瘤特异免疫治疗新抗原潜在来源,扩充了新抗原的筛选范围,对于低突变负荷的肿瘤类型尤其有益;并且综合考虑了肿瘤特异环状RNA的翻译潜能,以及通过整合目前最先进的两种环状RNA检测算法(CIRCexplorer2,CIRI2)的结果,构建用于重比对的环状RNA伪参考序列,验证每个候选环状RNA的在对应伪序列参考中的比对reads数,实现了更准确的环状RNA鉴定,可更准确用于鉴定基于环状RNA的免疫治疗新抗原。Compared with neoantigens derived mainly from somatic point mutations (SNV) and insertion and deletion variations (INDEL), embodiments of the present invention provide a computer-implemented method to explore proteins that translate circular RNA based on second-generation sequencing data. As a potential source of neoantigens for tumor-specific immunotherapy, it expands the screening scope of neoantigens, which is especially beneficial for tumor types with low mutation load; and comprehensively considers the translation potential of tumor-specific circRNAs, and by integrating the two most advanced The results of the circular RNA detection algorithm (CIRCexplorer2, CIRI2) were used to construct a circular RNA pseudo-reference sequence for re-alignment, and verify the number of aligned reads of each candidate circular RNA in the corresponding pseudo-sequence reference, achieving a more accurate Accurate circRNA identification can be used to more accurately identify circRNA-based immunotherapy neoantigens.
如图4所示,本发明实施例公开一种肿瘤特异环状RNA的新抗原鉴定装 置,可内嵌于计算机中,该装置包括数据获取单元401、检测单元402、伪参考单元403、比对单元404、第一确定单元405、滤除单元406、翻译预测单元407、肽段获取单元408、打分单元409和第二确定单元410,其中,As shown in Figure 4, the embodiment of the present invention discloses a neoantigen identification device for tumor-specific circular RNA. The device can be embedded in a computer. The device includes a data acquisition unit 401, a detection unit 402, a pseudo reference unit 403, a comparison unit 404, a first determination unit 405, a filtering unit 406, a translation prediction unit 407, and a peptide acquisition unit. Unit 408, scoring unit 409 and second determining unit 410, wherein,
数据获取单元401,用于获取肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据;The data acquisition unit 401 is used to acquire the first sequencing data of tumor tissue samples and the second sequencing data of adjacent cancer tissue samples;
检测单元402,用于分别对第一测序数据和第二测序数据进行环状RNA检测,获得多个候选环状RNA;The detection unit 402 is used to detect circRNAs on the first sequencing data and the second sequencing data respectively to obtain multiple candidate circRNAs;
伪参考单元403,用于对每个候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列;Pseudo-reference unit 403 is used to construct pseudo-reference sequences of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order for each candidate circRNA;
比对单元404,用于将第一测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与伪参考序列相匹配的reads确定为第一候选reads;以及,将第二测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与伪参考序列相匹配的reads确定为第二候选reads;其中,第二指定长度小于或等于第一指定长度;The comparison unit 404 is used to determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidate reads; and, determine the second candidate reads The reads that have a BSJ site in the sequencing data and the sequences of the second specified length upstream and downstream of the BSJ site match the pseudo-reference sequence are determined as second candidate reads; where the second specified length is less than or equal to the first specified length;
第一确定单元405,用于将具有第一候选reads支持的候选环状RNA确定为从肿瘤组织样本中检测到的第一环状RNA;以及,将具有第二候选reads支持的候选环状RNA确定为从癌旁组织样本中检测到的第二环状RNA;The first determination unit 405 is used to determine the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and, determine the candidate circRNA supported by the second candidate reads. Identified as a second circular RNA detected in paracancerous tissue samples;
滤除单元406,用于从多个第一环状RNA中滤除与第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA;The filtering unit 406 is used to filter out the first circular RNA that is the same as the second circular RNA from the plurality of first circular RNAs to obtain a plurality of tumor-specific circular RNAs;
翻译预测单元407,用于预测每个肿瘤特异环状RNA的翻译能力分值;The translation prediction unit 407 is used to predict the translation ability score of each tumor-specific circular RNA;
肽段获取单元408,用于获取多个肿瘤特异环状RNA衍生的多个新抗原候选肽段;The peptide acquisition unit 408 is used to acquire multiple neoantigen candidate peptides derived from multiple tumor-specific circular RNAs;
打分单元409,用于根据肿瘤特异环状RNA的翻译能力分值,对多个新抗原候选肽段分别进行免疫原性打分并排序;The scoring unit 409 is used to score and rank the immunogenicity of multiple neoantigen candidate peptide segments according to the translation ability score of the tumor-specific circular RNA;
第二确定单元410,用于将排序靠前的指定个数的新抗原候选肽段确定为 新抗原目标肽段。The second determination unit 410 is used to determine a specified number of neoantigen candidate peptide segments ranked first as Neoantigen target peptides.
如图5所示,本发明实施例公开一种电子设备,包括存储有可执行程序代码的存储器501以及与存储器501耦合的处理器502;As shown in Figure 5, an embodiment of the present invention discloses an electronic device, including a memory 501 storing executable program code and a processor 502 coupled with the memory 501;
其中,处理器502调用存储器501中存储的可执行程序代码,执行上述各实施例中描述的肿瘤特异环状RNA的新抗原鉴定方法。The processor 502 calls the executable program code stored in the memory 501 to execute the tumor-specific circular RNA neoantigen identification method described in the above embodiments.
本发明实施例还公开一种计算机可读存储介质,其存储计算机程序,其中,该计算机程序使得计算机执行上述各实施例中描述的肿瘤特异环状RNA的新抗原鉴定方法。Embodiments of the present invention also disclose a computer-readable storage medium that stores a computer program, wherein the computer program causes the computer to execute the neoantigen identification method of tumor-specific circular RNA described in the above embodiments.
以上实施例的目的,是对本发明的技术方案进行示例性的再现与推导,并以此完整的描述本发明的技术方案、目的及效果,其目的是使公众对本发明的公开内容的理解更加透彻、全面,并不以此限定本发明的保护范围。The purpose of the above embodiments is to exemplarily reproduce and deduce the technical solutions of the present invention, and thereby completely describe the technical solutions, purposes and effects of the present invention. The purpose is to enable the public to have a more thorough understanding of the disclosed content of the present invention. , comprehensive, and does not limit the scope of the present invention.
以上实施例也并非是基于本发明的穷尽性列举,在此之外,还可以存在多个未列出的其他实施方式。在不违反本发明构思的基础上所作的任何替换与改进,均属本发明的保护范围。 The above embodiments are not an exhaustive list of the present invention. In addition, there may be many other unlisted implementations. Any substitutions and improvements made without violating the concept of the present invention shall fall within the protection scope of the present invention.

Claims (10)

  1. 肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,包括:The method for identifying new antigens of tumor-specific circular RNA is characterized by including:
    获取肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据;Obtain the first sequencing data of the tumor tissue sample and the second sequencing data of the adjacent cancer tissue sample;
    分别对所述第一测序数据和所述第二测序数据进行环状RNA检测,获得多个候选环状RNA;Perform circRNA detection on the first sequencing data and the second sequencing data respectively to obtain multiple candidate circRNAs;
    对每个所述候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列;For each of the candidate circRNAs, a pseudo-reference sequence of the first specified length upstream and downstream of the reverse BSJ site is constructed according to its sequence order;
    将所述第一测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第一候选reads;以及,将所述第二测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第二候选reads;其中,所述第二指定长度小于或等于所述第一指定长度;Determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidate reads; and, The reads in the second sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence are determined as second candidate reads; wherein, the second The specified length is less than or equal to the first specified length;
    将具有所述第一候选reads支持的候选环状RNA确定为从所述肿瘤组织样本中检测到的第一环状RNA;以及,将具有所述第二候选reads支持的候选环状RNA确定为从所述癌旁组织样本中检测到的第二环状RNA;determining the candidate circRNA supported by the first candidate reads as the first circRNA detected from the tumor tissue sample; and determining the candidate circRNA supported by the second candidate reads as a second circular RNA detected from the adjacent cancer tissue sample;
    从多个所述第一环状RNA中滤除与所述第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA;Filter out the first circRNA that is the same as the second circRNA from the plurality of first circRNAs to obtain a plurality of tumor-specific circRNAs;
    预测每个所述肿瘤特异环状RNA的翻译能力分值;predicting a translation ability score for each of the tumor-specific circular RNAs;
    获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段;Obtain multiple neoantigen candidate peptide segments derived from multiple tumor-specific circular RNAs;
    根据所述肿瘤特异环状RNA的翻译能力分值,对多个所述新抗原候选肽段分别进行免疫原性打分并排序;According to the translation ability score of the tumor-specific circular RNA, perform immunogenicity scoring and ranking on multiple neoantigen candidate peptide segments;
    将排序靠前的指定个数的新抗原候选肽段确定为新抗原目标肽段。The specified number of neoantigen candidate peptides ranked first are determined as the neoantigen target peptides.
  2. 如权利要求1所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,预测每个所述肿瘤特异环状RNA的翻译能力分值,包括:The neoantigen identification method of tumor-specific circRNA as claimed in claim 1, characterized in that predicting the translation ability score of each tumor-specific circRNA includes:
    获取所述肿瘤组织样本基于多核糖体分析的第三测序数据; Obtain third sequencing data based on polyribosome analysis of the tumor tissue sample;
    将所述第三测序数据中具有BSJ位点且BSJ位点上下游各第二指定长度的序列与肿瘤特异环状RNA的伪参考序列相匹配的reads确定为跨BSJ位点的第三候选reads;The reads in the third sequencing data that have the BSJ site and the sequences of the second specified length upstream and downstream of the BSJ site that match the pseudo-reference sequence of the tumor-specific circular RNA are determined as the third candidate reads across the BSJ site. ;
    依次判断每个肿瘤特异环状RNA的跨BSJ位点所在位置是否比对上第三候选reads;Determine in turn whether the position of each tumor-specific circular RNA across the BSJ site is aligned with the third candidate read;
    若比对上,将跨BSJ位点所在位置比对上第三候选reads的肿瘤特异环状RNA的翻译能力分值置为最大分值;If aligned, the translation ability score of the tumor-specific circular RNA of the third candidate read across the BSJ site location is set to the maximum score;
    若未比对上,对跨BSJ位点所在位置没有比对上第三候选reads的肿瘤特异环状RNA构建其第一全长序列,确定所述第一全长序列中的多个IRES片段,计算每个所述IRES片段的原始分数;对多个所述IRES片段中的最大原始分数进行标准化,获得肿瘤特异环状RNA的翻译能力分值。If there is no alignment, construct the first full-length sequence of the tumor-specific circular RNA that does not have the third candidate read aligned across the BSJ site, and determine the multiple IRES fragments in the first full-length sequence, The raw score of each of the IRES fragments is calculated; the maximum raw score among multiple IRES fragments is normalized to obtain the translation ability score of the tumor-specific circRNA.
  3. 如权利要求2所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,确定所述第一全长序列中的多个IRES片段,包括:The neoantigen identification method of tumor-specific circular RNA according to claim 2, characterized in that determining multiple IRES fragments in the first full-length sequence includes:
    将多个指定六聚体核酸序列映射到所述第一全长序列中,确定所述第一全长序列中与指定六聚体核酸序列重叠的多个目标六聚体核酸序列,将位置相邻的目标六聚体核酸序列进行合并,获得多个IRES片段。Map multiple specified hexamer nucleic acid sequences to the first full-length sequence, determine multiple target hexamer nucleic acid sequences that overlap with the specified hexamer nucleic acid sequence in the first full-length sequence, and compare the positions. The adjacent target hexamer nucleic acid sequences are merged to obtain multiple IRES fragments.
  4. 如权利要求3所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,计算每个所述IRES片段的原始分数,包括:The neoantigen identification method of tumor-specific circular RNA as claimed in claim 3, characterized in that calculating the original score of each IRES fragment includes:
    将每个所述IRES片段中所有目标六聚体核酸序列的Z分值总和除以该IRES片段的序列长度,获得每个所述IRES片段的原始分数。The raw score for each IRES fragment is obtained by dividing the sum of the Z scores of all target hexamer nucleic acid sequences in each of the IRES fragments by the sequence length of the IRES fragment.
  5. 如权利要求2至4任一项所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段,包括:The neoantigen identification method of tumor-specific circular RNA according to any one of claims 2 to 4, characterized in that, a plurality of neoantigen candidate peptide segments derived from the tumor-specific circular RNA are obtained, including:
    构建多个所述肿瘤特异环状RNA的第二全长序列,对所述第二全长序列按照三种读码框架进行预测,获得多个开放阅读框序列; Construct multiple second full-length sequences of the tumor-specific circular RNA, predict the second full-length sequences according to three reading frames, and obtain multiple open reading frame sequences;
    按照密码子表将长度达到第三指定长度且跨域BSJ位点的开放阅读框序列翻译成氨基酸序列;Translate the open reading frame sequence of the third specified length and the cross-domain BSJ site into an amino acid sequence according to the codon table;
    将所述氨基酸序列切分成多个肽段,并滤除人类正常蛋白质组中包含的肽段,获得多个新抗原候选肽段。The amino acid sequence is divided into multiple peptide segments, and peptide segments included in the human normal proteome are filtered out to obtain multiple neoantigen candidate peptide segments.
  6. 如权利要求2至4任一项所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,从多个所述第一环状RNA中滤除与所述第二环状RNA相同的第一环状RNA获得多个肿瘤特异环状RNA之后,所述方法还包括:The method for identifying neoantigens of tumor-specific circular RNAs according to any one of claims 2 to 4, characterized in that, from a plurality of the first circular RNAs, those identical to the second circular RNAs are filtered out. After the first circRNA obtains a plurality of tumor-specific circRNAs, the method further includes:
    统计各个所述肿瘤特异环状RNA比对上的第一候选reads数目;Count the number of first candidate reads on each of the tumor-specific circRNA alignments;
    根据所述第一候选reads数目计算每个所述肿瘤特异环状RNA的丰度;Calculate the abundance of each tumor-specific circular RNA based on the number of first candidate reads;
    以及,所述根据所述肿瘤特异环状RNA的翻译能力分值,对多个所述新抗原候选肽段分别进行免疫原性打分并排序,包括:And, according to the translation ability score of the tumor-specific circular RNA, the immunogenicity scores and rankings of multiple neoantigen candidate peptide segments are respectively performed, including:
    根据所述肿瘤特异环状RNA的翻译能力分值和丰度,对多个所述新抗原候选肽段分别进行免疫原性打分并排序。According to the translation ability score and abundance of the tumor-specific circular RNA, multiple neoantigen candidate peptide segments are individually scored and ranked for their immunogenicity.
  7. 如权利要求6所述的肿瘤特异环状RNA的新抗原鉴定方法,其特征在于,获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段之后,所述方法还包括:The neoantigen identification method of tumor-specific circRNA according to claim 6, wherein after obtaining a plurality of neoantigen candidate peptide segments derived from the tumor-specific circRNA, the method further includes:
    将序列长度位于第一长度范围内的新抗原候选肽段确定为I类新抗原候选肽段;以及,将序列长度位于第二长度范围内的新抗原候选肽段确定为II类新抗原候选肽段;其中,所述第二长度范围大于所述第一长度范围;Determining the neoantigen candidate peptides whose sequence length is within the first length range as Class I neoantigen candidate peptides; and determining the neoantigen candidate peptides whose sequence length is within the second length range as Class II neoantigen candidate peptides segment; wherein the second length range is greater than the first length range;
    预测所述I类新抗原候选肽段与所述肿瘤组织样本对应患者的HLA I类分子之间的结合亲和力;以及,预测所述II类新抗原候选肽段与所述肿瘤组织样本对应患者的HLA II类分子之间的结合亲和力;Predict the binding affinity between the class I neoantigen candidate peptide segment and the HLA class I molecule of the patient corresponding to the tumor tissue sample; and, predict the binding affinity between the class II neoantigen candidate peptide segment and the patient corresponding to the tumor tissue sample. Binding affinity between HLA class II molecules;
    以及,根据所述肿瘤特异环状RNA的翻译能力分值和丰度,对多个所述新抗原候选肽段分别进行免疫原性打分并排序,包括: And, according to the translation ability score and abundance of the tumor-specific circRNA, immunogenicity scores and rankings are performed on multiple neoantigen candidate peptide segments, including:
    根据所述肿瘤特异环状RNA的翻译能力分值和丰度、以及每个所述新抗原候选肽段与对应HLA分子的结合亲和力,对多个所述新抗原候选肽段分别进行免疫原性打分并排序。According to the translation ability score and abundance of the tumor-specific circular RNA, and the binding affinity of each neoantigen candidate peptide segment to the corresponding HLA molecule, immunogenicity is performed on multiple neoantigen candidate peptide segments respectively. Score and sort.
  8. 肿瘤特异环状RNA的新抗原鉴定装置,其特征在于,包括:The tumor-specific circular RNA neoantigen identification device is characterized by including:
    数据获取单元,用于获取肿瘤组织样本的第一测序数据和癌旁组织样本的第二测序数据;a data acquisition unit, used to acquire first sequencing data of tumor tissue samples and second sequencing data of adjacent cancer tissue samples;
    检测单元,用于分别对所述第一测序数据和所述第二测序数据进行环状RNA检测,获得多个候选环状RNA;A detection unit, configured to detect circRNAs on the first sequencing data and the second sequencing data respectively, to obtain a plurality of candidate circRNAs;
    伪参考单元,用于对每个所述候选环状RNA按照其序列顺序构建反向的BSJ位点上下游各第一指定长度的伪参考序列;A pseudo-reference unit, used to construct a pseudo-reference sequence of the first specified length upstream and downstream of the reverse BSJ site according to its sequence order for each of the candidate circular RNAs;
    比对单元,用于将所述第一测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第一候选reads;以及,将所述第二测序数据中具有所述BSJ位点且所述BSJ位点上下游各第二指定长度的序列与所述伪参考序列相匹配的reads确定为第二候选reads;其中,所述第二指定长度小于或等于所述第一指定长度;Alignment unit, used to determine the reads in the first sequencing data that have the BSJ site and the second specified length sequences upstream and downstream of the BSJ site that match the pseudo-reference sequence as first candidates. reads; and, determine the reads in the second sequencing data that have the BSJ site and the sequences of the second specified length upstream and downstream of the BSJ site that match the pseudo-reference sequence as second candidate reads; Wherein, the second specified length is less than or equal to the first specified length;
    第一确定单元,用于将具有所述第一候选reads支持的候选环状RNA确定为从所述肿瘤组织样本中检测到的第一环状RNA;以及,将具有所述第二候选reads支持的候选环状RNA确定为从所述癌旁组织样本中检测到的第二环状RNA;A first determination unit, configured to determine the candidate circular RNA supported by the first candidate reads as the first circular RNA detected from the tumor tissue sample; and, determine the candidate circular RNA supported by the second candidate reads. The candidate circRNA is determined to be the second circRNA detected from the adjacent cancer tissue sample;
    滤除单元,用于从多个所述第一环状RNA中滤除与所述第二环状RNA相同的第一环状RNA,获得多个肿瘤特异环状RNA;a filtering unit, configured to filter out the first circular RNA that is the same as the second circular RNA from the plurality of first circular RNAs to obtain a plurality of tumor-specific circular RNAs;
    翻译预测单元,用于预测每个所述肿瘤特异环状RNA的翻译能力分值;a translation prediction unit for predicting the translation ability score of each of the tumor-specific circular RNAs;
    肽段获取单元,用于获取多个所述肿瘤特异环状RNA衍生的多个新抗原候选肽段;a peptide acquisition unit, used to acquire a plurality of neoantigen candidate peptides derived from a plurality of tumor-specific circular RNAs;
    打分单元,用于根据所述肿瘤特异环状RNA的翻译能力分值,对多个所 述新抗原候选肽段分别进行免疫原性打分并排序;A scoring unit is used to score a plurality of all tumor-specific circular RNAs based on their translation ability scores. The above-mentioned neoantigen candidate peptides were scored and ranked according to their immunogenicity;
    第二确定单元,用于将排序靠前的指定个数的新抗原候选肽段确定为新抗原目标肽段。The second determination unit is used to determine a specified number of neoantigen candidate peptides that are ranked first as neoantigen target peptides.
  9. 电子设备,其特征在于,包括存储有可执行程序代码的存储器以及与所述存储器耦合的处理器;所述处理器调用所述存储器中存储的所述可执行程序代码,用于执行权利要求1至7任一项所述的肿瘤特异环状RNA的新抗原鉴定方法。Electronic device, characterized in that it includes a memory storing executable program code and a processor coupled to the memory; the processor calls the executable program code stored in the memory to execute claim 1 The neoantigen identification method of tumor-specific circular RNA according to any one of to 7.
  10. 计算机可读存储介质,其特征在于,所述计算机可读存储介质存储计算机程序,其中,所述计算机程序使得计算机执行权利要求1至7任一项所述的肿瘤特异环状RNA的新抗原鉴定方法。 Computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, wherein the computer program causes the computer to perform the neoantigen identification of tumor-specific circular RNA according to any one of claims 1 to 7 method.
PCT/CN2023/077356 2022-09-06 2023-02-21 Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium WO2024051097A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211086237.7 2022-09-06
CN202211086237.7A CN115240773B (en) 2022-09-06 2022-09-06 New antigen identification method and device, equipment and medium of tumor specific circular RNA

Publications (1)

Publication Number Publication Date
WO2024051097A1 true WO2024051097A1 (en) 2024-03-14

Family

ID=83680826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/077356 WO2024051097A1 (en) 2022-09-06 2023-02-21 Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium

Country Status (2)

Country Link
CN (1) CN115240773B (en)
WO (1) WO2024051097A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115240773B (en) * 2022-09-06 2023-07-28 深圳新合睿恩生物医疗科技有限公司 New antigen identification method and device, equipment and medium of tumor specific circular RNA

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
US20190279742A1 (en) * 2017-10-10 2019-09-12 Gritstone Oncology, Inc. Neoantigen identification using hotspots
EP3618071A1 (en) * 2018-08-28 2020-03-04 CeCaVa GmbH & Co. KG Methods for selecting tumor-specific neoantigens
CN111192632A (en) * 2019-12-16 2020-05-22 深圳市新合生物医疗科技有限公司 Method and device for extracting gene fusion immunotherapy novel antigen by integrating deep sequencing data of DNA and RNA
CN111584006A (en) * 2020-05-06 2020-08-25 西安交通大学 Circular RNA identification method based on machine learning strategy
CN111627497A (en) * 2020-05-19 2020-09-04 深圳市新合生物医疗科技有限公司 Method for extracting immunotherapy new antigen based on tumor specific transcription region assembled by new transcript and application
CN113035272A (en) * 2021-03-08 2021-06-25 深圳市新合生物医疗科技有限公司 Method and apparatus for obtaining new antigens for immunotherapy based on endosomal cell variation
US20220112556A1 (en) * 2020-10-14 2022-04-14 Shenzhen Neocura Biotechnology Corporation Method and system for calculating tumor neoantigen burden
CN114882951A (en) * 2022-05-27 2022-08-09 深圳裕泰抗原科技有限公司 Method and device for detecting MHC II tumor neoantigen based on next generation sequencing data
CN115240773A (en) * 2022-09-06 2022-10-25 深圳新合睿恩生物医疗科技有限公司 Method, device, equipment and medium for identifying novel antigen of tumor specific circular RNA

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109022580B (en) * 2018-07-31 2021-12-24 华南农业大学 Canine circular RNA gene as diagnosis marker of canine breast tumor
BR112021023411A2 (en) * 2019-05-22 2022-02-01 Massachusetts Inst Technology Compositions and methods of circular rna

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279742A1 (en) * 2017-10-10 2019-09-12 Gritstone Oncology, Inc. Neoantigen identification using hotspots
EP3618071A1 (en) * 2018-08-28 2020-03-04 CeCaVa GmbH & Co. KG Methods for selecting tumor-specific neoantigens
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
CN111192632A (en) * 2019-12-16 2020-05-22 深圳市新合生物医疗科技有限公司 Method and device for extracting gene fusion immunotherapy novel antigen by integrating deep sequencing data of DNA and RNA
CN111584006A (en) * 2020-05-06 2020-08-25 西安交通大学 Circular RNA identification method based on machine learning strategy
CN111627497A (en) * 2020-05-19 2020-09-04 深圳市新合生物医疗科技有限公司 Method for extracting immunotherapy new antigen based on tumor specific transcription region assembled by new transcript and application
US20220112556A1 (en) * 2020-10-14 2022-04-14 Shenzhen Neocura Biotechnology Corporation Method and system for calculating tumor neoantigen burden
CN113035272A (en) * 2021-03-08 2021-06-25 深圳市新合生物医疗科技有限公司 Method and apparatus for obtaining new antigens for immunotherapy based on endosomal cell variation
CN114882951A (en) * 2022-05-27 2022-08-09 深圳裕泰抗原科技有限公司 Method and device for detecting MHC II tumor neoantigen based on next generation sequencing data
CN115240773A (en) * 2022-09-06 2022-10-25 深圳新合睿恩生物医疗科技有限公司 Method, device, equipment and medium for identifying novel antigen of tumor specific circular RNA

Also Published As

Publication number Publication date
CN115240773B (en) 2023-07-28
CN115240773A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
Pertea et al. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise
CN108388773B (en) A kind of identification method of tumor neogenetic antigen
CN109801678B (en) Tumor antigen prediction method based on complete transcriptome and application thereof
CN108796055B (en) Method, device and storage medium for detecting tumor neoantigen based on second-generation sequencing
CN109033749B (en) Tumor mutation load detection method, device and storage medium
US11623001B2 (en) Compositions and methods for viral cancer neoepitopes
US20200243164A1 (en) Systems and methods for patient-specific identification of neoantigens by de novo peptide sequencing for personalized immunotherapy
CN109584960B (en) Method, device and storage medium for predicting tumor neoantigen
Borden et al. Cancer neoantigens: challenges and future directions for prediction, prioritization, and validation
CN113035272B (en) Method and device for obtaining immunotherapeutic new antigen based on intein cell variation
WO2024051097A1 (en) Neoantigen identification method and device for tumor-specific circular rnas, apparatus and medium
CN110621785B (en) Method and device for haplotyping diploid genome based on three-generation capture sequencing
CN111627497B (en) Method for extracting immunotherapeutic new antigen based on tumor specific transcription region assembled by new transcripts and application
CN111755067A (en) Screening method of tumor neoantigen
CN112309502A (en) Method and system for calculating tumor neoantigen load
CN115747327A (en) Novel antigen prediction methods involving frameshift mutations
Menon Comparison of High-Throughput Next generation sequencing data processing pipelines
CN111192632B (en) Method and device for extracting gene fusion immunotherapy new antigen by integrating DNA and RNA deep sequencing data
CN112210596B (en) Tumor neoantigen prediction method based on gene fusion event and application thereof
Oreper et al. The peptide woods are lovely, dark and deep: Hunting for novel cancer antigens
CN116779028A (en) Method, device and computer readable storage medium for predicting neoepitope based on structural variation detection
CN114882951B (en) Method and device for detecting MHC II tumor neoantigen based on next generation sequencing data
Haltaufderhyde et al. Immunoinformatic risk assessment of Host cell proteins during process development for biologic therapeutics
CN114464256A (en) Method, computing device and computer storage medium for detecting tumor neoantigen burden
CN111599410B (en) Method for extracting microsatellite unstable immunotherapy new antigen by integrating multiple sets of chemical data and application

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23861805

Country of ref document: EP

Kind code of ref document: A1