CN117174166B - Tumor neoantigen prediction method and system based on third-generation sequencing data - Google Patents

Tumor neoantigen prediction method and system based on third-generation sequencing data Download PDF

Info

Publication number
CN117174166B
CN117174166B CN202311401140.5A CN202311401140A CN117174166B CN 117174166 B CN117174166 B CN 117174166B CN 202311401140 A CN202311401140 A CN 202311401140A CN 117174166 B CN117174166 B CN 117174166B
Authority
CN
China
Prior art keywords
information
sample
snp
tumor
transcript
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311401140.5A
Other languages
Chinese (zh)
Other versions
CN117174166A (en
Inventor
张函槊
闫柏先
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genex Health Co Ltd
Original Assignee
Genex Health Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genex Health Co Ltd filed Critical Genex Health Co Ltd
Priority to CN202311401140.5A priority Critical patent/CN117174166B/en
Publication of CN117174166A publication Critical patent/CN117174166A/en
Application granted granted Critical
Publication of CN117174166B publication Critical patent/CN117174166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Peptides Or Proteins (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to the technical field of bioinformatics, and discloses a tumor neoantigen prediction method and system based on third-generation sequencing data, wherein the method comprises the following steps: receiving third generation whole exon sequencing data of a sample; receiving HLA typing information of a sample; mapping and annotating third generation full exon sequencing data; carrying out structural variation analysis on the sample according to the annotation and the transcript information; carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information; obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample; netMHCpan analysis is performed on tumor samples based on specific polypeptide information and HLA typing information for tumor samples to predict tumor neoantigens. The invention fully utilizes the advantages of the third generation sequencing technology, has high precision and good accuracy for predicting the tumor neoantigen aiming at the specific polypeptide of the tumor sample, and can bring great economic value to society.

Description

Tumor neoantigen prediction method and system based on third-generation sequencing data
Technical Field
The invention relates to the technical field of bioinformatics, in particular to a tumor neoantigen prediction method and system based on third-generation sequencing data.
Background
The third generation sequencing technique, also known as the slave sequencing technique, refers to single molecule sequencing techniques. In DNA sequencing, third generation sequencing techniques do not require PCR amplification to achieve separate sequencing of each DNA molecule.
The third generation test technology is classified according to principles mainly including two kinds: the first is a single-molecule fluorescence sequencing method, wherein fluorescent labeling is firstly carried out on deoxynucleotide, then fluorescence intensity change is observed in real time through a microscope to realize sequencing, when the fluorescent labeled deoxynucleotide is doped into a DNA chain, the fluorescent labeling can be detected on the DNA chain at the same time, when the fluorescent labeled deoxynucleotide forms a chemical bond with the DNA chain, a fluorescent group is excised by DNA polymerase, the fluorescent labeling disappears, the activity of the DNA polymerase is not influenced by the fluorescent labeled deoxynucleotide, and after the fluorescent group is excised, the synthesized DNA chain is identical with a natural DNA chain; the second is nanopore sequencing (nanopore sequencing), which uses electrophoresis to drive individual molecules through the nanopore one by one to achieve sequencing, which is achieved by allowing only a single nucleic acid polymer to pass through due to the very small diameter of the nanopore, while ATCG single bases have different charged properties, and by observing the difference in electrical signals, the type of base passed can be detected.
Compared with the first generation sequencing technology and the second generation sequencing technology, the third generation sequencing technology can measure longer DNA fragments, and has high data quality, high accuracy and wide coverage. At present, the third generation sequencing technology is needed to be applied to tumor neoantigen prediction so as to improve the efficiency and quality of tumor neoantigen prediction, and a powerful knowledge base is provided for medical research.
Disclosure of Invention
The invention provides a tumor neoantigen prediction method and a tumor neoantigen prediction system based on third-generation sequencing data, which are used for improving the efficiency and quality of tumor neoantigen prediction.
The invention provides a tumor neoantigen prediction method based on third generation sequencing data, which comprises the following steps:
s1, respectively carrying out S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as samples in S101-S105;
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
s2, obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
s3, according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens.
According to the tumor neoantigen prediction method based on the third generation sequencing data provided by the invention, the step S105 comprises the following steps:
s10501, marking annotation information of each transcript according to the annotation, the transcript information and the structural variation information;
s10502, performing minimum 2 mapping on the transcripts to obtain mapping information;
s10503, comparing the annotation information with the mapping information, and reserving the mapping information consistent with the annotation information;
s10504, carrying out SNP analysis on the sample according to the mapping information consistent with the annotation information, and obtaining SNP information of transcripts in the sample.
According to the tumor neoantigen prediction method based on the third generation sequencing data provided by the invention, the step S2 comprises the following steps:
s201, comparing SNP information of a tumor sample with SNP information of a control sample, and reserving the Somatic SNP information (Somatic mutation SNP information);
s202, filtering out the information of the Somatic SNP which only appears once in the tumor sample;
s203, comparing the structural variation information of the tumor sample with the structural variation information of the control sample, and reserving the structural variation information of the Somatic;
s204, merging transcripts containing filtered-out Somatic SNP information and transcripts containing Somatic structural variation information in tumor samples to obtain a transcript set;
s205, transcript base sequence information is obtained for transcript sets according to reference information (such as annotation files, reference sequences and the like);
s206, correcting the base sequence information of the transcript according to the filtered information of the Somatic SNP and the filtered information of the variation of the Somatic structure;
s207, retaining CDS region sequence information which can be translated into polypeptide in the transcript according to the reference information and the corrected transcript base sequence information;
s208, filtering CDS region sequence information without an initiation codon at the position of a reference initiation codon;
s209, translating according to the filtered CDS region sequence information to obtain peptide chain information;
s210, cutting a peptide chain into a plurality of polypeptides according to the peptide chain information to obtain polypeptide information;
s211, filtering out translatable polypeptide information obtained by normal translation of the reference sequence to obtain specific polypeptide information.
According to the tumor neoantigen prediction method based on the third generation sequencing data, provided by the invention, the tumor neoantigen prediction method further comprises the following steps:
and S4, carrying out information filtering, backtracking and arrangement according to the NetMHCpan analysis result to obtain antigen prediction information.
According to the tumor neoantigen prediction method based on the third generation sequencing data provided by the invention, the step S4 comprises the following steps:
s401, retaining peptide related information of which the EL rank is less than or equal to 2 according to NetMHCpan analysis result information;
s402, backtracking the reserved polypeptide according to the related information and the specific polypeptide information, and finishing the position information and start-stop information of the polypeptide in the transcript and the structural variation information and SNP information contained in the transcript to obtain antigen prediction information.
Note that the official definition of EL rank is: rank of the predicted Affinity compared to a set of 400.000.000. 400.000 random natural peptides. This measure is not affected by inherent bias of certain molecules towards higher or lower mean predicted affinities. Strong binders are defined as having% Rank <0.5, and weak binders with% Rank <2. We advise to select candidate binders based on% Rank rather than nM Affinity (predicted Affinity class compared to a set of 400.000 random natural peptides. This measure is not affected by the inherent bias of certain molecules to higher or lower average predicted affinities. Strong binders are defined as% Rank <0.5, weak binders are defined as% Rank <2. It is recommended that candidate binders be selected based on% Rank instead of nM Affinity).
According to the tumor neoantigen prediction method based on the third generation sequencing data provided by the invention, step S103 utilizes software (such as TAGET software) special for the third generation sequencing data to map and annotate the sample.
According to the tumor neoantigen prediction method based on the third-generation sequencing data provided by the invention, step S104 is used for carrying out structural variation analysis on a sample by utilizing software (such as TAGET-sv software) special for the third-generation sequencing data.
The invention also provides a tumor neoantigen prediction system based on the third generation sequencing data, which comprises the following steps:
a tumor-control sample processing module for: steps S101-S105 are performed on the tumor sample and the control sample, respectively, to obtain SNP information of the tumor sample and SNP information of the control sample, where in steps S101-S105, the tumor sample and the control sample are collectively referred to as a sample:
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
a specific polypeptide information acquisition module for: obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
a tumor neoantigen prediction module for: netMHCpan analysis is performed on tumor samples based on specific polypeptide information and HLA typing information for tumor samples to predict tumor neoantigens.
The invention also provides electronic equipment, which comprises a processor and a memory storing a computer program, and is characterized in that the processor executes the computer program to realize the tumor neoantigen prediction method of any one of the above.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any of the above-described tumor neoantigen prediction methods.
The present invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing any one of the above-described tumor neoantigen prediction methods.
According to the tumor new antigen prediction method and system based on the third-generation sequencing data, the advantages of high average reading length, short sequencing time and no need of amplification of the third-generation sequencing technology are utilized, the third-generation whole-exon sequencing technology is adopted to sequence the sample, the accuracy and the completeness of the data are effectively ensured, the third-generation whole-exon sequencing data are mapped and annotated, the structural variation analysis and the SNP analysis are carried out on the sample according to the annotation and transcript information, the SNP information of the tumor sample and the SNP information of a control sample are compared, specific polypeptide information is obtained, the specific polypeptide information and HLA typing information of the tumor sample are combined, the NetMHCpan analysis is carried out on the tumor sample, the tumor new antigen can be predicted in a high quality and high efficiency, and important innovation value is brought to the tumor new antigen prediction, and great economic benefits are brought to medical research.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following brief description will be given of the drawings used in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a tumor neoantigen prediction method based on three-generation sequencing data.
Fig. 2 is a schematic structural diagram of a tumor neoantigen prediction system based on three-generation sequencing data.
Fig. 3 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions thereof will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, which should not be construed as limiting the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In the description of the present invention, it is to be understood that the terminology used is for the purpose of description only and is not to be interpreted as indicating or implying relative importance.
The tumor neoantigen prediction method and system based on the third generation sequencing data provided by the invention are described below with reference to fig. 1-3.
FIG. 1 is a flow chart of a tumor neoantigen prediction method based on third generation sequencing data. Referring to fig. 1, the tumor neoantigen prediction method based on the third generation sequencing data provided by the invention can include:
s1, respectively carrying out S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as a sample in S101-S105.
It should be noted that the sample may be any effective sample meeting the sequencing standard, the tumor sample may be a tumor sample separated from a human body, the control sample may be a normal sample separated from a human body, the tumor sample may also be a tumor sample separated from a mouse, a pig or the like, and the control sample may be a normal sample of a corresponding species, that is, the species types of the tumor sample and the control sample need to be unified.
S101, receiving third generation full exon sequencing data of a sample. In one embodiment, the third generation whole exon sequencing data obtained in S101 can be obtained by third generation whole exon sequencing of a sample based on platforms (e.g., capture kit, sequencer, etc.) well known in the art.
S102, receiving HLA typing information of the sample. In one embodiment, the HLA typing information obtained in S102 may be known in advance. For example, animal models are purchased with associated HLA typing information. If the typing of the sample is unknown, the sample may be HLA typed by techniques well known in the art, such as serologic typing techniques, cytological typing techniques, DNA typing techniques, and the like.
And S103, mapping and annotating the third generation whole exon sequencing data to obtain annotation and transcript information of the sample. In one embodiment, S103 can compare the third generation whole exon sequencing data with reference genomic data (e.g., human reference genomic data, hg38, mm10, the species of the reference genomic data is also to be uniform with the species of the sample) by TAGET software specific to the third generation sequencing data to map and annotate the sample to annotation and transcript information of the sample.
S104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain the structural variation information of the sample. In one embodiment, S104 may perform structural variation analysis on the sample by TAGET-sv software specific for the third generation sequencing data.
S105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample. In one embodiment, step S105 may include S10501-S10504:
s10501, marking the annotation information of each transcript according to the annotation and the transcript information and the structural variation information.
S10502, performing minimum 2 mapping on the transcripts to obtain mapping information.
S10503, comparing the annotation information with the mapping information, and reserving the mapping information consistent with the annotation information.
S10504, carrying out SNP analysis on the sample according to the mapping information consistent with the annotation information, and obtaining SNP information of transcripts in the sample.
S2, obtaining specific polypeptide information according to the SNP information of the tumor sample and the SNP information of the control sample. In one embodiment, step S2 may include S201-S211:
s201, comparing SNP information of a tumor sample with SNP information of a control sample, and retaining the genomic SNP information (Somatic mutation SNP information).
S202, filtering out the information of the Somatic SNP which only appears once in the tumor sample.
S203, comparing the structural variation information of the tumor sample with the structural variation information of the control sample, and retaining the structural variation information of the Somatic.
S204, merging transcripts containing filtered-out Somatic SNP information and transcripts containing Somatic structural variation information in tumor samples to obtain a transcript set.
S205, transcript base sequence information is obtained for the transcript set according to the reference information (for example, including the reference sequence, the annotation file and the like).
S206, correcting the base sequence information of the transcripts according to the filtered information of the Somatic SNP and the filtered information of the Somatic structural variation.
S207, retaining CDS region sequence information which can be translated into polypeptide in the transcript according to the reference information and the corrected transcript base sequence information.
S208, filtering CDS region sequence information without an initiation codon at the position of the reference initiation codon.
S209, translating according to the filtered CDS region sequence information to obtain peptide chain information.
S210, cutting the peptide chain into a plurality of polypeptides according to the peptide chain information to obtain the polypeptide information. In one embodiment, S210 may cleave a peptide chain into polypeptides containing 8-15 peptides, respectively.
S211, filtering the polypeptide information of which the reference sequence can be translated normally to obtain specific polypeptide information.
S3, according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens. In one embodiment, S3 may perform NetMHCpan analysis on tumor samples by existing NetMHCpan software with version number 4.1.
NetMHCpan is an artificial neural network-based immune epitope prediction algorithm for predicting the affinity of peptide fragments to MHC class I molecules, and is constructed by taking the combination of more than 180000 quantitative binding data from MHC molecules of multiple species such as human, mouse, pig and the like and MS-derived MHC eluting ligand data from HLA alleles of 55 persons and mice as a training set. The algorithm does not require any prior knowledge of the specific MHC molecule and has a high degree of accuracy.
And S4, carrying out information filtering, backtracking and arrangement according to the NetMHCpan analysis result to obtain antigen prediction information. In one embodiment, step S4 may include S401-S402:
s401, according to NetMHCpan analysis result information, retaining peptide related information of which EL rank is less than or equal to 2, wherein the mark of which EL rank is less than or equal to 0.5 is strong binding, and the mark of which EL rank is within the range of 0.5-2 is weak binding.
S402, backtracking the reserved polypeptide according to the related information and the specific polypeptide information, and finishing the position information and start-stop information of the polypeptide in the transcript and the structural variation information and SNP information contained in the transcript to obtain antigen prediction information so as to record the polypeptide required by antigen prediction and the position and the production reason of the polypeptide, thereby facilitating the verification and the test of subsequent experiments.
It should be noted that, the execution subject of the tumor neoantigen prediction method provided by the present invention may be any terminal-side device meeting technical requirements, such as a tumor neoantigen prediction apparatus.
According to the tumor neoantigen prediction method provided by the invention, the advantage that each transcript is sequenced independently in the whole exon sequencing process by the third generation sequencing technology compared with the second generation sequencing technology is taken into consideration, so that the specificity of translating into a peptide chain can be ensured according to the condition that each transcript has different structural variation and SNP, after the third generation whole exon sequencing and HLA analysis are carried out on a sample, mapping and annotation are carried out on the third generation whole exon sequencing data, then structural variation analysis and SNP analysis are carried out, and the analysis treatment of the somatic is carried out according to the information of a control sample, so that the specific transcript related to the cancer is finally obtained, which is not achieved by the second generation sequencing technology, and is the core of the invention; the specific transcripts are translated to obtain specific peptide chains, polypeptides with the length of 8-15 amino acids are cut according to the requirement, and normal polypeptides which can be translated by the reference transcripts under normal conditions are filtered out to obtain specific polypeptides related to cancers; finally, the netMHCpan technology is combined with sample HLA typing to carry out high-efficiency and high-quality tumor neoantigen predictive analysis on the specific polypeptides, then the result is filtered, and the generation reason of the specific polypeptides is traced back, so that the subsequent experimental verification and test are facilitated. The tumor neoantigen prediction method provided by the invention has low calculation resource requirement and high benefit, can bring important innovation value and huge economic benefit for tumor neoantigen prediction, and provides a substantial knowledge base for medical research.
Furthermore, the tumor neoantigen prediction method provided by the invention is adopted to predict the tumor neoantigen of the tumor sample of the mouse. Third generation full length transcriptome sequencing was performed on mouse tumor samples and mouse control samples, respectively, and a total of 7 groups of mouse tumor samples were used for tumor neoantigen prediction, and the HLA class of the experimental mice was known to be H-2-Db, H-2-Kb class (HLA class of the experimental mice was provided by the mouse company, and HLA analysis was not necessary). Specific mouse information is shown in table 1 below:
mapping and annotating the sample according to the GRCm39 standard reference file, searching structural variation and SNP analysis steps, obtaining polypeptides with the length of 8-15 amino acids, obtaining specific polypeptide information, carrying out NetMHCpan analysis according to the obtained specific polypeptide information and H-2-Db, H-2-Kb typing, and predicting new antigen results as shown in the following table 2.
Annotation:
sample: tumor sample number;
8-15 mer peptides: number of tumour neoantigens meeting the screening criteria in polypeptides of 8-15 amino acids in length.
Experimental results show that the tumor neoantigen prediction method provided by the invention can be used for efficiently predicting the tumor neoantigen of the sample, and fully considers specific polypeptides generated during translation of transcripts due to structural variation and SNP, and has low false positive and high precision of the prediction result.
The tumor neoantigen predicting system provided by the invention is described below, and the tumor neoantigen predicting system described below and the tumor neoantigen predicting method described above can be referred to correspondingly.
Referring to fig. 2, the tumor neoantigen prediction system based on the third generation sequencing data provided by the present invention may include:
tumor-control sample processing module a for: steps S101-S105 are performed on the tumor sample and the control sample, respectively, to obtain SNP information of the tumor sample and SNP information of the control sample, where in steps S101-S105, the tumor sample and the control sample are collectively referred to as a sample:
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
specific polypeptide information is obtained as module B for: obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
tumor neoantigen prediction module C for: netMHCpan analysis is performed on tumor samples based on specific polypeptide information and HLA typing information for tumor samples to predict tumor neoantigens.
It should be noted that, steps S101-S105 may be executed uniformly by the tumor-control sample processing module a, or the tumor-control sample processing module a may include five sub-modules:
a sequencing sub-module for: receiving third generation whole exon sequencing data of a sample;
an HLA analysis submodule for: receiving HLA typing information of a sample;
a mapping and annotation sub-module for: mapping and annotating the third generation full exon sequencing data to obtain annotation and transcript information of the sample;
a structural variation analysis sub-module for: carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
a SNP analysis submodule to: and carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample.
In one embodiment, specific polypeptide information obtaining module B may include:
the submodule is used for obtaining the SNP information: comparing SNP information of the tumor sample with SNP information of a control sample, and reserving the information of the genomic SNP;
a first filtering sub-module for: filtering out the information of the Somatic SNP which only appears once in the tumor sample;
the submodule is used for obtaining the mutation information of the Somatic structure and is used for: comparing the structural variation information of the tumor sample with the structural variation information of the control sample, and reserving the structural variation information of the soy;
the transcript sets result in a sub-module for: combining transcripts containing filtered Somatic SNP information and transcripts containing Somatic structural variation information to obtain a transcript set;
the base sequence information of the transcripts is used for obtaining a submodule for: acquiring transcript base sequence information from the transcript set according to the reference information;
a transcript base sequence information modifier module for: correcting the base sequence information of the transcript according to the filtered information of the Somatic SNP and the filtered information of the variation of the Somatic structure;
a second filtering sub-module for: according to the reference information and the corrected base sequence information of the transcript, preserving CDS region sequence information which can be translated into polypeptide in the transcript;
a third filtering sub-module for: filtering CDS region sequence information in which an initiation codon does not exist at a reference initiation codon position;
the peptide chain information yields a sub-module for: translating according to the filtered CDS region sequence information to obtain peptide chain information;
the polypeptide information is provided as a sub-module for: cutting a peptide chain into a plurality of polypeptides according to the peptide chain information to obtain polypeptide information;
specific polypeptide information is provided as a sub-module for: filtering the translatable polypeptide information to obtain specific polypeptide information.
In one embodiment, the tumor neoantigen prediction system may further comprise:
the prediction result sorting module D is configured to: and (5) carrying out information filtering, backtracking and finishing according to the NetMHCpan analysis result to obtain antigen prediction information.
In one embodiment, the prediction result sorting module D may include:
a fourth filtering sub-module for: according to NetMHCpan analysis result information, retaining peptide related information of which EL rank is less than or equal to 2;
antigen predictive information is obtained as a sub-module for: and backtracking the reserved polypeptide according to the peptide related information and the specific polypeptide information, and finishing the position information and start-stop information of the polypeptide in the transcript, and the structural variation information and SNP information contained in the transcript to obtain antigen prediction information.
Fig. 3 illustrates a physical schematic diagram of an electronic device, as shown in fig. 3, where the electronic device may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a tumor neoantigen prediction method comprising:
s1, respectively carrying out S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as samples in S101-S105;
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
s2, obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
s3, according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens.
Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the method of predicting a tumor neoantigen provided by the methods described above, the method comprising:
s1, respectively carrying out S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as samples in S101-S105;
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
s2, obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
s3, according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens.
In yet another aspect, the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the method of predicting a tumor neoantigen provided by the above methods, the method comprising:
s1, respectively carrying out S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as samples in S101-S105;
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
s2, obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
s3, according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (7)

1. A tumor neoantigen prediction method based on third generation sequencing data, comprising:
s1, respectively carrying out steps S101-S105 on a tumor sample and a control sample to obtain SNP information of the tumor sample and SNP information of the control sample, wherein the tumor sample and the control sample are collectively called as a sample in the steps S101-S105;
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
s2, obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
s3, carrying out NetMHCpan analysis on the tumor sample according to the specific polypeptide information and HLA typing information of the tumor sample so as to predict tumor neoantigens;
s4, carrying out information filtering, backtracking and arrangement according to the NetMHCpan analysis result to obtain antigen prediction information;
wherein, step S2 includes:
s201, comparing SNP information of a tumor sample with SNP information of a control sample, and reserving the SNP information of the systemic;
s202, filtering out the information of the Somatic SNP which only appears once in the tumor sample;
s203, comparing the structural variation information of the tumor sample with the structural variation information of the control sample, and reserving the structural variation information of the Somatic;
s204, merging transcripts containing filtered-out Somatic SNP information and transcripts containing Somatic structural variation information in tumor samples to obtain a transcript set;
s205, acquiring transcript base sequence information for transcript collection according to the reference information;
s206, correcting the base sequence information of the transcript according to the filtered information of the Somatic SNP and the filtered information of the variation of the Somatic structure;
s207, retaining CDS region sequence information which can be translated into polypeptide in the transcript according to the reference information and the corrected transcript base sequence information;
s208, filtering CDS region sequence information without an initiation codon at the position of a reference initiation codon;
s209, translating according to the filtered CDS region sequence information to obtain peptide chain information;
s210, cutting a peptide chain into a plurality of polypeptides according to the peptide chain information to obtain polypeptide information;
s211, filtering the polypeptide information which can be translated according to the reference information to obtain specific polypeptide information;
and, step S4 includes:
according to NetMHCpan analysis result information, retaining peptide related information of which EL rank is less than or equal to 2;
and backtracking the reserved polypeptide according to the peptide related information and the specific polypeptide information, and finishing the position information and start-stop information of the polypeptide in the transcript, and the structural variation information and SNP information contained in the transcript to obtain antigen prediction information.
2. The method of claim 1, wherein step S105 comprises:
s10501, marking annotation information of each transcript according to the annotation, the transcript information and the structural variation information;
s10502, performing minimum 2 mapping on the transcripts to obtain mapping information;
s10503, comparing the annotation information with the mapping information, and reserving the mapping information consistent with the annotation information;
s10504, carrying out SNP analysis on the sample according to the mapping information consistent with the annotation information, and obtaining SNP information of transcripts in the sample.
3. The method of claim 1 or 2, wherein step S103 uses TAGET software to map and annotate the samples.
4. The method of claim 1 or 2, wherein step S104 uses TAGET-sv software to perform structural mutation analysis on the sample.
5. A tumor neoantigen prediction system based on third generation sequencing data, comprising:
a tumor-control sample processing module for: steps S101-S105 are performed on the tumor sample and the control sample, respectively, to obtain SNP information of the tumor sample and SNP information of the control sample, where in steps S101-S105, the tumor sample and the control sample are collectively referred to as a sample:
s101, receiving third-generation full exon sequencing data of a sample;
s102, receiving HLA typing information of a sample;
s103, mapping and annotating third generation full exon sequencing data to obtain annotation and transcript information of a sample;
s104, carrying out structural variation analysis on the sample according to the annotation and the transcript information to obtain structural variation information of the sample;
s105, carrying out SNP analysis on the sample according to the annotation, the transcript information and the structural variation information to obtain SNP information of the sample;
a specific polypeptide information acquisition module for: obtaining specific polypeptide information according to SNP information of a tumor sample and SNP information of a control sample;
a tumor neoantigen prediction module for: according to the specific polypeptide information and HLA typing information of the tumor sample, carrying out NetMHCpan analysis on the tumor sample so as to predict tumor neoantigens;
the prediction result arrangement module is used for: according to the NetMHCpan analysis result, information filtering, backtracking and finishing are carried out to obtain antigen prediction information;
wherein, the specific polypeptide information obtaining module comprises:
the submodule is used for obtaining the SNP information: comparing SNP information of the tumor sample with SNP information of a control sample, and reserving the information of the genomic SNP;
a first filtering sub-module for: filtering out the information of the Somatic SNP which only appears once in the tumor sample;
the submodule is used for obtaining the mutation information of the Somatic structure and is used for: comparing the structural variation information of the tumor sample with the structural variation information of the control sample, and reserving the structural variation information of the soy;
the transcript sets result in a sub-module for: combining transcripts containing filtered Somatic SNP information and transcripts containing Somatic structural variation information to obtain a transcript set;
the base sequence information of the transcripts is used for obtaining a submodule for: acquiring transcript base sequence information from the transcript set according to the reference information;
a transcript base sequence information modifier module for: correcting the base sequence information of the transcript according to the filtered information of the Somatic SNP and the filtered information of the variation of the Somatic structure;
a second filtering sub-module for: according to the reference information and the corrected base sequence information of the transcript, preserving CDS region sequence information which can be translated into polypeptide in the transcript;
a third filtering sub-module for: filtering CDS region sequence information in which an initiation codon does not exist at a reference initiation codon position;
the peptide chain information yields a sub-module for: translating according to the filtered CDS region sequence information to obtain peptide chain information;
the polypeptide information is provided as a sub-module for: cutting a peptide chain into a plurality of polypeptides according to the peptide chain information to obtain polypeptide information;
specific polypeptide information is provided as a sub-module for: filtering the translatable polypeptide information to obtain specific polypeptide information;
and, the prediction result sorting module includes:
a fourth filtering sub-module for: according to NetMHCpan analysis result information, retaining peptide related information of which EL rank is less than or equal to 2;
antigen predictive information is obtained as a sub-module for: and backtracking the reserved polypeptide according to the peptide related information and the specific polypeptide information, and finishing the position information and start-stop information of the polypeptide in the transcript, and the structural variation information and SNP information contained in the transcript to obtain antigen prediction information.
6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the tumor neoantigen prediction method of any one of claims 1 to 4 when the program is executed.
7. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the tumor neoantigen prediction method according to any one of claims 1 to 4.
CN202311401140.5A 2023-10-26 2023-10-26 Tumor neoantigen prediction method and system based on third-generation sequencing data Active CN117174166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311401140.5A CN117174166B (en) 2023-10-26 2023-10-26 Tumor neoantigen prediction method and system based on third-generation sequencing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311401140.5A CN117174166B (en) 2023-10-26 2023-10-26 Tumor neoantigen prediction method and system based on third-generation sequencing data

Publications (2)

Publication Number Publication Date
CN117174166A CN117174166A (en) 2023-12-05
CN117174166B true CN117174166B (en) 2024-03-26

Family

ID=88930011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311401140.5A Active CN117174166B (en) 2023-10-26 2023-10-26 Tumor neoantigen prediction method and system based on third-generation sequencing data

Country Status (1)

Country Link
CN (1) CN117174166B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009061182A1 (en) * 2007-11-05 2009-05-14 Leiden University Medical Center Hla class ii pi4k2b mhags and application thereof
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
CN111755067A (en) * 2019-03-28 2020-10-09 格源致善(上海)生物科技有限公司 Screening method of tumor neoantigen
CN113533741A (en) * 2021-06-23 2021-10-22 深圳市新合生物医疗科技有限公司 Method for predicting new antigen based on polypeptide structural index
WO2022074098A1 (en) * 2020-10-08 2022-04-14 Fundació Privada Institut D'investigació Oncològica De Vall Hebron Method for the identification of cancer neoantigens
CN114627967A (en) * 2022-03-15 2022-06-14 北京基石生命科技有限公司 Method for accurately annotating three-generation full-length transcript
CN116779028A (en) * 2023-05-19 2023-09-19 北京大学 Method, device and computer readable storage medium for predicting neoepitope based on structural variation detection

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534156B (en) * 2019-09-02 2022-06-17 深圳市新合生物医疗科技有限公司 Method and system for extracting immunotherapy new antigen
CN112309502A (en) * 2020-10-14 2021-02-02 深圳市新合生物医疗科技有限公司 Method and system for calculating tumor neoantigen load

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009061182A1 (en) * 2007-11-05 2009-05-14 Leiden University Medical Center Hla class ii pi4k2b mhags and application thereof
CN108796055A (en) * 2018-06-12 2018-11-13 深圳裕策生物科技有限公司 Tumor neogenetic antigen detection method, device and storage medium based on the sequencing of two generations
CN109706065A (en) * 2018-12-29 2019-05-03 深圳裕策生物科技有限公司 Tumor neogenetic antigen load detection device and storage medium
CN109801678A (en) * 2019-01-25 2019-05-24 上海鲸舟基因科技有限公司 Based on the tumour antigen prediction technique of full transcript profile and its application
CN111755067A (en) * 2019-03-28 2020-10-09 格源致善(上海)生物科技有限公司 Screening method of tumor neoantigen
CN110600077A (en) * 2019-08-29 2019-12-20 北京优迅医学检验实验室有限公司 Prediction method of tumor neoantigen and application thereof
WO2022074098A1 (en) * 2020-10-08 2022-04-14 Fundació Privada Institut D'investigació Oncològica De Vall Hebron Method for the identification of cancer neoantigens
CN113533741A (en) * 2021-06-23 2021-10-22 深圳市新合生物医疗科技有限公司 Method for predicting new antigen based on polypeptide structural index
CN114627967A (en) * 2022-03-15 2022-06-14 北京基石生命科技有限公司 Method for accurately annotating three-generation full-length transcript
CN116779028A (en) * 2023-05-19 2023-09-19 北京大学 Method, device and computer readable storage medium for predicting neoepitope based on structural variation detection

Also Published As

Publication number Publication date
CN117174166A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
Bressan et al. The dawn of spatial omics
Smith et al. The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution
US11492656B2 (en) Haplotype resolved genome sequencing
Alvarez-Cubero et al. Next generation sequencing: an application in forensic sciences?
CN107849612B (en) Alignment and variant sequencing analysis pipeline
CN105793859B (en) System for detecting sequence variants
AU2014337089B2 (en) Methods and systems for genotyping genetic samples
CN105779636B (en) PCR primer for amplifying human breast cancer susceptibility gene BRCA1 and BRCA2 coding sequence and application
CN105849279A (en) Methods and systems for identifying disease-induced mutations
CN105637098A (en) Methods and systems for aligning sequences
Panagopoulos et al. The “grep” command but not FusionMap, FusionFinder or ChimeraScan captures the CIC-DUX4 fusion gene from whole transcriptome sequencing data on a small round cell tumor with t (4; 19)(q35; q13)
Lee et al. EST analysis of gene expression in early cleavage-stage sea urchin embryos
GB2590197A (en) Compositions, methods and systems for processing or analyzing multi-species nucleic acid samples
JP2017516501A (en) Lung cancer typing method
US20200168294A1 (en) A diagnostic and prognostic test for multiple cancer types based on transcript profiling
Sjöstedt et al. Integration of transcriptomics and antibody-based proteomics for exploration of proteins expressed in specialized tissues
Duan et al. Spatially resolved transcriptomics: advances and applications
Su et al. Identification of splice variants and isoforms in transcriptomics and proteomics
CN117174166B (en) Tumor neoantigen prediction method and system based on third-generation sequencing data
Vai et al. DNA sequencing in cultural heritage
WO2023235379A1 (en) Single molecule sequencing and methylation profiling of cell-free dna
CN107885972A (en) It is a kind of based on the fusion detection method of single-ended sequencing and its application
Wilmott et al. Tumour procurement, DNA extraction, coverage analysis and optimisation of mutation-detection algorithms for human melanoma genomes
Nambiar et al. Global gene expression profiling: a complement to conventional histopathologic analysis of neoplasia
BR102014003033B1 (en) process, apparatus or system and kit for classifying tumor samples of unknown and / or uncertain origin and use of the biomarker group genes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240228

Address after: 100195 Beijing Haidian Xingshikou Road Yiyuan Cultural and Creative Industry Base Area C West Section of Building 11 201

Applicant after: GENEX HEALTH Co.,Ltd.

Country or region after: China

Address before: 3rd Floor, West Section, Building 11, Zone C, Yiyuan Cultural and Creative Industry Base, No. 65 Xingshikou Road, Haidian District, Beijing, 100195

Applicant before: Beijing cornerstone Jingzhun Diagnostic Technology Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant