CN110146636B - Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers - Google Patents

Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers Download PDF

Info

Publication number
CN110146636B
CN110146636B CN201910367519.6A CN201910367519A CN110146636B CN 110146636 B CN110146636 B CN 110146636B CN 201910367519 A CN201910367519 A CN 201910367519A CN 110146636 B CN110146636 B CN 110146636B
Authority
CN
China
Prior art keywords
protein
gastric cancer
dimensionality reduction
typing
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910367519.6A
Other languages
Chinese (zh)
Other versions
CN110146636A (en
Inventor
秦钧
汪宜
刘明伟
夏夏
宋雷
李恺
倪晓天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Guhai Tianmu Biomedical Technology Co ltd
Original Assignee
Beijing Guhai Tianmu Biomedical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Guhai Tianmu Biomedical Technology Co ltd filed Critical Beijing Guhai Tianmu Biomedical Technology Co ltd
Priority to CN201910367519.6A priority Critical patent/CN110146636B/en
Publication of CN110146636A publication Critical patent/CN110146636A/en
Application granted granted Critical
Publication of CN110146636B publication Critical patent/CN110146636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/88Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • G01N33/574Immunoassay; Biospecific binding assay; Materials therefor for cancer
    • G01N33/57407Specifically defined cancers
    • G01N33/57446Specifically defined cancers of stomach or intestine
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • G01N33/6803General methods of protein analysis not limited to specific proteins or families of proteins
    • G01N33/6848Methods of protein analysis involving mass spectrometry
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • G01N30/88Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86
    • G01N2030/8809Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample
    • G01N2030/8813Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials
    • G01N2030/8831Integrated analysis systems specially adapted therefor, not covered by a single one of the groups G01N30/04 - G01N30/86 analysis specially adapted for the sample biological materials involving peptides or proteins

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Immunology (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Urology & Nephrology (AREA)
  • Hematology (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Microbiology (AREA)
  • Cell Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Food Science & Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Hospice & Palliative Care (AREA)
  • Oncology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention provides a screening method and a screening device for a gastric cancer typing protein marker and application of the screened protein marker. The screening method comprises the following steps: screening proteins meeting the retention condition from a protein expression mass spectrum database formed by a plurality of samples to serve as an effective protein set; sequentially carrying out two times of dimensionality reduction treatment on the effective protein set to obtain a dimensionality reduction protein set; and (4) carrying out cluster analysis on the dimensionality reduction protein set to obtain different types of protein markers. The method screens protein markers which are obviously and highly expressed in the cancer samples from a large number of protein spectrum databases of the gastric cancer samples, and classifies the gastric cancer into different classes of protein markers according to the correlation between the different protein markers and the survival rate, and the markers have obvious difference between the different classes, so that the classification of the gastric cancer is more accurate.

Description

Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers
Technical Field
The invention relates to the field of tumor detection, in particular to a screening method and a screening device for a gastric cancer typing protein marker and application of the screened protein marker.
Background
The third cause of tumor death is stomach cancer, China is one of high-incidence areas of stomach cancer, and the annual new cases are nearly half of the worldwide cases. There are three main methods for the current prognosis of gastric cancer: the first is the tissue morphology based Lauren typing which classifies gastric cancer into intestinal, diffuse and mixed types. Among them, the prognosis of intestinal gastric cancer is the best, and the 5-year survival rate is 65%; diffuse type gastric cancer is the worst prognosis, and the 5-year survival rate is 48%. The second is TNM staging, which is done on patients with gastric cancer according to tumor size, invasiveness, number of lymph nodes and whether proximal or distal metastases occur. Among them, the prognosis is type IIA, 5-year survival rate is 78%; the worst prognosis is type IIIC, with a 5-year survival rate of 15%. The third is genome and transcriptome-based typing by the ACRG research institute, which is: MSI type, microsatellite stability/epithelial to mesenchymal cell transformation type (MSS/EMT), microsatellite stability/TP 53 positive type and microsatellite stability/TP 53 negative type. Among them, MSI type prognosis is best, 5-year survival rate 71%; MSS/EMT type prognosis is the worst, with a 5-year survival rate of 33%.
The Lauren typing method has the most limited ability to assess prognosis, with a 5-year survival difference of only 17% at the most (65% in the gut; 48% in the diffuse). Of these three approaches, TNM staging is the best possible way to assess prognosis for gastric cancer patients, with 5-year survival differing by a maximum of 63% (type IIA 78%; type IIIC 15%). However, both the Lauren and TNM typing methods have a problem in that they are not molecular-based gastric cancer prognosis evaluation methods, and thus do not provide personalized options for chemotherapeutic regimens in patients.
The genome-based gastric cancer typing of ACRG analyzes gastric cancer from molecular layer for the first time, and the prognosis evaluation capability based on the method is worse than that of TNM typing (5-year survival rate difference is 35.85% at most (MSI type 71%; MSS/EMT type 33%), but explains the cause of gastric cancer to some extent.
Therefore, how to more accurately classify gastric cancer in the population has become a problem to be solved.
Disclosure of Invention
The invention mainly aims to provide a screening method and a screening device of a protein marker for gastric cancer typing and application of the screened protein marker, so as to more accurately type gastric cancer in people.
In order to achieve the above object, according to one aspect of the present invention, there is provided a method for screening a gastric cancer-typing protein marker, the method comprising: screening proteins meeting the retention condition from a protein expression mass spectrum database formed by a plurality of samples to serve as an effective protein set; sequentially carrying out two times of dimensionality reduction treatment on the effective protein set to obtain a dimensionality reduction protein set; and (4) carrying out cluster analysis on the dimensionality reduction protein set to obtain different types of protein markers.
Further, the two dimensionality reduction processes comprise: carrying out primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and performing second dimensionality reduction treatment on the first dimensionality reduction data set by using t-SNE to obtain a dimensionality reduction protein set.
Further, the retention condition includes a quality condition and/or a frequency condition, the quality condition includes at least one of the following: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples.
Further, in the step of performing the second dimension reduction processing on the first dimension reduction data set by adopting t-SNE, the learning rate is set to be 10-500, the iteration frequency is set to be more than 7500, and the iteration stop is set to be 100-400.
Further, in the step of cluster analysis, the average distance in the classes is less than or equal to 4.58, and the average distance between the classes is more than or equal to 9.68.
Further, before performing the dimension reduction treatment twice on the effective protein set, the method further comprises: standardizing the effective protein set to obtain a standardized effective protein set; preferably, the normalization process is performed such that the mean value of the proteins in the effective protein set is 0 and the variance is 1.
Further, the protein markers of different classes are protein markers with P value less than 10 to the power of-35 based on variance test; preferably, the-35 power protein markers with a P-value less than 10 based on the variance test comprise: c0, H2AFZ, H2AFJ, H2 AFV; c1, ACTG2, ACTA2, ACTC 1; c2, FLNA, COL6a2, COL6a 1; c3, TUBB2B, TUBB2A, TUBB; c4, H3F3C, H3F3A, H3F 3B; c5, X; c6, PABPC1, HNRNPK, DDX 39B.
Further, the different classes of protein markers also include protein markers with a power of-20 with a P-value below 10 based on the variance test, preferably, protein markers with a power of-20 with a P-value below 10 based on the variance test include:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、 HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、 HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、 RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、 CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、 VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、 SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、 HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、 PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、 TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、 HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、 ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、 EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、 EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、 EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、 RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、 NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、 RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、 HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、 UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、 ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、 DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、 ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、 HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、 PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
according to a second aspect of the present application, there is provided a method for typing gastric cancer in a population, the method comprising typing different types of protein markers selected by any one of the above methods for screening gastric cancer-typing protein markers to obtain different types.
According to a third aspect of the present application, there is provided a method for typing a gastric cancer sample of an individual, comprising typing according to a gastric cancer typing standard, wherein the gastric cancer typing standard is different types of protein markers selected by any one of the above screening methods, or different types of gastric cancer obtained by any one of the above screening methods.
According to a fourth aspect of the present application, there is provided a method of typing a gastric cancer sample in an individual, the method comprising: typing a known gastric cancer sample set according to the method for typing gastric cancer in any one of the people; screening proteins meeting the retention condition from protein mass spectrum data of a gastric cancer sample to be detected, and taking the proteins as a protein set to be detected; sequentially carrying out dimensionality reduction treatment on the protein set to be detected twice according to the same conditions with the effective protein set in the known gastric cancer sample set to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected; and performing cluster analysis on the dimensionality reduction protein set of the gastric cancer sample to be detected to obtain a type with the highest similarity to the known gastric cancer sample set, namely the type corresponding to the gastric cancer sample to be detected.
Further, the two dimensionality reduction processes comprise: performing primary dimensionality reduction treatment on a protein set to be detected by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and performing second dimensionality reduction treatment on the first dimensionality reduction data set by using t-SNE to obtain a dimensionality reduction protein set.
Further, the retention condition includes a quality condition and/or a frequency condition, the quality condition includes at least one of the following: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples in a known set of gastric cancer samples.
Further, the protein mass spectrum data of the gastric cancer sample to be detected is subjected to primary dimensionality reduction treatment by using the same parameters as those of the protein mass spectrum data of the known gastric cancer sample set subjected to principal component analysis dimensionality reduction treatment, so that a first dimensionality reduction data set is obtained.
And further, performing second dimensionality reduction on the first dimensionality reduction data by using the same parameters of t-SNE dimensionality reduction on protein mass spectrum data in a known gastric cancer sample set to obtain a dimensionality reduction protein set.
Further, in the step of cluster analysis, the average distance in the classes is less than or equal to 4.58, and the average distance between the classes is more than or equal to 9.68.
Further, before performing the dimension reduction treatment twice on the effective protein set, the method further comprises: standardizing the effective protein set to obtain a standardized effective protein set; preferably, the normalization process is performed such that the mean value of the proteins in the effective protein set is 0 and the variance is 1.
According to a fifth aspect of the present application, there is provided a reagent, a kit or a chip for detecting gastric cancer, wherein the reagent, the kit or the chip comprises different classes of protein markers obtained by any one of the above-mentioned screening methods.
Further, the detection includes any one or more of typing diagnosis, survival prognosis evaluation and screening of chemotherapeutic drugs.
According to a sixth aspect of the present application, there is provided a use of the different classes of protein markers obtained by any one of the above screening methods in the preparation of a reagent, a kit or a chip for detecting gastric cancer.
Further, the protein marker is a protein marker whose expression amount is significantly higher in the belonging class than in the remaining classes.
Further, the detection includes any one or more of typing diagnosis, survival prognosis evaluation and screening of chemotherapeutic drugs.
According to a seventh aspect of the present application, there is provided a screening apparatus for a gastric cancer typing protein marker, comprising: the screening module A is used for screening out proteins meeting retention conditions from a protein expression mass spectrum database formed by a plurality of samples to serve as an effective protein set; the dimensionality reduction module A is used for sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set; and the clustering and typing module A is used for carrying out clustering analysis on the dimensionality reduction protein set to obtain different types of protein markers.
Further, the dimension reduction module a includes: the principal component dimensionality reduction module A is used for carrying out primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module A is used for carrying out second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
Further, the retention condition comprises a quality condition and/or a frequency condition, and the quality condition comprises at least one of the following conditions: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and three peptide fragments meeting the quality requirement are provided; frequent conditions are proteins present in at least 80% of the samples.
Further, in the t-SNE dimension reduction module A: the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
Furthermore, in the cluster classification module A, the average distance in the classes is less than or equal to 4.58, and the average distance between the classes is more than or equal to 9.68.
Further, the screening apparatus further comprises: the standardization processing module A is used for carrying out standardization processing on the effective protein set to obtain a standardized effective protein set; preferably, the normalization processing module a is used to mean that the mean value of the proteins in the effective protein set is 0 and the variance is 1.
Further, the protein markers of different classes are protein markers with P value less than 10 to the power of-35 based on variance test; preferably, the-35 power protein markers with a P-value less than 10 based on the variance test comprise: c0, H2AFZ, H2AFJ, H2 AFV; c1, ACTG2, ACTA2, ACTC 1; c2, FLNA, COL6a2, COL6a 1; c3, TUBB2B, TUBB2A, TUBB; c4, H3F3C, H3F3A, H3F 3B; c5, X; c6, PABPC1, HNRNPK, DDX 39B.
Further, the different classes of protein markers also include protein markers with a power of-20 with a P-value below 10 based on the variance test, preferably, protein markers with a power of-20 with a P-value below 10 based on the variance test include:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、 HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、 HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、 RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、 CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、 VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、 SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、 HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、 PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、 TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、 HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、 ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、 EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、 EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、 RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、 NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、 RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、 HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、 UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、 ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、 DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、 HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、 PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
according to an eighth aspect of the present application, there is provided a gastric cancer typing device for a human, comprising: the screening module B is used for screening out proteins meeting the retention conditions from a protein expression mass spectrum database formed by a plurality of samples to be used as an effective protein set; the dimensionality reduction module B is used for sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set; and the clustering and typing module B is used for carrying out clustering analysis on the dimensionality reduction protein set to obtain different types which are divided according to different types of protein markers.
Further, the dimension reduction module B includes: the principal component dimensionality reduction module B is used for performing primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module B is used for performing second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
Further, the retention condition includes a quality condition and/or a frequency condition, the quality condition includes at least one of the following: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples.
Further, in the t-SNE dimension reduction module B: the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
Furthermore, in the cluster classification module B, the average distance in the classes is less than or equal to 4.58, and the average distance between the classes is more than or equal to 9.68.
Further, the gastric cancer typing device further comprises a standardization processing module B, which is used for carrying out standardization processing on the effective protein set to obtain a standardized effective protein set; preferably, the normalization processing block B is configured to make the mean of the proteins in the effective protein set 0 and the variance 1.
Further, the different classes of protein markers are those with a power of-35 with a P value less than 10 based on the variance test; preferably, the-35 power protein markers with a P-value less than 10 based on the variance test comprise: c0, H2AFZ, H2AFJ, H2 AFV; c1, ACTG2, ACTA2, ACTC 1; c2, FLNA, COL6a2, COL6a 1; c3, TUBB2B, TUBB2A, TUBB; c4, H3F3C, H3F3A, H3F 3B; c5, X; c6, PABPC1, HNRNPK, DDX 39B.
Further, the different classes of protein markers also include protein markers with a power of-20 with a P-value below 10 based on the variance test, preferably, protein markers with a power of-20 with a P-value below 10 based on the variance test include:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、 HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、 HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、 RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、 CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、 SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、 HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、 PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、 TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、 HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、 ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、 EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、 EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、 EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、 RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、 NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、 RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、 HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、 UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、 ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、 DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、 ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、 HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、 PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
according to a ninth aspect of the present application, there is provided an apparatus for typing a gastric cancer sample of an individual, the apparatus comprising: a gastric cancer typing device for typing gastric cancer according to a known gastric cancer sample set, wherein the gastric cancer typing device is any one of the gastric cancer typing devices; the screening module C is used for screening out proteins meeting the retention conditions from the protein mass spectrum data of the gastric cancer sample to be detected, and the proteins are used as a protein set to be detected; the dimensionality reduction module C is used for sequentially carrying out dimensionality reduction treatment on the protein set to be detected twice according to the same conditions as the effective protein set in the known gastric cancer sample set in the gastric cancer parting device to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected; and the clustering and typing module C is used for clustering and analyzing the dimensionality reduction protein set of the gastric cancer sample to be detected to obtain the typing with the highest similarity with the known gastric cancer sample set, namely the typing corresponding to the gastric cancer sample to be detected.
Further, the dimension reduction module C includes: the principal component dimensionality reduction module C is used for performing primary dimensionality reduction treatment on the protein set to be detected by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module C is used for performing second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
Further, the retention condition includes a quality condition and/or a frequency condition, the quality condition includes at least one of the following: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples in a known set of gastric cancer samples.
Furthermore, in the principal component dimension reduction module C, the parameter setting of the dimension reduction processing is the same as the parameter of the principal component analysis dimension reduction processing performed on the protein mass spectrum data in the known gastric cancer sample set in the typing device.
Further, in the t-SNE dimension reduction module C: the parameter setting of the dimensionality reduction treatment is the same as the parameter of the t-SNE dimensionality reduction treatment on protein mass spectrum data in a known gastric cancer sample set in a typing device.
Furthermore, in the cluster classification module C, the average distance in the classes is less than or equal to 4.58, and the average distance between the classes is more than or equal to 9.68.
Further, the device also comprises a standardization processing module C for carrying out standardization processing on the protein set to be detected to obtain a standardized protein set to be detected; preferably, the normalization processing module C is configured to make the mean value of the proteins in the protein set to be tested 0 and the variance 1.
According to a tenth aspect of the present application, there is provided a storage medium comprising a stored program, wherein the program performs any one of the above-described methods for screening gastric cancer-typing protein markers; or to program execution of any of the above methods for typing gastric cancer in a human population; or to program execution of any of the above methods for typing a gastric cancer sample in an individual.
According to an eleventh aspect of the present application, there is provided a processor for running a program, wherein the program performs any one of the above-described methods for screening gastric cancer typing protein markers; or to program execution of any of the above methods for typing gastric cancer in a human population; or programmed to perform any of the above-described methods for typing a gastric cancer sample in an individual.
The technical scheme of the invention is applied.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 shows the number of proteins (3109-4064) identified in each of 50 samples according to example 1 of the present invention.
FIG. 2 shows that 7 different classes (cluster) separated by cluster analysis in example 1 correspond to 7 gastric cancer classifications.
Fig. 3 shows the significant difference in survival rates between the different types of gastric cancer in fig. 2, with the prognosis for survival being best for type C0, followed by type C4, and worst for type C2.
FIG. 4 shows the relationship between survival for patients typed with C6 in FIG. 2 and the prognostic survival for chemotherapy treatment.
Fig. 5 shows the relationship between the post-operative survival time and the typing results of 224 samples in example 3 according to the present invention, the post-operative survival time showed the best prognosis for type C0, followed by type C4, and worst type C2, consistent with the predicted results of typing provided in example 1.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "… … A module," "… … B module," "… … C module," and the like in the description and claims of this application are used for distinguishing similar objects and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the application herein. Furthermore, the terms "comprises," "comprising," and any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
proteomics: the science refers to the science of integrally, dynamically and quantitatively researching life phenomena and laws on the protein level by utilizing a high-resolution protein separation technology and a high-efficiency protein identification technology.
Qualitative principle of proteomics: with the development of high performance liquid chromatography and electrostatic field orbitrap mass spectrometry, liquid chromatography tandem mass spectrometry (LC-MS/MS) becomes a main technology for studying proteomics, and the basic steps (bottom-up) for identifying proteins include: after collecting samples, extracting total protein, digesting and cutting protein into polypeptide fragments, carrying out HPLC separation, entering an MS electric field in a grading manner for further ionization, obtaining mass-to-charge ratio and peak type information of each ion by MS, calculating amino acid composition by software, and obtaining qualitative and sequence information of the protein by database retrieval and comparison.
And (3) reducing the dimensionality: i.e., techniques for rendering multidimensional data in 2 or 3 dimensions. Common dimensionality reduction methods are: principal component analysis, t-SNE, Samont mapping, isometric mapping, local linear embedding, canonical correlation analysis, SNE, least variance unbiased estimation, Laplace feature map and the like.
Principal Component Analysis (PCA) is a multivariate statistical Analysis method in which a plurality of variables are linearly transformed to select a smaller number of important variables, which is also called Principal Component Analysis.
t-SNE: the t-distributed random neighborhood embedding is a nonlinear dimensionality reduction algorithm for mining high latitude data. It maps multidimensional data into 2 or 3 dimensions that are suitable for human observation.
Clustering analysis: (Cluster Analysis), also called Cluster Analysis, is a multivariate statistical Analysis method for classifying samples or indexes according to the theory of "class of things", wherein the object in question is a large number of samples, which can be reasonably classified according to their respective characteristics, and no model is available for reference or follow, i.e. without prior knowledge. It is a statistical analysis method that divides the study objects into relatively homogeneous groups/clusters (clusters), so objects in the same cluster have great similarity, while objects between different clusters have great dissimilarity.
Mass spectral data generally have the following characteristic parameters: PSM, Peptide, Unique peptides, Strict peptides and Protein, wherein PSM is obtained by comparing polypeptides in a database with a mass spectrogram and outputting the polypeptide with the highest score value as PSM, and the higher the PSM value is, the higher the reliability is relatively. Proteins (proteins) are assembled from peptides, so that a protein can correspond to many peptides, and the greater the number of peptides detected, the greater the probability that the protein is indeed identified. Since some peptides detected by mass spectrometry are only present in a certain protein, when such peptides are identified, we can be confident that the corresponding protein is present, and these peptides are called unique peptides. In addition, if the score of the peptide fragment on the ion score of Mascot in the library searching software is high (more than 20 points), the peptide fragment is called as a high-quality peptide fragment (strict peptide), and the measurement can also better characterize the existence of the protein.
As mentioned in the background art, the typing criteria of gastric cancer in the prior art are not accurate enough, and in order to further improve the accuracy of gastric cancer typing, the present inventors have conducted improvement studies on the existing typing, and it is considered that the reason why the existing typing is not accurate is that typing is performed either from the diseased part or from the appearance symptoms, while the ACRG typing proposes typing of gastric cancer from the molecular level, but this typing is only genetic level, and the actual life activities cannot be predicted only by the genome, because the protein is the performer of life activities, and can be closer to the actual life activity status only by the change in the protein group level. Therefore, the application explores protein markers related to the gastric cancer from the aspect of proteomics, and classifies the gastric cancer according to the relation between different molecular characteristics of the protein markers and clinical postoperative survival time, the classification result is relatively more accurate, and the survival prognosis evaluation of different gastric cancer conditions is relatively more accurate, so that the application is more beneficial to the medication guidance of chemotherapy treatment, and the clinical application value is high.
On the basis of the above research results, the applicant proposed the technical solution of the present application. In a typical embodiment, there is provided a method of screening for a gastric cancer-typing protein marker, the method comprising: screening proteins meeting the retention condition from a protein expression mass spectrum database formed by a plurality of samples to serve as an effective protein set; sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set; and performing clustering analysis on the dimensionality reduction protein set to obtain different types of protein markers.
According to the method for screening the gastric cancer typing protein marker, the effective protein set is screened from the protein expression mass spectrum database formed by a plurality of samples, so that the protein with low accuracy is eliminated, and the accuracy of the protein for detection is improved; and then, obtaining proteins closely related to the proteome and a relation library of clinical information through two times of dimensionality reduction treatment, and then carrying out clustering analysis on the proteins so as to obtain different classes of protein markers. The method screens protein markers which are obviously and highly expressed in the cancer samples from a large number of protein spectrum databases of the gastric cancer samples, and classifies the gastric cancer into different classes of protein markers according to the correlation between different protein markers and survival rates, and the markers have obvious difference between different classes, so that the gastric cancer can be classified more accurately.
The protein expression profile database formed from the plurality of samples is known information including age, sex, cancer cell ratio, Lauren type, TNM type, survival time, death, chemotherapy regimen, etc. of the known gastric cancer cases.
The two dimensionality reduction treatments are to remove proteins with small change amplitude, combine proteins with high correlation with clinical information, reduce complexity of model construction, and facilitate calculation of distance between samples so as to ensure that distance between similar individuals is as small as possible, so that any dimensionality reduction treatment method capable of achieving the purpose is suitable for the application. In order to capture the gastric cancer features in a large amount of protein information and achieve better effect of gathering similar gastric cancer patients, in a preferred embodiment of the present application, the two dimensionality reduction treatments comprise: carrying out primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and performing second dimensionality reduction treatment on the first dimensionality reduction data set by using t-SNE to obtain a dimensionality reduction protein set.
The PCA dimension reduction method is to perform dimension reduction from the perspective of overall similarity of a sample collection, while the t-SNE dimension reduction method focuses more on local similarity of samples.
In the process of screening the effective protein set from the database, the specific screening conditions can be reasonably adjusted according to the actual mass spectrum data condition. In some preferred embodiments of the present application, the satisfied retention condition includes a quality condition and/or a frequency condition, and the quality condition includes at least one of: having two peptide fragments satisfying the quality requirements and at least one of which is a unique peptide fragment (also translated into a specific peptide fragment, referring to a peptide fragment that may occur in only one protein) satisfying the quality requirements, and at least three peptide fragments satisfying the quality requirements; frequent conditions are proteins present in at least 80% of the samples.
The quality requirement is reasonably set according to the spectrogram quality of the protein mass spectrum peptide fragment in the database, in order to improve the accuracy of the screened protein, the quality requirement to be met in the application is the high-quality peptide fragment, and the high-quality means that the score (ion score) of Mascot on the peptide fragment is more than 20. The only peptide segment with high quality means that the peptide segment possibly appears in only one protein and the score (ion score) of the peptide segment is more than 20 points, and the peptide segment with two peptide segments meeting the high quality requirement means that the protein at least comprises two peptide segments with the score of more than 20 points. The protein at least comprises three peptide segments meeting the quality requirement, namely the protein at least comprises three peptide segments with the score of more than 20. Frequently, proteins are present in more than 80% of the samples in the database, thus ensuring that the proteins are universal proteins.
In the first dimension reduction processing process, the dimension reduction is carried out by using a PCA principal component analysis method which is set by default.
In a preferred embodiment, in the step of performing the second dimension reduction processing on the first dimension reduction dataset by using t-SNE, the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
In the method, the clustering analysis step can be obtained by reasonably adjusting on the basis of the existing clustering analysis principle, the specific adjustment principle is to ensure that the intra-class distance is as small as possible and the inter-class distance is as large as possible after repeated iteration, and when the result is converged, the calculation is terminated to obtain the clustering result. In a preferred embodiment of the present application, in the step of cluster analysis, the average distance within a class is less than or equal to 4.58, and the average distance between classes is greater than or equal to 9.68.
Before data analysis, it is usually necessary to normalize the data (normalization), and then analyze the data after the normalization process. The data standardization processing comprises two aspects of data chemotaxis processing and non-dimensionalization processing. The data non-dimensionalization processing mainly solves the comparability of data, the methods are various, and after standardization processing, the original data are all converted into non-dimensionalization index mapping evaluation values, namely, each index is processed on the same quantity level, and comprehensive evaluation analysis can be carried out.
For the mass spectrometry data of proteomics, the method also comprises the step of standardizing the effective protein set to obtain a standardized effective protein set before the effective protein set is subjected to two times of dimensionality reduction treatment in sequence. The specific normalization process may be performed by an existing method, such as Z-score normalization. In a preferred embodiment, the normalization process is performed such that the mean value of the proteins in the effective protein set is 0 and the variance is 1.
The method for screening gastric cancer typing protein markers can be used for classifying different types of protein markers according to cluster analysis. In a preferred embodiment of the present application, the different classes of protein markers are those with a P value of less than 10 to the power of-35 based on the variance test. In a more preferred embodiment of the present application, different classes of protein markers with a P-value of less than 10 to the power of-35 based on the variance test comprise: c0, H2AFZ, H2AFJ, H2 AFV; c1, ACTG2, ACTA2, ACTC 1; c2, FLNA, COL6a2, COL6a 1; c3, TUBB2B, TUBB2A, TUBB; c4, H3F3C, H3F3A, H3F 3B; c5, X; c6, PABPC1, HNRNPK, DDX 39B.
Since the protein markers of the different classes have extremely significant differences in expression levels between the class to which they belong and the remaining classes, they can accurately mark the molecular characteristics of gastric cancers of different types (i.e., protein expression characteristics, which means that the expression levels of the protein markers in each class are significantly higher than those of the corresponding proteins in the other classes). In the above C5 class, X means that no protein is expressed in the class in a significantly higher amount than in other classes.
When the protein markers of the above categories are used for typing gastric cancer, any one protein marker in each category can be used for typing which is significantly higher than any one protein marker in the rest categories, and any two or three protein markers in each category can also be used for typing which is significantly higher than any two or three protein markers in the rest categories. The more protein markers used, the more accurate the results of typing are.
In order to further improve the accuracy of gastric cancer typing, in other preferred embodiments of the present application, the protein markers of different classes include, in addition to the above-mentioned protein markers, protein markers with a P value of less than 10 to the power of-20 based on variance test. In some more preferred embodiments, the-20 power protein markers for which the variance-based test P-value is less than 10 comprise:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、 HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、 HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、 RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、 CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、 VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、 SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、 HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、 PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、 TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、 HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、 ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、 EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、 EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、 RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、 NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、 RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、 HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、 UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、 ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、 DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、 ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、 HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、 PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
when the above protein markers are used for typing, the more the number of the protein markers used per type is, the more accurate the typing of gastric cancer becomes. When the protein markers are simultaneously used for analysis, the typing result is more accurate.
The protein marker screened by the method for screening the gastric cancer typing marker can more accurately type gastric cancer from the protein molecule level, and the 5-year survival rates corresponding to different types are different. Therefore, any unknown gastric cancer sample can be detected or diagnosed by using the different types of the protein markers.
It should be noted that, the method for screening gastric cancer type protein markers of the present application can screen protein markers belonging to different categories, and the method for screening protein markers can also be used for typing gastric cancer, i.e., the method for typing gastric cancer in a population is the same as the method for screening gastric cancer type protein markers.
Of course, the method for typing gastric cancer in a human may be a method for typing using the protein marker. In a second exemplary embodiment of the present application, there is also provided a method for typing a gastric cancer in a human, the method comprising typing a gastric cancer using a protein marker selected by any one of the methods for screening a gastric cancer-typing protein marker described above to obtain different gastric cancer types. The typing method of the present application is typing at the protein level, which is closer to the true life state and thus relatively more accurate.
The protein markers screened by the method for screening the gastric cancer protein markers and the different gastric cancer types divided by the protein markers can be used for screening gastric cancer or carrying out survival rate prognosis typing on any individual sample, and are particularly suitable for accurately typing gastric cancer individuals. Thus, in a third exemplary embodiment of the present application, there is also provided a method for typing a gastric cancer sample of an individual, comprising typing according to a gastric cancer typing standard, wherein the gastric cancer typing standard is different from the classification of different types of protein markers screened by any one of the above screening methods, or different from the classification obtained by the above method for typing gastric cancer in a human population. The individual typing diagnosis is more accurate by utilizing different types obtained by the protein marker division.
In a fourth exemplary embodiment of the present application, there is also provided a method for typing a gastric cancer sample in an individual, the method comprising: typing a known gastric cancer sample set according to any one of the above methods for typing gastric cancer in a population; screening out proteins meeting the retention condition from protein mass spectrum data of a gastric cancer sample to be detected, and taking the proteins as a protein set to be detected; sequentially carrying out dimensionality reduction treatment on the protein set to be detected twice according to the same conditions with the effective protein set in the known gastric cancer sample set to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected; and performing cluster analysis on the dimensionality reduction protein set of the gastric cancer sample to be detected to obtain a type with the highest similarity to the known gastric cancer sample set, namely the type corresponding to the gastric cancer sample to be detected.
The method for typing the individual gastric cancer samples comprises the steps of firstly establishing typing information between protein mass spectrum data and a clinical information knowledge base in an existing sample base by a comparison method based on the similarity of the individual gastric cancer samples and a plurality of known gastric cancer samples in a database, obtaining specific parameters of data standardization, dimension reduction twice and clustering, and then comparing the protein mass spectrum data of the sample to be tested with the protein mass spectrum data of the known sample, thereby finding out the typing of the known sample with the highest similarity and the prognosis information of the chemotherapy and treatment scheme corresponding to the known sample.
In the above typing method, the purpose of the two-time dimensionality reduction treatment is the same as that of the two-time dimensionality reduction in the people gastric cancer typing method, and is also to remove proteins with small variation amplitude, combine proteins with high correlation with clinical information, thereby reducing the complexity of a matching model and facilitating the calculation of the distance between a sample to be tested and each known sample, further ensuring that the distance between similar individuals is as small as possible, so that the result of matching the sample to be tested and the known sample is more accurate, and the typing result of the sample to be tested is more accurate.
Therefore, any dimension reduction method capable of achieving the above-described effects is applicable to the present application. In a preferred embodiment of the present application, the two dimension reduction processes include: performing primary dimensionality reduction treatment on a protein set to be detected by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and performing second dimensionality reduction treatment on the first dimensionality reduction data set by using t-SNE to obtain a dimensionality reduction protein set.
The PCA dimension reduction method is to perform dimension reduction from the perspective of overall similarity of a sample collection, while the t-SNE dimension reduction method focuses more on local similarity of samples.
The specific parameter setting conditions in the specific principal component analysis method and the t-SNE dimension reduction method are set according to parameters obtained by calculation in the gastric cancer typing process of a protein spectrum database of a sample, and the specific parameters are the same as the parameter conditions in the gastric cancer typing method.
In the above typing method, the retention conditions including the mass condition and/or the frequency condition may be set according to the mass of the protein mass spectrum data of the sample used in practice. In a preferred embodiment of the present application, the quality condition includes at least one of: the peptide fragment has two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples in a known set of gastric cancer samples.
The quality requirement is the same requirement as that of the peptide fragment with high quality in the known gastric cancer sample set. By "high quality" is meant that the peptide fragment is scored by Mascot (ion score) greater than 20 points. The only peptide segment with high quality means that the peptide segment possibly appears in only one protein and the score (ion score) of the peptide segment is more than 20 points, and the peptide segment with two peptide segments meeting the high quality requirement means that the protein at least comprises two peptide segments with the score of more than 20 points. The protein at least comprises three peptide segments meeting the quality requirement, namely the protein at least comprises three peptide segments with the score of more than 20. Frequently, proteins are present in more than 80% of the samples in the database, thus ensuring that the proteins are universal proteins.
In the method, the clustering analysis step can be obtained by reasonably adjusting on the basis of the existing clustering analysis principle, and the specific adjustment principle is to ensure that the intra-class distance is as small as possible and the inter-class distance is as large as possible. In a preferred embodiment of the present application, in the step of cluster analysis, the average distance within a class is less than or equal to 4.58, and the average distance between classes is greater than or equal to 9.68.
Before data analysis, it is usually necessary to normalize the data (normalization), and then analyze the data after the normalization process. The data standardization processing comprises two aspects of data chemotaxis processing and non-dimensionalization processing. The data non-dimensionalization processing mainly solves the comparability of data, the methods are various, and after standardization processing, the original data are all converted into non-dimensionalization index mapping evaluation values, namely, each index is processed on the same quantity level, and comprehensive evaluation analysis can be carried out.
For the mass spectrometry data of proteomics, the method also comprises the step of standardizing the protein set to be detected to obtain a standardized protein set to be detected before performing the dimension reduction treatment twice on the protein set to be detected. The specific normalization process may be performed by an existing method, such as Z-score normalization. In a preferred embodiment, the normalization process is performed such that the mean value of the proteins in the test protein set is 0 and the variance is 1.
In a fifth exemplary embodiment of the present application, there is also provided a reagent, a kit or a chip for detecting gastric cancer, wherein the reagent, the kit or the chip comprises different classes of protein markers obtained by any one of the methods described above. The protein marker screened by the method has the advantages that the prognosis of different gastric cancer patients can be predicted relatively more accurately by typing, and the method is favorable for more accurately evaluating whether the patients are subjected to chemotherapy or not and the curative effect of various medication schemes.
The detection in the reagent, kit or chip for detecting gastric cancer includes any one or more of typing diagnosis, survival prognosis evaluation and screening of chemotherapeutic drugs.
In a sixth exemplary embodiment of the present application, there is further provided a use of the different classes of protein markers obtained by any one of the above-mentioned methods in the preparation of a reagent, a kit or a chip for detecting gastric cancer.
In a preferred embodiment, the protein marker is a protein marker whose expression level is significantly higher in the belonging class than in the remaining classes. In other preferred embodiments, the protein markers are classified into 7 classes (corresponding to the 7 classes of protein markers in the screening method, which is not described herein), and the expression level of each class of protein markers in each class is significantly higher than that in the other classes.
Applications herein include, but are not limited to, typing, prognosis of survival, and screening for chemotherapeutic drugs.
In a seventh exemplary embodiment of the present application, there is also provided a screening apparatus for a gastric cancer typing protein marker, including: the screening module A is used for screening out proteins meeting the retention conditions from a protein expression mass spectrum database formed by a plurality of samples to be used as an effective protein set; the dimensionality reduction module A is used for sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set; and the clustering and typing module A is used for carrying out clustering analysis on the dimensionality reduction protein set to obtain different types of protein markers.
The screening device firstly screens out an effective protein set from a database by operating a screening module A, thereby excluding proteins with low accuracy and improving the accuracy of the proteins for detection; and performing two times of dimensionality reduction treatment on the effective protein set by using a dimensionality reduction module A to obtain proteins with protein groups closely associated with a clinical information relational database, and performing clustering analysis on the proteins by using a clustering and typing module A, so that different types of protein markers can be obtained, and different types of gastric cancer marked by the different types of protein markers are obtained and respectively marked with clinically different 5-year survival rates. The screening device can screen protein markers which are obviously highly expressed in cancer samples from a large number of protein spectrum databases of the gastric cancer samples, and can divide the gastric cancer into different categories of protein markers according to the interrelation between different protein markers and the 5-year survival rate, and the markers have obvious differences among different categories, so that the gastric cancer is more accurately classified.
In a preferred embodiment, the dimension reduction module a comprises: the system comprises a principal component dimensionality reduction module A and a t-SNE dimensionality reduction module A, wherein the principal component dimensionality reduction module A is used for performing primary dimensionality reduction processing on an effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module A is used for carrying out second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
The principal component dimension reduction module A adopts a PCA dimension reduction method to reduce the dimension from the perspective of the overall similarity of the sample total set, and the t-SNE dimension reduction module A adopts a t-SNE dimension reduction method to pay more attention to the local similarity of the samples. In the preferred embodiment, the principal component dimension reduction module A is adopted for dimension reduction, and then the t-SNE dimension reduction module A is adopted for dimension reduction, so that the beneficial effects of both total and local are achieved.
In a preferred embodiment, the retention condition in the screening module a includes a quality condition and/or a frequency condition, and the quality condition includes at least one of the following conditions: the peptide fragment has two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples.
The quality requirement is set reasonably according to the spectrogram quality of the peptide fragment of the protein mass spectrum in the database, in order to improve the accuracy of the screened protein, the quality requirement to be met in the application is the high-quality peptide fragment, and the high-quality means that Mascot scores (ion score) of the peptide fragment for more than 20 minutes. The only peptide segment with high quality means that the peptide segment possibly appears in only one protein and the score (ion score) of the peptide segment is more than 20 points, and the peptide segment with two peptide segments meeting the high quality requirement means that the protein at least comprises two peptide segments with the score of more than 20 points. The protein at least comprises three peptide segments meeting the quality requirement, namely the protein at least comprises three peptide segments with the score of more than 20. Frequently, proteins are present in more than 80% of the samples in the database, thus ensuring that the proteins are universal proteins.
In a preferred embodiment, the principal component dimension reduction module performs the first dimension reduction by using a default setting.
In a preferred embodiment, in the t-SNE dimension reduction module a: the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
The cluster typing module A can be obtained by reasonably adjusting on the basis of the conventional cluster analysis module, the specific adjustment principle is to ensure that the intra-class distance is as small as possible and the inter-class distance is as large as possible after repeated iteration, and when the result is converged, the calculation is terminated to obtain a clustering result. In a preferred embodiment of the present application, in the cluster classification module a, the average distance within a class is less than or equal to 4.58, and the average distance between classes is greater than or equal to 9.68.
For the mass spectrum data of proteomics, the screening device also comprises a standardization processing module A before the dimension reduction module A is adopted to carry out two times of dimension reduction processing on the effective protein set. The specific standardized processing module a may be an existing module, such as a Z-score standardized module. In a preferred embodiment, the normalization processing module a is configured to make the mean of the proteins in the effective protein set 0 and the variance 1.
In a preferred embodiment of the present application, the different classes of protein markers selected by the above-mentioned screening apparatus are protein markers having a P-value of less than 10 to the power of-35 based on the variance test. In a more preferred embodiment of the present application, different classes of protein markers with a P-value of less than 10 to the power of-35 based on the variance test comprise: c0, H2AFZ, H2AFJ, H2 AFV; c1, ACTG2, ACTA2, ACTC 1; c2, FLNA, COL6a2, COL6a 1; c3, TUBB2B, TUBB2A, TUBB; c4, H3F3C, H3F3A, H3F 3B; c5, X; c6, PABPC1, HNRNPK, DDX 39B.
Since the protein markers of the different classes have extremely significant differences in expression levels between the class to which they belong and the remaining classes, they can accurately mark the molecular characteristics of gastric cancers of different types (i.e., protein expression characteristics, which means that the expression levels of the protein markers in each class are significantly higher than those of the corresponding proteins in the other classes). In the above C5 class, X means that no protein is expressed in the class in a significantly higher amount than in other classes.
When the protein markers of the various types screened by the screening device are used for typing the gastric cancer, any one protein marker in each type can be remarkably higher than any one protein marker in the other types for typing, and any two or three protein markers in each type can be remarkably higher than any two or three protein markers in the other types for typing. The more protein markers are used, the more accurate the typing results are.
In order to further improve the accuracy of gastric cancer typing, in other preferred embodiments of the present application, the protein markers of different classes screened by the screening device include protein markers with a P value of less than 10 to the power of-20 according to variance test, in addition to the protein markers. In some more preferred embodiments, the-20 power protein markers for which the variance-based test P-value is less than 10 comprise:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、 HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、 HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、 RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、 CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、 VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、 SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、 HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、 PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、 TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、 HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、 ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、 EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、 EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、 RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、 NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、 RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、 HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、 UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、 ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、 DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、 ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、 HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、 PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
when the above protein markers are used for typing, the more the number of the protein markers used per type is, the more accurate the typing of gastric cancer becomes. When the protein markers are simultaneously used for analysis, the typing result is more accurate.
The screening device of the gastric cancer typing protein marker can essentially perform the function of typing the gastric cancer of people, namely, the screening device can also be called as a gastric cancer typing device for people. In an eighth exemplary embodiment of the present application, there is also provided a gastric cancer typing apparatus, including a screening module B for screening proteins satisfying a retention condition from a protein expression profile database formed from a plurality of samples as an effective protein set; the dimensionality reduction module B is used for sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set; and the clustering and typing module B is used for carrying out clustering analysis on the dimensionality reduction protein set to obtain different types which are divided according to different types of protein markers.
The typing module firstly screens an effective protein set from a database by operating the screening module B, thereby excluding proteins with low accuracy and improving the accuracy of the proteins for detection; and then performing two times of dimensionality reduction treatment on the effective protein set by executing a dimensionality reduction module B to obtain proteins closely related to a relational database of proteome and clinical information, and finally performing clustering analysis on the proteins by executing a clustering and typing module B, so that different types of protein markers can be obtained. The screening device can screen protein markers which are obviously highly expressed in cancer samples from a large number of protein spectrum databases of the gastric cancer samples, and can classify the gastric cancer into different classes of protein markers according to the interrelation between different protein markers and the 5-year survival rate, and the markers have obvious differences among different classes, so that the gastric cancer can be classified more accurately.
In a preferred embodiment, the dimension reduction module B comprises: the system comprises a principal component dimensionality reduction module and a t-SNE dimensionality reduction module, wherein the principal component dimensionality reduction module is used for performing primary dimensionality reduction processing on an effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module is used for performing second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
In a preferred embodiment, the retention condition in the screening module B includes a quality condition and/or a frequency condition, and the quality condition includes at least one of the following conditions: the peptide fragment has two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples.
In a preferred embodiment, the learning rate in the t-SNE dimension reduction module B is set to be 10-500, the iteration number is set to be more than 7500, and the iteration stop is set to be 100-400.
In a preferred embodiment, in the cluster classification module B, the average distance within a class is less than or equal to 4.58, and the average distance between classes is greater than or equal to 9.68.
In a preferred embodiment, the typing device further comprises a normalization processing module B, and the normalization processing module B is configured to perform normalization processing on the effective protein set to obtain a normalized effective protein set.
In another preferred embodiment, the normalization process B is used to make the mean of the proteins in the effective protein set 0 and the variance 1.
The effects of the above-mentioned device for typing gastric cancer in human populations according to the preferred embodiments are the same as those of the above-mentioned device for screening gastric cancer typing protein markers, and the details thereof can be found in the above-mentioned related matters.
In a ninth exemplary embodiment of the present application, there is also provided an apparatus for typing a gastric cancer sample of an individual, the apparatus including: a gastric cancer typing device for typing gastric cancer according to a known gastric cancer sample set, wherein the gastric cancer typing device is any one of the gastric cancer typing devices; the screening module C is used for screening out proteins meeting the retention conditions from the protein mass spectrum data of the gastric cancer sample to be detected, and the proteins are used as a protein set to be detected; the dimensionality reduction module C is used for sequentially carrying out dimensionality reduction treatment on the protein set to be detected twice according to the same conditions as the effective protein set in the known gastric cancer sample set in the gastric cancer parting device to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected; and the clustering and typing module C is used for clustering and analyzing the dimensionality reduction protein set of the gastric cancer sample to be detected to obtain the typing with the highest similarity with the known gastric cancer sample set, namely the typing corresponding to the gastric cancer sample to be detected.
The device firstly establishes a typing model between protein mass spectrum data and a clinical information knowledge base in an existing sample base by a comparison method based on the similarity with a plurality of known gastric cancer samples in a database, obtains specific parameters of data standardization, dimension reduction twice and clustering, and then compares the protein mass spectrum data of a sample to be detected with the protein mass spectrum data of the known sample, thereby finding out the typing of the known sample with the highest similarity and the prognosis information of a chemotherapy and treatment scheme corresponding to the known sample. The comparison typing detection method based on the similarity of the sample to be detected and the known sample enables the typing result of the sample to be detected to be more accurate.
In the device for typing the individual samples, the two-time dimensionality reduction treatment also aims to remove the protein with small change amplitude, combine the protein with high correlation with clinical information, reduce the complexity of a matching model, facilitate the calculation of the distance between the sample to be tested and each known sample, further ensure that the distance between similar individuals is as small as possible, ensure that the result of matching the sample to be tested and the known sample is more accurate, and further ensure that the typing result of the sample to be tested is more accurate.
Therefore, any dimension reduction method capable of achieving the above-described effects is applicable to the present application. In a preferred embodiment of the present application, the dimension reduction module C includes: the principal component dimensionality reduction module C is used for performing primary dimensionality reduction treatment on the protein set to be detected by adopting a principal component analysis method to obtain a first dimensionality reduction data set; and the t-SNE dimensionality reduction module C is used for performing second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain a dimensionality reduction protein set.
The principal component dimension reduction module C performs dimension reduction from the perspective of overall similarity of the sample total set, and the t-SNE dimension reduction module C focuses more on local similarity of the samples.
The specific parameter setting conditions in the main component dimension reduction module C and the t-SNE dimension reduction module C are set according to parameters obtained by calculation in the stomach cancer typing process of a protein spectrum database of a sample, and the specific parameters are the same as the parameter conditions in the stomach cancer typing device.
In the above typing apparatus, the retention conditions including the mass condition and/or the frequency condition may be set appropriately according to the mass of the protein mass spectrum data of the sample actually used. In a preferred embodiment of the present application, the quality condition includes at least one of: the peptide fragment has two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement; frequent conditions are proteins present in at least 80% of the samples in a known set of gastric cancer samples.
The quality requirement is the same requirement as that of the peptide fragment with high quality in the known gastric cancer sample set. By "high quality" is meant that the peptide fragment is scored by Mascot (ion score) greater than 20 points. The only peptide segment with high quality means that the peptide segment possibly appears in only one protein and the score (ion score) of the peptide segment is more than 20 points, and the peptide segment with two peptide segments meeting the high quality requirement means that the protein at least comprises two peptide segments with the score of more than 20 points. The protein at least comprises three peptide segments meeting the quality requirement, namely the protein at least comprises three peptide segments with the score of more than 20. Frequently, proteins are present in more than 80% of the samples in the database, thus ensuring that the proteins are universal proteins.
In the device, the cluster analysis module C can be obtained by reasonably adjusting the existing cluster analysis module on the basis of the principle, and the specific adjustment principle is to ensure that the intra-class distance is as small as possible and the inter-class distance is as large as possible. In a preferred embodiment of the present application, in the cluster analysis module C, the average distance within a class is less than or equal to 4.58, and the average distance between classes is greater than or equal to 9.68.
Before data analysis, it is usually necessary to normalize the data (normalization), and then perform data analysis after normalization. The data standardization processing comprises two aspects of data chemotaxis processing and non-dimensionalization processing. The data non-dimensionalization processing mainly solves the comparability of data, the methods are various, and after standardization processing, the original data are all converted into non-dimensionalization index mapping evaluation values, namely, each index is processed on the same quantity level, and comprehensive evaluation analysis can be carried out.
For the mass spectrometry data of proteomics, before the dimension reduction module C is used to sequentially perform the dimension reduction processing on the protein set to be detected twice, the device further comprises a standardization processing module C, which is used to perform standardization processing on the protein set to be detected, so as to obtain a standardized protein set to be detected. The specific standardized processing module C may employ an existing module, such as a Z-score standardized module. In a preferred embodiment, the normalization processing module C is configured to make the mean value of the proteins in the test protein set 0 and the variance 1.
In a tenth exemplary embodiment of the present application, there is also provided a storage medium including a stored program, wherein the program performs any one of the above-described methods for screening gastric cancer-typing protein markers; or to program execution of any of the above methods for typing gastric cancer in a human population; or to program execution of any of the above methods for typing a gastric cancer sample in an individual; or to program execution of any of the above methods for typing a gastric cancer sample in an individual.
In an eleventh exemplary embodiment of the present application, there is further provided a processor for executing a program, wherein the program performs any one of the above-mentioned methods for screening gastric cancer typing protein markers; or to program execution of any of the above methods for typing gastric cancer in a human population; or to program execution of any of the above methods for typing a gastric cancer sample in an individual; or to program execution of any of the above methods for typing a gastric cancer sample in an individual.
The advantageous effects of the present application will be further described with reference to specific examples.
Example 1
First part, sample preparation method based on paraffin-embedded samples
Protein extraction and analysis were performed on paraffin-embedded tissue samples of 50 known gastric cancer cases (specifically including known information about age, sex, cancer cell ratio, Lauren typing, TNM typing, survival time, death or not, chemotherapy regimen, etc.) by detailed steps as follows:
firstly, slicing:
taking the embedded wax block, removing redundant paraffin wax around the tissue block by using a scalpel, and modifying the tissue block into a trapezoid;
fixing the wax block on a slicing machine, adjusting the thickness of the slice to be 5 microns, and slicing;
picking the slices with a brush pen or a toothpick, and spreading the slices in a spreading machine at the temperature of 45 ℃;
taking out the slide glass (at least two slide glasses are taken out, one experiment is carried out, and one standby slide glass is used), numbering the pencils after the wax sheets are placed, vertically placing the slide glass, airing, and then placing on a piece baking machine, and baking the slide glass for 2 hours at the temperature of 60 ℃;
storing in a cutting box at-20 deg.C.
(II) protein extraction
Taking out the slices at the temperature of minus 20 ℃, putting the slices on a slice rack, and baking the slices for 1 hour in an oven at the temperature of 60 ℃;
dewaxing is carried out after room temperature recovery, and the specific process and time are as follows:
xylene I, 10 min
Xylene II, 10 min
Xylene III, 10 minutes
100% ethanol, 10 min
95% ethanol, 10 min
90% ethanol, 10 min
85% ethanol, 10 min
75% ethanol, 10 min
Distilled water I, 10 min
Distilled water II, 10 min
After drying in the air, the sections were scraped with a scalpel blade into an EP tube containing 100. mu.l of a 50mM NH4HCO3 solution.
And (III) protein reduction alkylation, enzymolysis peptide fragment:
adding 0.5 microliter 1M DTT, heating at 56 deg.C for 30 min, recovering to room temperature, adding 1M IAA, and keeping away from light at room temperature for 30 min; then 0.5 microliter of 1M DTT is added, and the temperature is 15 minutes at room temperature;
adding 3 micrograms of mass spectrum sequencing pancreatin, uniformly mixing and digesting at 37 ℃ overnight;
the next day, 200 microliters of 50% acetonitrile and 0.1% formic acid solution are added, the mixture is shaken and mixed uniformly for 5 minutes at room temperature, 10000g of the mixture is centrifuged for 5 minutes, and the supernatant is taken and placed in a new tube; repeating the steps once, combining the two supernatants, and pumping the peptide fragment at 60 ℃ by a vacuum concentrator.
(IV) desalting:
1. re-dissolving the dried peptide segment with 100 microliter of 0.1% formic acid solution; the desalting procedure was as follows:
self-making a desalting column: t400 tip +2 layers of C18 film +1mgC18 powder;
desalting column activation:
100% ACN, 100. mu.l, 1000g, centrifugation for 1min
50% ACN, 100 microliter, 1000g, centrifugate for 3min
80% ACN, 100. mu.l, 1000g, centrifugation for 2min
50% ACN, 100 microliter, 1000g, centrifugate for 3min
Desalting column balancing: 0.1% of FA, 100 microliter, 1000g, centrifuging for 4 min; repeating the steps once;
loading peptide fragments: adding the dissolved peptide segment into a desalting column, centrifuging for 4min at 1000 g; repeatedly loading the centrifuged liquid once;
cleaning salinity: 0.1% of FA, 100 microliter, 1000g, centrifuging for 4 min; repeating the steps once;
and (3) eluting the peptide fragment: the desalting column was placed in a new collection tube, 100. mu.l of 25% ACN + 0.1% FA was added, 1000g was separated for 4min, the eluted sample was collected by vacuum concentrator and the peptide fragments were dried at 60 ℃ and sent to LC-MS.
The second part, a method for gastric cancer typing based on gastric cancer proteome data of paraffin-embedded samples or a method for screening gastric cancer typing protein markers, comprises the following detailed steps:
1. and (4) protein accuracy screening. And screening the proteins detected by the known samples, and reserving the proteins in which at least two high-quality peptide fragments are detected, wherein one of the proteins is the only high-quality peptide fragment, or at least three proteins in which the high-quality peptide fragments are detected.
2. And (4) screening proteins frequently. All known samples (also known as known gastric cancer samples) were screened for protein frequency, and 80% of the proteins present in the samples were retained.
3. Normalization of protein levels was done for all known samples, ensuring that each protein in the pool was normalized to a queue value with a mean of 0 and a variance of 1.
4. The principal component analysis method is used for carrying out principal component analysis on a known gastric cancer sample set, and the blood level is reduced, so that the principal component after the blood level reduction can explain most sample characteristics. The specific parameters are as follows: the singular value decomposition method is set to automatic selection.
5. Further dimension reduction was performed using t-SNE on a known gastric cancer sample set, and the specific parameter learning rate was set to 100, the number of iterations 20000, and the iteration stop was set to 300.
6. Performing cluster analysis on dimensionality reduction data of the known gastric cancer sample set after dimensionality reduction to ensure that the average distance in the class is less than or equal to 4.58 and the average distance between the classes is more than or equal to 9.68, so that the known gastric cancer sample set is divided into different classes, proteins with expression quantity remarkably higher than that of the proteins in the other classes exist in each class, the proteins are protein markers of the corresponding class, and the different classes are different types of the gastric cancer in the crowd.
Third partial result
The total 8550 proteins of 50 cases of stomach cancer paraffin-embedded tissue samples are identified by the method, and each sample identifies 3109-4064 proteins, which is shown in figure 1. The abscissa of FIG. 1 represents the number of identified proteins, and the ordinate represents the frequency of the samples, and it can be seen from FIG. 1 that 3109-4064 proteins can be identified in each of 50 samples.
The results of typing 50 paraffin-embedded tissue samples of gastric cancer are shown in table 1, table 2 and fig. 3, wherein table 1 and table 2 show that there are significantly differentially expressed protein markers in the different classes, respectively, fig. 2 shows that 7 different classes (cluster) into which the cluster analysis is divided correspond to 7 gastric cancer types, and fig. 3 shows that there is a significant difference in survival rate (P value of 0.00005772) between the different types of gastric cancer in fig. 2, wherein the survival prognosis effect of type C0 is the best, followed by type C4, and the worst is type C2.
Table 1:
Figure BDA0002047924280000241
as can be seen from table 1 above, the protein markers described above have significant differences between the different classes (variance test P values are all-35 powers less than 10) and the FDR (false discovery rate, an expected value that is the ratio of the number of false rejects to the number of all rejected original hypotheses) of the results described above is 0.
Table 2:
Figure BDA0002047924280000251
Figure BDA0002047924280000261
Figure BDA0002047924280000271
Figure BDA0002047924280000281
as can be seen from table 2 above, the protein markers have significant differences between the different classes (variance test P values are at least-20 th power less than 10) and the FDRs of the results are all 0. Indicating that the protein markers in table 2 can be used to type gastric cancer as well.
Example 2
This example further examined the correlation between sensitivity to chemotherapy and sensitivity to drug administration and prognosis of survival for patients with different gastric cancer types with respect to the typing results of example 1, wherein fig. 4 shows the relationship between survival and prognostic survival for chemotherapy treatment for patients with type C6.
Example 3
The procedures of typing diagnosis and detection, protein separation and extraction and mass spectrum separation of 224 newly collected gastric cancer samples are the same as example 1 respectively by utilizing the typing results established in example 1. The typing detection steps are as follows:
1. and (4) protein accuracy screening. And screening the proteins detected by the sample to be detected, and reserving the proteins at least of which two high-quality peptide fragments are detected, wherein one of the proteins is the only high-quality peptide fragment, or at least of which three high-quality peptide fragments are detected.
2. And (4) screening proteins frequently. The protein detected in the test sample is retained if it appears in 80% of the known gastric cancer samples from example 1.
3. The proteins screened in the sample to be tested are standardized according to the parameters for standardizing the proteins in the known gastric cancer sample set in example 1, so that the mass spectrum data of the effective detection proteins of the sample to be tested are comparable to the mass spectrum data of the proteins in the known gastric cancer sample set, namely, each protein in the sample to be tested is standardized to a level comparable to the protein in the classified sample.
4. Using the same specific parameters as in example 1, a principal component analysis was performed on the gastric cancer sample to be tested by a principal component analysis method, and dimensionality reduction was performed at the protein level to obtain first dimensionality reduction data.
5. And (3) adopting the same t-SNE setting parameters as in example 1, and further reducing the dimensionality of the first dimensionality reduction data by using t-SNE to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected.
6. And (3) performing cluster analysis on the dimensionality reduction protein set of the gastric cancer sample to be detected by adopting the same cluster analysis setting parameters as those in the embodiment 1, so as to classify the protein of the gastric cancer sample to be detected into different categories, and judging the type with the highest similarity as the type corresponding to the gastric cancer sample to be detected according to the similarity of the protein and the known gastric cancer sample set. The typing results are shown in Table 3.
Table 3:
Figure BDA0002047924280000291
Figure BDA0002047924280000301
Figure BDA0002047924280000311
Figure BDA0002047924280000321
from the typing test results for each case in table 3, and the survival time record for the follow-up visits to these cases, it was calculated that:
type C0, with an average post-operative survival time of 5.09 years for 23 cases;
type C1, mean postoperative survival time of 3.93 years, 39 total cases;
type C2, with an average post-operative survival time of 2.97 years for 30 cases;
type C3, with an average post-operative survival time of 3.93 years for 28 cases;
type C4, with an average post-operative survival time of 4.59 years, for a total of 36 cases;
type C5, with an average post-operative survival time of 3.94 years, for 31 cases in total;
type C6, mean postoperative survival time 4.06 years, for a total of 37 cases.
As can be seen from the mean survival time after surgery for each type of cases and the results shown in FIG. 5, the survival prognosis for type C0 was the best, the second from type C4, and the worst from type C2. This result is completely consistent with the typing prediction result in example 1. Therefore, the method for typing detection has the advantage that the result is more accurate.
From the above description, it can be seen that the above-described embodiments of the present application achieve the following technical effects: according to the method, a set of method for extracting proteins in paraffin-embedded samples is developed according to the characteristics of paraffin tissues, and the method can be used for identifying more than 3000 proteins in a short time by single LC-MS mass spectrometry from ultra-micro paraffin-embedded samples by combining with targeted mass spectrometry conditions and parameter settings, so that huge tissue wax block samples are converted into deep and accurate proteome data rapidly and efficiently.
On the basis, the application further classifies the gastric cancer into 7 types according to different protein markers from the protein molecular level, and establishes a gastric cancer prognosis evaluation model, wherein the 5-year survival rate of the type I with the best prognosis reaches 80.29%, and the 5-year survival rate of the type I with the worst prognosis reaches 35.85%. The prediction level is higher than that of the Lauren method, and is equivalent to that of the ACRG method, and is slightly lower than that of the TNM method. According to the protein component characteristics of the 7 gastric cancer types obtained in the application, whether the patients are subjected to chemotherapy or not and the curative effect of various medication schemes are guided or evaluated, and valuable personalized choices related to survival are provided for the treatment schemes of the gastric cancer patients.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (31)

1. A method for screening a gastric cancer typing protein marker, which comprises the following steps:
screening proteins meeting the retention condition from a protein expression profile database formed by a plurality of gastric cancer samples to serve as an effective protein set;
sequentially carrying out dimensionality reduction treatment twice on the effective protein set to obtain a dimensionality reduction protein set;
performing clustering analysis on the dimensionality reduction protein set to obtain different categories of protein markers, wherein the two dimensionality reduction treatments comprise:
carrying out primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set;
performing a second dimensionality reduction treatment on the first dimensionality reduction data set by adopting t-SNE to obtain the dimensionality reduction protein set, wherein the retention condition comprises a quality condition and/or a frequency condition,
the quality condition includes at least one of: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement;
the frequency condition is a protein present in at least 80% of the samples;
the different classes of protein markers comprise-35 power protein markers having a P-value less than 10 based on a variance test;
the-35 power protein marker with the P value less than 10 based on the variance test comprises:
C0,H2AFZ、H2AFJ、H2AFV;
C1,ACTG2、ACTA2、ACTC1;
C2,FLNA、COL6A2、COL6A1;
C3,TUBB2B、TUBB2A、TUBB;
C4,H3F3C、H3F3A、H3F3B;
C5,X;
C6,PABPC1、HNRNPK、DDX39B。
2. the screening method according to claim 1, wherein in the step of performing the second dimension reduction processing on the first dimension reduction dataset by using t-SNE, a learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and an iteration stop is set to 100 to 400.
3. The screening method according to claim 1, wherein in the step of cluster analysis, the average distance within a class is not more than 4.58 and the average distance between classes is not less than 9.68.
4. The screening method according to any one of claims 1 to 3, wherein before the two successive dimensionality reduction treatments of the effective protein set, the method further comprises: and carrying out standardization treatment on the effective protein set to obtain a standardized effective protein set.
5. The screening method according to claim 4, wherein the normalization process is performed such that the mean value of the proteins in the effective protein set is 0 and the variance is 1.
6. The screening method according to claim 1, wherein the different classes of protein markers further comprise-20 power protein markers having a P-value of less than 10 based on a variance test,
the-20 th power protein marker with the P value lower than 10 based on the variance test comprises:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
7. a reagent, kit or chip for detecting gastric cancer, which comprises different classes of protein markers screened by the screening method according to any one of claims 1 to 6.
8. The reagent, kit or chip according to claim 7, wherein the detection comprises any one or more of typing diagnosis, survival prognosis evaluation and screening of chemotherapeutic drugs.
9. Use of different classes of protein markers screened by the method of any one of claims 1 to 6 for the preparation of reagents, kits or chips for detecting gastric cancer.
10. The use according to claim 9, wherein the protein marker is a protein marker whose expression level is significantly higher in the belonging class than in the remaining classes.
11. The use of claim 9 or 10, wherein the detection comprises any one or more of typing diagnosis, prognosis of survival assessment and screening for chemotherapeutic agents.
12. A screening apparatus for a gastric cancer-typing protein marker, comprising:
the screening module A is used for screening out proteins meeting the retention conditions from a protein expression profile database formed by a plurality of gastric cancer samples to serve as an effective protein set;
the dimensionality reduction module A is used for sequentially carrying out dimensionality reduction treatment on the effective protein set twice to obtain a dimensionality reduction protein set;
a clustering and classifying module A, configured to perform clustering analysis on the dimensionality reduction protein set to obtain different categories of protein markers, where the dimensionality reduction module A includes:
the principal component dimensionality reduction module A is used for carrying out primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set;
a t-SNE dimension reduction module A for performing a second dimension reduction treatment on the first dimension reduction data set by using t-SNE to obtain the dimension reduction protein set, wherein the retention condition comprises a quality condition and/or a frequency condition,
the quality condition includes at least one of: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and three peptide fragments meeting the quality requirement are provided;
the frequency condition is a protein present in at least 80% of the samples;
the different classes of protein markers are protein markers with a P value of less than 10 to the power of-35 based on variance test;
the-35 power protein marker with the P value less than 10 based on the variance test comprises:
C0,H2AFZ、H2AFJ、H2AFV;
C1,ACTG2、ACTA2、ACTC1;
C2,FLNA、COL6A2、COL6A1;
C3,TUBB2B、TUBB2A、TUBB;
C4,H3F3C、H3F3A、H3F3B;
C5,X;
C6,PABPC1、HNRNPK、DDX39B。
13. the screening apparatus according to claim 12, wherein in the t-SNE dimension reduction module a: the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
14. The screening apparatus according to claim 12, wherein the cluster classification module A has an average distance within a class of not more than 4.58 and an average distance between classes of not less than 9.68.
15. The screening apparatus according to any one of claims 12 to 14, further comprising: and the standardization processing module A is used for carrying out standardization processing on the effective protein set to obtain a standardized effective protein set.
16. The screening apparatus according to claim 15, wherein the normalization processing module a is configured to make the mean value of the proteins in the effective protein set 0 and the variance 1.
17. The screening apparatus of claim 12, wherein the different classes of protein markers further comprise-20 power protein markers having a P-value of less than 10 based on a variance test,
the-20 th power protein marker with the P value lower than 10 based on the variance test comprises:
0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
18. a device for typing a gastric cancer of a human, comprising:
the screening module B is used for screening out proteins meeting the retention conditions from a protein expression profile database formed by a plurality of gastric cancer samples to serve as an effective protein set;
the dimensionality reduction module B is used for sequentially carrying out dimensionality reduction treatment on the effective protein set for two times to obtain a dimensionality reduction protein set;
a clustering and classifying module B, configured to perform clustering analysis on the dimensionality reduction protein set to obtain different classifications divided according to different categories of protein markers, where the dimensionality reduction module B includes:
the principal component dimensionality reduction module B is used for performing primary dimensionality reduction treatment on the effective protein set by adopting a principal component analysis method to obtain a first dimensionality reduction data set;
a t-SNE dimension reduction module B for performing a second dimension reduction treatment on the first dimension reduction data set by using t-SNE to obtain the dimension reduction protein set, wherein the retention condition comprises a quality condition and/or a frequency condition,
the quality condition includes at least one of: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement;
the frequency condition is a protein present in at least 80% of the samples;
the different classes of protein markers are protein markers with a P value of less than 10 to the power of-35 based on variance test;
the-35 power protein marker with the P value less than 10 based on the variance test comprises:
C0,H2AFZ、H2AFJ、H2AFV;
C1,ACTG2、ACTA2、ACTC1;
C2,FLNA、COL6A2、COL6A1;
C3,TUBB2B、TUBB2A、TUBB;
C4,H3F3C、H3F3A、H3F3B;
C5,X;
C6,PABPC1、HNRNPK、DDX39B。
19. the gastric cancer typing device according to claim 18, wherein in said t-SNE dimension reduction module B: the learning rate is set to 10 to 500, the number of iterations is set to 7500 or more, and the iteration stop is set to 100 to 400.
20. The gastric cancer typing device according to claim 18, wherein the mean distance within the cluster typing module B is not more than 4.58 and the mean distance between the clusters is not less than 9.68.
21. The gastric cancer typing device according to any one of claims 18 to 20, further comprising a normalization processing module B for normalizing said effective protein set to obtain normalized effective protein set.
22. The gastric cancer typing device according to claim 21, wherein the normalization processing module B is used to make the mean value of the proteins in the effective protein set 0 and the variance 1.
23. The gastric cancer typing device according to claim 18, wherein said different classes of protein markers further comprise-20 power protein markers having P-value lower than 10 based on variance test,
the-20 power protein marker with the P value lower than 10 based on the variance test comprises the following components:
C0,HIST2H2AC、HIST2H2AA3、HIST2H2AA4、HIST1H2AJ、HIST1H2AL、HIST1H2AG、HIST1H2AM、HIST1H2AI、HIST1H2AK、HIST1H2AH、HIST1H2AD、HIST1H2AA、HIST1H2AC、HIST1H2AE、HIST1H2AB、HIST3H2A、H2AFX、RPS3A、HIST1H1B、EEF1A2、RAN、LMNB1、HIST2H2AB;
C1,ACTA1、DES、SYNM、MYL9、PGM5、SORBS1;
C2,TPM1、MYH11、TPM2、LMOD1、TAGLN、MYH10、EHD2、COL6A3、FLNC、CNN1、HSPB6、TPM4、SYNPO2、MYLK、CALD1、DPYSL3、CRYAB、ACTN1、TF、TLN1、VCL、HSPG2、TGFBI、CKB、TPM3、KNG1、PDLIM7、ILK、RRAS、CSRP1、LUM、TTR、SOD3、MYL6、ALB、AOC3、OGN、ACTBL2;
C3,TUBB3、TUBA3D、TUBA3C、TUBA3E、CAPZA2、IQGAP1、TUBB4B、GDI1、HPX、MYH9、TUBA4A、TUBB6、TUBA1A、TUBA1C、TUBA1B、TUBB4A、ANXA5、PLS3、CAPZA1、ACTR2、SRSF3、NPEPPS、CLTCL1、SERPINH1、PDLIM3、HNRNPD、TUBA8;
C4,HIST3H3、HIST1H3A、HIST1H3D、HIST2H3C、HIST2H3A、HIST2H3D、HIST1H3C、HIST1H3J、HIST1H3I、HIST1H3B、HIST1H3F、HIST1H3E、HIST1H3G、HIST1H3H;
C5,X;
C6,HSPA8、DDX39A、EIF4A1、PABPC3、EEF1G、HNRNPM、RPN1、RPL4、NCL、ILF3、DHX9、RPS9、RPS19、RPS4X、XRCC5、HNRNPR、HSP90AA1、SYNCRIP、RPS15A、EIF4A3、GANAB、RBMX、EEF2、RPL13A、RPL7、RPS16、PSMA6、RPS24、FUBP1、EIF4A2、EIF2S1、RPL36、SRSF1、RPL7A、RPLP0、RPL6、TUBB8、ARF1、ARF3、RPSA、RAB14、EEF1A1、RPL27、DHX15、ILF2、SRSF7、STIP1、HNRNPA2B1、RPL10A、RPL23A、ARF4、RPS18、RPL38、NME1-NME2、GNB2L1、RPS6、PDIA6、HNRNPA1、PA2G4、RPS3、RAB7A、NONO、RPL9、YWHAQ、QARS、PDIA3、HNRNPA3、CCT5、RPS20、XRCC6、HSP90AB1、RPS28、PCBP2、HNRNPH2、DDX3X、ARPC1B、TAGLN2、EIF3F、DDOST、HSPA4、HNRNPA1L2、CNDP2、PPIB、CTNND1、RPS13、HSPA9、PRKDC、RPS27A、UBB、UBA52、UBC、CCT6A、RPL15、NME2、EIF3A、HNRNPC、RPL13、KHSRP、DDX5、SARNP、ALYREF、ATP2A2、ELAVL1、RUVBL2、ATP6V1A、RAB5C、UBA1、CCT8、STT3B、HSPA2、COPB2、DAD1、P4HB、RRBP1、RPL24、HSP90B1、EFHD2、PGD、DDX3Y、ANXA2、EEF1D、ARCN1、YWHAB、PKM、PSMA7、IDH1、SARS、SNRNP200、PSMD2、HNRNPF、TMED10、HNRNPH1、ENO1、EIF6、HNRNPH3、PGK2、PARP1、HDLBP、STT3A、PRDX6、HSPD1、PGK1、PHB、PPA1、HSPE1、DNM2、CAPN1、OTUB1、ATP6V1B2。
24. an apparatus for typing a gastric cancer sample in an individual, the apparatus comprising:
a gastric cancer typing device for typing gastric cancer according to a known gastric cancer sample set, the gastric cancer typing device being the gastric cancer typing device according to any one of claims 18 to 23;
the screening module C is used for screening out proteins meeting the retention conditions from the protein mass spectrum data of the gastric cancer sample to be detected, and the proteins are used as a protein set to be detected;
the dimensionality reduction module C is used for sequentially carrying out dimensionality reduction treatment on the protein set to be detected twice according to the same conditions of the effective protein set in the known gastric cancer sample set in the gastric cancer typing device to obtain a dimensionality reduction protein set of the gastric cancer sample to be detected;
a clustering and typing module C, configured to perform clustering analysis on the dimensionality reduction protein set of the gastric cancer sample to be detected, to obtain a typing with the highest similarity to the known gastric cancer sample set, that is, the typing corresponding to the gastric cancer sample to be detected, where the dimensionality reduction module C includes:
the principal component dimensionality reduction module C is used for performing primary dimensionality reduction treatment on the protein set to be detected by adopting a principal component analysis method to obtain a first dimensionality reduction data set;
a t-SNE dimension reduction module C for performing second dimension reduction processing on the first dimension reduction data set by using t-SNE to obtain the dimension reduction protein set, wherein the retention condition comprises a quality condition and/or a frequency condition,
the quality condition includes at least one of: the peptide fragment has at least two peptide fragments meeting the quality requirement, at least one of the peptide fragments is a unique peptide fragment meeting the quality requirement, and at least three peptide fragments meeting the quality requirement;
the frequency condition is a protein present in at least 80% of the samples in the known set of gastric cancer samples.
25. The apparatus of claim 24, wherein the parameter settings of the principal component dimension reduction process in the principal component dimension reduction module C are the same as the parameters of the principal component analysis dimension reduction process performed on the protein mass spectrometry data in the known gastric cancer sample set in the typing apparatus.
26. The apparatus of claim 24, wherein the t-SNE dimension reduction module C is configured to: the parameter setting of the dimensionality reduction treatment is the same as the parameter of the t-SNE dimensionality reduction treatment on the protein mass spectrum data in the known gastric cancer sample set in the typing device.
27. The apparatus according to claim 24, wherein the cluster classification module C has an average distance within a class of 4.58 or less and an average distance between classes of 9.68 or more.
28. The device according to any one of claims 24 to 27, further comprising a normalization module C for normalizing the test protein set to obtain a normalized test protein set.
29. The apparatus of claim 28, wherein the normalization processing module C is configured to make the mean of the proteins in the test protein set 0 and the variance 1.
30. A storage medium comprising a stored program, wherein the program executes the method for screening a gastric cancer-typing protein marker according to any one of claims 1 to 6.
31. A processor for running a program, wherein the program performs the method for screening for gastric cancer-typing protein markers according to any one of claims 1 to 6.
CN201910367519.6A 2019-04-30 2019-04-30 Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers Active CN110146636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910367519.6A CN110146636B (en) 2019-04-30 2019-04-30 Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910367519.6A CN110146636B (en) 2019-04-30 2019-04-30 Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers

Publications (2)

Publication Number Publication Date
CN110146636A CN110146636A (en) 2019-08-20
CN110146636B true CN110146636B (en) 2022-05-13

Family

ID=67594492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910367519.6A Active CN110146636B (en) 2019-04-30 2019-04-30 Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers

Country Status (1)

Country Link
CN (1) CN110146636B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111275109A (en) * 2020-01-20 2020-06-12 国网山东省电力公司枣庄供电公司 Power equipment state data characteristic optimization method and system based on self-encoder
CN111551720A (en) * 2020-04-30 2020-08-18 昆山德诺瑞尔生物科技有限公司 Method for detecting OGN protein expression and application thereof in gastric cancer auxiliary diagnosis
CN111534596A (en) * 2020-05-22 2020-08-14 江西省肿瘤医院(江西省癌症中心) Glioma malignant progression and survival prognosis detection molecular marker, temozolomide drug resistance detection target spot and application
CN111933211B (en) * 2020-06-28 2023-10-31 北京谷海天目生物医学科技有限公司 Cancer accurate chemotherapy typing marker screening method, chemotherapy sensitivity molecular typing method and application
CN113130009A (en) * 2021-04-19 2021-07-16 林燕 Application of regulating EIF4A3 expression to regulating apoptosis, migration and invasion capacity of liver cancer cells
CN114511569B (en) * 2022-04-20 2022-07-12 中南大学湘雅医院 Tumor marker-based medical image identification method, device, equipment and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11519899B2 (en) * 2015-03-20 2022-12-06 London Health Sciences Centre Research Inc. Metabolomics profiling of central nervous system injury
CN108445097A (en) * 2017-03-31 2018-08-24 北京谷海天目生物医学科技有限公司 Molecular typing of diffuse type gastric cancer, protein marker for typing, screening method and application thereof
CN109554469B (en) * 2017-09-26 2021-10-12 北京大学 Tumor cell of T cell acute lymphatic leukemia and molecular marker thereof

Also Published As

Publication number Publication date
CN110146636A (en) 2019-08-20

Similar Documents

Publication Publication Date Title
CN110146636B (en) Screening method and screening device for gastric cancer typing protein markers and application of screened protein markers
AU2017245307B2 (en) Methods of identification and diagnosis of lung diseases using classification systems and kits thereof
US11315774B2 (en) Big-data analyzing Method and mass spectrometric system using the same method
US11769596B2 (en) Plasma based protein profiling for early stage lung cancer diagnosis
Qu et al. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients
EP2700038B1 (en) Analyzing the expression of biomarkers in cells with moments
Robotti et al. Biomarkers discovery through multivariate statistical methods: a review of recently developed methods and applications in proteomics
CN110838340B (en) Method for identifying protein biomarkers independent of database search
WO2016175990A1 (en) Bagged filtering method for selection and deselection of features for classification
US20080086272A1 (en) Identification and use of biomarkers for the diagnosis and the prognosis of inflammatory diseases
WO2005017646A2 (en) System, software and methods for biomarker identification
Srinivasan et al. Accurate diagnosis of acute graft-versus-host disease using serum proteomic pattern analysis
CN107657149B (en) System for predicting prognosis of liver cancer patient
US20170059581A1 (en) Methods for diagnosis and prognosis of inflammatory bowel disease using cytokine profiles
Long et al. Pattern-based diagnosis and screening of differentially expressed serum proteins for rheumatoid arthritis by proteomic fingerprinting
CN113270188A (en) Method and device for constructing prognosis prediction model of patient after esophageal squamous carcinoma radical treatment
CN107849613A (en) Method for lung cancer parting
US9563744B1 (en) Method of predicting development and severity of graft-versus-host disease
CN112946276A (en) Postoperative recurrence risk prediction system for stage I lung adenocarcinoma patient and application thereof
KR102397822B1 (en) Apparatus and method for analyzing cells using chromosome structure and state information
CN113960130A (en) Machine learning method for diagnosing thyroid cancer by adopting open ion source
Liu et al. Proteomic patterns for classification of ovarian cancer and CTCL serum samples utilizing peak pairs indicative of post‐translational modifications
Paweletz et al. Surface enhanced laser desorption ionization spectrometry reveals biomarkers for drug treatment but not dose
CN110797083B (en) Biomarker identification method based on multiple networks
Klupczynska et al. Study of serum metabolic profiles of patients with non-small cell lung cancer with special emphasis on the smoking status of patients

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100195 No. 150, building 1, zone B, No. 80, xingshikou Road, Haidian District, Beijing

Applicant after: BEIJING GUHAI TIANMU BIOMEDICAL TECHNOLOGY Co.,Ltd.

Address before: No. 301-183, floor 3, building 2, zone B, central liquid cold and heat source environmental system industrial base project, No. 80, xingshikou Road, Haidian District, Beijing 102206

Applicant before: BEIJING GUHAI TIANMU BIOMEDICAL TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant