WO2018133553A1 - Method for establishing quantitative reference range for healthy person urinary proteome and acquiring disease-related urinary protein marker - Google Patents

Method for establishing quantitative reference range for healthy person urinary proteome and acquiring disease-related urinary protein marker Download PDF

Info

Publication number
WO2018133553A1
WO2018133553A1 PCT/CN2017/113550 CN2017113550W WO2018133553A1 WO 2018133553 A1 WO2018133553 A1 WO 2018133553A1 CN 2017113550 W CN2017113550 W CN 2017113550W WO 2018133553 A1 WO2018133553 A1 WO 2018133553A1
Authority
WO
WIPO (PCT)
Prior art keywords
urine
protein
data
proteome
urinary
Prior art date
Application number
PCT/CN2017/113550
Other languages
French (fr)
Chinese (zh)
Inventor
秦钧
冷文川
甄蓓
倪晓天
路天元
汪宜
王广舜
孙长青
钟博文
Original Assignee
北京蛋白质组研究中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201710051714.9A external-priority patent/CN108334747B/en
Priority claimed from CN201710048188.0A external-priority patent/CN108334752B/en
Application filed by 北京蛋白质组研究中心 filed Critical 北京蛋白质组研究中心
Publication of WO2018133553A1 publication Critical patent/WO2018133553A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N30/00Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
    • G01N30/02Column chromatography
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Biochemistry (AREA)
  • Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A method for establishing a quantitative reference range for the urinary proteome of a healthy person and for acquiring a disease-related urinary protein marker: acquiring a quantitative reference range for the urinary proteome of a healthy person on the basis of an established urinary proteome data set of the healthy person, using a hypergeometric distribution inspection method in a urinary proteome data set of a patient of a certain disease to screen for an outlier protein to serve as a urinary protein marker related to the disease, and establishing a tumor-related outlier urinary protein library with a tumor as an example. The method eliminates interferences from physiological fluctuations and differential proteins between individuals in a screening process for a urinary protein biomarker.

Description

健康人尿蛋白质组定量参考范围的建立和获取疾病相关尿蛋白标志物的方法Establishment of quantitative reference range for healthy human urine proteome and method for obtaining disease-related urinary protein markers 技术领域Technical field
本发明属于医药生物领域中生物标志物数据的建立,涉及利用能覆盖个体内及个体间生理性波动和差异的健康人尿蛋白质组数据集建立健康人尿蛋白质组定量参考范围的方法以及所建立起的健康人尿蛋白质组数据库,还涉及利用健康人尿蛋白质组定量参考范围数据对某疾病患者的尿蛋白质组进行筛查而获取该疾病相关尿蛋白标志物(即离群尿蛋白)的方法,特别涉及疾病相关离群尿蛋白库的建立及以肿瘤作为疾病代表所建立的肿瘤相关离群尿蛋白库。The invention belongs to the establishment of biomarker data in the field of medical biology, and relates to a method for establishing a quantitative reference range of a healthy human urine proteome using a urine human proteome data set capable of covering physiological fluctuations and differences between individuals and individuals, and establishing The Healthy Human Urine Proteome Database also relates to a method for obtaining a urine protein marker (ie, an outlier urine protein) of a disease patient by screening a urine proteome of a disease patient using a healthy human urine proteome quantitative reference range data In particular, it relates to the establishment of disease-related outlier urinary protein libraries and tumor-associated outlier urinary protein libraries established by tumors as a representative of diseases.
背景技术Background technique
尿液是临床检验中除血液外最常用的体液样本,尿常规中对胆红素、葡萄糖、酮体、蛋白、血细胞等指标的检测被用于各种疾病的诊断或疗效监测。鉴于尿液检测在健康医学方面的重要价值,世界各国科学家一直在利用蛋白质组学技术试图从尿液中找到新的用于疾病诊断、预后判定、疗效检测的蛋白标志物。目前从尿液中寻找新生物标志物的研发流程通常分为发现和验证两个阶段:在发现阶段利用蛋白质组学方法可以分别对几例到几十例(通常<50例)的目标疾病组和对照组样品进行检测,两组间显著差异的蛋白成为候选生物标志物进入验证阶段的研究;在验证阶段,利用大规模独立的样本对候选生物标志物进行检验。由于缺乏高通量的深度定量尿蛋白质组检测方法,在发现阶段通过小样本量找到的候选标志物实际上通常是不同个体间差异的蛋白,而不是真正反映疾病和对照状态差异的蛋白,这是目前还没有通过蛋白质组学方法发现新的尿蛋白标志物成功走上临床实际应用的主要原因。因此,建立能覆盖个体内及个体间差异和生理性波动的人尿蛋白质组定量参考范围对于发现新的尿蛋白标志物很有必要,进而才能建立一种获取肿瘤尿蛋白标志物的方法以有效克服尿蛋白质组个体内及个体间生理性波动和差异所带来的干扰。Urine is the most commonly used body fluid sample in clinical tests except for blood. The detection of bilirubin, glucose, ketone body, protein, blood cells and other indicators in urine routine is used for the diagnosis or therapeutic monitoring of various diseases. In view of the important value of urine testing in health medicine, scientists all over the world have been using proteomics technology to try to find new protein markers for disease diagnosis, prognosis and efficacy detection from urine. The current research and development process for finding new biomarkers from urine is usually divided into two stages: discovery and verification: in the discovery stage, proteomics can be used to target several to dozens of cases (usually <50 cases) of target disease groups. The test samples were tested and the significantly different proteins between the two groups became the candidate biomarkers into the validation phase; in the validation phase, candidate biomarkers were tested using large independent samples. Due to the lack of high-throughput, deep quantitative urine proteomic detection methods, candidate markers found through small sample sizes during the discovery phase are actually proteins that differ between individuals, rather than proteins that truly reflect differences in disease and control status. At present, there is no main reason for the successful application of new urine protein markers by proteomics methods. Therefore, it is necessary to establish a quantitative reference range for human urinary proteome that can cover intra- and inter-individual and inter-individual differences and physiological fluctuations, so as to establish a method for obtaining tumor urinary protein markers to be effective. Overcoming the interference caused by physiological fluctuations and differences within and between individuals in the urinary proteome.
发明内容Summary of the invention
为了解决现有技术中存在的问题,本发明一个目的在于提供一种建立健康人尿蛋白质组定量参考范围的方法,并进一步提出健康人尿蛋白质组数据库,该数据库包括能覆盖个体内及个体间差异和生理性波动的健康人尿蛋白质组数据集及根据该数据集确定的健康人尿蛋白质的数量及计算得到的健康人尿蛋白质组定量参考范围。In order to solve the problems in the prior art, an object of the present invention is to provide a method for establishing a quantitative reference range of a healthy human urine proteome, and further to provide a database of healthy human urine proteomes, which can cover individuals and individuals. Differences and physiological fluctuations in the healthy human urine proteome dataset and the number of healthy human urine proteins determined from the data set and the calculated quantitative reference range of the healthy human urine proteome.
本发明提出的建立健康人尿蛋白质组定量参考范围的方法,包括以下步骤:The method for establishing a quantitative reference range for a healthy human urine proteome proposed by the present invention comprises the following steps:
1)采样:采集统计数量健康人的尿样;1) Sampling: collecting urine samples of a healthy number of healthy people;
2)制备尿蛋白样品:将采集的每一个尿样制成一个尿蛋白样品; 2) preparing a urine protein sample: each urine sample collected is made into a urine protein sample;
3)检测:对每一个尿蛋白样品进行质谱检测,得到每一个尿蛋白样品的质谱数据;3) Detection: mass spectrometry is performed on each urine protein sample to obtain mass spectrometry data of each urine protein sample;
4)搜库及定量:对每一个尿蛋白样品的质谱数据进行数据库搜索、肽段定量及蛋白拼接组装,确定每一个尿蛋白样品中的蛋白种类及各蛋白的定量形成一个尿蛋白质组数据;4) Search and quantification: perform database search, peptide quantification and protein splicing assembly on the mass spectrometry data of each urine protein sample, determine the protein species in each urine protein sample and quantify each protein to form a urine proteome data;
5)就不同人及不同采样时间跨度确定不同的亚数据集,包括:将单个人不同采样时间跨度的全部尿蛋白样品的尿蛋白质组数据归集得到该人的个体内尿蛋白质组亚数据集(BCM);将多人少次或单次采样的的全部尿蛋白样品的尿蛋白质组数据归集得到个体间尿蛋白质组亚数据集(BPRC);5) Different sub-data sets are determined for different people and different sampling time spans, including: urinary proteome data of all urine protein samples of individual individuals with different sampling time spans are collected to obtain the individual's intra-urine proteome sub-data set. (BCM); urinary proteome data of all urine protein samples collected by a small number of people or a single sample is collected to obtain an inter-individual urinary proteome sub-data set (BPRC);
6)计算每一亚数据集内全部尿蛋白定量数据的变异系数的分布范围用以评估个体内生理性波动;6) Calculate the distribution range of the coefficient of variation of all urine protein quantitative data in each sub-data set to assess physiological fluctuations in the individual;
7)利用随机重采样的方法,对采样时间跨度最长的2个人的亚数据集进行分析,确定覆盖健康人尿蛋白质组个体内生理性波动或差异所需的采样个数;7) Using the method of random resampling, analyze the sub-datasets of the two individuals with the longest sampling time span, and determine the number of samples needed to cover the physiological fluctuations or differences in the healthy human urine proteome;
8)将全部数量人数的亚数据(BCM和BPRC)集合并得到健康人尿蛋白质组数据的总数据集A;每个亚数据集或总数据集中至少10%的尿样中有定量信息的蛋白才参与评估各亚数据集或总数据集的尿蛋白质组个体间生理性波动和差异的评估;8) Collect the total number of sub-data (BCM and BPRC) and obtain the total data set A of healthy human urine proteome data; at least 10% of the urine samples in each sub-dataset or total data set have quantitative information To participate in the assessment of the assessment of physiological fluctuations and differences between individuals in the urinary proteome of each sub-dataset or total dataset;
9)利用总数据集A的数据计算健康人尿蛋白质组定量参考范围。9) Calculate the quantitative reference range of the healthy human urine proteome using the data of the total data set A.
其中,步骤9)中数据符合正态分布时,以参数法建立定量参考范围,根据数据的统计学参数(均值和标准差)按公式计算覆盖目标百分比人群的参考范围上下限(如均数加减2倍标准差覆盖95%的个体)。而步骤9)中数据不确定是否符合正态分布时,以非参数法建立定量参考范围,按照百分位数法求出参考范围上下限就实际覆盖了目标百分比的个体(如第2.5和97.5百分位数就覆盖了95%的个体)。Wherein, when the data in step 9) conforms to the normal distribution, the quantitative reference range is established by the parameter method, and the upper and lower limits of the reference range of the population covering the target percentage are calculated according to the statistical parameters (mean and standard deviation) of the data (eg, the mean plus Reduce the standard deviation by 2 times to cover 95% of individuals). When the data in step 9) is inconsistent with the normal distribution, the non-parametric method is used to establish the quantitative reference range, and the upper and lower limits of the reference range are determined according to the percentile method to actually cover the target percentage (such as 2.5 and 97.5). The percentile covers 95% of the individuals).
其中,就不同人及不同采样时间跨度确定不同的亚数据集,人数较少采样次数较多的尿样形成的亚数据集用来评估少数人多次采样的尿蛋白质组个体内生理性波动和差异;人数较多采样次数较少的尿样形成的亚数据集用来评估对多数人进行少次或单次采样的尿蛋白质组个体间生理性波动和差异;男性和女性尿蛋白质组亚数据集可用来评估不同性别的尿蛋白质组个体间生理性波动和差异。Among them, different sub-data sets are determined for different people and different sampling time spans, and sub-data sets formed by urine samples with a small number of sampling times are used to evaluate the physiological fluctuations of the urine proteome of a plurality of samples repeatedly. Differences; sub-data sets formed by urine samples with a small number of fewer samples were used to assess physiological fluctuations and differences between individuals in the urinary proteome with fewer or single samplings for most people; sub-data for male and female urine proteomes The set can be used to assess physiological fluctuations and differences between individuals of different urinary proteome groups.
所述评估的方法是计算每个符合要求蛋白在相应亚数据集或总数据集中的变异系数,然后以箱型图展示各亚数据集或总数据集中符合要求蛋白的变异系数的分布范围,用以评估相应的尿蛋白质组个体间生理性波动和差异。The method of evaluation is to calculate the coefficient of variation of each eligible protein in the corresponding sub-data set or the total data set, and then display the distribution range of the coefficient of variation of the desired protein in each sub-data set or total data set in a box plot. To assess the physiological fluctuations and differences between individuals in the corresponding urinary proteome.
所述步骤2)采用基于超速离心和还原的方法得到尿蛋白样品,即将尿样离心后的沉淀用重悬缓冲液(50mM Tris,250mM蔗糖,pH8.5)重悬,再加入二硫苏糖醇, 加热去除样品中绝大部分的尿调素蛋白,用清洗缓冲液(10mM三乙醇胺,100mM氯化钠,pH7.4)清洗后离心,得到的沉淀集为该尿样的尿蛋白样品。The step 2) uses a method based on ultracentrifugation and reduction to obtain a urine protein sample, that is, the precipitate after centrifugation of the urine sample is resuspended in a resuspension buffer (50 mM Tris, 250 mM sucrose, pH 8.5), and then dithiothreose is added. Alcohol, Most of the urinary protein in the sample was removed by heating, washed with a washing buffer (10 mM triethanolamine, 100 mM sodium chloride, pH 7.4), and then centrifuged to obtain a urine sample of the urine sample.
所述步骤3)将所述尿蛋白样品用聚丙烯酰胺凝胶电泳(SDS-PAGE)分离、胶切成6条带进行胶内酶解,然后合并为2组分的肽样品作为一个尿蛋白质组,利用LC-MS/MS对2组分肽样品进行检测,得到针对每一尿样的尿蛋白样品质谱数据;步骤4)搜库的目的是对质谱产出的数据进行分析,确定质谱产出的数据中包含的蛋白,并得到所有肽段的一级定量结果,从而获得每一尿蛋白样品对应的蛋白质组数据。The step 3) separates the urine protein sample by polyacrylamide gel electrophoresis (SDS-PAGE), gel-cut into 6 bands for in-gel digestion, and then combines into a 2-component peptide sample as a urine protein. In the group, the two-component peptide samples were detected by LC-MS/MS to obtain the urine protein sample mass spectrum data for each urine sample; and the purpose of the search was to analyze the data produced by the mass spectrometry to determine the mass spectrometry production. The protein contained in the data is obtained, and a first-order quantitative result of all the peptides is obtained, thereby obtaining corresponding proteome data for each urine protein sample.
其中,对三个不同采样时间跨度(24小时内、连续3天以及大于2个月)的健康人个体内尿蛋白质组生理性波动和差异进行评估,评估方法是确定相应亚数据集中各蛋白质定量数据的变异系数(蛋白定量数据的标准差/蛋白定量数据的均值)的分布范围;每个24小时或连续3天采样的亚数据集中包括3-5个尿蛋白质组数据,对那些在3-5个尿样中均有定量数据的蛋白,计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示;每个采样时间跨度大于2个月的亚数据集包括6-62个尿蛋白质组数据,对那些至少在3个(<30个尿蛋白质组的亚数据集)或10%尿样(>30个尿蛋白质组的亚数据集)中有定量数据的蛋白计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示。Among them, the physiological fluctuations and differences in the urinary proteome of healthy individuals in three different sampling time spans (24 hours, 3 consecutive days and more than 2 months) were evaluated by determining the protein quantification in the corresponding sub-data sets. The distribution of the coefficient of variation of the data (the standard deviation of the protein quantitation data / the mean of the protein quantification data); each sub-data set of 24 hours or 3 consecutive days of sampling includes 3-5 urine proteome data, for those in 3- The quantitative data of the five urine samples were calculated, and the coefficient of variation was calculated. Finally, the distribution range of the coefficient of variation of all the required proteins in each sub-data set was obtained, and displayed by box-plot; each sampling time span Sub-data sets greater than 2 months include 6-62 urinary proteome data for those at least 3 (<30 urinary proteome sub-data sets) or 10% urine samples (>30 urinary proteome subgroups) The data with quantitative data in the data set calculates the coefficient of variation, and finally obtains the distribution range of the coefficient of variation of all the required proteins in each sub-data set, and displays it in a box-plot.
其中,对总数据集及其中的男女性别亚数据集来评估健康人尿蛋白质组个体间生理性波动和差异,对每个数据集或亚数据集中超过10%尿样有定量数据的蛋白,计算其定量数据的变异系数,并用箱型图(box-plot)展示各数据集和亚数据集中全部符合要求的蛋白的变异系数分布。Among them, the total data set and the gender sub-dataset in it are used to evaluate the physiological fluctuations and differences between healthy human urine proteome individuals, and the protein with quantitative data for more than 10% urine samples in each data set or sub-data set is calculated. The coefficient of variation of the quantitative data, and box-plot is used to display the coefficient of variation distribution of all eligible proteins in each data set and sub-data set.
本发明还进一步提供健康人尿蛋白质组数据库,包括前述所确定的亚数据集、总数据集、及依数据集确定的健康人尿蛋白质种类和计算得到的健康人尿蛋白质组定量参考范围;所述健康人尿蛋白质组定量参考范围包括表7-1或表7-2所列及覆盖的2025个尿蛋白及其数值。The present invention still further provides a healthy human urine proteome database, comprising the aforementioned identified sub-data sets, total data sets, and healthy human urine protein types determined according to the data set and the calculated quantitative reference range of healthy human urine proteome; The quantitative reference range for the healthy human urine proteome includes 2025 urine proteins and their values listed in Table 7-1 or Table 7-2.
更广泛的,本发明还提出一种建立健康人尿蛋白定量参考范围的方法,包括以下步骤:More broadly, the present invention also provides a method of establishing a quantitative reference range for healthy human urine protein, comprising the following steps:
采样:采集健康人的尿样;Sampling: collecting urine samples from healthy people;
制样:将采集的尿样制成尿蛋白样品;Sample preparation: the collected urine sample is made into a urine protein sample;
检测:对尿蛋白样品进行检测,得到尿蛋白样品的蛋白检测数据;Detection: detection of urine protein samples to obtain protein detection data of urine protein samples;
确定:将蛋白检测数据分类,每一类从多个蛋白检测数据中选定上限数值和下限数值形成定量参考范围,多类汇总组成健康人尿蛋白定量参考范围。 Determination: The protein detection data is classified, and each category selects the upper limit value and the lower limit value from the plurality of protein detection data to form a quantitative reference range, and the plurality of types are combined to form a quantitative reference range for healthy human urine protein.
根据该方法,基于健康人尿蛋白数据建立的健康人尿蛋白定量参考范围均属于本发明公开范围。所述人尿蛋白数据包括但不限于蛋白种类和各蛋白含量。According to this method, a healthy human urine protein quantitative reference range established based on healthy human urine protein data is within the scope of the present disclosure. The human urine protein data includes, but is not limited to, protein species and individual protein content.
本发明另一目的在于提供一种获取与疾病相关尿蛋白标志物的方法。疾病可以为任意一种疾病,以下以肿瘤为例说明该技术方案,其中有关肿瘤的表述均可适用于某一疾病。Another object of the present invention is to provide a method of obtaining a urine protein marker associated with a disease. The disease can be any kind of disease. The following is a case in which the tumor is taken as an example, and the expression about the tumor can be applied to a certain disease.
本发明提出的获取疾病相关(以肿瘤为例)尿蛋白标志物的方法,包括以下步骤:The method for obtaining a disease-related (taking a tumor as an example) urine protein marker proposed by the present invention comprises the following steps:
(1)将所述健康人尿蛋白质组数据集A随机分为三个亚数据集A1、亚数据集A2和亚数据集A3,基于健康人尿蛋白质组亚数据集A1用非参数的百分位数法确定健康人尿蛋白质组定量参考范围,以每个尿蛋白在该数据集中的第99.5百分位数的定量值为定量参考范围的上限;(1) The healthy human urine proteome data set A is randomly divided into three sub-data sets A1, sub-data sets A2 and sub-data sets A3, based on the non-parametric percentage of the healthy human urine proteome sub-data set A1. The number of digits method determines the quantitative reference range of the healthy human urine proteome, and the quantitative value of the 99.5th percentile of each urine protein in the data set is the upper limit of the quantitative reference range;
(2)从肿瘤患者尿蛋白质组数据集B中抽取部分形成训练亚数据集B1,将其中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,如果某个蛋白在至少两个样品中超过参考范围上限则将其纳入到候选肿瘤相关离群尿蛋白库中;所有训练数据被筛完产生一个候选肿瘤相关离群尿蛋白库C1;(2) The training sub-data set B1 is formed from the urine proteome data set B of the tumor patient, and each of the urine proteome data is screened by the upper limit of the reference range established in (1), if a certain protein is At least two samples exceeding the upper limit of the reference range are included in the candidate tumor-associated outlier urinary protein pool; all training data are screened to produce a candidate tumor-associated outlier urinary protein library C1;
(3)从肿瘤患者尿蛋白质组数据集B中抽取部分形成验证亚数据集B2,将亚数据集A2和B2中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,每个尿蛋白质组(样品)产生一个样品特异的离群尿蛋白库C2;将每个样品特异离群尿蛋白库C2中的全部蛋白与(2)中生成的候选肿瘤相关离群尿蛋白库C1中的蛋白进行对比,确定两个库中相同蛋白及数量,相同的蛋白越多,该样品与肿瘤患者的样品越相近;(3) Extracting part of the validation sub-dataset B2 from the urinary proteome dataset B of the tumor patient, and screening each of the urinary proteome data in the sub-datasets A2 and B2 with the upper limit of the reference range established in (1) Each urinary proteome (sample) produces a sample-specific outlier urinary protein pool C2; each sample is specifically isolated from the urinary protein pool C2 and (2) the candidate tumor-associated outlier urinary protein The proteins in library C1 are compared to determine the same protein and quantity in the two pools. The more the same protein, the closer the sample is to the tumor patient sample;
利用超几何分布检验的方法(hypergeometric test)来计算C1和C2两个库中相同蛋白重叠的p值,利用这些p值绘制ROC曲线(receiver operating characteristic curve,ROC)用来考察(2)中生成的候选肿瘤相关离群尿蛋白库C1对验证亚数据集A2和B2中健康人及肿瘤患者尿蛋白质组的区分能力;The hypergeometric test was used to calculate the p-values of the same protein overlap in the two libraries C1 and C2. The ROC curve was used to investigate the generation of the ROC curve (2). The ability of the candidate tumor-associated outlier urinary protein pool C1 to distinguish the urinary proteome of healthy people and tumor patients in the sub-data sets A2 and B2;
(4)对肿瘤患者尿蛋白质组数据集B进行N次(N为大于10的自然数)随机抽样形成N对训练亚数据集B1和验证亚数据集B2,对每对亚数据集进行上述(3)中同样的分析,得到N个候选肿瘤相关离群尿蛋白库C1及N个ROC曲线,其中与最大ROC曲线下面积对应的候选肿瘤相关离群尿蛋白库C1被确定为最终的肿瘤相关离群尿蛋白库C,其中包含的离群蛋白即为肿瘤尿蛋白标志物。(4) Randomly sampling the urinary proteome data set B of the tumor patient N times (N is a natural number greater than 10) to form an N pair training sub-data set B1 and a verification sub-data set B2, and performing the above for each pair of sub-data sets (3) In the same analysis, N candidate tumor-associated outliers urinary protein pool C1 and N ROC curves were obtained, and the candidate tumor-associated outlier urinary protein pool C1 corresponding to the area under the maximum ROC curve was determined as the final tumor correlation. The group urine protein library C, which contains the outlier protein, is a tumor urine protein marker.
上述方法中还包括对所建立的肿瘤相关离群尿蛋白库C进行验证的步骤:The above method also includes the step of verifying the established tumor-associated outlier urine protein library C:
(5)从肿瘤患者尿蛋白质组数据集B中抽取完全独立(指从未参加过训练和验证过程)部分形成验证亚数据集B3,利用亚数据集A3和B3对上述(4)中获得的最 终肿瘤相关离群尿蛋白库C区分健康人和肿瘤患者的能力进行测试,方法同上述(3)的方法,得到每个健康人及肿瘤患者尿蛋白质组的超几何分布检验p值,并与上述(4)中确定的卡值Pc进行比较确定每个尿蛋白质组是属于健康人或肿瘤患者,依据假阳性率和假阴性率确定肿瘤相关离群尿蛋白库区分健康人和肿瘤患者的敏感性和特异性。(5) Extracting the sub-data set B3 from the urinary proteome data set B of the tumor patient completely independent (referring to the process of never participating in the training and verification process), using the sub-data sets A3 and B3 to obtain the above (4) Most The final tumor-associated outlier urinary protein library C is tested for the ability to distinguish between healthy and tumor patients. The method is the same as (3) above, and the p-value of the hypergeometric distribution test of the urine proteome of each healthy person and tumor patient is obtained. The card value Pc determined in the above (4) is compared to determine whether each urinary proteome belongs to a healthy person or a tumor patient, and the tumor-related outlier urinary protein pool is determined according to the false positive rate and the false negative rate to distinguish the sensitivity of the healthy person and the tumor patient. Sex and specificity.
所述步骤(1)确定健康人尿蛋白质组定量参考范围是利用亚数据集A1的数据以非参数法计算,按照百分位数法求出参考范围上下限就实际覆盖了目标百分比的个体(如第2.5和97.5百分位数就覆盖了95%的个体)。The step (1) determines that the quantitative reference range of the healthy human urine proteome is calculated by the non-parametric method using the data of the sub-data set A1, and the individual who actually covers the target percentage according to the upper and lower limits of the reference range according to the percentile method ( For example, the 2.5th and 97.5th percentiles cover 95% of individuals).
所述建立步骤(2)中肿瘤患者尿蛋白质组数据集B的过程包括:The process of establishing the urine proteome data set B of the tumor patient in the step (2) comprises:
1)采样:采集肿瘤患者的尿样;1) Sampling: collecting urine samples from tumor patients;
2)制备尿蛋白样品:将采集的每一个尿样制成一个尿蛋白样品;2) preparing a urine protein sample: each urine sample collected is made into a urine protein sample;
3)检测:对每一个尿蛋白样品进行质谱检测,得到每一个尿蛋白样品的质谱数据;3) Detection: mass spectrometry is performed on each urine protein sample to obtain mass spectrometry data of each urine protein sample;
4)搜库及定量:对每一个尿蛋白样品的质谱数据进行数据库搜索、肽段定量及蛋白拼接组装,确定每一个尿蛋白样品中的蛋白种类及各蛋白的定量形成一个尿蛋白质组数据;4) Search and quantification: perform database search, peptide quantification and protein splicing assembly on the mass spectrometry data of each urine protein sample, determine the protein species in each urine protein sample and quantify each protein to form a urine proteome data;
5)将全部尿蛋白样品的尿蛋白质组数据归集得到肿瘤患者尿蛋白质组数据集B。5) Collecting the urine proteomic data of all urine protein samples to obtain the urine proteome data set B of the tumor patient.
通过以上方法得到的肿瘤相关离群尿蛋白库及其中包含的离群尿蛋白即肿瘤尿蛋白标志物也属于本发明。The tumor-associated outlier urine protein library obtained by the above method and the outlier urine protein contained therein, that is, the tumor urine protein marker are also in the present invention.
所述肿瘤相关离群尿蛋白库,包括509个尿蛋白,具体为A1BG、A2M、ABCB7、ABCD4、ABCE1、ABHD11、ABHD12、ABHD14B、ACADM、ACADSB、ACE2、ACO2、ACOT9、ACSL3、ACSM2A、ACSM2B、ACTR1B、ADD1、AGT、AHNAK2、AHSG、ALDH1L2、ALDH3A1、ALDH3A2、ALDH3B1、ALDH4A1、ALDOC、AMY2A、AMY2B、ANGPTL6、ANK1、ANPEP、ANXA1、ANXA10、ANXA2、ANXA3、ANXA4、ANXA5、ANXA6、APMAP、APOB、APP、AQP7、ARFIP1、ARG1、ARHGAP1、ARL13B、ARL6IP5、ARL8A、ARMC9、ARRDC1、ASNA1、ASPH、ATP13A3、ATP2A2、ATP6AP1、ATP6V0A1、ATP6V0C、ATP6V1B1、ATP6V1B2、AZU1、B3GNT3、BBOX1、BLMH、BPI、BPIFB1、C14orf166、C19orf59、C1orf123、C1QB、C1QC、C3、C4A、C4BPA、C5、C6orf211、C8A、C8B、C9、CAMK2G、CANT1、CCDC22、CCDC64B、CCT4、CCT6A、CDC42BPA、CDH11、CEACAM1、CEACAM6、CEACAM8、CECR5、CERS3、CFB、CHCHD3、CHI3L1、CHP1、CLCA4、CLCN7、CLPTM1、CLRN3、CMBL、CNDP2、CNN3、COL12A1、COL4A2、COMP、COPA、 COPS7A、CORO1C、CPT1A、CPVL、CRAT、CRHBP、CRP、CRYAB、CRYL1、CTBS、CTNNA1、CTSG、CTSH、CWH43、CYP1A1、DDAH2、DDOST、DDX6、DERL1、DHCR24、DHX15、DIRC2、DKC1、DMTN、DNAJA1、DNAJB1、DNAJC7、DNASE1、DNM2、DOCK2、DOCK5、DOCK9、DPM1、DPP4、DPT、DSCR3、ECHS1、EEF1A1、EIF4A1、ELMO1、ELMO3、EMC1、EMD、EMILIN1、ENO1、ENOPH1、ENPEP、ENPP4、ENPP6、ENPP7、EPB41、EPB42、EPCAM、EPDR1、EPHA2、EPPK1、EPX、ERAP1、ERLIN1、ERLIN2、ERMP1、ESRP1、ETF1、FAM151A、FARP1、FCAR、FCN2、FDFT1、FDXR、FGA、FGB、FGG、FGL1、FGL2、FGR、FLOT2、FN3KRP、FNBP1L、FOLH1、FRK、GALK2、GALM、GALNT1、GDF15、GIPC2、GLDC、GLUD1、GLYAT、GNS、GOLM1、GOLT1B、GPD2、GPR110、GPR126、GPR137B、GPR56、GPX3、GSTK1、GSTT1、HBB、HDLBP、HECTD3、HEXA、HEXB、HGD、HLA-A、HLA-B、HLA-DQB1、HLA-DRA、HLA-DRB1、HLA-DRB5、HMOX2、HNRNPA0、HNRNPA1、HNRNPD、HNRNPF、HNRNPH3、HNRNPK、HNRNPM、HNRNPU、HSP90B1、HSPA2、HSPA6、HSPA8、HSPA9、HYOU1、IDH2、IDH3A、IGFBP7、IGLL1、ILF3、IMPDH2、IQCC、ITIH1、ITIH2、KCNJ15、KIAA0319L、KIF13B、KRT18、KRT20、KRT7、KRT8、LAMB1、LAMP1、LAMTOR1、LBR、LLGL2、LMF2、LMNB2、LONP1、LPCAT3、LRG1、LRRK2、LZTFL1、MAL2、MARVELD3、MCCC1、MCRS1、MFSD1、MGAM、MITD1、MLYCD、MMP7、MNDA、MPO、MPP1、MRPL12、MRPL39、MRPS18B、MRPS22、MSMO1、MTAP、MTCH1、MTHFD1、MTOR、MUC20、MUT、MVP、MYO18A、MYO1B、MYO1D、NAMPT、NCF1、ND4、NDFIP1、NDUFA10、NDUFA9、NDUFS3、NDUFS8、NNT、NPTN、NSDHL、NT5C3A、NUDT9、NUMA1、OAT、OGFOD3、OGN、OLA1、ORM1、P4HB、PAPSS2、PCCA、PCCB、PCK1、PCK2、PDHA1、PDIA3、PDP1、PEF1、PFKFB2、PFKL、PGD、PHB2、PHPT1、PI4K2A、PICALM、PIP4K2A、PIP4K2C、PITRM1、PLA2G15、PLAU、PLCG2、PLEKHF2、PNKD、PON1、PPP2R2A、PRDX4、PRKAB1、PRKACA、PSMA5、PSMA7、PSMC3、PSMC4、PSMD3、PSMD7、PSMF1、PTPLAD1、PTPRC、RAB11B、RAB12、RAB24、RAB32、RAB7A、RAB9A、RDH13、RELA、RILPL2、RNASE3、RNASET2、ROGDI、RPN1、RPN2、RPS3、RPS6KA1、RPS9、RPTOR、RRAGC、RRAS2、S100A7、SCAMP3、SCARB1、SCARB2、SEC22B、SEPT9、SERPINA1、SERPINA3、SERPINA7、SERPIND1、SF3B1、SF3B3、SGPL1、SIAE、SIRT5、SLC12A6、SLC12A7、SLC12A9、SLC13A2、SLC15A1、SLC15A4、SLC17A1、SLC17A3、SLC17A5、SLC1A5、SLC22A11、SLC25A10、SLC25A13、 SLC25A24、SLC25A4、SLC25A5、SLC26A11、SLC26A4、SLC2A1、SLC30A2、SLC34A2、SLC35F2、SLC35F6、SLC38A7、SLC3A1、SLC46A1、SLC4A1、SLC6A19、SLC6A8、SLC7A9、SLC9A3、SLFN5、SMPDL3A、SMPDL3B、SNRNP200、SNX27、SORT1、SPAG9、SPARCL1、SPNS1、SPRYD4、SPTA1、SPTB、SPTLC1、SSR1、ST13、STAP1、STARD3NL、STC1、STIM1、STK11IP、STOM、STRN、STT3A、STUB1、STX3、STX7、STX8、STXBP1、SUCLA2、SUCLG2、SULT1A1、SULT1C2、SUN2、SVIL、TACSTD2、TAOK1、TARDBP、TBC1D9B、TBCB、TBL2、TCIRG1、TFRC、TGM2、TGM3、TIMM44、TIMM50、TM9SF2、TM9SF3、TM9SF4、TMBIM1、TMED2、TMED4、TMEM104、TMEM106A、TMEM160、TMEM176A、TMEM176B、TMEM192、TMEM205、TMEM27、TMEM40、TMEM55B、TMEM9、TMLHE、TOMM40L、TOP2B、TPCN1、TPCN2、TPM3、TRIM14、TRIP10、TST、TTC38、TUFM、TXNDC5、TYMP、UGT1A6、UGT1A9、UGT2B7、UPK3B、VAC14、VAMP7、VAPA、VAPB、VASN、VIM、VKORC1L1、VNN2、VPS37C、VPS4A、VSNL1、VTI1B、VWA5A、WASH1、WASL、ZADH2、ZNRF2。The tumor-associated outlier urine protein library includes 509 urine proteins, specifically A1BG, A2M, ABCB7, ABCD4, ABCE1, ABHD11, ABHD12, ABHD14B, ACADM, ACADSB, ACE2, ACO2, ACOT9, ACSL3, ACSM2A, ACSM2B, ACTR1B, ADD1, AGT, AHNAK2, AHSG, ALDH1L2, ALDH3A1, ALDH3A2, ALDH3B1, ALDH4A1, ALDOC, AMY2A, AMY2B, ANGPTL6, ANK1, ANPEP, ANXA1, ANXA10, ANXA2, ANXA3, ANXA4, ANXA5, ANXA6, APMAP, APOB, APP, AQP7, ARFIP1, ARG1, ARHGAP1, ARL13B, ARL6IP5, ARL8A, ARMC9, ARRDC1, ASNA1, ASPH, ATP13A3, ATP2A2, ATP6AP1, ATP6V0A1, ATP6V0C, ATP6V1B1, ATP6V1B2, AZU1, B3GNT3, BBOX1, BLMH, BPI, BPIFB1 C14orf166, C19orf59, C1orf123, C1QB, C1QC, C3, C4A, C4BPA, C5, C6orf211, C8A, C8B, C9, CAMK2G, CANT1, CCDC22, CCDC64B, CCT4, CCT6A, CDC42BPA, CDH11, CEACAM1, CEACAM6, CEACAM8, CECR5, CERS3, CFB, CHCHD3, CHI3L1, CHP1, CLCA4, CLCN7, CLPTM1, CLRN3, CMBL, CNDP2, CNN3, COL12A1, COL4A2, COMP, COPA, COPS7A, CORO1C, CPT1A, CPVL, CRAT, CRHBP, CRP, CRYAB, CRYL1, CTBS, CTNNA1, CTSG, CTSH, CWH43, CYP1A1, DDAH2, DDOST, DDX6, DERL1, DHCR24, DHX15, DIRC2, DKC1, DMTN, DNAJA1 DNAJB1, DNAJC7, DNASE1, DNM2, DOCK2, DOCK5, DOCK9, DPM1, DPP4, DPT, DSCR3, ECHS1, EEF1A1, EIF4A1, ELMO1, ELMO3, EMC1, EMD, EMILIN1, ENO1, ENOPH1, ENPEP, ENPP4, ENPP6, ENPP7, EPB41, EPB42, EPCAM, EPDR1, EPHA2, EPPK1, EPX, ERAP1, ERLIN1, ERLIN2, ERMP1, ESRP1, ETF1, FAM151A, FARP1, FCAR, FCN2, FDFT1, FDXR, FGA, FGB, FGG, FGL1, FGL2, FGR, FLOT2, FN3KRP, FNBP1L, FOLH1, FRK, GALK2, GALM, GALNT1, GDF15, GIPC2, GLDC, GLUD1, GLYAT, GNS, GOLM1, GOLT1B, GPD2, GPR110, GPR126, GPR137B, GPR56, GPX3, GSTK1, GSTT1, HBB, HDLBP, HECTD3, HEXA, HEXB, HGD, HLA-A, HLA-B, HLA-DQB1, HLA-DRA, HLA-DRB1, HLA-DRB5, HMOX2, HNRNPA0, HNRNPA1, HNRNPD, HNRNPF, HNRNPH3, HNRNPK, HNRNPM, HNRNPU, HSP90B1, HSPA2, HSPA6, HSPA8, HSPA9 HYOU1, IDH2, IDH3A, IGFBP7, IGLL1, ILF3, IMPDH2, IQCC, ITIH1, ITIH2, KCNJ15, KIAA0319L, KIF13B, KRT18, KRT20, KRT7, KRT8, LAMB1, LAMP1, LAMTOR1, LBR, LLGL2, LMF2, LMNB2, LONP1 LPCAT3, LRG1, LRRK2, LZTFL1, MAL2, MARVELD3, MCCC1, MCRS1, MFSD1, MGAM, MITD1, MLYCD, MMP7, MNDA, MPO, MPP1, MRPL12, MRPL39, MRPS18B, MRPS22, MSMO1, MTAP, MTCH1, MTHFD1, MTOR, MUC20, MUT, MVP, MYO18A, MYO1B, MYO1D, NAMPT, NCF1, ND4, NDFIP1, NDUFA10, NDUFA9, NDUFS3, NDUFS8, NNT, NPTN, NSDHL, NT5C3A, NUDT9, NUMA1, OAT, OGFOD3, OGN, OLA1, ORM1 P4HB, PAPSS2, PCCA, PCCB, PCK1, PCK2, PDHA1, PDIA3, PDP1, PEF1, PFKFB2, PFKL, PGD, PHB2, PHPT1, PI4K2A, PICALM, PIP4K2A, PIP4K2C, PITRM1, PLA2G15, PLAU, PLCG2, PLEKHF2, PNKD, PON1, PPP2R2A, PRDX4, PRKAB1, PRKACA, PSMA5, PSMA7, PSMC3, PSMC4, PSMD3, PSMD7, PSMF1, PTPLAD1, PTPRC, RAB11B, RAB12, RAB24, RAB32, RAB7A, RAB9A, RDH13, RELA, RILPL2, RNASE3, RNA SET2, ROGDI, RPN1, RPN2, RPS3, RPS6KA1, RPS9, RPTOR, RRAGC, RRAS2, S100A7, SCAMP3, SCARB1, SCARB2, SEC22B, SEPT9, SERPINA1, SERPINA3, SERPINA7, SERPIND1, SF3B1, SF3B3, SGPL1, SIAE, SIRT5, SLC12A6, SLC12A7, SLC12A9, SLC13A2, SLC15A1, SLC15A4, SLC17A1, SLC17A3, SLC17A5, SLC1A5, SLC22A11, SLC25A10, SLC25A13, SLC25A24, SLC25A4, SLC25A5, SLC26A11, SLC26A4, SLC2A1, SLC30A2, SLC34A2, SLC35F2, SLC35F6, SLC38A7, SLC3A1, SLC46A1, SLC4A1, SLC6A19, SLC6A8, SLC7A9, SLC9A3, SLFN5, SMPDL3A, SMPDL3B, SNRNP200, SNX27, SORT1, SPAG9, SPARCL1, SPNS1, SPRYD4, SPTA1, SPTB, SPTLC1, SSR1, ST13, STAP1, STARD3NL, STC1, STIM1, STK11IP, STOM, STRN, STT3A, STUB1, STX3, STX7, STX8, STXBP1, SUCLA2, SUCLG2, SULT1A1, SULT1C2 SUN2, SVIL, TACSTD2, TAOK1, TARDBP, TBC1D9B, TBCB, TBL2, TCIRG1, TFRC, TGM2, TGM3, TIMM44, TIMM50, TM9SF2, TM9SF3, TM9SF4, TMBIM1, TMED2, TMED4, TMEM104, TMEM106A, TMEM160, TMEM176A, TMEM176B, TMEM192, TMEM205, TMEM27, TMEM40, TMEM55B, TMEM9, TMLHE, TOMM40L, TOP2B, TPCN1, TPCN2, TPM3, TRIM14, TRIP10, TST, TTC38, TUFM, TXNDC5, TYMP, UGT1A6, UGT1A9, UGT2B7, UPK3B, VAC14, VAMP7, VAPA, VAPB, VASN, VIM, VKORC1L1, VNN2, VPS37C, VPS4A, VSNL1, VTI1B, VWA5A, WASH1, WASL, ZADH2, ZNRF2.
本发明还一目的在于提供利用所述肿瘤相关离群尿蛋白库针对待检尿样进行肿瘤相关度判断,其方法为:获取该待检尿样的蛋白质组数据,利用超几何分布检验的方法来计算该尿样和所述肿瘤尿蛋白离群蛋白库中相同蛋白重叠的p值,确定特异性为95%时的Pc值,当超几何分布检验p值大于Pc时,判断该待检尿样为健康人样品,当p值小于Pc时,判断该待检尿样为肿瘤患者样品。Still another object of the present invention is to provide a tumor correlation degree judgment for a urine sample to be tested by using the tumor-related out-of-group urine protein library by obtaining a proteomic data of the urine sample to be tested, and using a hypergeometric distribution test method Calculating the p value of the same protein overlap in the urine sample and the tumor urine protein outlier protein library, determining the Pc value when the specificity is 95%, and determining the urine to be tested when the p-value of the hypergeometric distribution test is greater than Pc The sample is a healthy human sample. When the p value is less than Pc, the urine sample to be tested is judged to be a tumor patient sample.
类似的,可利用其它与疾病相关离群尿蛋白库针对待检尿样进行疾病相关度判断,过程为:获取该待检尿样的蛋白质组数据,利用超几何分布检验的方法来计算该尿样和所述疾病尿蛋白离群蛋白库中相同蛋白重叠的p值,确定特异性为95%时的Pc值,当超几何分布检验p值大于Pc时,判断该待检尿样为健康人样品,当p值小于Pc时,判断该待检尿样为该疾病患者样品。Similarly, other disease-related outlier urine protein stores can be used to determine the disease correlation degree of the urine sample to be tested. The process is: obtaining proteomic data of the urine sample to be tested, and calculating the urine by using a hypergeometric distribution test method. Determining the p value of the same protein in the urinary protein outlier protein library, and determining the Pc value when the specificity is 95%. When the p-value of the hypergeometric distribution test is greater than Pc, the urine sample to be tested is determined to be a healthy person. The sample, when the p value is less than Pc, determines that the urine sample to be tested is a patient sample of the disease.
本发明的效果:通过大规模地采集健康人尿蛋白质组数据建立了能覆盖个体内及个体间差异和生理性波动的尿蛋白质组数据集,并利用该数据集建立了尿蛋白质组定量参考范围。依据该参考范围对某疾病(以肿瘤为例)患者的尿蛋白质组数据进行筛选,获得该疾病相关离群尿蛋白标志物,该筛选过程能够很好地排除在尿蛋白生物标志物发现过程中来自生理性波动和个体间差异蛋白的干扰。The effect of the invention: a urine protein set dataset capable of covering intra- and inter-individual differences and physiological fluctuations is established by collecting large-scale human urine proteome data, and a quantitative reference range of urine proteome is established by using the data set. . According to the reference range, the urinary proteome data of a patient (in the case of a tumor) is screened to obtain an off-colon urinary protein marker related to the disease, and the screening process can be well excluded from the discovery of urinary protein biomarkers. Interference from physiological fluctuations and differential proteins between individuals.
附图说明DRAWINGS
图1为健康人尿蛋白质组个体内24小时及连续3天的生理性波动范围的变异系数图。24小时数据来自2名自愿者(U001和U002),连续3天的数据来自16名自愿 者(U001-U005、U007-U017)。纵轴为变异系数,横轴为不同个体的不同亚数据集。Figure 1 is a graph showing the coefficient of variation of physiological fluctuation ranges for 24 hours and 3 consecutive days in a healthy human urine proteome group. The 24-hour data comes from 2 volunteers (U001 and U002), and the data for 3 consecutive days comes from 16 voluntary (U001-U005, U007-U017). The vertical axis is the coefficient of variation, and the horizontal axis is the different sub-data sets of different individuals.
图2为健康人尿蛋白质组个体内大于60天的生理性波动范围的变异系数图。除U10、U015及U017外,其他14名自愿者的采样时间跨度在61-314天。纵轴为变异系数,横轴为不同个体的亚数据集。Figure 2 is a graph showing the coefficient of variation of physiological fluctuation ranges greater than 60 days in a healthy human urine proteome group. In addition to U10, U015 and U017, the sampling time span of the other 14 volunteers was 61-314 days. The vertical axis is the coefficient of variation, and the horizontal axis is the sub-data set of different individuals.
图3为采样数量与健康人尿蛋白质组个体内生理性波动幅度的关系图。A幅为自愿者U001的采样数量与尿蛋白质组个体内生理性波动幅度的关系图,B幅为自愿者U002的采样数量与尿蛋白质组个体内生理性波动幅度的关系图;纵轴为变异系数,横轴为亚数据集的采样数。Figure 3 is a graph showing the relationship between the number of samples and the physiological fluctuations in the healthy human urine proteome. A is the relationship between the number of samples of the volunteer U001 and the physiological fluctuations of the urine proteome. B is the relationship between the number of samples of the volunteer U002 and the physiological fluctuations of the urine proteome; the vertical axis is the variation. The coefficient, the horizontal axis is the number of samples in the sub-data set.
图4为健康人尿蛋白质组个体间生理性波动范围的变异系数图。纵轴:变异系数;横轴:BCM、BPRC、A1、Female和Male为亚数据集,A为总数据集,括号中的数字为各数据集中蛋白质组变异系数分布中的中位变异系数。Figure 4 is a graph showing the coefficient of variation of the range of physiological fluctuations between individuals in the healthy human urine proteome. Vertical axis: coefficient of variation; horizontal axis: BCM, BPRC, A1, Female, and Male are sub-data sets, A is the total data set, and the numbers in parentheses are the median coefficient of variation in the distribution of proteome variation coefficients in each data set.
图5为自愿者U001的一个尿蛋白样(包括2组分的肽样品)经液相色谱串联质谱(LC-MS)检测后生成的总离子流图,纵轴为信号强度,横轴为保留时间。Figure 5 is a total ion chromatogram generated by liquid chromatography tandem mass spectrometry (LC-MS) of a urine protein sample (including a two-component peptide sample) of volunteer U001. The vertical axis is the signal intensity and the horizontal axis is the retention. time.
图6为肿瘤相关离群尿蛋白库的建立过程流程图,Figure 6 is a flow chart showing the process of establishing a tumor-associated outlier urine protein library,
A幅为训练数据集及侯选肿瘤相关离群蛋白库的生成;A is the training data set and the generation of candidate tumor-associated alien protein pools;
B幅为验证数据集的生成及对侯选肿瘤相关离群蛋白库的评估;B is the generation of the validation data set and the evaluation of the candidate tumor-associated alien protein pool;
C幅为测试数据集的生成及对最终肿瘤相关离群蛋白库的测试。Panel C is the generation of test data sets and testing of the final tumor-associated alien protein pool.
具体实施方式detailed description
本发明一方面提供一种建立健康人尿蛋白质组定量参考范围的方法,并进一步提出健康人尿蛋白组数据库;另一方面提供一种获取疾病相关尿蛋白标志物的方法,并以肿瘤为例,进一步提出肿瘤相关离群尿蛋白库。本发明利用健康人尿蛋白质组定量参考范围对某疾病患者的尿蛋白质组数据进行筛选发现离群蛋白,通过发现、验证及测试三个阶段(将健康人及患者的尿蛋白质组数据随机分成训练、验证及测试亚数据集分别进行)的分析最终确定疾病(以肿瘤为例)相关离群尿蛋白库。通常,蛋白质组的概念是指细胞内、组织内、体液内或个体内全部种类蛋白的集合。本发明中,尿蛋白质组是指每个尿样中所包括的全部不同种类的蛋白。One aspect of the present invention provides a method for establishing a quantitative reference range of a healthy human urine proteome, and further provides a healthy human urine protein group database; and another method for obtaining a disease-related urine protein marker, and taking a tumor as an example Further, a tumor-related outlier urinary protein pool was proposed. The invention utilizes the quantitative reference range of the healthy human urine proteome to screen the urinary proteome data of a disease patient to find outliers, through three stages of discovery, verification and testing (to randomly divide the urine proteome data of healthy people and patients into training) The analysis of the validation and test sub-data sets separately determines the disease (in the case of tumors) associated with the outlier urinary protein pool. Generally, the concept of a proteome refers to a collection of all kinds of proteins in a cell, within a tissue, within a body fluid, or within an individual. In the present invention, the urinary proteome refers to all of the different kinds of proteins included in each urine sample.
为达成以上成果,本发明就以下几方面内容做出说明:In order to achieve the above results, the present invention explains the following aspects:
一、尿蛋白样品的制备First, the preparation of urine protein samples
针对采集的健康人尿样和某疾病患者尿样本发明采用以下基于超速离心和还原的方法得到尿蛋白样品:For the collection of healthy human urine samples and urine samples of patients with certain diseases, the following urine protein samples were obtained by ultracentrifugation and reduction methods:
(1)10ml尿样,以100000g的离心力在4℃条件下离心20分钟,弃去上清,留沉淀; (1) 10 ml of urine sample, centrifuged at 100 ° C for 10 minutes at 4 ° C, discard the supernatant, leaving a precipitate;
(2)将上述沉淀转移至离心管,向离心管中加入60μl的重悬缓冲液(50mM Tris,250mM蔗糖,pH8.5),在室温静置10分钟,用移液器充分吹打重悬沉淀;(2) Transfer the above precipitate to a centrifuge tube, add 60 μl of resuspension buffer (50 mM Tris, 250 mM sucrose, pH 8.5) to the centrifuge tube, let stand at room temperature for 10 minutes, and thoroughly resuspend the pellet with a pipette. ;
(3)向上述重悬沉淀中加入二硫苏糖醇至终浓度50mM,80℃加热10分钟,去除样品中绝大部分的尿调素蛋白;(3) adding dithiothreitol to the above resuspended precipitate to a final concentration of 50 mM, and heating at 80 ° C for 10 minutes to remove most of the urinary protein in the sample;
(4)补充填加清洗缓冲液(10mM三乙醇胺,100mM氯化钠,pH7.4)至400ul,然后以100000的离心力在4条件下离心20分钟,弃去上清,留沉淀。(4) A washing buffer (10 mM triethanolamine, 100 mM sodium chloride, pH 7.4) was added to 400 ul, and then centrifuged at 100,000 for 20 minutes under a centrifugal force of 100,000, and the supernatant was discarded to leave a precipitate.
该沉淀作为该尿样的尿蛋白样品。This precipitate was used as a urine protein sample of the urine sample.
二、尿蛋白样品的质谱检测Second, the mass spectrometric detection of urine protein samples
本发明将经上述超速离心法制备的每一个尿蛋白样品用60μl的1%十二烷基硫酸钠缓冲液(1%SDS,50mM Tris,pH8.5)溶解沉淀,取30μl上样利用聚丙烯酰胺凝胶电泳(SDS-PAGE)分离,之后将胶切成6条带进行胶内酶解,然后合并为2组分的肽样品作为一个尿蛋白质组,利用LC-MS/MS对2组分肽样品进行检测,得到针对每一尿样的尿蛋白样品数据(质谱数据,谱图参见图5)。具体操作为:In the present invention, each urine protein sample prepared by the above ultracentrifugation method is dissolved in 60 μl of 1% sodium dodecyl sulfate buffer (1% SDS, 50 mM Tris, pH 8.5), and 30 μl of the sample is used for polypropylene. After separation by amide gel electrophoresis (SDS-PAGE), the gel was cut into 6 bands for in-gel digestion, and then combined into a 2-component peptide sample as a urine proteome, using LC-MS/MS for 2 components. Peptide samples were tested to obtain urine protein sample data for each urine sample (mass data, see Figure 5 for the spectrum). The specific operation is:
消化后所得肽样品用20μl的上样缓冲液(5%甲醇,0.1%甲酸)溶解,然后取5μl上样,利用ThermoScientific的纳升级液相色谱串联高分辨质谱系统(nLC-Easy1000-Q Exactive-HF)进行数据采集。The peptide sample obtained after digestion was dissolved in 20 μl of loading buffer (5% methanol, 0.1% formic acid), and then 5 μl was applied for loading, using a ThermoScientific nanoscale liquid chromatography tandem high resolution mass spectrometry system (nLC-Easy1000-Q Exactive- HF) for data collection.
纳升液相上样柱规格如下:内径100微米、填料为Dr.Maisch GmbH公司的C18填料(颗粒直径为3微米、颗粒孔径为120纳米)、填料柱床长度为2厘米;纳升液相分离柱规格如下:内径150微米、填料为Dr.Maisch GmbH公司的C18填料(颗粒直径为1.9微米、颗粒孔径为120纳米)、填料柱床长度为12厘米。流动相A为0.1%甲酸;流动相B为乙腈及0.1%甲酸。肽分离洗脱梯度如下:0-69分钟为5%-31%流动相B,70-75分钟为95%流动相B。The specifications of the nanoliter liquid phase loading column are as follows: the inner diameter is 100 μm, the packing is the C18 packing of Dr. Maisch GmbH (particle diameter is 3 μm, the particle diameter is 120 nm), the packed bed length is 2 cm; the nanoliter liquid phase The separation column specifications were as follows: an inner diameter of 150 μm, a filler of Dr. Maisch GmbH, a C18 filler (particle diameter of 1.9 μm, a particle diameter of 120 nm), and a packed bed length of 12 cm. Mobile phase A was 0.1% formic acid; mobile phase B was acetonitrile and 0.1% formic acid. The peptide separation elution gradient was as follows: 0-69 minutes for 5%-31% mobile phase B and 70-75 minutes for 95% mobile phase B.
质谱数据以Data Dependent Acquisition方式进行采集,Q Exactive-HF所用参数如下:一级质谱分辨率为12万,扫描范围为300-1400m/z,AGC为3E+6,最大离子注入时间为80毫秒;二级质谱根据一级质谱中肽片段的信号强度由高向低依次分离碎裂(以Top 20模式),二级质谱的分辨率为1.5万,二级质谱母离子质量分离窗口为3m/z,AGC为2E+4,离子最大注入时间为20ms,HCD相对碰撞能量为27%,数据采集时采用12s动态排除。The mass spectrometry data was collected by Data Dependent Acquisition. The parameters used for Q Exactive-HF were as follows: the first-order mass spectrometer resolution was 120,000, the scanning range was 300-1400 m/z, the AGC was 3E+6, and the maximum ion implantation time was 80 msec. The secondary mass spectrometry separates the fragmentation according to the signal intensity of the peptide fragment in the first-order mass spectrum from high to low (in Top 20 mode), the resolution of the secondary mass spectrometer is 15,000, and the mass separation window of the secondary mass spectrometer is 3 m/z. The AGC is 2E+4, the maximum ion implantation time is 20ms, the HCD relative collision energy is 27%, and the data acquisition uses 12s dynamic elimination.
三、尿蛋白样品的质谱数据分析3. Analysis of mass spectrometry data of urine protein samples
利用生物信息学工具和方法将每一尿蛋白样品所得质谱数据进行搜库。数据库搜索的目的是对质谱产出的数据进行分析,确定质谱产出的数据中包含的蛋白。其过程是通过对质谱产出的数据中的母离子的二级谱图进行分析,在一定的质量偏差范围内 对碎片离子的强度分布情况与理论强度进行对比,通过未超出质量偏差范围的碎片离子情况对母离子进行评分从而得到母离子(短肽段)的鉴定结果。再将短肽段与已知的蛋白质氨基酸序列库进行匹配,确定所检测到的短肽段所属的蛋白信息,得到蛋白的鉴定结果。具体过程及所用参数如下:Mass spectrometry data from each urine protein sample was searched using bioinformatics tools and methods. The purpose of the database search is to analyze the data produced by the mass spectrometry and determine the proteins contained in the data produced by the mass spectrometry. The process is to analyze the secondary spectrum of the parent ion in the data produced by the mass spectrometer within a certain mass deviation range. The intensity distribution of the fragment ions was compared with the theoretical intensity, and the mother ions were scored by the fragment ions not exceeding the mass deviation range to obtain the identification results of the parent ions (short peptides). The short peptide fragment is matched with a known protein amino acid sequence library to determine the protein information of the detected short peptide segment, and the protein identification result is obtained. The specific process and parameters used are as follows:
所得质谱数据利用Mascot2.3搜索引擎的Proteome Discoverer V2.0软件进行肽序列数据库搜索分析。在“Mascot”模板中对数据库搜索的各项参数进行设定:在“Protein Database”中选取人蛋白质序列数据库,所用的数据库为美国生物技术信息国家中心(National Center for Biotechnology Information,NCBI)的人类蛋白质参考序列数据库;在“Enzyme Name”中选取Trypsin;在“Maximum Missed Cleavage”中填入2(代表允许的最大漏切位点数为2);在“Instrument”中选Default;在“Taxonomy”中选All entries;在“Precursor Mass Tolerance”中填20ppm;在“Precursor Mass Tolerance”中填50mmu;在“Use Average Precursor Mass”中选False;在“From Quan Method”中选None;在“Show All Modifications”中选False;在“Dynamic Modification“中除选取通常存在的Acetyl(Protein N-term)、DeStreak(C)、Oxidation(M)、Carbamidomethyl(C);肽段水平的假阳性鉴定要小于1%。The obtained mass spectral data was subjected to peptide sequence database search analysis using the Proteome Discoverer V2.0 software of the Mascot 2.3 search engine. In the "Mascot" template, the parameters of the database search are set: the human protein sequence database is selected in the "Protein Database", and the database used is the human body of the National Center for Biotechnology Information (NCBI). Protein reference sequence database; select Trypsin in "Enzyme Name"; fill in 2 in "Maximum Missed Cleavage" (representing the maximum number of missed sites allowed to be 2); select Default in "Instrument"; select All in "Taxonomy" Entries; fill 20ppm in "Precursor Mass Tolerance"; fill 50mmu in "Precursor Mass Tolerance"; select False in "Use Average Precursor Mass"; select None in "From Quan Method"; select False in "Show All Modifications"; In "Dynamic Modification", except for the commonly available Acetyl (Protein N-term), DeStreak (C), Oxidation (M), and Carbamidomethyl (C); the false positive identification of the peptide level is less than 1%.
通过数据库搜索产生的肽段匹配图谱信息对原始数据中的一级谱图进行计算,得到所有肽段的一级定量结果。批量计算的程序使用已有的《基于高解析度质谱数据肽段交叉回归的蛋白丰度定量软件[简称:PQPCR]》V 1.0(中华人民共和国国家版权局计算机软件著作权登记书号:软著登字第0451332号,登记号2012SR083269,登记日期2012年09月04日,著作权人:北京蛋白质组研究中心)。定量后的肽段根据数据库中蛋白的氨基酸序列进行拼接组装成相应的蛋白,获得每一尿蛋白样品对应的尿蛋白质组数据。尿蛋白质组的概念是指每个尿样中所包括的全部不同种类的蛋白,将一个尿样中被鉴定到的全部蛋白称为一个尿蛋白质组。The first-order spectra in the original data were calculated by the peptide-matching map information generated by the database search, and the first-order quantitative results of all the peptides were obtained. The batch calculation program uses the existing protein abundance quantification software based on high-resolution mass spectrometry data peptide cross-regression [referred to as: PQPCR] V 1.0 (National Copyright Administration of the People's Republic of China computer software copyright registration number: soft boarding No. 0451332, registration number 2012SR083269, registration date: September 4, 2012, copyright owner: Beijing Proteome Research Center). The quantified peptides are spliced into corresponding proteins according to the amino acid sequence of the proteins in the database, and the corresponding urine proteome data of each urine protein sample is obtained. The concept of the urinary proteome refers to all the different kinds of proteins included in each urine sample, and all the proteins identified in one urine sample are called a urinary proteome.
四、健康人尿蛋白质组数据集及相应亚数据集、建立肿瘤患者尿蛋白质组数据集Fourth, healthy human urine proteome dataset and corresponding sub-data sets, establish a urine protein set data set for cancer patients
将通过上述方法分析获得的每一个健康人尿蛋白质组数据依次合并获得健康人蛋白质组数据集A(整合表4和表5,包含167名健康人的497个尿蛋白质组的数据集),将获得的每一个肿瘤尿蛋白质组数据(以肿瘤作为疾病的示例)合并获得患者的肿瘤尿蛋白质组数据集B(如表8-2,包含来自7种实体性肿瘤—膀胱癌17例、乳腺癌4例、宫颈癌25例、结直肠癌22例、食管癌14例、胃癌47例及肺癌25例的154个尿蛋白质组的数据集)。Each healthy human urine proteome data obtained by the above method analysis was sequentially combined to obtain healthy human proteome data set A (integrated Table 4 and Table 5, containing data sets of 497 urine proteomes of 167 healthy persons), Each tumor urinary proteome data obtained (with tumors as an example of disease) was combined to obtain the patient's tumor urinary proteome data set B (as shown in Table 8-2, including 17 solid tumors - bladder cancer 17 cases, breast cancer) 4 cases, 25 cases of cervical cancer, 22 cases of colorectal cancer, 14 cases of esophageal cancer, 47 cases of gastric cancer and 25 cases of lung cancer, 154 urine proteome data sets).
健康人尿蛋白质组数据集A中的数据用来评估健康人尿蛋白质组的个体内及个 体间生理性波动和差异并建立健康人尿蛋白质组定量参考范围。该数据集A中的数据可根据用来评估不同类型尿蛋白质组生理性波动和差异的目的分为不同的亚数据集。例如,用来评估某一个体的个体内差异的数据可以构成一个亚数据集(如表3);这个亚数据集内的数据也可以根据采样时间跨度的不同再分成相应的亚数据集,用以评估健康人个体内不同时间跨度的尿蛋白质组生理性波动和差异。此外,还可以根据性别等因素建立亚数据集。The data in the Healthy Human Urine Proteome Dataset A is used to assess the individual and individual of the healthy human urine proteome. Physiological fluctuations and differences between the organisms and establish a quantitative reference range for healthy human urine proteome. The data in data set A can be divided into different sub-data sets for the purpose of assessing physiological fluctuations and differences in different types of urinary proteomes. For example, data used to assess intra-individual differences in an individual can constitute a sub-data set (see Table 3); the data in this sub-data set can also be subdivided into corresponding sub-data sets based on the sampling time span. To assess physiological fluctuations and differences in urinary proteome at different time spans within healthy individuals. In addition, sub-data sets can be created based on factors such as gender.
利用数据集或亚数据集的数据系统评估健康人尿蛋白质组个体内及个体间的差异和生理性波动,并在此基础上利用百分位数法计算出健康人尿蛋白质组的定量参考范围(参见表6)。The data system of the data set or sub-dataset is used to evaluate the intra- and inter-individual differences and physiological fluctuations of the healthy human urine proteome, and the quantitative reference range of the healthy human urine proteome is calculated by the percentile method. (See Table 6).
肿瘤尿蛋白质组数据集B中的数据根据需要随机分成训练、验证及测试亚数据集用于肿瘤相关离群尿蛋白的发现、验证及用来区分健康人和肿瘤患者能力的测试。The data in the tumor urinary proteome dataset B were randomly divided into training, validation, and test sub-data sets for the detection, validation, and testing of the ability of healthy and tumor patients to be diagnosed.
五、评估健康人尿蛋白质组个体内生理性波动和差异V. Assessment of physiological fluctuations and differences in healthy human urine proteome
对三个不同采样时间跨度(24小时内、连续3天以及大于2个月)的健康人个体内尿蛋白质组生理性波动和差异进行了评估,评估方法是确定相应亚数据集中各蛋白质定量数据的变异系数(蛋白定量数据的标准差/蛋白定量数据的均值)的分布范围。每个24小时或连续3天采样的亚数据集中包括3-5个尿蛋白质组数据,对那些在3-5个尿样中均有定量数据的蛋白,计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示。每个采样时间跨度大于2个月的亚数据集包括6-62个尿蛋白质组数据,对那些至少在3个(<30个尿蛋白质组的亚数据集)或10%尿样(>30个尿蛋白质组的亚数据集)中有定量数据的蛋白计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示。The physiologic fluctuations and differences in the urinary proteome of healthy individuals in three different sampling time spans (24 hours, 3 consecutive days, and more than 2 months) were evaluated by determining the quantitative data of each protein in the corresponding sub-dataset. The distribution range of the coefficient of variation (the standard deviation of protein quantitative data/the mean of protein quantitation data). The sub-dataset sampled every 24 hours or 3 consecutive days includes 3-5 urine proteomic data, and for those proteins with quantitative data in 3-5 urine samples, calculate the coefficient of variation and finally obtain each sub- The data set all met the distribution of the coefficient of variation of the required protein and was presented in a box-plot. The sub-dataset with a sampling time span of more than 2 months includes 6-62 urinary proteome data for those at least 3 (<30 urinary proteome subdata sets) or 10% urine samples (>30) The protein with quantitative data in the sub-dataset of the urinary proteome) calculates the coefficient of variation, and finally obtains the distribution range of the coefficient of variation of all the required proteins in each sub-data set, and displays it in a box-plot.
六、评估健康人尿蛋白质组个体间生理性波动和差异6. To assess the physiological fluctuations and differences among individuals in the urine protein group of healthy people.
利用包含167名健康人的497个尿蛋白质组的数据集A及其中的男女性别亚数据集来评估健康人尿蛋白质组个体间生理性波动和差异,对每个数据集或亚数据集中超过10%尿样有定量数据的蛋白,计算其定量数据的变异系数,并用箱型图(box-plot)展示各数据集和亚数据集中全部符合要求的蛋白的变异系数分布。Data set A of 497 urinary proteomes containing 167 healthy individuals and their gender subsets were used to assess physiological fluctuations and differences between healthy human urinary proteome individuals, over 10 for each data set or sub-data set. % urine samples have quantitative data, calculate the coefficient of variation of their quantitative data, and use box-plot to display the distribution of coefficient of variation for all eligible proteins in each data set and sub-data set.
七、健康人尿蛋白质组定量参考范围的建立VII. Establishment of quantitative reference range for urinary proteome in healthy people
通过上述对健康人尿蛋白质组个体内及个体间的生理性波动和差异的系统评估,证明已建立的包含167名健康人的497个尿蛋白质组数据集A能覆盖健康人群尿蛋白质组个体内及个体间生理性波动和差异。对该数据集A中的每个蛋白利用百分位数法根据其在497个尿样中的定量数据确定该蛋白在不同百分位数的定量值作为该蛋白在 健康人群尿蛋白质组中的定量参考范围。例如,某蛋白的第2.5和97.5百分位数水平的定量值覆盖了该蛋白在497个尿样中95%样品的定量波动范围。本数据集A中全部蛋白的定量参考范围,可用于尿蛋白生物标志物研发过程中排除生理性波动或个体间差异带来的干扰;也可在利用尿蛋白质组信息进行健康管理过程中帮助发现超出定量参考范围的离群蛋白。Through the systematic evaluation of the physiological fluctuations and differences between individuals and individuals in the urine protein group of healthy people, it is proved that the established 497 urine proteome data sets A containing 167 healthy people can cover the urine group of healthy people. And physiological fluctuations and differences between individuals. The protein in the data set A was determined by the percentile method based on its quantitative data in 497 urine samples to determine the quantitative value of the protein in different percentiles as the protein. Quantitative reference range in the urine proteome of healthy populations. For example, a quantitative value for the 2.5th and 97.5th percentile levels of a protein covers the quantitative fluctuation range of the protein in 95% of the 497 urine samples. The quantitative reference range of all proteins in this data set A can be used to exclude the interference caused by physiological fluctuations or inter-individual differences in the development of urine protein biomarkers; it can also help to find out in the process of health management using urine proteome information. Outliers that are outside the quantitative reference range.
八、筛查离群蛋白及建立肿瘤相关离群尿蛋白库Eight, screening outliers and establishing tumor-related outlier urine protein libraries
将健康人尿蛋白质组数据集A(包含167名健康人的497个尿蛋白质组的数据集)随机分为3个亚数据集。其中第一个亚数据集A1包括健康人350个尿蛋白质组数据,用来建立健康人尿蛋白质组定量参考范围(利用百分位数法);第二个亚数据集A2包括健康人100个尿蛋白质组数据用于验证筛选的肿瘤相关离群尿蛋白区分健康人和肿瘤患者能力;第三个亚数据集A3包括健康人47个尿蛋白质组数据用于最后独立测试通过验证的肿瘤相关离群尿蛋白库区分健康人和肿瘤患者能力。其中测试亚数据集A3一经产生便不再参与肿瘤相关离群蛋白的发现和验证过程,以保证其对最终建立的肿瘤相关离群尿蛋白库区分健康人和肿瘤患者能力进行测试时的独立性。肿瘤患者的尿蛋白质组数据集也被按照7种肿瘤的相应数量随机分为训练亚数据集B1、验证亚数据集B2及测试亚数据集B3用于同相应的健康人尿蛋白质组亚数据集(A1-A3)共同完成肿瘤相关离群尿蛋白库的建立。B1、B2及B3亚数据集分别包括45、61及48个肿瘤患者的尿蛋白质组数据。其中测试亚数据集B3一经产生便不再参与肿瘤相关离群蛋白的发现和验证过程,以保证其对最终建立的肿瘤相关离群尿蛋白库区分健康人和肿瘤患者能力进行测试时的独立性。The Healthy Human Urine Proteome Dataset A (dataset containing 497 urinary proteomes from 167 healthy individuals) was randomly divided into 3 sub-data sets. The first sub-dataset A1 includes 350 urine proteome data from healthy individuals to establish a quantitative reference range for healthy human urine proteome (using the percentile method); the second sub-dataset A2 includes 100 healthy individuals. Urine proteomic data was used to validate the ability of screened tumor-associated outliers to differentiate between healthy and tumor patients; the third sub-dataset A3 included 47 healthy urinary proteome data from healthy individuals for final independent testing of validated tumor-associated The group urine protein library distinguishes the ability of healthy people and tumor patients. Test sub-dataset A3 is no longer involved in the discovery and validation of tumor-associated outliers to ensure independence from the ability of the ultimately established tumor-associated outlier urinary protein pool to differentiate between healthy and cancer patients. . The urinary proteome dataset of tumor patients was also randomly divided into training subdataset B1, validation subdataset B2, and test subdataset B3 according to the corresponding number of 7 tumors for the corresponding healthy human urine proteome subdataset. (A1-A3) jointly completed the establishment of a tumor-associated outlier urinary protein pool. The B1, B2, and B3 sub-data sets included urine proteomic data for 45, 61, and 48 tumor patients, respectively. Test sub-dataset B3 is no longer involved in the discovery and validation of tumor-associated outliers to ensure independence from the ability of the ultimately established tumor-associated outlier urinary protein pool to differentiate between healthy and tumor patients. .
一)利用亚数据集A1(包括350个尿蛋白质组数据)确定健康人尿蛋白质组定量参考范围:a) Using sub-dataset A1 (including 350 urinary proteome data) to determine the quantitative reference range for healthy human urine proteome:
通过对健康人尿蛋白质组个体内及个体间的生理性波动和差异的系统评估(参见五和六的方法),证明建立的健康人的350个尿蛋白质组数据集A1能覆盖健康人群尿蛋白质组个体内及个体间生理性波动和差异。对该数据集中的每个蛋白利用百分位数法根据其在350个尿样中的定量数据确定该蛋白在不同百分位数的定量值作为该蛋白在健康人群尿蛋白质组中的定量参考范围。Through a systematic assessment of the physiological fluctuations and differences between individuals and individuals in the urinary proteome of healthy people (see methods of five and six), it is proved that the established 350 urinary proteome dataset A1 of healthy people can cover the urine protein of healthy people. Physiological fluctuations and differences within and between individuals. The quantitative value of the protein in different percentiles was determined by using the percentile method for each protein in the data set based on its quantitative data in 350 urine samples as a quantitative reference for the protein in the urine proteome of healthy people. range.
二)肿瘤相关离群蛋白筛选及建库的具体过程如下(全部流程见图6):b) The specific process of tumor-related outlier screening and database construction is as follows (all processes are shown in Figure 6):
(1)用非参数的百分位数法和健康人尿蛋白质组亚数据集A1建立健康人尿蛋白质组定量参考范围。确定方法如一)所述,在此以每个尿蛋白在亚数据集A1的350个尿蛋白质组中定量数据的第99.5百分位数的定量值为定量参考范围的上限;(1) Establish a quantitative reference range for healthy human urine proteome using the nonparametric percentile method and the Healthy Human Urine Proteome Subset A1. The determination method is as described in a), wherein the quantitative value of the 99.5th percentile of the quantitative data of each urine protein in the 350 urine proteome of the sub-data set A1 is the upper limit of the quantitative reference range;
(2)将包括45个肿瘤患者尿蛋白质组数据的训练亚数据集B1中的每个尿蛋白 质组数据用(1)中建立的参考范围上限进行筛查,如果某个蛋白在至少两个样品中超过参考范围上限则将其纳入到后候选肿瘤相关离群尿蛋白库中。当所有训练数据被筛完便产生了1个候选肿瘤相关离群尿蛋白库C1。(2) Each urinary protein in the training sub-dataset B1 that will include urinary proteome data from 45 tumor patients The genomic data is screened using the upper limit of the reference range established in (1). If a protein exceeds the upper limit of the reference range in at least two samples, it is included in the post-candidate tumor-associated urinary protein pool. When all training data were screened, a candidate tumor-associated outlier urine protein library C1 was generated.
(3)将包括100个健康人尿蛋白质组数据的亚数据集A2和61个肿瘤患者尿蛋白质组数据的验证亚数据集B2中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,这样每个尿蛋白质组都会产生一个样品特异的离群尿蛋白库C2。将每个样品特异离群尿蛋白库C2中的全部蛋白与(2)中生成的候选肿瘤相关离群尿蛋白库C1中的蛋白进行对比,看两个库中有多少相同的蛋白。样品特异离群尿蛋白库C2与候选肿瘤相关离群尿蛋白库C1中相同的蛋白越多,则说明该样品与肿瘤患者的样品越相近。利用超几何分布检验的方法(hypergeometric test)来计算(计算方法见表9,算式如下)2个库中相同蛋白重叠的p值。(3) A sub-data set A2 including 100 healthy human urine proteome data and a validation reference set established in (1) for each urine proteome data in the validation sub-dataset B2 of the urine data of the 61 tumor patients. The upper limit is screened so that each urine proteome produces a sample-specific outlier urine protein library C2. The total protein in each sample specific outlier urine protein library C2 was compared with the protein in the candidate tumor associated outlier urine protein library C1 generated in (2) to see how many identical proteins were in the two pools. The more the sample-specific outlier urinary protein pool C2 is the same as the candidate tumor-associated urinary protein pool C1, the closer the sample is to the tumor patient's sample. The hypergeometric test was used to calculate (the calculation method is shown in Table 9, the formula is as follows). The p values of the same protein overlap in the two libraries.
Figure PCTCN2017113550-appb-000001
Figure PCTCN2017113550-appb-000001
这样健康人尿蛋白质组数据的亚数据集A2及肿瘤患者尿蛋白质组验证亚数据集B2共得到161个相应的超几何分布检验p值,利用这些p值绘制ROC曲线(receiver operating characteristic curve,ROC)用来考察(2)中生成的候选肿瘤相关离群尿蛋白库C1对验证亚数据集A2和B2中健康人及肿瘤患者尿蛋白质组的区分能力。ROC曲线纵轴的刻度为0-1,无单位,用来衡量区分健康人及肿瘤患者尿蛋白质组的敏感性,越接近于1代表敏感性越高;横轴为假阳性率,刻度也为0-1,无单位,区分健康人及肿瘤患者尿蛋白质组的特异性=(1-假阳性率),该差值越接近于1代表特异性越高。理想状态下敏感性及特异性均为1,ROC曲线下面积为1,因此可用ROC曲线下面积衡量区分能力的高低。另外还可根据预期的敏感性或特异性确定相应的超几何分布检验p值作为卡值(Pc值)区分健康人及肿瘤患者。在本应用中均以特异性为95%确定相应的卡值Pc。The sub-dataset A2 of the healthy human urine proteome data and the urinary proteome validation subdataset B2 of the tumor patient obtained a total of 161 corresponding hypergeometric distribution test p values, and the ROC curve was drawn using these p values (receiver operating characteristic curve, ROC). ) was used to examine the ability of the candidate tumor-associated outlier urinary protein pool C1 generated in (2) to verify the urinary proteome of healthy and tumor patients in the sub-datasets A2 and B2. The vertical axis of the ROC curve has a scale of 0-1, no unit, which is used to measure the sensitivity of the urine protein group in healthy people and tumor patients. The closer to 1 is, the higher the sensitivity is; the horizontal axis is the false positive rate, and the scale is also 0-1, no unit, distinguish the specificity of the urine proteome of healthy people and tumor patients = (1 - false positive rate), the closer the difference is to 1 means the higher the specificity. The ideal sensitivity and specificity are 1, and the area under the ROC curve is 1, so the area under the ROC curve can be used to measure the level of discrimination. In addition, the corresponding hypergeometric distribution test p value can be determined according to the expected sensitivity or specificity as a card value (Pc value) to distinguish between healthy people and tumor patients. In this application, the corresponding card value Pc is determined with a specificity of 95%.
(4)上述(3)是对肿瘤患者106个尿蛋白质组数据(从B数据集154个肿瘤尿蛋白质组数据中按7种肿瘤数量相应随机抽取48个尿蛋白质组数据生成肿瘤测试亚数据集B3后所余的数据)中随机产生的一个训练亚数据集B1(含45个肿瘤尿蛋白质组数据)和相应的验证亚数据集B2(含61个肿瘤尿蛋白质组数据)产生的结果。为了避免一次随机抽样产生的抽样误差,对肿瘤患者106个尿蛋白质组数据共进行了20次随机抽样,共获得了20对训练亚数据集和验证亚数据集(20对B1~B2),对每对亚数据集(B1~B2)进行上述(3)中同样的分析,得到20个候选肿瘤相关离群尿蛋白库C1及20个ROC曲线,其中与最大ROC曲线下面积(0.957)对应的候选肿瘤相 关离群尿蛋白库C1被确定为最终的肿瘤相关离群尿蛋白库C(包含509肿瘤相关离群蛋白,参见表10),特异性为95%时的Pc值为1.78X10-8,与之对应的敏感性(=1-假阴性率)为85.2%见图6的B幅。当被分析样品的超几何分布检验p值大于Pc时,认为该样品为健康人样品,当小于Pc时,认为该样品为肿瘤患者样品。(4) The above (3) is the data of 106 urine proteomes of tumor patients (48 urinary proteome data were randomly selected from the data of 154 tumor urinary proteome data of B data set to generate tumor test sub-data sets. A training sub-data set B1 (containing 45 tumor urine proteome data) and a corresponding validation sub-data set B2 (containing 61 tumor urine proteome data) randomly generated in the remaining data after B3). In order to avoid the sampling error caused by a random sampling, a total of 20 random samples were taken from 106 urinary proteome data of tumor patients, and 20 pairs of training sub-data sets and verification sub-data sets (20 pairs B1 to B2) were obtained. The same analysis in (3) above was performed for each pair of sub-data sets (B1 to B2), and 20 candidate tumor-associated outlier urinary protein pools C1 and 20 ROC curves were obtained, which corresponded to the area under the maximum ROC curve (0.957). The candidate tumor-associated outlier urinary protein pool C1 was identified as the final tumor-associated outlier urinary protein pool C (containing 509 tumor-associated outliers, see Table 10), and the Pc value at specificity of 95% was 1.78×10 -8 The corresponding sensitivity (=1 - false negative rate) is 85.2% as shown in Figure 6 B. When the hypergeometric distribution test p value of the sample to be analyzed is greater than Pc, the sample is considered to be a healthy human sample, and when it is smaller than Pc, the sample is considered to be a tumor patient sample.
(5)最后利用完全独立(指从未参加过训练和验证过程)的测试亚数据集A3(包含健康人47个尿蛋白质组数据)和B3(包含肿瘤患者48个尿蛋白质组数据)对上述(4)中获得的最终肿瘤相关离群尿蛋白库C区分健康人和肿瘤患者的能力进行测试,方法同上述(3)的方法,得到每个健康人及肿瘤患者尿蛋白质组的超几何分布检验p值,并与上述(4)中确定的卡值Pc进行比较确定每个尿蛋白质组是属于健康人或肿瘤患者,依据假阳性率和假阴性率确定肿瘤相关离群尿蛋白库区分健康人和肿瘤患者的敏感性和特异性。例如,47名健康人有2名被错分到肿瘤组(假阳性率为4.26%),48名肿瘤患者中有8名被错分到健康组(假阴性率为16.67%),根据测试亚数据集的结果,肿瘤相关离群尿蛋白库区分健康人和肿瘤患者的敏感性(=1-假阴性率)约为85%,特异性(1-假阳性率)大于95%,见图6的C幅。(5) Finally, using the test sub-dataset A3 (containing 47 urinary proteome data from healthy people) and B3 (including 48 urine proteomic data from tumor patients), which are completely independent (referring to the training and verification process), (4) The final tumor-associated outlier urinary protein pool C obtained in the healthy human and tumor patients was tested by the method of the above (3) to obtain the hypergeometric distribution of the urine proteome of each healthy person and tumor patient. The p value is tested and compared with the card value Pc determined in (4) above to determine whether each urinary proteome belongs to a healthy person or a tumor patient, and the tumor-related outlier urinary protein pool is determined according to the false positive rate and the false negative rate. Sensitivity and specificity of human and cancer patients. For example, 2 of 47 healthy people were misclassified to the tumor group (false positive rate was 4.26%), and 8 of 48 tumor patients were misclassified into the healthy group (false negative rate 16.67%), according to the test As a result of the dataset, the tumor-associated outlier urinary protein pool distinguishes the sensitivity (=1-false negative rate) of healthy people and tumor patients by about 85%, and the specificity (1-false positive rate) is greater than 95%, as shown in Fig. 6. C frame.
下面结合具体实施例对本发明做进一步详细说明。实施例中所用方法如无特别说明均为常规方法;所涉及到的术语如无特别说明均为本意。实施例中提到的肿瘤仅作为疾病的代表而并非限制,本发明所述方法适用于所有疾病。The present invention will be further described in detail below in conjunction with specific embodiments. The methods used in the examples are all conventional methods unless otherwise specified; the terms referred to are intended to be unless otherwise specified. The tumors mentioned in the examples are merely representative of the disease and are not limiting, and the method of the present invention is applicable to all diseases.
实施例1、建立用于评估健康人尿蛋白质组个体内生理性波动和差异的数据集,并评估尿蛋白质组个体内生理性波动Example 1. Establish a data set for assessing physiological fluctuations and differences within a healthy human urinary proteome, and assess physiological fluctuations in the urinary proteome
建立数据集的过程包括:The process of creating a data set includes:
1)采样:连续采集17名知情同意自愿者不同时间跨度的尿样,采样时间和数量参见表1;1) Sampling: Continuously collect 17 urine samples of informed consent volunteers at different time spans. See Table 1 for sampling time and quantity;
2)制备尿蛋白样品:将采集的每一个尿样按前述一的方法制成尿蛋白样品,每一尿样制成一尿蛋白样品(含2组分的肽样品));2) preparing a urine protein sample: each urine sample collected is prepared into a urine protein sample according to the method described above, and each urine sample is made into a urine protein sample (a peptide sample containing two components));
3)检测:按前述二的方法对每一个尿蛋白样品进行检测,得到每一尿蛋白样品的质谱数据,以表1中第一行U001-1(U001号志愿者24小时采集的其中1个尿样制成的尿蛋白样品)为例,其质谱图参见图5(上下谱图分别对应2组分的肽样品);3) Detection: Each urine protein sample was tested according to the method of the above two, and the mass spectrometry data of each urine protein sample was obtained, and the first row of U001-1 in Table 1 (one of the U001 volunteers collected in 24 hours) For example, the urine protein sample prepared by urine sample is shown in Fig. 5 (the upper and lower spectra correspond to the peptide samples of the two components, respectively);
4)搜库及定量:按前述三的方法对每一尿蛋白样品的质谱数据进行数据库搜索、肽段定量及蛋白拼接组装,确定每一尿蛋白样品中的蛋白种类及各蛋白的定量,得到尿蛋白质组数据,以U001-1(U001号志愿者24小时采集的4个尿样制成的尿蛋白样品)为例,该尿蛋白质组数据见表2,其包含了24小时采集的4个样品涉及1615个蛋白的定量数据,限于篇幅,其中仅摘取部分蛋白数据; 4) Search and quantification: According to the above three methods, database search, peptide quantification and protein splicing assembly of each urine protein sample are performed to determine the protein species and the quantification of each protein in each urine protein sample. Urinary proteome data, for example, U001-1 (a urine protein sample made from four urine samples collected by volunteers U001 for 24 hours), the urine proteome data are shown in Table 2, which includes 4 collected in 24 hours. The sample involves quantitative data of 1615 proteins, limited to the length, in which only part of the protein data is extracted;
5)按前述四的方法将每一尿蛋白质组数据依次合并得到17名健康自愿者不同采样时间跨度的针对每一自愿者的个体内尿蛋白质组数据集。以U001号志愿者为例,其个体内尿蛋白质组亚数据集见表3,其包含了针对该名志愿者314天采集的62个样品涉及3264个蛋白的定量数据,限于篇幅,其中仅摘取部分蛋白数据;5) Each urinary proteome data was sequentially combined according to the method of the foregoing four to obtain an intra-individual urinary proteome data set for each volunteer of 17 healthy volunteers with different sampling time spans. Take U001 volunteer as an example. The sub-data set of the individual urinary proteome is shown in Table 3. It contains the quantitative data of 3264 proteins involved in the 62 samples collected from the volunteer for 314 days, which is limited to the space. Take some protein data;
6)按照前述四的方法根据不同人及不同采样时间跨度确定不同的亚数据集(如表3所示),计算每一亚数据集内全部尿蛋白定量数据的变异系数的分布范围,用以评估健康人尿蛋白质组不同采样时间跨度的个体内生理性波动或差异;6) According to the method of the foregoing four, different sub-data sets are determined according to different people and different sampling time spans (as shown in Table 3), and the distribution range of the coefficient of variation of all urine protein quantitative data in each sub-data set is calculated. Assessing physiological fluctuations or differences within the individual of a healthy human urine proteome with different sampling time spans;
7)利用随机重采样的方法,对采样时间跨度最长(314和264天)的2名自愿者的亚数据集(分别包括62和51个尿蛋白质组数据),如表3所示的U001号志愿者亚数据集,以及U002号志愿者的亚数据集(限于篇幅此处省略数据)进行分析,确定覆盖健康人尿蛋白质组个体内生理性波动或差异所需的采样个数。7) Using the random resampling method, the sub-datasets of the two volunteers with the longest sampling time span (314 and 264 days) (including 62 and 51 urine proteome data, respectively), as shown in Table 3, U001 The volunteer sub-dataset, as well as the sub-dataset of U002 volunteers (limited to the data omitted here), were analyzed to determine the number of samples required to cover physiological fluctuations or differences within the healthy human urine proteome.
本实施例的数据集包括17名自愿者的短期(24小时内、连续3天)或长期采样(超过60天)的数据,每名自愿者的总采样时间跨度为5天至314天,采集每日清晨尿样或24小时尿样;结果获取了共包括319个尿蛋白质组数据的亚数据集BCM(见表4)。The data set of this embodiment includes short-term (24 hours, 3 consecutive days) or long-term sampling (over 60 days) data of 17 volunteers, and the total sampling time span of each volunteer is 5 days to 314 days. Daily morning urine samples or 24-hour urine samples; results obtained a sub-data set BCM containing a total of 319 urine proteome data (see Table 4).
根据尿样来自不同的自愿者,将该亚数据集BCM分成不同个体的亚数据集(见表3);在这些亚数据集中,根据是否是24小时内连续采样或连续3天采样,可进一步分成不同的亚数据集。利用这些亚数据集可评估健康人个体内24小时、连续3天及大于60天的尿蛋白质组生理性波动范围或差异,结果见图1和图2(横轴为不同个体的不同亚数据集,纵轴为变异系数)。其中:According to different volunteers, the sub-data set BCM is divided into sub-data sets of different individuals (see Table 3); in these sub-data sets, according to whether it is continuous sampling within 24 hours or sampling for 3 consecutive days, further Divided into different sub-data sets. These sub-datasets can be used to assess the range or differences in physiologic fluctuations of urinary proteome within 24 hours, consecutive 3 days, and greater than 60 days in healthy individuals. The results are shown in Figures 1 and 2 (horizontal axis is the different sub-data sets for different individuals) The vertical axis is the coefficient of variation). among them:
图1显示的个体内24小时尿蛋白质组生理性波动数据来自2名自愿者(U001和U002)的共4个24小时亚数据集,每个亚数据集中包括3-5个尿蛋白质组数据(例如表2显示在24小时内采集的自愿者U001的4个尿样的数据,每个尿样有1个蛋白质组数据,然后合并成一个24小时的亚数据集)。对每个亚数据集中在全部尿样中都有定量数据的蛋白,求其定量数据的变异系数(定量数据的标准差/定量数据的均值),亚数据集中全部符合要求的蛋白的变异系数分布范围利用箱型图(Box-plot)进行展示,用来代表个体内24小时尿蛋白质组的生理性波动范围。4个亚数据集的24小时尿蛋白质组生理性波动的中位变异系数(coefficients of variation)在0.29-0.33之间,变化最大的蛋白的变异系数为2.0(见图1左段U001-2)。Figure 1 shows the 24-hour urinary proteome physiological fluctuation data for individuals from four volunteers (U001 and U002) for a total of four 24-hour sub-data sets, each sub-data set including 3-5 urine proteome data ( For example, Table 2 shows data for 4 urine samples of volunteer U001 collected within 24 hours, each urine sample having 1 proteomic data, and then combined into a 24-hour sub-data set). For each sub-data, the protein with quantitative data in all urine samples is obtained, and the coefficient of variation of the quantitative data (the standard deviation of the quantitative data/the mean of the quantitative data) is obtained, and the coefficient of variation of all the proteins in the sub-data set meets the requirements. The range is displayed using a box-plot to represent the physiological fluctuation range of the 24-hour urine proteome within the individual. The median coefficient of variation of the physiological fluctuations of the 24-hour urine proteome of the four sub-datasets ranged from 0.29 to 0.33, and the coefficient of variation of the most variable protein was 2.0 (see U11-2 in the left section of Figure 1). .
个体内连续3天尿蛋白质组生理性波动数据来自16名自愿者(U001-U005、U007-U017)的35个亚数据集,每个亚数据集中包括3个尿蛋白质组数据(由每天清晨采样的尿蛋白质组数据组成)。利用和评估24小时尿蛋白质组生理性波动同样的 方法得到每个亚数据集尿蛋白质组的变异系数分布范围,用以代表个体内连续3天尿蛋白质组的生理性波动范围(见图1右段)。连续3天尿蛋白质组生理性波动的中位变异系数为0.23-0.5,略高于24小时内尿蛋白质组的定量波动。The physiological volatility data of the urinary proteome in three consecutive days was from 35 sub-datasets of 16 volunteers (U001-U005, U007-U017), and each sub-dataset included 3 urinary proteome data (sampled every morning) The composition of the urine proteome data). Use and evaluate the same physiological fluctuations in the 24-hour urine proteome Methods The distribution coefficient of variation coefficient of each sub-dataset urinary proteome was obtained to represent the physiological fluctuation range of the urinary proteome in the individual for 3 consecutive days (see the right part of Figure 1). The median coefficient of variation of physiological fluctuations in the urinary proteome for 3 consecutive days was 0.23-0.5, slightly higher than the quantitative fluctuation of the urinary proteome within 24 hours.
个体内超过60天的尿蛋白质组生理性波动数据来自17名自愿者的17个亚数据集,每个亚数据集中包括6-62个尿蛋白质组数据,采样时间跨度为61-314天。对于包括少于30个尿蛋白质组数据的亚数据集,当某蛋白在至少3个尿样中有定量信息时计算其变异系数(如果某蛋白不能在至少3个尿样中被检测到,则认为这个蛋白不是健康人尿蛋白质组中常见的蛋白,因此不评估其生理性波动);对于包括30个或以上尿蛋白质组数据的亚数据集,当某蛋白在至少10%的尿样中有定量信息时(不能在至少10%尿样中被检测到的蛋白,认为这个蛋白不是健康人尿蛋白质组中常见的蛋白,因此不评估其生理性波动)计算其变异系数。每个亚数据集中尿蛋白质组的生理性波动范围用所有符合要求的蛋白的变异系数的分布范围来表示(见图2)。个体内长期尿蛋白质组生理性波动的中位变异系数为0.45-0.87(见图2),明显高于24小时和连续3天个体内尿蛋白质组的生理性波动。The urinary proteome physiological fluctuation data for more than 60 days in vivo was from 17 sub-datasets of 17 volunteers, and each sub-dataset included 6-62 urine proteome data with a sampling time span of 61-314 days. For sub-data sets that include less than 30 urinary proteome data, when a protein has quantitative information in at least 3 urine samples, the coefficient of variation is calculated (if a protein cannot be detected in at least 3 urine samples, then This protein is not considered to be a common protein in the healthy human urinary proteome and therefore does not assess its physiological fluctuations; for sub-data sets including 30 or more urinary proteome data, when a protein is present in at least 10% of the urine sample In the case of quantitative information (proteins that cannot be detected in at least 10% of urine samples, this protein is not considered to be a common protein in the healthy human urine proteome, so its physiological fluctuations are not evaluated) and its coefficient of variation is calculated. The physiological fluctuation range of the urine proteome in each sub-dataset is expressed by the distribution range of the coefficient of variation of all eligible proteins (see Figure 2). The median coefficient of variation of physiological fluctuations in the long-term urinary proteome in vivo was 0.45-0.87 (see Figure 2), which was significantly higher than the physiological fluctuations in the urinary proteome of individuals within 24 hours and 3 consecutive days.
图2的数据也表明了个体内尿蛋白质组生理性波动与采样的时间跨度没有线性关系,这表明个体内尿蛋白质组的生理性波动不会随时间的变化而无限变化,而是在一个有限稳定的范围内。因此根据一个人的个体内尿蛋白质组生理性波动范围建立个人尿蛋白质组定量参考范围是可行的。The data in Figure 2 also shows that there is no linear relationship between the physiological fluctuations of the urinary proteome in the individual and the time span of sampling, which indicates that the physiological fluctuations of the urinary proteome in the individual do not change indefinitely with time, but in a limited Within a stable range. Therefore, it is feasible to establish a quantitative reference range for the individual urine proteome according to the physiological fluctuation range of a person's intra-urine proteome.
更进一步,本实施例还利用两个最大的个人尿蛋白质组亚数据集(分别包含62和51个尿蛋白质组数据)分析至少需要多少个不同的样品才能覆盖到稳定的个体内尿蛋白质组生理性波动范围。每个亚数据集中,只有在至少10%的尿样中有定量信息的蛋白参与分析。利用随机重采样的方法,从每个亚数据集中分别随机抽取3-25个尿蛋白质组数据组成样本量分别为3-25的亚数据集。为避免抽样误差带来的干扰,这一过程共重复100次,这样每个样本量就会得到由反复随机抽取产生的100个亚数据集,计算每个亚数据集中每个蛋白的定量均值(这样每个蛋白就会有100个均值),然后根据每个蛋白的100个均值计算其定量均值的均值和定量均值的标准差,进一步得到其定量均值的变异系数,最后用箱型图展示在某个样本量下全部蛋白定量均值变异系数的分布范围(见图3)。图3来自两个相互独立个体(A来自U001,B来自U002)的相互独立的数据集,图中结果清晰显示当检测了一个人的大约15个尿蛋白质组后,尿蛋白质组中蛋白的定量均值开始趋于稳定,表明该个体尿蛋白质组的生理性波动范围基本已被覆盖。Furthermore, this example also uses two of the largest individual urinary proteome sub-datasets (containing 62 and 51 urinary proteome data, respectively) to analyze at least how many different samples are needed to cover a stable urinary proteome physiology in an individual. Range of sexual fluctuations. In each sub-data set, only proteins with quantitative information in at least 10% of the urine samples were involved in the analysis. Using random resampling methods, 3-25 urinary proteome data were randomly selected from each sub-data set to form a sub-data set with sample sizes of 3-25. In order to avoid the interference caused by sampling error, this process is repeated 100 times, so that each sample size will get 100 sub-data sets generated by repeated random extraction, and calculate the quantitative mean of each protein in each sub-data set ( In this way, each protein will have 100 mean values. Then, based on the 100 mean values of each protein, the mean value of the quantitative mean and the standard deviation of the quantitative mean are calculated, and the coefficient of variation of the quantitative mean is further obtained. Finally, the box plot is displayed. The distribution range of the quantitative mean coefficient of variation of all proteins under a certain sample size (see Figure 3). Figure 3 is a separate data set from two independent individuals (A from U001, B from U002). The results clearly show the quantitation of protein in the urine proteome after detecting about 15 urine protein groups in a person. The mean value began to stabilize, indicating that the physiological fluctuation range of the individual's urinary proteome has been substantially covered.
用于评估健康人个体内生理性波动所用的各亚数据集中所包括的蛋白种类等统 计信息见表1。The types of proteins included in each sub-data set used to assess physiological fluctuations in healthy individuals The meter information is shown in Table 1.
表1.用于评估健康人个体内生理性波动所用的亚数据集统计信息Table 1. Sub-dataset statistics used to assess physiological fluctuations in healthy individuals
Figure PCTCN2017113550-appb-000002
Figure PCTCN2017113550-appb-000002
Figure PCTCN2017113550-appb-000003
Figure PCTCN2017113550-appb-000003
Figure PCTCN2017113550-appb-000004
Figure PCTCN2017113550-appb-000004
表2:U001-1尿蛋白样品的尿蛋白质组数据Table 2: Urine proteome data for U001-1 urine protein samples
Figure PCTCN2017113550-appb-000005
Figure PCTCN2017113550-appb-000005
表3:U001的尿蛋白质组亚数据集(含314天采集的62个样品中3264个蛋白)Table 3: Urinary Proteome Subset of U001 (containing 3264 proteins out of 62 samples collected on 314 days)
Figure PCTCN2017113550-appb-000006
Figure PCTCN2017113550-appb-000006
Figure PCTCN2017113550-appb-000007
Figure PCTCN2017113550-appb-000007
表4.17名自愿者319个尿蛋白质组亚数据集BCMTable 4.17 Volunteers 319 urine proteome sub-data sets BCM
Figure PCTCN2017113550-appb-000008
Figure PCTCN2017113550-appb-000008
实施例2、建立用于评估健康人尿蛋白质组个体间生理性波动和差异的数据集,并评估尿蛋白质组个体间生理性波动Example 2, establishing a data set for assessing physiological fluctuations and differences between healthy human urinary proteome individuals, and assessing physiological fluctuations between urinary proteome individuals
健康人尿蛋白质组的数据采集与实施例1相同。The data collection of the healthy human urine proteome was the same as in Example 1.
本实施例采集了由150名自愿者的178个尿蛋白质组数据组成的亚数据集BPRC(参见表5)。This example collected a sub-dataset BPRC consisting of 178 urine proteome data from 150 volunteers (see Table 5).
表5.包括150名健康自愿者的178个尿蛋白质组数据亚数据集BPRCTable 5. 178 urinary proteome data sub-data sets BPRC including 150 healthy volunteers
Figure PCTCN2017113550-appb-000009
Figure PCTCN2017113550-appb-000009
将亚数据集BPRC和亚数据集BCM进行合并得到包括167名健康自愿者的497个尿蛋白质组数据集A(整合表4和表5,此处略)。数据集A还可根据自愿者的性别 分成男性和女性尿蛋白质组亚数据集或其它亚数据集。亚数据集BCM(包括17名健康自愿者的319个尿蛋白质组数据)可用来评估少数人多次采样的尿蛋白质组个体间生理性波动和差异;亚数据集BPRC(包括150名健康自愿者的178个尿蛋白质组数据)可用来评估对多数人进行少次或单次采样的尿蛋白质组个体间生理性波动和差异;男性(Male,包括98名健康自愿者的343个尿蛋白质组数据)和女性(Female,包括69名健康自愿者的154个尿蛋白质组数据)尿蛋白质组亚数据集可用来评估不同性别的尿蛋白质组个体间生理性波动和差异。只有在每个亚数据集中至少10%的尿样中有定量信息的蛋白才参与评估各亚数据集的尿蛋白质组个体间生理性波动和差异的评估。评估的方法仍然是计算每个符合要求蛋白在相应亚数据集中的变异系数,然后以箱型图展示各亚数据集中符合要求蛋白的变异系数的分布范围,用以评估相应的尿蛋白质组个体间生理性波动和差异(见图4)。The sub-dataset BPRC and the sub-dataset BCM were combined to obtain 497 urinary proteome data sets A including 167 healthy volunteers (integration Tables 4 and 5, omitted here). Data set A can also be based on the gender of the volunteer Divided into male and female urine proteome sub-data sets or other sub-data sets. Sub-dataset BCM (including 319 urine proteomic data from 17 healthy volunteers) can be used to assess physiologic fluctuations and differences between urinary proteomes that have been sampled by a small number of individuals; sub-dataset BPRC (including 150 healthy volunteers) The 178 urinary proteome data can be used to assess physiologic fluctuations and differences between individuals in the urinary proteome with a small or single sampling of the majority; Male (including 343 urinary proteome data from 98 healthy volunteers) And the female (Female, including 154 urine proteomics data from 69 healthy volunteers) urinary proteome sub-datasets can be used to assess physiological fluctuations and differences between individuals of different sex urinary proteomes. Only proteins with quantitative information in at least 10% of the urine samples in each sub-data set were involved in the assessment of physiological fluctuations and differences in the urine proteome of each sub-dataset. The method of evaluation is still to calculate the coefficient of variation of each protein in the corresponding sub-data set, and then display the distribution range of the coefficient of variation of the desired protein in each sub-data set in a box plot to evaluate the corresponding urinary proteome Physiological fluctuations and differences (see Figure 4).
健康人497个尿蛋白质组数据组成的数据集A和健康人350个尿蛋白质组数据组成亚数据集A1均可用来建立健康人尿蛋白质组定量参考范围,图4结果表明6个亚数据集及中尿蛋白质组的个体间生理性波动范围很相似,中位变异系数在1.01-1.19间,这也说明数据集A或亚数据集A1基本覆盖健康尿蛋白质组人个体间的生理性波动和差异。但个体间生理波动范围要明显高于个体内的生理性波动范围(图4、图2及图1)。Data set A composed of 497 urine proteome data of healthy people and 350 urinary proteome data of healthy people sub-data set A1 can be used to establish quantitative reference range of healthy human urine proteome. Figure 4 shows 6 sub-data sets and The inter-individual physiological fluctuation range of the middle urinary proteome is very similar, and the median coefficient of variation is between 1.01-1.19, which also indicates that the data set A or the sub-data set A1 basically covers the physiological fluctuations and differences among the healthy urinary proteome individuals. . However, the range of physiological fluctuations between individuals is significantly higher than the range of physiological fluctuations within individuals (Figure 4, Figure 2 and Figure 1).
用于评估健康人个体间生理性波动及差异所用的各亚数据集中所包括的蛋白种类等统计信息见表6。See Table 6 for statistical information such as the types of proteins included in each sub-dataset used to assess physiological fluctuations and differences between healthy individuals.
表6.用于评估健康人个体间生理性波动及差异所用的各亚数据集统计信息Table 6. Statistical data for each sub-set used to assess physiological fluctuations and differences between individuals in healthy individuals
Figure PCTCN2017113550-appb-000010
Figure PCTCN2017113550-appb-000010
实施例3、建立健康人尿蛋白质组定量参考范围及健康人尿蛋白质组数据库Example 3: Establishing a Healthy Human Urine Proteome Quantitative Reference Range and a Healthy Human Urine Proteome Database
上述实施例1和实施例2对健康人尿蛋白质组的个体内及个体间生理性波动和差异进行了系统评估,且表明已采集的数据能够覆盖健康人尿蛋白质组的个体内及个体 间生理性波动和差异。The above Examples 1 and 2 systematically evaluated the physiological fluctuations and differences between individuals and individuals in the healthy human urine proteome, and showed that the collected data can cover the individuals and individuals of the healthy human urine proteome. Physiological fluctuations and differences.
将健康人尿蛋白质组总数据集A(整合表4和表5,包含167名健康人的497个尿蛋白质组的数据集)随机分为三个亚数据集,其中第一个亚数据集A1包括健康人的350个尿蛋白质组数据,第二个亚数据集A2包括健康人的100个尿蛋白质组数据,第三个亚数据集A3包括健康人的47个尿蛋白质组数据。本实施例分别用总数据集A和亚数据集A1的数据建立健康人尿蛋白质组定量参考范围。The Healthy Human Urine Proteome Total Data Set A (integrated Tables 4 and 5, containing a data set of 497 urinary proteomes from 167 healthy individuals) was randomly divided into three sub-data sets, of which the first sub-data set A1 Including 350 urine proteome data for healthy people, the second sub-data set A2 includes 100 urine proteome data for healthy people, and the third sub-data set A3 includes 47 urine proteome data for healthy people. In this example, the quantitative data of the healthy human urine proteome is established using the data of the total data set A and the sub-data set A1, respectively.
建立定量参考范围的方法分为参数和非参数两种,以参数法建立定量参考范围要求数据必须符合正态分布,这样才能根据数据的统计学参数(均值和标准差)按公式计算覆盖目标百分比人群的参考范围上下限,如均数加减2倍标准差覆盖95%的个体。但在不清楚数据是否符合正态分布时不能利用参数法。The method of establishing the quantitative reference range is divided into two types: parameter and non-parameter. The parameter reference method is used to establish the quantitative reference range. The data must conform to the normal distribution, so that the percentage of the coverage target can be calculated according to the statistical parameters (mean and standard deviation) of the data. The upper and lower limits of the reference range of the population, such as the mean plus or minus 2 times the standard deviation covers 95% of the individuals. However, the parameter method cannot be used when it is not clear whether the data conforms to the normal distribution.
非参数方法对数据的统计学分布没有要求,按照百分位数法求出参考范围上下限就实际覆盖了目标百分比的个体,如第2.5和97.5百分位数就覆盖了95%的个体。鉴于数据集中有些蛋白的定量数据符合正态分布,有些不符合,为了计算方便起见,本实施例采用非参数法建立健康人尿蛋白质组定量参考范围。具体结果见表7-1和表7-2示例。The nonparametric method does not require statistical distribution of the data. Individuals that actually cover the target percentage by the upper and lower limits of the reference range according to the percentile method, such as the 2.5th and 97.5th percentiles, cover 95% of the individuals. In view of the fact that the quantitative data of some proteins in the data set conform to the normal distribution and some do not conform to it, for the convenience of calculation, this example adopts the non-parametric method to establish a quantitative reference range for the healthy human urine proteome. The specific results are shown in Table 7-1 and Table 7-2.
依据表7-1数据(用数据集A的数据建立健康人尿蛋白质组定量参考范围),以健康人尿蛋白DYNC1H1为例,其第2.5和97.5百分位数水平的定量值(0.024-11.344)覆盖了该蛋白在497个尿样中95%样品的定量波动范围;其第5和95百分位数水平的定量值(0.918-8.964)覆盖了该蛋白在497个尿样中90%样品的定量波动范围。According to the data in Table 7-1 (using the data of data set A to establish a quantitative reference range for healthy human urine proteome), the healthy human urine protein DYNC1H1 is taken as an example, and the quantitative values of the 2.5th and 97.5th percentile levels (0.024-11.344) Covers the quantitative fluctuation range of the protein in 95% of the 497 urine samples; the quantitative value of the 5th and 95th percentile levels (0.918-8.964) covers 90% of the protein in 497 urine samples. The range of quantitative fluctuations.
依据表7-2数据(用亚数据集A1的数据建立健康人尿蛋白质组定量参考范围),并以第99.5百分位数的定量值为定量参考范围的上限,以健康人尿蛋白DYNC1H1为例,其第2.5和97.5百分位数水平的定量值(0.044-10.962)覆盖了该蛋白在350个尿样中95%样品的定量波动范围;第99.5百分位数的定量值(19.279)为定量参考范围的上限。According to the data in Table 7-2 (using the data of sub-dataset A1 to establish a quantitative reference range for healthy human urine proteome), and the quantitative value of the 99.5th percentile is the upper limit of the quantitative reference range, and the healthy human urine protein DYNC1H1 is For example, the quantitative values of the 2.5th and 97.5th percentile levels (0.044-10.962) cover the quantitative fluctuation range of the protein in 95% of the 350 urine samples; the quantitative value of the 99.5th percentile (19.279) The upper limit of the reference range is quantified.
在不需要将研究分为发现和验证两阶段的情况下,可以采用数据集A建立的数值范围;在所进行研究必须采用发现和验证两阶段的情况下,可以采用A1建立的数值范围。In the case where the study is not divided into two stages of discovery and verification, the range of values established by data set A can be used; in the case where the research must use two stages of discovery and verification, the range of values established by A1 can be used.
依据以上健康人尿蛋白质组定量参考范围建立健康人尿蛋白质组数据库,该数据库包括前述所确定的各亚数据集(如表1-表5)、总数据集A(如表6)、及依总数据集A或亚数据集A1确定的健康人尿蛋白质种类和计算得到的健康人尿蛋白质组定量参考范围(如表7-1或表7-2)。 The Healthy Human Urine Proteome Database is established according to the above quantitative reference range of the healthy human urine proteome, and the database includes the above identified sub-data sets (such as Table 1 - Table 5), the total data set A (such as Table 6), and The healthy human urine protein species determined by total data set A or sub-data set A1 and the calculated healthy human urine proteome quantitative reference range (as in Table 7-1 or Table 7-2).
实施例4、建立肿瘤患者尿蛋白质组数据集B并建立肿瘤相关离群尿蛋白库C建立肿瘤患者尿蛋白质组的数据集过程与实施例1相同。Example 4: Establishing a urinary proteome data set B of a tumor patient and establishing a tumor-associated outlier urinary protein pool C The data set process for establishing a urine proteome of a tumor patient is the same as in Example 1.
本实施例采集了154名包括7种实体肿瘤类型患者的154个尿蛋白质组数据建立了肿瘤患者尿蛋白质组数据集B(参见表8-2)。其中,膀胱癌17例、乳腺癌4例、宫颈癌25例、结直肠癌22例、食管癌14例、胃癌47例及肺癌25例。利用实施例2中的健康人尿蛋白质组总数据集A(整合表4和表5,包括167人的497个尿蛋白质组数据)及本实施例中肿瘤患者的尿蛋白质组数据集B建立肿瘤相关离群尿蛋白库C,具体过程如下:This example collected 154 urinary proteome data from 154 patients including 7 solid tumor types to establish a urinary proteome data set B for tumor patients (see Table 8-2). Among them, 17 cases of bladder cancer, 4 cases of breast cancer, 25 cases of cervical cancer, 22 cases of colorectal cancer, 14 cases of esophageal cancer, 47 cases of gastric cancer and 25 cases of lung cancer. Tumors were established using the Healthy Human Urine Proteome Total Data Set A in Example 2 (integration of Tables 4 and 5, including 497 urine proteomic data of 167 persons) and the urinary proteome data set B of tumor patients in this example. Related outliers urinary protein library C, the specific process is as follows:
将健康人尿蛋白质组数据集A(包含167名健康人的497个尿蛋白质组的数据集)随机分为三个亚数据集。其中第一个亚数据集A1包括350个健康人尿蛋白质组数据,用来建立健康人尿蛋白质组定量参考范围(利用百分位数法);第二个亚数据集A2包括100个健康人尿蛋白质组数据用于验证筛选的肿瘤相关离群尿蛋白区分健康人和肿瘤患者能力;第三个亚数据集A3包括47个健康人尿蛋白质组数据用于最后独立测试通过验证的肿瘤相关离群尿蛋白库区分健康人和肿瘤患者能力。肿瘤患者的尿蛋白质组数据集也被按照7种肿瘤的相应数量随机分为训练亚数据集B1、验证亚数据集B2及测试亚数据集B3(参见表8-1)用于同相应的健康人尿蛋白质组亚数据集(A1-A3)共同完成肿瘤相关离群尿蛋白库的建立。B1、B2及B3亚数据集分别包括45、61及48个肿瘤患者的尿蛋白质组数据。其中测试亚数据集B3一经产生便不再参与肿瘤相关离群蛋白的发现和验证过程,以保证其对最终建立的肿瘤相关离群尿蛋白库区分健康人和肿瘤患者能力进行测试时的独立性。The Healthy Human Urine Proteome Dataset A (dataset containing 497 urinary proteomes from 167 healthy individuals) was randomly divided into three sub-data sets. The first sub-dataset A1 includes 350 healthy human urine proteome data to establish a quantitative reference range for healthy human urine proteome (using the percentile method); the second sub-dataset A2 includes 100 healthy individuals. Urine proteomic data was used to validate the ability of screened tumor-associated outlier urine proteins to differentiate between healthy and tumor patients; the third sub-dataset A3 included 47 healthy human urine proteomic data for final independent testing of validated tumor-associated The group urine protein library distinguishes the ability of healthy people and tumor patients. The urinary proteome dataset of tumor patients was also randomly divided into training subdataset B1, validation subdataset B2, and test subdataset B3 (see Table 8-1) for the corresponding health according to the corresponding number of 7 tumors. The Human Urine Proteome Subset (A1-A3) together complete the establishment of a tumor-associated outlier urinary protein pool. The B1, B2, and B3 sub-data sets included urine proteomic data for 45, 61, and 48 tumor patients, respectively. Test sub-dataset B3 is no longer involved in the discovery and validation of tumor-associated outliers to ensure independence from the ability of the ultimately established tumor-associated outlier urinary protein pool to differentiate between healthy and tumor patients. .
表8-1.肿瘤患者的尿蛋白质组数据集B分布情况Table 8-1. B distribution of urinary proteome datasets in tumor patients
Figure PCTCN2017113550-appb-000011
Figure PCTCN2017113550-appb-000011
154个肿瘤尿蛋白质组数据见表8-2。The data of 154 tumor urinary proteomes are shown in Table 8-2.
表8-2.肿瘤患者尿蛋白质组数据集B Table 8-2. Urine Proteome Dataset B for Tumor Patients
Figure PCTCN2017113550-appb-000012
Figure PCTCN2017113550-appb-000012
肿瘤相关离群蛋白筛选及建库的具体过程如下:The specific process of tumor-related outlier screening and database construction is as follows:
(1)利用实施例3方法基于第一个健康人尿蛋白质组亚数据集A1建立健康人尿蛋白质组定量参考范围。在此以每个尿蛋白在第一个亚数据集A1的350个尿蛋白质组中定量数据的第99.5百分位数的定量值为定量参考范围的上限;(1) A healthy human urine proteome quantitative reference range was established based on the first healthy human urinary proteome sub-data set A1 using the method of Example 3. Here, the quantitative value of the 99.5th percentile of the quantitative data of each urine protein in the 350 urine proteome of the first sub-data set A1 is the upper limit of the quantitative reference range;
(2)将包括肿瘤患者45个尿蛋白质组数据的训练亚数据集B1中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,如果某个蛋白在至少两个样品中超过参考范围上限则将其纳入到后候选肿瘤相关离群尿蛋白库中。当所有训练数据被筛完便产生了1个候选肿瘤相关离群尿蛋白库C1。(2) Each urine proteome data in the training sub-data set B1 including 45 urine proteome data of the tumor patient is screened using the upper limit of the reference range established in (1), if a protein is in at least two samples The upper limit of the reference range is included in the post-candidate tumor-associated outlier urinary protein pool. When all training data were screened, a candidate tumor-associated outlier urine protein library C1 was generated.
(3)将包括健康人100个尿蛋白质组数据的亚数据集A2和肿瘤患者61个尿蛋白质组数据的验证亚数据集B2中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,这样每个尿蛋白质组都会产生一个样品特异的离群尿蛋白库C2。将每个样品特异离群尿蛋白库C2中的全部蛋白与(2)中生成的候选肿瘤相关离群尿蛋白库C1中的蛋白进行对比,看两个库中有多少相同的蛋白。样品特异离群尿蛋白库C2与候选肿瘤相关离群尿蛋白库C1中相同的蛋白越多,则说明该样品与肿瘤患者的样品越相近。利用超几何分布检验的方法(hypergeometric test)来计算(计算方法见表9)2个库中相同蛋白重叠的p值。(3) A reference range established in (1) for each urinary proteome data in the validation sub-dataset B2 including the sub-dataset A2 of 100 urinary proteome data of healthy persons and 61 urinary proteome data of tumor patients The upper limit is screened so that each urine proteome produces a sample-specific outlier urine protein library C2. The total protein in each sample specific outlier urine protein library C2 was compared with the protein in the candidate tumor associated outlier urine protein library C1 generated in (2) to see how many identical proteins were in the two pools. The more the sample-specific outlier urinary protein pool C2 is the same as the candidate tumor-associated urinary protein pool C1, the closer the sample is to the tumor patient's sample. The p-value of the same protein overlap in the two pools was calculated using the hypergeometric test (calculated as shown in Table 9).
Figure PCTCN2017113550-appb-000013
Figure PCTCN2017113550-appb-000013
这样健康人尿蛋白质组数据的亚数据集A2及肿瘤患者尿蛋白质组验证亚数据集B2共得到161个相应的超几何分布检验p值,利用这些p值绘制ROC曲线(receiver operating characteristic curve,ROC)用来考察(2)中生成的候选肿瘤相关离群尿蛋白库C1对验证亚数据集B2中健康人及肿瘤患者尿蛋白质组的区分能力。ROC曲 线纵轴的刻度为0-1,无单位,用来衡量区分健康人及肿瘤患者尿蛋白质组的敏感性,越接近于1代表敏感性越高;横轴为假阳性率,刻度也为0-1,无单位,区分健康人及肿瘤患者尿蛋白质组的特异性=(1-假阳性率),该差值越接近于1代表特异性越高。理想状态下敏感性及特异性均为1,ROC曲线下面积为1,因此可用ROC曲线下面积衡量区分能力的高低。另外还可根据预期的敏感性或特异性确定相应的超几何分布检验p值作为卡值(Pc值)区分健康人及肿瘤患者。在本应用中均以特异性为95%确定相应的卡值Pc。The sub-dataset A2 of the healthy human urine proteome data and the urinary proteome validation subdataset B2 of the tumor patient obtained a total of 161 corresponding hypergeometric distribution test p values, and the ROC curve was drawn using these p values (receiver operating characteristic curve, ROC). ) was used to examine the ability of the candidate tumor-associated outlier urinary protein pool C1 generated in (2) to verify the urinary proteome of healthy and tumor patients in the sub-dataset B2. ROC song The vertical axis of the line is 0-1, no unit, which is used to measure the sensitivity of the urine protein group in healthy people and tumor patients. The closer to 1 is, the higher the sensitivity is; the horizontal axis is the false positive rate, and the scale is also 0. -1, no unit, distinguishing the specificity of the urine proteome of healthy people and tumor patients = (1 - false positive rate), the closer the difference is to 1 the higher the specificity. The ideal sensitivity and specificity are 1, and the area under the ROC curve is 1, so the area under the ROC curve can be used to measure the level of discrimination. In addition, the corresponding hypergeometric distribution test p value can be determined according to the expected sensitivity or specificity as a card value (Pc value) to distinguish between healthy people and tumor patients. In this application, the corresponding card value Pc is determined with a specificity of 95%.
表9.超几何分布检验列联表Table 9. Hypergeometric distribution test contingency table
q(C1∩C2)q(C1∩C2) m-q(C1-C1∩C2)M-q(C1-C1∩C2) m(C1)m(C1)
k-q(C2-C1∩C2)K-q(C2-C1∩C2) n-k+q(T-C1-C2+C1∩C2)N-k+q(T-C1-C2+C1∩C2) n(T-C1)n(T-C1)
k(C2)k(C2) 15447-k(T-C2)15447-k(T-C2) 15447(T)15447(T)
注:C1-肿瘤相关离群蛋白库,其所包括的蛋白个数为m;Note: C1-tumor related outlier protein library, the number of proteins included is m;
C2-样品特异离群蛋白库,其所包括的蛋白个数为k;C2-sample specific outlier protein library, the number of proteins included is k;
T-在全部健康人及肿瘤患者尿蛋白质组中检测到的蛋白,其所包括的蛋白个数为15447;T-protein detected in the urine proteome of all healthy people and tumor patients, the number of proteins included is 15447;
C1∩C2-代表C1和C2的交集,其所包括的蛋白个数为q。C1∩C2- represents the intersection of C1 and C2, and the number of proteins included is q.
(4)上述(3)是对106个肿瘤患者尿蛋白质组数据(从B数据集154个肿瘤尿蛋白质组数据中按7种肿瘤数量相应随机抽取48个尿蛋白质组数据生成肿瘤测试亚数据集B3后所余的数据)中随机产生的一个训练亚数据集B1(含45个肿瘤尿蛋白质组数据)和相应的验证亚数据集B2(含61个肿瘤尿蛋白质组数据)产生的结果。为了避免一次随机抽样产生的抽样误差,对106个肿瘤患者尿蛋白质组数据共进行了100次随机抽样,共获得了100对训练亚数据集和验证亚数据集(100对B1~B2),对每对亚数据集(B1~B2)进行上述(3)中同样的分析,得到100个候选肿瘤相关离群尿蛋白库C1及100个ROC曲线,其中与最大ROC曲线下面积(0.957)对应的候选肿瘤相关离群尿蛋白库C1被确定为最终的肿瘤相关离群尿蛋白库C(包含509肿瘤相关离群蛋白,参见表10),特异性为95%时的Pc值为1.78X10-8,与之对应的敏感性(=1-假阴性率)为85.2%见图6的B幅。当被分析样品的超几何分布检验p值大于Pc时,认为该样品为健康人样品,当小于Pc时,认为该样品为肿瘤患者样品。(4) The above (3) is the urinary proteome data of 106 tumor patients (48 urinary proteome data were randomly selected from the data of 154 tumor urinary proteome data in B data set to generate tumor test sub-data sets. A training sub-data set B1 (containing 45 tumor urine proteome data) and a corresponding validation sub-data set B2 (containing 61 tumor urine proteome data) randomly generated in the remaining data after B3). In order to avoid the sampling error caused by a random sampling, 100 random samplings were performed on the urinary proteome data of 106 tumor patients, and 100 pairs of training sub-data sets and verification sub-data sets (100 pairs B1 to B2) were obtained. The same analysis in the above (3) was performed for each pair of sub-data sets (B1 to B2), and 100 candidate tumor-associated outlier urine protein pools C1 and 100 ROC curves were obtained, which corresponded to the area under the maximum ROC curve (0.957). The candidate tumor-associated outlier urinary protein pool C1 was identified as the final tumor-associated outlier urinary protein pool C (containing 509 tumor-associated outliers, see Table 10), and the Pc value at specificity of 95% was 1.78×10 -8 The corresponding sensitivity (=1 - false negative rate) is 85.2% as shown in Figure 6 B. When the hypergeometric distribution test p value of the sample to be analyzed is greater than Pc, the sample is considered to be a healthy human sample, and when it is smaller than Pc, the sample is considered to be a tumor patient sample.
(5)最后利用完全独立(指从未参加过训练和验证过程)的测试亚数据集A3和B3(包含47个健康人及48个肿瘤患者的尿蛋白质组数据)对上述(4)中获得的最终肿瘤相关离群尿蛋白库区C分健康人和肿瘤患者的能力进行测试,方法同上述(3)的方法,得到每个健康人及肿瘤患者尿蛋白质组的超几何分布检验p值,并与上述(4)中确定的卡值Pc进行比较确定每个尿蛋白质组是属于健康人或肿瘤患者,依据假阳 性率和假阴性率确定肿瘤相关离群尿蛋白库区分健康人和肿瘤患者的敏感性和特异性。例如,47名健康人有2名被错分到肿瘤组(假阳性率为4.26%),48名肿瘤患者中有8名被错分到健康组(假阴性率为16.67%),根据测试亚数据集的结果,肿瘤相关离群尿蛋白库区分健康人和肿瘤患者的敏感性(=1-假阴性率)约为85%,特异性(1-假阳性率)大于95%,见图6的C幅。(5) Finally, using the test sub-data sets A3 and B3 (containing urinary proteome data of 47 healthy people and 48 tumor patients) that are completely independent (referring to the training and verification process never participated), obtained in (4) above. The final tumor-associated outlier urinary protein pool area C is tested for the ability of healthy people and tumor patients. The method is the same as (3) above, and the p-value of the hypergeometric distribution test of the urine proteome of each healthy person and tumor patient is obtained. And comparing with the card value Pc determined in the above (4) to determine that each urine proteome belongs to a healthy person or a tumor patient, according to the false positive Sex and false negative rates determine the sensitivity and specificity of tumor-associated outlier urinary protein pools to distinguish healthy and cancer patients. For example, 2 of 47 healthy people were misclassified to the tumor group (false positive rate was 4.26%), and 8 of 48 tumor patients were misclassified into the healthy group (false negative rate 16.67%), according to the test As a result of the dataset, the tumor-associated outlier urinary protein pool distinguishes the sensitivity (=1-false negative rate) of healthy people and tumor patients by about 85%, and the specificity (1-false positive rate) is greater than 95%, as shown in Fig. 6. C frame.
表10.肿瘤相关离群尿蛋白库CTable 10. Tumor-associated outlier urine protein library C
Figure PCTCN2017113550-appb-000014
Figure PCTCN2017113550-appb-000014
注:第一行各种癌症后括号内的数字为该肿瘤尿样的例数;Note: The number in parentheses after the first line of various cancers is the number of cases of the tumor urine sample;
表格内的数字代表对应蛋白在对应肿瘤样品中为离群蛋白的次数。The numbers in the table represent the number of times the corresponding protein is an outlier in the corresponding tumor sample.
确定的509肿瘤相关离群蛋白为:A1BG、A2M、ABCB7、ABCD4、ABCE1、ABHD11、ABHD12、ABHD14B、ACADM、ACADSB、ACE2、ACO2、ACOT9、ACSL3、ACSM2A、ACSM2B、ACTR1B、ADD1、AGT、AHNAK2、AHSG、ALDH1L2、ALDH3A1、ALDH3A2、ALDH3B1、ALDH4A1、ALDOC、AMY2A、AMY2B、ANGPTL6、ANK1、ANPEP、ANXA1、ANXA10、ANXA2、ANXA3、ANXA4、ANXA5、ANXA6、APMAP、APOB、APP、AQP7、ARFIP1、ARG1、ARHGAP1、ARL13B、ARL6IP5、ARL8A、ARMC9、ARRDC1、ASNA1、ASPH、ATP13A3、ATP2A2、ATP6AP1、ATP6V0A1、ATP6V0C、ATP6V1B1、ATP6V1B2、AZU1、B3GNT3、BBOX1、BLMH、BPI、BPIFB1、C14orf166、C19orf59、C1orf123、C1QB、C1QC、C3、C4A、C4BPA、C5、C6orf211、C8A、C8B、C9、CAMK2G、CANT1、CCDC22、CCDC64B、CCT4、CCT6A、CDC42BPA、CDH11、CEACAM1、CEACAM6、CEACAM8、CECR5、CERS3、CFB、CHCHD3、CHI3L1、CHP1、CLCA4、CLCN7、CLPTM1、CLRN3、CMBL、CNDP2、CNN3、COL12A1、COL4A2、COMP、COPA、COPS7A、CORO1C、CPT1A、CPVL、CRAT、CRHBP、CRP、CRYAB、CRYL1、CTBS、CTNNA1、CTSG、CTSH、CWH43、CYP1A1、DDAH2、DDOST、DDX6、DERL1、DHCR24、DHX15、DIRC2、DKC1、DMTN、DNAJA1、DNAJB1、DNAJC7、DNASE1、DNM2、DOCK2、DOCK5、DOCK9、DPM1、DPP4、DPT、DSCR3、ECHS1、EEF1A1、 EIF4A1、ELMO1、ELMO3、EMC1、EMD、EMILIN1、ENO1、ENOPH1、ENPEP、ENPP4、ENPP6、ENPP7、EPB41、EPB42、EPCAM、EPDR1、EPHA2、EPPK1、EPX、ERAP1、ERLIN1、ERLIN2、ERMP1、ESRP1、ETF1、FAM151A、FARP1、FCAR、FCN2、FDFT1、FDXR、FGA、FGB、FGG、FGL1、FGL2、FGR、FLOT2、FN3KRP、FNBP1L、FOLH1、FRK、GALK2、GALM、GALNT1、GDF15、GIPC2、GLDC、GLUD1、GLYAT、GNS、GOLM1、GOLT1B、GPD2、GPR110、GPR126、GPR137B、GPR56、GPX3、GSTK1、GSTT1、HBB、HDLBP、HECTD3、HEXA、HEXB、HGD、HLA-A、HLA-B、HLA-DQB1、HLA-DRA、HLA-DRB1、HLA-DRB5、HMOX2、HNRNPA0、HNRNPA1、HNRNPD、HNRNPF、HNRNPH3、HNRNPK、HNRNPM、HNRNPU、HSP90B1、HSPA2、HSPA6、HSPA8、HSPA9、HYOU1、IDH2、IDH3A、IGFBP7、IGLL1、ILF3、IMPDH2、IQCC、ITIH1、ITIH2、KCNJ15、KIAA0319L、KIF13B、KRT18、KRT20、KRT7、KRT8、LAMB1、LAMP1、LAMTOR1、LBR、LLGL2、LMF2、LMNB2、LONP1、LPCAT3、LRG1、LRRK2、LZTFL1、MAL2、MARVELD3、MCCC1、MCRS1、MFSD1、MGAM、MITD1、MLYCD、MMP7、MNDA、MPO、MPP1、MRPL12、MRPL39、MRPS18B、MRPS22、MSMO1、MTAP、MTCH1、MTHFD1、MTOR、MUC20、MUT、MVP、MYO18A、MYO1B、MYO1D、NAMPT、NCF1、ND4、NDFIP1、NDUFA10、NDUFA9、NDUFS3、NDUFS8、NNT、NPTN、NSDHL、NT5C3A、NUDT9、NUMA1、OAT、OGFOD3、OGN、OLA1、ORM1、P4HB、PAPSS2、PCCA、PCCB、PCK1、PCK2、PDHA1、PDIA3、PDP1、PEF1、PFKFB2、PFKL、PGD、PHB2、PHPT1、PI4K2A、PICALM、PIP4K2A、PIP4K2C、PITRM1、PLA2G15、PLAU、PLCG2、PLEKHF2、PNKD、PON1、PPP2R2A、PRDX4、PRKAB1、PRKACA、PSMA5、PSMA7、PSMC3、PSMC4、PSMD3、PSMD7、PSMF1、PTPLAD1、PTPRC、RAB11B、RAB12、RAB24、RAB32、RAB7A、RAB9A、RDH13、RELA、RILPL2、RNASE3、RNASET2、ROGDI、RPN1、RPN2、RPS3、RPS6KA1、RPS9、RPTOR、RRAGC、RRAS2、S100A7、SCAMP3、SCARB1、SCARB2、SEC22B、SEPT9、SERPINA1、SERPINA3、SERPINA7、SERPIND1、SF3B1、SF3B3、SGPL1、SIAE、SIRT5、SLC12A6、SLC12A7、SLC12A9、SLC13A2、SLC15A1、SLC15A4、SLC17A1、SLC17A3、SLC17A5、SLC1A5、SLC22A11、SLC25A10、SLC25A13、SLC25A24、SLC25A4、SLC25A5、SLC26A11、SLC26A4、SLC2A1、SLC30A2、SLC34A2、SLC35F2、SLC35F6、SLC38A7、SLC3A1、SLC46A1、SLC4A1、SLC6A19、SLC6A8、SLC7A9、SLC9A3、SLFN5、SMPDL3A、SMPDL3B、SNRNP200、SNX27、SORT1、SPAG9、SPARCL1、SPNS1、SPRYD4、SPTA1、SPTB、SPTLC1、SSR1、ST13、STAP1、STARD3NL、STC1、STIM1、STK11IP、STOM、STRN、STT3A、STUB1、STX3、STX7、STX8、STXBP1、SUCLA2、SUCLG2、SULT1A1、SULT1C2、SUN2、SVIL、TACSTD2、TAOK1、TARDBP、TBC1D9B、TBCB、TBL2、TCIRG1、TFRC、TGM2、TGM3、TIMM44、TIMM50、TM9SF2、TM9SF3、TM9SF4、TMBIM1、TMED2、TMED4、TMEM104、TMEM106A、TMEM160、TMEM176A、TMEM176B、TMEM192、TMEM205、TMEM27、TMEM40、TMEM55B、TMEM9、TMLHE、 TOMM40L、TOP2B、TPCN1、TPCN2、TPM3、TRIM14、TRIP10、TST、TTC38、TUFM、TXNDC5、TYMP、UGT1A6、UGT1A9、UGT2B7、UPK3B、VAC14、VAMP7、VAPA、VAPB、VASN、VIM、VKORC1L1、VNN2、VPS37C、VPS4A、VSNL1、VTI1B、VWA5A、WASH1、WASL、ZADH2、ZNRF2。The identified 509 tumor-associated outliers were: A1BG, A2M, ABCB7, ABCD4, ABCE1, ABHD11, ABHD12, ABHD14B, ACADM, ACADSB, ACE2, ACO2, ACOT9, ACSL3, ACSM2A, ACSM2B, ACTR1B, ADD1, AGT, AHNAK2 AHSG, ALDH1L2, ALDH3A1, ALDH3A2, ALDH3B1, ALDH4A1, ALDOC, AMY2A, AMY2B, ANGPTL6, ANK1, ANPEP, ANXA1, ANXA10, ANXA2, ANXA3, ANXA4, ANXA5, ANXA6, APMAP, APOB, APP, AQP7, ARFIP1, ARG1 ARHGAP1, ARL13B, ARL6IP5, ARL8A, ARMC9, ARRDC1, ASNA1, ASPH, ATP13A3, ATP2A2, ATP6AP1, ATP6V0A1, ATP6V0C, ATP6V1B1, ATP6V1B2, AZU1, B3GNT3, BBOX1, BLMH, BPI, BPIFB1, C14orf166, C19orf59, C1orf123, C1QB, C1QC, C3, C4A, C4BPA, C5, C6orf211, C8A, C8B, C9, CAMK2G, CANT1, CCDC22, CCDC64B, CCT4, CCT6A, CDC42BPA, CDH11, CEACAM1, CEACAM6, CEACAM8, CECR5, CERS3, CFB, CHCHD3, CHI3L1 CHP1, CLCA4, CLCN7, CLPTM1, CLRN3, CMBL, CNDP2, CNN3, COL12A1, COL4A2, COMP, COPA, COPS7A, CORO1C, CPT1A, CPVL, CRAT, CRHBP, CRP, CRYAB, CRYL1 CTBS, CTNNA1, CTSG, CTSH, CWH43, CYP1A1, DDAH2, DDOST, DDX6, DERL1, DHCR24, DHX15, DIRC2, DKC1, DMTN, DNAJA1, DNAJB1, DNAJC7, DNASE1, DNM2, DOCK2, DOCK5, DOCK9, DPM1, DPP4, DPT, DSCR3, ECHS1, EEF1A1 EIF4A1, ELMO1, ELMO3, EMC1, EMD, EMILIN1, ENO1, ENOPH1, ENPEP, ENPP4, ENPP6, ENPP7, EPB41, EPB42, EPCAM, EPDR1, EPHA2, EPPK1, EPX, ERAP1, ERLIN1, ERLIN2, ERMP1, ESRP1, ETF1 FAM151A, FARP1, FCAR, FCN2, FDFT1, FDXR, FGA, FGB, FGG, FGL1, FGL2, FGR, FLOT2, FN3KRP, FNBP1L, FOLH1, FRK, GALK2, GALM, GALNT1, GDF15, GIPC2, GLDC, GLUD1, GLYAT, GNS, GOLM1, GOLT1B, GPD2, GPR110, GPR126, GPR137B, GPR56, GPX3, GSTK1, GSTT1, HBB, HDLBP, HECTD3, HEXA, HEXB, HGD, HLA-A, HLA-B, HLA-DQB1, HLA-DRA, HLA-DRB1, HLA-DRB5, HMOX2, HNRNPA0, HNRNPA1, HNRNPD, HNRNPF, HNRNPH3, HNRNPK, HNRNPM, HNRNPU, HSP90B1, HSPA2, HSPA6, HSPA8, HSPA9, HYOU1, IDH2, IDH3A, IGFBP7, IGLL1, ILF3, IMPDH2 IQCC, ITIH1, ITIH2, KCNJ15, KIAA0319L, KIF13B, KRT18, KRT20, KRT7, KRT8, LAMP1, LAMP1, LAMTOR1, LBR, LLGL2, LMF2, LMNB2, LONP1, LPCAT3, LRG1, LRRK2, LZTFL1, MAL2, MARVELD3, MCCC1 MCRS1, MFSD1, MGAM, MITD1, MLYCD , MMP7, MNDA, MPO, MPP1, MRPL12, MRPL39, MRPS18B, MRPS22, MSMO1, MTAP, MTCH1, MTHFD1, MTOR, MUC20, MUT, MVP, MYO18A, MYO1B, MYO1D, NAMPT, NCF1, ND4, NDFIP1, NDUFA10, NDUFA9 , NDUFS3, NDUFS8, NNT, NPTN, NSDHL, NT5C3A, NUDT9, NUMA1, OAT, OGFOD3, OGN, OLA1, ORM1, P4HB, PAPSS2, PCCA, PCCB, PCK1, PCK2, PDHA1, PDIA3, PDP1, PEF1, PFKFB2, PFKL , PGD, PHB2, PHPT1, PI4K2A, PICALM, PIP4K2A, PIP4K2C, PITRM1, PLA2G15, PLAU, PLCG2, PLEKHF2, PNKD, PON1, PPP2R2A, PRDX4, PRKAB1, PRKACA, PSMA5, PSMA7, PSMC3, PSMC4, PSMD3, PSMD7, PSMF1 , PTPLAD1, PTPRC, RAB11B, RAB12, RAB24, RAB32, RAB7A, RAB9A, RDH13, RELA, RILPL2, RNASE3, RNASET2, ROGDI, RPN1, RPN2, RPS3, RPS6KA1, RPS9, RPTOR, RRAGC, RRAS2, S100A7, SCAMP3, SCARB1 , SCARB2, SEC22B, SEPT9, SERPINA1, SERPINA3, SERPINA7, SERPIND1, SF3B1, SF3B3, SGPL1, SIAE, SIRT5, SLC12A6, SLC12A7, SLC12A9, SLC13A2, SLC15A1, SLC15A4, SLC17A1, SLC17A3, SLC17A 5. SLC1A5, SLC22A11, SLC25A10, SLC25A13, SLC25A24, SLC25A4, SLC25A5, SLC26A11, SLC26A4, SLC2A1, SLC30A2, SLC34A2, SLC35F2, SLC35F6, SLC38A7, SLC3A1, SLC46A1, SLC4A1, SLC6A19, SLC6A8, SLC7A9, SLC9A3, SLFN5, SMPDL3A, SMPDL3B, SNRNP200, SNX27, SORT1, SPAG9, SPARCL1, SPNS1, SPRYD4, SPTA1, SPTB, SPTLC1, SSR1, ST13, STAP1, STARD3NL, STC1, STIM1, STK11IP, STOM, STRN, STT3A, STUB1, STX3, STX7, STX8, STXBP1, SUCLA2, SUCLG2, SULT1A1, SULT1C2, SUN2, SVIL, TACSTD2, TAOK1, TARDBP, TBC1D9B, TBCB, TBL2, TCIRG1, TFRC, TGM2, TGM3, TIMM44, TIMM50, TM9SF2, TM9SF3, TM9SF4, TMBIM1, TMED2, TMED4, TMEM104, TMEM106A, TMEM160, TMEM176A, TMEM176B, TMEM192, TMEM205, TMEM27, TMEM40, TMEM55B, TMEM9, TMLHE, TOMM40L, TOP2B, TPCN1, TPCN2, TPM3, TRIM14, TRIP10, TST, TTC38, TUFM, TXNDC5, TYMP, UGT1A6, UGT1A9, UGT2B7, UPK3B, VAC14, VAMP7, VAPA, VAPB, VASN, VIM, VKORC1L1, VNN2, VPS37C, VPS4A, VSNL1, VTI1B, VWA5A, WASH1, WASL, ZADH2, ZNRF2.
本实施例确定的癌症离群蛋白库(C)中的509种离群蛋白为对肿瘤特异性蛋白,可作为肿瘤标志物用于研究开发各种基于尿蛋白检测的癌症早期筛查或监测类服务、试剂盒或其他产品中。The 509 outliers in the cancer outlier protein library (C) identified in this example are tumor-specific proteins, which can be used as tumor markers for research and development of various early detection or monitoring of cancer based on urine protein detection. In a service, kit or other product.
利用本实施例的方法,可以调整尿样所针对的疾病种类,可用于开发对不同疾病和状况进行分类的服务和产品(如特定疾病的蛋白标志物),在此不一一列举,但本领域技术人员参照本实施例所做的类似改变也属于本发明公开内容。By using the method of the present embodiment, the type of disease to which the urine sample is directed can be adjusted, and it can be used to develop services and products (such as protein markers of specific diseases) for classifying different diseases and conditions, which are not enumerated here, but Similar changes made by those skilled in the art with reference to this embodiment also belong to the present disclosure.
工业应用性Industrial applicability
本发明提出健康人尿蛋白质组定量参考范围的建立方法以及健康人尿蛋白质组数据库,并进一步提供获取疾病相关尿蛋白标志物的方法及得到的肿瘤相关离群尿蛋白库,能够更好地排除在尿蛋白生物标志物发现过程中来自生理性波动和个体间差异蛋白的干扰,为临床检验和科学实验提供了依据,适于工业应用。 The invention provides a method for establishing a quantitative reference range of a healthy human urine proteome and a database of healthy human urine proteome, and further provides a method for obtaining a disease-related urine protein marker and a tumor-related outlier urine protein library, which can be better excluded. Interference from physiological fluctuations and inter-individual differential proteins during the discovery of urinary protein biomarkers provides a basis for clinical testing and scientific experiments, and is suitable for industrial applications.
Figure PCTCN2017113550-appb-000015
Figure PCTCN2017113550-appb-000015

Claims (22)

  1. 建立健康人尿蛋白质组定量参考范围的方法,包括以下步骤:A method of establishing a quantitative reference range for a healthy human urine proteome includes the following steps:
    1)采样:采集统计数量健康人的尿样;1) Sampling: collecting urine samples of a healthy number of healthy people;
    2)制备尿蛋白样品:将采集的每一个尿样制成一个尿蛋白样品;2) preparing a urine protein sample: each urine sample collected is made into a urine protein sample;
    3)检测:对每一个尿蛋白样品进行质谱检测,得到每一个尿蛋白样品的质谱数据;3) Detection: mass spectrometry is performed on each urine protein sample to obtain mass spectrometry data of each urine protein sample;
    4)搜库及定量:对每一个尿蛋白样品的质谱数据进行数据库搜索、肽段定量及蛋白拼接组装,确定每一个尿蛋白样品中的蛋白种类及各蛋白的定量形成一个尿蛋白质组数据;4) Search and quantification: perform database search, peptide quantification and protein splicing assembly on the mass spectrometry data of each urine protein sample, determine the protein species in each urine protein sample and quantify each protein to form a urine proteome data;
    5)就不同人及不同采样时间跨度确定不同的亚数据集,包括:将单个人不同采样时间跨度的全部尿蛋白样品的尿蛋白质组数据归集得到该人的个体内尿蛋白质组亚数据集(BCM);将多人少次或单次采样的的全部尿蛋白样品的尿蛋白质组数据归集得到个体间尿蛋白质组亚数据集(BPRC);5) Different sub-data sets are determined for different people and different sampling time spans, including: urinary proteome data of all urine protein samples of individual individuals with different sampling time spans are collected to obtain the individual's intra-urine proteome sub-data set. (BCM); urinary proteome data of all urine protein samples collected by a small number of people or a single sample is collected to obtain an inter-individual urinary proteome sub-data set (BPRC);
    6)计算每一亚数据集内全部尿蛋白定量数据的变异系数的分布范围用以评估个体内生理性波动;6) Calculate the distribution range of the coefficient of variation of all urine protein quantitative data in each sub-data set to assess physiological fluctuations in the individual;
    7)利用随机重采样的方法,对采样时间跨度最长的2个人的亚数据集进行分析,确定覆盖健康人尿蛋白质组个体内生理性波动或差异所需的采样个数;7) Using the method of random resampling, analyze the sub-datasets of the two individuals with the longest sampling time span, and determine the number of samples needed to cover the physiological fluctuations or differences in the healthy human urine proteome;
    8)将全部数量人数的亚数据(BCM和BPRC)集合并得到健康人尿蛋白质组数据的总数据集A;每个亚数据集或总数据集中至少10%的尿样中有定量信息的蛋白才参与评估各亚数据集或总数据集的尿蛋白质组个体间生理性波动和差异的评估;8) Collect the total number of sub-data (BCM and BPRC) and obtain the total data set A of healthy human urine proteome data; at least 10% of the urine samples in each sub-dataset or total data set have quantitative information To participate in the assessment of the assessment of physiological fluctuations and differences between individuals in the urinary proteome of each sub-dataset or total dataset;
    9)利用总数据集A的数据计算健康人尿蛋白质组定量参考范围。9) Calculate the quantitative reference range of the healthy human urine proteome using the data of the total data set A.
  2. 根据权利要求1所述的方法,步骤9)中数据符合正态分布时,以参数法建立定量参考范围,根据数据的统计学参数(均值和标准差)按公式计算覆盖目标百分比人群的参考范围上下限(如均数加减2倍标准差覆盖95%的个体)。The method according to claim 1, wherein when the data in step 9) conforms to the normal distribution, the quantitative reference range is established by the parameter method, and the reference range of the population covering the target percentage is calculated according to the statistical parameter (mean and standard deviation) of the data according to the formula. Upper and lower limits (if the mean plus or minus 2 times the standard deviation covers 95% of the individuals).
  3. 根据权利要求1所述的方法,步骤9)中数据不确定是否符合正态分布时,以非参数法建立定量参考范围,按照百分位数法求出参考范围上下限就实际覆盖了目标百分比的个体(如第2.5和97.5百分位数就覆盖了95%的个体)。According to the method of claim 1, when the data in step 9) is inconsistent with the normal distribution, the quantitative reference range is established by the non-parametric method, and the upper and lower limits of the reference range are determined according to the percentile method to actually cover the target percentage. Individuals (eg, the 2.5th and 97.5th percentiles cover 95% of individuals).
  4. 根据权利要求1或2或3所述的方法,就不同人及不同采样时间跨度确定不同的亚数据集,人数较少采样次数较多的尿样形成的亚数据集用来评估少数人多次采样的尿蛋白质组个体内生理性波动和差异;人数较多采样次数较少的尿样形成的亚数据集用来评估对多数人进行少次或单次采样的尿蛋白质组个体间生理性波动和差异;男性和女性尿蛋白质组亚数据集可用来评估不同性别的尿蛋白质组个体间生理性波 动和差异。The method according to claim 1 or 2 or 3, wherein different sub-data sets are determined for different people and different sampling time spans, and sub-data sets formed by urine samples with a small number of sampling times are used to evaluate a minority number of times. Sampling urinary proteome physiologic fluctuations and differences; sub-data sets formed by urine samples with a small number of sampling times are used to assess physiological fluctuations between urinary proteome individuals with a small or single sampling of most people And differences; male and female urinary proteome sub-data sets can be used to assess physiologic waves between urinary proteomes of different genders Movement and difference.
  5. 根据权利要求4所述的方法,所述评估的方法是计算每个符合要求蛋白在相应亚数据集或总数据集中的变异系数,然后以箱型图展示各亚数据集或总数据集中符合要求蛋白的变异系数的分布范围,用以评估相应的尿蛋白质组个体间生理性波动和差异。The method according to claim 4, wherein the method of evaluating is to calculate a coefficient of variation of each of the eligible proteins in the corresponding sub-data set or the total data set, and then displaying the sub-data sets or the total data sets in a box plot to meet the requirements. The distribution of the coefficient of variation of the protein is used to assess the physiological fluctuations and differences between the corresponding urinary proteome.
  6. 根据权利要求1至5任一项所述的方法,步骤2)采用基于超速离心和还原的方法得到尿蛋白样品,即将尿样离心后的沉淀用重悬缓冲液(50mM Tris,250mM蔗糖,pH8.5)重悬,再加入二硫苏糖醇,加热去除样品中绝大部分的尿调素蛋白,用清洗缓冲液(10mM三乙醇胺,100mM氯化钠,pH7.4)清洗后离心,得到的沉淀集为该尿样的尿蛋白样品。The method according to any one of claims 1 to 5, wherein the step 2) uses a method based on ultracentrifugation and reduction to obtain a urine protein sample, that is, a sediment suspension resuspension buffer (50 mM Tris, 250 mM sucrose, pH 8) after centrifugation of the urine sample. .5) Resuspend, add dithiothreitol, heat to remove most of the urinary protein in the sample, wash with washing buffer (10 mM triethanolamine, 100 mM sodium chloride, pH 7.4) and centrifuge. The set of precipitates is a urine protein sample of the urine sample.
  7. 根据权利要求6所述的方法,步骤3)将所述尿蛋白样品用聚丙烯酰胺凝胶电泳(SDS-PAGE)分离、胶切成6条带进行胶内酶解,然后合并为2组分的肽样品作为一个尿蛋白质组,利用LC-MS/MS对2组分肽样品进行检测,得到针对每一尿样的尿蛋白样品质谱数据;步骤4)搜库的目的是对质谱产出的数据进行分析,确定质谱产出的数据中包含的蛋白,并得到所有肽段的一级定量结果,从而获得每一尿蛋白样品对应的蛋白质组数据。The method according to claim 6, wherein the urine protein sample is separated by polyacrylamide gel electrophoresis (SDS-PAGE), cut into 6 strips for in-gel digestion, and then combined into two components. The peptide sample is used as a urine proteome, and the 2-component peptide sample is detected by LC-MS/MS to obtain the urine protein sample mass spectrum data for each urine sample; and the step 4) is for the mass spectrometry output. The data was analyzed to determine the proteins contained in the data produced by the mass spectrometry and to obtain a first-order quantitative result for all peptides, thereby obtaining corresponding proteome data for each urine protein sample.
  8. 根据权利要求4所述的方法,对三个不同采样时间跨度(24小时内、连续3天以及大于2个月)的健康人个体内尿蛋白质组生理性波动和差异进行评估,评估方法是确定相应亚数据集中各蛋白质定量数据的变异系数(蛋白定量数据的标准差/蛋白定量数据的均值)的分布范围;The method according to claim 4, wherein physiologic fluctuations and differences in the urinary proteome of healthy individuals in three different sampling time spans (24 hours, 3 consecutive days, and more than 2 months) are evaluated, and the evaluation method is determined. The distribution range of the coefficient of variation (the standard deviation of the protein quantitative data/the mean of the protein quantitative data) of the quantitative data of each protein in the corresponding sub-data set;
    每个24小时或连续3天采样的亚数据集中包括3-5个尿蛋白质组数据,对那些在3-5个尿样中均有定量数据的蛋白,计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示;The sub-dataset sampled every 24 hours or 3 consecutive days includes 3-5 urine proteomic data, and for those proteins with quantitative data in 3-5 urine samples, calculate the coefficient of variation and finally obtain each sub- The data set all meets the distribution range of the coefficient of variation of the required protein and is displayed in a box-plot;
    每个采样时间跨度大于2个月的亚数据集包括6-62个尿蛋白质组数据,对那些至少在3个(<30个尿蛋白质组的亚数据集)或10%尿样(>30个尿蛋白质组的亚数据集)中有定量数据的蛋白计算其变异系数,最终获得每一亚数据集中全部符合要求蛋白的变异系数分布范围,并用箱型图(box-plot)展示。The sub-dataset with a sampling time span of more than 2 months includes 6-62 urinary proteome data for those at least 3 (<30 urinary proteome subdata sets) or 10% urine samples (>30) The protein with quantitative data in the sub-dataset of the urinary proteome) calculates the coefficient of variation, and finally obtains the distribution range of the coefficient of variation of all the required proteins in each sub-data set, and displays it in a box-plot.
  9. 根据权利要求4所述的方法,对总数据集及其中的男女性别亚数据集来评估健康人尿蛋白质组个体间生理性波动和差异,对每个数据集或亚数据集中超过10%尿样有定量数据的蛋白,计算其定量数据的变异系数,并用箱型图(box-plot)展示各数据集和亚数据集中全部符合要求的蛋白的变异系数分布。The method according to claim 4, wherein the total data set and the gender sub-data set thereof are used to assess physiological fluctuations and differences between healthy human urine proteome individuals, and more than 10% urine samples per data set or sub-data set. A protein with quantitative data, the coefficient of variation of its quantitative data is calculated, and a box-plot is used to display the coefficient of variation distribution of all eligible proteins in each data set and sub-data set.
  10. 健康人尿蛋白质组数据库,包括权利要求1至9任一项中所确定的亚数据集、 总数据集、及依数据集确定的健康人尿蛋白质种类和计算得到的健康人尿蛋白质组定量参考范围;所述健康人尿蛋白质组定量参考范围包括表7-1或表7-2所列及覆盖的2025个尿蛋白及其数值。a healthy human urine proteome database comprising the sub-data set identified in any one of claims 1 to 9, The total data set, and the healthy human urine protein type determined according to the data set and the calculated quantitative reference range of the healthy human urine proteome; the healthy human urine proteome quantitative reference range includes Table 7-1 or Table 7-2 And covering 2025 urine proteins and their values.
  11. 建立健康人尿蛋白定量参考范围的方法,包括以下步骤:A method of establishing a quantitative reference range for healthy human urine protein, comprising the following steps:
    采样:采集健康人的尿样;Sampling: collecting urine samples from healthy people;
    制样:将采集的尿样制成尿蛋白样品;Sample preparation: the collected urine sample is made into a urine protein sample;
    检测:对尿蛋白样品进行检测,得到尿蛋白样品的蛋白检测数据;Detection: detection of urine protein samples to obtain protein detection data of urine protein samples;
    确定:将蛋白检测数据分类,每一类从多个蛋白检测数据中选定上限数值和下限数值形成定量参考范围,多类汇总组成健康人尿蛋白定量参考范围。Determination: The protein detection data is classified, and each category selects the upper limit value and the lower limit value from the plurality of protein detection data to form a quantitative reference range, and the plurality of types are combined to form a quantitative reference range for healthy human urine protein.
  12. 根据权利要求11所述方法,所述蛋白检测数据包括但不限于蛋白种类和各蛋白含量。The method of claim 11 wherein said protein detection data comprises, but is not limited to, protein species and individual protein content.
  13. 获取与疾病相关尿蛋白标志物的方法,通过建立与疾病相关离群尿蛋白库获得,包括以下步骤:Methods for obtaining urinary protein markers associated with disease are obtained by establishing a disease-associated urinary protein library, including the following steps:
    (1)将权利要求1至10任一项所述健康人尿蛋白质组数据集A随机分为三个亚数据集A1、亚数据集A2和亚数据集A3,基于健康人尿蛋白质组亚数据集A1用非参数的百分位数法确定健康人尿蛋白质组定量参考范围,以每个尿蛋白在该数据集中的第99.5百分位数的定量值为定量参考范围的上限;(1) The healthy human urine proteome data set A according to any one of claims 1 to 10 is randomly divided into three sub-data sets A1, sub-data sets A2 and sub-data sets A3, based on healthy human urine proteome sub-data. Set A1 uses a nonparametric percentile method to determine a quantitative reference range for a healthy human urinary proteome, with the quantitative value of the 99.5th percentile of each urinary protein in the data set being the upper limit of the quantitative reference range;
    (2)从患者尿蛋白质组数据集B中抽取部分形成训练亚数据集B1,将其中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,如果某个蛋白在至少两个样品中超过参考范围上限则将其纳入到候选疾病相关离群尿蛋白库中;所有训练数据被筛完产生一个候选疾病相关离群尿蛋白库C1;(2) extracting a part of the training sub-data set B1 from the patient's urine proteome data set B, and screening each of the urine proteome data by the upper limit of the reference range established in (1), if a certain protein is at least Two samples exceeded the upper limit of the reference range and were included in the candidate disease-associated outlier urinary protein pool; all training data were screened to produce a candidate disease-associated outlier urinary protein library C1;
    (3)从患者尿蛋白质组数据集B中抽取部分形成验证亚数据集B2,将亚数据集A2和B2中的每个尿蛋白质组数据用(1)中建立的参考范围上限进行筛查,每个尿蛋白质组(样品)产生一个样品特异的离群尿蛋白库C2;将每个样品特异离群尿蛋白库C2中的全部蛋白与(2)中生成的候选疾病相关离群尿蛋白库C1中的蛋白进行对比,确定两个库中相同蛋白及数量,相同的蛋白越多,该样品与患者的样品越相近;(3) extracting part of the validation sub-data set B2 from the patient's urinary proteome dataset B, and screening each of the urinary proteome data in the sub-datasets A2 and B2 with the upper limit of the reference range established in (1), Each urinary proteome (sample) produces a sample-specific outlier urinary protein pool C2; each sample specific outlier urinary protein pool C2 is associated with the candidate disease generated in (2) outlier urinary protein library The proteins in C1 are compared to determine the same protein and quantity in the two libraries. The more the same protein, the closer the sample is to the patient's sample;
    利用超几何分布检验的方法(hypergeometric test)来计算C1和C2两个库中相同蛋白重叠的p值,利用这些p值绘制ROC曲线(receiver operating characteristic curve,ROC)用来考察(2)中生成的候选疾病相关离群尿蛋白库C1对验证亚数据集A2和B2中健康人及患者尿蛋白质组的区分能力;The hypergeometric test was used to calculate the p-values of the same protein overlap in the two libraries C1 and C2. The ROC curve was used to investigate the generation of the ROC curve (2). The candidate disease-associated outlier urinary protein pool C1 is capable of distinguishing between healthy human and patient urinary proteome in sub-data sets A2 and B2;
    (4)对患者尿蛋白质组数据集B进行N次(N为大于10的自然数)随机抽样形成N对训练亚数据集B1和验证亚数据集B2,对每对亚数据集进行上述(3)中同样的分 析,得到N个候选疾病相关离群尿蛋白库C1及N个ROC曲线,其中与最大ROC曲线下面积对应的候选疾病相关离群尿蛋白库C1被确定为最终的疾病相关离群尿蛋白库C,其中包含的离群蛋白即为疾病相关尿蛋白标志物。(4) Randomly sampling the patient's urine proteome data set B N times (N is a natural number greater than 10) to form an N pair training sub-data set B1 and a verification sub-data set B2, and performing the above (3) on each pair of sub-data sets. Same point in Analysis of N candidate disease-associated outliers urinary protein pool C1 and N ROC curves, wherein the candidate disease-associated outlier urinary protein pool C1 corresponding to the area under the maximum ROC curve was identified as the final disease-related outlier urinary protein bank. C, the outlier contained therein is a disease-associated urinary protein marker.
  14. 根据权利要求13所述的方法,还包括对所建立的疾病相关离群尿蛋白库C进行验证的步骤:The method of claim 13 further comprising the step of verifying the established disease-related outlier urine protein library C:
    (5)从患者尿蛋白质组数据集B中抽取完全独立(指从未参加过训练和验证过程)部分形成验证亚数据集B3,利用亚数据集A3和B3对上述(4)中获得的最终疾病相关离群尿蛋白库C区分健康人和患者的能力进行测试,方法同上述(3)的方法,得到每个健康人及患者尿蛋白质组的超几何分布检验p值,并与上述(4)中确定的卡值Pc进行比较确定每个尿蛋白质组是属于健康人或患者,依据假阳性率和假阴性率确定疾病相关离群尿蛋白库区分健康人和患者的敏感性和特异性。(5) Extracting the sub-data set B3 from the patient's urinary proteome data set B completely independent (referring to the process of never participating in the training and verification process), using the sub-data sets A3 and B3 to obtain the final result in (4) above. The disease-associated outlier urinary protein library C is tested for the ability to distinguish between healthy humans and patients. The method is the same as the method of (3) above, and the hypergeometric distribution test p value of each healthy person and patient urine proteome is obtained, and the above (4) The card value Pc determined in the comparison is determined to determine whether each urinary proteome belongs to a healthy person or patient, and the sensitivity and specificity of the disease-associated outlier urinary protein pool to distinguish healthy humans and patients are determined according to the false positive rate and the false negative rate.
  15. 根据权利要求13或14所述的方法,步骤(1)确定健康人尿蛋白质组定量参考范围是利用亚数据集A1的数据以非参数法计算,按照百分位数法求出参考范围上下限就实际覆盖了目标百分比的个体(如第2.5和97.5百分位数就覆盖了95%的个体)。The method according to claim 13 or 14, wherein the step (1) determines that the quantitative reference range of the healthy human urine proteome is calculated by the non-parametric method using the data of the sub-data set A1, and the upper and lower limits of the reference range are determined according to the percentile method. Individuals who actually covered the target percentage (eg, the 2.5th and 97.5th percentiles covered 95% of the individuals).
  16. 根据权利要求13或14或15所述的方法,其中,建立步骤(2)中患者尿蛋白质组数据集B的过程包括:The method according to claim 13 or 14 or 15, wherein the process of establishing the patient urinary proteome data set B in step (2) comprises:
    1)采样:采集患者的尿样;1) Sampling: collecting urine samples of patients;
    2)制备尿蛋白样品:将采集的每一个尿样制成一个尿蛋白样品;2) preparing a urine protein sample: each urine sample collected is made into a urine protein sample;
    3)检测:对每一个尿蛋白样品进行质谱检测,得到每一个尿蛋白样品的质谱数据;3) Detection: mass spectrometry is performed on each urine protein sample to obtain mass spectrometry data of each urine protein sample;
    4)搜库及定量:对每一个尿蛋白样品的质谱数据进行数据库搜索、肽段定量及蛋白拼接组装,确定每一个尿蛋白样品中的蛋白种类及各蛋白的定量形成一个尿蛋白质组数据;4) Search and quantification: perform database search, peptide quantification and protein splicing assembly on the mass spectrometry data of each urine protein sample, determine the protein species in each urine protein sample and quantify each protein to form a urine proteome data;
    5)将全部尿蛋白样品的尿蛋白质组数据归集得到患者尿蛋白质组数据集B。5) Collecting the urine proteomic data of all urine protein samples to obtain the patient urine proteome data set B.
  17. 根据权利要求13至16任一所述的方法,其中,所述疾病为任意一种疾病,实施例中以肿瘤为例。The method according to any one of claims 13 to 16, wherein the disease is any disease, and in the embodiment, a tumor is exemplified.
  18. 权利要求13至16任一所述方法中得到的疾病相关离群尿蛋白库及其中包含的离群尿蛋白即疾病相关尿蛋白标志物。The disease-related outlier urine protein library obtained by the method according to any one of claims 13 to 16 and an outlier urine protein contained therein, that is, a disease-associated urine protein marker.
  19. 权利要求17所述方法中得到的肿瘤相关离群尿蛋白库及其中包含的离群尿蛋白即肿瘤尿蛋白标志物。The tumor-associated outlier urinary protein library obtained in the method of claim 17 and an outlier urinary protein contained therein, that is, a tumor urinary protein marker.
  20. 根据权利要求19所述肿瘤相关离群尿蛋白库,包括509个尿蛋白,具体为 A1BG、A2M、ABCB7、ABCD4、ABCE1、ABHD11、ABHD12、ABHD14B、ACADM、ACADSB、ACE2、ACO2、ACOT9、ACSL3、ACSM2A、ACSM2B、ACTR1B、ADD1、AGT、AHNAK2、AHSG、ALDH1L2、ALDH3A1、ALDH3A2、ALDH3B 1、ALDH4A1、ALDOC、AMY2A、AMY2B、ANGPTL6、ANK1、ANPEP、ANXA1、ANXA10、ANXA2、ANXA3、ANXA4、ANXA5、ANXA6、APMAP、APOB、APP、AQP7、ARFIP1、ARG1、ARHGAP1、ARL13B、ARL6IP5、ARL8A、ARMC9、ARRDC1、ASNA1、ASPH、ATP13A3、ATP2A2、ATP6AP1、ATP6V0A1、ATP6V0C、ATP6V1B1、ATP6V1B2、AZU1、B3GNT3、BBOX1、BLMH、BPI、BPIFB1、C14orf166、C19orf59、C1orf123、C1QB、C1QC、C3、C4A、C4BPA、C5、C6orf211、C8A、C8B、C9、CAMK2G、CANT1、CCDC22、CCDC64B、CCT4、CCT6A、CDC42BPA、CDH11、CEACAM1、CEACAM6、CEACAM8、CECR5、CERS3、CFB、CHCHD3、CHI3L1、CHP1、CLCA4、CLCN7、CLPTM1、CLRN3、CMBL、CNDP2、CNN3、COL12A1、COL4A2、COMP、COPA、COPS7A、CORO1C、CPT1A、CPVL、CRAT、CRHBP、CRP、CRYAB、CRYL1、CTBS、CTNNA1、CTSG、CTSH、CWH43、CYP1A1、DDAH2、DDOST、DDX6、DERL1、DHCR24、DHX15、DIRC2、DKC1、DMTN、DNAJA1、DNAJB1、DNAJC7、DNASE1、DNM2、DOCK2、DOCK5、DOCK9、DPM1、DPP4、DPT、DSCR3、ECHS 1、EEF1A1、EIF4A1、ELMO1、ELMO3、EMC1、EMD、EMILIN1、ENO1、ENOPH1、ENPEP、ENPP4、ENPP6、ENPP7、EPB41、EPB42、EPCAM、EPDR1、EPHA2、EPPK1、EPX、ERAP1、ERLIN1、ERLIN2、ERMP1、ESRP1、ETF1、FAM151A、FARP1、FCAR、FCN2、FDFT1、FDXR、FGA、FGB、FGG、FGL1、FGL2、FGR、FLOT2、FN3KRP、FNBP1L、FOLH1、FRK、GALK2、GALM、GALNT1、GDF15、GIPC2、GLDC、GLUD1、GLYAT、GNS、GOLM1、GOLT1B、GPD2、GPR110、GPR126、GPR137B、GPR56、GPX3、GSTK1、GSTT1、HBB、HDLBP、HECTD3、HEXA、HEXB、HGD、HLA-A、HLA-B、HLA-DQB1、HLA-DRA、HLA-DRB1、HLA-DRB5、HMOX2、HNRNPA0、HNRNPA1、HNRNPD、HNRNPF、HNRNPH3、HNRNPK、HNRNPM、HNRNPU、HSP90B1、HSPA2、HSPA6、HSPA8、HSPA9、HYOU1、IDH2、IDH3A、IGFBP7、IGLL1、ILF3、IMPDH2、IQCC、ITIH1、ITIH2、KCNJ15、KIAA0319L、KIF13B、KRT18、KRT20、KRT7、KRT8、LAMB1、LAMP1、LAMTOR1、LBR、LLGL2、LMF2、LMNB2、LONP1、LPCAT3、LRG1、LRRK2、LZTFL1、MAL2、MARVELD3、MCCC1、MCRS 1、MFSD1、MGAM、MITD1、MLYCD、MMP7、MNDA、MPO、MPP1、MRPL12、MRPL39、MRPS 18B、MRPS22、MSMO1、MTAP、MTCH1、MTHFD1、MTOR、MUC20、MUT、MVP、MYO18A、MYO1B、 MYO1D、NAMPT、NCF1、ND4、NDFIP1、NDUFA10、NDUFA9、NDUFS3、NDUFS8、NNT、NPTN、NSDHL、NT5C3A、NUDT9、NUMA1、OAT、OGFOD3、OGN、OLA1、ORM1、P4HB、PAPSS2、PCCA、PCCB、PCK1、PCK2、PDHA1、PDIA3、PDP1、PEF1、PFKFB2、PFKL、PGD、PHB2、PHPT1、PI4K2A、PICALM、PIP4K2A、PIP4K2C、PITRM1、PLA2G15、PLAU、PLCG2、PLEKHF2、PNKD、PON1、PPP2R2A、PRDX4、PRKAB1、PRKACA、PSMA5、PSMA7、PSMC3、PSMC4、PSMD3、PSMD7、PSMF1、PTPLAD1、PTPRC、RAB11B、RAB12、RAB24、RAB32、RAB7A、RAB9A、RDH13、RELA、RILPL2、RNASE3、RNASET2、ROGDI、RPN1、RPN2、RPS3、RPS6KA1、RPS9、RPTOR、RRAGC、RRAS2、S100A7、SCAMP3、SCARB1、SCARB2、SEC22B、SEPT9、SERPINA1、SERPINA3、SERPINA7、SERPIND1、SF3B1、SF3B3、SGPL1、SIAE、SIRT5、SLC12A6、SLC12A7、SLC12A9、SLC13A2、SLC15A1、SLC15A4、SLC17A1、SLC17A3、SLC17A5、SLC1A5、SLC22A11、SLC25A10、SLC25A13、SLC25A24、SLC25A4、SLC25A5、SLC26A11、SLC26A4、SLC2A1、SLC30A2、SLC34A2、SLC35F2、SLC35F6、SLC38A7、SLC3A1、SLC46A1、SLC4A1、SLC6A19、SLC6A8、SLC7A9、SLC9A3、SLFN5、SMPDL3A、SMPDL3B、SNRNP200、SNX27、SORT1、SPAG9、SPARCL1、SPNS1、SPRYD4、SPTA1、SPTB、SPTLC1、SSR1、ST13、STAP1、STARD3NL、STC1、STIM1、STK11IP、STOM、STRN、STT3A、STUB1、STX3、STX7、STX8、STXBP1、SUCLA2、SUCLG2、SULT1A1、SULT1C2、SUN2、SVIL、TACSTD2、TAOK1、TARDBP、TBC1D9B、TBCB、TBL2、TCIRG1、TFRC、TGM2、TGM3、TIMM44、TIMM50、TM9SF2、TM9SF3、TM9SF4、TMBIM1、TMED2、TMED4、TMEM104、TMEM106A、TMEM160、TMEM176A、TMEM176B、TMEM192、TMEM205、TMEM27、TMEM40、TMEM55B、TMEM9、TMLHE、TOMM40L、TOP2B、TPCN1、TPCN2、TPM3、TRIM14、TRIP10、TST、TTC38、TUFM、TXNDC5、TYMP、UGT1A6、UGT1A9、UGT2B7、UPK3B、VAC14、VAMP7、VAPA、VAPB、VASN、VIM、VKORC1L1、VNN2、VPS37C、VPS4A、VSNL1、VTI1B、VWA5A、WASH1、WASL、ZADH2、ZNRF2。The tumor-associated outlier urine protein library according to claim 19, comprising 509 urine proteins, specifically A1BG, A2M, ABCB7, ABCD4, ABCE1, ABHD11, ABHD12, ABHD14B, ACADM, ACADSB, ACE2, ACO2, ACOT9, ACSL3, ACSM2A, ACSM2B, ACTR1B, ADD1, AGT, AHNAK2, AHSG, ALDH1L2, ALDH3A1, ALDH3A2, ALDH3B 1 , ALDH4A1, ALDOC, AMY2A, AMY2B, ANGPTL6, ANK1, ANPEP, ANXA1, ANXA10, ANXA2, ANXA3, ANXA4, ANXA5, ANXA6, APMAP, APOB, APP, AQP7, ARFIP1, ARG1, ARHGAP1, ARL13B, ARL6IP5, ARL8A, ARMC9 , ARDDC1, ASNA1, ASPH, ATP13A3, ATP2A2, ATP6AP1, ATP6V0A1, ATP6V0C, ATP6V1B1, ATP6V1B2, AZU1, B3GNT3, BBOX1, BLMH, BPI, BPIFB1, C14orf166, C19orf59, C1orf123, C1QB, C1QC, C3, C4A, C4BPA, C5 , C6orf211, C8A, C8B, C9, CAMK2G, CANT1, CCDC22, CCDC64B, CCT4, CCT6A, CDC42BPA, CDH11, CEACAM1, CEACAM6, CEACAM8, CECR5, CERS3, CFB, CHCHD3, CHI3L1, CHP1, CLCA4, CLCN7, CLPTM1, CLRN3 , CMBL, CNDP2, CNN3, COL12A1, COL4A2, COMP, COPA, COPS7A, CORO1C, CPT1A, CPVL, CRAT, CRHBP, CRP, CRYAB, CRYL1, CTBS, CTNNA1, CTSG, CTSH, CWH43, CYP1A1, DDAH2, DDOST, DDX6, DERL1, DHCR24, DHX15, DIRC2, DKC1, DMTN, DNAJA1, DNAJB1, DNAJC7, DNASE1, DNM2, DOCK2, DOCK5, DOCK9, DPM1, DPP4, DPT, DSCR3, ECHS 1, EEF1A1, EIF4A1 , ELMO1, ELMO3, EMC1, EMD, EMILIN1, ENO1, ENOPH1, ENPEP, ENPP4, ENPP6, ENPP7, EPB41, EPB42, EPCAM, EPDR1, EPHA2, EPPK1, EPX, ERAP1, ERLIN1, ERLIN2, ERMP1, ESRP1, ETF1, FAM151A , FARP1, FCAR, FCN2, FDFT1, FDXR, FGA, FGB, FGG, FGL1, FGL2, FGR, FLOT2, FN3KRP, FNBP1L, FOLH1, FRK, GALK2, GALM, GALNT1, GDF15, GIPC2, GLDC, GLUD1, GLYAT, GNS , GOLM1, GOLT1B, GPD2, GPR110, GPR126, GPR137B, GPR56, GPX3, GSTK1, GSTT1, HBB, HDLBP, HECTD3, HEXA, HEXB, HGD, HLA-A, HLA-B, HLA-DQB1, HLA-DRA, HLA -DRB1, HLA-DRB5, HMOX2, HNRNPA0, HNRNPA1, HNRNPD, HNRNPF, HNRNPH3, HNRNPK, HNRNPM, HNRNPU, HSP90B1, HSPA2, HSPA6, HSPA8, HSPA9, HYOU1, IDH2, IDH3A, IGFBP7, IGLL1, ILF3, IMPDH2, IQCC , ITIH1, ITIH2, KCNJ15, KIAA0319L, KIF1 3B, KRT18, KRT20, KRT7, KRT8, LAMB1, LAMP1, LAMTOR1, LBR, LLGL2, LMF2, LMNB2, LONP1, LPCAT3, LRG1, LRRK2, LZTFL1, MAL2, MARVELD3, MCCC1, MCRS 1, MFSD1, MGAM, MITD1, MLYCD , MMP7, MNDA, MPO, MPP1, MRPL12, MRPL39, MRPS 18B, MRPS22, MSMO1, MTAP, MTCH1, MTHFD1, MTOR, MUC20, MUT, MVP, MYO18A, MYO1B, MYO1D, NAMPT, NCF1, ND4, NDFIP1, NDUFA10, NDUFA9, NDUFS3, NDUFS8, NNT, NPTN, NSDHL, NT5C3A, NUDT9, NUMA1, OAT, OGFOD3, OGN, OLA1, ORM1, P4HB, PAPSS2, PCCA, PCCB, PCK1 PCK2, PDHA1, PDIA3, PDP1, PEF1, PFKFB2, PFKL, PGD, PHB2, PHPT1, PI4K2A, PICALM, PIP4K2A, PIP4K2C, PITRM1, PLA2G15, PLAU, PLCG2, PLEKHF2, PNKD, PON1, PPP2R2A, PRDX4, PRKAB1, PRKACA, PSMA5, PSMA7, PSMC3, PSMC4, PSMD3, PSMD7, PSMF1, PTPLAD1, PTPRC, RAB11B, RAB12, RAB24, RAB32, RAB7A, RAB9A, RDH13, RELA, RILPL2, RNASE3, RNASET2, ROGDI, RPN1, RPN2, RPS3, RPS6KA1 RPS9, RPTOR, RRAGC, RRAS2, S100A7, SCAMP3, SCARB1, SCARB2, SEC22B, SEPT9, SERPINA1, SERPINA3, SERPINA7, SERPIND1, SF3B1, SF3B3, SGPL1, SIAE, SIRT5, SLC12A6, SLC12A7, SLC12A9, SLC13A2, SLC15A1, SLC15A4, SLC17A1, SLC17A3, SLC17A5, SLC1A5, SLC22A11, SLC25A10, SLC25A13, SLC25A24, SLC25A4, SLC25A5, SLC26A11, SLC26A4, SLC2A1, SLC30A2, SLC34A2, SLC35F2, SLC35F 6. SLC38A7, SLC3A1, SLC46A1, SLC4A1, SLC6A19, SLC6A8, SLC7A9, SLC9A3, SLFN5, SMPDL3A, SMPDL3B, SNRNP200, SNX27, SORT1, SPAG9, SPARCL1, SPNS1, SPRYD4, SPTA1, SPTB, SPTLC1, SSR1, ST13, STAP1 STARD3NL, STC1, STIM1, STK11IP, STOM, STRN, STT3A, STUB1, STX3, STX7, STX8, STXBP1, SUCLA2, SUCLG2, SULT1A1, SULT1C2, SUN2, SVIL, TACSTD2, TAOK1, TARDBP, TBC1D9B, TBCB, TBL2, TCIRG1 TFRC, TGM2, TGM3, TIMM44, TIMM50, TM9SF2, TM9SF3, TM9SF4, TMBIM1, TMED2, TMED4, TMEM104, TMEM106A, TMEM160, TMEM176A, TMEM176B, TMEM192, TMEM205, TMEM27, TMEM40, TMEM55B, TMEM9, TMLHE, TOMM40L, TOP2B, TPCN1, TPCN2, TPM3, TRIM14, TRIP10, TST, TTC38, TUFM, TXNDC5, TYMP, UGT1A6, UGT1A9, UGT2B7, UPK3B, VAC14, VAMP7, VAPA, VAPB, VASN, VIM, VKORC1L1, VNN2, VPS37C, VPS4A, VSNL1 VTI1B, VWA5A, WASH1, WASL, ZADH2, ZNRF2.
  21. 权利要求19或20所述肿瘤相关离群尿蛋白库的应用,用权利要求1或权利要求16中步骤2)-4)获取待检尿样的蛋白质组数据,利用超几何分布检验的方法来计算该尿样和所述肿瘤尿蛋白离群蛋白库中相同蛋白重叠的p值,确定特异性为95%时的Pc值,当超几何分布检验p值大于Pc时,判断该待检尿样为健康人样品,当p值小于Pc时,判断该待检尿样为肿瘤患者样品。The use of the tumor-associated outlier urinary protein library according to claim 19 or 20, wherein the proteomic data of the urine sample to be tested is obtained by using the steps 2) to 4) of claim 1 or claim 16, and using a method of hypergeometric distribution test Calculating the p value of the same protein overlap in the urine sample and the tumor urine protein outlier protein library, determining the Pc value when the specificity is 95%, and determining the urine sample to be tested when the p-value of the hypergeometric distribution test is greater than Pc For a healthy human sample, when the p value is less than Pc, it is judged that the urine sample to be tested is a tumor patient sample.
  22. 权利要求18所述疾病相关离群尿蛋白库的应用,用权利要求1或权利要求 16中步骤2)-4)获取待检尿样的蛋白质组数据,利用超几何分布检验的方法来计算该尿样和所述疾病相关尿蛋白离群蛋白库中相同蛋白重叠的p值,确定特异性为95%时的Pc值,当超几何分布检验p值大于Pc时,判断该待检尿样为健康人样品,当p值小于Pc时,判断该待检尿样为该疾病患者样品。 Use of the disease-related outlier urine protein library of claim 18, with claim 1 or claim Step 2) - 4) Obtain proteomic data of the urine sample to be tested, and calculate the p value of the same protein overlap in the urine sample and the disease-related urinary protein outlier protein library by using a hypergeometric distribution test method to determine The Pc value when the specificity is 95%, when the p-value of the hypergeometric distribution test is greater than Pc, it is judged that the urine sample to be tested is a healthy human sample, and when the p value is smaller than Pc, the urine sample to be tested is determined as the patient sample of the disease. .
PCT/CN2017/113550 2017-01-20 2017-11-29 Method for establishing quantitative reference range for healthy person urinary proteome and acquiring disease-related urinary protein marker WO2018133553A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201710051714.9A CN108334747B (en) 2017-01-20 2017-01-20 Method for obtaining tumor urine protein marker and obtained tumor-related outlier urine protein library
CN201710048188.0 2017-01-20
CN201710048188.0A CN108334752B (en) 2017-01-20 2017-01-20 Method for establishing quantitative reference range of healthy human urine proteome and healthy human urine proteome database
CN201710051714.9 2017-01-20

Publications (1)

Publication Number Publication Date
WO2018133553A1 true WO2018133553A1 (en) 2018-07-26

Family

ID=62907724

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/113550 WO2018133553A1 (en) 2017-01-20 2017-11-29 Method for establishing quantitative reference range for healthy person urinary proteome and acquiring disease-related urinary protein marker

Country Status (1)

Country Link
WO (1) WO2018133553A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067833A (en) * 2007-05-09 2007-11-07 冯连元 Method for unified standardizing various examination and test results normal range referencel value and its actual measured value in clinical medicine
CN103884806A (en) * 2012-12-21 2014-06-25 中国科学院大连化学物理研究所 Proteome label-free quantification method combining tandem mass spectrometry with machine learning algorithm
WO2016083832A1 (en) * 2014-11-28 2016-06-02 The University Of Birmingham Bladder cancer prognosis

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067833A (en) * 2007-05-09 2007-11-07 冯连元 Method for unified standardizing various examination and test results normal range referencel value and its actual measured value in clinical medicine
CN103884806A (en) * 2012-12-21 2014-06-25 中国科学院大连化学物理研究所 Proteome label-free quantification method combining tandem mass spectrometry with machine learning algorithm
WO2016083832A1 (en) * 2014-11-28 2016-06-02 The University Of Birmingham Bladder cancer prognosis

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AN, LONGFEI ET AL.: "Application of Proteomics to Screen Urinary Biomarkers in Women with Cervical Cancer", CHINA BIOTECHNOLOGY, vol. 36, no. 9, 31 December 2016 (2016-12-31), pages 1 - 10 *
CHEN, BIN ET AL.: "Non-official translation: Lesson 11: How to Determine Reference Value Ranges", CHINESE JOURNAL OF PREVENTIVE MEDICINE, vol. 36, no. 5, 30 September 2002 (2002-09-30), pages 355 - 357 *
CUI, YOUHONG ET AL.: "Non- official translation: Quantitative Electrophoresis Analysis and Reference Values of Healthy Human Urine Protein Components", CHINESE JOURNAL OF NEPHROLOGY, 31 December 1994 (1994-12-31) *
ZHAO, XUHONG ET AL.: "Urinary Proteomic Analysis for Diagnosis and Differential Diagnosis of Prostatic Hyperplasia", JOURNAL OF CAPITAL MEDICAL UNIVERSITY, vol. 30, 30 June 2009 (2009-06-30), pages 277 - 281 *

Similar Documents

Publication Publication Date Title
US11733249B2 (en) Methods and algorithms for aiding in the detection of cancer
US20230324407A1 (en) Identification and use of glycopeptides as biomarkers for diagnosis and treatment monitoring
CN109884302A (en) Lung cancer early diagnosis marker and its application based on metabolism group and artificial intelligence technology
JP2021103177A (en) Method and system for determining risk of autism spectrum disorder
Kavallaris et al. Proteomics and disease: opportunities and challenges
Bilello The agony and ecstasy of “OMIC” technologies in drug development
Puchades-Carrasco et al. Metabolomics in pharmaceutical research and development
JP7441303B2 (en) Systems and methods for sample preparation, data generation, and protein corona analysis
WO2011157655A1 (en) Use of bile acids for prediction of an onset of sepsis
Caprioli Deciphering protein molecular signatures in cancer tissues to aid in diagnosis, prognosis, and therapy
US20220328129A1 (en) Multi-omic assessment
JP2022000650A (en) Methods and kits for identification, assessment, prevention and therapy of lung diseases, including sexuality-based identification, assessment, prevention and therapy of diseases
Karley et al. Biomarkers: The future of medical science to detect cancer
CN108334752B (en) Method for establishing quantitative reference range of healthy human urine proteome and healthy human urine proteome database
JP2020517935A (en) Diagnostic method for Behcet&#39;s disease using metabolite analysis
US9678086B2 (en) Diagnostic assay for Alzheimer&#39;s disease
US20220260559A1 (en) Biomarkers for diagnosing alzheimer&#39;s disease
Chao et al. Towards proteome standards: the use of absolute quantitation in high-throughput biomarker discovery
Donovan et al. Peptide-centric analyses of human plasma enable increased resolution of biological insights into non-small cell lung cancer relative to protein-centric analysis
Cao et al. Two classifiers based on serum peptide pattern for prediction of HBV-induced liver cirrhosis using MALDI-TOF MS
US20230223111A1 (en) Multi-omic assessment
CN108334747B (en) Method for obtaining tumor urine protein marker and obtained tumor-related outlier urine protein library
WO2018133553A1 (en) Method for establishing quantitative reference range for healthy person urinary proteome and acquiring disease-related urinary protein marker
GB2607436A (en) Multi-omic assessment
Campbell et al. Applying gene expression microarrays to pulmonary disease

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17892995

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17892995

Country of ref document: EP

Kind code of ref document: A1