KR20190005034A

KR20190005034A - Apparatus and method for quantification analysis of sample using liquid chromatograph-mass spectrometry

Info

Publication number: KR20190005034A
Application number: KR1020170085558A
Authority: KR
Inventors: 김빛내리; 김종서; 정교원; 최연
Original assignee: 서울대학교산학협력단; 기초과학연구원
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2019-01-15
Also published as: KR101958900B1

Abstract

A quantitative sample analysis apparatus includes: a spectrum recognition unit which receives the liquid chromatography-mass spectrometry spectrum of a sample including a first material attached with a first label material and a second material attached with a second label material having a different mass from the first label material; and a peptide spectrum match (PSM) analysis unit which extracts the peptide spectrum match spectrum corresponding to the first label material from the mass spectrometry spectrum received from the spectrum recognition unit, specifies the template of the first material by using the PSM of the first label material, calculates the predicted retention time (RT) shift of the first and second label materials, specifies the template of the second material by using the predicted RT shift, uses the template of the first and second materials to reconfigure the signals of the first and second materials for perform a quantitative peptide level analysis on the sample. The quantitative sample analysis apparatus can apply machine learning to accurately and efficiently predict the retention time shift caused by the substitution of the hydrogen in the label material with the deuterium (2H or D).

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for analyzing a sample using a liquid chromatograph mass spectrometry,

실시예들은 액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS)에 있어서 시료의 정량(quantification) 분석 장치 및 방법에 관한 것으로, 보다 상세하게는, LC-MS 분석에 있어서 라벨(label) 물질의 사용으로 인하여 발생하는 리텐션 시간(retention time)의 편이(shift)를 예측하고 이를 통해 질량이 인접한 라벨 물질들의 신호를 정확하게 분리하여 정량 측정의 다중성(multiplexity)을 향상시키는 기술에 대한 것이다. Embodiments relate to an apparatus and a method for analyzing a sample in a liquid chromatograph-mass spectrometry (LC-MS), and more particularly, to an apparatus and method for analyzing a sample by LC- The present invention relates to a technique for predicting a shift of retention time caused by the use of a substance and thereby improving the multiplexing of quantitative measurement by accurately separating signals of adjacent label substances.

액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS)은, 타겟 물질을 액체 상태에서 컬럼(column) 등에 통과시킴으로써 성분별로 분리하고, 각 성분들을 이온화한 후 질량 분석을 통해 질량 대 전하비가 상이한 물질들을 분리하는 기술로서, 단백질의 정량(identification) 등에 이용될 수 있다. LC-MS에서 질량 분석을 위한 이온화 과정이 탠덤(tandem) 방식으로 이루어지는 경우에는 LC-MS/MS로 지칭한다. Liquid Chromatograph-Mass Spectrometry (LC-MS) is a method in which a target material is separated into components by passing them through a column or the like in a liquid state, ionizes the respective components, and mass- As a technique for separating different substances, it can be used for identification of proteins and the like. When the ionization process for mass spectrometry in LC-MS is performed in a tandem manner, it is referred to as LC-MS / MS.

LC-MS/MS 기법 중에는, 질량 대 전하비를 알고 있는 라벨(label) 물질을 타겟 물질에 미리 결합시키고, 질량 분석 결과 얻어진 스펙트럼에서 라벨 물질의 스펙트럼을 특정함으로써 라벨로 표시된 타겟 물질의 정량을 수행하는 방법이 있다. 이때, 라벨로 사용되는 물질이 이온화된 정도에 따라 상기 정량 기법은 전구체(precursor)에 해당하는 모분자 기반의 라벨을 이용한(MS1-based) 정량 기법 및 모분자로부터 더 세분화된 조각 분자 기반의 라벨을 이용한(MS2-based) 정량 기법으로 구분될 수 있다. Among the LC-MS / MS techniques, quantification of the labeled target substance is performed by pre-binding a label substance known to the mass to charge ratio to the target substance and specifying the spectrum of the label substance in the obtained spectrum There is a way. In this case, depending on the degree of ionization of the substance used as a label, the quantitation technique may be performed by using a MS1-based quantitation technique based on a molecular-based label corresponding to a precursor, (MS2-based) quantification technique.

도 1은 모분자 기반의 라벨을 이용한(MS1-based) 정량 기법을 설명하기 위한 개념도이다. 1 is a conceptual diagram for explaining an MS1-based quantitation technique using a parent molecule-based label.

도 1을 참조하면, 검출하고자 하는 상이한 종료의 타겟 물질(11, 12)에 각각 상이한 종류의 라벨 물질(110, 120)을 부착한다. 이러한 타겟 물질(11, 12)이 포함된 시료를 LC-MS/MS 방식으로 분석하게 되면, LC에 의한 리텐션 시간(retention time), 조각 이온의 질량 대 전하비(m/z) 및 이온 신호의 세기(intensity)를 각 축으로 하는 3차원 공간 내에서 조각 이온에 해당하는 그래프(1100, 1200)가 얻어진다. 이때 라벨 물질(110, 120)의 질량 대 전하비를 이용하여 각 그래프에 해당하는 타겟 물질(11, 12)을 특정할 수 있고, 따라서 시료의 정량이 가능하다. Referring to FIG. 1, different types of label materials 110 and 120 are attached to different target materials 11 and 12 to be detected, respectively. Analysis of the sample containing the target materials 11 and 12 by the LC-MS / MS method revealed retention time due to LC, mass-to-charge ratio (m / z) The graphs 1100 and 1200 corresponding to the fragment ions are obtained in the three-dimensional space having the axis of the intensity as the axis. At this time, the target materials 11 and 12 corresponding to the respective graphs can be specified using the mass to charge ratio of the label materials 110 and 120, and thus the sample can be quantified.

그런데, 라벨 물질은 원자량이 미세하게 상이한 동위원소(isotope)를 포함할 수 있으므로, 실제 질량 분석 스펙트럼은 도 2에 도시된 것과 같은 형태를 갖는다. However, since the label material may contain isotopes whose atomic amounts are slightly different, the actual mass spectrometry spectrum has a form as shown in Fig.

도 2를 참조하면, 라벨 물질(110)의 질량 분석 스펙트럼은 질량 대 전하비가 미세하게 상이한 복수의 그래프(1101-1103)를 포함하며, 라벨 물질(120)의 질량 분석 스펙트럼은 질량 대 전하비가 미세하게 상이한 복수의 그래프(1201-1203)를 포함한다. 이처럼 질량 분석 스펙트럼이 질량 대 전하비의 일정 영역에 걸쳐 분포하기 때문에, 명확한 구별을 위해 라벨 물질(110, 120) 상호 간의 질량차가 일정 수준 이상, 예컨대, 6 달톤(Da) 이상인 물질을 이용하는 것이 일반적이다.Referring to FIG. 2, the mass spectrometry spectrum of the label material 110 includes a plurality of graphs 1101-1103 with slightly different mass to charge ratios, and the mass spectrometry spectrum of the label material 120 is such that the mass- Lt; RTI ID = 0.0 > 1201-1203. &Lt; / RTI > Because the mass spectrometric spectrum is distributed over a certain area of the mass to charge ratio, it is generally preferable to use a substance having a mass difference of at least a certain level between the label materials 110 and 120, for example, 6 Da (Da) to be.

예를 들어, "Stable-Isotope Dimethyl Labeling for Quantitative Proteomics" (Jue-Liang Hsu　외 공저, Anal. Chem.,　2003,　75　(24), pp 6843-6852)의 제목을 가지는 논문은 탄소(C), 질소(N) 및 수소(H)로 이루어진 펩타이드(peptide) 형태의 라벨 물질에서 하나 이상의 수소 원자를 중수소(²H, 또는 D(Deuterium))로 치환함으로써 라벨 물질들 사이의 질량차를 원하는 형태로 조절하는 정량 기법을 개시한다. 이때, 도 2에 도시된 것과 같이 사용된 라벨 물질들 간의 질량차에 의하여 라벨링 폭(labeling width)이 정의되는데, 라벨링 폭이 좁을수록 정량 측정의 다중성(multiplexity)을 증가시킬 수 있다. For example, a paper titled "Stable-Isotope Dimethyl Labeling for Quantitative Proteomics" (Jue-Liang Hsu et al., Anal. Chem., 2003, 75 (24), pp 6843-6852) It is possible to substitute one or more hydrogen atoms with deuterium ( ² H, or D (Deuterium)) in a label material in the form of a peptide consisting of nitrogen (N) and hydrogen (H) Lt; / RTI > At this time, as shown in FIG. 2, a labeling width is defined by a difference in mass between the used label materials. The narrower the labeling width, the greater the multiplexing of the quantitative measurement.

그러나, 라벨 물질들 간의 질량차를 줄이고자 할 경우 질량 대 전하비가 인접한 그래프들이 중첩되어 버릴 수 있고, 이는 정량 측정의 다중성에 대한 제한으로 작용한다. 도 3a는 질량차가 작은 라벨 물질들을 사용할 경우 발생하는 스펙트럼의 중첩을 나타내는 그래프로서, 질량차가 2Da인 세 종류의 라벨 물질이 사용되며 시료는 질량차가 1Da인 세 종류의 동위원소를 포함하는 경우를 나타낸다. However, if the mass difference between the label materials is to be reduced, the graphs in which the mass to charge ratios are adjacent to each other may overlap, which serves as a limitation on the multiplicity of the quantitative measurement. 3A is a graph showing a superposition of spectra generated when a label material having a small mass difference is used, wherein three kinds of label materials having a mass difference of 2Da are used, and a sample includes three kinds of isotopes having a mass difference of 1Da .

도 3a에서 상이한 색상으로 도시된 그래프는 각각 상이한 종류의 라벨 물질에 대응되는 질량 분석 스펙트럼을 나타내는 것으로서, 그래프(301-303)는 기본 펩타이드 물질로 된 라벨의 질량 분석 스펙트럼을 나타내며, 그래프(311-313)는 기본 라벨 물질에서 2개의 수소 원자가 중수소로 치환된 라벨 물질의 질량 분석 스펙트럼을 나타내고, 그래프(321-323)는 기본 라벨 물질에서 4개의 수소 원자가 중수소로 치환된 라벨 물질의 질량 분석 스펙트럼을 나타낸다. 도시되는 것과 같이, 각 스펙트럼의 질량 대 전하비(m/z)가 인접하여 서로 상이한 라벨 물질에 해당하는 두 그래프(303, 311 및 313, 312)가 서로 중첩되어 버리는 것을 알 수 있다.3A shows a mass spectrometry spectrum corresponding to a different kind of label material, and graphs 301-303 show mass spectrometry spectra of a label made of a basic peptide material, and graphs 311- 313) shows a mass spectrometry spectrum of a label substance in which two hydrogen atoms are replaced with deuterium in the basic label substance, and a graph (321-323) shows a mass spectrometry spectrum of a label substance in which four hydrogen atoms are replaced with deuterium in the basic label substance . As shown, it can be seen that the two graphs 303, 311 and 313, 312, which correspond to the label material of which the mass to charge ratio (m / z) of each spectrum are adjacent to each other, are overlapped with each other.

위와 같은 문제를 해결하기 위하여 해상도를 수십 배로 높인 분석 장비들이 사용될 수 있으나, 이는 지나친 비용이 소요되어 효율적이지 못하며, 해상도는 증가되더라도 측정 신호의 세기가 감소되는 문제가 있다.In order to solve the above problem, analysis apparatuses having a resolution several tens of times higher may be used, but this is not efficient due to excessive cost, and the intensity of the measurement signal is reduced even if the resolution is increased.

또한, 수소를 중수소로 치환하여 생성된 라벨 물질은 치환 전의 물질과 비교할 때 LC에서의 리텐션 시간에 약간의 편이(shift)가 발생하는 문제가 있다. 도 3b는 수소가 중수소로 치환된 라벨 물질을 사용할 경우 발생하는 리텐션 시간 편이를 나타내는 그래프이다. 도 3b에 도시된 것과 같이, 수소가 중수소로 치환되면서 각 라벨 물질은 LC에서의 리텐션 시간에 약간의 편이(shift)가 발생하는데, 이로 인하여 라벨 물질에 대응되는 질량 분석 스펙트럼을 검출할 수 없거나 질량 분석 스펙트럼의 신호가 세기가 부정확하게 검출될 수 있다. Further, there is a problem that a label material produced by replacing hydrogen with deuterium has a slight shift in the retention time in the LC as compared with the material before replacement. FIG. 3B is a graph showing a retention time shift occurring when a label material in which hydrogen is replaced with deuterium is used. As shown in FIG. 3B, as hydrogen is replaced with deuterium, each label material slightly shifts in retention time in the LC, thereby failing to detect the mass spectrometry spectrum corresponding to the label substance The intensity of the signal of the mass spectrometry spectrum can be detected incorrectly.

예컨대, 질량 분석 스펙트럼의 그래프(302)상에서 특정 질량 대 전하비를 가지는 라벨 물질(30)을 특정하였다고 하면, 실험자는 사용된 라벨 물질들 사이의 질량차를 미리 알고 있으므로, 특정된 라벨 물질(30)로부터 해당 질량차만큼 질량 대 전하비가 증가한 그래프의 영역에서 다른 라벨 물질의 스펙트럼을 찾고자 할 것이다. 그러나, 리텐션 시간의 편이로 인하여 그래프(311-313)는 예측된 영역에 비해 리텐션 시간에 해당하는 축 상의 일 방향으로 편이되어 위치하므로, 그래프(311-313)의 신호 세기가 부정확하게 측정되거나, 또는 그래프(311-313)의 검출 자체가 이루어지지 않을 수 있는 문제점이 있다. For example, if the label material 30 having a specific mass to charge ratio is specified on the graph 302 of the mass spectrometry spectrum, the experimenter knows in advance the mass difference between the used label materials, so that the specified label material 30 ) To find the spectrum of the other label material in the region of the graph where the mass-to-charge ratio has increased by the corresponding mass difference. However, due to the deviation of the retention time, the graphs 311 to 313 are shifted in one direction on the axis corresponding to the retention time, as compared with the predicted area, so that the signal strength of the graphs 311 to 313 is inaccurately measured Or the detection of the graphs 311-313 may not be performed.

도 3c 및 3d는 서로 상이한 라벨 물질에 대응되는 스펙트럼이 리텐션 시간 편이를 가지고 서로 중첩된 것을 나타내는 그래프로서, 라벨 물질들의 질량이 2Da씩 증가하며 4개의 동위원소가 포함된 경우를 나타낸다. FIGS. 3c and 3d are graphs showing that spectra corresponding to different label materials are overlapped with each other with a retention time shift. FIG. 3c shows the case where the mass of the label materials increases by 2Da and four isotopes are included.

도 3c에 도시된 것과 같이, 각각 상이한 라벨 물질에 해당하는 그래프(301-304), 그래프(311-314) 및 그래프(321-324) 사이에 리텐션 시간의 편이가 발생하였다. 또한, 서로 상이한 물질에 해당하는 그래프(303, 304) 및 그래프(311, 312)가 서로 중첩되었으며, 마찬가지로 서로 상이한 물질에 해당하는 그래프(313, 314) 및 그래프(321, 322)가 서로 중첩되었다. 서로 중첩된 그래프들은 도 3d에 도시된 그래프(331, 332) 및 그래프(333, 334)와 같은 형태로 검출되며, 이러한 형태의 그래프를 통해서는 각 물질에 해당하는 신호를 특정하는 것이 불가능하다. As shown in FIG. 3C, the retention time shifts between the graphs 301-304, 311-314, and 321-324 corresponding to different label materials, respectively. The graphs 303 and 304 and the graphs 311 and 312 corresponding to different materials are overlapped with each other and the graphs 313 and 314 and the graphs 321 and 322 corresponding to different materials are overlapped with each other . Graphs superimposed on each other are detected in the form of graphs 331 and 332 and graphs 333 and 334 shown in FIG. 3D, and it is impossible to specify a signal corresponding to each material through such graphs.

도 3e는 위와 같은 한계를 가지는 종래의 단백질 정량 도구를 이용한 정량 분석 과정을 나타내는 개념도이다. FIG. 3E is a conceptual diagram illustrating a quantitative analysis process using a conventional protein quantitation tool having the above limitations.

도 3e를 참조하면, 점선으로 도시된 사각형(341-347)은 질량차를 가지는 라벨 물질로 표시된 시료의 질량 스펙트럼 쌍을 나타내며, 실선으로 도시된 사각형(351, 352)는 종래의 정량 도구를 통해 실제 검출되는 스펙트럼 쌍을 나타낸다. 도시되는 것과 같이, 사각형(351)을 제외한 많은 스펙트럼들은 리텐션 시간의 편이 및 그래프의 중첩으로 인해 스펙트럼 쌍 자체가 발견되지 않았다. 한편, 사각형(352)의 경우에는 리텐션 시간의 편이로 인하여 본래의 스펙트럼 쌍이 아닌 다른 물질의 스펙트럼과 쌍을 이루는 것으로 잘못 검출되었다. Referring to FIG. 3E, quadrangles 341-347 shown in phantom represent pairs of mass spectra of samples labeled with a labeling material having a mass difference, and rectangles 351 and 352 shown in solid lines Indicates the actually detected pair of spectra. As shown, many spectra except the quadrangle 351 have not found the spectral pair itself due to the shift of the retention time and the overlap of the graph. On the other hand, in the case of the quadrangle 352, due to the deviation of the retention time, it was erroneously detected to be paired with the spectra of other materials than the original spectral pair.

"Stable-Isotope Dimethyl Labeling for Quantitative Proteomics", Jue-Liang Hsu　외 공저, Anal. Chem.,　2003,　75　(24), pp 6843-6852&Quot; Stable-Isotope Dimethyl Labeling for Quantitative Proteomics ", Jue-Liang Hsu et al., Anal. Chem., 2003, 75 (24), pp 6843-6852

상술한 바와 같은 문제점을 해결하기 위하여, 본 발명의 일 측면에 따르면, 액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS)에 있어서 라벨(label) 물질에서 수소(H)의 중수소(²H, 또는 D(Deuterium)) 치환에 의하여 발생하는 리텐션 시간(retention time)의 편이(shift)를 예측하고 이를 통해 질량이 인접한 물질들의 신호를 정확하게 분리할 수 있는 시료의 정량 분석 장치 및 방법과, 상기 방법을 실행하기 위한 컴퓨터 프로그램을 제공할 수 있다. According to an aspect of the present invention, there is provided a method of separating deuterium ( ² ) of hydrogen (H) from a label material in a liquid chromatograph-mass spectrometry (LC-MS) H, or D (Deuterium)) substitution, which can accurately discriminate signals of adjacent substances with masses, and a method and apparatus for quantitatively analyzing a sample, which can predict the shift of retention time caused by substitution , And a computer program for executing the above method.

일 실시예에 따른 시료의 정량 분석 장치는, 제1 라벨(label) 물질이 부착된 제1 물질, 및 상기 제1 라벨 물질과 질량이 상이한 제2 라벨 물질이 부착된 제2 물질을 포함하는 시료의 액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS) 스펙트럼을 수신하도록 구성된 스펙트럼 인식부; 상기 스펙트럼 인식부에 수신된 질량 분석 스펙트럼으로부터 상기 제1 라벨 물질에 대응되는 펩타이드 정합 스펙트럼(Peptide Spectrum Match; PSM)을 추출하며, 상기 제1 라벨 물질의 PSM을 이용하여 상기 제1 물질의 템플릿(template)을 특정하고, 상기 제1 라벨 물질과 상기 제2 라벨 물질의 예측된 리텐션 시간(Retention Time; RT) 편이(shift)를 산출하며, 상기 예측된 RT 편이를 이용하여 상기 제2 물질의 템플릿을 특정하고, 상기 제1 물질과 상기 제2 물질의 템플릿을 이용하여 상기 제1 물질과 상기 제2 물질의 신호를 재구성함으로써 상기 시료의 PSM 수준 정량 분석을 수행하도록 구성된 PSM 분석부를 포함한다.The apparatus for quantitatively analyzing a sample according to an embodiment includes a sample containing a first substance on which a first label substance is attached and a second substance on which a second label substance having a different mass from the first label substance is attached A spectral recognition unit configured to receive a Liquid Chromatograph-Mass Spectrometry (LC-MS) spectrum; Extracting a Peptide Spectrum Match (PSM) corresponding to the first label material from the mass spectrometry spectrum received in the spectrum recognition unit, and using the PSM of the first label material, template of the second label material and calculating a predicted retention time (RT) shift of the first label material and the second label material, and calculating a predicted retention time And a PSM analyzer configured to specify a template and to perform PSM level quantitative analysis of the sample by reconstructing signals of the first material and the second material using the template of the first material and the second material.

일 실시예에 따른 시료의 정량 분석 장치는, 하나 이상의 물질에 대한 상기 펩타이드 정합 스펙트럼 수준 정량 분석 결과를 이용하여 상기 시료의 단백질 수준 정량 분석을 수행하도록 구성된 단백질 분석부를 더 포함한다.The apparatus for quantitatively analyzing a sample according to an embodiment further includes a protein analyzing unit configured to perform protein level quantitative analysis of the sample using the result of the quantitative analysis of the peptide matching spectrum for one or more substances.

일 실시예에서, 상기 PSM 분석부는 상기 예측된 RT 편이를 산출하는 RT 편이 예측부를 포함한다. 상기 RT 편이 예측부는, 상기 질량 분석 스펙트럼으로부터 미리 결정된 하나 이상의 특징(feature)을 추출하도록 구성된 특징 추출부; 및 질량 분석 스펙트럼의 훈련 세트(training set)를 입력값으로 이용한 기계 학습(machine learning)을 통해 얻어진 규칙에 상기 하나 이상의 특징을 적용하도록 구성된 기계 학습부를 포함한다. In one embodiment, the PSM analyzing unit includes an RT deviation estimator for calculating the predicted RT deviation. Wherein the RT deviation predicting unit comprises: a feature extracting unit configured to extract one or more predetermined features from the mass spectrometry spectrum; And a machine learning unit configured to apply the one or more features to a rule obtained through machine learning using a training set of a mass spectrometry as an input value.

일 실시예에서, 상기 하나 이상의 특징은, 상기 제2 라벨 물질에 치환된 중수소의 개수, 상기 질량 분석 스펙트럼의 정규화된 RT, 상기 제1 라벨 물질의 펩타이드 서열 길이, 상기 제2 물질에서 상기 제2 라벨 물질이 표지된 위치의 비율 또는 상기 질량 분석 스펙트럼의 정규화된 피크 폭 중 하나 이상을 포함한다. In one embodiment, the at least one characteristic is selected from the group consisting of: the number of deuterium substituted on the second label material; the normalized RT of the mass spectrometry spectrum; the peptide sequence length of the first label material; The ratio of the labeled location of the labeling substance, or the normalized peak width of the mass spectrometry spectrum.

일 실시예에서, 상기 PSM 분석부는, 제1 라벨 물질에 대응되는 PSM으로부터얻어진 정규분포 곡선과 상기 제1 물질을 구성하는 원소들의 미리 알려진 동위원소 분포 프로파일을 이용하여 상기 제1 물질의 템플릿을 특정하도록 구성된 동위원소 분리부를 포함한다. In one embodiment, the PSM analyzer may determine a template of the first material using a normal distribution curve obtained from the PSM corresponding to the first label material and a previously known isotope distribution profile of the elements constituting the first material. And an isotope separation section configured to detect the presence of the isotope.

일 실시예에서, 상기 PSM 분석부는, 상기 제1 물질의 템플릿 및 상기 제2 물질의 템플릿을 이용하여 상기 제1 물질 및 상기 제2 물질의 상대량을 산출함으로써 상기 시료의 액체 크로마토그래프 질량 분석 스펙트럼을 상기 제1 물질의 신호 및 상기 제2 물질의 신호로 재구성하도록 구성된 재구성부를 포함한다.In one embodiment, the PSM analyzing unit calculates a relative mass of the first material and the second material using a template of the first material and a template of the second material, thereby obtaining a liquid chromatograph mass spectrometry spectrum And a reconstruction unit configured to reconstruct the signal of the first material and the signal of the second material.

일 실시예에 따른 정량 분석 방법은, 정량 분석 장치가, 제1 라벨 물질이 부착된 제1 물질, 및 상기 제1 라벨 물질과 질량이 상이한 제2 라벨 물질이 부착된 제2 물질을 포함하는 시료의 LC-MS 스펙트럼을 수신하는 단계; 상기 정량 분석 장치가, 상기 질량 분석 스펙트럼으로부터 상기 제1 라벨 물질과 상기 제2 라벨 물질의 예측된 RT 편이를 산출하는 단계; 상기 정량 분석 장치가, 상기 질량 분석 스펙트럼으로부터 상기 제1 라벨 물질에 대응되는 펩타이드 정합 스펙트럼을 추출하는 단계; 상기 정량 분석 장치가, 추출된 상기 제1 라벨 물질의 펩타이드 정합 스펙트럼을 이용하여 상기 제1 물질의 템플릿을 특정하는 단계; 상기 정량 분석 장치가, 상기 제1 물질의 템플릿 및 상기 예측된 RT 편이를 이용하여 상기 제2 물질의 템플릿을 특정하는 단계; 상기 정량 분석 장치가, 상기 제1 물질과 상기 제2 물질의 템플릿을 이용하여 상기 제1 물질과 상기 제2 물질의 신호를 재구성하는 단계; 및 상기 정량 분석 장치가, 재구성된 상기 제1 물질 및 상기 제2 물질의 신호를 이용하여 상기 시료의 PSM 수준 정량 분석을 수행하는 단계를 포함한다. The quantitative analysis method according to one embodiment is characterized in that the quantitative analysis apparatus includes a sample containing a first substance to which a first label substance is attached and a second substance to which a second label substance having a different mass from the first label substance is attached 0.0 > LC-MS < / RTI > The quantitative analysis apparatus comprising: calculating a predicted RT deviation of the first label material and the second label material from the mass spectrometry spectrum; Extracting a peptide matching spectrum corresponding to the first label substance from the mass spectrometry spectrum; Identifying the template of the first substance using the peptide matching spectrum of the extracted first label substance; The quantitative analysis apparatus further comprising: specifying a template of the second material by using the template of the first material and the predicted RT deviation; The quantitative analysis apparatus comprising: reconstructing a signal of the first material and the second material using a template of the first material and the second material; And the quantitative analysis apparatus performing PSM level quantitative analysis of the sample using the reconstructed signals of the first material and the second material.

일 실시예에 따른 정량 분석 방법은, 상기 정량 분석 장치가, 하나 이상의 물질에 대한 상기 PSM 수준 정량 분석 결과를 이용하여 상기 시료의 단백질 수준 분석을 수행하는 단계를 더 포함한다.The quantitative analysis method according to an embodiment may further include the step of analyzing the protein level of the sample using the quantitative analysis results of the PSM level for one or more substances.

일 실시예에서, 상기 제1 라벨 물질은 탄소, 질소 및 수소를 포함하는 펩타이드이며, 상기 제2 라벨 물질은 상기 제1 라벨 물질의 하나 이상의 수소(H)를 중수소(²H, 또는 D(Deuterium))로 치환한 펩타이드이다. In one embodiment, the first label material is a peptide comprising carbon, nitrogen and hydrogen, and the second label material is selected from the group consisting of deuterium ( ² H, or D (Deuterium )). &Lt; / RTI >

일 실시예에서, 상기 제2 라벨 물질은, 치환된 중수소의 개수에 따라 상기 제1 라벨 물질의 질량으로부터 순차적으로 질량이 증가하는 5개 이상의 펩타이드를 포함한다. In one embodiment, the second label material comprises five or more peptides that increase in mass sequentially from the mass of the first label material depending on the number of deuterium substituted.

일 실시예에서, 상기 예측된 RT 편이를 산출하는 단계는, 상기 질량 분석 스펙트럼으로부터 미리 결정된 하나 이상의 특징을 추출하는 단계; 및 질량 분석 스펙트럼의 훈련 세트를 입력값으로 이용한 기계 학습을 통해 얻어진 규칙에 상기 하나 이상의 특징을 적용함으로써 상기 예측된 RT 편이를 산출하는 단계를 포함한다.In one embodiment, the step of calculating the predicted RT deviation further comprises: extracting one or more predetermined features from the mass spectrometry spectrum; And calculating the predicted RT deviation by applying the at least one feature to a rule obtained through machine learning using a training set of a mass spectrometry spectrum as an input value.

일 실시예에서, 상기 하나 이상의 특징은, 상기 제2 라벨 물질에 치환된 중수소의 개수, 상기 질량 분석 스펙트럼의 정규화된 RT, 상기 제1 라벨 물질의 펩타이드 서열 길이, 상기 제2 물질에서 상기 제2 라벨 물질이 표지된 위치의 비율 또는 상기 질량 분석 스펙트럼의 정규화된 피크 폭 중 하나 이상을 포함한다.In one embodiment, the at least one characteristic is selected from the group consisting of: the number of deuterium substituted on the second label material; the normalized RT of the mass spectrometry spectrum; the peptide sequence length of the first label material; The ratio of the labeled location of the labeling substance, or the normalized peak width of the mass spectrometry spectrum.

일 실시예에서, 상기 제1 물질의 템플릿을 특정하는 단계는, 제1 라벨 물질에 대응되는 펩타이드 정합 스펙트럼으로부터 얻어진 정규분포 곡선과 상기 제1 물질을 구성하는 원소들의 미리 알려진 동위원소 분포를 이용하여 상기 제1 물질의 템플릿을 특정하는 단계를 포함한다.In one embodiment, the step of specifying the template of the first material comprises using a normal distribution curve obtained from the peptide matching spectrum corresponding to the first label material and a previously known isotope distribution of the elements constituting the first material And specifying a template of the first material.

일 실시예에서, 상기 제1 물질과 상기 제2 물질의 신호를 재구성하는 단계는, 상기 제1 물질의 템플릿 및 상기 제2 물질의 템플릿을 이용하여 상기 제1 물질 및 상기 제2 물질의 상대량을 산출하는 단계를 포함한다. In one embodiment, the step of reconstructing the signals of the first material and the second material comprises the steps of: using a template of the first material and a template of the second material, .

일 실시예에 따른 컴퓨터 프로그램은, 하드웨어와 결합되어 전술한 실시예들에 따른 시료의 정량 분석 방법을 실행하기 위한 것으로서, 컴퓨터로 판독 가능한 매체에 저장될 수 있다. The computer program according to one embodiment is for executing a method of quantitative analysis of a sample according to the embodiments described above in combination with hardware and can be stored in a computer-readable medium.

본 발명의 일 측면에 따른 시료의 정량 분석 장치 및 방법에 의하면, 액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS)에 있어서, 라벨(label) 물질에서 수소(H)의 중수소(²H, 또는 D(Deuterium)) 치환에 의하여 발생하는 리텐션 시간(retention time)의 편이(shift)를 기계 학습(machine learning)을 적용하여 효율적으로 예측할 수 있고, 그 결과 중수소 원자를 라벨 물질의 생성에 사용할 수 있어 사용 가능한 라벨 물질의 종류를 늘릴 수 있다. According to an apparatus and method for quantitatively analyzing a sample according to an aspect of the present invention, in a label substance, deuterium ( ² ) of hydrogen (H) in a liquid chromatograph-mass spectrometry (LC- H, or D (Deuterium)) substitution can be efficiently predicted by applying machine learning, and as a result, deuterium atoms can be generated from the label material So that the number of kinds of usable label materials can be increased.

또한, 본 발명의 일 측면에 따른 시료의 정량 분석 장치 및 방법에 의하면, 펩타이드 정합 스펙트럼(Peptide Spectrum Match; PSM)을 이용하여 특정된 템플릿(template)으로부터 리텐션 시간 편이 예측을 통해 각 물질의 신호를 재구성함으로써 질량차가 예를 들어 4달톤(Da), 또는 심지어 2달톤(Da) 정도로 질량이 인접한 라벨 물질들을 사용하더라도 각 물질의 신호를 정확하게 분리하는 것이 가능하므로, 분해능을 높이기 위한 고가의 장비를 사용하지 않고도 정량 측정의 다중성(multiplexity)을 향상시킬 수 있는 이점이 있다. In addition, according to an apparatus and method for quantitative analysis of a sample according to an aspect of the present invention, the signal of each substance is predicted from a template specified using a peptide matching spectrum (PSM) It is possible to accurately separate signals of each material even if the mass difference is about 4 daltons (Da) or even 2 daltons (Da), for example. Therefore, expensive equipments for increasing the resolution There is an advantage that the multiplexing of the quantitative measurement can be improved without using it.

도 1은 모분자 기반의 라벨(label)을 이용한(MS1-based) 정량 기법을 설명하기 위한 개념도이다.
도 2는 동위원소(isotope)를 포함하는 라벨 물질을 이용하여 얻어진 질량 분석 스펙트럼의 그래프이다.
도 3a는 질량차가 작은 라벨 물질들을 사용할 경우 발생하는 스펙트럼의 중첩을 나타내는 그래프이다.
도 3b는 수소(H)가 중수소(²H, 또는 D(Deuterium))로 치환된 라벨 물질을 사용할 경우 발생하는 리텐션 시간 편이(retention time shift)를 나타내는 그래프이다.
도 3c 및 3d는 서로 상이한 물질에 대응되는 스펙트럼이 리텐션 시간 편이를 가지고 서로 중첩된 것을 나타내는 그래프이다.
도 3e는 종래의 단백질 정량 도구를 이용한 정량 분석 과정을 나타내는 개념도이다.
도 4는 일 실시예에 따른 시료의 정량 분석 장치의 블록도이다.
도 5a는 질량 분석 스펙트럼에 대한 로그정규 피팅(log-normal fitting)을 나타내는 그래프이다.
도 5b는 예측된 리텐션 시간 편이를 이용하여 다른 라벨 물질의 질량 분석 스펙트럼을 특정하는 과정을 나타내는 그래프이다.
도 6a 내지 6c는 일 실시예에 따라 질량 분석 스펙트럼을 라벨 물질 별로 분리하여 정량을 수행하는 과정을 나타내는 그래프이다.
도 7은 리텐션 시간의 정규화(normalization) 과정을 나타내는 그래프이다.
도 8은 정규화된 리텐션 시간을 나타내는 그래프이다.
도 9는 동일한 펩타이드(peptide)와 동일한 라벨 물질에 대한 반복 실험에서 리텐션 시간 편이의 경향성을 나타내는 그래프이다.
도 10은 리텐션 시간 편이를 라벨 물질에 치환된 중수소의 개수에 대하여 나타내는 그래프이다.
도 11은 리텐션 시간 편이를 정규화된 리텐션 시간에 대하여 나타내는 그래프이다.
도 12는 리텐션 시간 편이를 펩타이드 서열 길이에 대하여 나타내는 그래프이다.
도 13은 리텐션 시간 편이를 라벨로 표시된 아미노산 비율에 대하여 나타내는 그래프이다.
도 14는 리텐션 시간을 질량 분석 스펙트럼의 피크 폭에 대하여 나타내는 그래프이다.
도 15는 다양한 회귀(regression) 모델에 따른 오차를 나타내는 그래프이다.
도 16a 및 16b는 회귀 모델 중 서포트 벡터 회귀(Support Vector Regression; SVR)를 설명하기 위한 도면이다.
도 17a 및 17b는 회귀 모델로 SVR을 사용하였을 경우의 성능을 나타내는 그래프이다.
도 18은 일 실시예에 따른 시료의 정량 분석 방법을 사용할 경우 라벨 물질로 활용될 수 있는 예시적인 펩타이드들을 나타낸다.
도 19a 내지 19h는 라벨로 표시된 단백질의 비율에 따른 일 실시예에 의한 정량 결과를 이론값과 비교하여 나타내는 그래프이다.
도 20a 및 20b는 라벨로 표시된 단백질의 비율에 따른 일 실시예에 의한 정량 결과를 종래 기술에 의한 정량 결과와 비교하여 나타내는 그래프이다.
도 21은 일 실시예에 따른 시료의 정량 분석 방법의 순서도이다. FIG. 1 is a conceptual diagram for explaining an MS1-based quantitation technique using a label based on a parent molecule.
2 is a graph of a mass spectrometry spectrum obtained using a label material containing isotopes.
FIG. 3A is a graph showing superposition of spectra that occurs when using label materials with small mass difference. FIG.
3B is a graph showing a retention time shift occurring when a label material in which hydrogen (H) is replaced with deuterium ( ² H, or D (Deuterium)).
3C and 3D are graphs showing that spectra corresponding to different materials are overlapped with each other with a retention time shift.
3E is a conceptual diagram illustrating a quantitative analysis process using a conventional protein quantitation tool.
4 is a block diagram of a quantitative analysis apparatus for a sample according to an embodiment.
5A is a graph showing the log-normal fitting for the mass spectrometry spectrum.
5B is a graph illustrating a process of specifying a mass spectrometry spectrum of another label material using the predicted retention time deviation.
6A to 6C are graphs illustrating a process of separating mass spectrometry spectra according to one embodiment and performing quantitation.
7 is a graph showing a normalization process of the retention time.
8 is a graph showing the normalized retention time.
9 is a graph showing the tendency of retention time shift in repeated experiments on the same label material as the same peptide.
10 is a graph showing the retention time deviation with respect to the number of deuterium substituted on the label material.
11 is a graph showing the retention time deviation with respect to the normalized retention time.
12 is a graph showing the retention time shift with respect to the peptide sequence length.
13 is a graph showing the retention time shift with respect to the ratio of amino acids indicated by labels.
14 is a graph showing the retention time with respect to the peak width of the mass spectrometry spectrum.
15 is a graph showing errors according to various regression models.
16A and 16B are diagrams for explaining support vector regression (SVR) in the regression model.
17A and 17B are graphs showing the performance when SVR is used as a regression model.
Figure 18 shows exemplary peptides that can be utilized as labeling material when using the quantitative assay method of a sample according to one embodiment.
FIGS. 19A to 19H are graphs showing quantitation results according to one embodiment according to the ratio of labeled proteins to theoretical values. FIG.
FIGS. 20A and 20B are graphs showing the results of quantitation according to one embodiment according to the ratio of the labeled proteins to the quantification results according to the prior art.
21 is a flowchart of a quantitative analysis method of a sample according to an embodiment.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. The following detailed description of the invention refers to the accompanying drawings, which illustrate, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different, but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment.

또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is to be limited only by the appended claims, along with the full scope of equivalents to which such claims are entitled, if properly explained. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

이하, 도면들을 참조하여 본 발명의 몇몇 실시예들을 보다 상세하게 설명하기로 한다.Some embodiments of the present invention will now be described in more detail with reference to the drawings.

도 4는 일 실시예에 따른 시료의 정량 분석 장치의 블록도이다. 4 is a block diagram of a quantitative analysis apparatus for a sample according to an embodiment.

도 4를 참조하면, 본 실시예에 따른 시료의 정량 분석 장치는 스펙트럼 인식부(41)와, 펩타이드 정합 스펙트럼(Peptide Spectrum Match; PSM) 분석부(42)를 포함한다. 일 실시예에서, 시료의 정량 분석 장치는 단백질 분석부(43)를 더 포함한다. 실시예에 따른 시료의 정량 분석 장치는 전적으로 하드웨어이거나, 또는 부분적으로 하드웨어이고 부분적으로 소프트웨어인 측면을 가질 수 있다. 즉, 시료의 정량 분석 장치 및 이에 포함된 각 부(unit)는, 특정 형식 및 내용의 데이터를 저장하거나 전자통신 방식으로 주고받기 위한 장치 및 이에 관련된 소프트웨어를 통칭할 수 있다. 하드웨어는 CPU 또는 다른 프로세서(processor)를 포함하는 데이터 처리 기기일 수 있다. 또한, 하드웨어에 의해 구동되는 소프트웨어는 실행중인 프로세스, 객체(object), 실행파일(executable), 실행 스레드(thread of execution), 프로그램(program) 등을 지칭할 수 있다.Referring to FIG. 4, the apparatus for quantitatively analyzing a sample according to the present embodiment includes a spectrum recognition unit 41 and a peptide spectrum match (PSM) analysis unit 42. In one embodiment, the apparatus for quantitatively analyzing a sample further includes a protein analyzer 43. The apparatus for quantitative analysis of a sample according to an embodiment may be entirely hardware, or partly hardware, and partly software. That is, the sample quantitative analysis device and each unit included therein may be collectively referred to as a device for storing data of a specific format and contents, or a device for sending and receiving data by an electronic communication method, and related software. The hardware may be a data processing device comprising a CPU or other processor. Also, the software driven by the hardware may refer to a running process, an object, an executable, a thread of execution, a program, and the like.

또한, 본 명세서에서 시료의 정량 분석 장치를 구성하는 각 부(41-43)는 반드시 물리적으로 구분되는 별개의 구성요소를 지칭하는 것으로 의도되지 않는다. 즉, 도 1에서 시료의 정량 분석 장치의 각 부는 서로 구분되는 별개의 블록으로 도시되었으나, 이는 시료의 정량 분석 장치를 이에 의해 실행되는 동작에 의해 기능적으로 구분한 것이다. 실시예에 따라서는 전술한 각 부 중 일부 또는 전부가 동일한 하나의 장치 내에 집적화될 수 있으며, 또는 하나 이상의 부가 다른 부와 물리적으로 구분되는 별개의 장치로 구현될 수도 있다. 예컨대, 각각의 부는 분산 컴퓨팅 환경 하에서 서로 통신 가능하게 연결된 컴포넌트들일 수도 있다.In this specification, each of the parts 41-43 constituting the quantitative analysis apparatus of a sample is not necessarily intended to refer to a separate component that is physically separated. That is, in FIG. 1, each part of the quantitative analysis apparatus of a sample is shown as a separate block that is distinguished from each other, but it is a functional classification of the quantitative analysis apparatus of the sample by the operation performed thereby. Depending on the embodiment, some or all of the parts described above may be integrated into one and the same device, or one or more parts may be implemented as separate devices physically separated from one another. For example, each component may be components that are communicatively coupled to one another in a distributed computing environment.

스펙트럼 인식부(41)는, 리텐션 시간(Retention Time; RT)의 편이(shift)를 예측하고자 하는 시료의 액체 크로마토그래프 질량 분석(Liquid Chromatograph-Mass Spectrometry; LC-MS) 스펙트럼을 수신하도록 구성된다. 이때 시료는 서로 질량이 상이한 둘 이상의 라벨 물질로 표시된다. 예컨대, 시료는 제1 라벨 물질이 부착된 제1 물질, 및 제1 라벨 물질과 질량이 상이한 제2 라벨 물질이 부착된 제2 물질을 포함할 수 있다. 이때, 제1 및 제2 물질은 단백질일 수 있으며, 이에 표시되는 제1 및 제2 라벨 물질은 소정의 조성을 가지는 펩타이드(peptide)일 수 있다. The spectrum recognizer 41 is configured to receive a Liquid Chromatograph-Mass Spectrometry (LC-MS) spectrum of a sample for which a shift of a retention time (RT) is predicted . At this time, the samples are represented by two or more label materials having different masses from each other. For example, the sample may comprise a first material to which a first label material is attached and a second material to which a second label material with a different mass from the first label material is attached. In this case, the first and second substances may be proteins, and the first and second labeled substances may be peptides having a predetermined composition.

일 실시예에서, 스펙트럼 인식부(41)는 질량 분석 스펙트럼으로부터 펩타이드에 해당하는 PSM을 인식한다. 인식은 MS/MS 스펙트럼을 단백질 서열 데이터베이스의 펩타이드와 대조하는 MS-GF+ 방식으로 이루어질 수 있으나, 이에 한정되는 것은 아니다. 인식된 PSM은 펩타이드 서열에 대응되는 것일 뿐만 아니라 이에 표지된 라벨 물질에도 대응된다. In one embodiment, the spectrum recognition unit 41 recognizes the PSM corresponding to the peptide from the mass spectrometry spectrum. Recognition can be, but is not limited to, the MS-GF + method, which compares the MS / MS spectrum to the peptides of the protein sequence database. The recognized PSM corresponds not only to the peptide sequence but also to the label substance labeled thereon.

PSM 분석부(42)는, 스펙트럼 인식부(41)에 의하여 인식된 PSM을 대상으로 한 분석을 수행하여, 특정 라벨 물질(예컨대, 제1 라벨 물질)이 부착된 물질(예컨대, 제1 물질)의 신호에 대한 로그정규 피팅(log-normal fitting)을 통해 로그정규 분포를 가지는 확률 밀도 함수를 얻고, 얻어진 확률 밀도 함수에 해당 물질의 동위원소 분포 프로파일을 곱하여 해당 물질의 템플릿(template)을 특정하며, PSM이 인식된 라벨 물질과 질량차를 가지는 다른 라벨 물질(예컨대, 제2 라벨 물질)의 신호에서 발생하는 RT 편이를 예측하며, 전술한 제1 물질의 템플릿 및 예측된 RT 편이를 기반으로 다른 라벨 물질이 부착된 물질(예컨대, 제2 물질)의 템플릿을 특정하여, 특정된 템플릿들을 기반으로 각 물질의 신호를 재구성하는 PSM 수준의 정량 분석을 수행한다. The PSM analyzing unit 42 performs an analysis on the PSM recognized by the spectrum recognizing unit 41 to analyze a substance (for example, a first substance) to which a specific label substance (for example, a first label substance) A probability density function having a log-normal distribution is obtained through log-normal fitting on the signal of the material, and the template of the material is determined by multiplying the obtained probability density function by the isotope distribution profile of the substance , The PSM predicts the RT deviation occurring in the signal of the recognized label material and the other label material (e.g., the second label material) having a mass difference, and determines the RT deviation based on the template of the first material and the predicted RT deviation of the above- The template of the substance to which the label substance is attached (for example, the second substance) is specified, and a PSM level quantitative analysis is performed to reconstruct a signal of each substance based on the specified templates.

따라서, 본 명세서에서 템플릿이란, 각 물질에 해당하는 질량 대 전하비(m/z)에서 얻어지는 해당 물질의 신호 형상을 해당 물질의 동위원소 분포 프로파일을 고려하여 미리 결정한 것을 지칭한다. 실제 측정된 질량 분석 스펙트럼의 신호가 각 물질의 템플릿에 대하여 가지는 비율에 의하여 시료에서 각 물질이 차지하는 상대량을 산출할 수 있다.Therefore, in the present specification, the term "template" refers to a signal shape of a material obtained at a mass to charge ratio (m / z) corresponding to each material, which is predetermined in consideration of an isotope distribution profile of the material. The amount of each substance in the sample can be calculated by the ratio of the signal of the actually measured mass spectrometry to the template of each substance.

구체적으로는, LC-MS/MS에 의하여 얻어지는 신호는 질량 대 전하비(m/z) 및 RT 값에 대한 세기를 가지며, 이 세기는 해당 m/z를 가지는 용리(elute)된 이온의 해당 RT에서의 합산된 존재량을 나타낸다. 용리된 특정 이온에 대하여 얻어지는 신호는 특정 m/z에서 RT 축 상의 한정된 구간에서 RT 축을 따라 종 형태의 피크를 형성하며, 이 피크 아래의 면적이 해당 이온의 존재량에 비례한다. 이때, 본 발명의 실시예들에 따라 라벨 물질의 신호를 재구성하는 과정에 대하여 설명한다. Specifically, the signal obtained by LC-MS / MS has an intensity with respect to the mass-to-charge ratio (m / z) and the RT value, which corresponds to the corresponding RT of the eluted ion having the corresponding m / Lt; / RTI > The signal obtained for the particular ion eluted forms a peak in the longitudinal form along the RT axis at a defined interval on the RT axis at a particular m / z, the area under this peak being proportional to the abundance of that ion. The process of reconstructing the signal of the label material according to the embodiments of the present invention will now be described.

이온의 전하 상태는 동일하며, 제1 물질에 해당하는 신호를 P₁이라고 하고 제2 물질에 해당하는 신호를 P₂라고 한다. 이때, P₁이 m/z가 1001, RT 범위가 3 내지 6이며, P₂가 m/z가 1005, RT 범위가 1 내지 4인 것으로 가정한다. 이때, 해당 펩타이드가 3개의 동위원소를 가지며 그 비율이 1:r₁:r₂ 라고 하면, 제1 물질의 동위원소들의 군집 신호인 템플릿 행렬은 하기 수학식 1로 나타낼 수 있으며, 제2 물질의 동위원소들의 군집 신호인 템플릿 행렬은 하기 수학식 2로 나타낼 수 있다. The charge state of the ion is the same, and the signal corresponding to the first substance is referred to as P ₁ and the signal corresponding to the second substance is referred to as P ₂ . At this time, it is assumed that P ₁ has m / z of 1001, an RT range of 3 to 6, a P ₂ of m / z of 1005, and an RT range of 1 to 4. In this case, if the peptide has three isotopes and the ratio thereof is 1: r ₁ : r ₂ , the template matrix, which is a cluster signal of the isotopes of the first substance, can be expressed by the following equation (1) The template matrix, which is a cluster signal of isotopes, can be expressed by the following equation (2).

[수학식 1][Equation 1]

[수학식 2]&Quot; (2) "

상기 수학식 1 및 수학식 2의 각 행렬에서 i번째 행은 m/z 값 (i+1000)에 대응되며, j번째 열은 RT에 대응된다. 또한, 상기 수학식 1의 x_t 및 상기 수학식 2의 y_t는 대상 신호의 전체 신호 세기가 1일 때 해당 m/z 및 RT에서의 단동위(monoisotopic) 신호 세기를 나타낸다. 이때, 제1 및 제2 물질의 신호를 포함하는 전체 신호는 하기 수학식 3과 같이 정의될 수 있으며, q₁ 및 q₂는 해당 m/z를 가지는 신호의 전체 세기를 나타낸다. The i-th row in each of the matrices of Equations 1 and 2 corresponds to the m / z value (i + 1000), and the j-th row corresponds to the RT. Also, x _{t in} Equation (1) and y _t in Equation (2) represent monoisotopic signal intensity at the corresponding m / z and RT when the total signal intensity of the target signal is 1, respectively. At this time, the entire signal including the signals of the first and second materials may be defined as Equation (3), and q ₁ and q ₂ represent the total intensity of the signal having the corresponding m / z.

[수학식 3]&Quot; (3) "

PSM 분석부(42)는, 라벨 물질의 PSM을 이용한 로그정규 피팅을 통하여 라벨 물질이 부착된 물질의 신호를 로그정규 분포를 가지는 확률 밀도 함수로 재구성하고, 확률 밀도 함수에 해당 물질의 동위원소 분포 프로파일을 곱하여 상기 수학식 3의템플릿 T₁ 및 T₂ 각각을 얻을 수 있다. 본 명세서에서, 로그정규 분포를 가지는 확률 밀도 함수는 간단히 정규분포 곡선으로도 지칭된다.The PSM analyzing unit 42 reconstructs the signal of the substance to which the label material is attached through the log regular fitting using the PSM of the label material as a probability density function having a lognormal distribution and calculates the isotope distribution of the substance in the probability density function The templates T ₁ and T ₂ of the above equation (3) can be obtained. In this specification, a probability density function having a log normal distribution is also referred to simply as a normal distribution curve.

도 5a는 질량 분석 스펙트럼에 대한 로그정규 피팅을 나타내는 그래프이다. 5A is a graph showing log-normal fittings for the mass spectrometry spectrum.

도 5a를 참조하면, 분석 대상 PSM은 하나 이상의 그래프(500-506)로 이루어진 것으로 가정한다. PSM 분석부(42)는 로그정규 피팅을 위하여 라벨 물질에 해당하는 PSM이 탐지된 그래프(500)로부터 신호 세기의 정점(apex)을 탐지한다. 정점은 라벨 물질에 대응되는 신호가 특정된 RT로부터 신호 세기가 증가하는 방향으로 탐색한 결과 첫 번째 발생되는 신호 세기의 변곡점으로 결정되며, 이 정점을 이용하여 로그정규 피팅을 수행함으로써 수학식 1 및 수학식 3의 T₁의 성분 x_t 및 이의 RT 범위가 결정될 수 있다. Referring to FIG. 5A, it is assumed that a PSM to be analyzed consists of one or more graphs 500-506. The PSM analyzer 42 detects the apex of the signal intensity from the graph 500 in which the PSM corresponding to the label material is detected for lognormal fitting. The vertex is determined as the inflection point of the first generated signal intensity as a result of searching for the signal corresponding to the label material from the specified RT in the direction of increasing the signal intensity. By performing the lognormal fitting using the vertex, The component x _t of T ₁ in equation (3) and its RT range can be determined.

구체적으로, 그래프(500)에 해당하는 입력 값이 RT축 상의 각 위치에서의 이온 신호 세기 값을 각 성분으로 가지는 벡터 (z₀; z₁; ... ; z_p; ... )이며, p가 세기의 정점에 해당하는 RT 값이라고 하면, RT축 상의 위치 x에서의 확률 밀도 함수는 lnN(x; μ; σ²)로 정의될 수 있다. 여기서, μ는 위치 파라미터로서 임의의 양수이며, σ는 스케일(scale) 파라미터이다. 이때, e > p 이며 신호 세기 ze ≤ zp/10 인 가장 작은 RT에 해당하는 e와, s < p 이며 신호 세기 z_s ≥ z_p/100 인 가장 큰 RT에 해당하는 s 사이에서 입력 값을 추출하고, s 와 e 사이의 RT 값 t에 대하여 하기 수학식 4와 같이 함수를 정의할 수 있다. 이는 μ 값에 관계 없이 s 로부터 시작하며 p에서 정점을 가지는 로그정규 분포를 가지는 확률 밀도 함수이며, μ 값에 의해 커브의 형태가 결정된다. Specifically, an input value corresponding to the graph 500 is a vector (z ₀ ; z ₁ ;..., Z _p ) having an ion signal intensity value at each position on the RT axis, If p is the RT value corresponding to the peak of the intensity, the probability density function at position x on the RT axis can be defined as lnN (x; μ; σ ² ). Where mu is an arbitrary positive number as a position parameter, and sigma is a scale parameter. In this case, the input values are extracted between e corresponding to the smallest RT with e> p and signal strength ze ≤ zp / 10, and s corresponding to the largest RT with s <p and signal strength z _s ≥ z _p / 100 , And a function can be defined with respect to the RT value t between s and e as shown in Equation (4). This is a probability density function with a lognormal distribution starting at s and irregardless of the μ value and having apexes at p, and the shape of the curve is determined by the μ value.

[수학식 4]&Quot; (4) "

이때, PSM 분석부(42)는 μ 값을 변화시켜 가면서 벡터 (z_s; z_s ₊ ₁; ... ; z_e _-1; z_e) 및 벡터 (f_μ(z_s); f_μ(z_s+1); ... ; f_μ(z_e-1); f_μ(z_e))(510-513으로 도시됨) 사이의 코사인(cosine) 값을 결정하고, 코사인 값이 가장 큰 μ의 값을 결정한다. 코사인 값이 가장 큰 μ 값을 대입한 벡터 (f_μ(z_s); f_μ(z_s+1); ... ; f_μ(z_e-1); f_μ(z_e))가 그래프(500)에 대응되는 정규분포 함수로서 출력된다. In this case, PSM analysis section 42 is going to change the value vector μ _{_{_{_{(z s; z s + 1}}}} ; ...; z e -1; z e) and the vector _(μ f (z _s); f _μ ( _{z s + 1); ...;} f μ (z e-1); f μ (z e)) ( determine the cosine (cosine) values between shown) to 510-513, and the cosine of the largest Determine the value of μ. A cosine value is substituted for the largest value vector μ _{_{(f μ (z s);}} f μ (z s + 1); ...; f μ (z e-1); f μ (z e)) , the graph Is output as a normal distribution function corresponding to the normal distribution 500.

한편, 라벨 물질이 탐지되지 않은 그래프(502, 504)에 대해서는 위와 같은 방식으로 로그정규 피팅을 수행할 수 없다. 이 때문에, PSM 분석부(42)는 RT 편이 예측부(421)를 포함하여 제1 라벨 물질과 제2 라벨 물질 사이의 RT 편이의 예측값을 산출하고, 제1 라벨 물질에 피팅된 정규분포 곡선(510)의 정점 위치를 RT 편이의 예측값만큼 이동시키는 것에 의하여 제2 라벨 물질의 신호를 정규분포 곡선으로 재구성한다. 이 과정에 의하여, 수학식 1 및 수학식 2의 T₂ 의 성분 y_t와 이의 RT 범위가 결정될 수 있다.On the other hand, log normal fitting can not be performed for the graphs 502 and 504 in which the label material is not detected in the above manner. For this reason, the PSM analyzing unit 42 includes the RT deviation predicting unit 421 to calculate a predicted value of the RT deviation between the first label material and the second label material, and calculates a normal distribution curve fitted to the first label material 510) is shifted by the predicted value of the RT deviation to reconstruct the signal of the second label material as a normal distribution curve. By this process, the component y _t and its RT range of T ₂ in equations (1) and (2) can be determined.

도 5b는 예측된 RT 편이를 이용하여 다른 라벨 물질의 질량 분석 스펙트럼을 특정하는 과정을 나타내는 그래프이다. 5B is a graph illustrating a process of specifying a mass spectrometry spectrum of another label material using the predicted RT deviation.

도 5b을 참조하면, PSM 분석부(42)에 의해 그래프(500)에 대응되는 정규분포 함수(600)가 얻어지며, 그래프(500)로부터 라벨 물질에 해당하는 신호가 얻어진 것으로 가정한다. 이때, PSM 분석부(42)는 라벨 물질들 사이의 질량차를 이용하여 다른 라벨 물질의 신호가 얻어질 질량 대 전하비(m/z)를 알 수 있으므로, 다른 라벨 물질에 해당하는 그래프(502, 504)를 결정한다. 또한, PSM 분석부(42)는 RT 편이를 예측하기 위해 추출된 하나 이상의 특징에 기반하여, 각 그래프(502, 504)에 대응되는 RT 편이를 산출한다. 산출된 RT 편이를 이용하여, 각 그래프(502, 504)로부터 전처리와 동일한 피팅 방식으로 각 그래프(502, 504)에 해당하는 정규분포 곡선(620, 640)을 얻을 수 있다. PSM 분석부(42)가 RT 편이를 예측하기 위한 기계 학습(machine learning) 과정에 대해서는 도 5 내지 도 17을 참조하여 후술한다. Referring to FIG. 5B, it is assumed that a normal distribution function 600 corresponding to the graph 500 is obtained by the PSM analyzer 42, and a signal corresponding to the label material is obtained from the graph 500. At this time, since the PSM analyzer 42 can know the mass-to-charge ratio (m / z) at which the signal of the other label material is obtained by using the mass difference between the label materials, the graph 502 , 504). In addition, the PSM analyzer 42 calculates an RT deviation corresponding to each of the graphs 502 and 504 based on one or more characteristics extracted to predict the RT deviation. Normal distribution curves 620 and 640 corresponding to the respective graphs 502 and 504 can be obtained from the graphs 502 and 504 using the calculated RT deviation according to the same fitting method as the preprocessing. The machine learning process for the PSM analysis unit 42 to predict the RT deviation will be described later with reference to FIG. 5 to FIG.

도 6a 내지 6c는 일 실시예에 따라 질량 분석 스펙트럼을 라벨 물질별로 분리하여 정량을 수행하는 과정을 나타내는 그래프이다. 6A to 6C are graphs illustrating a process of separating mass spectrometry spectra according to one embodiment and performing quantitation.

도 6a를 참조하면, 도 3d를 참조하여 전술한 것과 같이 상이한 라벨 물질(110, 120, 130)을 이용하여 스펙트럼을 얻은 결과 서로 상이한 라벨 물질에 해당하는 그래프(301, 302) 및 그래프(323, 324)와, 상이한 라벨 물질들의 신호가 중첩된 그래프(331-334)의 형태로 스펙트럼이 얻어지며, 그래프(301)로부터 라벨 물질에 대응되는 신호가 특정되었다고 가정한다. 이때, PSM 분석부(42)에서는 라벨 물질의 신호에 대한 전술한 로그정규 피팅 과정을 통해 해당 물질의 m/z 범위 및 RT 범위를 가지는 도메인(domain)(610)을 결정할 수 있다. m/z 범위는 동위원소에 대한 정보로부터 알 수 있으며, RT 범위는 로그정규 피팅을 통하여 얻어진다. Referring to FIG. 6A, spectra are obtained using different label materials 110, 120, and 130 as described above with reference to FIG. 3D. As a result, graphs 301 and 302 corresponding to different label materials and graphs 323, 324, and a graph in the form of graphs 331-334 in which the signals of different label materials are superimposed, and the signal corresponding to the label material from the graph 301 is specified. At this time, the PSM analyzer 42 can determine the domain 610 having the m / z range and the RT range of the corresponding substance through the log normal fitting process for the signal of the label material. The m / z range is known from the information on isotopes, and the RT range is obtained through logarithmic fitting.

다음으로, PSM 분석부(42)에서는 RT 편이 예측부(421)에 의하여 산출된 RT 편이 예측값을 이용하여, 신호가 검출된 라벨 물질과 질량차를 가지는 다른 라벨 물질의 신호가 존재할 도메인(620, 630)을 결정하고, 해당 도메인(620, 630)으로부터 신호를 검출할 수 있다. 이때, 각 도메인(620, 630)에서 검출되는 신호는 여전히 상이한 펩타이드들의 신호가 동위원소로 인해 중첩된 형태를 갖는다. Next, the PSM analyzing unit 42 uses the RT deviation estimated value calculated by the RT deviation predicting unit 421 to calculate the intensity of the label material in which the signals are detected and the domains 620, 630, and may detect a signal from the corresponding domain 620, 630. At this time, the signals detected in each of the domains 620 and 630 still have a form in which signals of different peptides are overlapped due to isotopes.

도 6b를 참조하면, 위와 같은 문제를 해결하기 위해, PSM 분석부(42)에서는 동위원소 분포를 반영한 템플릿을 통하여 중첩된 신호를 상이한 펩타이드에 대응되는 신호들(701-704, 801-804, 901-904)로 분리한다. 구체적으로는, PSM 분석부(42)의 동위원소 분리부(422)는 라벨로 표지된 물질의 이온 조성을 통하여 동위원소 분포 프로파일(profile)을 산출한다. 예를 들어, 이온 조성이 C₁₀O₃₀이라고 하면, 알려져 있는 탄소(C)의 동위원소 분포는 n번째 성분이 n번째 동위원소의 분포량을 나타내는 벡터로 표현하면 (0.9893, 0.0107)로 나타낼 수 있고, 산소(O)의 동위원소 분포는 동일한 방식으로 (0.99757, 0.00038, 0.00205)로 나타낼 수 있다. 6B, in order to solve the above problem, in the PSM analyzing unit 42, signals superimposed through templates reflecting the isotope distribution are converted into signals 701-704, 801-804, 901 corresponding to different peptides -904). Specifically, the isotope separation section 422 of the PSM analysis section 42 calculates the isotope distribution profile through the ion composition of the label-labeled substance. For example, if the ionic composition of C Speaking ₁₀ O _30, isotopic distribution of the carbon (C), known is represented as a vector of n-th component represents the bunporyang the n-th isotopes be represented as (0.9893, 0.0107) and , And oxygen (O) can be expressed in the same way (0.99757, 0.00038, 0.00205).

이때, 이온 조성이 C₁₀O₃₀인 펩타이드의 동위원소 분포를 산출하기 위하여, 동위원소 분리부(422)는 상기 펩타이드가 가상의 원소 α가 40개 있는 것과 동일하다고 보고, α에 대한 동위원소 분포 프로파일은 펩타이드에서 탄소(C)의 동위원소 분포와 산소(O)의 동위원소 분포를 각 원소 비율에 비례적으로 합산하여 하기 수학식 5와 같이 결정할 수 있다. At this time, the ion composition of C ₁₀ O ₃₀ to calculate the isotopic distributions of the peptides, isotope separation unit 422 is reported to be the same as that of 40 the peptides virtual element α, the isotope distribution of the α The profile can be determined as shown in Equation (5) by adding up isotope distribution of carbon (C) and isotope distribution of oxygen (O) in the peptide proportional to each element ratio.

[수학식 5]&Quot; (5) "

다음으로, 펩타이드의 n번째 동위원소의 신호 세기의 비율은, 상기 동위원소 분포 프로파일을 이용한 비순환 방향 그래프(Directed Acyclic Graph; DAG) 알고리즘으로 결정될 수 있다. 구체적으로는, 두 종류의 동위원소가 있는 경우 각 동위원소는 DAG 알고리즘에서 에지(edge) e₁ 및 e₂로 각각 표현되는데, e₁은 임의의 n에 대하여 n-1번째 노드(node)와 n번째 노드를 연결하는 것이고, e₂는 n-2번째 노드와 n번째 노드를 연결하는 것이다. l₁개의 e₁과 l₂개의 e₂로 이루어진 경로를 상정하면, 경로의 세기는 하기 수학식 6의 확률질량함수(probability mass function)로 나타낼 수 있다.Next, the ratio of the signal intensity of the nth isotope of the peptide can be determined by a Directed Acyclic Graph (DAG) algorithm using the above isotope distribution profile. Specifically, if there are two types of isotopes, each isotope is represented by edges e ₁ and e ₂ in the DAG algorithm, where e ₁ is the n-1th node for any n, n is to connect the second node, e ₂ is to connect the n-2-th node to the n-th node. Assuming a path consisting of l ₁ e ₁ and l ₂ e ₂ , the path strength can be expressed by a probability mass function of Equation (6).

[수학식 6]&Quot; (6) "

상기 수학식 6에서 m은 시행 횟수이며, p₁, p₂, p₃는 각 동위원소의 분포 확률로서, 수학식 5에 수학식 6의 결과 및 임의의 시행회수 40회를 대입하면 경로의 세기는 하기 수학식 7과 같이 정의된다. In Equation (6), m is the number of trials, and p ₁ , p ₂ , and p ₃ are probability distributions of the respective isotopes. When the result of Equation (5) Is defined by the following equation (7).

[수학식 7]&Quot; (7) "

수학식 7을 기반으로 DAG의 경로 탐색 알고리즘에 의하여 0번째 노드로부터 n번째 노드까지의 모든 경로를 수집할 수 있으며, 이때 n번째 동위원소의 신호 세기는 위와 같이 수집된 모든 경로의 세기의 합으로 얻어진다. 이 세기는 0번째 세기가 1이 되도록 정규화된다. Based on Equation (7), it is possible to collect all paths from the 0th node to the nth node by the path search algorithm of the DAG, where the signal intensity of the nth isotope is the sum of the strengths of all the paths collected as above . This intensity is normalized so that the zeroth century is one.

도 6c를 참조하면, PSM 분석부(42)의 재구성부(423)는, 전술한 과정에 의하여 RT 편이 예측부(421)에 의하여 산출된 RT 편이 예측값 및 동위원소 분리부(422)에 의하여 동위원소의 세기 분포를 반영하여 특정된 각각의 물질에 대응되는 템플릿을 이용하여 각 물질의 신호를 재구성한다. 구체적으로는, 재구성부(423)는 전술한 수학식 3을 기반으로 측정된 신호 세기와 예측에 의해 산출된 신호 세기 사이의 오차를 최소화하는 각 템플릿에 대한 가중치 q₁ 및 q₂를 산출한다. 6C, the reconstructing unit 423 of the PSM analyzing unit 42 determines whether or not the RT deviation estimated by the RT deviation predicting unit 421 and the predicted value calculated by the isotope separating unit 422 by the above- The signal of each material is reconstructed using a template corresponding to each specified material reflecting the intensity distribution of the element. Specifically, the reconstruction unit 423 calculates weights q ₁ and q ₂ for each template that minimizes the error between the signal intensity measured based on the above-described Equation (3) and the signal intensity calculated by the prediction.

예컨대, 측정된 신호 세기를

이라 하고 예측값을 적용한 템플릿 신호 행렬을

및

라 할 경우, 재구성부(423)는 하기 수학식 8이 최소가 되는 q₁ 및 q₂의 값을 계산할 수 있다. q₁ 및 q₂을 얻기 위한 계산 과정은 통상의 기술자에게 용이하게 이해될 수 있으므로 자세한 설명을 생략한다. For example, if the measured signal strength is

And the template signal matrix to which the predicted value is applied

And

, The reconstructing unit 423 can calculate the values of q ₁ and q ₂ that satisfy Equation (8) below. The calculation process for obtaining q ₁ and q ₂ can be easily understood by a person skilled in the art, and therefore, a detailed description thereof will be omitted.

[수학식 8]&Quot; (8) "

일 실시예에서, 재구성부(423)는 신호를 재구성하는 과정에서 함께 용리(co-elute)된 다른 이온들로 인한 잡음을 고려할 수 있다. 잡음의 형태 및 위치를 나타내는 행렬을 C_j라 하고 잡음의 양을 w_j라 할 경우, 잡음을 고려한 신호는 하기 수학식 9와 같이 나타낼 수 있다. 하기 수학식 9에서 L은 라벨 물질의 개수를 나타낸다.In one embodiment, the reconstruction unit 423 may consider noise due to other ions co-eluted together in the process of reconstructing the signal. If the matrix representing the shape and position of noise is C _j and the amount of noise is w _j , the noise-considered signal can be expressed as Equation (9). In Equation (9), L represents the number of label substances.

[수학식 9]&Quot; (9) "

수학식 9의 잡음 성분은 라벨로 표지된 펩타이드의 신호를 제외한 다른 부분에서 신호 세기의 정점을 이용한 로그정규 피팅에 의하여 얻어질 수 있으며, 최대 크기가 1이 되도록 정규화된다. The noise component of Equation (9) can be obtained by log-normal fitting using the peak of signal intensity in other parts of the signal except for the signal of the labeled peptide, and is normalized so that the maximum size is 1.

한편, 재구성부(423)가 잡음을 고려하는 경우 하기 수학식 10에 의하여 측정 신호와 재구성된 신호 사이의 차이가 정의될 수 있다. On the other hand, when the reconstructing unit 423 considers noise, a difference between the measured signal and the reconstructed signal can be defined according to Equation (10).

[수학식 10]&Quot; (10) "

이때, 잡음을 제외한 펩타이드 신호 세기는 하기 수학식 11과 같이 정의될 수 있으므로, 신호대잡음비(Signal-to-Noise Ratio; SNR)는 하기 수학식 12와 같이 정의된다. 재구성부(423)는 분석 대상 PSM 중에서 SNR이 미리 설정된 문턱값 미만인 것은 정확성을 위하여 대상에서 제외할 수 있다. 또한, 재구성부(423)는 같은 도메인을 가지는 PSM이 복수 개 있을 경우 스펙트럼 E-값이 가장 낮은 PSM 및/또는 펩타이드 수준 q-값이 가장 낮은 PSM만을 대상으로 할 수도 있다. At this time, since the peak signal intensity excluding the noise can be defined as the following Equation (11), the Signal-to-Noise Ratio (SNR) is defined as Equation (12). The reconstructing unit 423 can exclude from the target that the SNR of the PSM to be analyzed is less than a predetermined threshold value for the sake of accuracy. If there are a plurality of PSMs having the same domain, the reconstructing unit 423 may target PSMs having the lowest PSM and / or the lowest q-value of the spectrum E-value.

[수학식 11]&Quot; (11) "

[수학식 12]&Quot; (12) "

단백질 분석부(43)는, PSM 분석부(42)에 의하여 얻어진 각 표지된 펩타이드의 상대량(예컨대, q₁ 및 q₂)을 이용하여 단백질 수준에서의 분석을 수행한다. 하나의 PSM은 특정 단백질에 고유하게 정합되는 것일 수도 있고, 또는 여러 단백질이 공유하는 서열로서 어느 단백질로부터 유래하는 것인지 불명확한 것일 수도 있다. 이러한 불명확성을 해소하기 위하여, 단백질 분석부(43)는 시료의 단백질 비율과 이에 정합되는 PSM 비율이 일정할 것이라는 가정 하에 PSM을 선별한다. The protein analyzing unit 43 performs analysis at the protein level using the relative quantities (e.g., q ₁ and q ₂ ) of each labeled peptide obtained by the PSM analyzing unit 42. One PSM may be uniquely matched to a particular protein, or it may be unclear as to which protein it is from as a sequence shared by several proteins. In order to solve this uncertainty, the protein analyzer 43 selects the PSM based on the assumption that the protein ratio of the sample and the PSM ratio matching therewith are constant.

구체적으로, PSM과 단백질을 각각 이분 그래프(bipartite graph)의 노드 집합 U 및 V로 나타내고, 노드 사이의 에지를 E로 나타내며, 그래프 G(U, V, E)에서 U에서 속하는 노드 u 및 V에 속하는 노드 v 사이의 에지 (u, v)는 PSM과 단백질이 정합되는 경우에 연결되는 것으로 가정한다. 이때, 노드 u의 신호 세기는 전술한 PSM 분석부(42)의 정량 분석에 의하여 하기 수학식 13과 같이 주어진다. Specifically, the PSM and the protein are represented by the node sets U and V of the bipartite graph, the edges between the nodes are denoted by E, and the graphs G (U, V, E) The edge (u, v) between the node v to which it belongs is assumed to be connected when the PSM and protein are matched. At this time, the signal strength of the node u is given by the quantitative analysis of the PSM analysis unit 42 as shown in the following equation (13).

[수학식 13]&Quot; (13) "

단백질 노드 v에 연결된 PSM 노드의 집합을 U(v)라고, 하고, U(v)에 속하는 벡터 q^u의 합을 Q^v라고 하면, 단백질의 비율과 이에 정합되는 PSM의 비율이 일정하다면 q^u 와 Q^v는 유사한 방향을 가져야 할 것이다. 단백질 분석부(43)는 이러한 방향의 유사성이 나타나도록 에지들을 선별한다. 단백질 분석부(43)는 벡터 q^u및 벡터 Q^v - q^u사이의 코사인(cosine) 값을 계산하며, 이는 q^u 와 Q^v의 방향이 일치하는 정도에 대응된다. Let u (v) denote the set of PSM nodes connected to the protein node v and let Q ^v denote the sum of the vectors q ^u belonging to U (v). If the ratio of the proteins and the corresponding PSM ratios are constant, q ^u And Q ^v should have a similar direction. The protein analyzing unit 43 selects the edges so as to show the similarity of such directions. The protein analyzing unit 43 analyzes the vector q ^u and the vector Q ^v Calculate the cosine value between q and ^u , which corresponds to the degree of coincidence of q ^u and Q ^v .

단백질 분석부(43)는 크기가 1 이상인 U(v)에 대하여 위와 같이 산출된 코사인 값이 미리 설정된 문턱값(예컨대, 0.8) 미만인 에지들은 제거함으로써 U(v)를 갱신하고, 이를 기초로 Q^v를 갱신한다. 이상의 과정은 모든 단백질 노드 v에 대하여 반복 수행됨으로써, 부정확하거나 특이값에 해당하는 PSM에 해당하는 노드를 제외할 수 있다. 이상의 과정은 단백질별 가지치기(per-protein-pruning)로 명명될 수 있다.The protein analyzing unit 43 updates U (v) by removing the edges whose cosine value calculated as above is less than a preset threshold value (for example, 0.8) for U (v) having a size of 1 or more, ^v . The above procedure is repeated for all protein nodes v, so that it is possible to exclude nodes corresponding to an inaccurate or singular value PSM. The above process can be termed per-protein-pruning.

다음으로, 단백질 분석부(43)는, 각 PSM 노드 u에 대한 분석을 수행한다. 노드 u가 복수의 단백질 노드 v에 연결되어 있을 경우 이에 해당하는 PSM는 여러 단백질에 의해 공유되는 것이다. 이때, 단백질 분석부(43)는 각 노드 v에 대하여 벡터 q^u및 벡터 Q^v 사이의 코사인 값이 최대가 되는 에지만을 남긴다. 이상의 과정은 모든 노드 u에 대하여 수행된다. 이상의 과정은 펩타이드별 가지치기(per-peptide-pruning)로 명명될 수 있다.Next, the protein analyzing unit 43 analyzes each PSM node u. When the node u is connected to a plurality of protein nodes v, the corresponding PSM is shared by various proteins. At this time, the protein analyzing unit 43 calculates the vector q ^u and the vector Q ^v Leaving only the edge at which the cosine value between them becomes maximum. The above procedure is performed for all nodes u. The above process can be termed per-peptide-pruning.

전술한 단백질별 또는 펩타이드별 가지치기 과정은 복수 회(예컨대, 10회) 교대로 반복 수행될 수 있다. 이 과정 후, 정합된 PSM의 수가 미리 설정된 문턱값(예컨대, 2) 미만인 단백질 노드는 제외할 수도 있다. 이후 남겨진 단백질 노드 v를 포함하는 벡터 Q^v가 최종 결정된 단백질 수준의 정량 분석 결과가 된다. 이상의 과정을 통하여, 시료로 사용된 단백질의 각 성분이 어떠한 비율로 분포되어 있는지에 대한 정보를 얻을 수 있다. The above-described pruning process for each protein or each peptide can be repeatedly performed multiple times (for example, 10 times) alternately. After this process, a protein node with a number of matched PSMs less than a preset threshold (e.g., 2) may be excluded. The vector Q ^v containing the remaining protein node v is the result of quantitative analysis of the final determined protein level. Through the above process, it is possible to obtain information on the proportion of each component of the protein used as the sample.

이하에서는, PSM 분석부(42)의 RT 편이 예측부(421)의 구체적인 동작 중 RT 편이를 예측하기 위하여 기계 학습을 통해 RT 편이와의 상관 관계가 높은 하나 이상의 특징을 결정하는 과정에 대하여 설명한다. Hereinafter, a process of determining at least one characteristic having a high correlation with the RT deviation through machine learning in order to predict the RT deviation among the concrete operations of the RT deviation prediction unit 421 of the PSM analysis unit 42 will be described .

도 7은 RT 정규화(normalization) 과정을 나타내는 그래프이다.7 is a graph showing the RT normalization process.

도 7을 참조하면, 각 시행에 따른 오차를 방지하기 위하여, 질량이 상이한 복수의 라벨 물질을 이용하여 타겟 물질의 LC-MS/MS 스펙트럼을 복수 회의 측정에 의해 얻고 이로부터 측정된 RT 절대값을 0 내지 125 분 사이의 각 시간에 대한 발생 빈도로 나타낼 수 있다. 이때, 절대값이 작은 쪽으로부터 큰 쪽으로 20번째 RT에 해당하는 RT₂₀ 및 80번째에 RT에 해당하는 RT₈₀ 사이의 시간 간격을 RT 스팬(span)이라고 하면, 정규화된 RT 값은 [(RT-RT₂₀) / RT 스팬]으로 정의될 수 있다. 7, an LC-MS / MS spectrum of a target material is obtained by a plurality of measurements using a plurality of label materials having different masses and an absolute value of RT measured therefrom It can be expressed as the occurrence frequency for each time between 0 and 125 minutes. In this case, if the time interval between the RT ₂₀ corresponding to the 20 th RT and the RT ₈₀ corresponding to the 80 th RT from the smallest absolute value to the largest is defined as an RT span, the normalized RT value is expressed as [(RT- RT ₂₀ ) / RT span].

도 8은 정규화된 RT를 나타내는 그래프이다. 8 is a graph showing the normalized RT.

도 8을 참조하면, 본 발명자들에 의한 실험 결과 RT₂₀ 은 27분, RT 스팬은 62분이었으며, 도트 형태로 도시된 라벨 물질의 신호 지점에서 RT는 94분으로서 정규화된 RT 값은 1.08이었다. RT에 대한 신호 세기 그래프의 피크 폭은 1분으로서 정규화된 피크 폭은 이를 RT 스팬으로 나누어 정규화하면 0.01613이며, 라벨 물질 사이의 RT 편이는 0.4 분으로서 이를 RT 스팬으로 나누어 정규화하면 0.0065였다. 8, RT ₂₀ and RT span were 27 min and 62 min, respectively, and RT at the signal point of the label material shown in dot form was 94 min and the normalized RT value was 1.08. The peak width of the signal intensity graph for RT was 1 minute, and the normalized peak width was 0.01613 when normalized by RT span. The RT deviation between the label materials was 0.4 minutes, which was normalized by dividing by RT span and was 0.0065.

도 9는 동일한 펩타이드와 동일한 라벨 물질에 대한 반복 실험에서 RT 편이의 경향성을 나타내는 그래프이다. Figure 9 is a graph showing the tendency of RT deviation in repeated experiments on the same labeling substance as the same peptide.

도 9를 참조하면, 반복 시행에도 불구하고 각 시행 결과의 상관 계수(correlation coefficient)는 0.803690, 평균 제곱근 편차(root-mean-square error; RMSE)는 피크 폭의 1/10 미만인 0.001312이다. 이는 반복 시행에도 불구하고 RT 편이가 일정한 경향성을 갖는다는 것을 의미하는 것으로서, 하나의 샘플에 대한 RT 편이로부터 다른 샘플의 RT 편의를 예측할 수 있는 가능성을 시사한다. 9, the correlation coefficient of each test result is 0.803690, and the root-mean-square error (RMSE) is 0.001312, which is less than 1/10 of the peak width. This implies that the RT deviation has a certain tendency in spite of repeated execution, suggesting the possibility of predicting the RT bias of the other samples from the RT deviation for one sample.

본 발명자들은, 이상과 같은 점에 착안하여, 정규화된 RT를 기반으로 RT 편이에 영향을 미치는 특징을 추출하기 위하여 다양한 실험을 수행하였다. 본 발명자들은 세 번의 LC-MS/MS 시행에 의하여 얻어진 22,000개의 PSM(또는, 질량 분석 스펙트럼으로도 지칭함)을 훈련 세트로 하여 다양한 특징을 시험하였으며, RT 편이와 선형적인 상관 관계를 가지는 특징들을 선별하였다. 일 실시예에서, 선별된 특징은 기본 라벨 물질에 해당하는 펩타이드에 대비하여 다른 라벨 물질에 치환된 중수소(D)의 개수, 및 타겟 물질에 해당하는 이온 신호의 정규화된 RT를 포함한다. The present inventors paid attention to the above points and conducted various experiments in order to extract features affecting the RT deviation based on the normalized RT. We tested various features with a training set of 22,000 PSMs (also referred to as mass spectrometry spectra) obtained by three LC-MS / MS runs and selected features with a linear correlation with the RT deviation Respectively. In one embodiment, the selected feature comprises the number of deuterium (D) substituted for the other labeling material in contrast to the peptide corresponding to the base labeling material, and the normalized RT of the ion signal corresponding to the target material.

또한 일 실시예에서, 선별된 특징은 타겟 물질의 펩타이드 서열 길이, 타겟 물질에서 라벨로 표시된 위치의 비율 및/또는 타겟 물질에 해당하는 이온 신호의 정규화된 RT 폭을 더 포함한다. 라벨로 표시된 위치의 비율이란, 펩타이드에서 라벨 물질의 분자가 부착된 위치의 수를 펩타이드의 길이로 나눈 값을 지칭한다. 라벨 물질의 분자는 아미노산, 예컨대, 라이신(Lysine)일 수 있으나, 이에 한정되는 것은 아니며, 아르기닌(Arginine) 또는 티로신(Tyrosine) 등 다른 아미노산이 사용될 수도 있다.In one embodiment, the selected feature further comprises a peptide sequence length of the target material, a ratio of the labeled location in the target material, and / or a normalized RT width of the ion signal corresponding to the target material. The ratio of the position indicated by the label refers to the value obtained by dividing the number of positions at which the labeling substance molecules are attached in the peptide divided by the length of the peptide. The molecule of the label substance may be an amino acid such as lysine, but is not limited thereto, and other amino acids such as arginine or tyrosine may be used.

도 10은 RT 편이를 기계 학습을 위한 특징 중 하나로서 라벨 물질에 치환된 중수소의 개수에 대하여 나타내는 그래프이다. 도 10에 도시되는 것과 같이, 정규화된 RT 편이값은 라벨 물질에 치환된 중수소(D)의 개수와 연관된 것으로 보였으며, 상관 거리(distance correlation)가 0.57로 가장 높게 나타났다. 본 명세서에서 상관 거리는 선택된 특징과 RT 편이 사이의 선형적 및/또는 비선형적인 연관을 포괄하여 유사도를 나타내기 위한 것으로서, Gabor J. Szekely 외 공저의 논문 "Measuring and testing dependence by correlation of distances" (The Annals of Statistics, 35(6):2769-2794)에 개시된 방법에 의하여 산출되었다. 10 is a graph showing the number of deuterium substituted for labeled substance as one of the features for machine learning of the RT deviation. As shown in FIG. 10, the normalized RT deviation value appeared to be related to the number of deuterium (D) substituted in the label material, and the distance correlation was the highest at 0.57. Correlation distances are used herein to refer to similarity across linear and / or non-linear associations between selected features and RT shifts, as described in Gabor J. Szekely et al., &Quot; Measuring and testing dependence by correlation of distances & Annals of Statistics, 35 (6): 2769-2794).

도 11은 RT 편이를 기계 학습을 위한 또 다른 특징 중 하나로서 정규화된 RT에 대하여 나타내는 그래프이다. 도 11을 참조하면, RT 편이는 정규화된 RT 값과 상관 관계를 가지며 액체 크로마토그래피의 시작 시점과 끝 시점에서 RT 편이가 감소하는 경향을 갖는 것으로 나타났다. 시각적으로 두드러지도록, 라벨 물질에서 치환된 중수소 개수는 20개로 하였으며, 이는 후술하는 도 12 내지 도 15에서도 동일하다. 정규화된 RT 값과 RT 편이의 상관 거리는 0.33 이었다. Figure 11 is a graph showing the RT deviation versus normalized RT as one of the other features for machine learning. Referring to FIG. 11, the RT deviation has a correlation with the normalized RT value, and the RT deviation tends to decrease at the beginning and the end of the liquid chromatography. To be visually conspicuous, the number of deuterium substituted in the label material was set to 20, which is also the same in Figs. 12 to 15 described later. The correlation distance between the normalized RT value and the RT deviation was 0.33.

도 12는 RT 편이를 기계 학습을 위한 또 다른 특징 중 하나로서 펩타이드 서열 길이에 대하여 나타내는 그래프이다. 도 12를 참조하면, 타겟 물질에 해당하는 펩타이드의 길이가 길수록 RT 편이가 감소하는 것으로 나타냈다. 이는 펩타이드가 길 경우 중수소(D) 치환으로 인한 영향이 상대적으로 적은 것으로 이해될 수 있고, 펩타이드 서열 길이와 RT 편이의 상관 거리는 0.43 이었다. Figure 12 is a graph showing the RT deviation versus peptide sequence length as one of the other features for machine learning. Referring to FIG. 12, the longer the length of the peptide corresponding to the target substance, the smaller the RT deviation decreases. It can be understood that the effect of deuterium (D) substitution is relatively small when the peptide is long, and the correlation distance between the peptide sequence length and the RT shift is 0.43.

도 13은 RT 편이를 기계 학습을 위한 또 다른 특징 중 하나로서 라벨로 표시된 아미노산(라이신) 비율에 대하여 나타내는 그래프이다. 도 13을 참조하면, 라이신의 비율이 높을수록 RT 편이가 증가하는 것으로 나타났으며, 라이신 비율과 RT 편이의 상관 거리는 0.40 이었다. Figure 13 is a graph showing the RT deviation versus amino acid (lysine) ratio labeled as one of the other features for machine learning. Referring to FIG. 13, the higher the ratio of lysine was, the more the RT deviation increased. The correlation distance between the lysine ratio and the RT deviation was 0.40.

도 14는 RT 편이를 기계 학습을 위한 또 다른 특징 중 하나로서 질량 분석크 스펙트럼의 피크 폭에 대하여 나타내는 그래프이다. 본 명세서에서 피크 폭은 다음과 같이 정의된다. 각 질량 대 전하비(m/z)에 상응하는 정규분포 곡선에서, 특정 RT 시간 구간에서는 신호 세기가 해당 정규분포 곡선의 신호 세기 최대값의 1/100 이상이며, 해당 구간을 제외한 다른 구간에서는 신호 세기가 전술한 최대값의 1/100 미만일 경우, 상기 특정 RT 시간 구간을 피크 폭으로 정의한다. 피크 폭과 RT 편이의 상관 거리는 0.25 였다. 14 is a graph showing the RT deviation as one of the features for machine learning with respect to the peak width of the mass spectrometry spectrum. In this specification, the peak width is defined as follows. In a normal distribution curve corresponding to each mass-to-charge ratio (m / z), in a specific RT time interval, the signal intensity is 1/100 or more of the signal intensity maximum value of the corresponding normal distribution curve, When the intensity is less than 1/100 of the maximum value described above, the specific RT time interval is defined as a peak width. The correlation distance between the peak width and the RT deviation was 0.25.

RT 편이 예측부(421)는, 도 10 내지 도 14에 도시된 것과 같이 RT 편이와 상관 관계를 가지는 각 특징을 입력값으로 이용한 기계 학습 알고리즘을 수행함으로써 RT 편이의 예측값을 산출한다. The RT deviation predicting section 421 calculates a predicted value of the RT deviation by performing a machine learning algorithm using each characteristic having a correlation with the RT deviation as an input value, as shown in FIGS. 10 to 14. FIG.

구체적으로는, 타겟 물질에 해당하는 단백질과, 라벨로 사용된 펩타이드들의 종류 및 타겟 물질의 동위원소의 질량차 등이 알려진 질량 분석 스펙트럼의 훈련 세트로부터 RT 편이 예측부(421)의 특징 추출부(4211)에 의해 전술한 각 특징값을 추출한다. 다음으로, RT 편이 예측부(421)의 기계 학습부(4212)는 각 특징값과 훈련 세트로부터 측정된 RT 편이를 입력값으로 한 기계 학습 알고리즘에 의해 각 특징값과 RT 편이의 관계를 정의하는 규칙을 생성하고, 이러한 규칙을 훈련 세트가 아닌 시료에 적용함으로써 시료의 질량 분석 스펙트럼에서 발생하는 RT 편이의 예측값을 산출한다. Specifically, from the training set of the mass spectrometry spectrum in which the protein corresponding to the target substance, the kind of the peptide used as the label, and the mass difference of the isotope of the target substance are known, 4211 extracts the respective feature values described above. Next, the machine learning unit 4212 of the RT deviation predicting unit 421 defines the relationship between each feature value and the RT deviation by a machine learning algorithm in which each feature value and the RT deviation measured from the training set are input values Generate a rule and apply these rules to the sample, not to the training set, to calculate the predicted value of the RT deviation occurring in the mass spectral spectrum of the sample.

실시예들에서 특징값들과 RT 편이의 관계를 정의하는 규칙의 형태는 기계 학습 알고리즘에 따라 상이할 수 있으며 일괄적으로 정의될 수 없다. 실시예들에 있어서 기계 학습부(4212)는 선형 회귀(Linear regression) 알고리즘, K 최근접 이웃(K Nearest Neighbors; KNN) 알고리즘, 반지름 근접 이웃(Radius Neighbors) 알고리즘, 신경 네트워크(Neural Network) 알고리즘, 결정 트리(Decision Tree) 알고리즘, 커널 능형 회귀(Kernel Ridge Regression; KRR) 알고리즘, 서포트 벡터 회귀(Support Vector Regression; SVR) 알고리즘 또는 다른 적당한 기계 학습 알고리즘을 적용할 수 있다. In the embodiments, the types of rules that define the relationship between the feature values and the RT deviation may differ depending on the machine learning algorithm and can not be collectively defined. In embodiments, the machine learning unit 4212 may include a linear regression algorithm, a K Nearest Neighbor (KNN) algorithm, a Radius Neighbors algorithm, a Neural Network algorithm, A Decision Tree algorithm, a kernel ridge regression (KRR) algorithm, a support vector regression (SVR) algorithm, or other suitable machine learning algorithms.

알고리즘의 적용에 있어 각각의 특징값은 리스케일된(rescaled) 형태 또는 표준화된(standardized) 형태로 적용될 수 있다. 리스케일의 경우 각 특징값은 특징값을 x라고 할 경우 최소값이 -1이며 최대값이 1이 되도록 하기 수학식 14에 따라 리스케일된 특징값 x'로 변환된다. In applying the algorithm, each feature value can be applied in a rescaled or standardized form. In the case of the rescale, each feature value is converted into a rescaled feature value x 'according to the following equation (14) so that the minimum value is -1 and the maximum value is 1 when the feature value is x.

[수학식 14]&Quot; (14) "

또한, 표준화의 경우 각 특징값은 특징값의 평균이 0이 되고 표준편차가 1이 되도록 하기 수학식 15에 따라 표준화된 특징값 x'로 변환된다. 수학식 15에서 σ는 표준화 전 모집단의 표준편차를 의미한다.In the case of normalization, each characteristic value is converted into a standardized characteristic value x 'according to Equation (15) such that the average of the characteristic values becomes 0 and the standard deviation becomes 1. In Equation 15,? Denotes the standard deviation of the population before the standardization.

[수학식 15]&Quot; (15) "

도 15는 다양한 회귀 모델에 따른 오차를 나타내는 그래프이다. 도 15를 참조하면, 도 10 내지 도 14를 참조하여 전술한 5개의 특징을 모두 사용하여 기계 학습을 수행하였으며, 각각의 회귀 모델에 있어서 10겹 교차 검증(10-fold cross validation)을 통해 예측된 RT 편이 값과 실제 측정된 RT 편이 값 사이의 RMSE를 산출하였다. 이 중 표준화된 특징값을 적용한 SVR 알고리즘의 경우 RMSE 값이 0.0016으로서 가장 바람직하게 나타났다. 15 is a graph showing the error according to various regression models. Referring to FIG. 15, machine learning is performed using all of the five features described above with reference to FIGS. 10 to 14, and in each of the regression models, a 10-fold cross validation The RMSE between the RT deviation value and the actually measured RT deviation value was calculated. The SVR algorithm with standardized feature values showed the best RMSE value of 0.0016.

SVR 알고리즘의 경우, 데이터 원본값이 변수 x 및 y를 갖는다고 가정하면, 대부분의 데이터에서 발생하는 오차값의 크기를 무시하고 모든 데이터가 허용 오차범위(예컨대, ε로 표기)에 들어가야 한다는 제한만을 적용하면서 y = wx + b 형태로 모델을 생성한다. 이후, 모든 데이터가 오차범위 안에 속하고 모델값과 가장 가까운 데이터들 사이의 폭(m)이 최대가 되도록 y = wx + b를 정의하는 w, b의 값을 산출함으로써 변수 x 및 y의 관계를 선형결합 형태의 식으로 설명한다. 예를 들어, 도 16a와 같은 모델은 변수 사이의 관계를 잘 설명하는 것인 반면, 도 16b와 같은 모델은 변수 사이의 관계를 잘 설명하지 못하는 것으로 판정한다. 본 실시예에서는, 이를 적용하여 특징값들과 RT 편이의 관계식을 정의하고 기계학습을 수행할 수 있다.In the case of the SVR algorithm, assuming that the data source value has the variables x and y, only the limitation that all the data should fall within the tolerance range (for example, denoted by?) Is ignored, The model is created in the form of y = wx + b. Then, by calculating the values of w and b that define y = wx + b such that all the data belongs to the error range and the width (m) between the data closest to the model value is the maximum, Linear combination form. For example, the model as shown in FIG. 16A is a good explanation of the relationship between the variables, whereas the model as shown in FIG. 16B determines that the relationship between the variables is not well explained. In the present embodiment, the relation between the feature values and the RT deviation can be defined and machine learning can be performed.

한편, 일 실시예에서는 비선형인 변수들 사이의 관계를 선형으로 근사하기 위해 커널(kernel) 함수가 적용된 SVR 알고리즘을 통하여 기계학습을 수행할 수도 있다. 가우시안 방사 기저 함수(Gaussian Radial Basis Function)라는 커널 함수를 데이터 원본에 적용한 후 SVR 알고리즘에 의해 기계 학습을 수행할 경우, 데이터의 차원을 높일 수 있어 변수들 사이의 복잡한 관계도 모델에 포함시킬 수 있다.Meanwhile, in one embodiment, the machine learning may be performed through the SVR algorithm to which a kernel function is applied in order to linearly approximate the relationship between nonlinear variables. When a kernel function called Gaussian Radial Basis Function is applied to a data source and machine learning is performed by the SVR algorithm, the dimension of the data can be increased, and the complex relationship between the variables can be included in the model .

SVR 알고리즘의 구체적인 연산은 본 발명의 속하는 기술분야의 통상의 기술자에게 용이하게 이해될 수 있으므로, 발명의 요지를 명확하게 하기 위하여 본 명세서에서는 자세한 설명은 생략한다.Since the specific operation of the SVR algorithm can be easily understood by those skilled in the art, detailed description thereof will be omitted in order to clarify the gist of the present invention.

도 17a 및 17b는 회귀 모델로 SVR을 사용하였을 경우의 성능을 나타내는 그래프이다. 도 17a를 참조하면, 정규화된 피크 폭 평균이 0.016일 때 예측된 RT 편이의 RMSE 값은 이의 1/10에 해당하는 0.0016에 불과하여 RT 편이에 대한 예측의 정확성을 확인할 수 있다. 또한, 도 17b를 참조하면, 예측된 RT 편이와 실제 측정된 RT 편이 사이의 상관 계수는 0.76242로 높은 상관도를 나타내었다. 17A and 17B are graphs showing the performance when SVR is used as a regression model. Referring to FIG. 17A, when the normalized peak width average is 0.016, the RMSE value of the estimated RT deviation is only 0.0016 corresponding to 1/10 of the estimated RT deviation, so that the accuracy of the prediction for the RT deviation can be confirmed. Also, referring to FIG. 17B, the correlation coefficient between the predicted RT deviation and the actually measured RT deviation was 0.76242, indicating a high correlation.

또한, 본 발명자들은 도 10 내지 14를 참조하여 전술한 5개의 특징을 개별적으로 사용하는 경우와 5개의 특징 모두를 사용하는 경우의 기계 학습 결과에 기반한 RT 편이 예측 결과를 비교하였으며, 그 결과는 하기 표 1과 같다. 표 1에 도시된 것과 같이 5개의 특징을 개별적으로 이용하는 경우와 함께 이용하는 경우 RMSE에서의 편차는 상대적으로 크지 않았다. 반면, 상관 계수에 있어서는 모든 특징을 함께 이용한 기계 학습의 결과가 큰 폭으로 높은 상관도를 나타내었다. Further, the present inventors compared the results of RT deviation prediction based on the results of the machine learning in the case of using the five characteristics individually and the case of using all of the five characteristics described above with reference to Figs. 10 to 14, Table 1 shows. The deviations in the RMSE were relatively small when used in combination with the individual use of the five features as shown in Table 1. On the other hand, correlation coefficient showed a high correlation with the results of machine learning using all the features together.

예측 형태Forecast type 이용된 특징Features Used RMSERMSE 상관 계수Correlation coefficient 단일 특징 이용Single Feature Usage 중수소(D) 개수Number of deuterium (D) 2.2×10^-3 2.2 x 10 ^-3 0.470.47 정규화된 RTNormalized RT 2.4×10^-3 2.4 × 10 ^-3 0.240.24 펩타이드 서열 길이Peptide sequence length 2.4×10^-3 2.4 × 10 ^-3 0.210.21 라벨된 아미노산 비율Percent of labeled amino acids 2.3×10^-3 2.3 x 10 ^-3 0.330.33 피크 폭Peak width 2.4×10^-3 2.4 × 10 ^-3 0.320.32 모든 특징 이용Use all features 상기 특징 모두All of the above features 1.6×10^-3 1.6 x 10 ^-3 0.770.77

표 1에 도시된 결과는, 질량 분석 스펙트럼에 있어서 RT 편이의 값은 스펙트럼의 전체 폭에 비해 상대적으로 작은 값이라는 점과, RT 편이가 크게 나타내는 질량 분석 스펙트럼에 있어서 본 실시예에 따라 5개의 특징을 이용하여 기계 학습을 통해 RT 편이를 정확하게 예측할 수 있음을 의미한다. The results shown in Table 1 show that the value of the RT deviation in the mass spectrometry spectrum is relatively small in comparison with the total width of the spectrum and that in the mass spectrometry spectrum in which the RT deviation is largely shown, This means that it is possible to accurately predict the RT deviation through machine learning.

이상에서 설명한 본 발명의 실시예들에 따른 시료의 정량 분석 장치에 의하면, 라벨 물질에서 수소(H)의 중수소(²H, 또는 D(Deuterium)) 치환에 의하여 발생하는 RT 편이를 기계 학습을 적용하여 효율적으로 예측할 수 있다. 또한, 질량차가 예를 들어 4달톤(Da), 또는 심지어 2달톤(Da) 정도로 질량이 인접한 라벨 물질들을 사용하더라도 각 라벨 물질의 신호를 정확하게 분리하는 것이 가능하므로, 분해능을 높이기 위한 고가의 장비를 사용하지 않고도 정량 측정의 다중성(multiplexity)을 향상시킬 수 있다. According to the quantitative analysis apparatus for samples according to the embodiments of the present invention described above, the RT deviation generated by the substitution of deuterium ( ² H, or D (Deuterium)) of hydrogen (H) So that it can be predicted efficiently. It is also possible to accurately separate the signals of each labeling material, even if the mass difference is, for example, about 4 daltons (Da) or even 2 daltons (Da) It is possible to improve the multiplexing of the quantitative measurement without using it.

도 18은 일 실시예에 따른 시료의 정량 분석 방법을 사용할 경우 라벨 물질로 활용될 수 있는 예시적인 펩타이드들을 나타낸다. Figure 18 shows exemplary peptides that can be utilized as labeling material when using the quantitative assay method of a sample according to one embodiment.

도 18을 참조하면, 최상단에 도시된 것과 같이 H₂N-펩타이드-K-NH₂의 구조를 가지는 펩타이드를 나트륨시아노보로하이드라이드(NaBH₃CN) 및 알데히드(aldehyde)와 반응시켜 2개의 에틸(ethyl)기가 결합된 말단을 가지는 펩타이드 구조체를 기본 라벨 물질로 하고, 반응 물질에서 일부 수소(H) 원자를 중수소(D)로 치환해감으로써 기본 라벨 물질로부터 4달톤(Da)씩 질량이 증가하는 5종류의 추가적인 라벨 물질용 펩타이드 구조체를 얻을 수 있다. 도 18에 도시된 펩타이드들을 이용하면 한꺼번에 6개의 라벨 물질을 이용한 측정(6-plexing)이 가능하다. 그러나 이는 예시적인 것으로서, 반응 물질의 변경에 의해 펩타이드의 종류를 늘림으로써 8개 또는 10개의 라벨 물질을 이용한 측정도 가능하며, 사용 가능한 펩타이드의 개수는 1개 이상 임의의 숫자 일 수 있고 특정 개수로 한정되지 않는다. Referring to FIG. 18, peptides having the structure of H ₂ N-peptide-K-NH ₂ are reacted with sodium cyanoborohydride (NaBH ₃ CN) and aldehyde as shown at the top, (Da) from the basic label substance by replacing some hydrogen (H) atoms with deuterium (D) in the reactive substance as a basic label substance with a peptide structure having an ethyl group bonded thereto Five additional peptide structures for labeling substances can be obtained. Using the peptides shown in FIG. 18, six-plexing can be performed using six label materials at a time. However, by way of example, it is also possible to measure using 8 or 10 label substances by increasing the kinds of peptides by changing the reactants. The number of available peptides can be one or more arbitrary numbers, It is not limited.

도 19a 내지 19h는 라벨로 표시된 단백질의 비율에 따른 일 실시예에 의한 정량 결과를 이론값과 비교하여 나타내는 그래프이다. FIGS. 19A to 19H are graphs showing quantitation results according to one embodiment according to the ratio of labeled proteins to theoretical values. FIG.

도 19a 내지 19h는 도 18에 도시된 것과 같은 6종류의 펩타이드 구조체를 라벨 물질로 사용한 시료의 정량 분석 결과를 나타내는 것으로서, 도 19a에서 각 라벨 물질이 부착된 단백질의 비율은 1:1:1:1:1:1이며, 도 19b에서 상기 비율은 1:1.2:1:1.2:1:1.2이고, 도 19c에서 상기 비율은 1:4:1:4:1:4이며, 도 19d에서 상기 비율은 1:8:1:8:1:8이고, 도 19e에서 상기 비율은 1:10:1:10:1:10이며, 도 19f에서 상기 비율은 1:20:1:20:1:20이고, 도 19h에서 상기 비율은 1:50:1:50:1:50이다. 19A to 19H show quantitative analysis results of a sample using six kinds of peptide constructs as shown in FIG. 18 as a labeling substance. In FIG. 19A, the ratio of the protein to which each labeling substance was attached was 1: 1: 1: 19: 1: 1.2: 1: 1.2: 1: 1.2, and the ratio is 1: 4: 1: 4: 1: 4 in FIG. 19: 1: 8: 1: 8: 1: 8, and the ratio is 1: 10: 1: 10: 19H, the ratio is 1: 50: 1: 50: 1: 50.

각 그래프는 2번 라벨 물질의 신호에 대한 다른 라벨 물질 신호의 상대 비율을 로그2 폴드(log2 fold) 변화로 나타낸 것이며, 점선으로 표시된 것은 이론적인 신호 위치를 나타낸다. 도시되는 것과 같이, 각 라벨 물질이 표지된 단백질의 비율이 다양하게 변화하는 경우에도 본 실시예에 의한 정량 분석 결과가 이론적인 값을 잘 추종하는 것을 확인할 수 있다. Each graph shows the relative ratio of the other label material signals to the signal of label material No. 2 in terms of a log 2 fold change and the dashed line represents the theoretical signal position. As shown in the figure, the quantitative analysis results of the present example follow the theoretical values even when the ratios of the labeled proteins of the respective label materials vary widely.

또한, 도 20a 및 20b는 라벨로 표시된 단백질의 비율에 따른 일 실시예에 의한 정량 결과를 종래 기술에 의한 정량 결과와 비교하여 나타내는 그래프이다. 20A and 20B are graphs showing the results of quantitation according to one embodiment according to the ratios of the labeled proteins to quantitative results according to the prior art.

도 20a 및 20b에서 종래 기술로는 LC-MS/MS 스펙트럼 기반의 정량 도구인 맥스퀀트(MaxQuant)가 이용되었으며, 맥스퀀트에서는 3종류의 라벨 물질까지만 동시 측정이 가능하므로, 도 18을 참조하여 전술한 방식으로 질량이 4달톤(Da)씩 증가하는 3종류의 펩타이드를 라벨 물질로 사용하였다. 동위원소로 인한 신호의 중첩은 없도록 하였다. 도 19a 내지 19h와 마찬가지로, 도 20a 및 20b에서도 각 그래프는 2번 라벨 물질의 신호에 대한 다른 라벨 물질 신호의 상대 비율을 로그2 폴드(log2 fold) 변화로 나타낸 것이며, 점선은 이론적인 신호 위치를 나타낸다.20A and 20B, MaxQuant, which is an LC-MS / MS spectrum-based quantitation tool, is used. Simultaneously measuring up to three types of label materials is possible in Maxquant. Therefore, Three types of peptides were used as the labeling substance, one mass of which increased by 4 daltons (Da). No overlap of signals due to isotopes was made. 20A and 20B, each graph represents the relative ratio of the other label material signals to the signal of the label material No. 2 by the log 2 fold change, and the dotted line represents the theoretical signal position .

도 20a는 세 종류의 라벨 물질이 표지된 단백질의 비율이 1:1:1인 경우의 정량 분석 결과이며, 도 20b는 상기 비율이 1:10:20인 경우의 정량 분석 결과이다. 도시되는 것과 같이, 종래 기술의 경우 단백질 비율이 1:1:1인 경우에는 비교적 정확한 결과가 얻어지지만 비율이 1:10:20으로 증가하면서 이론값과의 오차가 증가하는 반면에, 본 실시예에 의하면 비율이 증가하는 경우에도 정량 분석 결과가 이론값을 잘 추종하는 것을 확인할 수 있다. FIG. 20A is a result of quantitative analysis in the case where the ratio of labeled proteins of three kinds of label materials is 1: 1: 1, and FIG. 20B is a result of quantitative analysis in the case where the ratio is 1:10:20. As shown, in the case of the conventional technology, a relatively accurate result is obtained when the protein ratio is 1: 1: 1, but the error increases from the theoretical value while the ratio increases to 1:10:20, , It can be seen that the quantitative analysis follows the theoretical value even when the ratio increases.

도 21은 일 실시예에 따른 시료의 정량 분석 방법의 순서도이다. 21 is a flowchart of a quantitative analysis method of a sample according to an embodiment.

도 21을 참조하면, 시료의 LC-MS/MS 스펙트럼을 수신하고(S1), 수신된 스펙트럼 중 분석 대상 스펙트럼을 특정할 수 있다(S2). 분석 대상 스펙트럼을 특정하는 과정은 스펙트럼으로부터 PSM을 인식하는 과정을 포함하며, 예컨대, MS/MS 스펙트럼을 단백질 서열 데이터베이스의 펩타이드와 대조하는 MS-GF+ 방식으로 이루어질 수 있으나, 이에 한정되는 것은 아니다.Referring to FIG. 21, an LC-MS / MS spectrum of a sample is received (S1), and an analysis target spectrum among the received spectra can be specified (S2). The process of specifying the spectrum to be analyzed includes a process of recognizing PSM from the spectrum, for example, but not limited to, the MS-GF + method in which the MS / MS spectrum is compared with the peptide in the protein sequence database.

다음으로, 인식된 PSM으로부터 제1 라벨 물질에 대응되는 신호를 추출하고(S3), 추출된 신호를 이용한 로그정규 피팅을 통해 해당 라벨 물질이 표지된 제1 물질인 펩타이드의 신호에 대응되는 템플릿을 특정할 수 있다(S4). 이때, 템플릿은 제1 물질의 알려진 동위원소 분포 프로파일을 반영할 수 있다. 즉, 인식된 PSM에 대하여 피팅된 정규분포 함수와 해당 물질의 동위원소 분포 프로파일을 곱하여 해당 물질의 템플릿을 얻을 수 있다.Next, a signal corresponding to the first label substance is extracted from the recognized PSM (S3), and a template corresponding to the signal of the first substance labeled with the label substance is obtained through log normal fitting using the extracted signal (S4). At this time, the template may reflect the known isotope distribution profile of the first material. That is, the template of the corresponding substance can be obtained by multiplying the fitted normal distribution function for the recognized PSM by the isotope distribution profile of the substance.

또한, 제1 라벨 물질과 제2 라벨 물질의 알려진 질량차를 기반으로, 제1 라벨 물질과 제2 라벨 물질 간의 RT 편이의 예측된 값을 산출한다. 예측값을 산출하는 과정은, 질량 분석 스펙트럼으로부터 기계 학습을 위한 하나 이상의 특징을 추출하는 단계(S5) 및 추출된 특징을 SVR 등 기계 학습 알고리즘에 적용하여 RT 편이를 예측하는 단계(S6)를 포함한다. Further, based on the known mass difference of the first label material and the second label material, a predicted value of the RT deviation between the first label material and the second label material is calculated. The step of calculating the predicted value includes extracting at least one feature for the machine learning from the mass spectral analysis (S5), and applying the extracted feature to the machine learning algorithm such as SVR to predict the RT deviation (S6) .

다음으로, 제1 물질의 재구성된 PSM 및 RT 편이의 예측값을 이용하여 제2 라벨 물질로 표지된 제2 물질의 템플릿을 특정할 수 있다(S7). Next, using the predicted values of the reconstructed PSM and RT deviation of the first material, the template of the second substance labeled with the second label material can be specified (S7).

다음으로, 특정된 제1 물질의템플릿 및 제2 물질의 템플릿을 이용하여 제1 물질 및 제2 물질 각각의 신호를 재구성하는 분석을 수행함으로써, 시료에 대한 PSM 수준의 정량 분석 결과를 도출할 수 있다(S8). 상기 과정은, 측정된 신호와 각 템플릿의 비교를 통해 제1 물질 및 제2 물질의 상대량을 산출하는 단계를 포함할 수 있다. 나아가, 동일 단백질에 매칭되는 하나 이상의 PSM에 대한 분석 결과를 취합하여 시료의 단백질 수준 정량 분석을 수행할 수 있다(S9).Next, the analysis of reconstructing the signals of each of the first and second substances using the template of the identified first substance and the template of the second substance can be performed to derive a PSM level quantitative analysis result on the sample (S8). The process may include calculating a relative amount of the first material and the second material by comparing the measured signal with each template. Further, analysis results for one or more PSMs matched to the same protein may be collected to perform protein level quantitative analysis of the sample (S9).

이상에서 설명한 실시예들에 따른 시료의 정량 분석 방법에 의한 동작은, 적어도 부분적으로 컴퓨터 프로그램으로 구현되고 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 실시예들에 따른 시료의 정량 분석 방법에 의한 동작을 구현하기 위한 프로그램이 기록되고 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산 방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수도 있다. 또한, 본 실시예를 구현하기 위한 기능적인 프로그램, 코드 및 코드 세그먼트(segment)들은 본 실시예가 속하는 기술 분야의 통상의 기술자에 의해 용이하게 이해될 수 있을 것이다.The operation of the quantitative analysis method of the sample according to the embodiments described above can be at least partially implemented in a computer program and recorded in a computer-readable recording medium. A program for implementing the operation of the quantitative analysis method of the sample according to the embodiments is recorded, and the computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage, and the like. The computer readable recording medium may also be distributed over a networked computer system so that computer readable code is stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the present embodiment may be easily understood by those skilled in the art to which this embodiment belongs.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims. You will understand.

Claims

A spectral recognition unit configured to receive a liquid chromatograph mass spectrometry spectrum of a sample including a first substance to which a first label substance is attached and a second substance to which a second label substance having a different mass from the first label substance is attached; And
Extracting a peptide matching spectrum corresponding to the first labeling substance from the mass spectrometry spectrum received in the spectrum recognizing unit, specifying a template of the first substance using the peptide matching spectrum of the first labeling substance, The method of claim 1, further comprising: calculating a predicted retention time deviation of the first label material and the second label material, specifying a template of the second material using the predicted retention time deviation, And a peptide matching spectrum analyzer configured to perform quantitative analysis of a peptide matching spectrum level of the sample by reconstructing a signal of the first substance and the second substance using a template.

The method according to claim 1,
And a protein analyzing unit configured to perform protein level quantitative analysis of the sample using the peptide matching spectral level quantitative analysis results for one or more substances.

The method according to claim 1,
Wherein the first label material is a peptide comprising carbon, nitrogen and hydrogen,
Wherein the second label material is a peptide in which one or more hydrogens of the first label material are replaced by deuterium.

The method of claim 3,
Wherein the second label material comprises five or more peptides that increase in mass sequentially from the mass of the first label material depending on the number of deuterium substituted.

The method according to claim 1,
Wherein the peptide matching spectrum analyzing unit includes a retention time deviation predicting unit for calculating the predicted retention time deviation,
The retention time shift predicting unit predicts,
A feature extraction unit configured to extract one or more features predetermined from the mass spectrometry spectrum; And
And a machine learning unit configured to apply the one or more features to a rule obtained through machine learning using a training set of a mass spectrometry spectrum as an input value.

6. The method of claim 5,
Wherein said at least one characteristic comprises at least one of the number of deuterium substituted on said second label material, the normalized retention time of said mass spectrometry spectrum, the peptide sequence length of said first label material, A ratio of labeled sites or a normalized peak width of said mass spectrometry spectrum.

6. The method of claim 5,
The peptide matching spectrum analyzing unit may determine a template of the first substance by using a normal distribution curve obtained from the peptide matching spectrum corresponding to the first label substance and a previously known isotope distribution profile of the elements constituting the first substance And a quantitative analysis device for the sample including the constituted isotope separation section.

6. The method of claim 5,
Wherein the peptide matching spectrum analyzing unit analyzes the liquid chromatography mass spectrometry spectrum of the sample by calculating the relative mass of the first substance and the second substance using the template of the first substance and the template of the second substance, And a reconstruction unit configured to reconstruct a signal of the first material and a signal of the second material.

A quantitative analysis apparatus is provided for receiving a liquid chromatograph mass spectrometry spectrum of a sample comprising a first substance to which a first label substance is attached and a second substance to which a second label substance having a different mass from the first label substance is attached step;
The quantitative analysis apparatus comprising: calculating a predicted retention time deviation of the first label material and the second label material from the mass spectrometry spectrum;
Extracting a peptide matching spectrum corresponding to the first label substance from the mass spectrometry spectrum;
Identifying the template of the first substance using the peptide matching spectrum of the extracted first label substance;
The quantitative analysis apparatus comprising: specifying a template of the second material by using the template of the first material and the predicted retention time deviation;
The quantitative analysis apparatus comprising: reconstructing a signal of the first material and the second material using a template of the first material and the second material; And
Wherein the quantitative analysis apparatus comprises performing quantitative analysis of the peptide matching spectrum level of the sample using the reconstructed signals of the first substance and the second substance.

10. The method of claim 9,
Wherein the quantitative analysis apparatus further comprises analyzing the protein level of the sample using the result of the quantitative analysis of the peptide matching spectrum for one or more substances.

10. The method of claim 9,
Wherein the first label material is a peptide comprising carbon, nitrogen and hydrogen,
Wherein the second label material is a peptide in which one or more hydrogens of the first label material are replaced by deuterium.

12. The method of claim 11,
Wherein the second label material comprises five or more peptides that increase in mass sequentially from the mass of the first label material depending on the number of deuterium substituted.

10. The method of claim 9,
Wherein the step of calculating the predicted retention time deviation comprises:
Extracting one or more predetermined features from the mass spectrometry spectrum; And
Calculating the predicted retention time deviation by applying the one or more features to a rule obtained through machine learning using a training set of a mass spectrometry spectrum as an input value.

14. The method of claim 13,
Wherein said at least one characteristic comprises at least one of the number of deuterium substituted on said second label material, the normalized retention time of said mass spectrometry spectrum, the peptide sequence length of said first label material, A ratio of labeled sites or a normalized peak width of said mass spectrometry spectrum.

10. The method of claim 9,
Wherein the step of specifying the template of the first material comprises the steps of: using a normal distribution curve obtained from the peptide matching spectrum corresponding to the first label material and a previously known isotope distribution profile of the elements constituting the first material, The method comprising the steps of:

10. The method of claim 9,
Wherein reconfiguring the signals of the first material and the second material comprises calculating a relative amount of the first material and the second material using a template of the first material and a template of the second material A method for quantitative analysis of a sample.

17. A computer program stored in a medium, coupled to hardware, for executing a method of quantitative analysis of a sample according to any one of claims 9 to 16.