KR102511137B1

KR102511137B1 - Method and method for analyzing similarity of chromatogram image

Info

Publication number: KR102511137B1
Application number: KR1020220074629A
Authority: KR
Inventors: 김진영; 정재철; 김희승; 이현수
Original assignee: 대한민국
Priority date: 2022-06-20
Filing date: 2022-06-20
Publication date: 2023-03-17

Abstract

Disclosed are a method and a system for analyzing similarity. More specifically, the present invention relates to a method and a system for analyzing similarity of a chromatogram image which perform impurity profiling on a specific compound by using a gas chromatography-flame ionization detector. According to an embodiment of the present invention, the method for analyzing similarity of a chromatogram image can eliminate the need for undergoing a manual adjustment process based on an internal standard material for a problem in which the retention time of data does not match through a separation and analysis process for a methamphetamine sample which is a drug and a process of extracting a GC-FID chromatogram image, extract chromatograms similar to the GC-FID chromatogram of each sample, and follow a novel statistical analysis of finding a sample by using the GC-FID chromatogram image in comparison to a conventional data method.

Description

Method and system for analyzing the similarity of chromatogram images {METHOD AND METHOD FOR ANALYZING SIMILARITY OF CHROMATOGRAM IMAGE}

본 발명은 유사도 분석 방법에 관한 것으로, 특히 기체크로마토그래피-불꽃이온화검출기를 이용하여 특정 화합물에 대한 불순물 프로파일링을 수행하는 크로마토그램 이미지의 유사도 분석 방법 및 시스템에 관한 것이다.The present invention relates to a similarity analysis method, and more particularly, to a similarity analysis method and system for chromatogram images that perform impurity profiling for a specific compound using a gas chromatography-flame ionization detector.

국외에서 국내로 마약이 유입된 경우, 외국의 밀조조직이 제조하여 들여온 마약에 대하여 제조에 사용된 전구체, 합성방법 및 함유된 이성질체의 구조 등 물질의 고유한 특성을 파악하고, 유입된 마약과 대조 마약의 유사도 분석 정보를 활용한다면 마약 범죄조직에 대한 추적 및 단속이 수월해질 수 있다.In the event that narcotics are introduced into Korea from abroad, the unique characteristics of substances such as precursors used in manufacturing, synthetic methods, and structure of isomers contained in the narcotics manufactured and imported by foreign secretarial organizations are identified, and compared with imported narcotics. If the similarity analysis information of drugs is used, tracking and crackdown on drug criminal organizations can be facilitated.

국내에서는 이러한 마약의 고유한 특성을 파악하기 위한 방법으로서, 현재 가장 많이 남용되고 있는 마약인 필로폰을 대상으로 하여 밀조, 유입된 필로폰 사이의 연관성과, 시중에 유통되고 있는 다른 필로폰과의 동일성 및 상호 연관성 등을 확인하기 위한 과학적 분석기법으로 불순물 프로파일링(impurity profiling)을 활용하고 있다.In Korea, as a method to identify the unique characteristics of these drugs, the relationship between methamphetamine, which is currently the most abused drug, and the influx of methamphetamine, and the identity and mutuality with other methamphetamine distributed on the market Impurity profiling is used as a scientific analysis technique to confirm correlation.

특정 화합물에 대한 불순물 프로파일링에 가장 일반적이고 많이 사용되는 기기분석 방법은 크로마토그래피법이다. 국내에서는 필로폰 불순물 프로파일링을 위해 기체크로마토그래피-불꽃이온화검출기(gas chromatography-flame ionization detection, GC-FID)를 이용한 분석법을 개발하여 활용하고 있다. GC-FID는 휘발성이 있는 탄화수소 화합물의 범용 검출기로 사용될 뿐만 아니라 감도가 뛰어나 불순물 프로파일링 분석에 유용한 것으로 알려져 있다.Chromatography is the most common and most used instrumental method for profiling impurities for specific compounds. In Korea, an analysis method using gas chromatography-flame ionization detection (GC-FID) has been developed and used for profiling methamphetamine impurities. GC-FID is known to be used not only as a general-purpose detector for volatile hydrocarbon compounds, but also useful for impurity profiling due to its excellent sensitivity.

분석을 위한 선행 연구로서, 데이터 분석에 앞서 불순물 프로파일링을 위해 가스 크로마토그래프-질량분석계(gas chromatograph-mass spectrometer; GC-MS)를 이용하여 필로폰(메트암페타민; methamphetamine)내에 존재하는 미량의 불순물에 대해 각 성분을 확인하였고, 이 중에서 특징적인 166개의 화합물을 불순물로 선정하였다. 이후 GC-FID 분석을 통해 얻은 71개 시료의 크로마토그램으로부터 불순물들을 대상으로 반정량(semi-quantification) 분석을 통해 불순물 프로파일링을 수행하였다. 또한, 최종 화학계량학적 분석에 사용되는 군집분석 방법으로 지도방식인 지지도 벡터 머신(support vector machines, SVM)을 적용하여 군집화를 수행 하였다¹⁾.As a preliminary study for analysis, a gas chromatograph-mass spectrometer (GC-MS) was used for impurity profiling prior to data analysis to detect trace impurities present in methamphetamine (methamphetamine). Each component was identified, and among them, 166 characteristic compounds were selected as impurities. Subsequently, impurity profiling was performed through semi-quantification analysis for impurities from chromatograms of 71 samples obtained through GC-FID analysis. In addition, clustering was performed by applying support vector machines (SVM), a guidance method, as a cluster analysis method used in the final chemometric analysis ¹⁾ .

그러나, SVM 분석은 수치형 데이터 처리에 유용하고 딥러닝에 비해 사용이 용이하고 과적합(overfitting)되는 경우가 많지 않은 것이 장점인 반면, 여러 가지 조합, 커널 및 모델에서 테스트 작업이 필요할 뿐만 아니라, 입력 데이터 셋이 많은 경우 학습 속도가 저하되고, 많은 연산이 필요하며, 해석이 어렵다는 단점이 있다²⁾.However, while SVM analysis is useful for numerical data processing and has advantages over deep learning in that it is easy to use and rarely overfits, it requires testing work in various combinations, kernels and models, When there are a large number of input data sets, the learning speed is reduced, many calculations are required, and interpretation is difficult ²⁾ .

등록특허공보 제10-1615480호(공고일자: 2016.04.25.)Registered Patent Publication No. 10-1615480 (Public date: 2016.04.25.)

1) Dong Won Shin, Beom Jun Ko, Jae Chul Cheong, Wonho Lee, Suhkmann Kim, and Jin Young Kim, Impurity profiling and chemometric analysis of methamphetamine seizures in Korea, Vol 33 No 2 98-107 2020. 2) Brereton RG, Lloyd GR. Support vector machines for classification and regression. Analyst. 2010 Feb;135(2):230-67. doi: 10.1039/b918972f. / Ovidiu Ivanciuc. Applications of Support Vector Machines in Chemistry. In: Reviews in Computational Chemistry, Volume 23, Eds.: K. B. Lipkowitz and T. R. Cundari. Wiley-VCH, Weinheim, 2007, pp. 291-400. 3) NOBUYUKI OTSU, "A Trreshold Selection Method from Gray-Level Histograms", 　「IEEE Transactions of Systems, Man, and Cybernetics」, 1979. 4) Ali, Mohammed, et al. "TimeCluster: dimension reduction applied to temporal data for visual analytics." The Visual Computer 35.6 (2019): 1013-1026. 5) R. Bellman and R. Kalaba, “On adaptive control processes," Automatic Control, IRE Transactions on, vol. 4, no. 2, pp. 1-9, 1959. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1104847 / T. Kahveci and A. Singh, “Variable length queries for time series data", in Data Engineering, 2001. Proceedings. 17th International Conference on, 2001, pp. 273-282. [Online]. Available: http: //dx.doi.org/10.1109/ICDE.2001.914838 / Gorecki와 Tomasz(2018), Sakoe와 Chiba(1978)). 6) Donald J. Bemdt · James Clifford, "Using Dynamic Time Warping to Find Patterns in Time Series", 「AAAI-94 Workshop on Knowledge Discovery in Databases」, 1994 / Pavel Senin, "Dynamic Time Warping Algorithm Review", 2008 / Marion Morel · Catherine Achard · Richard Kulpa · S?verine Dubuisson, "Time-series averaging using constrained dynamic time warping with tolerance", 「Pattern Recognition, Volume 74, Issue C」, 2018 / Zoltan Geler · Vladimir Kurbalija · Mirjana Ivanovi? · Milo? Radovanovi? · Weihui Dai, "Dynamic Time Warping: Itakura vs Sakoe-Chiba", 「IEEE International Symposium on INnovations in Intelligent SysTems and Applications」, 2019. 7) A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: A review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4-37. 8) International Journal of Machine Learning and Cybernetics (2019) 10:1227.1246. 9) Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1), 43-49.

10) Franc-ois Petitjean · Alain Ketterlin · Pierre Gancarski, "A global averaging method for dynamic time warping, with applications to clustering", 「Pattern Recognition Volume 44, Issue 3」, 2011 / Marion Morel · Catherine Achard · Richard Kulpa · Sverine Dubuisson, "Time-series averaging using constrained dynamic time warping with tolerance", 「Pattern Recognition, Volume 74, Issue C」, 2018. 1) Dong Won Shin, Beom Jun Ko, Jae Chul Cheong, Wonho Lee, Suhkmann Kim, and Jin Young Kim, Impurity profiling and chemometric analysis of methamphetamine seizures in Korea, Vol 33 No 2 98-107 2020. 2) Brereton RG, Lloyd GR. Support vector machines for classification and regression. Analyst. 2010 Feb;135(2):230-67. doi: 10.1039/b918972f. / Ovidiu Ivanciuc. Applications of Support Vector Machines in Chemistry. In: Reviews in Computational Chemistry, Volume 23, Eds.: KB Lipkowitz and TR Cundari. Wiley-VCH, Weinheim, 2007, pp. 291-400. 3) NOBUYUKI OTSU, "A Threshold Selection Method from Gray-Level Histograms", 「IEEE Transactions of Systems, Man, and Cybernetics」, 1979. 4) Ali, Mohammed, et al. "TimeCluster: dimension reduction applied to temporal data for visual analytics." The Visual Computer 35.6 (2019): 1013-1026. 5) R. Bellman and R. Kalaba, “On adaptive control processes," Automatic Control, IRE Transactions on, vol. 4, no. 2, pp. 1-9, 1959. [Online]. Available: http:// ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=1104847 / T. Kahveci and A. Singh, “Variable length queries for time series data", in Data Engineering, 2001. Proceedings. 17th International Conference on, 2001, pp. 273-282. [Online]. Available: http://dx.doi.org/10.1109/ICDE.2001.914838 / Gorecki and Tomasz (2018), Sakoe and Chiba (1978)). 6) Donald J. Bemdt James Clifford, "Using Dynamic Time Warping to Find Patterns in Time Series", 「AAAI-94 Workshop on Knowledge Discovery in Databases」, 1994 / Pavel Senin, "Dynamic Time Warping Algorithm Review", 2008 / Marion Morel · Catherine Achard · Richard Kulpa · S?verine Dubuisson, "Time-series averaging using constrained dynamic time warping with tolerance", 「Pattern Recognition, Volume 74, Issue C」, 2018 / Zoltan Geler · Vladimir Kurbalija · Mirjana Ivanovi? · Milo? Radovanovi? · Weihui Dai, "Dynamic Time Warping: Itakura vs Sakoe-Chiba", 「IEEE International Symposium on INnovations in Intelligent SysTems and Applications」, 2019. 7) AK Jain, RPW Duin, J. Mao, Statistical pattern recognition: A review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (1) (2000) 4-37. 8) International Journal of Machine Learning and Cybernetics (2019) 10:1227.1246. 9) Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1), 43-49.

10) Franc-ois Petitjean Alain Ketterlin Pierre Gancarski, "A global averaging method for dynamic time warping, with applications to clustering", 「Pattern Recognition Volume 44, Issue 3」, 2011 / Marion Morel Catherine Achard Richard Kulpa Sverine Dubuisson, "Time-series averaging using constrained dynamic time warping with tolerance", 「Pattern Recognition, Volume 74, Issue C」, 2018.

본 발명은 이러한 문제점들을 극복하고 신속하고 간편한 분석결과를 얻기 위해 안출된 것으로, 필로폰 불순물 프로파일링을 위해 머무름 시간(retention time, RT)에 대한 시그널로 이루어진 GC-FID 크로마토그램에 대한 이미지를 이진화(image binarization)하고 불필요한 부분을 제거하는 전처리 과정을 거쳐 시계열 데이터로 변환하고, 유사도 분석을 위해 동적 시간 워핑(dynamic time warping; DTW) 기반 TimeSeriesKMeans, 전체 시계열을 여러 개의 하위 시퀀스(subsequence)로 나누어 행렬(matrix)을 구성하는 슬라이딩 윈도우(sliding window) 및, 군집화 앙상블 기법인 군집 기반 유사도 분할 알고리즘(cluster-based similarity partitioning algorithm, CSPA)을 불순물 프로파일링에 적용하는 데 과제가 있다.The present invention was conceived to overcome these problems and obtain quick and convenient analysis results. For profiling methamphetamine impurities, the image of the GC-FID chromatogram consisting of the signal for the retention time (RT) is binarized ( Image binarization) and conversion to time series data through a preprocessing process to remove unnecessary parts, dynamic time warping (DTW)-based TimeSeriesKMeans for similarity analysis, dividing the entire time series into several subsequences and matrices ( matrix) and cluster-based similarity partitioning algorithm (CSPA), which is a clustering ensemble technique, to impurity profiling.

전술한 과제를 해결하기 위해, 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법은, 화합물에 대한 머무름 시간(retention time)이 반영된 크로마토그램 이미지의 유사도 분석 방법으로서, 복수의 크로마토그램 이미지를 입력받아 전처리하는 단계, 전처리된 이미지의 X축 상에서 최대 Y축값에 해당하는 포인트들을 선택하여 각 포인트가 동일한 시간 간격을 갖는 시계열 데이터로 변환하는 단계, 상기 시계열 데이터에 일정한 윈도우 크기 및 스트라이드를 조건으로 설정한 슬라이딩 윈도우 알고리즘을 적용하여 서브시퀀스 행렬을 생성하는 단계, 군집 기반 유사도 분할 알고리즘을 이용하여 상기 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계 및, 상기 유사도 행렬에 기초하여 상관계수를 계산하고, 각 시계열 데이터간 정규화된 교차상관 관계를 계산하여, 두 계산결과를 미리 설정된 기준값과 비교하여 각 크로마토그램간 유사도를 판단하는 단계를 포함할 수 있다.In order to solve the above-mentioned problems, a similarity analysis method of chromatogram images according to an embodiment of the present invention is a similarity analysis method of chromatogram images in which the retention time of a compound is reflected, and a plurality of chromatogram images are analyzed. Step of receiving input and pre-processing, selecting points corresponding to the maximum Y-axis value on the X-axis of the pre-processed image and converting them into time-series data in which each point has the same time interval, subject to a constant window size and stride in the time-series data Generating a subsequence matrix by applying a set sliding window algorithm, generating a similarity matrix by clustering the subsequence matrix using a cluster-based similarity segmentation algorithm, and calculating a correlation coefficient based on the similarity matrix, , Calculating a normalized cross-correlation between each time series data, and comparing the two calculation results with a preset reference value to determine the degree of similarity between each chromatogram.

상기 입력된 크로마토그램 이미지를 전처리하는 단계는, 상기 입력된 크로마토그램 이미지를 그레이 스케일 이미지로 변환하는 단계 및, 이진화 알고리즘을 통해 상기 그레이 스케일 이미지를 이진화 이미지로 변환하는 단계를 포함할 수 있다.The step of preprocessing the input chromatogram image may include converting the input chromatogram image into a gray scale image, and converting the gray scale image into a binarized image through a binarization algorithm.

상기 이진화 알고리즘을 통해 상기 그레이 스케일 이미지를 이진화 이미지로 변환하는 단계 이후, 변환된 이진화 이미지에 포함된 모든 포인트의 픽셀값을 최대값으로 변환하는 단계, 각 이진화 이미지의 최대 Y축값을 산출하고, 각 이진화 이미지에서 상기 최대 Y축값을 기준으로 하여 최초 Y축값을 갖는 포인트의 Y축값 미만에 해당하는 영역을 제거하는 단계 및, 모든 이진화 이미지의 크기를 동일하게 조정하는 단계를 포함할 수 있다.After converting the gray scale image into a binarized image through the binarization algorithm, converting pixel values of all points included in the converted binarized image into maximum values, calculating the maximum Y-axis value of each binarized image, and The method may include removing an area corresponding to less than a Y-axis value of a point having an initial Y-axis value based on the maximum Y-axis value from the binarized image, and adjusting the size of all binarized images to be the same.

상기 서브시퀀스 행렬은, 이하의 수학식,The subsequence matrix is expressed by the following equation,

으로 표현될 수 있다(단, w는 윈도우 크기, s는 스트라이드 값, n는 스탭).(where w is the window size, s is the stride value, and n is the step).

상기 군집 기반 유사도 분할 알고리즘을 이용하여 상기 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계는, 상기 서브시퀀스 행렬에 동적 시간 워핑 알고리즘(Dynamic Time Warping)을 적용하여 각 시계열 데이터의 잠재적 뒤틀림 경로(Warping path)를 최소화하는 단계를 포함하고, 상기 동적 시간 워핑 알고리즘에서 시간 내 이동과 왜곡을 최소화하기 위한 비용함수는, 이하의 수학식, The step of generating a similarity matrix by clustering the subsequence matrices using the cluster-based similarity segmentation algorithm may include applying a dynamic time warping algorithm to the subsequence matrix to generate potential warping paths (warping paths) of each time series data. path), and the cost function for minimizing movement and distortion in time in the dynamic time warping algorithm is the following equation,

으로 표현될 수 있다(단, C는 비용함수, x,y는 개별 시계열 데이터의 포인트, N,M은 개별 시계열 데이터의 길이).(provided that C is a cost function, x and y are points of individual time series data, and N and M are lengths of individual time series data).

상기 군집 기반 유사도 분할 알고리즘을 이용하여 상기 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계는, Sikoe-Chiba band를 적용하여 전역 비용 행렬을 생성하는 단계를 포함하고, 상기 전역 비용 행렬은 이하의 수학식, Generating a similarity matrix by clustering the subsequence matrices using the clustering-based similarity segmentation algorithm includes generating a global cost matrix by applying a Sikoe-Chiba band, and the global cost matrix is obtained by the following mathematical formula: ceremony,

으로 표현될 수 있다.can be expressed as

상기 군집 기반 유사도 분할 알고리즘을 이용하여 상기 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계는, 하나 이상의 군집화 평가 지표를 이용하여 군집 개수를 설정하는 단계를 포함하고, 상기 군집화 평가 지표는, 군집 중심과 군집 내 시계열 데이터 사이의 거리의 제곱 합을 계산하는 이니셔(Inertia), 모든 시계열 데이터에 대하여 하나의 군집과 가장 가까운 다른 군집 사이의 거리의 평균을 계산하는 실루엣 스코어(Silhouette index), 군집 내 분산과 군집 사이의 분산의 비로 계산되는 칼린스키-하라바즈 지수(Calinski-Harabasz index) 및, 군집 내 높은 유사도를 갖고 군집 사이의 낮은 유사도를 갖는 모델을 높게 평가하는 데이비드-볼딘 지수(Davies-Bouldin index) 중, 하나 이상을 포함할 수 있다.The generating of the similarity matrix by clustering the subsequence matrices using the clustering-based similarity segmentation algorithm includes setting the number of clusters using one or more clustering evaluation metrics, wherein the clustering evaluation metrics are cluster centers. Inertia, which calculates the sum of squares of distances between the time series data within a cluster and the silhouette score (Silhouette index), which calculates the average of the distances between one cluster and the nearest other cluster for all time series data, within clusters The Calinski-Harabasz index, calculated as the ratio of the variance to the variance between clusters, and the Davies-Bouldin index, which evaluates models with high intra-cluster similarity and low inter-cluster similarity index), may include one or more.

또한, 전술한 과제를 해결하기 위해, 본 발명의 실시예에 따르면, 크로마토그램 이미지의 유사도 분석방법을 구현하기 위해 판독 가능한 기록매체에 저장된 컴퓨터 프로그램이 제공될 수 있다.In addition, in order to solve the above problems, according to an embodiment of the present invention, a computer program stored in a readable recording medium may be provided to implement a similarity analysis method of chromatogram images.

본 발명의 실시예에 따르면, 마약류인 필로폰 시료에 대한 분리, 분석 과정과 GC-FID 크로마토그램 이미지를 추출하는 과정을 통해, 각 데이터의 머무름 시간이 일치하지 않는 문제에 대해 내부표준물질을 기준으로 하여 일일이 수동으로 조정하는 과정을 거칠 필요가 없으며, 각각의 시료의 GC-FID 크로마토그램과 유사한 크로마토그램들의 추출이 가능하고, 기존 데이터 방식과 달리 GC-FID 크로마토그램 이미지를 활용하여 유사한 시료를 찾아내는 새로운 통계분석에 따른 크로마토그램 이미지의 유사도 분석방법을 제공할 수 있는 효과가 있다. According to an embodiment of the present invention, through the process of separating and analyzing the narcotic methamphetamine sample and extracting the GC-FID chromatogram image, based on the internal standard for the problem of inconsistent retention time of each data Therefore, there is no need to go through the manual adjustment process, and it is possible to extract chromatograms similar to the GC-FID chromatogram of each sample. There is an effect of providing a similarity analysis method of chromatogram images according to a new statistical analysis.

도 1은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법을 순서도로 나타낸 도면이다.
도 2는 본 발명을 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 의한 이진화 과정 전후의 이미지를 나타낸 도면이다.
도 3은 본 발명을 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 의한 보정된 이진화 이미지를 나타낸 도면이다.
도 4는 본 발명의 실시예에 따른 유사도 분석 방법에 의한 크로마토그램 이미지를 시계열 데이터로 변환하여 이미지로 나타낸 도면이다.
도 5는 본 발명의 실시예에 따른 유사도 분석 방법에 적용된 슬라이딩 윈도우를 설명하기 위한 도면이다.
도 6은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에서 GC-FID 크로마토그램 시계열 이미지 데이터에 DTW를 적용하기 전, 후를 비교한 도면이다.
도 7a 및 도 7b는 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에서 각각 두 시계열 사이에 매칭되는 포인트 및 그 뒤틀림 경로와, TimeSeriesKMeans의 전역 제약 조건의 적용시 결과를 나타낸 도면이다.
도 8은 본 발명의 실시예에 따른 TimeSeriesKMeans 알고리즘의 군집 수를 10부터 30까지 설정했을 때의 이너셔, 실루엣 스코어, 칼린스키-하라바즈 지수 및 데이비스-보울딘 지수의 값을 나타낸 도면이다.
도 9는 본 발명의 실시예에 따른 하이퍼그래프를 기반으로 한 상관관계 히트맵(Correlation heatmap)을 나타낸 도면이다.1 is a flowchart illustrating a similarity analysis method of chromatogram images according to an embodiment of the present invention.
2 is a diagram showing images before and after a binarization process by a similarity analysis method of chromatogram images according to an embodiment of the present invention.
3 is a diagram showing a corrected binarized image by a similarity analysis method of chromatogram images according to an embodiment of the present invention.
4 is a diagram showing an image by converting a chromatogram image by a similarity analysis method according to an embodiment of the present invention into time-series data.
5 is a diagram for explaining a sliding window applied to a similarity analysis method according to an embodiment of the present invention.
6 is a diagram comparing before and after applying DTW to GC-FID chromatogram time series image data in a similarity analysis method of chromatogram images according to an embodiment of the present invention.
7A and 7B are diagrams showing results when matching points between two time series and their warp paths and global constraints of TimeSeriesKMeans are applied, respectively, in the similarity analysis method of chromatogram images according to an embodiment of the present invention.
8 is a diagram showing values of inertia, silhouette score, Kalinsky-Haravaz index, and Davis-Bouldin index when the number of clusters of the TimeSeriesKMeans algorithm is set from 10 to 30 according to an embodiment of the present invention.
9 is a diagram showing a correlation heatmap based on a hypergraph according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The present invention may have various changes and various embodiments, and it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the present invention.

본 명세서에서 사용되는 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 공정, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 공정, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this specification are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, process, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, processes, operations, components, parts, or combinations thereof is not precluded.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미가 있는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having meanings consistent with the meanings in the context of the related art, and unless explicitly defined in this application, they should not be interpreted in ideal or excessively formal meanings. don't

본 발명에 대한 상세한 설명에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정하여 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여, 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.Prior to the detailed description of the present invention, the terms or words used in the present specification and claims should not be construed as being limited to a common or dictionary meaning, and the inventor intends to explain his/her invention in the best way. Based on the principle that the concept of the term can be properly defined, it should be interpreted as meaning and concept consistent with the technical spirit of the present invention.

또한, 사용되는 기술 용어 및 과학 용어에 있어서 다른 정의가 없다면, 이 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 통상적으로 이해하고 있는 의미를 가지며, 하기의 설명 및 첨부 도면에서 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 설명은 생략한다. 본 명세서에 관한 도면들은 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 예로서 제공되는 것이다. 따라서, 본 발명은 이하 제시되는 도면들에 한정되지 않고 다른 형태로 구체화될 수도 있다. In addition, unless there is another definition in the technical terms and scientific terms used, they have meanings commonly understood by those of ordinary skill in the art to which this invention belongs, and the gist of the present invention is described in the following description and accompanying drawings. Descriptions of well-known functions and configurations that may be unnecessarily obscure are omitted. The drawings related to this specification are provided as examples to sufficiently convey the spirit of the present invention to those skilled in the art. Accordingly, the present invention may be embodied in other forms without being limited to the drawings presented below.

또한, 본 발명의 실시예에 따른 방법 및 시스템은 모듈단위의 컴퓨터 프로그램으로 구성될 수 있고, 하나의 물리적 메모리에 기록되거나, 둘 이상의 메모리 및 기록매체 사이에 분산되어 기록될 수 있다.In addition, the method and system according to an embodiment of the present invention may be configured as a module-based computer program, and may be recorded in one physical memory or distributed between two or more memories and recording media.

그리고, 이하의 설명에서는 국내에서 가장 많이 남용되는 불법 마약인 메트암페타민의 GC-FID 크로마토그램 이미지를 분석 데이터로 이용하여 본 발명의 크로마토그램 이미지의 유사도 분석 방법을 설명하고 있으나, 분석 대상이 특정 물질에 대한 데이터에 한정되는 것은 아니며, 보다 다양한 화합물에 대해서도 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법이 이용될 수 있다.In addition, the following description describes the similarity analysis method of chromatogram images of the present invention using the GC-FID chromatogram image of methamphetamine, the most abused illegal drug in Korea, as analysis data, but the analysis target is a specific substance It is not limited to the data for, and the similarity analysis method of chromatogram images according to embodiments of the present invention can be used for more various compounds.

본 발명의 크로마토그램 이미지의 유사도 분석 방법에 따르면, 검출기에 의한 검출신호를 시간의 함수로 표시한 108개 크로마토그램 이미지를 대상으로 동적 시간 워핑(dynamic time warping) 기반 TimeSeriesKMeans, 전체 시계열을 여러 개의 서브시퀀스(subsequence)로 나누어 행렬(matrix)을 구성하는 슬라이딩 윈도우(sliding window) 및, 군집화(clustering) 앙상블 기법 중 하나인 유사도 분할 알고리즘(cluster-based similarity partitioning algorithm, CSPA)을 이용하여 각 시료간 유사도를 도출할 수 있다.According to the similarity analysis method of chromatogram images of the present invention, the dynamic time warping-based TimeSeriesKMeans targeting 108 chromatogram images in which the detection signal by the detector is displayed as a function of time, The similarity between each sample using a sliding window that divides into sequences to form a matrix and the cluster-based similarity partitioning algorithm (CSPA), one of the clustering ensemble techniques can be derived.

또한, 본 발명의 실시예예 따른 유사도 분석 방법에 따르면, 특정 성분의 검출신호가 머무름 시간(retention time)상의 신호로 표시되는 크로마토그램 이미지 유사도 분석에 선행되어야할 필수 요소인 머무름 시간 보정을 자동화함으로써, 시계열 이미지 분석의 편의성 및 신속성을 확보할 수 있다.In addition, according to the similarity analysis method according to an embodiment of the present invention, by automating the retention time correction, which is an essential element that must precede the chromatogram image similarity analysis in which the detection signal of a specific component is displayed as a signal on the retention time, Convenience and speed of time-series image analysis can be secured.

본 발명의 실시예에 따르면, 이미지 군집화 및 유사도 분석을 위한 기준값(cutoff-value)으로 정규화된 교차 상관(normalized cross-correlation) 값은 0.95로 설정할 수 있고, CSPA 분석결과로부터 얻은 유사행렬(similarity matrix)에 대한 상관계수(correlation coefficient) 기준값은 0.9로 설정할 수 있으며, 이를 통해 분석을 수행한 결과, 108 개의 이미지 중 67 개의 이미지에서 1개 내지 6개까지의 유사도를 갖는 군집이 도출된다.According to an embodiment of the present invention, the normalized cross-correlation value can be set to 0.95 as a cutoff-value for image clustering and similarity analysis, and the similarity matrix obtained from the CSPA analysis result ) can be set to 0.9, and as a result of performing the analysis through this, clusters having a similarity of 1 to 6 in 67 images out of 108 images are derived.

이에 따라, 본 발명은 머무름 시간과 시그널로 표시된 다양한 형태의 크로마토그램 이미지에 대한 군집화 및 유사도 분석의 확장성을 제시할 수 있으며, 분석 결과는 여러 나라를 거쳐 불법적으로 유통되는 필로폰 밀수사건에 유용한 과학적 증거가 될 것으로 기대된다.Accordingly, the present invention can provide scalability of clustering and similarity analysis for various types of chromatogram images indicated by retention time and signal, and the analysis results are scientifically useful for methamphetamine smuggling cases illegally distributed through various countries. expected to serve as evidence.

한편, 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에서 실험을 위해 이용한 시료 분석장치는, 미국 에이질런트사의 시료 자동주입기(7683B)가 장착된 GC-FID(7890GC)를 사용하였고, GC에 장착된 모세분리관은 에이질런트사의 DB-5MS(30 m × 0.32 mm I.D., 1.0 μm film thickness)를 사용하였다.On the other hand, the sample analysis device used for the experiment in the similarity analysis method of chromatogram images according to the embodiment of the present invention was a GC-FID (7890GC) equipped with a sample auto-injector (7683B) from Agilent, USA, Agilent's DB-5MS (30 m × 0.32 mm I.D., 1.0 μm film thickness) was used as the capillary tube mounted on the GC.

그리고, 본 발명의 실시예에서 이용된 시료의 경우, 실험시 모세분리관 내의 유속은 5.0 mL/min로 설정하였고, GC 오븐 온도는 40 ℃에서 1 분간 유지시킨 후 100 ℃까지는 5 ℃/min로 상승시켰고, 그 후 300 ℃까지는 10 ℃/min로 승온한 후 4 분 동안 유지시켰다. 또한, 주입구와 검출기의 온도는 각각 250 ℃와 280 ℃로 설정하였다. 시료 주입은 비분할 주입방식(splitless mode)으로 purge on time은 1.0 min으로 설정하였으며, 시료 주입량은 2 μL이였다. 그리고, 검출기에 유입되는 수소기체와 공기의 유속은 각각 30 mL/min 및 300 mL/min이었으며, 이때 검출기의 온도는 300 ℃로 설정하여 분석을 실시하였다.And, in the case of the sample used in the examples of the present invention, the flow rate in the capillary tube was set to 5.0 mL/min during the experiment, and the GC oven temperature was maintained at 40 °C for 1 minute and then maintained at 5 °C/min until 100 °C. After that, the temperature was raised at 10 °C/min to 300 °C and maintained for 4 minutes. In addition, the temperatures of the inlet and detector were set to 250 °C and 280 °C, respectively. Sample injection was a splitless mode, the purge on time was set to 1.0 min, and the sample injection amount was 2 μL. In addition, the flow rates of hydrogen gas and air flowing into the detector were 30 mL/min and 300 mL/min, respectively, and at this time, the temperature of the detector was set to 300 °C and the analysis was performed.

또한, 시료 전처리 단계에서는, 시료 50 mg을 측정하여 원심분리용 튜브에 넣고 0.1 M 인산염 완충용액(pH 7.0) 2 mL에 녹인 후, 10% Na2CO3 용액 0.25 mL를 첨가하여 pH 9로 조절하였다. 그리고, 10 μg/mL 농도의 내부표준물질 노사코산이 포함된 에틸 아세테이트 0.5 mL를 넣고 10 분간 진탕(shaking)하여 추출하였다. 이후 20000g에서 10 분 동안 초고속 원심분리한 후, 상층액 2 μL를 GC-FID에 주입하여 분석을 실시하였다.In addition, in the sample pretreatment step, 50 mg of the sample was measured, placed in a centrifuge tube, dissolved in 2 mL of 0.1 M phosphate buffer (pH 7.0), and then adjusted to pH 9 by adding 0.25 mL of 10% Na2CO3 solution. Then, 0.5 mL of ethyl acetate containing nosakosan, an internal standard at a concentration of 10 μg/mL, was added and extracted by shaking for 10 minutes. After ultra-high-speed centrifugation at 20000g for 10 minutes, 2 μL of the supernatant was injected into GC-FID for analysis.

이하, 도면을 참조하여 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법 및 시스템을 상세히 설명한다. 이하의 설명에서 각 단계별 실행주체는 별도의 기재가 없는 한, 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 시스템 및 그 구성부가 된다.Hereinafter, a similarity analysis method and system of chromatogram images according to an embodiment of the present invention will be described in detail with reference to the drawings. In the following description, the subject of execution for each step is the chromatogram image similarity analysis system and its constituent parts according to an embodiment of the present invention, unless otherwise specified.

도 1은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법을 순서도로 나타낸 도면이다.1 is a flowchart illustrating a similarity analysis method of chromatogram images according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법은, 화합물에 대한 머무름 시간(retention time)이 반영된 크로마토그램 이미지의 유사도 분석 방법으로서, 복수의 크로마토그램 이미지를 입력받아 전처리하는 단계(S100), 전처리된 이미지의 X축 상에서 최대 Y축값에 해당하는 포인트들을 선택하여 각 포인트가 동일한 시간 간격을 갖는 시계열 데이터로 변환하는 단계(S110), 시계열 데이터에 일정한 윈도우 크기 및 스트라이드를 조건으로 설정한 슬라이딩 윈도우 알고리즘을 적용하여 서브시퀀스 행렬을 생성하는 단계(S120), 군집 기반 유사도 분할 알고리즘을 이용하여 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계(S130) 및, 유사도 행렬에 기초하여 상관계수를 계산하고, 각 시계열 데이터간 정규화된 교차상관 관계를 계산하여, 두 계산결과를 미리 설정된 기준값과 비교하여 각 크로마토그램간 유사도를 판단하는 단계(S140)를 포함할 수 있다.Referring to FIG. 1, the similarity analysis method of chromatogram images according to an embodiment of the present invention is a similarity analysis method of chromatogram images in which the retention time of a compound is reflected, and receives a plurality of chromatogram images Pre-processing step (S100), selecting points corresponding to the maximum Y-axis value on the X-axis of the pre-processed image and converting each point to time-series data having the same time interval (S110), constant window size and stride for time-series data Generating a subsequence matrix by applying a sliding window algorithm set as a condition (S120), generating a similarity matrix by clustering the subsequence matrix using a cluster-based similarity segmentation algorithm (S130), and Based on this, a correlation coefficient is calculated, a normalized cross-correlation between each time series data is calculated, and a similarity between the chromatograms is determined by comparing the two calculation results with a preset reference value (S140).

전처리(S100 단계)Pre-treatment (step S100)

상세하게는, 복수의 크로마토그램 이미지를 입력받아 전처리하는 단계(S100)는, 타겟이 되는 성분을 포함하는 복수의 화합물 시료를 GC-FID 장치에 주입하고 분석을 수행하여 획득한 복수의 크로마토그램 이미지를 본 발명의 크로마토그램 이미지의 유사도 분석 시스템에 입력하여 전처리 과정을 거치는 단계이다.In detail, the step of receiving and preprocessing a plurality of chromatogram images (S100) is a plurality of chromatogram images obtained by injecting a plurality of compound samples including a target component into a GC-FID device and performing analysis This is a step of inputting to the similarity analysis system of the chromatogram image of the present invention and going through a preprocessing process.

먼저, 크로마토그램 이미지에서 정보를 획득하기 위해서는 이미지에 포함된 개체들을 분리하는 영상처리 과정을 거쳐야 한다. 영상처리 방법 중 하나인 이진화(binarization)는 이미지의 그레이값에 따라서 픽셀을 0과 1 또는 0과 255로 전환하는 과정으로서, 이를 통해 이미지에 포함된 객체를 백그라운드에서 분리할 수 있다.First, in order to obtain information from a chromatogram image, it is necessary to undergo an image processing process to separate objects included in the image. Binarization, one of the image processing methods, is a process of converting pixels into 0 and 1 or 0 and 255 according to the gray value of the image, through which objects included in the image can be separated from the background.

본 발명의 실시예에 따르면, 이진화 기법으로 널리 알려진 오츠 스레시홀드(Otsu thresholding)를 적용하여 흑백 이미지를 대상으로 두 가지 영역으로 최적화하여 분리할 수 있는 임계값을 찾을 수 있다.According to an embodiment of the present invention, it is possible to find a threshold that can be separated by optimizing a black and white image into two areas by applying Otsu thresholding, which is widely known as a binarization technique.

이러한 오츠 스레시홀드에 의한 이진화 절차에 따르면, 전경(foreground)과 배경(background)으로 나뉘는 두 집단의 가중 분산 합계가 최소가 되는 임계값을 찾아서 흰색과 검정색 영역으로 분리하는 것으로, 이하의 수학식 1과 같이 나타낼 수 있다³⁾.According to the binarization procedure by Otsu threshold, the threshold value at which the weighted variance sum of the two groups divided into foreground and background is minimized is found and separated into white and black areas. 1 and ³⁾ .

여기서, P(i) : probability of gray level histogram, t : threshold value, l : gray level을 나타낸다.Here, P(i): probability of gray level histogram, t: threshold value, l: represents gray level.

도 2는 본 발명을 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 의한 이진화 과정 전후의 이미지를 나타낸 도면으로서, 도 2를 참조하면 GC-FID 크로마토그램 이미지(a)의 경우 색상으로부터 유의미한 정보를 획득할 수 없음에 따라, 그레이 스케일(Gray scale) 이미지로 변환한 후, 오츠 스레시홀딩을 통해 이진화 이미지(b)로 변환하게 된다.Figure 2 is a diagram showing images before and after the binarization process by the similarity analysis method of chromatogram images according to an embodiment of the present invention. Referring to Figure 2, in the case of GC-FID chromatogram image (a), significant information from color If it cannot be acquired, it is converted to a gray scale image and then converted to a binary image (b) through Otsu thresholding.

다음으로, 도 3은 본 발명을 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 의한 보정된 이진화 이미지를 나타낸 도면으로서, 도 3에 나타낸 바와 같이, 이진화 과정을 거쳐 획득한 이진화 이미지에서 각 포인트에 해당하는 픽셀 값을 '255'로 동일하게 설정함으로써 이미지를 구성하는 모든 픽셀이 '0' 또는 '255'의 바이너리(binary) 값을 갖도록 한다.Next, FIG. 3 is a diagram showing a corrected binarized image by the similarity analysis method of chromatogram images according to an embodiment of the present invention. As shown in FIG. 3, each point in the binarized image obtained through the binarization process By setting the corresponding pixel value equally to '255', all pixels constituting the image have a binary value of '0' or '255'.

그리고, 이미지별 Y축을 기준으로 각 포인트가 '0'으로부터 각기 다른 정도로 떨어져 있는 것을 상쇄시키기 위해, 모든 이미지에서 각각 최대 Y축 값을 구하고 이를 기준으로 Y축을 조정하여 이미지 하단부에 존재하는 불필요한 영역을 제거함으로써 각 포인트의 Y축 최소값이 '0' 이하가 없도록 한다.In addition, in order to offset the fact that each point is different from '0' on the basis of the Y-axis for each image, the maximum Y-axis value is obtained for each image and the Y-axis is adjusted based on this value to remove the unnecessary area at the bottom of the image. By removing it, the Y-axis minimum value of each point is not less than '0'.

이후, 시계열 데이터로의 변환을 위해 이미지의 크기를 동일하게 맞춘다. 일례로서, 모든 이미지는 일괄적으로 해상도 600×1200로 변환될 수 있다.Afterwards, the size of the image is set to be the same for conversion into time series data. As an example, all images may be collectively converted to a resolution of 600×1200.

시계열 데이터 변환(S110 단계)Time series data conversion (step S110)

다음으로, 전처리된 이미지의 X축 상에서 최대 Y축값에 해당하는 포인트들을 선택하여 각 포인트가 동일한 시간 간격을 갖는 시계열 데이터로 변환하는 단계(S110)는, 불필요한 영역이 제거되고 크기가 동일하게 전처리된 이미지를 시계열 데이터로 변환하는 절차로서, 먼저 각각의 전처리된 이미지의 X축을 따라 각 포인트들의 최대값에 해당하는 Y축값을 선택함으로써, 동일한 시간 간격에서 다양한 패턴을 가지는 시계열 데이터를 도출하게 된다. 도 4는 본 발명의 실시예에 따른 유사도 분석 방법에 의한 크로마토그램 이미지를 시계열 데이터로 변환하여 이미지로 나타낸 도면으로서, 전술한 이러한 시계열 데이터를 예시하고 있다.Next, in the step of selecting points corresponding to the maximum Y-axis value on the X-axis of the pre-processed image and converting them into time-series data having the same time interval (S110), unnecessary areas are removed and pre-processed to have the same size. As a procedure for converting images into time-series data, first, by selecting the Y-axis value corresponding to the maximum value of each point along the X-axis of each preprocessed image, time-series data having various patterns at the same time interval is derived. 4 is a diagram showing an image by converting a chromatogram image by a similarity analysis method according to an embodiment of the present invention into time-series data, exemplifying the above-described time-series data.

특히, 전술한 시계열 데이터는 전처리 과정 이전의 원본 이미지가 적색(R), 녹색(G) 및 청색(B)의 RGB 채널 이미지로서, 3 채널로 구성된 것을 1차원(1 dimensional) 시계열 데이터로 변환함으로써 데이터의 크기를 약 1/4로 대폭 감소시켰으며, 이에 따라 데이터 분석에 있어 연산의 속도가 빨라지게 되어 신속성이 향상되는 장점이 있다.In particular, the above-mentioned time series data is an RGB channel image of red (R), green (G), and blue (B), and the original image before the preprocessing process is converted into 1-dimensional time series data composed of three channels. The size of the data has been drastically reduced to about 1/4, and as a result, the speed of calculation in data analysis is increased, which has the advantage of improving speed.

서브시퀀스 행렬 생성(S120 단계)Subsequence matrix generation (step S120)

다음으로, 시계열 데이터에 일정한 윈도우 크기 및 스트라이드를 조건으로 설정한 슬라이딩 윈도우 알고리즘을 적용하여 서브시퀀스 행렬(subsequence matrix)을 생성하는 단계(S120)는, 유사도 분석을 위한 동적 시간 워핑(DTW)를 적용하기 위해, 크로마토그램 이미지를 통해 획득한 전체 시계열 데이터를 슬라이딩 윈도우를 이용하여 여러개의 하위 시퀀스로 나누어 행렬을 구성하는 단계이다. Next, in the step of generating a subsequence matrix by applying a sliding window algorithm set as a condition of a constant window size and stride to time series data (S120), dynamic time warping (DTW) for similarity analysis is applied In order to do this, it is a step of constructing a matrix by dividing the entire time series data acquired through the chromatogram image into several sub-sequences using a sliding window.

전술한 슬라이딩 윈도우는 이미지를 윈도우 크기에 맞춰 나눈 다음 구분된 이미지를 입력값으로 모델을 통과하여 결과를 얻는 방법이다. 신경망이 사용되기 이전에는 선형함수를 사용하였으므로 경제적인 연산이 가능하였으나, 주로 이미지나 영상 데이터를 처리에 사용되는 신경망(neural network)에 슬라이딩 윈도우를 적용하는 경우 여러 번 연산을 해야 하므로 속도의 저하가 발생되는 반면, 계산 속도를 빠르게 하기위해 스트라이드(stride)를 크게 하면 정확한 결과를 얻지 못하게 된다.The aforementioned sliding window is a method of dividing an image according to the size of the window and then passing the divided image as an input value through a model to obtain a result. Before neural networks were used, linear functions were used, so economical calculations were possible. On the other hand, if the stride is increased to increase the calculation speed, an accurate result cannot be obtained.

이러한 문제점을 극복하기 위해, 본 발명의 실시예에서는 합성곱으로 슬라이딩 윈도우를 구현하였으며, 이에 연산의 횟수가 줄어들게 되어 결국 효율적인 계산을 구현하게 된다. 이러한 슬라이딩 윈도우 알고리즘을 이용하면 윈도우 크기와 스트라이드(stride)를 고려하여 전체 시계열 데이터에 대한 서브시퀀스(subsequence)들을 생성할 수 있다.In order to overcome this problem, in the embodiment of the present invention, a sliding window is implemented by convolution, and thus the number of operations is reduced, resulting in efficient calculation. Using such a sliding window algorithm, subsequences for the entire time series data can be generated by considering the window size and stride.

도 5는 본 발명의 실시예에 따른 유사도 분석 방법에 적용된 슬라이딩 윈도우를 설명하기 위한 도면으로서, 도 5를 참조하면 슬라이드 윈도우 알고리즘에 따라 데이터에 윈도우 크기(Window size)를 '3', 스트라이드(stride)를 '1'로 적용했을 경우를 나타내고 있다⁴⁾.5 is a diagram for explaining a sliding window applied to a similarity analysis method according to an embodiment of the present invention. Referring to FIG. 5, the window size is set to '3' and the stride ) is applied as '1' ⁴⁾ .

본 발명의 실시예에서는 시계열 데이터 상에 윈도우가 순차적으로 이동하면서 하위 시퀀스 행렬{Matrix(W, S)}을 생성하게 된다.In an embodiment of the present invention, a sub-sequence matrix {Matrix(W, S)} is generated as windows sequentially move on time series data.

유사도 행렬 생성(S130 단계)Generate a similarity matrix (step S130)

다음으로, 군집 기반 유사도 분할 알고리즘을 이용하여 상기 서브시퀀스 행렬을 군집화하여 유사도 행렬을 생성하는 단계(S130)는, 시계열 군집화에 사용되는 DTW 기반 TimeSeriesKMeans와 군집화 앙상블 기법 기반 CSPA를 통해 각 시계열 데이터의 서브스퀀스들을 군집화하고, 그 결과에 대한 하이퍼엣지(Hyperedge)를 허용하는 하이퍼그래프(Hypergraph)를 생성하여 이를 유사도 행렬(Similarity Matrix)로 재구성하는 단계이다.Next, generating a similarity matrix by clustering the subsequence matrix using a clustering-based similarity segmentation algorithm (S130) is a subsequence of each time series data through DTW-based TimeSeriesKMeans used for time-series clustering and CSPA based on a clustering ensemble technique. This is a step of clustering the sequences, generating a hypergraph allowing hyperedges for the result, and reconstructing it into a similarity matrix.

전술한 서브시퀀스행렬은 이하의 수학식 2와 같이 나타낼 수 있다.The aforementioned subsequence matrix can be expressed as in Equation 2 below.

DTW는 여러 시그널을 빠르고 일정하게 정렬하기 위해 사용하는 유용한 방법으로 알려져 있다. DTW는 1960년대 도입되어 데이터 마이닝 및 시계열 군집화에는 2000년대 초반부터 활용되었으며, DTW를 이용하여 시계열 유사성의 척도로 활용된다⁵⁾.DTW is known to be a useful method for quickly and uniformly aligning multiple signals. DTW was introduced in the 1960s and has been used since the early 2000s for data mining and time series clustering, and is used as a measure of time series similarity ⁵⁾ .

이러한 DTW는 원본 시그널들을 선택적으로 줄이거나 늘려 시그널 사이의 거리를 최소화될 수 있도록 한다. 이러한 과정을 거침에 따라 상당히 잘 정열된 크로마토그램 검출신호들을 획득할 수 있다. 각 벡터들 사이의 차이를 제곱하여 합한 유클리디안 거리(euclidean distance)계산은 같은 시간을 기준으로 계산하므로 유사한 시계열 분포를 나타내고 있으나, 이와 달리 DTW는 유사한 시계열 분포를 갖고 있음에도 서로 다른 시간대를 나타내고 있는 경우에도 이전 거리를 함께 고려함으로써 다른 시간대의 시계열 데이터 사이의 유사도를 계산할 수 있고 길이가 서로 다르더라도 유사도를 계산할 수 있다는 장점이 있다⁶⁾. 그러므로, DTW는 시퀀스(sequence)가 있는 머무름 시간(RT) 대비 검출신호의 시계열 데이터인 크로마토그램에 적용하는 것이 바람직하다.This DTW selectively attenuates or augments the original signals so that the distance between them can be minimized. By going through this process, chromatogram detection signals that are quite well aligned can be obtained. The calculation of the Euclidean distance, which is the sum of the squared differences between each vector, shows a similar time series distribution because it is calculated based on the same time. ⁶⁾ There is an advantage in that similarity between time series data of different time periods can be calculated by considering the previous distance together, and similarity can be calculated even if the length is different [6]. Therefore, it is preferable to apply DTW to a chromatogram, which is time-series data of a detection signal versus a retention time (RT) having a sequence.

따라서, 본 발명의 실시에 따른 크로마토그램 이미지의 유사도 분석 방법은, DTW를 이용하여 다른 유사한 이미지를 찾아내기 위해 시계열 데이터를 변환시키고 시간 내 이동과 왜곡을 최소화함으로써 시계열 이미지의 유사도를 측정할 수 있다.Therefore, the similarity analysis method of chromatogram images according to the practice of the present invention converts time-series data to find other similar images using DTW and can measure the similarity of time-series images by minimizing movement and distortion in time. .

즉, 시계열 데이터에 대하여 DTW 알고리즘을 적용하여 거리 측정을 바탕으로 각 경로에 대한 누적 거리를 계산하여 잠재적인 뒤틀림 경로(warping path)를 최소화하는 것이 바람직하며, 이를 위한 DTW 알고리즘에서 시간 내 이동과 왜곡을 최소화하기 위한 비용함수는, 이하의 수학식 3과 같이 나타낼 수 있다.That is, it is desirable to minimize the potential warping path by calculating the cumulative distance for each path based on the distance measurement by applying the DTW algorithm to the time series data. A cost function for minimizing can be expressed as Equation 3 below.

여기서, C는 비용함수, x,y는 개별 시계열 데이터의 포인트, N,M은 개별 시계열 데이터의 길이를 나타낸다.Here, C is a cost function, x and y are points of individual time series data, and N and M represent lengths of individual time series data.

도 6은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에서 GC-FID 크로마토그램 시계열 이미지 데이터에 DTW를 적용하기 전, 후를 비교한 도면으로서, GC-FID 크로마토그램 시계열 이미지 데이터에 DTW를 적용하기 전(a)과, DTW를 적용한 후(b)를 나타내고 있다.6 is a diagram comparing before and after applying DTW to GC-FID chromatogram time series image data in a similarity analysis method of chromatogram images according to an embodiment of the present invention, DTW to GC-FID chromatogram time series image data (a) before applying and (b) after applying DTW are shown.

또한, TimeSeriesKMeans 알고리즘 적용 시, 적절한 군집의 개수를 선정하기 위해, 본 발명의 실시예에 따르면 전체 시계열 데이터를 대상으로 10부터 30까지의 군집 개수를 갖게 되는 알고리즘을 적용한 뒤 군집 중심과 군집 내 시료 사이 거리의 제곱 합을 나타내는 이너셔(inertia), 모든 시료에 대해 하나의 군집과 가장 가까운 다른 군집 사이와 거리의 평균을 이용해서 계산하는 실루엣 계수(silhouette coefficient)의 평균값을 이용하는 실루엣 스코어(silhouette index), 군집 내 분산과 군집 사이의 분산의 비로 계산되는 칼린스키-하라바즈 지수(Calinski-Harabasz index), 군집 내 높은 유사도를 갖고 군집 사이의 낮은 유사도를 가지면 군집화가 잘 된 모델로 평가하는 데이비스-보울딘 지수(Davies-Bouldin Index) 등 4가지 군집화 평가 지표 중, 하나 이상을 활용할 수 있다.In addition, in order to select the appropriate number of clusters when applying the TimeSeriesKMeans algorithm, according to an embodiment of the present invention, an algorithm having a number of clusters from 10 to 30 is applied to the entire time series data, and then between the center of the cluster and the sample within the cluster Inertia representing the sum of squared distances, silhouette index using the average value of the silhouette coefficient calculated using the average of the distances between one cluster and the nearest other cluster for all samples , Calinski-Harabasz index, which is calculated as the ratio of variance within and between clusters, and Davis-Bowl, which is evaluated as a model with good clustering if it has high similarity within a cluster and low similarity between clusters One or more of the four clustering evaluation metrics, such as the Davies-Bouldin Index, can be used.

아울러, 군집화는 오브젝트 간의 유사도 측정으로 전체 데이터를 유사도가 높은 오브젝트 그룹으로 구분하는 비지도 학습 방법론으로서, 훈련 데이터를 통해 구축된 분류 모델을 적용하여 교차검증 정확도로 평가하는 SVM 등의 지도학습과는 구분된다. 데이터 군집화의 주요 목적은 특징적인 패턴이 있는 군집을 찾아내는 기법을 가리킨다⁷⁾.In addition, clustering is an unsupervised learning methodology that divides the entire data into object groups with high similarity by measuring the similarity between objects. Separated. The main purpose of data clustering refers to a technique to find clusters with characteristic patterns ⁷⁾ .

이러한 군집화에 의하면, 크로마토그램 사이의 유사도 측정을 통하여 밀접한 관계를 갖는 크로마토그램들로 군집을 형성할 수 있으며, 이들 군집의 공통된 특징을 추출할 수 있다. 군집화 성능이 개선될수록 의미 있는 정보를 얻을 수 있게 되므로, 보다 상관관계가 높은 데이터 분석을 위해서는 군집화 알고리즘의 개선이 요구된다. According to such clustering, it is possible to form clusters of chromatograms having a close relationship through measurement of similarity between chromatograms, and common features of these clusters can be extracted. Since meaningful information can be obtained as clustering performance is improved, improvement of clustering algorithm is required for data analysis with higher correlation.

이러한 군집화에서 두 오브젝트가 같은 군집에 속해 있으면 상당히 유사하다고 볼 수 있고, 반대인 경우 유사하지 않다고 볼 수 있다. 이러한 경험적 요소들이 CSPA에 적용되어 두 오브젝트 사이의 유사도가 '1'이면 완전히 같은 군집에 속하고 '0'이면 다른 군집에 속한다고 볼 수 있다. 전술한 CSPA는 보다 나은 결과를 얻기 위해 여러 군집화 모델을 합쳐나가게 된다. 즉, 다수의 군집 결과를 출력하고 이러한 출력을 기반으로 군집화를 수행하여 최종적으로 단일 군집 결과를 생성하는 과정으로 다양한 군집화 결과를 결합함으로써 최종적인 결과의 품질을 향상시키기 위한 알고리즘이다⁸⁾.In this clustering, if two objects belong to the same cluster, they can be regarded as very similar, and if they are in the opposite case, they can be regarded as dissimilar. When these empirical factors are applied to CSPA, if the similarity between two objects is '1', they belong to the same cluster, and if they are '0', they belong to different clusters. The aforementioned CSPA combines several clustering models to obtain better results. That is, it is an algorithm for improving the quality of the final result by combining various clustering results in the process of outputting multiple clustering results, performing clustering based on these outputs, and finally generating a single clustering result ⁸⁾ .

이에 따라, 일반적으로 적합하다고 생각되는 군집의 수 보다 더 큰 값으로 군집 수 'K'를 설정할 수 있다. 일례로서, 군집 수(K)는 보다 세밀한 결과를 도출하는 CSPA의 특성과, 전술한 4가지 평가 지표를 감안하여 10에서 30까지의 범위에서 상대적으로 좋은 성능을 보인다고 판단되는 '21'로 설정할 수 있다(K=21).Accordingly, the number of clusters 'K' can be set to a value greater than the number of clusters generally considered appropriate. As an example, the number of clusters (K) can be set to '21', which is judged to show relatively good performance in the range of 10 to 30, in consideration of the characteristics of CSPA that derives more detailed results and the four evaluation indicators described above. Yes (K=21).

또한, 시퀀스 사이의 모든 포인트를 비교하는 경우 발생되는 비용 요소를 줄이고, 비정상적인 정렬이 생성되지 않도록 하기 위해, TimeSeriesKMeans의 전역 제약 조건으로 워핑 윈도우 크기를 나타내는 변수인 Sakoe-Chiba radius를 12로 설정한 Sakoe-Chiba band를 사용할 수 있다⁹⁾.In addition, in order to reduce the cost factor that occurs when comparing all points between sequences and to prevent abnormal alignment, Sakoe-Chiba radius, a variable representing the warping window size, is set to 12 as a global constraint of TimeSeriesKMeans. -Chiba band can be used ⁹⁾ .

제약 조건이 없는 경우의 전역 비용 행렬은 이하의 수학식 4와 같이 나타낼 수 있다.The global cost matrix in the case of no constraint can be expressed as Equation 4 below.

또한, 전역 제약 조건으로 Sakoe-Chiba band를 적용하였을 경우 전역 비용 행렬은 이하의 수학식 5와 같이 나타낼 수 있다.In addition, when the Sakoe-Chiba band is applied as a global constraint, the global cost matrix can be expressed as in Equation 5 below.

여기서, Kp는 constrained step을 나타낸다.Here, Kp represents a constrained step.

특히, 시계열 데이터에 대한 K-Means 알고리즘으로 DTW barycenter averaging(DBA)을 통해 군집의 중심을 최신화하는 방식으로 진행할 수 있다. 전술한 DBA는 시퀀스들로부터 평균 시퀀스까지의 거리를 최소화하기 위해 초기의 평균 시퀀스를 반복적으로 개선해나가는 평균화 방법으로서, 시퀀스 집합에서 임의의 시퀀스를 평균 시퀀스로 설정한 후, 평균 시퀀스와 시퀀스 집합의 각 개별 시퀀스 사이의 연관성을 찾기 위해 DTW를 계산하고, 이후 연관된 포인트의 무게중심으로 평균 시퀀스의 포인트를 최신화하는 단계들을 거쳐 수행할 수 있다¹⁰⁾.In particular, with the K-Means algorithm for time series data, it is possible to proceed by updating the center of clusters through DTW barycenter averaging (DBA). The DBA described above is an averaging method that repeatedly improves an initial average sequence to minimize the distance from sequences to the average sequence. After setting a random sequence in a sequence set as the average sequence, each 10) Calculate the DTW to find the correlation between individual sequences, and then update the points of the average sequence with the center of gravity of the associated points ¹⁰⁾ .

도 7a 및 도 7b은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에서 각각 두 시계열 사이에 매칭되는 포인트 및 그 뒤틀림 경로와, TimeSeriesKMeans의 전역 제약 조건의 적용시 결과를 나타낸 도면으로서, 도 7a은 두 시계열 사이에 매칭되는 포인트들과 이를 기반으로 한 뒤틀림 경로에 대한 이미지를 나타내고 있고, 도 7b는 Sakoe-chiba radius를 '6'으로 하는 Sakoe-chiba band를 전역 제약 조건으로 적용한 이미지를 나타내고 있다.7a and 7b are diagrams showing the results when applying points matched between two time series and their warp paths and global constraints of TimeSeriesKMeans, respectively, in the similarity analysis method of chromatogram images according to an embodiment of the present invention. 7a shows an image of matching points between two time series and a warping path based on it, and FIG. 7b shows an image in which the Sakoe-chiba band with the Sakoe-chiba radius set to '6' is applied as a global constraint. there is.

유사도 판단(S140 단계)Similarity judgment (step S140)

다음으로, 유사도 행렬에 기초하여 상관계수를 계산하고, 각 시계열 데이터간 정규화된 교차상관 관계를 계산하여, 두 계산결과를 미리 설정된 기준값과 비교하여 각 크로마토그램간 유사도를 판단하는 단계(S140)로서, S130 단계에서 도출한 유사도 행렬(similarity matrix)을 기반으로 하는 상관관계(correlation)와 전체 시계열 데이터 사이의 정규화된 교차 상관관계(normalized cross-correlation)를 계산하여 일정 수준 이상의 값이면 이미지가 유사한 것으로 정의하는 단계이다.Next, calculating the correlation coefficient based on the similarity matrix, calculating the normalized cross-correlation between each time series data, and comparing the two calculation results with a preset reference value to determine the similarity between each chromatogram (S140) , Calculate the normalized cross-correlation between the correlation based on the similarity matrix derived in step S130 and the entire time series data, and if the value is above a certain level, the images are considered to be similar. This is the definition step.

본 발명의 실시예에 따르면, 전술한 바와 같이 TimeSeriesKMeans 적용 시 적절한 군집의 개수를 결정하기 위해, 전체 시계열 데이터를 대상으로 10부터 30까지의 군집 개수(K)를 갖게 되는 알고리즘을 적용하되, 4 가지 평가지표(이너셔, 실루엣 스코어, 칼린스키-하라바즈 지수, 데이비스-보울딘 지수)를 활용하여 군집 개수를 결정하게 된다.According to an embodiment of the present invention, as described above, in order to determine the appropriate number of clusters when TimeSeriesKMeans is applied, an algorithm having the number of clusters (K) from 10 to 30 is applied to the entire time series data, but four The number of clusters is determined using evaluation indicators (inertia, silhouette score, Kalinske-Haravaz index, Davis-Bouldin index).

이중, 이너셔는 값이 작을수록 각 군집의 밀집도가 높다고 볼 수 있으며 이 값이 감소하는 변화량이 급격히 줄어드는 지점(elbow)을 적절한 클러스터(cluster) 개수로 선정할 수 있다. 또한, 실루엣 스코어는 군집화가 잘못된 경우(-1)와 고밀도로 군집화가 잘 된 경우(1)의 값을 가지므로 이 값을 군집 개수 설정에 참조할 수 있다.Among them, the smaller the value of inertia, the higher the density of each cluster, and the point (elbow) at which the amount of change at which this value decreases is rapidly reduced can be selected as the appropriate number of clusters. In addition, since the silhouette score has values for the case of incorrect clustering (-1) and the case of good clustering at high density (1), this value can be referred to for setting the number of clusters.

또한, 군집 내 분산과 군집 사이의 분산의 비율로 정의되는 칼린스키-하라바즈 지수는 값이 클수록 군집이 조밀하고 다른 군집과 잘 분리되는 것임에 따라 이를 참조할 수 있고, 군집 사이의 거리에 대한 군집 내 거리의 비율로 정의되어 군집 사이의 평균 유사도를 나타내는 데이비스-보울딘 지수는 값이 0에 가까울수록 더 나은 군집이 형성된 것으로서, 4가지 평가 지표를 고려하여 군집의 개수를 선정할 수 있다.In addition, the Kalinsky-Haravaz index, which is defined as the ratio of the variance within a cluster to the variance between clusters, can be referred to as the higher the value, the denser the cluster is and the better it is separated from other clusters. The Davis-Bouldin index, which is defined as the ratio of distances within clusters and represents the average degree of similarity between clusters, indicates that the closer the value is to 0, the better clusters are formed. The number of clusters can be selected by considering the four evaluation indicators.

본 발명의 실시예에서 군집 수(K)가 '20' 이상인 지점에서 상대적으로 모든 지표의 성능이 좋은 것으로 나타났으며, 이에 군집 수(K) '21'를 설정할 수 있다. In the embodiment of the present invention, the performance of all indices is relatively good at the point where the number of clusters (K) is '20' or more, and thus the number of clusters (K) '21' can be set.

도 8은 본 발명의 실시예에 따른 TimeSeriesKMeans 알고리즘의 군집 수를 10부터 30까지 설정했을 때의 이너셔, 실루엣 스코어, 칼린스키-하라바즈 지수 및 데이비스-보울딘 지수의 값을 나타낸 도면으로서, TimeSeriesKMeans 알고리즘의 군집 수를 '10'부터 '30'까지 설정했을 때의 이너셔, 실루엣 스코어, 칼린스키-하라바즈 지수 및 데이비스-보울딘 지수의 값을 각각 나타내고 있다.8 is a diagram showing values of inertia, silhouette score, Kalinsky-Haravaz index, and Davis-Bouldin index when the number of clusters of the TimeSeriesKMeans algorithm is set from 10 to 30 according to an embodiment of the present invention, TimeSeriesKMeans The values of inertia, silhouette score, Kalinsky-Haravaz index, and Davis-Bouldin index when the number of clusters in the algorithm are set from '10' to '30' are respectively shown.

그리고, 데이터 셋을 입력받아 여러 군집 결과를 출력하고 이러한 출력 결과를 기반으로 군집화를 수행하여 최종적으로 단일 군집 결과를 생성한 후, 여러 군집화 결과를 결합하여 개선된 최종 결과를 얻기 위해 CSPA 알고리즘을 적용할 수 있다.Then, after receiving the data set, outputting multiple clustering results, performing clustering based on these output results, and finally generating a single clustering result, applying the CSPA algorithm to combine multiple clustering results to obtain an improved final result. can do.

본 발명의 실시예에서는, 서로 유사한 시계열이라면 전체 시계열로부터 생성된 여러 개의 서브시퀀스가 동일한 군집에 배치될 가능성이 높다는 CSPA 알고리즘을 기반으로 시계열 사이의 유사성을 분석하는 것을 특징으로 한다. In an embodiment of the present invention, the similarity between time series is analyzed based on the CSPA algorithm, which states that if time series are similar to each other, there is a high possibility that several subsequences generated from the entire time series will be assigned to the same cluster.

여기서, 하이퍼그래프는 정점과 간선으로 구성되며 두 개체의 유사성은 같은 군집에 속하면 1이고 그렇지 않으면 0으로 하여 아래와 같은 형태로 나타낼 수 있다. 또한, 하이퍼그래프를 기반으로 두 개체가 동일한 군집에 속하는 군집화 비율을 나타내는 유사도 행렬을 구할 수 있다. 도 9는 본 발명의 실시예에 따른 하이퍼그래프를 기반으로 한 상관관계 히트맵(Correlation heatmap)을 나타내고 있다.Here, the hypergraph is composed of vertices and edges, and the similarity between the two entities can be expressed as 1 if they belong to the same cluster and 0 otherwise. In addition, based on the hypergraph, a similarity matrix representing the clustering ratio of two entities belonging to the same cluster can be obtained. 9 shows a correlation heatmap based on a hypergraph according to an embodiment of the present invention.

이때, 교차상관관계 계산에 사용된 윈도우 크기는 12로 설정하고, 유사도에 따라 데이터를 추출하는데 필요한 설정값에 따라 유사 데이터를 판별할 수 있다.At this time, the size of the window used in the cross-correlation calculation is set to 12, and similar data can be determined according to the set value required to extract data according to the degree of similarity.

상세하게는, 정규화된 교차상관 기준값(NCC)와 유사행렬에 대한 상관계수 기준값(R)이 모두 변동 가능한 설정값(α, β) 이상에 해당하면 유사한 데이터로 판별하고, 두 설정값(α, β) 중 하나의 조건이라도 만족시키지 못한 경우에는 유사하지 않은 데이터로 판별할 수 있다.Specifically, if the normalized cross-correlation reference value (NCC) and the correlation coefficient reference value (R) for the similar matrix both correspond to the variable set values (α, β) or more, it is determined as similar data, and the two set values (α, If even one condition of β) is not satisfied, it can be determined as dissimilar data.

여기서, 설정값(α, β)은 필요에 따라 변수 조정(parameter tuning)을 통해 유사도가 높은 데이터 즉, 후보 데이터를 찾아내기 위해 조정이 가능한 변수이다. 예를 들면, 교차상관 기준값(NCC)와 유사행렬에 대한 상관계수 기준값(R)을 각각 0.5로 설정하는 경우, 0.9로 설정하는 경우에 비해 하나의 데이터에 대해 유사하다고 판별되는 후보 데이터들이 다수 산출될 수 있다.Here, the set values (α, β) are variables that can be adjusted to find data with a high degree of similarity, that is, candidate data, through parameter tuning as needed. For example, when the cross-correlation reference value (NCC) and the correlation coefficient reference value (R) for similar matrices are set to 0.5, a large number of candidate data determined to be similar for one data are calculated compared to the case where they are set to 0.9. It can be.

따라서, 본 발명의 일 실시예에서는, 정규화된 교차상관에 대한 기준값은 0.95 이상, 유사행렬에 대한 상관계수 기준값은 0.9 이상으로 설정함으로써, 대상 시계열 데이터와 매우 강한 상관관계를 갖은 시계열 데이터를 추출할 수 있다.Therefore, in one embodiment of the present invention, by setting the reference value for the normalized cross-correlation to 0.95 or more and the reference value of the correlation coefficient for the similarity matrix to 0.9 or more, time series data having a very strong correlation with the target time series data can be extracted. can

이하의 표 1은 본 발명의 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 의한 GC-FID 크로마토그램 이미지 유사도 산출결과를 예시하고 있다.Table 1 below illustrates the result of calculating the similarity of a GC-FID chromatogram image by the similarity analysis method of chromatogram images according to an embodiment of the present invention.

Target
sampleTarget
sample Similar
samplesSimilar
samples Correlation
coefficient (r)Correlation
coefficient (r) 201100701201100701 201100702201100702 0.92850.9285 201100702201100702 201100701201100701 0.92850.9285 201101301201101301 201101302201101302 0.94620.9462 201101302201101302 201101301201101301 0.94620.9462 201101401201101401 201101402201101402 0.91480.9148 201101402201101402 201101401201101401 0.91480.9148 201102001201102001 201102003201102003 0.98430.9843 201102004201102004 0.98430.9843 201102002201102002 0.98430.9843 201102002201102002 201102003201102003 1.00001.0000 201102004201102004 1.00001.0000 201102001201102001 0.98430.9843 201102003201102003 201102004201102004 1.00001.0000 201102002201102002 1.00001.0000 201102001201102001 0.98430.9843 201102004201102004 201102003201102003 1.00001.0000 201102002201102002 1.00001.0000 201102001201102001 0.98430.9843 201201401201201401 201201402201201402 0.91700.9170 201201402201201402 201201401201201401 0.91700.9170 201202301201202301 201202302201202302 0.94420.9442 201202311201202311 0.91620.9162 201202302201202302 201202301201202301 0.94420.9442 201202304201202304 201202308201202308 0.99350.9935 201202309201202309 0.98560.9856 201202308201202308 201202304201202304 0.99350.9935 201202309201202309 0.99140.9914 201202309201202309 201202308201202308 0.99140.9914 201202304201202304 0.98560.9856 201202312201202312 0.97520.9752 201202311201202311 201202301201202301 0.91620.9162 201202312201202312 201202309201202309 0.97520.9752 201301802201301802 201301804201301804 0.99160.9916 201301808201301808 0.98860.9886 201301804201301804 201301802201301802 0.99160.9916 201301808201301808 0.97600.9760 201301808201301808 201301802201301802 0.98860.9886 201301804201301804 0.97600.9760 201400102201400102 201400104201400104 0.97440.9744 201400105201400105 0.96240.9624 201400104201400104 201400102201400102 0.97440.9744 201400105201400105 0.90680.9068 201400105201400105 201400102201400102 0.96240.9624 201400104201400104 0.90680.9068 201400503201400503 201502501201502501 0.98330.9833 201600401201600401 0.95880.9588 201401001201401001 201401801201401801 0.90940.9094 201401801201401801 201401001201401001 0.90940.9094 201401802201401802 201502614201502614 0.91090.9109 201502501201502501 0.90560.9056 201500501201500501 201501402201501402 0.96610.9661 201500601201500601 0.92590.9259 201500601201500601 201500501201500501 0.92590.9259 201501402201501402 201500501201500501 0.96610.9661 201502501201502501 201400503201400503 0.98330.9833 201502614201502614 0.96860.9686 201600401201600401 0.94640.9464 201401802201401802 0.90560.9056 201502614201502614 201502501201502501 0.96860.9686 201401802201401802 0.91090.9109 201600401201600401 0.90700.9070 201600101201600101 201600401201600401 0.91910.9191 201600401201600401 201400503201400503 0.95880.9588 201502501201502501 0.94640.9464 201600101201600101 0.91910.9191 201502614201502614 0.90700.9070 201800102201800102 201801201201801201 0.99430.9943 201800201201800201 0.97890.9789 201802001201802001 0.97240.9724 201802701201802701 0.95870.9587 201800301201800301 0.95680.9568 201800201201800201 201800102201800102 0.97890.9789 201801201201801201 0.97820.9782 201802001201802001 0.97560.9756 201802701201802701 0.96180.9618 201800301201800301 0.94820.9482 201800301201800301 201800102201800102 0.95680.9568 201800201201800201 0.94820.9482 201801201201801201 0.93840.9384 201802001201802001 0.90030.9003 201800501201800501 201800701201800701 1.00001.0000 201801301201801301 0.93890.9389 201800601201800601 0.93680.9368 201800601201800601 201801301201801301 0.95320.9532 201801302201801302 0.95300.9530 201800501201800501 0.93680.9368 201800701201800701 0.93680.9368 201802701201802701 0.91410.9141 201802001201802001 0.90600.9060 201800701201800701 201800501201800501 1.00001.0000 201801301201801301 0.93890.9389 201800601201800601 0.93680.9368 201800801201800801 201801301201801301 0.97010.9701 201801302201801302 0.94360.9436 201800901201800901 201801101201801101 0.96080.9608 201801301201801301 0.90540.9054 201801101201801101 201800901201800901 0.96080.9608 201801201201801201 201800102201800102 0.99430.9943 201800201201800201 0.97820.9782 201802001201802001 0.97130.9713 201802701201802701 0.96360.9636 201800301201800301 0.93840.9384 201801301201801301 201801302201801302 0.97460.9746 201800801201800801 0.97010.9701 201800601201800601 0.95320.9532 201800501201800501 0.93890.9389 201800701201800701 0.93890.9389 201800901201800901 0.90540.9054 201801302201801302 201801301201801301 0.97460.9746 201800601201800601 0.95300.9530 201800801201800801 0.94360.9436 201801502201801502 201801503201801503 0.97450.9745 201801503201801503 201801502201801502 0.97450.9745 201802412201802412 0.92730.9273 201802404201802404 0.90280.9028 201801504201801504 201802411201802411 0.93790.9379 201801801201801801 0.93420.9342 201801601201801601 201802405201802405 0.93540.9354 201802409201802409 0.92780.9278 201801801201801801 201801504201801504 0.93420.9342 201802406201802406 0.92670.9267 201802001201802001 201802701201802701 0.99500.9950 201800201201800201 0.97560.9756 201800102201800102 0.97240.9724 201801201201801201 0.97130.9713 201800601201800601 0.90600.9060 201800301201800301 0.90030.9003 201802401201802401 201802405201802405 0.99360.9936 201802409201802409 0.98570.9857 201802407201802407 0.96450.9645 201802403201802403 201802407201802407 0.94530.9453 201802406201802406 0.94010.9401 201802404201802404 201802412201802412 0.94910.9491 201801503201801503 0.90280.9028 201802405201802405 201802401201802401 0.99360.9936 201802409201802409 0.97520.9752 201802407201802407 0.96400.9640 201802413201802413 0.93610.9361 201801601201801601 0.93540.9354 201802408201802408 0.91800.9180 201802406201802406 201802403201802403 0.94010.9401 201801801201801801 0.92670.9267 201802407201802407 0.90710.9071 201802407201802407 201802401201802401 0.96450.9645 201802405201802405 0.96400.9640 201802409201802409 0.94730.9473 201802403201802403 0.94530.9453 201802406201802406 0.90710.9071 201802413201802413 0.90330.9033 201802408201802408 201802413201802413 0.95480.9548 201802405201802405 0.91800.9180 201802409201802409 0.90330.9033 201802409201802409 201802401201802401 0.98570.9857 201802405201802405 0.97520.9752 201802407201802407 0.94730.9473 201801601201801601 0.92780.9278 201802408201802408 0.90330.9033 201802411201802411 201801504201801504 0.93790.9379 201802412201802412 201802404201802404 0.94910.9491 201801503201801503 0.92730.9273 201802413201802413 201802408201802408 0.95480.9548 201802405201802405 0.93610.9361 201802407201802407 0.90330.9033 201802701201802701 201802001201802001 0.99500.9950 201801201201801201 0.96360.9636 201800201201800201 0.96180.9618 201800102201800102 0.95870.9587 201800601201800601 0.91410.9141

전술한 단계에 따라, 본 발명의 크로마토그램 이미지의 유사도 분석 방법에 의하면, 국내에 유입되거나 유통되고 있는 필로폰 시료에 대한 불순물을 분석하고 GC-FID 크로마토그램 이미지에 대한 유사성 판별분석을 수행할 수 있다.According to the above steps, according to the similarity analysis method of chromatogram images of the present invention, impurities in methamphetamine samples imported or distributed in Korea can be analyzed and similarity discrimination analysis can be performed on GC-FID chromatogram images. .

특히, 본 발명의 실시예에 따르면, 시료 분석으로부터 얻은 크로마토그램 이미지를 시계열 데이터로 변환하고, 슬라이딩 윈도우를 통해 서브시퀀스 행렬을 구성하며, 구성된 데이터를 TimeSeriesKMeans 기반 CSPA를 통해 유사도 행렬(similarity matrix)을 구성하고, 이를 대상으로 계산한 상관계수와 전체 시계열을 대상으로 특정 임계값 이상인 경우에 유사한 시퀀스로 판별하는 것을 특징으로 하며, 이에 따라 필로폰 시료에 대한 분리, 분석 과정과 GC-FID 크로마토그램 이미지를 추출하는 과정을 거치면서 각 데이터의 머무름 시간이 일치하지 않는 문제에 대해 내부표준물질을 기준으로 하여 일일이 수동으로 조정하는 과정을 거칠 필요가 없도록 보완된 알고리즘을 구현할 수 있고, 각각의 시료의 GC-FID 크로마토그램과 유사한 크로마토그램들을 추출이 가능하게 된다.In particular, according to an embodiment of the present invention, a chromatogram image obtained from sample analysis is converted into time series data, a subsequence matrix is constructed through a sliding window, and a similarity matrix is converted from the constructed data through TimeSeriesKMeans-based CSPA. It is characterized in that it is configured and determined as a similar sequence when the correlation coefficient calculated for it and the entire time series are above a specific threshold value, and accordingly, the separation and analysis process for the methamphetamine sample and the GC-FID chromatogram image During the extraction process, it is possible to implement a supplementary algorithm so that there is no need to manually adjust the internal standard material for the problem of inconsistent retention time of each data, and the GC- Chromatograms similar to FID chromatograms can be extracted.

또한, 기존 데이터 방식과 달리 GC-FID 크로마토그램 이미지를 활용해 유사한 시료를 찾아내는 새로운 통계분석 방식을 제시함으로써 실제 분석에 적용할 수 있다.In addition, unlike existing data methods, it can be applied to actual analysis by presenting a new statistical analysis method that uses GC-FID chromatogram images to find similar samples.

이상에서 본 발명의 일 실시예에 따른 크로마토그램 이미지의 유사도 분석 방법에 대하여 상세히 설명하였지만, 이를 구현하기 위한 컴퓨터 판독 가능한 기록매체에 저장된 프로그램 역시 구현 가능함은 물론이다.Although the similarity analysis method of chromatogram images according to an embodiment of the present invention has been described in detail above, a program stored in a computer readable recording medium for implementing the same can be implemented as well.

즉, 전술한 크로마토그램 이미지의 유사도 분석 방법은 이를 구현하기 위한 명령어들의 프로그램이 유형적으로 구현됨으로써, 컴퓨터를 통해 판독될 수 있는 기록매체에 포함되어 제공될 수도 있음을 당업자들이 쉽게 이해할 수 있을 것이다. 다시 말해, 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어, 컴퓨터 판독 가능한 기록매체에 기록될 수 있다.That is, those skilled in the art can easily understand that the above-described similarity analysis method of a chromatogram image may be provided by being included in a computer-readable recording medium by tangibly implementing a program of instructions for implementing the same. In other words, it may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable recording medium.

여기서, 컴퓨터 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.Here, the computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer readable recording medium may be specially designed and configured for the present invention or may be known and usable to those skilled in computer software.

전술한 컴퓨터 판독 가능한 기록매체로는 하드 디스크, 플로피 디스크 및 자기테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리, USB 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 상기의 프로그램 명령은 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 또한, 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The computer-readable recording media described above include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and floptical disks. Hardware devices specially configured to store and execute program instructions, such as magneto-optical media, ROM, RAM, flash memory, USB memory, and the like, may be included. The above program commands may include not only machine language codes generated by a compiler but also high-level language codes that can be executed by a computer using an interpreter or the like. Additionally, a hardware device may be configured to act as one or more software modules to perform the operations of the present invention, and vice versa.

상기에 많은 사항이 구체적으로 기재되어 있으나 이것은 발명의 범위를 한정하는 것이라기보다 바람직한 실시예의 예시로서 해석되어야 한다. 따라서, 발명은 설명된 실시예에 의하여 정할 것이 아니고 특허청구범위와 특허청구범위에 균등한 것에 의하여 정하여져야 한다.Although many details have been specifically described above, they should be interpreted as examples of preferred embodiments rather than limiting the scope of the invention. Therefore, the invention should not be defined by the described examples, but should be defined by what is equivalent to the claims and claims.

Claims

As a similarity analysis method of chromatogram images in which the retention time of the compound is reflected,
receiving and pre-processing a plurality of chromatogram images;
selecting points corresponding to the maximum Y-axis value on the X-axis of the pre-processed image and converting each point into time-series data having the same time interval;
generating a subsequence matrix by applying a sliding window algorithm set to the time series data as conditions of a constant window size and stride;
generating a similarity matrix by clustering the subsequence matrix using a cluster-based similarity segmentation algorithm; and
Calculating a correlation coefficient based on the similarity matrix, calculating a normalized cross-correlation between each time series data, and comparing the two calculation results with a preset reference value to determine the similarity between each chromatogram
Similarity analysis method of a chromatogram image containing a.

According to claim 1,
The step of preprocessing the input chromatogram image,
converting the input chromatogram image into a gray scale image; and
Converting the gray scale image into a binarized image through a binarization algorithm
Similarity analysis method of a chromatogram image containing a.

According to claim 2,
After converting the gray scale image into a binarized image through the binarization algorithm,
converting pixel values of all points included in the converted binary image into maximum values;
calculating a maximum Y-axis value of each binarized image, and removing an area corresponding to less than a Y-axis value of a point having an initial Y-axis value based on the maximum Y-axis value in each binarized image; and
Steps to resize all binarized images equally
Similarity analysis method of a chromatogram image containing a.

According to claim 1,
The subsequence matrix is expressed by the following equation,

A method for analyzing the similarity of chromatogram images, which is expressed as (however, w is the window size, s is the stride value, and n is the step).

According to claim 1,
The step of generating a similarity matrix by clustering the subsequence matrix using the cluster-based similarity segmentation algorithm,
Minimizing a potential warping path of each time series data by applying a dynamic time warping algorithm to the subsequence matrix,
The cost function for minimizing movement and distortion in time in the dynamic time warping algorithm is the following equation,

A similarity analysis method for chromatogram images, which is expressed as (where C is a cost function, x, y are points of individual time series data, and N, M is the length of individual time series data).

According to claim 1,
The step of generating a similarity matrix by clustering the subsequence matrix using the cluster-based similarity segmentation algorithm,
Generating a global cost matrix by applying the Sikoe-Chiba band, wherein the global cost matrix is expressed by the following equation:

Method for analyzing the similarity of chromatogram images, which is expressed as real number, c _i,j is a function that implements the timing difference between two patterns, k is a variable representing an arbitrary i or j, M is the boundary value of the i-axis, and N is the boundary value of the j-axis The value, Kp, is a constraint value (where p=n/m, m is a multiple of the i or j axis direction, and n is a multiple of the direction orthogonal to m)}.

According to claim 6,
The step of generating a similarity matrix by clustering the subsequence matrix using the cluster-based similarity segmentation algorithm,
Setting the number of clusters using one or more clustering evaluation indicators;
The clustering evaluation index,
Inertia, which calculates the sum of squares of distances between cluster centers and time series data within a cluster; Silhouette index, which calculates the average distance between one cluster and the nearest other cluster for all time series data; The Calinski-Harabasz index, which is calculated as the ratio of the variance within a cluster to the variance between clusters, and the Davies-Boldin index, which evaluates models with high intra-class similarity and low similarity between clusters, -Bouldin index), similarity analysis method of chromatogram images including one or more.

A computer program stored in a readable recording medium for implementing the similarity analysis method of chromatogram images according to any one of claims 1 to 7.