KR20240043388A

KR20240043388A - Apparatus for classification model building to classify fire accelerators and method thereof

Info

Publication number: KR20240043388A
Application number: KR1020220122422A
Authority: KR
Inventors: 박치현; 박우용; 이동계
Original assignee: 대한민국(관리부서: 행정안전부 국립과학수사연구원장)
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2024-04-03
Also published as: WO2024071819A1

Abstract

본 발명은 화재 촉진제를 분류하기 위한 분류 모델 구축 장치 및 그 방법에 대한 것이다.
본 발명에 따른 화재 촉진제를 분류하기 위한 분류 모델 구축 장치는 방화 의심 사례를 통해 획득한 GC-MS 데이터를 수집하는 데이터베이스, 상기 수집된 GC-MS 데이터를 전처리하는 데이터 전처리부, 상기 전처리된 GC-MS 데이터를 이용하여 데이터 세트를 생성하고, 생성된 데이터 세트를 이용하여 복수의 분류 모델을 구축하기 위한 하이퍼파라미터를 결정하는 모델 구축부, 그리고 상기 구축된 복수의 분류 모델을 통해 화재 물질을 추출하는 성능을 검증하는 검증부를 포함한다. The present invention relates to an apparatus and method for building a classification model for classifying fire accelerants.
A classification model building device for classifying fire accelerants according to the present invention includes a database that collects GC-MS data obtained through suspected arson cases, a data preprocessor that preprocesses the collected GC-MS data, and the preprocessed GC-MS data. A model building unit that generates a data set using MS data, determines hyperparameters for building multiple classification models using the generated data sets, and extracts fire substances through the constructed multiple classification models. Includes a verification unit that verifies performance.

Description

Apparatus for classification model building to classify fire accelerators and method thereof}

본 발명은 화재 촉진제를 분류하기 위한 분류 모델 구축 장치 및 그 방법에 관한 것으로서, 더욱 상세하게는 GC-MS 데이터를 이용하여 화재에 사용된 촉진제를 분류하기 위하여 머신러닝을 기반으로 하는 분류 모델을 구축하는 장치 및 그 방법에 관한 것이다. The present invention relates to an apparatus and method for building a classification model for classifying fire accelerators. More specifically, the present invention relates to building a classification model based on machine learning to classify accelerators used in fires using GC-MS data. It relates to a device and method for doing so.

현대 사회에서 화재 피해와 관련된 재정적 손실은 다양한 함의를 갖는 중요한 고려사항이다. 따라서 화재가 자연적으로 발생했는지 방화로 인해 발생했는지 확인하는 것이 중요하다. 가솔린, 등유, 디젤과 같은 탄화수소 연료와 유기 용제 또는 양초와 같은 일반적인 방화 사건의 대부분에는 여러 화재 촉진제가 사용된다.In modern society, financial losses associated with fire damage are an important consideration with a variety of implications. Therefore, it is important to determine whether the fire occurred naturally or was caused by arson. Several fire accelerants are used in most common arson incidents, such as hydrocarbon fuels such as gasoline, kerosene, diesel, and organic solvents or candles.

종래에는 기체 크로마토그래피 질량 분광법(Gas Chromatography - Mass Spectrometer, GC-MS), 고체상 미세추출(Solid Phase Micro-Extraction, SPME) 또는 용매 추출을 이용한 분석을 통해 화재 촉진제의 여러 마커 구성 요소를 분리하고 식별하였다. Conventionally, several marker components of fire accelerants are separated and identified through analysis using Gas Chromatography - Mass Spectrometer (GC-MS), Solid Phase Micro-Extraction (SPME), or solvent extraction. did.

기체 크로마토그래피 질량 분광법(GC-MS)은 샘플에 대한 연속적이고 자연스러운 정보를 제공하므로 기체 크로마토그래피 질량 분광법(GC-MS) 데이터를 이용한 기계 학습 모델을 개발하려는 많은 시도가 있었다. Because gas chromatography mass spectrometry (GC-MS) provides continuous and natural information about samples, many attempts have been made to develop machine learning models using gas chromatography mass spectrometry (GC-MS) data.

대한민국 등록특허공보 제10-2375679호 (2022.03.18. 공고)Republic of Korea Patent Publication No. 10-2375679 (announced on March 18, 2022)

본 발명이 이루고자 하는 기술적 과제는 GC-MS 데이터를 이용하여 화재에 사용된 촉진제를 분류하기 위하여 머신러닝을 기반으로 하는 분류 모델을 구축하는 장치 및 그 방법을 제공하기 위한 것이다.The technical problem to be achieved by the present invention is to provide an apparatus and method for building a classification model based on machine learning to classify accelerants used in fires using GC-MS data.

이러한 기술적 과제를 이루기 위한 본 발명의 실시예에 따른 방화 의심 사례를 통해 획득한 GC-MS 데이터를 수집하는 데이터베이스, 상기 수집된 GC-MS 데이터를 전처리하는 데이터 전처리부, 상기 전처리된 GC-MS 데이터를 이용하여 데이터 세트를 생성하고, 생성된 데이터 세트를 이용하여 복수의 분류 모델을 구축하기 위한 하이퍼파라미터를 결정하는 모델 구축부, 그리고 상기 구축된 복수의 분류 모델을 통해 화재 물질을 추출하는 성능을 검증하는 검증부를 포함한다. In order to achieve this technical task, a database that collects GC-MS data obtained through suspected cases of arson according to an embodiment of the present invention, a data preprocessor that preprocesses the collected GC-MS data, and the preprocessed GC-MS data A model building unit that generates a data set using and determines hyperparameters for building a plurality of classification models using the generated data set, and the performance of extracting fire substances through the constructed plurality of classification models. Includes a verification unit that verifies.

상기 데이터베이스는, 상기 GC-MS 데이터를 화재를 발생 시킨 물질에 따라 복수개로 분류하여 저장하며, 상기 물질은, 화재 촉진제, 가솔린, 등유, 디젤, 양초 및 유기 용제 중에서 적어도 하나를 포함할 수 있다. The database stores the GC-MS data by classifying it into a plurality of groups according to the substance that caused the fire, and the substance may include at least one of fire accelerants, gasoline, kerosene, diesel, candles, and organic solvents.

상기 복수의 분류 모델은, 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델을 포함할 수 있다. The plurality of classification models may include a random forest (RF) model, a support vector machine (SVM) model, and a convolutional neural network (CNN) model.

학습시키고자 하는 분류 모델이 상기 랜덤 포레스트(RF)모델 및 서포트 벡터 머신(SVM) 모델일 경우, 상기 전처리부는, 상기 GC-MS 데이터를 분석하여 화학 물질에 대응하는 복수의 피크를 추출하고, 추출된 피크와 기준 피크를 비교하여 코사인 유사도를 산출할 수 있다. When the classification model to be learned is the random forest (RF) model and the support vector machine (SVM) model, the preprocessor analyzes the GC-MS data to extract a plurality of peaks corresponding to the chemical substance, and extracts Cosine similarity can be calculated by comparing the peak and the reference peak.

상기 전처리부는, 상기 산출된 유사도가 기준값보다 크면 상기 GC-MS 데이터에 포함된 물질 정보를 추출할 수 있다. The preprocessor may extract material information included in the GC-MS data if the calculated similarity is greater than a reference value.

학습시키고자 하는 분류 모델이 상기 콘볼루션 신경망(CNN) 모델일 경우, 상기 전처리부는, 하기의 수학식을 이용하여 GC-MS 데이터를 전처리할 수 있다. When the classification model to be learned is the convolutional neural network (CNN) model, the preprocessor can preprocess the GC-MS data using the following equation.

여기서, 는 각 요소의 표준화된 값을 나타내고, 는 행렬의 각 요소를 나타낸다. here, represents the standardized value of each element, represents each element of the matrix.

상기 모델 구축부는, 추정기의 수와 모델의 최대 깊이를 이용하여 랜덤 포레스트(RF)모델의 하이퍼파라미터를 결정할 수 있다. The model building unit may determine hyperparameters of the random forest (RF) model using the number of estimators and the maximum depth of the model.

상기 모델 구축부는, 방사형 기초 함수(Radial Basis Function, RBF)를 서포트 벡터 머신(SVM) 모델의 커널로 사용하여 하이퍼파라미터를 결정할 수 있다. The model building unit may determine hyperparameters using a radial basis function (RBF) as the kernel of a support vector machine (SVM) model.

상기 모델 구축부는, 드롭아웃 레이어를 컨볼루션 레이어 뒤에 배치하고, 모든 중간 계층의 활성화 함수는 ReLU(Rectified Linear Unit) 함수이되, 출력 계층에는 시그모이드 함수를 적용하여 콘볼루션 신경망(CNN) 모델의 하이퍼파라미터를 결정할 수 있다. The model building unit places a dropout layer behind the convolutional layer, and the activation function of all middle layers is a ReLU (Rectified Linear Unit) function, and the sigmoid function is applied to the output layer to create a convolutional neural network (CNN) model. Hyperparameters can be determined.

또한, 본 발명의 실시예에 따른 분류 모델 구축 장치를 이용한 분류 모델 구축 방법에 있어서, 방화 의심 사례를 통해 획득한 GC-MS 데이터를 수집하여 데이터베이스를 구축하는 단계, 상기 수집된 GC-MS 데이터를 전처리하는 단계, 상기 전처리된 GC-MS 데이터를 이용하여 데이터 세트를 생성하고, 생성된 데이터 세트를 이용하여 복수의 분류 모델을 구축하기 위한 하이퍼파라미터를 결정하는 단계, 그리고 상기 구축된 복수의 분류 모델을 통해 화재 물질을 추출하는 성능을 검증하는 단계를 포함한다. In addition, in the classification model building method using the classification model building device according to an embodiment of the present invention, collecting GC-MS data obtained through suspected arson cases to build a database, the collected GC-MS data A preprocessing step, generating a data set using the preprocessed GC-MS data, determining hyperparameters for building a plurality of classification models using the generated data set, and the constructed plurality of classification models. It includes the step of verifying the performance of extracting fire substances through.

이와 같이 본 발명에 따르면, 인공 지능을 이용하여 실제 방화로 의심되는 화재 잔류물에서 화재 촉진제를 분류할 수 있어 화재 사고의 효과적이고 정확한 조사에 도움이 될 수 있다. In this way, according to the present invention, it is possible to classify fire accelerants from fire residues suspected of actual arson using artificial intelligence, which can help in effective and accurate investigation of fire accidents.

도 1은 본 발명의 실시예에 따른 분류 모델 구축 장치를 설명하기 위한 구성도이다.
도 2는 본 발명의 실시예에 따른 분류 모델 구축장치를 이용한 분류 모델 구축 방법에 대한 순서도이다.
도 3은 도 2에 도시된 S230단계에서 랜덤 포레스트(RF)모델 및 서포트 벡터 머신(SVM) 모델에 대한 최적의 하이퍼파라미터를 설정하는 조건을 설명하기 위한 예시도이다.
도 4는 도 2에 도시된 S230단계에서 구축된 콘볼루션 신경망(CNN) 모델을 설명하기 위한 예시도이다.
도 5는 각 범주의 평균 GC-MS 데이터를 설명하기 위한 예시도이다.
도 6은 GC-MS 데이터에서 추출된 특징점을 나타내는 예시도이다.
도 7은 본 발명의 실시예에 따른 분류 모델의 성능 평과 결과를 나타내는 예시도이다.
도 8은 본 발명의 실시예에 따른 분류 모델의 ROC 곡선과 PR 곡선을 나타내는 예시도이다. 1 is a configuration diagram illustrating a classification model building device according to an embodiment of the present invention.
Figure 2 is a flowchart of a classification model building method using a classification model building device according to an embodiment of the present invention.
FIG. 3 is an example diagram illustrating conditions for setting optimal hyperparameters for a random forest (RF) model and a support vector machine (SVM) model in step S230 shown in FIG. 2.
Figure 4 is an example diagram for explaining the convolutional neural network (CNN) model built in step S230 shown in Figure 2.
Figure 5 is an example diagram to explain average GC-MS data for each category.
Figure 6 is an example diagram showing feature points extracted from GC-MS data.
Figure 7 is an exemplary diagram showing the performance evaluation results of a classification model according to an embodiment of the present invention.
Figure 8 is an exemplary diagram showing the ROC curve and PR curve of a classification model according to an embodiment of the present invention.

이하 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하기로 한다. 이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the attached drawings. In this process, the thickness of lines or sizes of components shown in the drawing may be exaggerated for clarity and convenience of explanation.

또한 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서, 이는 측정 대상자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the measurement subject, operator's intention, or custom. Therefore, definitions of these terms should be made based on the content throughout this specification.

이하에서는 도 1을 이용하여 본 발명의 실시예에 따른 화재 촉진제를 분류하기 위한 분류 모델 구축 장치에 대해 더욱 상세하게 설명한다. Hereinafter, a classification model building device for classifying fire accelerants according to an embodiment of the present invention will be described in more detail using FIG. 1.

도 1은 본 발명의 실시예에 따른 분류 모델 구축 장치를 설명하기 위한 구성도이다. 1 is a configuration diagram illustrating a classification model building device according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 분류 모델 구축 장치 (100)는 데이터베이스(110), 데이터 전처리부(120), 모델 구축부(130) 및 검증부(140)를 포함한다. As shown in FIG. 1, the classification model building apparatus 100 according to an embodiment of the present invention includes a database 110, a data preprocessing unit 120, a model building unit 130, and a verification unit 140. .

먼저, 데이터베이스(110)는 방화 의심 사례를 통해 획득한 로우 데이터(raw data)를 저장한다. 부연하자면, 데이터베이스(110)는 기체 크로마토그래피 질량 분광법(GC-MS)을 통해 획득한 GC-MS 데이터를 수집하고, 수집된 GC-MS 데이터를 화재 촉매재의 종류에 따라 분류한다. First, the database 110 stores raw data obtained through cases of suspected arson. To elaborate, the database 110 collects GC-MS data obtained through gas chromatography mass spectrometry (GC-MS) and classifies the collected GC-MS data according to the type of fire catalyst.

데이터 전처리부(120)는 GC-MS 데이터를 학습데이터로 사용하기 위하여 전처리를 수행한다. 일 예로 설명하면, 데이터 전처리부(120)는 GC-MS 데이터를 콘볼루션 신경망(Convolutional Neural Network, CNN) 모델의 학습데이터로 사용하기 위하여 GC-MS 데이터로부터 2차원 매트릭스를 추출하고, 추출된 2차원의 매트릭스를 표준화한다. The data preprocessing unit 120 performs preprocessing on GC-MS data to use it as learning data. As an example, the data preprocessor 120 extracts a two-dimensional matrix from the GC-MS data to use the GC-MS data as training data for a convolutional neural network (CNN) model, and the extracted 2 Standardize the matrix of dimensions.

모델 구축부(130)는 머신 러닝을 기반으로 하는 복수의 분류모델을 구축한다. 이를 다시 설명하면, 모델 구축부(130)는 랜덤 포레스트(random forest, RF)모델, 서포트 벡터 머신(Support Vector Machine, SVM) 모델, 콘볼루션 신경망(CNN) 모델을 기반으로 하는 분류모델을 획득한다. The model building unit 130 builds a plurality of classification models based on machine learning. To explain this again, the model building unit 130 obtains a classification model based on a random forest (RF) model, a support vector machine (SVM) model, and a convolutional neural network (CNN) model. .

그리고, 모델 구축부(130)는 복수의 모델에 전처리가 완료된 학습 데이터를 훈련 데이터와 검증 데이터로 분류하고, 분류된 훈련 데이터와 검증 데이터를 이용하여 각각의 모델에 대응하는 하이퍼파라미터를 결정한다. Then, the model building unit 130 classifies the learning data that has been preprocessed for a plurality of models into training data and validation data, and determines hyperparameters corresponding to each model using the classified training data and validation data.

마지막으로 검증부(140)는 결정된 하이퍼파라미터로 구축된 분류모델에 대한 검증을 수행한다. Finally, the verification unit 140 performs verification of the classification model built with the determined hyperparameters.

이하에서는 도 2 내지 도 8을 이용하여 본 발명의 실시예에 따른 화재 촉진제를 분류하기 위한 분류 모델 구축 방법에 대해 더욱 상세하게 설명한다. Hereinafter, a method of building a classification model for classifying fire accelerants according to an embodiment of the present invention will be described in more detail using FIGS. 2 to 8.

도 2는 본 발명의 실시예에 따른 분류 모델 구축장치를 이용한 분류 모델 구축 방법에 대한 순서도이다. Figure 2 is a flowchart of a classification model building method using a classification model building device according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 실시예에 따른 분류 모델 구축 장치(100)는 방화 의심 사례를 통해 획득한 GC-MS 데이터를 수집하여 데이터베이스를 구축한다(S210).As shown in FIG. 2, the classification model building device 100 according to an embodiment of the present invention collects GC-MS data obtained through suspected arson cases and builds a database (S210).

부연하자면, 분류 모델 구축 장치(100)는 국립과학수사연구원과 같이 방화 의심 사례를 분석하는 연구소의 서버로부터 GC-MS 데이터를 수신받는다. To elaborate, the classification model building device 100 receives GC-MS data from the server of a research institute that analyzes suspected cases of arson, such as the National Institute of Forensic Science.

이때, GC-MS 데이터는 하기의 표1에 기재된 표준 분석 조건에 따라 분석된 결과이다. At this time, the GC-MS data is the result of analysis according to the standard analysis conditions listed in Table 1 below.

상기의 표준 분석 조건은 필수가 아니며 보다 적절한 분석을 위해 다양한 분석 조건을 임의로 수정할 수도 있다. The above standard analysis conditions are not essential, and various analysis conditions may be modified arbitrarily for more appropriate analysis.

수신된 GC-MS 데이터는 화재 촉진제, 가솔린, 등유, 디젤, 양초 및 유기 용제에 따라 분류되어 데이터베이스(110)에 저장된다. The received GC-MS data is classified according to fire accelerant, gasoline, kerosene, diesel, candle, and organic solvent and stored in the database 110.

그 다음, 데이터 전처리부(120)는 데이터베이스(110)에 저장된 GC-MS 데이터를 학습 데이터로 사용하기 위하여 전처리를 수행한다(S220). Next, the data preprocessing unit 120 performs preprocessing on the GC-MS data stored in the database 110 to use it as learning data (S220).

데이터베이스(110)에 저장된 GC-MS 데이터는 실제 화재 사례로부터 획득한 것이므로 범주 간 불균형을 초래한다. The GC-MS data stored in the database 110 is obtained from actual fire cases, resulting in an imbalance between categories.

따라서, 데이터 전처리부(120)는 분류 모델에 따라 GC-MS 데이터를 전처리한다. Therefore, the data preprocessing unit 120 preprocesses the GC-MS data according to the classification model.

이를 더욱 상세하게 설명하면, 본 발명의 실시예에 따른 분류 모델 구축 장치(100)는 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델을 이용하여 분류 모델을 구축하고자 한다. To explain this in more detail, the classification model building device 100 according to an embodiment of the present invention creates a classification model using a random forest (RF) model, a support vector machine (SVM) model, and a convolutional neural network (CNN) model. I want to build it.

따라서, 데이터 전처리부(120)는 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델의 성능을 높이기 위하여 GC-MS 데이터를 전처리한다.Therefore, the data preprocessing unit 120 preprocesses GC-MS data to improve the performance of the random forest (RF) model, support vector machine (SVM) model, and convolutional neural network (CNN) model.

이하에서는 각각의 분류 모델을 학습시키기 위한 GC-MS 데이터를 전처리하는 방법에 대해 더욱 상세하게 설명한다. Below, we will describe in more detail how to preprocess GC-MS data to learn each classification model.

랜덤 포레스트(RF)모델 및 서포트 벡터 머신(SVM) 모델일 경우, 데이터 전처리부(120)는 GC-MS 데이터를 1차원 데이터로 사전 처리한다. 이를 다시 설명하면, 데이터 전처리부(120)는 화재잔류물에서 자주 발견되는 물질들이 나타내는 피크를 기준 피크로 설정한다. 그리고, 데이터 전처리부(120)는 GC-MS 데이터를 분석하여 화학 물질마다 그에 대응하는 피크를 감지한다. 그리고, 데이터 전처리부(120)는 감지된 복수의 피크와 기준 피크를 비교하여 유사한 피크가 발현되는지 판단한다. 그리고, 유사한 피크가 발현된 것으로 판단되면, 데이터 전처리부(120)는 유사한 피크와 기준 피크 사이의 코사인 유사도를 산출한다. 그리고 산출된 유사도가 기준값을 초과하면, 데이터 전처리부(120)는 해당되는 피크에 대응하는 물질 정보를 추출한다. In the case of a random forest (RF) model and a support vector machine (SVM) model, the data preprocessor 120 preprocesses the GC-MS data into one-dimensional data. To explain this again, the data preprocessor 120 sets the peaks represented by substances frequently found in fire residues as the reference peak. Then, the data preprocessing unit 120 analyzes the GC-MS data and detects peaks corresponding to each chemical substance. Then, the data pre-processing unit 120 compares the plurality of detected peaks and the reference peak to determine whether similar peaks appear. Then, when it is determined that a similar peak has occurred, the data preprocessor 120 calculates the cosine similarity between the similar peak and the reference peak. And if the calculated similarity exceeds the reference value, the data preprocessor 120 extracts material information corresponding to the corresponding peak.

또한 분류 모델이 콘볼루션 신경망(CNN) 모델일 경우, 데이터 전처리부(120)는 GC-MS 데이터에 포함된 2차원의 행렬을 하기의 수학식 1을 이용하여 상대적으로 약한 신호로 전처리한다. Additionally, when the classification model is a convolutional neural network (CNN) model, the data preprocessor 120 preprocesses the two-dimensional matrix included in the GC-MS data into a relatively weak signal using Equation 1 below.

S220단계에 완료되면, 모델 구축부(130)는 전처리가 완료된 GC-MS 데이터를 이용하여 각각의 분류 모델에 적용하기 위한 하이퍼파라미터를 결정한다(S230), When step S220 is completed, the model building unit 130 determines hyperparameters to apply to each classification model using the preprocessed GC-MS data (S230).

이에 대해 더욱 상세하게 설명하면, 모델 구축부(130)는 전처리가 완료된 전체의 GC-MS 데이터 중에서 대략 20%에 해당하는 GC-MS 데이터를 테스트 세트로 분류하고, 나머지 GC-MS 데이터 중에서 대략 25%에 해당하는 GC-MS 데이터를 검증 세트로 분류하며, 나머지 데이터를 학습 세트로 분류한다. To explain this in more detail, the model building unit 130 classifies GC-MS data corresponding to approximately 20% of the total GC-MS data for which preprocessing has been completed as a test set, and approximately 25% of the remaining GC-MS data is classified into a test set. GC-MS data corresponding to % are classified as the validation set, and the remaining data are classified as the learning set.

그 다음, 모델 구축부(130)는 분류 모델에 적용하기 위한 하이퍼파라미터를 결정하기 위하여 훈련 데이터와 검증 데이터를 이용하여 교차 검증을 수행한다. Next, the model building unit 130 performs cross-validation using training data and validation data to determine hyperparameters to apply to the classification model.

도 3은 도 2에 도시된 S230단계에서 랜덤 포레스트(RF)모델 및 서포트 벡터 머신(SVM) 모델에 대한 최적의 하이퍼파라미터를 설정하는 조건을 설명하기 위한 예시도이고, 도 4는 도 2에 도시된 S230단계에서 구축된 콘볼루션 신경망(CNN) 모델을 설명하기 위한 예시도이다. Figure 3 is an example diagram for explaining conditions for setting optimal hyperparameters for the random forest (RF) model and support vector machine (SVM) model in step S230 shown in Figure 2, and Figure 4 is shown in Figure 2. This is an example diagram to explain the convolutional neural network (CNN) model built in step S230.

분류 모델이 랜덤 포레스트(RF)모델일 경우, 모델 구축부(130)는 추정기의 수와 모델의 최대 깊이를 이용하여 하이퍼파라미터를 결정한다. 그러면, 도 3의 (a)에 도시된 바와 같이, 랜덤 포레스트(RF)모델은 최대 깊이가 30이고 의 추정기로 구성될 경우에 가장 높은 정확도를 나타낸다. When the classification model is a random forest (RF) model, the model construction unit 130 determines hyperparameters using the number of estimators and the maximum depth of the model. Then, as shown in (a) of Figure 3, the random forest (RF) model has a maximum depth of 30 and It shows the highest accuracy when composed of an estimator.

또한, 분류 모델이 서포트 벡터 머신(SVM) 모델일 경우, 모델 구축부(130)는 방사형 기초 함수(Radial Basis Function, RBF)를 서포트 벡터 머신(SVM) 모델의 커널로 사용한다. 모델 구축부(130)는 5중 계층 교차 검증 방법을 이용하여 그리드 검색을 수행하여 하이퍼파라미터를 최적화한다. 그러면, 도 3의 (b)에 도시된 바와 같이, 최적의 하이퍼파라미터의 커널 변수의 범위는 모두 내지 를 포함하고, 정규화 파라미터의 역수로 정의되는 C값은 10이며, 커널 변수(γ)는 0.01에서 최적화된다. Additionally, when the classification model is a support vector machine (SVM) model, the model building unit 130 uses a radial basis function (RBF) as the kernel of the support vector machine (SVM) model. The model building unit 130 optimizes hyperparameters by performing grid search using a five-layer cross-validation method. Then, as shown in (b) of Figure 3, the ranges of the kernel variables of the optimal hyperparameters are all inside The C value, defined as the reciprocal of the normalization parameter, is 10, and the kernel variable (γ) is optimized at 0.01.

마지막으로 분류 모델이 콘볼루션 신경망(CNN) 모델일 경우, 모델 구축부(130)는 AlphaNet 또는 GoogLeNet와 같은 신경망 구조로 구성되는 콘볼루션 신경망(CNN) 모델을 구축하되 성능 최적화를 위하여 1x1 컨볼루션 레이어를 추가한다. Lastly, when the classification model is a convolutional neural network (CNN) model, the model building unit 130 builds a convolutional neural network (CNN) model composed of a neural network structure such as AlphaNet or GoogLeNet, but uses a 1x1 convolutional layer to optimize performance. Add .

또한, 도 4에 도시된 바와 같이, 콘볼루션 신경망(CNN) 모델의 드롭아웃 레이어는 컨볼루션 레이어 뒤에 배치된다. 그리고 모든 중간 계층의 활성화 함수는 ReLU(Rectified Linear Unit) 함수이되, 출력 계층에는 시그모이드 함수를 적용한다. Additionally, as shown in Figure 4, the dropout layer of the convolutional neural network (CNN) model is placed behind the convolutional layer. The activation function of all middle layers is a ReLU (Rectified Linear Unit) function, but the sigmoid function is applied to the output layer.

S230단계가 완료되면, 검증부(140)는 구축된 복수의 분류 모델에 대한 검증을 수행한다(S240).When step S230 is completed, the verification unit 140 performs verification of the plurality of constructed classification models (S240).

도 5는 각 범주의 평균 GC-MS 데이터를 설명하기 위한 예시도이고, 도 6은 GC-MS 데이터에서 추출된 특징점을 나타내는 예시도이다. Figure 5 is an example diagram for explaining average GC-MS data for each category, and Figure 6 is an example diagram showing feature points extracted from GC-MS data.

도 5에 도시된 바와 같이, GC-MS 데이터에 가솔린 또는 유기 용매에 대한 화재 잔류물이 포함될 경우에는 짧은 체류시간을 나타내고, 디젤 또는 파라핀이 포함될 경우에는 긴 체류시간을 나타낸다. As shown in Figure 5, when the GC-MS data includes fire residues for gasoline or organic solvents, it shows a short retention time, and when it includes diesel or paraffin, it shows a long retention time.

또한 도 6에 도시된 바와 같이, 가솔린, 솔벤트의 GC-MS 스펙트럼은 유사한 패턴을 보인다. 이는 가솔린과 솔벤트에 모두 알킬환 벤젠을 함유하고 있기 때문이다. 일반적으로 알킬화된 벤젠의 존재는 가솔린의 존재에 대한 강력한 증거로 간주되지만 대부분의 알킬화된 벤젠은 다른 석유 용매에서도 강력한 신호를 출력한다. Also, as shown in Figure 6, the GC-MS spectra of gasoline and solvent show similar patterns. This is because both gasoline and solvents contain alkyl ring benzene. The presence of alkylated benzene is generally considered strong evidence for the presence of gasoline, but most alkylated benzenes also produce strong signals in other petroleum solvents.

또한, 등유, 디젤 및 양초와 같이 알칸이 더 높은 화재 촉진제는 공통적으로 구별 가능한 성분으로 이소 파라핀을 포함하므로, 화재 촉매제가 양초인 경우에는 파라핀 왁스 성분을 분류 마커로 사용한다. In addition, fire accelerators with higher alkanes, such as kerosene, diesel, and candles, commonly contain isoparaffin as a distinguishable component, so when the fire catalyst is a candle, the paraffin wax component is used as a classification marker.

따라서, 본 발명의 실시에예에 따른 화재 촉진제를 분류하기 위한 분류 모델 구축 장치(100)는 상기 S220단계에서 데이터 전처리를 통해 GC-MS 데이터로부터 중요 물질 정보를 획득한다. 분류 모델 구축 장치(100)는 획득한 물질 정보에 대응하는 신호 강도를 수집한다. Therefore, the apparatus 100 for building a classification model for classifying fire accelerants according to an embodiment of the present invention acquires important material information from GC-MS data through data preprocessing in step S220. The classification model building device 100 collects signal intensity corresponding to the obtained material information.

그리고, 분류 모델 구축 장치(100)는 수집된 신호 강도를 이용하여 복수의 분류 모델을 학습시키고, 학습이 완료된 분류 모델로 하여금 입력되는 GC-MS 데이터로부터 해당 물질을 출력하도록 한다. Then, the classification model building device 100 trains a plurality of classification models using the collected signal intensities, and causes the learned classification model to output the corresponding substance from the input GC-MS data.

도 7은 본 발명의 실시예에 따른 분류 모델의 성능 평과 결과를 나타내는 예시도이다. Figure 7 is an exemplary diagram showing the performance evaluation results of a classification model according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 검증부(140)는 테스트 세트를 이용하여 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델에 대한 성능을 평가한다. 그 결과를 살펴보면, 각 모델의 실제 긍정 사례 중 긍정 식별 비율(정밀도)과 긍정 식별 사례 중 실제 긍정 비율(재현율)의 조화 평균으로 정의되는 가중 평균 F1-scores는 0.88, 0.88 및 0.92로 평가되었다. As shown in FIG. 7, the verification unit 140 evaluates the performance of a random forest (RF) model, a support vector machine (SVM) model, and a convolutional neural network (CNN) model using a test set. Looking at the results, the weighted average F1-scores, defined as the harmonic average of the positive identification rate among the actual positive cases (precision) and the true positive rate among the positive identification cases (recall) of each model, were evaluated as 0.88, 0.88, and 0.92.

도 8은 본 발명의 실시예에 따른 분류 모델의 ROC 곡선과 PR 곡선을 나타내는 예시도이다. Figure 8 is an exemplary diagram showing the ROC curve and PR curve of a classification model according to an embodiment of the present invention.

ROC(Receiver Operation Characteristic) 곡선 아래 면적(AU-ROC)은 분류 도구의 성능을 결정하는데 널리 사용되는 지표이다. ROC 곡선 아래 면적(AU-ROC)의 값이 0.5에 가까울수록 모델은 양수와 음수를 완전히 구별할 수 없는 것으로 판단하고, ROC 곡선 아래 면적(AU-ROC)의 값이 1에 가까울수록 모델을 양수와 음수를 완전히 구별할 수 있는 것으로 판단한다. The area under the receiver operating characteristic (ROC) curve (AU-ROC) is a widely used indicator to determine the performance of classification tools. The closer the value of the area under the ROC curve (AU-ROC) is to 0.5, the more the model is judged to be unable to completely distinguish between positive and negative numbers. The closer the value of the area under the ROC curve (AU-ROC) is to 1, the more positive the model is. It is judged that it is possible to completely distinguish between and negative numbers.

도 8에 도시된 바와 같이, 본 발명의 실시예에 따른 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델의 ROC 곡선 아래 면적(AU-ROC)은 0.98, 0.98 및 0.99이므로, 각각의 분류모델은 우수한 신뢰성을 지닌다. As shown in Figure 8, the area under the ROC curve (AU-ROC) of the random forest (RF) model, support vector machine (SVM) model, and convolutional neural network (CNN) model according to an embodiment of the present invention is 0.98, Since it is 0.98 and 0.99, each classification model has excellent reliability.

ROC 곡선 아래 면적(AU-ROC)는 강력한 지표이지만 불균형한 데이터 세트가 있는 모델의 성능 평가에는 충분하지 않다. 따라서 본 발명에서는 각 모델의 PR 곡선 아래 면적(AU-PRC)도 평가한다. PR 곡선 아래 면적(AU-PRC)은 AU-ROC와 유사하게 1에 가까울수록 모델의 성능이 더 우수하다고 평가할 수 있다. 랜덤 포레스트(RF)모델, 서포트 벡터 머신(SVM) 모델 및 콘볼루션 신경망(CNN) 모델의 AU-PRC 값이 각각 0.94, 0.93 및 0.96이므로 해당되는 분류모델의 신뢰성이 우수한 것으로 판단할 수 있다. Area under the ROC curve (AU-ROC) is a powerful metric, but it is not sufficient for evaluating the performance of models with imbalanced data sets. Therefore, in the present invention, the area under the PR curve (AU-PRC) of each model is also evaluated. Similar to AU-ROC, the area under the PR curve (AU-PRC) can be evaluated as being closer to 1, indicating better model performance. Since the AU-PRC values of the random forest (RF) model, support vector machine (SVM) model, and convolutional neural network (CNN) model are 0.94, 0.93, and 0.96, respectively, the reliability of the corresponding classification model can be judged to be excellent.

이와 같이 본 발명에 따르면, 인공 지능을 이용하여 실제 방화로 의심되는 화재 잔류물에서 화재 촉진제를 분류할 수 있어 화재 사고의 효과적이고 정확한 조사에 도움이 될 수 있다. In this way, according to the present invention, fire accelerants can be classified from fire residues suspected of actual arson using artificial intelligence, which can help in effective and accurate investigation of fire accidents.

본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 아래의 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely illustrative, and those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom. will be. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the patent claims below.

100 : 분류 모델 구축 장치
110 : 데이터베이스
120 : 데이터 전처리부
130 : 모델 구축부
140 : 검증부100: Classification model building device
110: database
120: data preprocessing unit
130: model construction unit
140: verification unit

Claims

A database that collects GC-MS data obtained from suspected arson cases;
A data preprocessing unit that preprocesses the collected GC-MS data,
A model building unit that generates a data set using the preprocessed GC-MS data and determines hyperparameters for building a plurality of classification models using the generated data set, and
A classification model building device including a verification unit that verifies the performance of extracting fire substances through the constructed plurality of classification models.

According to paragraph 1,
The database is,
The GC-MS data is classified and stored into multiple groups according to the substance that caused the fire,
The substance is,
A classification model building device comprising at least one of fire accelerants, gasoline, kerosene, diesel, candles, and organic solvents.

According to paragraph 1,
The plurality of classification models are,
A classification model building device including a random forest (RF) model, a support vector machine (SVM) model, and a convolutional neural network (CNN) model.

According to paragraph 3,
If the classification model you want to learn is the random forest (RF) model and support vector machine (SVM) model,
The preprocessor,
A classification model building device that analyzes the GC-MS data to extract a plurality of peaks corresponding to chemical substances, and calculates cosine similarity by comparing the extracted peaks with the reference peak.

According to clause 4,
The preprocessor,
A classification model building device that extracts material information included in the GC-MS data when the calculated similarity is greater than the reference value.

According to paragraph 3,
If the classification model you want to learn is the convolutional neural network (CNN) model,
The preprocessor,
Classification model building device that preprocesses GC-MS data using the following equation:

here, represents the standardized value of each element, represents each element of the matrix.

According to paragraph 1,
The model building unit,
A classification model building device that determines the hyperparameters of a random forest (RF) model using the number of estimators and the maximum depth of the model.

According to paragraph 1,
The model building unit,
A classification model building device that determines hyperparameters by using the Radial Basis Function (RBF) as the kernel of a support vector machine (SVM) model.

According to paragraph 1,
The model building unit,
A dropout layer is placed after the convolutional layer, the activation function of all intermediate layers is a ReLU (Rectified Linear Unit) function, and the sigmoid function is applied to the output layer to determine the hyperparameters of the convolutional neural network (CNN) model. Classification model building device.

In the method of building a classification model using a classification model building device,
Building a database by collecting GC-MS data obtained through suspected arson cases,
Preprocessing the collected GC-MS data,
Creating a data set using the preprocessed GC-MS data and determining hyperparameters for building a plurality of classification models using the generated data set, and
A classification model construction method including the step of verifying the performance of extracting fire substances through the constructed plurality of classification models.

According to clause 10,
The step of building the database is,
The GC-MS data is classified and stored into multiple groups according to the substance that caused the fire,
The substance is,
A method for building a classification model that includes at least one of the following: fire accelerant, gasoline, kerosene, diesel, candle, and organic solvent.

According to clause 10,
The plurality of classification models are,
Methods for building classification models, including random forest (RF) models, support vector machine (SVM) models, and convolutional neural network (CNN) models.

According to clause 12,
If the classification model you want to learn is the random forest (RF) model and support vector machine (SVM) model,
The preprocessing step is,
A classification model construction method that extracts a plurality of peaks corresponding to chemical substances by analyzing the GC-MS data and calculates cosine similarity by comparing the extracted peaks with the reference peak.

According to clause 13,
The preprocessing step is,
A classification model construction method for extracting material information included in the GC-MS data when the calculated similarity is greater than the reference value.

According to clause 12,
If the classification model you want to learn is the convolutional neural network (CNN) model,
The preprocessing step is,
How to build a classification model that preprocesses GC-MS data using the following equation:

here, represents the standardized value of each element, represents each element of the matrix.

According to clause 10,
The step of determining the hyperparameters is,
A method of building a classification model that determines the hyperparameters of a random forest (RF) model using the number of estimators and the maximum depth of the model.

According to clause 10,
The step of determining the hyperparameters is,
A method of building a classification model that determines hyperparameters by using the Radial Basis Function (RBF) as the kernel of a support vector machine (SVM) model.

According to clause 10,
The step of determining the hyperparameters is,
A dropout layer is placed after the convolutional layer, the activation function of all intermediate layers is a ReLU (Rectified Linear Unit) function, and the sigmoid function is applied to the output layer to determine the hyperparameters of the convolutional neural network (CNN) model. How to build a classification model.