KR20220000458A

KR20220000458A - Method and apparatus for predicting diagnostic result in real-time pcr

Info

Publication number: KR20220000458A
Application number: KR1020200078121A
Authority: KR
Inventors: 김민진; 김선빈; 강병규
Original assignee: 제노플랜코리아 주식회사
Priority date: 2020-06-26
Filing date: 2020-06-26
Publication date: 2022-01-04
Also published as: KR102470870B1

Abstract

According to one embodiment of the present invention, a method for predicting a diagnosis result using real-time polymerase chain reaction (PCR) is disclosed. The method includes the steps of: collecting experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment; inputting the experimental data into a machine learning model, extracting characteristic data of the experimental data, and training the machine learning model based on the extracted characteristic data; and predicting a diagnosis result for a target sequence through the trained machine learning model.

Description

Method and apparatus for predicting diagnostic results using real-time PCR

본 발명은 실시간 PCR을 이용한 진단 결과를 예측하기 위한 방법 및 장치에 관한 것으로서, 보다 상세하게는 기계학습을 통해 실시간 PCR을 이용한 진단 결과를 미리 예측할 수 있는 방법 및 그 방법을 수행하는 장치에 관한 것이다.The present invention relates to a method and apparatus for predicting a diagnosis result using real-time PCR, and more particularly, to a method and apparatus for performing the method for predicting a diagnosis result using real-time PCR through machine learning in advance. .

중합효소 연쇄 반응, 즉 PCR(Polymerase Chain Reaction)은 핵산을 포함하는 샘플 용액을 반복적으로 가열 및 냉각하여 상기 핵산의 특정 염기 서열을 갖는 부위를 연쇄적으로 복제하여 그 특정 염기 서열 부위를 갖는 핵산을 기하급수적으로 증폭하는 기술로써, 생명과학, 유전과학 및 의료 분야 등에서 분석 및 진단 목적으로 널리 사용되고 있다.Polymerase chain reaction, that is, PCR (Polymerase Chain Reaction) repeatedly heats and cools a sample solution containing a nucleic acid to chain-replicate a region having a specific nucleotide sequence of the nucleic acid to produce a nucleic acid having the specific nucleotide sequence region. As an exponential amplification technology, it is widely used for analysis and diagnosis purposes in life science, genetic science, and medical fields.

특히, 실시간 중합효소 연쇄 반응(Real-Time Polymerase Chain Reaction, 이하, 실시간 PCR)은 형광물질을 이용하여 PCR 증폭 산물을 실시간으로 모니터링하는 기술로, 짧은 시간 안에 민감도와 특이도가 높은 정량적 분석이 가능하다는 장점이 있어 각광받고 있다.In particular, Real-Time Polymerase Chain Reaction (hereinafter referred to as real-time PCR) is a technology that monitors PCR amplification products in real time using fluorescent substances, enabling quantitative analysis with high sensitivity and specificity within a short period of time. It is popular because of its advantages.

그러나, 실시간 PCR의 목표 서열 증폭 과정에서 최대 Ct 값(threshold cycle value)에서도 검출되지 않아 음성(negative)이라고 판정되는 경우가 적지 않았다. 이는 기존 실시간 PCR이 가시 범위에만 의존하고, 역가(threshold)를 임의로 설정하여 그 기준이 명확하지 않기 때문이다.However, in the process of amplifying the target sequence of real-time PCR, it was not detected even at the maximum Ct value (threshold cycle value), so there were many cases where it was determined to be negative. This is because the existing real-time PCR depends only on the visible range, and the threshold is not clear because the threshold is arbitrarily set.

또한, 사용자 또는 프로그램이 설정한 역가선 이상의 결과 값만 확인하여 양성(positive) 또는 음성을 판별할 수 있기 때문에, 진단자는 역가선 아래의 결과 값에 대하여는 진단자의 눈에 의존하여 재실험 여부를 판별할 수밖에 없었다.In addition, since positive or negative can be determined by only checking the result value above the potency line set by the user or the program, the diagnostician can rely on the diagnostician's eyes to determine whether to retest for the result value below the potency line. had no choice but to

이에 따라, 특정 병원체에 대한 실시간 PCR에서 목표 서열에 대한 증폭 결과가 위음성(false negative)으로 판정되었을 경우, 피검사자가 재검사를 하지 않거나 실제 양성임에도 불구하고 평소처럼 행동하여 타인에게 병원체를 전염시킬 가능성이 있었다.Accordingly, if the result of amplification of the target sequence in real-time PCR for a specific pathogen is determined to be false negative, there is a possibility that the subject does not retest or behaves as usual even though the test is actually positive and transmits the pathogen to others. there was.

따라서, 실시간 PCR에 있어서, 위음성 판정에 대한 빈도를 낮출 수 있는 신뢰가능한 결과 예측 방법의 필요성이 대두되고 있다.Therefore, in real-time PCR, there is a need for a reliable result prediction method capable of reducing the frequency of false-negative determinations.

본 발명은 상술한 문제점을 해결하기 위한 것으로서, 실시간 PCR을 이용한 진단 결과를 예측하여 위음성 판정 빈도를 낮추기 위한 방법 및 장치를 제공하는 것을 그 목적으로 한다.An object of the present invention is to provide a method and apparatus for predicting a diagnosis result using real-time PCR to reduce the frequency of false-negative determination in order to solve the above problems.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또다른 기술적 과제들은 아래의 기재들로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

본 발명의 일 실시예에 따른 실시간 PCR(Polymerase Chain Reaction)을 이용한 진단 결과를 예측하기 위한 방법은, 목표 서열의 증폭 실험에서 산출된 실험 결과 데이터 및 상기 실험에 대한 메타데이터(metadata)를 포함하는 실험 데이터를 수집하는 단계; 상기 실험 데이터를 상기 기계 학습 모델에 입력하여 상기 실험 데이터의 특성 데이터를 추출하고, 추출된 특성 데이터를 기반으로 상기 기계 학습 모델을 훈련시키는 단계; 및 상기 훈련된 기계 학습 모델을 통해 목표 서열에 대한 진단 결과를 예측하는 단계;를 포함할 수 있다.A method for predicting a diagnosis result using real-time PCR (Polymerase Chain Reaction) according to an embodiment of the present invention includes experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment collecting experimental data; inputting the experimental data into the machine learning model, extracting characteristic data of the experimental data, and training the machine learning model based on the extracted characteristic data; and predicting a diagnosis result for a target sequence through the trained machine learning model.

일 실시예에서, 상기 메타데이터는 버퍼(buffer), 프라이머(primer), 목표 서열, 택(taq) 중합효소, dNTPs, 시료 중 적어도 하나의 실험 조건 정보를 포함할 수 있다.In an embodiment, the metadata may include information on experimental conditions of at least one of a buffer, a primer, a target sequence, a taq polymerase, dNTPs, and a sample.

일 실시예에서, 상기 실험 결과 데이터는 상기 증폭 실험에서의 최대 증폭 반복횟수 동안의 신호값 및 상기 신호값에 대한 판단 데이터를 포함할 수 있다.In an embodiment, the experimental result data may include a signal value for the maximum number of amplification repetitions in the amplification experiment and determination data for the signal value.

일 실시예에서, 상기 실험 결과 데이터는 음성 대조군(negative control)에 대하여 산출된 실험 결과 데이터를 더 포함할 수 있다.In an embodiment, the experimental result data may further include experimental result data calculated for a negative control.

일 실시예에서, 상기 수집된 실험 데이터를 기계 학습 모델에서 사용할 수 있도록 전처리를 수행하는 단계를 더 포함하고, 상기 훈련시키는 단계는, 상기 전처리된 데이터를 상기 기계 학습 모델에 입력함으로써 수행될 수 있다.In an embodiment, the method may further include performing pre-processing so that the collected experimental data can be used in a machine learning model, and the training may be performed by inputting the pre-processed data into the machine learning model. .

일 실시예에서, 상기 전처리를 수행하는 단계는 상기 수집된 실험 데이터의 각 샘플을 인스턴스(instance) 별로 나눈 후, 상기 인스턴스로 나뉘어진 각 샘플의 데이터 값을 실수 값으로 변환하는 단계 및, 상기 실수 값으로 변환된 각 샘플의 데이터 값을 벡터화(vectorize)하는 단계를 포함할 수 있다.In an embodiment, the performing of the pre-processing includes dividing each sample of the collected experimental data for each instance, and then converting the data value of each sample divided into the instances into a real value; The method may include vectorizing a data value of each sample converted into a value.

일 실시예에서, 상기 전처리된 데이터에 대하여 데이터 증대(data augmentation)를 수행하는 단계를 더 포함할 수 있다.In an embodiment, the method may further include performing data augmentation on the preprocessed data.

일 실시예에서, 상기 데이터 증대를 수행하는 단계는 상기 전처리된 데이터가 상기 증폭 실험에 있어서 소정의 사이클(cycle) 단위의 구간으로 나뉘어진 지역적 패턴을 생성하고, 상기 생성된 지역적 패턴을 하나의 샘플로 구성하는 방법으로 수행되는 것일 수 있다.In an embodiment, performing the data augmentation includes generating a regional pattern in which the preprocessed data is divided into sections of a predetermined cycle unit in the amplification experiment, and applying the generated regional pattern to one sample. It may be performed by a method of configuring

일 실시예에서, 상기 데이터 증대를 수행하는 단계는 상기 소정의 사이클 단위의 구간으로 나뉘어진 쉬프트(shift)된 지역적 패턴을 추가로 생성하는 것일 수 있다.In an embodiment, performing the data augmentation may include additionally generating a shifted regional pattern divided into sections of the predetermined cycle unit.

일 실시예에서, 상기 데이터 증대를 수행하는 단계는 상기 전처리된 데이터를 지터링(jittering)하여 인공적으로 노이즈가 섞인 데이터를 생성하는 방법으로 수행되는 것일 수 있다.In an embodiment, the performing of the data augmentation may be performed by jittering the pre-processed data to artificially generate data mixed with noise.

일 실시예에서, 상기 기계 학습 모델은 CNN(Convolutional Neural Network) 모델인 것일 수 있다.In an embodiment, the machine learning model may be a Convolutional Neural Network (CNN) model.

일 실시예에서, 상기 CNN 모델은 ADAM 옵티마이저(optimizer) 및 배치 정규화(batch normalization)를 학습 알고리즘으로 사용하는 것일 수 있다.In an embodiment, the CNN model may use an ADAM optimizer and batch normalization as a learning algorithm.

일 실시예에서, 상기 훈련시키는 단계는 K-폴드 교차 검증(k-fold cross-validation)을 통해, 상기 CNN 모델을 훈련 및 검증하는 것일 수 있다.In an embodiment, the training may include training and validating the CNN model through k-fold cross-validation.

일 실시예에서, 상기 훈련시키는 단계는 상기 전처리된 데이터를 훈련 셋 및 테스트 셋으로 나누는 단계 및, 상기 훈련 셋을 소정의 개수의 폴드(fold)로 구성하고, 상기 폴드 중 어느 하나를 검증 셋(validation set)으로 지정하고, 나머지 폴드를 훈련 셋으로 사용하여 교차 검증을 수행하는 단계를 포함할 수 있다.In one embodiment, the training includes dividing the preprocessed data into a training set and a test set, configuring the training set into a predetermined number of folds, and converting any one of the folds into a validation set ( validation set), and performing cross-validation using the remaining folds as a training set.

본 발명의 일 실시예에 따른 실시간 PCR을 이용한 진단 결과를 예측하기 위한 장치는, 목표 서열의 증폭 실험에서 산출된 실험 결과 데이터 및 상기 실험에 대한 메타데이터를 포함하는 실험 데이터를 입력받는 입력부; 및 상기 입력된 실험 데이터를 상기 기계 학습 모델에 입력하여 상기 실험 데이터의 특성 데이터를 추출하며, 상기 추출된 특성 데이터를 기반으로 훈련된 기계 학습 모델을 통해 목표 서열에 대한 진단 결과를 예측하는 프로세서;를 포함할 수 있다.An apparatus for predicting a diagnosis result using real-time PCR according to an embodiment of the present invention includes: an input unit for receiving experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment; and a processor for inputting the input experimental data into the machine learning model, extracting characteristic data of the experimental data, and predicting a diagnosis result for a target sequence through a machine learning model trained based on the extracted characteristic data. may include

일 실시예에서, 상기 메타데이터는 버퍼, 프라이머, 목표 서열, 택(taq) 중합효소, dNTPs, 시료 중 적어도 하나의 실험 조건 정보를 포함할 수 있다.In an embodiment, the metadata may include information on experimental conditions of at least one of a buffer, a primer, a target sequence, a taq polymerase, dNTPs, and a sample.

일 실시예에서, 상기 실험 결과 데이터는 음성 대조군에 대하여 산출된 실험 결과 데이터를 더 포함할 수 있다.In an embodiment, the experimental result data may further include experimental result data calculated with respect to the negative control group.

일 실시예에서, 상기 프로세서는, 상기 입력된 실험 데이터를 기계 학습 모델에서 사용할 수 있도록 전처리를 수행하고, 상기 전처리된 데이터를 상기 기계 학습 모델에 입력함으로써, 상기 기계 학습 모델을 훈련시킬 수 있다.In an embodiment, the processor may train the machine learning model by performing preprocessing to use the input experimental data in a machine learning model, and inputting the preprocessed data to the machine learning model.

일 실시예에서, 상기 프로세서는 상기 입력된 실험 데이터의 각 샘플을 인스턴스 별로 나눈 후, 상기 인스턴스로 나뉘어진 각 샘플의 데이터 값을 실수 값으로 변환하고, 상기 실수 값으로 변환된 각 샘플의 데이터 값을 벡터화함으로써, 상기 실험 데이터에 대한 전처리를 수행할 수 있다.In an embodiment, the processor divides each sample of the input experimental data for each instance, converts a data value of each sample divided into the instances into a real value, and a data value of each sample converted into the real value By vectorizing , pre-processing of the experimental data can be performed.

본 발명의 일 실시예에 따른 실시간 PCR을 이용한 진단 결과를 예측하기 위한 방법 및 장치에 따르면, 기계학습을 통해 양성, 음성 및 미결정을 확실히 판별하여 진단자에게 제공할 수 있으므로, 위음성 판정에 대한 빈도를 낮출 수 있다.According to the method and apparatus for predicting a diagnosis result using real-time PCR according to an embodiment of the present invention, positive, negative, and undetermined can be reliably identified and provided to a diagnoser through machine learning, so the frequency of false negative determinations can lower

본 발명의 상세한 설명에서 인용되는 도면을 보다 충분히 이해하기 위하여 각 도면의 간단한 설명이 제공된다.
도 1은 본 발명의 일 실시예에 따른 딥러닝에 의한 모델 구축 과정을 설명하기 위한 흐름도를 도시한다.
도 2는 본 발명의 일 실시예에 따른 실시간 PCR을 이용한 진단 결과를 예측하기 위한 방법을 설명하기 위한 흐름도를 도시한다.
도 3은 본 발명의 일 실시예에 따른 실시간 PCR의 각 샘플 별 증폭 실험 데이터가 나타난 화면을 도시한다.
도 4는 본 발명의 일 실시예에 따른 실시간 PCR의 각 샘플에 대한 양성 또는 음성 판정 결과를 나타낸 화면을 도시한다.
도 5는 본 발명의 일 실시예에 따른 CNN 모델의 파이프라인을 도시한다.
도 6은 발명의 일 실시예에 따른 K-폴드 교차 검증을 통한 CNN 모델의 훈련 및 평가 방법을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 실시간 PCR을 이용한 진단 결과를 예측하기 위한 장치의 구성을 간략히 나타낸 블록도이다.In order to more fully understand the drawings cited in the Detailed Description, a brief description of each drawing is provided.
1 is a flowchart illustrating a model building process by deep learning according to an embodiment of the present invention.
2 is a flowchart illustrating a method for predicting a diagnosis result using real-time PCR according to an embodiment of the present invention.
3 shows a screen showing amplification experimental data for each sample of real-time PCR according to an embodiment of the present invention.
4 shows a screen showing positive or negative determination results for each sample of real-time PCR according to an embodiment of the present invention.
5 shows a pipeline of a CNN model according to an embodiment of the present invention.
6 is a diagram for explaining a method of training and evaluating a CNN model through K-fold cross-validation according to an embodiment of the present invention.
7 is a block diagram schematically illustrating the configuration of an apparatus for predicting a diagnosis result using real-time PCR according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 예들을 상세히 설명한다. 이 때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음에 유의해야 한다. 또한 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that in the accompanying drawings, the same components are denoted by the same reference numerals as much as possible. In addition, detailed descriptions of well-known functions and configurations that may obscure the gist of the present invention will be omitted.

본 발명의 일부 실시예는 기능적인 블록 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블록들의 일부 또는 전부는, 특정 기능들을 실행하는 다양한 개수의 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 본 발명의 기능 블록들은 하나 이상의 마이크로 프로세서들에 의해 구현되거나, 소정의 기능을 위한 회로 구성들에 의해 구현될 수 있다. 또한, 예를 들어, 본 발명의 기능 블록들은 다양한 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능 블록들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 발명은 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다.Some embodiments of the present invention may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented in various numbers of hardware and/or software configurations that perform specific functions. For example, the functional blocks of the present invention may be implemented by one or more microprocessors, or by circuit configurations for a predetermined function. Also, for example, the functional blocks of the present invention may be implemented in various programming or scripting languages. The functional blocks may be implemented as an algorithm running on one or more processors. In addition, the present invention may employ conventional techniques for electronic configuration, signal processing, and/or data processing, and the like.

또한, 본 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. "부", "모듈"은 어드레싱될 수 있는 저장 매체에 저장되며 프로세서에 의해 실행될 수 있는 프로그램에 의해 구현될 수도 있다.In addition, terms such as "...unit" and "module" described in this specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software, or a combination of hardware and software. have. “Part” and “module” are stored in an addressable storage medium and may be implemented by a program that can be executed by a processor.

예를 들어, “부”, "모듈" 은 소프트웨어 구성 요소들, 객체 지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터 베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들에 의해 구현될 수 있다.For example, “part” and “module” refer to components such as software components, object-oriented software components, class components, and task components, and processes, functions, properties, and programs. It may be implemented by procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 장치를 사이에 두고 "간접적으로 연결"되어 있는 경우도 포함한다. 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is "connected" to another part, it includes not only a case in which it is "directly connected" but also a case in which it is "indirectly connected" with a device interposed therebetween. Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

또한, 도면에 도시된 구성 요소들 간의 연결 선 또는 연결 부재들은 기능적인 연결 및/또는 물리적 또는 회로적 연결들을 예시적으로 나타낸 것일 뿐이다. 실제 장치에서는 대체 가능하거나 추가된 다양한 기능적인 연결, 물리적인 연결, 또는 회로 연결들에 의해 구성 요소들 간의 연결이 나타내어질 수 있다.In addition, the connecting lines or connecting members between the components shown in the drawings only exemplify functional connections and/or physical or circuit connections. In an actual device, a connection between components may be represented by various functional connections, physical connections, or circuit connections that are replaceable or added.

본 발명에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 발명에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 즉, 본 발명에서 특정 구성을 “포함”한다고 기술하는 내용은 해당 구성 이외의 구성을 배제하는 것이 아니며, 추가적인 구성이 본 발명의 실시 또는 본 발명의 기술적 사상의 범위에 포함될 수 있음을 의미한다.The terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present invention, terms such as "comprises" or "have" are intended to designate that the features, numbers, steps, operations, components, parts, or combinations thereof described in the specification exist, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof. That is, the description of “including” a specific configuration in the present invention does not exclude configurations other than the corresponding configuration, and it means that additional configurations may be included in the practice of the present invention or the scope of the technical spirit of the present invention.

본 발명의 일부의 구성 요소는 본 발명에서 본질적인 기능을 수행하는 필수적인 구성 요소는 아니고 단지 성능을 향상시키기 위한 선택적 구성 요소일 수 있다. 본 발명은 단지 성능 향상을 위해 사용되는 구성 요소를 제외한 본 발명의 본질을 구현하는데 필수적인 구성부만을 포함하여 구현될 수 있고, 단지 성능 향상을 위해 사용되는 선택적 구성 요소를 제외한 필수 구성 요소만을 포함한 구조도 본 발명의 권리범위에 포함된다.Some components of the present invention are not essential components for performing essential functions in the present invention, but may be optional components for merely improving performance. The present invention can be implemented by including only essential components to implement the essence of the present invention, except for components used for performance improvement, and a structure including only essential components excluding optional components used for performance improvement Also included in the scope of the present invention.

도 1은 본 발명의 일 실시예에 따른 딥러닝에 의한 모델 구축 과정을 간략히 설명하기 위한 흐름도이다.1 is a flowchart for briefly explaining a model building process by deep learning according to an embodiment of the present invention.

모델 구축 과정은 먼저, ⅰ) 데이터를 수집하는 데이터 수집 과정, ⅱ) 데이터 수집 과정을 통해서 모아진 데이터를 전처리하는 데이터 전처리 과정, ⅲ) 전처리된 데이터에서 필요한 특성을 추출하는 특성 추출 과정, ⅳ) 추출된 특성들을 가지고 예측을 하는 모델 구축 과정으로 이루어질 수 있다. 모델 구축 과정은 모델을 설계하고, 설계된 모델을 잘 학습시키고 평가하는 과정을 내포할 수 있다.The model building process consists of first, i) a data collection process that collects data, ii) a data preprocessing process that preprocesses the data collected through the data collection process, iii) a feature extraction process that extracts necessary characteristics from the preprocessed data, iv) extraction It can be done in the process of building a model that makes predictions with the specified characteristics. The model building process may include the process of designing a model and training and evaluating the designed model well.

데이터 수집 과정에서는 실시간 PCR(Polymerase Chain Reaction) 실험을 수행하여 실험 데이터를 수집할 수 있다(S110). 수집되는 실험 데이터는 목표 서열의 증폭 실험에서 산출되는 실험 결과 데이터 및 실험에 대한 메타데이터를 포함할 수 있다.In the data collection process, real-time PCR (Polymerase Chain Reaction) experiments may be performed to collect experimental data (S110). The collected experimental data may include experimental result data calculated in an amplification experiment of the target sequence and metadata about the experiment.

데이터 전처리 과정에서는 수집된 실험 데이터를 샘플 단위로 분할하고,(S120), 샘플 단위로 분할된 실험 데이터가 실수인지 여부를 판단할 수 있다(S130). 실험 데이터가 실수인 경우, 벡터화를 수행하고(S140), 실수가 아닌 경우에는 실수로 변환하는 과정을 거쳐(S150) 벡터화를 수행할 수 있다(S140).In the data preprocessing process, the collected experimental data may be divided into sample units (S120), and it may be determined whether the experimental data divided into sample units is a real number (S130). When the experimental data is a real number, vectorization is performed (S140), and when the experimental data is not real, the vectorization can be performed through a process of converting it to a real number (S150) (S140).

특성 추출 과정에서는 전처리된 데이터의 특성 데이터를 추출하고(S160), 모델 구축 과정에서는 추출된 특성 데이터에 기반하여 기계 학습 모델을 학습하고(S170), 학습된 기계 학습 모델 중 최적화된 모델을 선정하기 위한 모델 평가를 수행할 수 있다.(S180).In the feature extraction process, the feature data of the preprocessed data is extracted (S160), and in the model building process, a machine learning model is learned based on the extracted feature data (S170), and an optimized model is selected among the learned machine learning models. for model evaluation may be performed (S180).

상술한 바와 같이 구축되는 기계학습 모델을 이용하여, 실시간 PCR을 이용한 진단 결과를 예측하게 된다. 이에 대한 구체적인 방법은 도 2 내지 도 6과 관련하여 설명하도록 한다.The diagnosis result using real-time PCR is predicted using the machine learning model constructed as described above. A specific method for this will be described with reference to FIGS. 2 to 6 .

도 2는 본 발명의 일 실시예에 따른 실시간 PCR을 이용한 진단 결과를 예측하기 위한 방법을 설명하기 위한 흐름도를 도시한다.2 is a flowchart illustrating a method for predicting a diagnosis result using real-time PCR according to an embodiment of the present invention.

먼저, 목표 서열의 증폭 실험에서 산출된 실험 결과 데이터 및 실험에 대한 메타데이터를 포함하는 실험 데이터를 수집할 수 있다(S210).First, experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment may be collected (S210).

여기서, 실험 결과 데이터는 실험에서 나온 최대 증폭 반복횟수(Ct_max) 동안의 신호값 및 실험 결과에 대한 판단 데이터를 포함할 수 있다. 여기서, 신호값은 목표 서열의 증폭 곡선을 역가(threshold)를 기준으로 수치화시킨 값을 의미하며, 실험 결과에 대한 판단 데이터는 신호값을 기준으로 목표 서열의 증폭이 성공하였는지를 나타내는 데이터이다. 이러한 데이터는 진단 검사의 양성, 음성 또는 미결정을 판별한 결과 데이터로 라벨링(labeling)될 수 있다.Here, the experimental result data may include _{a signal value during the maximum number of amplification repetitions (Ct max} ) obtained from the experiment and judgment data for the experimental result. Here, the signal value refers to a value obtained by quantifying the amplification curve of the target sequence based on a threshold, and the judgment data for the experimental result is data indicating whether amplification of the target sequence was successful based on the signal value. Such data may be labeled as data as a result of determining whether the diagnostic test is positive, negative, or indeterminate.

한편, 도 3은 본 발명의 일 실시예에 따른 실시간 PCR의 각 샘플 별 증폭 실험 데이터가 나타난 화면을 도시한다.Meanwhile, FIG. 3 shows a screen showing amplification experimental data for each sample of real-time PCR according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 실시간 PCR의 목표 서열에 대한 각 샘플(31-1 ~ 31-10) 별 증폭 실험 데이터가 나타난 화면에서 각 샘플은 1 싸이클(cycle)이 반복될 때마다 2배씩 지수함수적으로 증가할 수 있다.As shown in FIG. 3 , on the screen where the amplification experimental data for each sample (31-1 to 31-10) for the target sequence of real-time PCR is displayed, each sample is exponentially doubled every time 1 cycle is repeated. It can be functionally increased.

일정 이상의 싸이클을 반복하면 dNTP가 소모되므로, 반응이 진행되면서 효율은 점점 떨어지게 되고 도 3에 도시된 커브 형태와 같이 목표 서열의 DNA 양이 안정기(plateau)에 도달하게 된다. 초기 목표 서열의 DNA 양(농도)이 많을수록 증폭 산물양은 빠르게 검출 가능한 양에 도달하게 되어, 증폭 곡선을 빠르게 확인할 수 있다.Since dNTPs are consumed when the cycle is repeated over a certain period, the efficiency gradually decreases as the reaction proceeds, and the amount of DNA in the target sequence reaches a plateau as shown in the curve shape shown in FIG. 3 . As the amount of DNA (concentration) of the initial target sequence increases, the amount of amplification product quickly reaches a detectable amount, so that the amplification curve can be quickly checked.

여기서, 각 샘플의 증폭된 양이 임의로 설정된 역가(32)에 도달할 때까지의 각 싸이클 수를 Ct 값(threshold cycle)이라고 하며, Ct 값은 초기 목표 서열의 DNA양과 역비례하게 된다. 역가(32)는 베이스라인(baseline)보다 훨씬 증가된 신호로써 통계적으로 의미를 갖는 신호 수준을 나타내며, 일반적으로 증폭신호의 증가 양상이 뚜렷하게 구분되는 시점으로 설정될 수 있다.Here, the number of cycles until the amplified amount of each sample reaches an arbitrarily set titer 32 is called a Ct value (threshold cycle), and the Ct value is inversely proportional to the DNA amount of the initial target sequence. The titer 32 represents a statistically significant signal level as a signal that is significantly increased than the baseline, and may generally be set as a time point at which the increase pattern of the amplified signal is clearly distinguished.

증폭 실험에 있어서 산출된 실험 결과 데이터는 이러한 각 샘플별 증폭 곡선을 수치화한 신호값, 최초의 DNA 양이 역으로 정량된 정량값 및 이를 토대로 양성, 음성 또는 미결정 여부가 판단된 결과 데이터를 포함할 수 있다.The experimental result data calculated in the amplification experiment may include a signal value obtained by quantifying the amplification curve for each sample, a quantitative value in which the initial amount of DNA is inversely quantified, and result data in which whether positive, negative, or indeterminate is determined based on this. can

도 4은 도 3에 나타난 각 샘플에 대하여 양성, 음성 및 미결정 여부가 판단된 결과 데이터를 도시한 것이다.FIG. 4 shows the result data of whether positive, negative, and undetermined for each sample shown in FIG. 3 .

도 4에 도시된 바와 같이, 샘플 1 내지 5(31-1 ~ 31-5)는 양성으로, 샘플 6 내지 9(31-6 ~ 31-9)는 미결정(재검사 필요)으로, 샘플 10(31-10)은 음성으로 판정될 수 있다. 샘플 1 내지 5(31-1 ~ 31-5)는 역가에 도달하기 위한 싸이클 수가 기 설정된 값보다 작았음을 의미하고, 샘플 6 내지 9(31-6 ~ 31-9)는 역가에 도달하기 위한 싸이클 수가 기 설정된 값보다 컸음을 의미한다. 이 경우, 샘플 6 내지 9(31-6 ~ 31-9)에 대하여는 진단 결과가 위음성(false negative)일 가능성이 있으므로, 재검사를 실시할 수 있다.As shown in Figure 4, samples 1 to 5 (31-1 to 31-5) are positive, samples 6 to 9 (31-6 to 31-9) are undecided (retest required), and sample 10 (31 -10) may be determined as negative. Samples 1 to 5 (31-1 to 31-5) mean that the number of cycles to reach the titer was less than a preset value, and samples 6 to 9 (31-6 to 31-9) to reach the titer It means that the number of cycles is greater than a preset value. In this case, for samples 6 to 9 (31-6 to 31-9), since there is a possibility that the diagnosis result is false negative, a retest can be performed.

샘플 10(31-10)은 실험에서 나온 최대 증폭 반복횟수(Ct_max) 동안의 실험값(증폭량)이 기 설정된 값 이하이므로 음성으로 판정되었다.Sample 10 (31-10) was determined to be negative because the experimental value (amplification amount) during the maximum number of amplification repetitions (Ct _max ) from the experiment was less than or equal to the preset value.

Ct 값은 목표 서열의 최초 복제수(copy number)에 따라 달라지므로, 목표 서열의 농도가 낮은 샘플의 경우 더 높은 Ct 값을 나타내지만, 최대 Ct 값 범위 내에서 역가를 넘지 못하여 음성 판정을 받을 수 있다.Since the Ct value depends on the original copy number of the target sequence, a sample with a low concentration of the target sequence shows a higher Ct value, but does not exceed the titer within the maximum Ct value range, resulting in a negative result. have.

따라서, 높은 농도를 갖는 샘플과 낮은 농도를 갖는 샘플 모두의 패턴을 수집 및 학습할 수 있도록, 동일한 샘플을 농도에 따라 다수의 구간으로 분할하여 실험할 수 있다.Therefore, the same sample may be divided into a plurality of sections according to the concentration and the experiment may be performed in order to collect and learn the patterns of both the sample having the high concentration and the sample having the low concentration.

한편, 메타데이터는 실험에 사용된 버퍼(buffer), 프라이머(primer), 목표 서열(target sequences), 택 중합효소(Taq polymerase), dNTPs 등과 같은 실험 조건과 시료에 대한 메타 속성(meta properties)을 기술하는 데이터를 의미할 수 있다.On the other hand, metadata contains meta properties for experimental conditions and samples such as buffers, primers, target sequences, Taq polymerase, dNTPs, etc. used in the experiment. It can mean the data to be described.

또한, 이 단계에서, 음성 대조군(negative control)에 대한 실험도 추가적으로 진행하여 실험 데이터를 수집할 수 있다.In addition, in this step, an experiment on a negative control may be additionally performed to collect experimental data.

음성 대조군에 불순물이 미량 섞여 있거나 형광 검출기의 문제로 인해 수집된 데이터에서 노이즈(noise)가 발생할 수 있는 문제가 있다. 따라서, 노이즈의 패턴이 다양하게 나타날 수 있으므로, 음성 대조군의 샘플에 대한 실험을 충분한 횟수로 수행하여. 음성 대조군에 대하여 산출된 실험 결과 데이터를 수집할 수 있도록 할 수 있다.There is a problem that a small amount of impurities are mixed in the negative control or noise may occur in the collected data due to a problem with the fluorescence detector. Therefore, since the pattern of noise may appear variously, experiments on samples of the negative control were performed a sufficient number of times. It is possible to collect the experimental result data calculated for the negative control group.

이후, 수집된 실험 데이터를 기계 학습 모델에서 사용할 수 있도록 전처리를 수행할 수 있다(S220).Thereafter, pre-processing may be performed so that the collected experimental data can be used in the machine learning model ( S220 ).

이때, 수집된 실험 데이터에 섞여 있는 각 샘플을 인스턴스(instance) 별로 나눈 후, 인스턴스로 나뉘어진 각 샘플의 데이터 값을 실수 값으로 변환하고, 실수 값으로 변환된 각 샘플의 데이터 값을 기계 학습 모델에 적용할 수 있도록 벡터화(vectorize)할 수 있다. 농도별로 다수 구간으로 분할하여 진행된 실험에 있어서도 각 샘플을 인스턴스 별로 나누는 작업이 수행될 수 있다.At this time, after dividing each sample mixed in the collected experimental data by instance, the data value of each sample divided into instances is converted into a real value, and the data value of each sample converted into a real value is converted into a machine learning model It can be vectorized so that it can be applied to Even in an experiment conducted by dividing each concentration into a plurality of sections, the operation of dividing each sample for each instance may be performed.

한편, 신호값은 실수임에 반해, 메타데이터는 범주형 데이터(categorical data) 또는 불리언 데이터(boolean data) 형식 등 실수가 아닌 값으로 이루어져 있으므로, 이러한 값들을 실수 값으로 변환한 후 벡터화를 수행할 수 있다.On the other hand, while signal values are real numbers, metadata consists of non-real values such as categorical data or boolean data format. can

이때, 실수 값으로 변환된 메타데이터는 지역적인 정보나 패턴을 추출하려는 것이 아니므로 콘볼루션(convolution) 연산에서 사용되지 않는다.In this case, the metadata converted into real values is not used in a convolution operation because it is not intended to extract local information or patterns.

한편, 전처리된 데이터의 양을 증가시키기 위하여, 데이터 증대(data augmentataion)를 수행할 수 있다.Meanwhile, in order to increase the amount of preprocessed data, data augmentation may be performed.

예를 들어, 전처리된 데이터가 증폭 실험에 있어서 기 설정된 소정의 사이클 단위의 구간으로 나뉘어진 지역적 패턴을 생성하고, 생성된 지역적 패턴을 하나의 샘플로 구성하는 방법으로 데이터 증대를 수행할 수 있다. 예를 들어, 본 발명에서는 10 사이클 단위의 구간으로 나뉘어진 지역적 패턴을 생성할 수 있다. 이러한 방식으로, 향후 10 사이클에 대한 데이터만으로 DNA 증폭이 성공할 것인지 여부를 예측할 수 있게 된다.For example, data augmentation may be performed by generating a regional pattern in which preprocessed data is divided into sections of a predetermined cycle unit in an amplification experiment, and configuring the generated regional pattern into one sample. For example, in the present invention, it is possible to generate a regional pattern divided into 10-cycle sections. In this way, it is possible to predict whether or not DNA amplification will be successful with only data for the next 10 cycles.

이때, 10 사이클 단위의 구간으로 나뉘어진 쉬프트(shift)된 지역적 패턴을 추가적으로 생성할 수도 있다.In this case, a shifted regional pattern divided into 10-cycle sections may be additionally generated.

또한, 데이터 증대는 전처리된 데이터를 지터링(jittering)하여 인공적으로 노이즈가 섞인 데이터를 생성하는 방법으로 수행될 수도 있다. 인공적으로 노이즈가 섞임으로써 기계 학습 모델에서 학습하게 되는 데이터의 다양성이 증대될 수 있다.In addition, data augmentation may be performed by jittering preprocessed data to artificially generate data mixed with noise. By artificially mixing noise, the diversity of data learned by the machine learning model can be increased.

이러한 데이터 증대에 의해, 수집된 데이터의 샘플 수를 늘릴 뿐만 아니라, 데이터 증대를 통해 좀 더 일반화(regularize)된 모델을 얻을 수 있으며, 이를 통해 기계 학습 모델의 오버피팅(overfitting)을 예방함으로써, 불특정(Unknown) 샘플에 대한 분류 및 예측 성능을 향상시킬 수 있다.By such data augmentation, not only the number of samples of the collected data is increased, but also a more generalized model can be obtained through data augmentation, which prevents overfitting of the machine learning model, thereby preventing unspecified Classification and prediction performance for (Unknown) samples can be improved.

전처리된 데이터는 이후, 데이터 분할(data split) 과정을 통해 트레이닝 셋(training set), 검증 셋(validation set) 및 테스트 셋(test set)으로 나뉘게 되는데, 이에 대하여는 도 6과 관련하여 구체적으로 설명하도록 한다.The preprocessed data is then divided into a training set, a validation set, and a test set through a data split process, which will be described in detail with reference to FIG. 6 . do.

이후, 전처리된 데이터를 기계 학습 모델에 입력하여, 전처리된 데이터의 특성 데이터를 추출하고, 추출된 특성 데이터를 기반으로 기계 학습 모델을 훈련할 수 있다(S230).Thereafter, the preprocessed data may be input to the machine learning model, characteristic data of the preprocessed data may be extracted, and the machine learning model may be trained based on the extracted characteristic data ( S230 ).

DNA 증폭 예측을 위하여 기계 학습 모델이 사용될 수 있으며, SVM(Support Vector Machine)과 같은 수학적 모델이나 베이즈 네트워크(Bayesian Network)와 같은 베이즈 모델, 랜덤 포레스트(Random Forest) 또는 그레디언트 부스팅(Gradient Boosting)과 같은 앙상블(Ensemble) 모델 등 여러가지 기계 학습 모델이 사용될 수 있다.A machine learning model can be used to predict DNA amplification, and a mathematical model such as a Support Vector Machine (SVM) or a Bayesian model such as a Bayesian Network, Random Forest or Gradient Boosting Various machine learning models, such as an ensemble model, can be used.

그러나, 본 발명에서는 지역적인 패턴 및 특성이 중요한 요소이므로 대표적인 기계 학습 모델 중 하나인 CNN(Convolutional Neural Network)을 사용하였다.However, in the present invention, since regional patterns and characteristics are important factors, a Convolutional Neural Network (CNN), which is one of the representative machine learning models, is used.

CNN은 기계 학습 모델 중에서 이미지와 같은 지역적인 정보를 갖는 데이터에 주로 적용되는 모델이다. CNN은 필터(filter)를 사용하여 입력값의 지역적인 정보를 추출하고, 추출된 정보 중 중요한 정보를 풀링(pooling)이라는 기법을 통해 재추출한다. CNN은 이러한 과정을 여러번 반복하여 출력값을 예측할 수 있다.CNN is a model mainly applied to data with local information such as images among machine learning models. CNN extracts local information of the input value using a filter, and re-extracts important information from the extracted information through a technique called pooling. CNN can predict the output value by repeating this process several times.

본 발명에서 CNN은 전처리된 실험 데이터에서 신호값의 변화량, 변화하는 패턴을 지역적으로 파악하고, 그 패턴 중에서 주요한 정보를 갖는 패턴만을 주로 추출하기 위하여 사용될 수 있다.In the present invention, CNN can be used to locally identify the amount of change in signal values and the changing patterns in the preprocessed experimental data, and to mainly extract only the patterns having main information among the patterns.

이후, 훈련된 기계 학습 모델을 통해 목표 서열에 대한 진단 결과를 예측할 수 있다(S240). 즉, 예를 들어, 특정 병원체를 진단하기 위하여 수행된 임의의 샘플에 대한 실시간 증폭 실험의 실험 데이터가 수집되면, 기계 학습 모델은 실험 데이터로부터 증폭 신호값이 역가에 도달되기 전인 초반의 소정의 사이클(예를 들어, 시작에서 10사이클까지)에서의 증폭 신호값의 변화량 및/또는 변화하는 패턴의 특징에 기초하여, 목표 서열과 관련된 상기 병원체의 진단 결과(DNA의 증폭 여부)를 정확히 예측하여 진단자에게 제공할 수 있게 된다.Thereafter, the diagnosis result for the target sequence may be predicted through the trained machine learning model ( S240 ). That is, for example, when experimental data of a real-time amplification experiment for an arbitrary sample performed to diagnose a specific pathogen is collected, the machine learning model performs an initial predetermined cycle before the amplification signal value reaches a titer from the experimental data. Diagnosis by accurately predicting the diagnostic result (whether or not DNA amplification) of the pathogen related to the target sequence based on the change amount and/or the change pattern of the amplification signal value (for example, from the start to 10 cycles) can be provided to the person.

도 5는 본 발명의 일 실시예에 따른 CNN 모델의 파이프라인을 도시한다. 5 shows a pipeline of a CNN model according to an embodiment of the present invention.

도 5에 도시된 바와 같이, CNN 모델은 특성 추출 영역(52) 및 분류 영역(53)을 포함할 수 있다. 특성 추출 영역(52)은 히든 레이어(hidden layers) 영역이라고도 하며, 분류 영역(53)은 뉴럴 네트워크 영역이라고도 한다.As shown in FIG. 5 , the CNN model may include a feature extraction region 52 and a classification region 53 . The feature extraction region 52 is also called a hidden layer region, and the classification region 53 is also called a neural network region.

본 발명의 일 실시예에서, 특성 추출 영역(52)은 3개의 컨볼루션 레이어(convolution layer) 및 3개의 맥스 풀 레이어(max pool layer)가 반복적으로 스택(stack)을 쌓는 구조로 구성될 수 있으며, 분류 영역(53)은 2개의 풀리 커넥티드 레이어(fully-connected layer)로 구성될 수 있다. 컨볼루션 레이어 및 맥스 풀 레이어 사이에는 활성화 함수가 배치될 수 있다.In an embodiment of the present invention, the feature extraction region 52 may be configured in a structure in which three convolution layers and three max pool layers are repeatedly stacked. , the classification region 53 may be composed of two fully-connected layers. An activation function may be disposed between the convolutional layer and the max-full layer.

컨볼루션 레이어는 필터를 통해 입력 데이터(51)의 특성을 추출하고, 맥스 풀 레이어는 추출된 특성의 핵심을 추출한다. 본 발명에서 컨볼루션 레이어는 필터를 통해 신호값의 지역적인 특성이나 패턴을 추출하고, 맥스 풀 레이어는 컨볼루션 레이어에서 추출된 특성들 중에서 중요한 특성만 남기고 중요하지 않다고 여겨지는 특성들은 버리는 차원 축소 작업을 수행하게 된다.The convolution layer extracts features of the input data 51 through a filter, and the max full layer extracts the core of the extracted features. In the present invention, the convolutional layer extracts local characteristics or patterns of signal values through a filter, and the max full layer leaves only important characteristics among the characteristics extracted from the convolutional layer and discards insignificant characteristics. will perform

구체적으로, 보통 특성 추출(feature extraction) 과정에서는 고차원의 데이터를 저차원의 데이터에 맵핑(mapping)하는 경우가 많으나, 40개 이하의 차원의 로우(raw) 데이터에서 어떤 결과물을 도출해내기에는 데이터의 특성이 적다. 본 발명에서 사용되는 데이터의 값의 차원은 Ct_max로, 실제로 문제를 해결할 수 있을 정도로 풍부하지 않기 때문에 특성 추출 과정이 필요하게 된다.Specifically, in the typical feature extraction process, high-dimensional data is often mapped to low-dimensional data, but it is difficult to derive any result from raw data of 40 or less dimensions. few characteristics. The dimension of the value of the data used in the present invention is Ct _max , and since it is not abundant enough to actually solve the problem, a feature extraction process is required.

CNN 모델에서 사용되는 여러가지 필터 들은 각각 다른 지역적인 패턴을 추출할 수 있다. 이 과정에서, 기존에 사용되는 작은 차원의 데이터 속에서 많은 양의 특성 데이터를 추출할 수 있다. 그러나, 많은 양의 특성 데이터를 추출하고 나서, 이 특성 데이터들을 모두 사용할 경우에는 실제로 예측에 사용되지 않는 특성 데이터들도 사용되어 오버피팅(overfitting) 문제가 야기될 수 있다.Various filters used in CNN models can extract different local patterns. In this process, a large amount of characteristic data can be extracted from the previously used small-dimensional data. However, when a large amount of feature data is extracted and then all of the feature data are used, feature data that is not actually used for prediction may also be used, resulting in an overfitting problem.

따라서, 특성 데이터 추출 후에 그 특성 데이터들 중에서 중요한 특성 데이터를 선별하는 작업이 필요하며, 이러한 작업을 맥스 풀 레이어에서 처리하게 된다.Therefore, it is necessary to select important characteristic data from among the characteristic data after extraction of the characteristic data, and this operation is processed by the max full layer.

컨볼루션 레이어 및 맥스 풀 레이어가 교대로 반복되면서, 입력 데이터의 특성이 충분히 추출되도록 한다. 다만 맥스 풀 레이어는 선택적으로 사용될 수 있다. 맥스 풀 레이어를 통해 데이터의 크기를 줄이면서 임의적인 소실을 발생시킬 수 있으므로 효율적인 학습 및 오버피팅을 방지할 수 있는 효과가 있다.The convolution layer and the max full layer are alternately repeated to ensure that the characteristics of the input data are sufficiently extracted. However, the max full layer can be selectively used. Random loss can occur while reducing the size of data through the max full layer, so it has the effect of preventing efficient learning and overfitting.

특성 추출 영역(52)에서 필터(filter), 스트라이드(stride) 및 패딩(padding)을 조절하여 특성 추출 부분의 입력과 출력 크기를 계산하고 맞추는 작업이 수행된다.In the feature extraction area 52 , the input and output sizes of the feature extraction part are calculated and matched by adjusting a filter, a stride, and a padding.

특성 추출 영역(52)의 출력 데이터는 피쳐맵(feature map) 형태의 특성 데이터로 출력되며, 이를 배열 형태로 만드는 플래튼 레이어(flatten layer)를 통해 최종적으로 뉴럴 네트워크의 입력으로 사용하기 위한 배열 형태의 풀리 커넥티드 레이어 형태의 분류 모델을 생성할 수 있다. 풀리 커넥티드 레이어를 통과한 출력 데이터에는 최종적으로 소프트맥스(softmax) 함수가 적용됨으로써, 샘플의 DNA가 증폭될 것인지 여부를 확률(0과 1 사이의 출력값)(54)로서 예측할 수 있게 된다(S240).The output data of the feature extraction region 52 is output as feature data in the form of a feature map, and an array form to be finally used as an input of a neural network through a flatten layer that makes it in an array form A classification model in the form of a fully connected layer of A softmax function is finally applied to the output data that has passed through the fully connected layer, so that it is possible to predict whether or not the DNA of the sample will be amplified as a probability (output value between 0 and 1) 54 (S240) ).

한편, 모델 학습(model training) 과정에서는 손실함수(loss function) 및 학습 알고리즘이 필요하다. 손실함수는 아래의 수학식 1과 같이, 일반적인 분류모델에 있어서 사용하는 교차 엔트로피(cross entropy)를 사용한다.On the other hand, in the model training process, a loss function and a learning algorithm are required. As the loss function, as shown in Equation 1 below, cross entropy used in a general classification model is used.

모델 학습은 손실함수인 교차 엔트로피 값을 최소화하는 방향으로 진행된다. 아래의 수학식 2는 모델의 정규화(regularization)를 위해서 L2 정규화를 사용하여 새롭게 정의된 손실함수 J_regularized의 공식을 도시한 것이다.Model training proceeds in the direction of minimizing the cross entropy value, which is a loss function. Equation 2 below shows the formula of the loss function J _regularized newly defined using L2 regularization for regularization of the model.

다만, 손실함수를 최소화하기 위한 이러한 접근은 지역 최소값(local minima)에 쉽게 빠질 수 있으므로, 기존의 SGD(Stochastic Gradient Descent)를 학습 알고리즘으로 사용하는 것보다 다른 학습 알고리즘을 사용하는 것이 바람직하다.However, since this approach to minimize the loss function can easily fall into a local minima, it is preferable to use another learning algorithm rather than using the existing stochastic gradient descent (SGD) as the learning algorithm.

본 발명의 일 실시예에서는 ADAM 옵티마이저(optimiser) 및 배치 정규화(batch normalization)를 학습 알고리즘으로 사용하였다.In an embodiment of the present invention, an ADAM optimizer and batch normalization are used as learning algorithms.

ADAM 옵티마이저는 SGD와는 달리, 각각의 파라미터(parameter)들에 다른 학습률(learning rate)을 부여할 수 있다. 따라서, 특정 파라미터 w_t는 좀 더 큰 폭으로 학습하고, 또한 2가지 종류의 모멘트(mean, uncentered variance)를 적용하여, 지역 최소값에 빠지지 않도록 할 수 있다.Unlike SGD, the ADAM optimizer may give different learning rates to respective parameters. Therefore, a specific parameter w _t can be trained with a larger width and also two kinds of moments (mean and uncentered variance) are applied so that it does not fall into the local minimum.

한편, 학습에 사용된 데이터는 10 사이클씩 잘라진 지역적 정보이며, 실험 정확도(test accuracy)와 훈련 정확도(training accuracy)의 차이를 기준으로 오버피팅을 확인하며, 에포크(epoch) 수를 조절하는 방식으로 오버피팅을 방지할 수 있다.On the other hand, the data used for learning is local information cut by 10 cycles, checking overfitting based on the difference between test accuracy and training accuracy, and adjusting the number of epochs. Overfitting can be prevented.

이 과정에서, 학습된 여러 CNN 모델의 성능을 평가하고, 그 중 최종 모델을 선정하기 위한 교차 검증(cross validation)이 수행될 수 있다.In this process, cross validation to evaluate the performance of several learned CNN models and select a final model among them may be performed.

구체적으로, 도 6에 도시된 바와 같이, 전처리된 데이터(60)를 소정의 개수의 훈련 셋(training set)(61) 및 테스트 셋(test set)(62)으로 나누고, 훈련 셋(61)을 K-폴드 교차 검증 방식을 통해 분할하여 모든 데이터를 훈련 및 검증 과정에서 사용할 수 있도록 한다. 예를 들어, 훈련 셋은 10개의 폴드(fold)로 구성되고, 그 중 어느 하나의 폴드를 검증 폴드(validation fold)으로 지정할 수 있으며, 나머지 9개의 폴드를 훈련 폴드(training folds)으로 사용하여 10번의 교차 검증을 수행할 수 있다.Specifically, as shown in FIG. 6 , the preprocessed data 60 is divided into a predetermined number of training sets 61 and test sets 62 , and the training set 61 is divided into Split through K-fold cross-validation to make all data available for training and validation. For example, the training set consists of 10 folds, any one of them can be designated as the validation fold, and the remaining 9 folds can be used as training folds for 10 You can perform cross-validation twice.

도 7은 본 발명의 일 실시예에 따른 실시간 PCR(Polymerase Chain Reaction)을 이용한 진단 결과를 예측하기 위한 장치의 구성을 간략히 나타낸 블록도이다.7 is a block diagram schematically illustrating the configuration of an apparatus for predicting a diagnosis result using real-time PCR (Polymerase Chain Reaction) according to an embodiment of the present invention.

도 7에 도시된 바와 같은 본 발명의 일 실시예에 따른 장치(700)는 입력부(710), 메모리(820) 및 프로세서(730)를 포함할 수 있다.As shown in FIG. 7 , an apparatus 700 according to an embodiment of the present invention may include an input unit 710 , a memory 820 , and a processor 730 .

입력부(710)는 목표 서열의 증폭 실험에서 산출된 실험 결과 데이터 및 실험에 대한 메타데이터를 포함하는 실험 데이터를 입력받을 수 있다. 입력부(710)는 유·무선 통신을 통하여 외부 장치 또는 외부 기록매체로부터 실험 데이터를 입력받을 수 있다.The input unit 710 may receive experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment. The input unit 710 may receive experimental data from an external device or an external recording medium through wired/wireless communication.

메모리(720)는 장치(700)의 동작에 필요한 프로그램 및 데이터를 저장할 수 있다. 일 실시예에서, 메모리(820)는 장치(700)가 송수신하는 신호에 포함된 제어 정보 또는 데이터를 저장할 수 있다. 메모리(820)는 롬(ROM), 램(RAM), 하드디스크, CD-ROM 및 DVD 등과 같은 저장 매체 또는 저장 매체들의 조합으로 구성될 수 있다. 또한, 메모리(820)는 복수 개일 수 있다 일 실시예에 따르면, 메모리(820)는 전술한 본 발명의 실시예들을 위한 동작을 수행하기 위한 프로그램을 저장할 수 있다.The memory 720 may store programs and data necessary for the operation of the device 700 . In an embodiment, the memory 820 may store control information or data included in a signal transmitted and received by the device 700 . The memory 820 may be configured as a storage medium or a combination of storage media, such as ROM, RAM, hard disk, CD-ROM, and DVD. Also, the number of memories 820 may be plural. According to an embodiment, the memory 820 may store a program for performing operations for the above-described embodiments of the present invention.

프로세서(730)는 장치(700)가 동작하는 일련의 과정을 제어할 수 있다. 예를 들면, 일 실시예에 따르는 실시간 PCR을 이용한 진단 결과를 예측하기 위한 장치(700)의 동작을 수행하도록 장치(700)의 구성요소들을 제어할 수 있다. 프로세서(830)는 복수 개일 수 있으며, 프로세서(730)는 메모리(720)에 저장된 프로그램을 실행함으로써 장치(700)의 동작을 수행할 수 있다.The processor 730 may control a series of processes in which the device 700 operates. For example, components of the apparatus 700 may be controlled to perform an operation of the apparatus 700 for predicting a diagnosis result using real-time PCR according to an embodiment. There may be a plurality of processors 830 , and the processor 730 may perform an operation of the device 700 by executing a program stored in the memory 720 .

일 실시예에서, 프로세서(730)는 입력된 실험 데이터를 기계 학습 모델에서 사용할 수 있도록 전처리를 수행할 수 있다. 이후, 프로세서(730)는 전처리된 데이터를 기계 학습 모델에 입력하여, 전처리된 데이터의 특성 데이터를 추출하며, 추출된 특성 데이터를 기반으로 훈련된 기계 학습 모델을 통해, 목표 서열의 증폭 결과에 대한 예측값을 산출할 수 있다.In an embodiment, the processor 730 may perform preprocessing to use the input experimental data in a machine learning model. Thereafter, the processor 730 inputs the pre-processed data into the machine learning model, extracts characteristic data of the pre-processed data, and uses a machine learning model trained based on the extracted characteristic data for the amplification result of the target sequence. predictions can be calculated.

이때, 메타 데이터는, 버퍼 프라이머, 목표 서열, 택 중합효소, dNTPs, 시료 중 적어도 하나의 실험 조건 정보를 포함할 수 있다. 또한, 실험 결과 데이터는 증폭 실험에서의 초대 증폭 반복횟수 동안의 신호값 및 신호값에 대한 판단 데이터를 포함할 수 있다. 또한, 실험 결과 데이터는 음성 대조군에 대하여 산출된 실험 결과 데이터를 더 포함할 수 있다.In this case, the metadata may include information on at least one experimental condition of a buffer primer, a target sequence, a tag polymerase, dNTPs, and a sample. In addition, the experimental result data may include a signal value and judgment data on the signal value during the first repetition number of amplification in the amplification experiment. In addition, the experimental result data may further include experimental result data calculated for the negative control group.

일 실시예에 따르면, 프로세서(730)는 수집된 실험 데이터의 각 샘플을 인스턴스 별로 나눈 후, 인스턴스로 나뉘어진 각 샘플의 데이터 값이 실수 값인지 여부를 판단하고, 실수 값이 아닌 경우에는 실수 값으로 변환하여 각 샘플의 데이터 값을 벡터화할 수 있다.According to an embodiment, the processor 730 divides each sample of the collected experimental data for each instance, then determines whether a data value of each sample divided into instances is a real value, and if not a real value, a real value can be converted to vectorize the data value of each sample.

일 실시예에 따르면, 프로세서(730)는 전처리된 데이터에 대하여 데이터 증대(data augmentation)를 수행할 수 있다. 구체적으로, 프로세서(730)는 전처리된 데이터가 증폭 실험에 있어서 소정의 사이클 단위(예를 들어, 10 사이클 단위)의 구간으로 나뉘어진 지역적 패턴을 생성하고, 생성된 지역적 패턴을 하나의 샘플로 구성하는 방법으로 데이터 증대를 수행할 수 있다.According to an embodiment, the processor 730 may perform data augmentation on the preprocessed data. Specifically, the processor 730 generates a regional pattern in which the preprocessed data is divided into sections of a predetermined cycle unit (eg, 10 cycle units) in an amplification experiment, and configures the generated regional pattern into one sample. Data augmentation can be performed in this way.

일 실시예에 따르면, 프로세서(730)는 소정의 사이클 단위(예를 들어, 10 사이클 단위)의 구간으로 나뉘어진 쉬프트(shift)된 지역적 패턴을 추가로 생성하거나, 전처리된 데이터를 지터링(jittering)하여 인공적으로 노이즈가 섞인 데이터를 생성하는 방법으로 데이터 증대를 수행할 수 있다.According to an embodiment, the processor 730 additionally generates a shifted regional pattern divided into sections of a predetermined cycle unit (eg, 10 cycle units), or jittering the preprocessed data. ) to artificially generate data mixed with noise, data augmentation can be performed.

일 실시예에 따르면, 프로세서(730)는 K-폴드 교차 검증을 통해, CNN 모델을 훈련 및 검증할 수 있다. 이때 프로세서(730)는 전처리된 데이터를 훈련 셋 및 테스트 셋으로 나누고, 훈련 셋을 10개의 폴드(fold)로 구성하며, 그 중 어느 하나의 폴드를 검증셋으로 지정하고, 나머지 폴드를 훈련 셋으로 사용하여 교차 검증을 수행할 수 있다.According to an embodiment, the processor 730 may train and validate the CNN model through K-fold cross-validation. At this time, the processor 730 divides the preprocessed data into a training set and a test set, configures the training set into 10 folds, designates any one of the folds as a validation set, and sets the remaining folds as a training set. can be used to perform cross-validation.

이상과 같은 본 발명의 다양한 실시예에 따르면, 기계학습을 통해 양성, 음성 및 미결정을 확실히 판별하여 진단자에게 제공할 수 있으므로, 위음성 판정에 대한 빈도를 낮출 수 있다.According to various embodiments of the present invention as described above, positive, negative, and undecided can be reliably determined through machine learning and provided to a diagnostician, thereby reducing the frequency of false negative determination.

한편, 상술한 실시예는, 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터에 의해 판독 가능한 매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 실시예에서 사용된 데이터의 구조는 컴퓨터 판독 가능 매체에 여러 수단을 통하여 기록될 수 있다. 또한, 상술한 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로 구현될 수 있다. 예를 들어, 소프트웨어 모듈 또는 알고리즘으로 구현되는 방법들은 컴퓨터가 읽고 실행할 수 있는 코드들 또는 프로그램 명령들로서 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있다. Meanwhile, the above-described embodiment can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable medium. In addition, the structure of data used in the above-described embodiment may be recorded in a computer-readable medium through various means. In addition, the above-described embodiment may be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. For example, methods implemented as a software module or algorithm may be stored in a computer-readable recording medium as computer-readable codes or program instructions.

일 예로, ⅰ) 목표 서열의 증폭 실험에서 산출된 실험 결과 데이터 및 상기 실험에 대한 메타데이터(metadata)를 포함하는 실험 데이터를 수집하는 단계, ⅱ) 수집된 실험 데이터를 기계 학습 모델에서 사용할 수 있도록 전처리를 수행하는 단계, ⅲ) 전처리된 데이터를 상기 기계 학습 모델에 입력하여, 전처리된 데이터의 특성 데이터를 추출하고, 추출된 특성 데이터를 기반으로 기계 학습 모델을 훈련시키는 단계 및, ⅳ) 훈련된 기계 학습 모델을 통해 목표 서열에 대한 진단 결과를 예측하는 단계를 수행하는 프로그램이 저장된 컴퓨터 판독 가능 매체가 제공될 수 있다.As an example, i) collecting experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment, ii) so that the collected experimental data can be used in a machine learning model performing pre-processing, iii) inputting the pre-processed data into the machine learning model, extracting characteristic data of the pre-processed data, and training a machine learning model based on the extracted characteristic data, and iv) trained A computer-readable medium storing a program for performing the step of predicting a diagnostic result for a target sequence through a machine learning model may be provided.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 기록 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 포함할 수 있다. 컴퓨터 판독 가능 매체는 마그네틱 저장매체, 예를 들면, 롬, 플로피 디스크, 하드 디스크 등을 포함하고, 광학적 판독 매체, 예를 들면, 시디롬, DVD 등과 같은 저장 매체를 포함할 수 있으나, 이에 제한되지 않는다. 또한, 컴퓨터 판독 가능 매체는 컴퓨터 저장 매체 및 통신 매체를 포함할 수 있다.Computer-readable media may be any recording media that can be accessed by a computer, and may include volatile and nonvolatile media, removable and non-removable media. The computer readable medium may include a magnetic storage medium, for example, a ROM, a floppy disk, a hard disk, and the like, and an optically readable medium, for example, a storage medium such as a CD-ROM or DVD, but is not limited thereto. . Additionally, computer-readable media may include computer storage media and communication media.

또한, 컴퓨터가 읽을 수 있는 복수의 기록 매체가 네트워크로 연결된 컴퓨터 시스템들에 분산되어 있을 수 있으며, 분산된 기록 매체들에 저장된 데이터, 예를 들면 프로그램 명령어 및 코드가 적어도 하나의 컴퓨터에 의해 실행될 수 있다.In addition, a plurality of computer-readable recording media may be distributed in network-connected computer systems, and data stored in the distributed recording media, for example, program instructions and codes, may be executed by at least one computer. have.

본 발명에서 설명된 특정 실행들은 일 실시예 일 뿐이며, 어떠한 방법으로도 본 발명의 범위를 한정하는 것은 아니다. 명세서의 간결함을 위하여, 종래 전자적인 구성들, 제어 시스템들, 소프트웨어, 및 상기 시스템들의 다른 기능적인 측면들의 기재는 생략될 수 있다.The specific implementations described in the present invention are merely exemplary and do not limit the scope of the present invention in any way. For brevity of the specification, descriptions of conventional electronic components, control systems, software, and other functional aspects of the systems may be omitted.

Claims

In the method for predicting a diagnosis result using real-time PCR (Polymerase Chain Reaction),
collecting experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment;
inputting the experimental data into a machine learning model, extracting characteristic data of the experimental data, and training the machine learning model based on the extracted characteristic data; and
Predicting a diagnosis result for a target sequence through the trained machine learning model;

The method of claim 1,
The metadata is
Buffer (buffer), primer (primer), target sequence, tag (taq) polymerase, dNTPs, including the experimental condition information of at least one of the sample, the method.

The method of claim 1,
The experimental result data is,
A method comprising a signal value for the maximum number of amplification repetitions in the amplification experiment and judgment data for the signal value.

4. The method of claim 3,
The experimental result data is,
A method, further comprising experimental result data calculated for a negative control.

The method of claim 1,
Further comprising the step of performing pre-processing so that the collected experimental data can be used in a machine learning model,
wherein the training is performed by inputting the preprocessed data into the machine learning model.

6. The method of claim 5,
The pre-processing step is
dividing each sample of the collected experimental data for each instance, and then converting the data value of each sample divided into the instances into a real value; and
Including; vectorizing the data value of each sample converted to the real value.

6. The method of claim 5,
The method further comprising; performing data augmentation on the preprocessed data.

8. The method of claim 7,
The step of performing the data augmentation includes:
The method is performed by generating a regional pattern in which the preprocessed data is divided into sections of a predetermined cycle unit in the amplification experiment, and configuring the generated regional pattern as one sample.

9. The method of claim 8,
The step of performing the data augmentation includes:
A method of further generating a shifted regional pattern divided into sections of the predetermined cycle unit.

8. The method of claim 7,
The step of performing the data augmentation includes:
The method is performed by generating data mixed with noise artificially by jittering the pre-processed data.

6. The method of claim 5,
The machine learning model is
A method, which is a Convolutional Neural Network (CNN) model.

12. The method of claim 11,
The CNN model is
A method using an ADAM optimizer and batch normalization as learning algorithms.

12. The method of claim 11,
The training step is
A method for training and validating the CNN model through k-fold cross-validation.

14. The method of claim 13,
The training step is
dividing the preprocessed data into a training set and a test set; and
Constructing the training set with a predetermined number of folds, designating any one of the folds as a validation set, and performing cross-validation using the remaining folds as a training set; includes; How to.

In an apparatus for predicting a diagnostic result using real-time PCR,
an input unit for receiving experimental data including experimental result data calculated in an amplification experiment of a target sequence and metadata for the experiment; and
A processor for inputting the input experimental data into a machine learning model, extracting characteristic data of the experimental data, and predicting a diagnosis result for a target sequence through a machine learning model trained based on the extracted characteristic data; device to do.

16. The method of claim 15,
The metadata is
Buffer, primer, target sequence, tag (taq) polymerase, dNTPs, the device comprising the experimental condition information of at least one of the sample.

16. The method of claim 15,
The experimental result data is,
A device comprising a signal value for the maximum number of amplification repetitions in the amplification experiment and judgment data for the signal value.

18. The method of claim 17,
The experimental result data is,
The device further comprising experimental result data calculated for the negative control.

16. The method of claim 15,
The processor is
An apparatus for training the machine learning model by performing preprocessing so that the input experimental data can be used in a machine learning model, and inputting the preprocessed data to the machine learning model.

20. The method of claim 19,
The processor is
After dividing each sample of the input experimental data by instance, converting the data value of each sample divided into the instance into a real value, and vectorizing the data value of each sample converted to the real value, For performing pre-processing, the device.

A computer program product comprising a recording medium in which a program for executing the method of any one of claims 1 to 14 is stored.