KR20220074201A

KR20220074201A - Machine Learning System based on Convolutional Neural Network that Mimic Human Visual Inspection for Analysis of Epigenetic Data and its Operation Method

Info

Publication number: KR20220074201A
Application number: KR1020200162428A
Authority: KR
Inventors: 송길태; 오동빈
Original assignee: 부산대학교 산학협력단
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-06-03
Also published as: KR102601275B1

Abstract

후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법이 제시된다. 일 실시예에 따른 컴퓨터 장치를 통해 수행되는 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 동작 방법은, CNN(Convolutional Neural Network) 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계; 및 학습된 상기 CNN 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출하는 단계를 포함하여 이루어질 수 있다. A CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis and an operating method thereof are presented. The operation method of a CNN-based machine learning system that mimics a human visual inspection for epigenetic data analysis performed through a computer device according to an embodiment is a method in which a CNN (Convolutional Neural Network) model learns cancer genome data analysis result data. step; and calling the peak to search for a functional region that can be a cancer development mechanism for new cancer genome data that the learned CNN model is not trained on.

Description

CNN-based machine learning system based on Convolutional Neural Network that Mimic Human Visual Inspection for Analysis of Epigenetic Data and its Operation Method

아래의 실시예들은 후성유전체 데이터 분석에 관한 것으로, 더욱 상세하게는 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법에 관한 것이다. The following embodiments relate to epigenetic data analysis, and more particularly, to a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis and an operating method thereof.

포스트게놈 시대를 맞아 후성유전적(epigenetic) 규제를 이해하는 것은 생물학 및 의학에서 가장 중요한 난제 중 하나이다. 병리학적 메커니즘을 해명하고 치료법을 개발하기 위한 표적 유전자를 정확히 파악하기 위해서는 유전자의 조절장애를 유발하는 단백질과 게놈 원소의 유전체 수준 상호작용의 기형을 평가하는 것이 중요하다. 이와 관련하여 염기서열(chromatin immunoprecipitation followed by sequencing, ChIP-seq)에 따른 염색질 면역 복제술은 히스톤, 전사 인자, DNA/RNA 결합 단백질을 포함한 후성유전자 규제기관의 유전체 결합 부위를 식별하기 위해 널리 사용되는 기법이다. ChIP-seq는 단백질 복합체와 DNA 규제 요소, 그리고 그들의 유전자 규제 네트워크 사이의 상호작용을 발견하는 것을 가능하게 한다. ChIP-seq 데이터 분석은 히스톤 수정과 핵산이 단백질을 상호작용하는 것이 유전자 조절, 세포 혈통 결정과 유지의 중요한 요소들을 어떻게 조절하는지를 보여주었다.In the post-genomic era, understanding epigenetic regulation is one of the most important challenges in biology and medicine. In order to elucidate the pathological mechanism and accurately identify the target gene for developing therapeutics, it is important to evaluate the malformation of the genome-wide interaction of proteins and genomic elements that cause gene dysregulation. In this regard, chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a widely used technique to identify genomic binding sites of epigenetic regulators including histones, transcription factors, and DNA/RNA binding proteins. to be. ChIP-seq makes it possible to discover the interactions between protein complexes and DNA regulatory elements and their gene regulatory networks. Analysis of ChIP-seq data revealed how histone modifications and nucleic acid interactions with proteins regulate important factors in gene regulation, cell lineage determination and maintenance.

ChIP-seq 데이터를 분석하는 데 있어 주요한 계산상의 어려움 중 하나는 주어진 참조 게놈에 서열 읽기를 매핑할 때 정렬된 읽기가 농축되는 게놈 영역의 피크를 식별하는 것이다. 이 일은 여러 가지 이유로 인해 도전적인 과제이다. 구조 변화로 인한 서열 오류와 로컬 편향은 피크 호출(peak-calling) 문제를 복잡하게 해결한다. 생물학적 변동성, 다양한 영역, 실험 환경, 서열 범위 등에 의해 발생하는 다양한 데이터 패턴이 문제를 더욱 어렵게 만든다. 노이즈가 많은 배경에서 피크를 결정하는 민감하고 신뢰할 수 있는 계산 방법은 특히 환자 검체를 제한할 수 있는 의학 연구에 중요하며, 따라서 데이터 양과 품질은 차선이다.One of the major computational difficulties in analyzing ChIP-seq data is identifying peaks in genomic regions where aligned reads are enriched when mapping sequence reads to a given reference genome. This task is challenging for several reasons. Sequence errors and local biases due to structural changes complicate the peak-calling problem. Diverse data patterns caused by biological variability, different regions, experimental environments, sequence coverage, etc. make the problem more difficult. A sensitive and reliable computational method to determine peaks in a noisy background is especially important for medical research where patient samples may be limited, so data quantity and quality are sub-optimal.

ChIP-seq 데이터의 피크를 호출하기 위한 여러 소프트웨어 도구는 MACS2, HOMER, SICER, SPP와 같은 다양한 확률론적이고 감독되지 않은 학습 방법을 기반으로 개발되었다. 이러한 도구들 중 일부는 피크를 호출하는 높은 민감도를 보여주지만, 높은 거짓 양성(false positive) 오류율을 겪고 있다. 다른 도구는 입력 데이터로서 추가 정보가 필요할 수 있다(예컨대, 게놈에서 판독된 하나 이상의 고유하게 표시 가능한 부분과 중복되는 부분의 표시 가능성 점수).Several software tools for recalling peaks in ChIP-seq data have been developed based on various probabilistic and unsupervised learning methods such as MACS2, HOMER, SICER, and SPP. Some of these tools show high sensitivity to call peaks, but suffer from high false positive error rates. Other tools may require additional information as input data (eg, displayability scores of portions that overlap with one or more uniquely displayable portions read in the genome).

신경망 모델에서는 ChIP-seq 데이터를 디노이즈(denoise)하고 피크 호출 성능을 높이기 위해 제안되었다. ChIP-seq 피크를 호출하는 정확도를 높이기 위한 앙상블 접근법도 제안되었다. 앙상블 접근법은 복수의 피크 호출 결정 중 특이치를 제거하기 위해 기존의 여러 피크 호출 소프트웨어 도구의 출력을 이용한다. 이러한 접근방식은 진정한 피크를 부르는 민감도를 증가시켰지만 여전히 높은 거짓 양성 비율에 시달리고 있다. 특히, 인간의 암세포 라인은 환자의 1차 암 조직에 따라 데이터 패턴이 크게 달라지기 때문에 ChIP-seq를 이용한 인간의 악성종양을 이해하기에는 너무 복잡하다. 인간 암세포 라인의 ChIP-seq 피크를 호출하는 거짓 양성 비율은 다른 데이터 집합보다 더 나쁘다.In the neural network model, it has been proposed to denoise ChIP-seq data and increase peak call performance. An ensemble approach to improve the accuracy of calling ChIP-seq peaks has also been proposed. The ensemble approach uses the output of existing multiple peak calling software tools to eliminate outliers among multiple peak calling decisions. Although this approach has increased the sensitivity to calling true peaks, it still suffers from high false positive rates. In particular, human cancer cell lines are too complex to understand human malignancies using ChIP-seq because the data patterns vary greatly depending on the patient's primary cancer tissue. The false positive rate calling ChIP-seq peaks in human cancer cell lines is worse than in other datasets.

이러한 문제를 해결하기 위해 전문가는 UCSC 게놈 브라우저와 통합 유전체 뷰어(IGV)와 같은 시각화 도구를 사용하여 실제 피크에 라벨을 붙이는 데 사용된다. 거짓 양성 피크는 전문 연구원에 의해서도 교정될 수 있다. 하지만, 과학자들이 많은 양의 ChIP-seq 데이터에 대해 전체 게놈에서 모든 피크를 찾는 것은 매우 비효율적일 것이다. Hocking은 사용자가 MACS2와 같은 피크 호출 도구를 실행할 때 설정해야 하는 파라미터(예컨대, 컷오프 값)를 최적화하기 위해 그리드 검색을 기반으로 한 감독적 학습 접근방식을 제안했다. 이들은 전문가에 의해 피크의 라벨이 부착된 데이터를 사용하고 파라미터 값을 학습하며 데이터 집합의 나머지 부분에 최적화된 파라미터를 적용한다. 그러나, 이는 각각의 개별 ChIP-seq 데이터 집합과 각각의 피크 호출 도구에 대한 라벨링 작업을 완료해야 한다.To address these issues, experts use visualization tools such as the UCSC Genome Browser and Integrated Genome Viewer (IGV) to label actual peaks. False positive peaks can also be corrected by expert researchers. However, it would be very inefficient for scientists to find every peak in the entire genome for a large amount of ChIP-seq data. Hocking proposed a supervised learning approach based on grid search to optimize the parameters (e.g., cutoff values) that the user should set when running a peak calling tool such as MACS2. They use data labeled by experts, learn parameter values, and apply optimized parameters to the rest of the data set. However, this requires complete labeling for each individual ChIP-seq data set and each peak calling tool.

Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research 22, 1813-1831(2012).Landt, S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Research 22, 1813-1831 (2012).

실시예들은 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법에 관하여 기술하며, 보다 구체적으로 암 후성유전체 데이터 분석의 핵심 과정 중 하나인 ChIP-seq 피크 검출 알고리즘을 기존 인간 연구원의 시각 검사를 모방하는 합성곱 신경망(Convolutional Neural Network, CNN) 기반 딥러닝 모델을 제공한다. The embodiments describe a CNN-based machine learning system that mimics human visual inspection for epigenomic data analysis and an operating method thereof, and more specifically, the ChIP-seq peak detection algorithm, which is one of the core processes of cancer epigenetic data analysis. It provides a convolutional neural network (CNN)-based deep learning model that mimics the visual inspection of existing human researchers.

실시예들은 딥러닝 모델이 인간 연구원의 암 유전체 데이터 분석 결과 데이터를 학습하고, 학습된 딥러닝 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색함으로써, 인간 연구원의 정확성을 가지면서도 빠른 분석을 수행할 수 있는, 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법을 제공하는데 있다. In the embodiments, the deep learning model learns the cancer genome data analysis result data of the human researcher, and the trained deep learning model searches the functional area that can be the cancer development mechanism for the new untrained cancer genome data, thereby making the human researcher An object of the present invention is to provide a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis and an operation method thereof, which can perform fast analysis while having the accuracy of

일 실시예에 따른 컴퓨터 장치를 통해 수행되는 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 동작 방법은, CNN(Convolutional Neural Network) 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계; 및 학습된 상기 CNN 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출하는 단계를 포함하여 이루어질 수 있다. The operation method of a CNN-based machine learning system that mimics a human visual inspection for epigenetic data analysis performed through a computer device according to an embodiment is a method in which a CNN (Convolutional Neural Network) model learns cancer genome data analysis result data. step; and calling the peak to search for a functional region that can be a cancer development mechanism for new cancer genome data that the learned CNN model is not trained on.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 전처리 모듈을 이용하여 상기 CNN 모델의 입력 데이터를 벡터로 변환하는 단계를 포함하여 이루어질 수 있다. The step of the CNN model learning the cancer genome data analysis result data may include converting the input data of the CNN model into a vector using a preprocessing module.

상기 전처리 모듈을 이용하여 상기 CNN 모델의 입력 데이터를 벡터로 변환하는 단계는, ChIP-seq 읽기 매핑 데이터, 상기 ChIP-seq 읽기 매핑 데이터의 적어도 일부 영역에 대해 피크 위치에 라벨링된 데이터 및 RefSeq 주석 정보 중 적어도 어느 하나 이상의 상기 입력 데이터를 콘볼루션 계층을 통해 벡터로 변환하는 단계를 포함하여 이루어질 수 있다. The step of converting the input data of the CNN model into a vector using the pre-processing module includes: ChIP-seq read mapping data, data labeled at peak positions for at least some regions of the ChIP-seq read mapping data, and RefSeq annotation information The method may include converting at least any one or more of the input data into a vector through a convolutional layer.

상기 전처리 모듈을 이용하여 상기 CNN 모델의 입력 데이터를 벡터로 변환하는 단계는, 상기 입력 데이터의 노이즈를 줄이기 위해 맥스 풀링(max-pooling) 계층을 통해 변환된 상기 벡터의 차원(dimension)을 감소시키는 단계를 포함할 수 있다. The step of converting the input data of the CNN model into a vector using the pre-processing module is to reduce the dimension of the vector transformed through a max-pooling layer in order to reduce the noise of the input data. may include steps.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 콘볼루션, 맥스 풀링 및 평균 풀링 중 적어도 어느 하나 이상의 필터를 포함하여 이루어진 복수 개의 인셉션 모듈을 통해 상기 CNN 모델을 학습하는 단계를 포함할 수 있다. The step of the CNN model learning the cancer genome data analysis result data includes learning the CNN model through a plurality of inception modules including at least one filter of convolution, max pooling, and average pooling. can do.

상기 복수 개의 인셉션 모듈을 통해 상기 CNN 모델을 학습하는 단계는, 콘볼루션, 맥스 풀링 및 평균 풀링 필터를 포함하는 적어도 하나 이상의 제1 인셉션 모듈을 통해 상기 CNN 모델에 피크 패턴과 강력한 피크 신호의 규모에 대한 정보를 제공하는 단계; 콘볼루션 및 맥스 풀링 필터를 포함하는 적어도 하나 이상의 제2 인셉션 모듈을 통해 맥스 풀링 및 평균 풀링 계층의 벡터 크기를 증가시키는 문제를 방지하는 단계; 및 상기 제1 인셉션 모듈 및 상기 제2 인셉션 모듈의 콘볼루션 필터보다 더 넓은 크기와 긴 스트라이드를 가진 콘볼루션 필터를 포함하는 적어도 하나 이상의 제3 인셉션 모듈을 제공하는 단계를 포함할 수 있다. The step of learning the CNN model through the plurality of inception modules includes a peak pattern and a strong peak signal in the CNN model through at least one first inception module including a convolution, max pooling and average pooling filter. providing information about the scale; preventing the problem of increasing vector sizes of max pooling and average pooling layers through at least one or more second inception modules including convolution and max pooling filters; and providing at least one third inception module including a convolutional filter having a wider size and a longer stride than the convolutional filters of the first inception module and the second inception module. .

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 기울기 손실(vanishing gradient) 문제를 방지하도록 상기 복수 개의 인셉션 모듈들 사이에는 잔여 연결(residual connection) 구조를 사용할 수 있다. In the step of the CNN model learning the cancer genome data analysis result data, a residual connection structure may be used between the plurality of inception modules to prevent a vanishing gradient problem.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 상기 복수 개의 인셉션 모듈들 끝에는 차원을 줄이기 위해 완전히 연결된 계층(fully connected layer)에 이은 평균 풀링 계층이 추가될 수 있다. In the step of the CNN model learning the cancer genome data analysis result data, an average pooling layer following a fully connected layer may be added to the end of the plurality of inception modules to reduce dimensions.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 상기 CNN 모델을 통해 선택된 윈도우의 ChIP-seq 읽기 매핑 데이터를 기반으로 게놈 영역의 최적의 임계 값을 학습하도록 할 수 있다. In the CNN model learning the cancer genome data analysis result data, the optimal threshold value of the genomic region may be learned based on ChIP-seq read mapping data of a window selected through the CNN model.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 상기 CNN 모델의 출력 벡터의 크기가 입력 벡터보다 작아지기 때문에 상기 출력 벡터의 크기를 상기 입력 벡터와 동일하게 확장하여, 각 개별 위치에서 피크 유무를 예측하도록 할 수 있다. In the step of the CNN model learning the cancer genome data analysis result data, since the size of the output vector of the CNN model becomes smaller than the input vector, the size of the output vector is expanded to be the same as the input vector, and at each individual location. The presence or absence of a peak can be predicted.

상기 CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계는, 상기 CNN 모델의 출력 데이터를 시그모이드(sigmoid) 활성화를 사용하여 각 피크에 점수를 부여하여 피크의 중요도를 측정할 수 있다. In the step of the CNN model learning the cancer genome data analysis result data, the importance of the peak may be measured by assigning a score to each peak using the sigmoid activation of the output data of the CNN model.

상기 학습된 상기 CNN 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출하는 단계는, 상기 학습된 상기 CNN 모델에 새로운 암 유전체 데이터를 입력하여 피크 호출 및 각 피크의 점수 부여를 통해 피크를 예측할 수 있다. The step of calling the peak to search for a functional region that can be a cancer development mechanism for the new cancer genome data that the learned CNN model is not trained includes inputting new cancer genome data into the trained CNN model to select the peak Peaks can be predicted by calling and scoring each peak.

다른 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템은, CNN(Convolutional Neural Network) 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 CNN 모델 학습부; 및 학습된 상기 CNN 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출하는 CNN 모델 피크 예측부를 포함하여 이루어질 수 있다. A CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis according to another embodiment includes a CNN model learning unit in which a CNN (Convolutional Neural Network) model learns cancer genome data analysis result data; and a CNN model peak prediction unit that calls a peak to search for a functional region that can be a cancer-generating mechanism for new cancer genome data that has not been trained on the CNN model.

상기 CNN 모델 학습부는, ChIP-seq 읽기 매핑 데이터, 상기 ChIP-seq 읽기 매핑 데이터의 적어도 일부 영역에 대해 피크 위치에 라벨링된 데이터 및 RefSeq 주석 정보 중 적어도 어느 하나 이상의 상기 CNN 모델의 입력 데이터를 벡터로 변환하는 전처리 모듈을 포함할 수 있다. The CNN model learning unit, at least one of ChIP-seq read mapping data, data labeled at peak positions for at least some regions of the ChIP-seq read mapping data, and RefSeq annotation information, input data of the CNN model as a vector It may include a preprocessing module that converts.

상기 CNN 모델 학습부는, 콘볼루션, 맥스 풀링 및 평균 풀링 중 적어도 어느 하나 이상의 필터를 포함하여 이루어진 복수 개의 인셉션 모듈을 포함할 수 있다. The CNN model learning unit may include a plurality of inception modules including filters of at least any one of convolution, max pooling, and average pooling.

상기 복수 개의 인셉션 모듈은, 상기 CNN 모델에 피크 패턴과 강력한 피크 신호의 규모에 대한 정보를 제공하기 위해, 콘볼루션, 맥스 풀링 및 평균 풀링 필터를 포함하여 구성되는 적어도 하나 이상의 제1 인셉션 모듈; 맥스 풀링 및 평균 풀링 계층의 벡터 크기를 증가시키는 문제를 방지하기 위해, 콘볼루션 및 맥스 풀링 필터를 포함하여 구성되는 적어도 하나 이상의 제2 인셉션 모듈; 및 상기 제1 인셉션 모듈 및 상기 제2 인셉션 모듈의 콘볼루션 필터보다 더 넓은 크기와 긴 스트라이드를 가진 콘볼루션 필터를 포함하여 구성되는 적어도 하나 이상의 제3 인셉션 모듈을 포함할 수 있다. The plurality of inception modules include at least one first inception module configured to include a convolution, max pooling, and average pooling filter in order to provide information on a peak pattern and a magnitude of a strong peak signal to the CNN model. ; at least one second inception module, configured to include a convolution and a max pooling filter, in order to avoid the problem of increasing the vector size of the max pooling and average pooling layer; and at least one or more third inception modules configured to include a convolutional filter having a wider size and a longer stride than the convolutional filters of the first inception module and the second inception module.

상기 복수 개의 인셉션 모듈들 사이에는 잔여 연결(residual connection) 구조를 사용하여 기울기 손실(vanishing gradient) 문제를 방지할 수 있다. A vanishing gradient problem may be prevented by using a residual connection structure between the plurality of inception modules.

상기 CNN 모델 학습부는, 상기 복수 개의 인셉션 모듈들 끝에는 차원을 줄이기 위해 완전히 연결된 계층(fully connected layer)에 이은 평균 풀링 계층이 추가될 수 있다. In the CNN model learning unit, an average pooling layer following a fully connected layer may be added to the end of the plurality of inception modules to reduce a dimension.

상기 CNN 모델 학습부는, 상기 CNN 모델의 출력 데이터를 시그모이드(sigmoid) 활성화를 사용하여 각 피크에 점수를 부여하여 피크의 중요도를 측정할 수 있다. The CNN model learning unit may measure the importance of the peak by assigning a score to each peak using sigmoid activation for the output data of the CNN model.

상기 CNN 모델 피크 예측부는, 상기 학습된 상기 CNN 모델에 새로운 암 유전체 데이터를 입력하여 피크 호출 및 각 피크의 점수 부여를 통해 피크를 예측할 수 있다. The CNN model peak prediction unit may predict a peak by inputting new cancer genome data to the learned CNN model and calling a peak and assigning a score to each peak.

실시예들에 따르면 딥러닝 모델이 인간 연구원의 암 유전체 데이터 분석 결과 데이터를 학습하고, 학습된 딥러닝 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색함으로써, 인간 연구원의 정확성을 가지면서도 빠른 분석을 수행할 수 있는, 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법을 제공할 수 있다. According to embodiments, the deep learning model learns the cancer genome data analysis result data of a human researcher, and the learned deep learning model searches for a functional area that can become a cancer development mechanism for new untrained cancer genome data, It is possible to provide a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis, which can perform fast analysis while having the accuracy of a human researcher, and an operating method thereof.

도 1은 일 실시예에 따른 ChIP-seq 읽기 매핑 데이터를 사용한 라벨링 피크의 시각화를 설명하기 위한 도면이다.
도 2a 내지 도 2c는 일 실시예에 따른 CNN 구조의 필터의 예시를 나타내는 도면이다.
도 3은 일 실시예에 따른 CNN 구조를 나타내는 도면이다.
도 4는 일 실시예에 따른 학습된 CNN 모델의 피크 호출 과정을 나타내는 도면이다.
도 5는 일 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템을 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 동작 방법을 나타내는 흐름도이다. 1 is a diagram for explaining visualization of a labeling peak using ChIP-seq read mapping data according to an embodiment.
2A to 2C are diagrams illustrating an example of a filter of a CNN structure according to an embodiment.
3 is a diagram illustrating a CNN structure according to an embodiment.
4 is a diagram illustrating a peak calling process of a learned CNN model according to an embodiment.
5 is a diagram for explaining a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis according to an embodiment.
6 is a flowchart illustrating an operation method of a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis according to an embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present invention to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

아래의 실시예들은 암 후성유전체(epigenetic) 데이터 분석의 핵심 과정 중 하나인 염기서열(ChIP-seq) 피크 검출 알고리즘을 기존 인간 연구원의 시각 검사를 모방하는 합성곱 신경망(Convolutional Neural Network, CNN) 딥러닝 모델을 제공한다. 여기서, 후성유전체는 유전자의 발현을 조절하는 기능에 관여된 물질들의 집합을 의미한다. 실시예들에 따르면 ChIP-seq 후성유전체 데이터를 분석하는 전문 연구원의 수동 검사 수준의 정확도를 달성할 수 있다. The examples below show a Convolutional Neural Network (CNN) deep that mimics the visual inspection of existing human researchers using the ChIP-seq peak detection algorithm, which is one of the core processes of cancer epigenetic data analysis. It provides a learning model. Here, the epigenome refers to a set of substances involved in the function of regulating the expression of genes. According to embodiments, it is possible to achieve the level of accuracy of a manual test of a professional researcher analyzing ChIP-seq epigenetic data.

실시예들은 인간 연구원의 시각 검사를 모사하는 딥러닝 모델 개발과 설계 및 검증을 수행할 수 있다. 특히, 실시예들은 암 유전체에서 상대적으로 매우 작은 기능적 영역을 정확히 탐색하기 위해 기존의 딥러닝 학습 방식과는 다른 손실 함수 설계와 배치 정규화(batch normalization), 잔여 구조(residual structure) 등의 최신 딥러닝 기법을 사용할 수 있다. Embodiments may perform development, design, and verification of a deep learning model that mimics the visual inspection of a human researcher. In particular, the embodiments provide a loss function design different from the existing deep learning learning method in order to accurately search a relatively very small functional region in the cancer genome, and the latest deep learning methods such as batch normalization and residual structure. technique can be used.

기존의 후성유전체 분석 도구들은 푸아송 가정 등 고전적인 기법의 경우, 암 유전체의 경우 유전체 내에서의 변이 수준이 높음에 따라 거짓 양성 오류가 많아 정확한 분석이 어려운 문제점이 있다. 여기서, 푸아송(Poisson distribution) 가정은 유전체 내 조절 인자 발생 패턴 예측 시 필요한 가정에 사용된 이산확률분포를 의미한다. Existing epigenetic analysis tools have a problem in that, in the case of classical techniques such as the Poisson assumption, in the case of cancer genomes, there are many false positive errors due to the high level of mutation within the genome, making it difficult to accurately analyze them. Here, the Poisson distribution assumption refers to the discrete probability distribution used in the assumption necessary for predicting the occurrence pattern of regulatory factors in the genome.

실시예들은 인간 연구원이 수동으로 검사한 후성유전체 데이터를 학습한 딥러닝 모델을 사용하여 인간 연구원의 정확성을 가지면서도 고전적인 분석 도구들의 빠른 분석을 할 수 있는 분석 도구를 개발할 수 있다. 아래에서 딥러닝 모델이 인간 연구원의 암 유전체 데이터 분석 결과 데이터를 학습하고, 학습된 딥러닝 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색할 수 있는 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템 및 그 동작 방법에 대해 보다 상세히 설명한다.Embodiments may use a deep learning model that learned epigenetic data manually inspected by a human researcher to develop an analysis tool capable of performing rapid analysis of classical analysis tools while maintaining the accuracy of a human researcher. Below, a deep learning model learns data from human researcher's cancer genome data analysis result data, and the learned deep learning model is an epigenome that can explore functional areas that can become cancer mechanisms for new untrained cancer genome data. A CNN-based machine learning system that mimics human visual inspection for data analysis and its operation method will be described in more detail.

염기서열(ChIP-seq)은 게놈 전체의 후성유전 상호작용을 이해하고 질병과 관련된 기능적 요소를 식별하는 데 이용 가능한 핵심 실험 자원 중 하나이다. ChIP-seq 데이터의 분석은 중요하지만 불규칙한 노이즈와 다양한 수준의 편중으로 인해 계산에 어려움이 있다. 많은 피크 호출(peak-calling) 방법이 개발되었지만, 현재의 계산 도구는 여전히 데이터 시각화를 이용한 인간 수동 검사가 필요한 경우도 있다. 하지만 엄청난 양의 ChIP-seq 데이터로 인해 연구원들이 수동으로 모든 피크를 발견하는 것은 거의 불가능하다. 최근 개발된 콘볼루션 신경망(CNN)은 인간다운 분류 정확도를 달성할 수 있어 이 문제에 적용될 수 있다. 본 실시예에서는 CNN을 이용한 ChIP-seq 피크 식별을 위한 새로운 감독 학습 방식을 설계하고, 이를 CNN-Peaks라는 소프트웨어 파이프라인에 통합한다. 여기에서는 연구원이 표기한 데이터를 사용하여 일부 게놈 영역에서 피크 유무에 주석을 달았다. 그런 다음, 학습된 모델을 적용하여 피크 호출 방법의 검증에 일반적으로 사용되는 벤치마크 데이터 집합을 포함하여 여러 ChIP-seq 데이터 집합에서 이전에 보이지 않았던 게놈 영역의 피크를 예측할 수 있다. 실시예들에 따르면 기존의 방법보다 우수한 성능을 확인할 수 있다.ChIP-seq is one of the key experimental resources available to understand genome-wide epigenetic interactions and to identify disease-related functional elements. Although the analysis of ChIP-seq data is important, it is computationally difficult due to irregular noise and various levels of bias. Although many peak-calling methods have been developed, current computational tools still often require human manual inspection using data visualization. However, due to the sheer volume of ChIP-seq data, it is nearly impossible for researchers to find all peaks manually. A recently developed convolutional neural network (CNN) can achieve human-like classification accuracy and can be applied to this problem. In this embodiment, we design a new supervised learning method for ChIP-seq peak identification using CNN, and integrate it into a software pipeline called CNN-Peaks. Here, we annotated the presence or absence of peaks in some genomic regions using the data indicated by the researchers. The trained model can then be applied to predict previously unseen peaks in genomic regions in multiple ChIP-seq datasets, including benchmark datasets commonly used for validation of peak calling methods. According to the embodiments, it is possible to confirm better performance than the conventional method.

최근에는 콘볼루션 신경망(CNN) 등 심층 신경망을 기반으로 한 감독 머신러닝법이 후성유전학, 조절유전체학, 시스템생물학 등에 성공적으로 적용되고 있다. 본 실시예들은 CNN-Peaks라는 이름의 CNN을 기반으로 연구원들이 부분적으로 라벨을 붙인 데이터를 이용하는 새로운 피크 호출 소프트웨어 파이프라인을 제공한다. 실시예들은 게놈 영역마다 다를 수 있는 로컬 피크 호출 임계 값을 식별하여 거짓 양성 피크 호출을 감소시키고, 따라서 인간 암세포 라인의 구조적 변화로 인한 로컬 편향을 해소하는 데 사용한다. 복사 번호 변동이 잦은 게놈 영역의 일부 피크는 다른 피크보다 매핑 깊이가 높을 것으로 예상된다는 점에 유의한다. CNN-Peaks를 위한 훈련 데이터로서, 원 ChIP-seq 읽기 매핑의 전처리 단계의 읽기 카운트 정보뿐만 아니라 전문가들에 의해 라벨이 붙여진 데이터를 사용한다. 소프트웨어 도구는 ChIP-seq 인접 지역의 읽기 매핑 패턴을 고려하여 특정 유전체 영역에서 적절한 피크 검출 컷오프 값을 결정하는 모델을 학습한다. Recently, supervised machine learning methods based on deep neural networks such as convolutional neural networks (CNN) have been successfully applied to epigenetics, regulatory genomics, and systems biology. The present embodiments provide a new peak call software pipeline based on CNN named CNN-Peaks that uses data that researchers have partially labeled. The embodiments are used to reduce false positive peak calls by identifying local peak call thresholds that may vary from genomic region to genomic region, thus resolving local bias due to conformational changes in human cancer cell lines. Note that some peaks in genomic regions with frequent copy number variations are expected to have a higher mapping depth than others. As training data for CNN-Peaks, we use data labeled by experts as well as read count information from the preprocessing step of the original ChIP-seq read mapping. The software tool trains a model to determine the appropriate peak detection cutoff value in a specific genomic region by considering the read mapping pattern of the ChIP-seq neighborhood.

CNN은 지역 의존성을 가진 특징을 가진 데이터 패턴을 사용하여 이미지 처리와 자연어 번역의 문제를 해결하는 데 성공적으로 사용되었다. 또한 예측 CNN 구축을 위한 교육 입력 데이터로 NCBI(https://www.ncbi.nlm.nih.gov/)에서 이용할 수 있는 통합 주석 정보(RefSeq)를 사용한다. RefSeq 데이터는 인간 육안 검사를 위한 피크의 존재 여부를 결정하는 데 도움이 될 수 있는 유전자의 유전학적 위치, 전사 및 단백질 인코딩을 포함한다.CNNs have been successfully used to solve the problems of image processing and natural language translation using data patterns with regionally dependent features. We also use the unified annotation information (RefSeq) available at NCBI ( https://www.ncbi.nlm.nih.gov/ ) as training input data for constructing predictive CNNs. RefSeq data include the genetic location, transcription, and protein encoding of genes that can help determine the presence or absence of peaks for human visual inspection.

실시예들에 따른 CNN-Peaks 소프트웨어 패키지는 세 가지 주요 모듈로 구성되어 있다. 하나는 원래의 입력 데이터(라벨이 부착된 데이터, 게놈 주석, 읽기 카운트 정보)를 CNN 구조에 공급하기 위한 것이고, 다른 하나는 훈련 데이터를 이용한 모델을 학습하기 위한 것이고, 마지막 하나는 알려지지 않은 데이터의 피크 예측을 위한 것이다. 여기에서는 테스트 데이터로 예약된 라벨링된 데이터를 사용하여 CNN-Peaks의 성능을 평가한다. 또한, 평가에 흔히 사용되는 ChIP-seq 벤치마크 데이터 집합에 대해 CNN-Peaks를 테스트하고, 그 성능을 다른 주요 피크 호출 도구와 비교한다. 또한, H3K27ac와 같은 히스톤 수정과 GATAD2와 같은 전사 인자 결합을 위해 다양한 실제 데이터 집합을 분석하기 위해 CNN-Peaks 도구를 사용한다. The CNN-Peaks software package according to the embodiments consists of three main modules. One is to feed the original input data (labeled data, genomic annotations, read count information) to the CNN structure, the other is to train the model using training data, and the last one is to This is for peak prediction. Here, we evaluate the performance of CNN-Peaks using labeled data reserved as test data. In addition, we test CNN-Peaks against the ChIP-seq benchmark dataset commonly used for evaluation, and compare its performance with other major peak calling tools. In addition, we use the CNN-Peaks tool to analyze various real-world data sets for histone modifications such as H3K27ac and binding of transcription factors such as GATAD2.

사용자들은 도커(docker) 이미지를 사용하여 CNN-Peaks 파이프라인을 설치하고 인간의 좁은 히스톤 수정과 전사 인자 결합 사이트를 예측하기 위해 학습된 모델과 함께 자체 데이터를 사용하여 패키지를 실행할 수 있다. 경험이 많은 사용자도 CNN-Peaks를 사용하여 자체 라벨링된 데이터로 학습된 새로운 예측 모델을 구축할 수 있다. 실시예들에 따른 소프트웨어 패키지에는 전문가들이 주어진 ChIP-seq 읽기 매핑 데이터에서 무작위로 선택한 게놈 영역의 피크를 라벨링하여 라벨링된 데이터를 쉽게 만들 수 있도록 지원하는 데스크탑 응용 프로그램이 포함되어 있다. 파이프라인은 DNase-seq와 ATAC-seq와 같은 다른 유형의 고투과 서열 데이터에도 적용될 수 있다. 패키지는 Github repository http://github.com/odb9402/CNNPeaks를 통해 이용할 수 있다.Users can install a CNN-Peaks pipeline using Docker images and run the package using their own data along with a trained model to predict human narrow histone modifications and transcription factor binding sites. Even experienced users can use CNN-Peaks to build new predictive models trained on self-labeled data. The software package according to the embodiments includes a desktop application that supports experts to easily create labeled data by labeling peaks of randomly selected genomic regions from given ChIP-seq read mapping data. The pipeline can also be applied to other types of high-penetration sequence data, such as DNase-seq and ATAC-seq. The package is available through the Github repository http://github.com/odb9402/CNNPeaks .

데이터 기술(Data description)Data description

도 1은 일 실시예에 따른 ChIP-seq 읽기 매핑 데이터를 사용한 라벨링 피크의 시각화를 설명하기 위한 도면이다. 1 is a diagram for explaining visualization of a labeling peak using ChIP-seq read mapping data according to an embodiment.

실시예들에 따르면 여러 개의 ChIP-seq 읽기 매핑 데이터 집합을 ENCODE 데이터 포털로부터 BAM 형식의 데이터 집합을 입수하여 라벨링된 데이터를 생성하고 CNN-Peeks의 유효성을 검증했다. 이 데이터에는 히스톤 H3K36me3, H3K4me3, H3K27me3, H2AFZ 및 H3K9ac과 같은 히스톤 수정을 검사하기 위한 여러 가지 ChIP-seq 데이터 집합과 전사 인자 GATAD2, POLR2A, SMARCE1, 같은 전사 인자 결합이 포함된다. 또한, 실시예들에 따르면 백혈병 세포 라인 K562에서 ChIP-seq 데이터를 다운로드했고 CNN 모델을 훈련시키기 위해 유전체 영역의 배경 결합에 라벨을 붙였다. According to the embodiments, several ChIP-seq read mapping data sets were obtained from ENCODE data portal in BAM format, labeled data were generated, and CNN-Peeks were validated. These data include several ChIP-seq datasets to examine histone modifications such as histones H3K36me3, H3K4me3, H3K27me3, H2AFZ and H3K9ac, as well as the binding of transcription factors such as GATAD2, POLR2A, SMARCE1, and transcription factors. In addition, according to the examples, ChIP-seq data were downloaded from the leukemia cell line K562 and labeled background binding of genomic regions to train a CNN model.

BAM 형식의 ChIP-seq 읽기 매핑 데이터에서 일부 게놈 영역을 무작위로 선택하였으며, 도 1에 도시된 바와 같이, BAM 정렬 데이터가 있는 시각화 도구를 사용하여 피크의 위치에 라벨(Labeled area)을 붙일 수 있다. Some genomic regions were randomly selected from the ChIP-seq read mapping data in BAM format, and as shown in Figure 1, the positions of the peaks can be labeled using a visualization tool with BAM alignment data. .

라벨이 부착된 데이터와 ChIP-seq 읽기 매핑 BAM 파일 외에도 NCBI의 인간 참조 서열(RefSeq)에 대한 게놈, 전사 및 단백질 서열 기록의 curate된 비중복 컬렉션을 CNN 모델의 추가 입력 벡터로 사용할 수 있다. RefSeq 데이터는 단백질 코드 위치 및 유사 생성물을 포함할 수 있다. 연구원들이 ChIP-seq 데이터를 수동으로 검사하여 피크를 결정할 때, 전사(transcript)와 그에 상응하는 단백질 기록과 같은 게놈 주석 정보(RefSeq 주석 정보)는 일반적으로 주어진 ChIP-seq 읽기 매핑 데이터와 함께 표시될 수 있다. 이 정보는 종종 피크가 어디에 있는지 추정하게 한다. 인간의 검사를 모방하기 위해 이를 고려하고, 이러한 다른 주석 유형을 입력으로 사용하도록 모델을 구축할 수 있다. 여기서, 이 주석 정보를 각 게놈 위치에서 전사와 단백질의 존재 여부를 나타내는 이진 벡터로 추가할 수 있다.In addition to labeled data and ChIP-seq read mapping BAM files, curated, non-overlapping collections of genomic, transcriptional, and protein sequence records to human reference sequences (RefSeq) from NCBI can be used as additional input vectors for CNN models. RefSeq data may include protein code positions and similar products. When researchers manually examine ChIP-seq data to determine peaks, genomic annotation information such as transcripts and corresponding protein records (RefSeq annotation information) will usually be displayed along with the given ChIP-seq read mapping data. can This information often allows us to estimate where the peaks are. We can take this into account to mimic human inspection, and build our model to use these other annotation types as input. Here, this annotation information can be added as a binary vector indicating the presence or absence of transcription and protein at each genomic location.

데이터 전처리(Data preprocessing)Data preprocessing

일 실시예에 따른 CNN-Peaks 파이프라인에는 전처리 모듈이 포함될 수 있다. 이러한 전처리 모듈은 ChIP-seq 읽기 매핑 BAM 파일(데이터), 라벨링된 데이터, RefSeq 주석 정보 등 모든 입력 데이터를 CNN 구조에 공급될 올바른 모양의 벡터로 변환할 수 있다. 라벨이 부착된 영역의 서로 다른 윈도우 크기를 동일한 크기로 정상화하기 위해 고정 크기(기본적으로 12,000 빈(bin))의 빈을 사용할 수 있다. 윈도우가 목표 윈도우 크기보다 작을 경우, CNN-Peaks 사용자는 시각적 검사 도구를 사용하여 이 윈도우에 추가할 추가 지역에 라벨을 붙이거나, 목표 윈도우 크기의 파라미터를 조정하거나, 단순히 윈도우를 제거해야 한다.The CNN-Peaks pipeline according to an embodiment may include a preprocessing module. These pre-processing modules can transform any input data, including ChIP-seq read mapping BAM files (data), labeled data, RefSeq annotation information, into well-shaped vectors to be fed into the CNN structure. Bins of a fixed size (12,000 bins by default) can be used to normalize the different window sizes of the labeled area to the same size. If the window is smaller than the target window size, CNN-Peaks users must use a visual inspection tool to label additional regions to add to this window, adjust the parameters of the target window size, or simply remove the window.

전처리 모듈에는 읽기 정렬 데이터의 노이즈를 줄이기 위해 읽기 카운트 패턴을 매끄럽게 하는 기능도 포함되어 있다. 이 평활화 단계는 또한 CNN 모델이 인간의 육안 검사에 근접하게 만드는 데 도움이 된다. 평활화 및 보정 단계가 없을 경우, (원(raw)) 깊이 패턴은 매우 노이즈가 많다(noisy). 평활화를 위해 맥스 풀링(max-pooling)과 가우스 필터(Gaussian filter) 콘볼루션 연산을 사용할 수 있다. 가우스 필터는 이미지 처리와 같은 다양한 도메인에서 디노이즈 및 스무딩에 일반적으로 사용된다. 콘볼루션 및 맥스 풀링 연산의 방정식은 각각 다음 식과 같이 나타낼 수 있다. The preprocessing module also includes the ability to smooth read count patterns to reduce noise in read-aligned data. This smoothing step also helps make the CNN model close to human visual inspection. Without the smoothing and correction steps, the (raw) depth pattern is very noisy. For smoothing, max-pooling and Gaussian filter convolution operations can be used. Gaussian filters are commonly used for denoising and smoothing in various domains such as image processing. The equations of the convolution and max pooling operations can be expressed as the following equations, respectively.

[수학식 1][Equation 1]

[수학식 2][Equation 2]

여기서, *는 콘볼루션 연산을 나타낸다. 콘볼루션 및 맥스 풀링의 스트라이드(stride)는 1이다. 위의

는 각 연산의 필터 크기이고, v, u는 입력 벡터이며, v_i는 v의 i 번째 요소이다. X는 ChIP-seq 읽기 매핑 데이터의 벡터가 되고, 가우스 필터를 필터링하며, X_smoothing은 X의 평활화 벡터이다. 이는 다음 식과 같이 표현될 수 있다.Here, * denotes a convolution operation. The stride of convolution and max pooling is 1. Over

is the filter size of each operation, v and u are the input vectors, and v _i is the i-th element of v. X becomes a vector of ChIP-seq read mapping data, filtering Gaussian filter, and X _smoothing is a smoothing vector of X. This can be expressed as the following equation.

[수학식 3][Equation 3]

[수학식 4][Equation 4]

CNN 구조(CNN architecture)CNN architecture

도 2a 내지 도 2c는 일 실시예에 따른 CNN 구조의 필터의 예시를 나타내는 도면이다. 보다 구체적으로, 도 2a는 A-module을 나타내고, 도 2b는 B-module을 나타내며, 도 2c는 C-module을 나타낸다.2A to 2C are diagrams illustrating an example of a filter of a CNN structure according to an embodiment. More specifically, FIG. 2a shows an A-module, FIG. 2b shows a B-module, and FIG. 2c shows a C-module.

도 2a 내지 도 2c를 참조하면, 일 실시예에 따른 CNN-peaks 모델의 인셉션(inception) 모듈 기반 필터의 예시를 나타낸다. CNN-peaks 모델은 다양한 크기의 필터를 통해 데이터에서 다양한 기능을 추출하는 GoogLeNet에서 사용되는 인셉션 모듈을 기반으로 구축될 수 있다. 여기서, 콘볼루션(201, 211, 221)은 각각 필터 크기가 1*N인 콘볼루션 연산을 나타내며, 연결 필터(concatenate filter)(204, 213, 222)는 여러 콘볼루션(201, 211, 221) 및 풀링(202, 203, 212) 출력을 결합할 수 있다.2A to 2C, an example of an inception module-based filter of a CNN-peaks model according to an embodiment is shown. The CNN-peaks model can be built based on the inception module used in GoogLeNet, which extracts various features from data through filters of different sizes. Here, the convolutions 201, 211, and 221 each represent a convolution operation with a filter size of 1*N, and concatenate filters 204, 213, 222 are several convolutions 201, 211, 221. and pulling (202, 203, 212) outputs.

A-module은 콘볼루션(201), 맥스 풀링(202) 및 평균 풀링(average pooling)(203)의 3 가지 유형의 필터를 사용할 수 있다. B-module은 평균 풀링 없이 콘볼루션(211) 및 맥스 풀링(212)의 2 가지 유형의 필터를 사용할 수 있고, 3 종류의 콘볼루션 필터(211)로 구성될 수 있다. 또한, C-module은 A-module과 B-module에서 사용되는 필터보다 더 넓은 크기와 긴 스트라이드를 가진 4 종류의 콘볼루션 필터(221)로 구성될 수 있다.A-module can use three types of filters: convolution 201 , max pooling 202 and average pooling 203 . The B-module may use two types of filters, convolutional 211 and max pooling 212 without average pooling, and may be composed of three types of convolutional filters 211 . In addition, the C-module may be composed of four types of convolutional filters 221 having a wider size and a longer stride than the filters used in the A-module and B-module.

도 3은 일 실시예에 따른 CNN 구조를 나타내는 도면이다.3 is a diagram illustrating a CNN structure according to an embodiment.

도 3을 참조하면, CNN-peaks 모델(310)의 구조는 기존의 인셉션 모듈보다 더 슬림한 구조의 3가지 유형의 인셉션 모듈을 제공할 수 있다. 여기서, CNN-peaks 모델(310)은 CNN 모델에 포함될 수 있다. CNN-peaks 모델(310)은 도 2a 내지 도 2c에 도시된 바와 같은 인셉션 모듈(313, 314, 315)뿐만 아니라, 풀링(312, 316), 콘볼루션 계층(311) 등과 같은 몇 개의 은닉 계층(hidden layer)으로 구성될 수 있다. Referring to FIG. 3 , the structure of the CNN-peaks model 310 may provide three types of inception modules having a slimmer structure than the existing inception modules. Here, the CNN-peaks model 310 may be included in the CNN model. CNN-peaks model 310 includes inception modules 313, 314, 315 as shown in FIGS. 2a to 2c, as well as several hidden layers, such as pooling 312, 316, convolutional layer 311, etc. It may be composed of a hidden layer.

CNN-peaks 모델(310)은 ChIP-seq 데이터(301)의 로컬 게놈 영역에서 피크를 호출하기 위한 최적의 임계 값을 학습하고, 각 입력 신호 값에서 임계 값을 감산(subtraction)(340)하고 시그모이드 연산(350)을 사용하여 피크의 유무를 결정할 수 있다. 즉, 시그모이드 연산(350)을 통해 피크 존재 확률(360)을 구하거나 CNN-peaks 모델로부터 피크를 예측(390)할 수 있다. 그리고, CNN-Peaks 모델(310)을 학습하기 위한 손실 함수로 교차 엔트로피(cross-entropy)(380)를 사용할 수 있다. 도 3에서는 화살표를 이용하여 인셉션 모듈(313, 314, 315) 간의 잔여 연결(residual connection)을 표현하고, "Expand"를 통해 출력 벡터(317)의 확장을 표현하였다.The CNN-peaks model 310 learns an optimal threshold value for calling a peak in the local genomic region of the ChIP-seq data 301, and subtracts the threshold value from each input signal value 340 and sig A mode operation 350 may be used to determine the presence or absence of a peak. That is, the peak existence probability 360 may be obtained through the sigmoid operation 350 or the peak may be predicted 390 from the CNN-peaks model. In addition, a cross-entropy 380 may be used as a loss function for learning the CNN-Peaks model 310 . In FIG. 3 , the residual connection between the inception modules 313 , 314 , and 315 is expressed using arrows, and the expansion of the output vector 317 is expressed through “Expand”.

한편, 인셉션 모듈(313, 314, 315)은 CNN-Peaks 모델(310)의 시작 부분에 배치될 경우, 예측 정확성 측면에서 비효율을 초래하는 것으로 알려져 있다. 이러한 잠재적 문제를 피하기 위해, CNN-Peaks 모델(310)의 시작 부분에 콘볼루션 필터(311)와 맥스 풀링 계층(312)을 제공할 수 있다. 일 실시예에 따른 CNN-Peaks 모델(310)의 세부사항은 다음과 같다. On the other hand, when the inception modules 313 , 314 , and 315 are disposed at the beginning of the CNN-Peaks model 310 , it is known that it causes inefficiency in terms of prediction accuracy. To avoid this potential problem, a convolution filter 311 and a max pooling layer 312 may be provided at the beginning of the CNN-Peaks model 310 . Details of the CNN-Peaks model 310 according to an embodiment are as follows.

먼저, ChIP-seq 읽기 매핑 패턴(301)과 주석 정보용 입력 벡터(RefSeq)(302)를 콘볼루션 계층(311)을 통해 벡터로 변환할 수 있다. 그런 다음, 맥스 풀링 계층(312)은 이러한 벡터의 차원(dimension)을 감소시킬 수 있다. 그리고 7개의 인셉션 모듈(313, 314, 315)이 있다. 각 인셉션 모듈(313, 314, 315)에는 필터의 출력 목록을 연결하는 계층이 있다. 이러한 인셉션 모듈(313, 314, 315) 중 2개는 A-module(313)이라고 불리며, 콘볼루션, 맥스 풀링 및 평균 풀링의 3 가지 유형의 필터를 사용할 수 있다. A-module(313)은 CNN-Peaks 모델(310)에 피크 패턴과 강력한 피크 신호의 규모에 대한 정보를 제공할 수 있다. 그러나 맥스 풀링과 평균 풀링은 이전 계층과 동일한 수의 특성을 가지기 때문에, 이것은 풀링 계층의 깊이 함수로써 벡터 크기를 기하급수적으로 증가시킨다. 이 문제를 해결하기 위해, B-module(314)이라 불리는 다른 3개의 인셉션 모듈을 사용할 수 있다. B-module(314)은 평균 풀링 없이 3 종류의 콘볼루션 필터로 구성되어 있다. 마지막으로, C-module(315)이라 불리는 2개의 인셉션 모듈을 사용할 수 있다. C-module(315)은 A-module(313)과 B-module(314)에서 사용되는 필터보다 더 넓은 크기와 긴 스트라이드를 가진 4 종류의 콘볼루션 필터를 가지고 있다. First, the ChIP-seq read mapping pattern 301 and the input vector for annotation information (RefSeq) 302 may be converted into vectors through the convolutional layer 311 . The max pooling layer 312 may then reduce the dimension of this vector. And there are seven inception modules (313, 314, 315). Each inception module 313 , 314 , 315 has a layer that connects the output list of the filter. Two of these inception modules 313 , 314 , 315 are called A-modules 313 and can use three types of filters: convolution, max pooling, and average pooling. The A-module 313 may provide information on the peak pattern and magnitude of the strong peak signal to the CNN-Peaks model 310 . However, since max pooling and average pooling have the same number of features as the previous layer, this increases the vector size exponentially as a function of the depth of the pooling layer. To solve this problem, another three inception modules called B-module 314 can be used. The B-module 314 is composed of three types of convolutional filters without averaging pooling. Finally, two inception modules called C-module 315 can be used. The C-module 315 has four types of convolutional filters having a wider size and longer stride than the filters used in the A-module 313 and B-module 314 .

인셉션 모듈(315)의 끝에는 차원을 줄이기 위해 완전히 연결된 계층(fully connected layer)에 이은 평균 풀링 계층(316)이 추가될 수 있다. 또한, 인셉션 모듈(313, 314, 315)들 사이의 잔여 구조(residual structure)를 사용할 수 있다. 이는 기울기 손실(vanishing gradient) 문제를 방지할 수 있다. 또한, 훈련 중에 과적합(overfitting)을 피하기 위해 배치 정규화(batch normalization)를 사용할 수 있다.An average pooling layer 316 following a fully connected layer may be added to the end of the inception module 315 to reduce dimensions. Also, a residual structure between the inception modules 313 , 314 , and 315 may be used. This can avoid the vanishing gradient problem. Also, batch normalization can be used to avoid overfitting during training.

CNN 구조의 출력 계층(Output layer)Output layer of CNN structure

각각의 게놈 위치에서 피크 유무를 판단하기 위해, CNN-Peaks 모델(310)의 출력 계층(317)은 게놈 베이스의 수와 동일한 다수의 뉴런을 필요로 한다. 출력 계층(317)에 있는 많은 수의 뉴런은 보통 학습 성과의 현저한 저하를 일으킨다. 뉴런 수를 줄이기 위해 각각의 개별 게놈 위치에서 p-값(p-value)이나 피크 신호의 존재 가능성을 계산하는 것이 아니라, 선택된 윈도우의 ChIP-seq 읽기 매핑 패턴(301)을 기반으로 게놈 영역의 최적의 임계 값을 학습하도록 CNN-Peaks 모델(310)을 설계할 수 있다. 이것은 CNN-Peaks 모델(310)의 출력 계층(317)에 필요한 뉴런의 수를 현저히 감소시키고 성능 저하를 방지할 수 있다. In order to determine the presence or absence of a peak at each genomic location, the output layer 317 of the CNN-Peaks model 310 requires a number of neurons equal to the number of genomic bases. A large number of neurons in the output layer 317 usually causes a significant degradation in learning performance. Rather than calculating the probability of the presence of a p-value or peak signal at each individual genomic location to reduce the number of neurons, the optimal The CNN-Peaks model 310 can be designed to learn the threshold of . This can significantly reduce the number of neurons required for the output layer 317 of the CNN-Peaks model 310 and prevent performance degradation.

출력 벡터(317)의 크기가 입력 벡터보다 작아지기 때문에 출력 벡터(317)의 크기를 입력 벡터와 동일하게 확장하여, 각 개별 위치에서 피크 유무를 예측할 수 있도록 할 수 있다. 이러한 확장 벡터(320)는 Tensorflow와 Numpy에서 "브로드캐스팅 벡터(broadcasting vector)" 표준을 사용하여 구현되며, 크기가 다른 벡터 간 연산이 가능하다. CNN-Peaks 모델(310)의 피크 호출 과정은 도 4를 참조하여 설명한다.Since the size of the output vector 317 is smaller than that of the input vector, the size of the output vector 317 is expanded to be the same as that of the input vector, so that the presence or absence of a peak at each individual position can be predicted. Such an extension vector 320 is implemented using the "broadcasting vector" standard in Tensorflow and Numpy, and operations between vectors of different sizes are possible. The peak calling process of the CNN-Peaks model 310 will be described with reference to FIG. 4 .

손실 함수(Loss function)Loss function

피크 신호의 유무를 결정하는 것은 이항 분류 문제이다. 여기에서는 CNN-Peaks 모델(310)을 학습하기 위한 손실 함수로 교차 엔트로피(cross-entropy)(380)를 사용할 수 있다. 분류 문제에 대한 대부분의 방법은 민감도와 성능의 특수성 사이의 균형 균형을 필요로 한다. 마찬가지로, 이 중 하나에만 편중되지 않도록 주의해야 한다. ChIP-seq 데이터의 피크 호출 문제에서 피크는 전체 게놈 크기에 비해 상대적으로 드물다. 특정 방법이 '노-피크(no-peak)'(즉, 거짓 양성 오류율(false-positive error rate)이 낮음)라고 부르는 경향이 있다면, 중요한 피크를 놓치기는 하지만 높은 정확도를 보일 것이다. 따라서 다음 식에서 가중 교차 엔트로피(weighted cross-entropy)를 손실로 사용할 수 있다.Determining the presence or absence of a peak signal is a binomial classification problem. Here, a cross-entropy 380 may be used as a loss function for learning the CNN-Peaks model 310 . Most methods for classification problems require a balancing act between sensitivity and specificity of performance. Likewise, you need to be careful not to focus on just one of them. In the peak call problem of ChIP-seq data, peaks are relatively sparse compared to the overall genome size. If a particular method tends to call it 'no-peak' (ie, it has a low false-positive error rate), it will miss important peaks but will have high accuracy. Therefore, the weighted cross-entropy can be used as a loss in the following equation.

[수학식 5][Equation 5]

여기서, X는 입력 읽기 카운트 벡터,

는 주석 벡터,

는 모델의 파라미터 집합, y_i는 라벨링된 데이터에 대한 입력 i 번째 요소,

는 평가에서 거짓 양성 호출을 기준으로 한 거짓 음성(false-negative) 호출을 중요시하는 가중치,

은 주어진 입력 X, 모델 파라미터

에 대해 예측된 출력에서 i 번째 요소이다. 가중치

는 주어진 데이터에 대해 음의 영역(no-peak)과 양의 영역(peak) 사이의 비율로 결정될 수 있다.where X is the input read count vector,

is the annotation vector,

is the parameter set of the model, y _i is the input i-th element for the labeled data,

is the weight giving importance to false-negative calls based on false positive calls in the evaluation;

is the given input X, the model parameter

is the i-th element in the predicted output for . weight

may be determined as a ratio between a negative region (no-peak) and a positive region (peak) for a given data.

또한, 손실 함수에 대해서는 Top-K 방식을 적용할 수 있다. Top-K 방법에서는 K의 높은 값에 대해서는 특수성보다 민감성이 더 중요한 것으로 간주되는 반면, K의 낮은 값에 대해서는 특수성이 더 중요하다. 감도와 특수성 사이의 균형을 이루기 위해 L = K/2를 설정하며, 여기서 L은 출력 벡터의 크기이다. 최종 손실 함수는 다음 식과 같이 나타낼 수 있다.In addition, the Top-K method can be applied to the loss function. In the Top-K method, sensitivity is considered more important than specificity for high values of K, whereas specificity is more important for low values of K. To strike a balance between sensitivity and specificity, we set L = K/2, where L is the magnitude of the output vector. The final loss function can be expressed as the following equation.

[수학식 6][Equation 6]

여기서,

은 모든 l_i 중에서 n 번째 가장 큰 개별 가중 교차 엔트로피 손실이다. 손실 함수

는 역전파(backpropagation)을 사용하여 파라미터

를 조절하는 아담 최적화(Adam optimizer)를 사용하여 최적화되었다.here,

is the nth largest individually weighted cross-entropy loss among all l _i . loss function

is a parameter using backpropagation

was optimized using an Adam optimizer that adjusts

시그모이드(sigmoid) 활성화를 사용하여 피크 점수 매기기Scoring peaks using sigmoid activation

피크 호출 알고리즘의 중요한 업무는 각 피크에 유의한 점수를 부여하는 것이다. Poisson 분포의 각 게놈 위치에 대한 p-값(ChIP 실험의 게놈 범위 태그 데이터의 각 위치에 대한 카운트가 Poisson 분포를 따르는 것으로 알려져 있음)을 계산하고, 이러한 p-값을 CNN-Peaks 모델(310)의 출력 계층(317)에서 나온 시그모이드 값과 결합하여 피크의 중요도를 측정할 수 있다. CNN-Peaks 모델(310)의 출력 계층(317)에 있는 시그모이드 함수(350)는 피크 존재 확률로 해석할 수 있는 값을 생성할 수 있다. CNN-Peaks 모델(310)이 호출한 각 피크의 점수 값은 해당 피크의 시그모이드 활성화 값과 Poisson 분포의 p-값의 곱에 의해 결정될 수 있다. 이 점수는 사용자가 특정 피크의 중요도를 평가하는 데 도움이 될 수 있다.An important task of the peak calling algorithm is to give each peak a significant score. Calculate the p-value for each genomic position in the Poisson distribution (counts for each position in the genomic range tag data of the ChIP experiment are known to follow the Poisson distribution), and compute these p-values in the CNN-Peaks model (310) The importance of the peak can be measured by combining it with the sigmoid value from the output layer 317 of . The sigmoid function 350 in the output layer 317 of the CNN-Peaks model 310 may generate a value that can be interpreted as a peak existence probability. The score value of each peak called by the CNN-Peaks model 310 may be determined by the product of the sigmoid activation value of the corresponding peak and the p-value of the Poisson distribution. This score can help users rate the importance of a particular peak.

도 4는 일 실시예에 따른 학습된 CNN 모델의 피크 호출 과정을 나타내는 도면이다.4 is a diagram illustrating a peak calling process of a learned CNN model according to an embodiment.

도 4를 참조하면, 학습된 CNN-Peaks 모델(410)을 사용한 피크 호출 과정이며, 여기서 신호는 ChIP-seq 입력 데이터의 읽기 매핑 깊이이며, 신호 아래의 해칭(회색 박스) 부분은 RefSeq 주석 정보에 유전자가 있음을 나타낸다. Referring to Figure 4, it is the peak calling process using the trained CNN-Peaks model 410, where the signal is the read mapping depth of the ChIP-seq input data, and the hatched (gray box) part under the signal is in the RefSeq annotation information. indicates that the gene is present.

401은 게놈 영역에서 ChIP-seq 읽기 매핑 신호와 RefSeq 주석 정보가 모두 있는 윈도우이다. 윈도우의 피크(402의 해칭 부분)는 학습된 CNN-Peaks 모델(410)을 사용하여 예측될 수 있고, 이 때 BED 형식으로 생성될 수 있다.401 is a window with both ChIP-seq read mapping signals and RefSeq annotation information in the genomic region. The peak (hatched portion of 402) of the window may be predicted using the trained CNN-Peaks model 410, and may be generated in BED format at this time.

도 5는 일 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템을 설명하기 위한 도면이다.5 is a diagram for explaining a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis according to an embodiment.

도 5를 참조하면, 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 파이프라인을 나타내는 것으로, 여기서 각 평행사변형은 데이터를 나타내고, 직사각형 상자는 개별 모듈을 나타내며, 점선으로 표시된 둥근 직사각형은 학습된 CNN-Peaks 모델(540)을 나타내고, 화살표는 데이터 흐름을 나타낸다. 여기서, CNN-Peaks 모델은 CNN 모델에 포함될 수 있으며, CNN 모델로 표현될 수 있다. Referring to Figure 5, which shows the pipeline of a CNN-based machine learning system that mimics human visual inspection for epigenomic data analysis, where each parallelogram represents data, rectangular boxes represent individual modules, and is indicated by dashed lines. The rounded rectangle represents the trained CNN-Peaks model 540, and the arrow represents the data flow. Here, the CNN-Peaks model may be included in the CNN model, and may be expressed as a CNN model.

일 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템은 CNN 모델 학습부(570) 및 CNN 모델 피크 예측부(580)를 포함하여 이루어질 수 있다. A CNN-based machine learning system that imitates human visual inspection for epigenetic data analysis according to an embodiment may include a CNN model learning unit 570 and a CNN model peak prediction unit 580 .

CNN 모델 학습부(570)는 CNN 모델(530)이 암 유전체 데이터 분석 결과 데이터(501, 502, 503)를 학습하고, CNN 모델 피크 예측부(580)는 학습된 CNN 모델(540)이 학습되지 않은 새로운 암 유전체 데이터(504)에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출(505)할 수 있다. The CNN model learning unit 570 learns the CNN model 530 from the cancer genome data analysis result data 501, 502, and 503, and the CNN model peak prediction unit 580 is the trained CNN model 540 is not trained. A peak may be called 505 to search for a functional region that may be a mechanism of cancer development for new cancer genomic data 504 that has not been identified.

CNN 모델 학습부(570)의 훈련의 입력 데이터(501, 502, 503)로, BAM 형식의 ChIP-seq 데이터(501), 텍스트 형식의 라벨이 지정된 데이터(502) 및 RefSeq 주석 정보(503)가 사용될 수 있다. 입력 데이터(501, 502, 503)는 CNN 모델(530)에 적합한 모양의 벡터로 변환될 수 있다. 이러한 벡터의 데이터를 기반으로 CNN 모델(530)을 훈련시킬 수 있다.As input data 501, 502, and 503 of training of the CNN model learning unit 570, ChIP-seq data 501 in BAM format, labeled data 502 in text format, and RefSeq annotation information 503 are can be used The input data 501 , 502 , and 503 may be converted into a vector having a shape suitable for the CNN model 530 . The CNN model 530 may be trained based on the data of these vectors.

그런 다음, 학습된 CNN 모델(540)은 예측 단계에서 라벨이 지정되지 않은 데이터(504)에 적용될 수 있다. CNN 모델 피크 예측부(580)는 학습된 CNN 모델(540)을 이용하여 추가 라벨링 및 훈련 없이 다른 인간의 ChIP-seq 데이터(504)에서 좁은 히스톤 수정 및 전사 인자 결합 부위에 대한 피크를 호출하는 데 사용할 수 있다. 또한, ATAC-seq의 피크는 CNN-Peaks를 사용하여 감지할 수도 있다.The trained CNN model 540 can then be applied to unlabeled data 504 in the prediction step. The CNN model peak prediction unit 580 uses the trained CNN model 540 to call peaks for narrow histone modifications and transcription factor binding sites in other human ChIP-seq data 504 without additional labeling and training. can be used In addition, the peaks of ATAC-seq can also be detected using CNN-Peaks.

대부분의 다른 피크 호출 방법과 달리, CNN-Peaks는 추가 제어 ChIP-seq 데이터가 필요하지 않으며, 일반적으로 거짓 양성 오류를 줄이기 위해 백그라운드 신호로 사용될 수 있다. 즉, 사용자는 제어 데이터 없이 대상 ChIP-seq 데이터(504)의 피크를 결정할 수 있다. Unlike most other peak calling methods, CNN-Peaks do not require additional control ChIP-seq data and can generally be used as a background signal to reduce false-positive errors. That is, the user can determine the peak of the target ChIP-seq data 504 without control data.

도 6은 일 실시예에 따른 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 동작 방법을 나타내는 흐름도이다. 6 is a flowchart illustrating an operation method of a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis according to an embodiment.

도 6을 참조하면, 일 실시예에 따른 컴퓨터 장치를 통해 수행되는 후성유전체 데이터 분석을 위한 인간 시각 검사를 모방하는 CNN 기반 머신러닝 시스템의 동작 방법은, CNN 모델이 암 유전체 데이터 분석 결과 데이터를 학습하는 단계(S110), 및 학습된 CNN 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출하는 단계(S120)를 포함하여 이루어질 수 있다. Referring to FIG. 6 , in the operation method of a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis performed through a computer device according to an embodiment, the CNN model learns cancer genome data analysis result data It can be made including a step (S110), and a step (S120) of calling a peak to search for a functional region that can be a cancer development mechanism for new cancer genome data that has not been trained by the learned CNN model.

CNN 기반 머신러닝 시스템의 동작 방법은 도 5에서 설명한 CNN 기반 머신러닝 시스템을 예를 들어 보다 상세히 설명할 수 있다. The operation method of the CNN-based machine learning system may be described in more detail by taking the CNN-based machine learning system described in FIG. 5 as an example.

단계(S110)에서, CNN 모델 학습부(570)는 CNN 모델(530)이 암 유전체 데이터 분석 결과 데이터를 학습할 수 있다. CNN 모델 학습부(570)는 전처리 모듈(510)을 이용하여 CNN 모델(530)의 입력 데이터(501, 502, 503)를 벡터로 변환할 수 있다. 보다 구체적으로, CNN 모델 학습부(570)의 전처리 모듈(510)은 ChIP-seq 읽기 매핑 데이터, ChIP-seq 읽기 매핑 데이터의 적어도 일부 영역에 대해 피크 위치에 라벨링된 데이터 및 RefSeq 주석 정보 중 적어도 어느 하나 이상의 입력 데이터(501, 502, 503)를 콘볼루션 계층을 통해 벡터로 변환할 수 있다. 또한, CNN 모델 학습부(570)의 전처리 모듈(510)은 입력 데이터(501, 502, 503)의 노이즈를 줄이기 위해 맥스 풀링(max-pooling) 계층을 통해 변환된 벡터의 차원(dimension)을 감소시킬 수 있다. In step S110 , the CNN model learning unit 570 may learn the CNN model 530 from the cancer genome data analysis result data. The CNN model learning unit 570 may convert the input data 501 , 502 , and 503 of the CNN model 530 into vectors by using the preprocessing module 510 . More specifically, the pre-processing module 510 of the CNN model learning unit 570 is at least one of ChIP-seq read mapping data, data labeled at peak positions for at least some regions of ChIP-seq read mapping data, and RefSeq annotation information. One or more input data 501 , 502 , and 503 may be transformed into vectors through a convolutional layer. In addition, the preprocessing module 510 of the CNN model learning unit 570 reduces the dimension of the transformed vector through a max-pooling layer in order to reduce the noise of the input data 501 , 502 , and 503 . can do it

CNN 모델 학습부(570)는 콘볼루션, 맥스 풀링 및 평균 풀링 중 적어도 어느 하나 이상의 필터를 포함하여 이루어진 복수 개의 인셉션 모듈을 포함할 수 있다. 여기서, 복수 개의 인셉션 모듈은 제1 인셉션 모듈, 제2 인셉션 모듈 및 제3 인셉션 모듈을 포함하여 이루어질 수 있으며, 각각의 제1 인셉션 모듈, 제2 인셉션 모듈 및 제3 인셉션은 적어도 하나 이상 구성될 수 있다. The CNN model learning unit 570 may include a plurality of inception modules including filters of at least any one of convolution, max pooling, and average pooling. Here, the plurality of inception modules may include a first inception module, a second inception module, and a third inception module, each of the first inception module, the second inception module, and the third inception module. may consist of at least one or more.

제1 인셉션 모듈은 CNN 모델(530)에 피크 패턴과 강력한 피크 신호의 규모에 대한 정보를 제공하기 위해, 콘볼루션, 맥스 풀링 및 평균 풀링 필터를 포함하여 구성될 수 있다. 제2 인셉션 모듈은 맥스 풀링 및 평균 풀링 계층의 벡터 크기를 증가시키는 문제를 방지하기 위해, 콘볼루션 및 맥스 풀링 필터를 포함하여 구성될 수 있다. 그리고 제3 인셉션 모듈은 제1 인셉션 모듈 및 제2 인셉션 모듈의 콘볼루션 필터보다 더 넓은 크기와 긴 스트라이드를 가진 콘볼루션 필터를 포함하여 구성될 수 있다. 예컨대, 제1 인셉션 모듈, 제2 인셉션 모듈 및 제3 인셉션 모듈은 도 2 및 도 3에서 설명한 A-module, B-module 및 C-module에 대응될 수 있다. 이러한 복수 개의 인셉션 모듈들 사이에는 잔여 연결(residual connection) 구조를 사용하여 기울기 손실(vanishing gradient) 문제를 방지할 수 있다. The first inception module may be configured to include a convolution, max pooling, and average pooling filter to provide the CNN model 530 with information on the peak pattern and magnitude of the strong peak signal. The second inception module may be configured to include a convolution and max pooling filter in order to avoid the problem of increasing the vector size of the max pooling and average pooling layer. In addition, the third inception module may include a convolutional filter having a wider size and a longer stride than the convolutional filters of the first inception module and the second inception module. For example, the first inception module, the second inception module, and the third inception module may correspond to the A-module, B-module, and C-module described with reference to FIGS. 2 and 3 . A vanishing gradient problem may be prevented by using a residual connection structure between the plurality of inception modules.

CNN 모델 학습부(570)는 복수 개의 인셉션 모듈들 끝에는 차원을 줄이기 위해 완전히 연결된 계층(fully connected layer)에 이은 평균 풀링 계층이 추가될 수 있다. CNN 모델 학습부(570)는 CNN 모델(530)의 출력 벡터의 크기가 입력 벡터보다 작아지기 때문에 출력 벡터의 크기를 입력 벡터와 동일하게 확장하여, 각 개별 위치에서 피크 유무를 예측하도록 할 수 있다. 또한, CNN 모델 학습부(570)는 CNN 모델(530)의 출력 데이터를 시그모이드(sigmoid) 활성화를 사용하여 각 피크에 점수를 부여하여 피크의 중요도를 측정할 수 있다. 이러한 CNN 모델 학습부(570)는 CNN 모델(530)을 통해 선택된 윈도우의 ChIP-seq 읽기 매핑 데이터(501)를 기반으로 게놈 영역의 최적의 임계 값을 학습하도록 할 수 있다. In the CNN model learning unit 570, an average pooling layer following a fully connected layer may be added to the end of the plurality of inception modules to reduce dimensions. Since the size of the output vector of the CNN model 530 becomes smaller than the input vector, the CNN model learning unit 570 expands the size of the output vector to be the same as the input vector, so that the presence or absence of a peak at each individual position can be predicted. . In addition, the CNN model learning unit 570 may measure the importance of the peak by giving the output data of the CNN model 530 a score to each peak using sigmoid activation. The CNN model learning unit 570 may learn the optimal threshold value of the genomic region based on the ChIP-seq read mapping data 501 of the window selected through the CNN model 530 .

단계(S120)에서, CNN 모델 피크 예측부(580)는 학습된 CNN 모델(540)이 학습되지 않은 새로운 암 유전체 데이터(504)에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색하도록 피크를 호출(505)할 수 있다. 보다 구체적으로, CNN 모델 피크 예측부(580)는 학습된 CNN 모델(540)에 새로운 암 유전체 데이터(504)를 입력하여 피크 호출 및 각 피크의 점수 부여를 통해 피크를 예측(505)할 수 있다. In step S120, the CNN model peak prediction unit 580 calls the peak so that the learned CNN model 540 searches a functional region that can be a cancer-generating mechanism for the new untrained cancer genome data 504. (505) You can. More specifically, the CNN model peak prediction unit 580 may predict a peak 505 by inputting new cancer genome data 504 to the trained CNN model 540 and calling the peak and scoring each peak. .

이상과 같이, 실시예들에 따르면 딥러닝 모델이 인간 연구원의 암 유전체 데이터 분석 결과 데이터를 학습하고, 학습된 딥러닝 모델이 학습되지 않은 새로운 암 유전체 데이터에 대해 암 발생 기전이 될 수 있는 기능적 영역을 탐색함으로써 인간 연구원의 정확성을 가지면서도 빠른 분석을 수행할 수 있다. As described above, according to embodiments, the deep learning model learns the data as a result of analysis of cancer genome data of a human researcher, and the learned deep learning model is a functional area in which new cancer genome data that has not been trained can become a cancer occurrence mechanism. It is possible to perform fast analysis with the accuracy of a human researcher by exploring

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the operating method of a CNN-based machine learning system that mimics human visual inspection for epigenetic data analysis performed through a computer device,
Learning a convolutional neural network (CNN) model from cancer genome data analysis result data; and
Calling peaks so that the trained CNN model searches functional regions that may be cancer-generating mechanisms for new untrained cancer genome data.
Containing, a method of operation of a CNN-based machine learning system.

According to claim 1,
The step of the CNN model learning the cancer genome data analysis result data,
Converting the input data of the CNN model into a vector using a preprocessing module
Containing, a method of operation of a CNN-based machine learning system.

3. The method of claim 2,
The step of converting the input data of the CNN model into a vector using the pre-processing module comprises:
Transforming the input data of at least one of ChIP-seq read mapping data, data labeled at peak positions for at least a partial region of the ChIP-seq read mapping data, and RefSeq annotation information into a vector through a convolutional layer
Containing, a method of operation of a CNN-based machine learning system.

4. The method of claim 3,
The step of converting the input data of the CNN model into a vector using the pre-processing module comprises:
reducing the dimension of the transformed vector through a max-pooling layer to reduce noise of the input data;
Containing, a method of operation of a CNN-based machine learning system.

According to claim 1,
The step of the CNN model learning the cancer genome data analysis result data,
Learning the CNN model through a plurality of inception modules including at least one filter of convolution, max pooling, and average pooling
Containing, a method of operation of a CNN-based machine learning system.

6. The method of claim 5,
Learning the CNN model through the plurality of inception modules comprises:
providing information on a peak pattern and a magnitude of a strong peak signal to the CNN model through at least one first inception module including a convolution, max pooling, and average pooling filter;
preventing the problem of increasing vector sizes of max pooling and average pooling layers through at least one or more second inception modules including convolution and max pooling filters; and
providing at least one third inception module comprising a convolutional filter having a wider size and a longer stride than the convolutional filter of the first inception module and the second inception module;
Containing, a method of operation of a CNN-based machine learning system.

6. The method of claim 5,
The step of the CNN model learning the cancer genome data analysis result data,
Using a residual connection structure between the plurality of inception modules to avoid the vanishing gradient problem.
A method of operation of a CNN-based machine learning system, characterized in that.

6. The method of claim 5,
The step of the CNN model learning the cancer genome data analysis result data,
At the end of the plurality of inception modules, an average pooling layer following a fully connected layer is added to reduce the dimension
A method of operation of a CNN-based machine learning system, characterized in that.

According to claim 1,
The step of the CNN model learning the cancer genome data analysis result data,
Learning the optimal threshold value of the genomic region based on the ChIP-seq read mapping data of the window selected through the CNN model
A method of operation of a CNN-based machine learning system, characterized in that.

The method of claim 1,
The step of the CNN model learning the cancer genome data analysis result data,
Since the size of the output vector of the CNN model becomes smaller than the input vector, the size of the output vector is expanded to be the same as the input vector, and the presence or absence of a peak at each individual position is predicted.
A method of operation of a CNN-based machine learning system, characterized in that.

According to claim 1,
The step of the CNN model learning the cancer genome data analysis result data,
Measuring the importance of the peak by giving the output data of the CNN model a score to each peak using sigmoid activation
A method of operation of a CNN-based machine learning system, characterized in that.

According to claim 1,
The step of calling the peak to search for a functional region that can be a cancer development mechanism for the new cancer genomic data that the learned CNN model is not trained is,
Predicting peaks through peak calling and scoring of each peak by inputting new cancer genome data to the learned CNN model
A method of operation of a CNN-based machine learning system, characterized in that.

In a CNN-based machine learning system that mimics human visual inspection for epigenomic data analysis,
a CNN model learning unit in which a CNN (Convolutional Neural Network) model learns cancer genome data analysis result data; and
CNN model peak prediction unit that calls the peak to search the functional region that can be the mechanism of cancer development for the new untrained cancer genome data of the learned CNN model
Including, the operation system of a CNN-based machine learning system.

14. The method of claim 13,
The CNN model learning unit,
ChIP-seq read mapping data, data labeled at peak positions for at least some regions of the ChIP-seq read mapping data, and RefSeq annotation information A preprocessing module for converting the input data of the CNN model into a vector
Including, the operation system of a CNN-based machine learning system.

14. The method of claim 13,
The CNN model learning unit,
A plurality of inception modules including at least one filter of convolution, max pooling, and average pooling
Including, the operation system of a CNN-based machine learning system.

16. The method of claim 15,
The plurality of inception modules,
at least one first inception module configured to include a convolution, max pooling, and average pooling filter to provide information on a peak pattern and a magnitude of a strong peak signal to the CNN model;
at least one second inception module, configured to include a convolution and a max pooling filter, in order to avoid the problem of increasing the vector size of the max pooling and average pooling layer; and
at least one or more third inception modules including a convolutional filter having a wider size and a longer stride than the convolutional filters of the first inception module and the second inception module
Including, the operation system of a CNN-based machine learning system.

16. The method of claim 15,
Preventing a vanishing gradient problem by using a residual connection structure between the plurality of inception modules
characterized in that, the operation system of a CNN-based machine learning system.

16. The method of claim 15,
The CNN model learning unit,
At the end of the plurality of inception modules, an average pooling layer following a fully connected layer is added to reduce the dimension
characterized in that, the operation system of a CNN-based machine learning system.

14. The method of claim 13,
The CNN model learning unit,
Measuring the importance of the peak by giving the output data of the CNN model a score to each peak using sigmoid activation
characterized in that, the operation system of a CNN-based machine learning system.

14. The method of claim 13,
The CNN model peak prediction unit,
Predicting peaks through peak calling and scoring of each peak by inputting new cancer genome data to the learned CNN model
characterized in that, the operation system of a CNN-based machine learning system.