KR20200017653A

KR20200017653A - Method for prediction of drug-target interactions

Info

Publication number: KR20200017653A
Application number: KR1020180092793A
Authority: KR
Inventors: 남호정; 이인구; 금종수
Original assignee: 광주과학기술원
Priority date: 2018-08-09
Filing date: 2018-08-09
Publication date: 2020-02-19
Also published as: KR102213670B1

Abstract

According to the present invention, a method for predicting a protein-target interaction may comprise the steps of: learning a protein sequence using a convolutional neural network in a processor to extract a local residual pattern; learning a drug fingerprint through a first fully connected layer in the processor to extract a drug fingerprint pattern; concatenating the local residual pattern and the drug fingerprint pattern in the processor; and learning the concatenated pattern through a second fully connected layer in the processor.

Description

Method for predicting drug-target interactions {METHOD FOR PREDICTION OF DRUG-TARGET INTERACTIONS}

본 발명은 약물-표적 단백질의 상호 작용 예측을 위한 방법에 관한 것이다.The present invention relates to a method for predicting interaction of drug-target proteins.

약물 발견의 초기 단계에서, 약물이 물리적 결합을 통해 표적의 생물학적 활성을 억제하거나 활성화하기 때문에 약물-표적 상호 작용 (drug-target interaction; DTI)의 식별이 중요한 역할을 한다. 따라서, 약물 개발자는 관심있는 생물학적 활성으로 특정 표적과 상호 작용하는 화합물을 스크리닝 한다. 하지만 대규모의 화학적 혹은 생물학적 실험에서 DTI를 식별하는 데는 일반적으로 2 ~ 3 년의 실험 시간이 필요하며 관련 비용이 많이 든다. 따라서 약물, 표적 및 상호 작용 데이터의 누적으로 인해 약물 발견을 돕기 위해 가능한 DTI의 예측을 위해 다양한 인 실리코(in silico) 방법이 개발되고 있다.In the early stages of drug discovery, the identification of drug-target interactions (DTIs) plays an important role because drugs inhibit or activate the biological activity of the target through physical binding. Thus, drug developers screen compounds that interact with specific targets with the biological activity of interest. However, the identification of DTI in large-scale chemical or biological experiments typically requires two to three years of testing time and is expensive. Therefore, various in silico methods are being developed to predict possible DTIs to aid drug discovery due to the accumulation of drug, target and interaction data.

전산 접근법 중 많은 유사성 기반 방법이 처음 연구되었다. 여기에서 약물은 알려진 표적과 유사한 단백질에 결합하고 그 반대의 경우도 마찬가지라고 가정했다. 가장 잘 확립된 방법 중 하나는 화학적 공간과 게놈 공간을 약리학적 공간에 결합시키는 것이다. DTI를 식별하기 위한 입력으로 알려진 약물 상호 작용에 대한 정보를 사용하는 커널 회귀 분석법을 사용하는 Yamanashi 등의 방법이다. 방대한 계산 능력을 요구 사항을 극복하기 위해 Beakley 등은 상호 작용 모델을 국지적으로 훈련하지만, 전체적으로 훈련시키지 않는 부분 모델을 개발했다. 계산 복잡도를 크게 줄이는 것 외에도 이 모델은 이전 모델보다 높은 성능을 보였다. 하지만 유사성 기반 방법은 특정 단백질 군내에서 DTI에 대해 잘 작동하지만 다른 분류에서는 그렇지 않는다. 유사성 기반 방법은 현재 DTI를 예측하는 데 일반적으로 사용되지 않는다. 따라서 단백질 클래스와 대상 또는 약물 사이의 유사성에 관계없이 DTI를 예측하는 방법이 필요하다. 이 방법은 상당한 계산 능력이 필요하다.Many similarity-based methods were first studied. The drug is assumed to bind to proteins similar to known targets and vice versa. One of the best established methods is to combine chemical and genomic spaces with pharmacological spaces. Yamanashi et al. Use a kernel regression method that uses information about drug interactions known as inputs to identify DTI. To overcome the enormous computational power requirements, Beakley et al. Developed a partial model that trains the interaction model locally, but not as a whole. In addition to significantly reducing computational complexity, this model outperformed the previous model. Similarity-based methods, however, work well for DTI within certain protein families, but not in other classes. Similarity-based methods are currently not commonly used to predict DTI. Thus, there is a need for a method of predicting DTI regardless of the similarity between the protein class and the subject or drug. This method requires considerable computational power.

일본공개특허: 특개2017-520868, 공개일: 2017년 7월 27일, 제목: 결합 친화성 예측 시스템 및 방법.Japanese Patent Application Laid-Open No. 2017-520868, published on July 27, 2017, title: Binding affinity prediction system and method. 한국공개특허: 10-2018-0017827, 공개일: 2018년 2월 21일, 제목: 염기 프로파일과 조성을 이용하여 단백질과 결합하는 RNA 서열 영역을 예측하는 방법 및 시스템.Korean Patent Laid-Open Publication No. 10-2018-0017827, published on February 21, 2018, title: A method and system for predicting RNA sequence region binding to a protein using a base profile and a composition. 한국공개특허: 10-2018-0052959, 공개일: 2018년 5월 21일, 제목: MHC와 펩타이드 사이의 결합 친화성 예측 방법 및 장치.Korean Patent Laid-Open Publication No. 10-2018-0052959, published on May 21, 2018, title: Method and apparatus for predicting binding affinity between MHC and peptide.

본 발명의 목적은 신규한 단백질-표적 상호 작용을 예측하는 방법을 제공하는데 있다.It is an object of the present invention to provide a method for predicting novel protein-target interactions.

본 발명의 실시 예에 따른 단백질-표적 상호 작용을 예측하기 위한 방법은: 로컬 잔류 패턴을 추출하기 위하여 프로세서에서 컨볼루션 신경망(convolution neural network)을 이용하여 단백질 서열을 학습하는 단계; 약물 지문 패턴을 추출하기 위하여 상기 프로세서에서 제 1 완전 연결 계층(fully connected layer)을 통하여 약물 지문을 학습하는 단계; 상기 프로세서에서 상기 로컬 잔류 패턴과 상기 약물 지문 패턴을 연접(concatenation)하는 단계; 및 상기 프로세서에서 제 2 완전 연결 계층을 통하여 상기 연접된 패턴을 학습하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method for predicting protein-target interaction includes: learning a protein sequence using a convolution neural network in a processor to extract a local residual pattern; Learning a drug fingerprint through a first fully connected layer in the processor to extract a drug fingerprint pattern; Concatenation of the local residual pattern and the drug fingerprint pattern in the processor; And learning, by the processor, the concatenated pattern through a second fully connected layer.

실시 예에 있어서, 상기 단백질 서열을 학습하는 단계는, 컨볼루션 계층(convolution layer)을 통하여 복수의 단백질 서열에 대하여 로컬 잔류 패턴(local residue pattern)을 학습하는 단계; 및 맥스 풀링 계층(max pooling layer)을 통하여 상기 학습된 로컬 서열 패턴의 결과에서 최대값을 상기 로컬 잔류 패턴으로 추출하는 단계를 포함할 수 있다.In an embodiment, the learning of the protein sequence may include: learning a local residue pattern for a plurality of protein sequences through a convolution layer; And extracting a maximum value as the local residual pattern from a result of the learned local sequence pattern through a max pooling layer.

실시 예에 있어서, 상기 맥스 풀링 계층은 글로벌 맥스 풀링 계층(global max pooling layer)인 것을 특징으로 한다.In an embodiment, the max pooling layer is a global max pooling layer.

실시 예에 있어서, 상기 복수의 단백질 서열은 DrugBank, IUPHAR(International Union of Basic and Clinical Pharmacology), 및 KEGG(Kyoto Encyclopedia of Genes and Genomes) 중에서 적어도 하나의 데이터베이스로부터 훈련되는 것을 특징으로 한다.In an embodiment, the plurality of protein sequences are trained from at least one database of DrugBank, International Union of Basic and Clinical Pharmacology (IUPHAR), and Kyoto Encyclopedia of Genes and Genomes (KEGG).

실시 예에 있어서, 상기 컨볼루션 계층의 크기는 사전에 결정된 단백질 서열의 크기(MPL)에서 창 크기(WS)를 뺀 값에 1을 더한 값인 것을 특징으로 한다.In an embodiment, the size of the convolutional layer may be a value obtained by subtracting the window size (WS) from the predetermined protein sequence size (MPL) plus 1.

실시 예에 있어서, 상기 단백질 서열을 학습하는 단계는, 임베딩 계층을 통하여 상기 복수의 단백질 서열에 대한 삽입 벡터의 크기를 상기 사전에 결정된 단백질 서열의 크기(MPL)로 설정하는 단계를 더 포함할 수 있다.In an embodiment, the learning of the protein sequence may further include setting the size of the insertion vector for the plurality of protein sequences to the predetermined size of the protein sequence (MPL) through an embedding layer. have.

실시 예에 있어서, 상기 프로세서에서 출력 계층을 통하여 단백질-표적 상호 작용에 대응하는 스코어를 발생하는 단계를 더 포함할 수 있다.In an embodiment, the processor may further include generating a score corresponding to the protein-target interaction through the output layer.

실시 예에 있어서, 상기 스코어를 발생하는 단계는, 상기 로컬 잔류 패턴과 상기 약물 지문 패턴의 각각에 가중치를 설정하는 단계; 및 상기 설정된 로컬 잔류 패턴과 상기 설정된 약물 지문 패턴을 입력으로 하는 활성화 함수를 활성시킴으로써 상기 스코어를 생성하는 단계를 포함할 수 있다.The generating of the score may include: setting a weight on each of the local residual pattern and the drug fingerprint pattern; And generating the score by activating an activation function that takes the set local residual pattern and the set drug fingerprint pattern as input.

실시 예에 있어서, 상기 활성화 함수는 sigmoid 함수 인 것을 특징으로 한다.In an embodiment, the activation function is characterized in that the sigmoid function.

실시 예에 있어서, 학습 속도 혹은 창 크기를 교차 유효성 검사 중에 조정시키는 단계를 더 포함할 수 있다.In an embodiment, the method may further include adjusting a learning speed or a window size during cross validation.

본 발명의 실시 예에 따른 단백질-표적 상호 작용을 예측하는 방법은, CNN을 이용하여 로컬 잔류 패턴을 추출하고, 대응하는 약물 지문 패턴을 학습하여 연접시켜 단백질-표적 상호 작용의 결과를 효율적으로 예측할 수 있다.In the method of predicting protein-target interaction according to an embodiment of the present invention, a local residual pattern is extracted using CNN, and a corresponding drug fingerprint pattern is learned and concatenated to efficiently predict the result of protein-target interaction. Can be.

이하에 첨부되는 도면들은 본 실시 예에 관한 이해를 돕기 위한 것으로, 상세한 설명과 함께 실시 예들을 제공한다. 다만, 본 실시예의 기술적 특징이 특정 도면에 한정되는 것은 아니며, 각 도면에서 개시하는 피쳐들은 서로 조합되어 새로운 실시 예로 구성될 수 있다.
도 1은 본 발명의 실시 예에 따른 약물-표적 상호 작용을 예측하기 위한 딥 러닝 장치를 개념적으로 보여주는 도면이다.
도 2는 본 발명의 실시 예에 따른 약물-표적 상호 작용을 예측하기 위한 딥 러닝 모델을 예시적으로 보여주는 도면이다.
도 3은 전체 단백질 서열로부터 로컬 잔류 패턴을 추출하는 전체 스키마를 예시적으로 보여주는 도면이다.
도 4는 단백질 디스크립터의 최적화 모델을 위한 정밀도-리콜 곡선을 예시적으로 보여주는 도면이다.
도 5는 PubChem 데이터 세트의 독립적인 데이터 세트에 대한 성능 비교를 예시적으로 보여주는 도면이다.
도 6는 PubChem 데이터 세트의 독립적인 데이터 세트에 대한 성능 비교를 예시적으로 보여주는 도면이다.
도 7은 본 발명의 컨볼루션 모델과 Wen의 모델 사이의 성능을 예시적으로 비교한 도면이다.
도 8a는 마스트/스템 셀 성장 인자 수용체 키트 단백질의 바인딩 사이트를 도시하고, 도 8b는 5-하이드록시트립타민 수용체 1B 단백질의 바인딩 사이트를 예시적으로 보여주는 도면이다.
도 9는 본 발명의 실시 예에 따른 약물-표적 상호 작용을 예측하는 방법을 예시적으로 보여주는 도면이다.BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are provided to assist in understanding the present embodiment, and provide embodiments with a detailed description. However, the technical features of the present embodiment are not limited to the specific drawings, and the features disclosed in the drawings may be combined with each other to constitute a new embodiment.
1 is a diagram conceptually illustrating a deep learning apparatus for predicting drug-target interaction according to an embodiment of the present invention.
2 is a diagram illustrating a deep learning model for predicting drug-target interaction according to an embodiment of the present invention.
3 exemplarily shows the entire schema for extracting local residual patterns from the entire protein sequence.
4 illustratively shows a precision-recall curve for an optimization model of a protein descriptor.
FIG. 5 illustratively shows a performance comparison of an independent data set of PubChem data sets.
FIG. 6 illustratively shows a performance comparison of an independent data set of PubChem data sets.
7 is an exemplary comparison of performance between the convolution model of the present invention and the model of Wen.
FIG. 8A shows the binding site of the mast / stem cell growth factor receptor kit protein, and FIG. 8B is an illustration showing the binding site of the 5-hydroxytryptamine receptor 1B protein.
9 is a diagram illustrating a method for predicting drug-target interaction according to an embodiment of the present invention.

아래에서는 도면들을 이용하여 본 발명의 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있을 정도로 본 발명의 내용을 명확하고 상세하게 기재할 것이다.DETAILED DESCRIPTION Hereinafter, the contents of the present invention will be described clearly and in detail so that those skilled in the art can easily carry out the drawings.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 하지만 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다.As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms.

도 1은 본 발명의 실시 예에 따른 약물-표적 상호 작용(DTI)을 예측하기 위한 딥 러닝 장치(100)를 개념적으로 보여주는 도면이다. 도 1을 참조하면, 딥 러닝 장치(100)는 CNN 유닛(110), 제 1 FC 유닛(120), 연접 유닛(130), 제 2 FC 유닛(140), 및 DTI 예측 유닛(150)을 포함할 수 있다.1 is a diagram conceptually illustrating a deep learning apparatus 100 for predicting drug-target interaction (DTI) according to an embodiment of the present invention. Referring to FIG. 1, the deep learning apparatus 100 includes a CNN unit 110, a first FC unit 120, a concatenation unit 130, a second FC unit 140, and a DTI prediction unit 150. can do.

CNN 유닛(110)은 단백질 서열을 수신하고, 컨볼루션 계층(convolution layer) 및 풀링 계층(pooling layer)에 따라 학습함으로써, 단백질의 로컬 잔류 패턴(local residue pattern)을 추출하도록 구현될 수 있다. 컨볼루션 계층은 단백질 서열에 대한 복수의 필터(혹은 커널)를 이용하여 컨볼루션 연산을 수행함으로써 복수의 피쳐 맵들(feature maps)을 생성할 수 있다. 풀링 계층은 컨볼루션 계층의 출력을 단순화시키기 위하여, 피쳐 맵들 각각에서 사전에 결정된 영역에서 특정 값을 추출할 수 있다. 예를 들어, 풀링 계층은, 단위 영역에서 최대값, 최소값, 평균 값 등 다양한 방법으로 특정 값을 추출할 수 있다. 아래에서는 설명의 편의를 위하여 본 발명의 풀링 계층은 최대값을 추출하는 맥스 풀링 계층으로 가정하겠다. 정리하면 CNN 유닛(110)은 필터를 이용하여 로컬 잔류 패턴을 추출할 수 이다.The CNN unit 110 may be implemented to extract a local residue pattern of a protein by receiving the protein sequence and learning according to a convolution layer and a pooling layer. The convolutional layer may generate a plurality of feature maps by performing a convolution operation using a plurality of filters (or kernels) on the protein sequence. The pooling layer may extract a specific value in a predetermined area in each of the feature maps to simplify the output of the convolutional layer. For example, the pooling layer may extract a specific value in various ways such as a maximum value, a minimum value, and an average value in the unit region. In the following description, it is assumed that the pooling layer of the present invention is a max pooling layer for extracting a maximum value. In summary, the CNN unit 110 may extract a local residual pattern using a filter.

제 1 FC 유닛(120)은 약물 지문을 수신하고, 완전 연결 계층(fully connected layer)을 통하여 학습하도록 구현될 수 있다. 제 1 FC 유닛(120)은 가중치 벡터와 활성화 함수(activation function)를 이용하여 기존의 피쳐 정보를 조직화 및 추상화 시킬 수 있다. O(결과 층) = activation_function(W(가중치) x H (입력 층)+ b(바이어스) 의 수식으로 표현될 수 있다. 따라서 결과층의 각 차원의 값은 입력 층 값마다에 각각 가중치가 곱해져서 더해진 후에 바이어스까지 더해진 값이 된다. 이러한 결과에 활성화 함수를 이용하여 비선형성이 유도될 수 있다. 이러한 결과층의 결과는 결과층의 각각의 차원이 입력층의 각각 차원을 가중치를 이용하여 조직화시킨 결과를 의미한다. 약물 완전 연결 계층의 경우, 각각의 차원은 화합물에서 특정 하부 구조(substructure)가 존재하는지에 대한 여부를 의미한다. 완전 연결 신경망의 결과층의 결과값은 입력된 하부 구조값에 가중치를 이용함으로써 도출한 특정 하부구조의 조합을 의미한다. 즉, 완전 연결 계층을 통하여 학습한다는 것은, 약물-표적 예측에 중요하게 작용하는 특정 하부구조를 모델에서 학습한다는 의미이다.The first FC unit 120 may be implemented to receive the drug fingerprint and learn through a fully connected layer. The first FC unit 120 may organize and abstract existing feature information by using a weight vector and an activation function. It can be expressed as the formula O (result layer) = activation_function (W (weight)) x H (input layer) + b (bias), so that the value of each dimension of the result layer is multiplied by the weight The result is a nonlinearity derived by using the activation function.The result of this result layer is that each dimension of the result layer is organized by weighting each dimension of the input layer. In the case of drug fully-connected layers, each dimension indicates whether or not a particular substructure exists in the compound. The combination of specific infrastructures derived from the use of weights, ie learning through a fully connected layer, is a specific infrastructure that is important for drug-target prediction. Learning means that the action in the model.

연접 유닛(130)은 CNN 유닛(110)로부터 출력된 로컬 잔류 패턴과 제 1 FC 유닛(120)로부터 출력된 약물 지문 패턴을 연접하도록 구현될 수 있다. The concatenation unit 130 may be implemented to concatenate the local residual pattern output from the CNN unit 110 and the drug fingerprint pattern output from the first FC unit 120.

제 2 FC 유닛(140)은 연접 유닛(130)으로부터 출력된 패턴에 대하여 완전 연결 계층(fully connected layer)을 통하여 학습하도록 구현될 수 있다. 여기서 약물과 표적 단백질의 상호 작용을 예측하기 위하여 정제된 약물과 단백질 피쳐로부터 유의미한 조합이 학습될 수 있다. 제 2 FC 유닛(140)은 제 1 FC 유닛(120)과 유사한 방법으로 구동될 수 있다.The second FC unit 140 may be implemented to learn through a fully connected layer about the pattern output from the connection unit 130. Here, significant combinations can be learned from the purified drug and protein features in order to predict the interaction of the drug with the target protein. The second FC unit 140 may be driven in a similar manner as the first FC unit 120.

DTI 예측 유닛(150)은 제 2 FC 유닛(140)의 출력에 대한 약물-표적 상호 작용(DTI)에 대응하는 스코어를 발생하도록 구현될 수 있다. 제 2 FC 유닛(140)의 조직화된 조합들에 대한 가중치를 통하여 최종적인 약물-표적 상호 작용이 예측될 수 있다. 예를 들어, 조직화된 조합은 Sigmoid 함수를 활성화 함수로 사용하여, 0 ~ 1 사의 값이 스코어링 될 수 있다. 이러한 스코어 값에 따라 분류가 이루어질 수 있다. 이러한 스코어링은 제 2 완전 연결 계층에서 조직화된 조합들에 가중치를 통하여 최종적인 DTI를 예측하는 값을 출력할 수 있다.The DTI prediction unit 150 may be implemented to generate a score corresponding to the drug-target interaction (DTI) for the output of the second FC unit 140. The final drug-target interaction can be predicted through the weights for the organized combinations of the second FC unit 140. For example, an organized combination may use a Sigmoid function as an activation function so that values from 0 to 1 can be scored. Classification may be made according to this score value. Such scoring may output a value that predicts the final DTI through weights to the organized combinations in the second fully connected layer.

일반적으로 약물-표적 상호 작용 (drug-target interaction; DTI)의 식별은 약물 발견에서 중요한 역할을 한다. 약물은 표적 단백질과 여러 가지 방식으로 상호 작용을 하면서 특정 기능을 수행한다. 시험 관내 및 생체 내 실험의 높은 비용과 노동 집약적 특성 때문에 실리코 기반 DTI 예측 접근법의 중요성이 강조되고 있다. 따라서, 본 발명의 실시 예에 따른 딥 러닝 장치(100)는 DTI에 참여하는 로컬 잔류 패턴을 포착하기 위해 단백질 원시 서열 학습에 CNN(convolutional neural network)을 사용할 수 있다. 또한, 본 발명의 실시 예에 따른 딥 러닝 장치(100)는 풀링된 컨볼루션 결과를 검토함으로써, DTI에 대한 단백질의 결합 영역을 검출 할 수 있다. 결론적으로, 본 발명의 딥 러닝 장치(100)는 표적 단백질의 로컬 잔류 패턴을 검출하기 위한 예측 모델로써, 미처리 단백질 서열의 단백질 피쳐를 잘 나타내고, 이전의 접근법보다 더 나은 예측 결과를 산출할 수 있다.In general, the identification of drug-target interactions (DTIs) plays an important role in drug discovery. Drugs perform specific functions by interacting with target proteins in different ways. Due to the high cost and labor intensive nature of in vitro and in vivo experiments, the importance of silico-based DTI prediction approaches is emphasized. Therefore, the deep learning apparatus 100 according to an embodiment of the present invention may use a convolutional neural network (CNN) for protein raw sequence learning to capture local residual patterns participating in DTI. In addition, the deep learning apparatus 100 according to an embodiment of the present invention may detect the binding region of the protein to the DTI by examining the pooled convolution results. In conclusion, the deep learning apparatus 100 of the present invention is a predictive model for detecting a local residual pattern of a target protein. The deep learning apparatus 100 may well display protein features of an unprocessed protein sequence and may yield better prediction results than the previous approach. .

일반적인 피쳐 기반 DTI 예측 모델의 경우, 약물 지문(drug fingerprint)은 약물 하부 구조(substructure)의 가장 일반적으로 사용되는 디스크립터이다. 약물 지문은 약물의 하부 구조의 존재를 나타내는 지표 값을 갖는 이진 벡터(binary vector)로 변형될 수 있다. 단백질의 경우, CTD(composition, transition and distribution) 디스크립터들은 일반적인 계산 표현으로 사용될 수 있다. 다양한 단백질과 화학적 디스크립터가 소개되지만, 피쳐 기반 모델은 충분히 좋은 예측 성능을 보여주지 못한다. 일반적인 머신 러닝 모델의 경우, SMILES 및 아미노산 서열과 같은 원래의 원시 형태로부터, 모델링 함으로써 피쳐들이 판독되어야 하는데, 변환 과정에서 로컬 잔류 패턴(local residue patterns) 정보가 손실되고 있으며, 손실된 로컬 잔류 패턴 정보가 복구도 어렵다.In the case of a general feature-based DTI prediction model, the drug fingerprint is the most commonly used descriptor of the drug substructure. The drug fingerprint can be transformed into a binary vector with an indicator value indicating the presence of the substructure of the drug. In the case of proteins, CTD (composition, transition and distribution) descriptors can be used as a general computational expression. Various protein and chemical descriptors are introduced, but feature-based models do not show good predictive performance. In a typical machine learning model, features must be read from the original primitive forms such as SMILES and amino acid sequences by modeling, where local residue pattern information is lost during conversion, and local residue pattern information is lost. It is also difficult to recover.

본 발명의 실시 예에 따른 딥 러닝 장치(100)는, 로컬 잔류 패턴 정보를 잃지 않으면서, 최대 길이에 대한 바이어스 없이 DTI에 대한 로컬 잔류 패턴을 추출할 수 있다. 본 발명의 실시 예에 따른 딥 러닝 장치(100)는 다양한 목적 단백질 클래스뿐만 아니라 다양한 단백질 길이에 대해서도 원시 단백질 서열을 사용함으로써 대용량 DTI를 예측할 수 있다.The deep learning apparatus 100 according to an embodiment of the present invention may extract the local residual pattern for the DTI without biasing the maximum length without losing the local residual pattern information. The deep learning apparatus 100 according to the embodiment of the present invention may predict large-capacity DTI by using raw protein sequences for various protein lengths as well as various target protein classes.

도 2는 본 발명의 실시 예에 따른 약물-표적 상호 작용을 예측하기 위한 딥 러닝 모델을 예시적으로 보여주는 도면이다. 도 2를 참조하면, 약물-표적 상호 작용을 예측하는 딥 러닝 모델은, DTI에 참여하는 주요 단백질 로컬 잔류 패턴(local residue patterns)을 추출하기 위해 단백질의 전체 서열에 컨볼루션 필터를 채택하고 있다. 컨볼루션의 최대 결과를 풀링함으로써, 주어진 단백질 서열이 DTI에 참여하는 로컬 잔류 패턴과 어떻게 일치하는지가 결정될 수 있다. 이 데이터를 상위 계층의 입력 변수로 사용하는 딥 러닝 모델은, 단백질에 대한 추상화 및 조직화된 피쳐들(features)을 구성할 수 있다.2 is a diagram illustrating a deep learning model for predicting drug-target interaction according to an embodiment of the present invention. Referring to FIG. 2, the deep learning model for predicting drug-target interactions employs a convolution filter over the entire sequence of the protein to extract key protein local residue patterns that participate in DTI. By pooling the maximum results of convolution, it can be determined how a given protein sequence matches the local residual pattern of participating in DTI. Deep learning models using this data as input variables in higher layers can construct abstracted and organized features for proteins.

마지막으로, 본 발명의 딥 러닝 모델은 약물 피쳐들(drug features)(혹은 약물 지문 패턴)을 단백질 피쳐들(혹은 단백질 서열 패턴)에 연접(concatenation)할 수 있다. 여기서 약물 피쳐들은 완전 연결 계층(fully connected layer; 제 1 완전 연결 계층)을 통해 약물 지문들로부터 학습됨으로써 산출 될 수 있다. 그리고 연접된 피쳐들은 더 높은 완전 연결 계층(제 2 완전 연결 계층)을 통하여 DTI들의 가능성을 예측할 수 있다.Finally, the deep learning model of the present invention may concatenate drug features (or drug fingerprint patterns) to protein features (or protein sequence patterns). The drug features may be calculated by learning from drug fingerprints through a fully connected layer (first fully connected layer). And the concatenated features can predict the likelihood of DTIs through a higher fully connected layer (second fully connected layer).

본 발명의 실시 예에 따른 딥 러닝 모델은 다양한 DTI 데이터베이스들로부터 통합된 대용량 DTI 정보로 훈련될 수 있다. 여기서 DTI 데이터베이스들은, DrugBank, IUPHAR(International Union of Basic and Clinical Pharmacology), 및 KEGG(Kyoto Encyclopedia of Genes and Genomes) 등을 포함할 수 있다. 본 발명의 딥 러닝 모델은 MATADOR 및 Liu et al.로부터 예측된 음성 상호 작용(negative interactions)으로부터 최적화 될 수 있다. 최적화된 모델을 사용함으로써, 모델 성능 평가를 위하여 PubChem BioAssays 및 KinaseSARfari와 같은 생물 분석으로부터 DTI가 예측될 수 있다.The deep learning model according to an embodiment of the present invention may be trained with integrated large DTI information from various DTI databases. Here, the DTI databases may include DrugBank, International Union of Basic and Clinical Pharmacology (IUPHAR), Kyoto Encyclopedia of Genes and Genomes (KEGG), and the like. The deep learning model of the present invention can be optimized from the negative interactions predicted from MATADOR and Liu et al. By using optimized models, DTI can be predicted from biological assays such as PubChem BioAssays and KinaseSARfari for model performance evaluation.

한편, 훈련 데이터 세트를 구축하기 위해 DrugBank, KEGG, IUPHAR의 세 가지 데이터베이스로부터 알려진 DTI가 얻어질 수 있다. 그리고 세 가지 데이터베이스의 중복된 DTI가 제거될 수 있다. 그 결과 총 12,859개의 화합물, 5163개의 단백질, 및 35,145개의 DTI가 얻어질 수 있다. 수집된 모든 DTI는 훈련을 위해 양성 샘플로 간주될 수 있다. 음성 샘플은 양성 샘플과 배타적인 무작위 DTI가 양성 샘플의 2배만큼 선택되어 구성될 수 있다. 또한 음성 샘플은 무작위로 선택되기 때문에 10개의 음성 데이터 세트가 생성될 수 있다.On the other hand, a known DTI can be obtained from three databases, DrugBank, KEGG, and IUPHAR, to build a training data set. And duplicate DTI of three databases can be eliminated. As a result, a total of 12,859 compounds, 5163 proteins, and 35,145 DTIs can be obtained. All collected DTIs can be considered positive samples for training. The negative sample may consist of a positive sample and an exclusive random DTI selected by twice the positive sample. Also, since voice samples are randomly selected, ten voice data sets can be generated.

본 발명의 실시 예에 따른 딥 러닝 모델은 평가를 위해, PubChem BioAssay 데이터베이스와 ChEMBL KinaseSARfari로부터 두 개의 독립적인 테스트 데이터 세트를 구축할 수 있다. 이러한 데이터 세트는 실험 분석의 결과로 구성될 수 있다. PubChem으로부터 양성 DTI를 얻기 위해, 해리 상수 (Kd < 10μm)를 갖는 분석법(assays)으로부터 "활성" DTIs가 수집될 수 있다. 약물이 단백질에 결합하는지 여부를 예측하기 때문에, 여러 유형의 분석법 (IC50, EC50, Kd, Ki, AC50) 중에서 해리 상수(Kd)의 평가가 양성 표본을 얻는 가장 적절하다. The deep learning model according to an embodiment of the present invention may build two independent test data sets from the PubChem BioAssay database and ChEMBL KinaseSARfari for evaluation. This data set may consist of the results of experimental analysis. To obtain positive DTI from PubChem, "active" DTIs can be collected from assays with dissociation constants (Kd <10 μm). Since predicting whether a drug binds to a protein, evaluation of dissociation constants (Kd) among the various types of assays (IC50, EC50, Kd, Ki, AC50) is the most appropriate to obtain a positive sample.

음성 샘플의 경우, 다른 분석 유형에서 "비활성"으로 주석 처리된 샘플이 사용될 수 있다. 하지만, PubChem 바이오 분석에서 너무 많은 음성 샘플이 수집될 수 있다. 이에 첫째로 PubChem bioassays의 양성 샘플에 포함된 약물과 표적이 있는 음성 샘플들만 수집된다. 둘째, 약물이나 단백질이 양성 샘플에 포함된 무작위 음성 샘플들을 추가함으로써, 음성 시료 수가 양성 시료와 동일할 것이다. 그 결과 12,906 개의 양성 및 음성 샘플이 14,343개의 약물과 714개의 단백질로 구성될 수 있다.For negative samples, samples annotated as "inactive" in other assay types may be used. However, too many negative samples can be collected in the PubChem bioassay. First, only negative samples with drug and targets included in the positive samples of PubChem bioassays are collected. Second, by adding random negative samples containing drug or protein in the positive sample, the number of negative samples will be equal to the positive sample. As a result, 12,906 positive and negative samples can be composed of 14,343 drugs and 714 proteins.

또한 KinaseSARfari로부터 샘플이 수집될 수 있다. KinaseSARfari는 키나아제 도메인에 결합하는 화합물을 포함하는 분석으로 구성될 수 있다. KinaseSARfari로부터 양성 샘플을 얻기 위해 양성으로 판단되기에 충분히 작은 양성으로 해리 상수(Kd<10μm)를 갖는 각각의 분석 결과는 양성으로 간주될 수 있다. PubChem bioassays와는 대조적으로, 음성 샘플의 수는 KinaseSARfari의 양성 샘플 수와 유사할 수 있다. 따라서 음성 샘플이 샘플링 되지 않는다. 3835개의 양성 샘플과 5520개의 음성 샘플들은 3379개의 화합물과 389개의 단백질로 구성될 수 있다.Samples can also be collected from KinaseSARfari. KinaseSARfari may consist of an assay comprising a compound that binds to a kinase domain. Each assay result with a dissociation constant (Kd <10 μm) that is small enough to be considered positive to obtain a positive sample from KinaseSARfari can be considered positive. In contrast to PubChem bioassays, the number of negative samples may be similar to the number of positive samples of KinaseSARfari. Therefore, voice samples are not sampled. The 3835 positive samples and 5520 negative samples may consist of 3379 compounds and 389 proteins.

한편, 본 발명의 실시 에에 따른 딥 러닝 모델은 단백질의 입력으로 원시 단백질 서열을 사용할 수 있다. 약물의 경우, 분자를 그래프로 분석하고, 전체 분자 그래프의 하위 그래프에서 분자 구조의 하부 구조를 검색하는 ECFP 약물 지문이 사용될 수 있다. 특히 RDKit을 사용하여 원시 SMILES 문자열에서 반경 2의 ECFP 지문이 산출될 수 있다. 마지막으로, 각 약물은 길이가 2048인 이진 벡터로 재현될 수 있다. 이러한 지표는 특정 하부 구조의 존재를 나타낼 수 있다.On the other hand, the deep learning model according to the embodiment of the present invention can use the raw protein sequence as the input of the protein. For drugs, ECFP drug fingerprints can be used that graphically analyze molecules and search for substructures of the molecular structure in subgraphs of the overall molecular graph. In particular, using RDKit, ECFP fingerprints with a radius of 2 can be generated from a raw SMILES string. Finally, each drug can be reproduced as a binary vector of length 2048. Such indicators may indicate the presence of a particular infrastructure.

한편, 본 발명의 실시 예에 따른 딥 러닝 모델은 CNN을 통해 전체 단백질 서열로부터 로컬 잔류 패턴(local residue patterns)을 추출하고, 완전 연결 계층(fully connected layers)을 통해 약물 지문의 잠재성 표현을 산출할 수 있다. 약물 계층과 단백질 계층을 모두 가공 한 후, 가공된 계층들을 연접함으로써 완전 연결 계층에서 그 결과물이 산출될 수 있다. 출력 계층을 제외한 모든 계층은 아래의 수학식과 같이 지수 선형 유닛 (ELU; exponential linear unit) 기능으로 활성화 될 수 있다.Meanwhile, the deep learning model according to the embodiment of the present invention extracts local residue patterns from the entire protein sequence through CNN, and calculates potential expression of drug fingerprints through fully connected layers. can do. After processing both the drug and protein layers, the result can be produced in the fully connected layer by concatenating the processed layers. All layers except the output layer can be activated with an exponential linear unit (ELU) function as in the following equation.

출력 계층은 분류(classification)을 위한 sigmoid 함수로 활성화 될 수 있다. 전체 신경망 모델은 Keras로 구현 될 수 있다.The output layer can be activated with sigmoid functions for classification. The entire neural network model can be implemented in Keras.

한편, 머신 러닝 모델과 딥 러닝 모델의 단백질 특성을 설명하는 데 어려움 중 하나는 단백질 길이가 모두 다르다는 것이다. 또 다른 어려움은 단백질 구조 전체가 아니라, 특이 도메인(domains)이나 모티프(motifs)와 같은 단백질의 특정 부분 만이 DTI에 관여한다는 것이다. 결과적으로, 전체 단백질 서열의 물리 화학적 성질은 DTI에 관여하지 않는 서열의 부분으로부터의 잡음 정보로 인해 DTI를 선행적으로 묘사하는데 적합하지 않다. 따라서 DTI에 관련된 로컬 잔류 패턴의 추출은 정확한 예측이 필요하다. 컨볼루션 신경망은 전체 공간에서 중요한 로컬 잔류 패턴을 포착할 수 있다.On the other hand, one of the difficulties in explaining the protein characteristics of the machine learning model and the deep learning model is that the protein lengths are all different. Another difficulty is that only certain parts of the protein, such as specific domains or motifs, are involved in the DTI, not the entire protein structure. As a result, the physicochemical properties of the entire protein sequence are not suitable for proactively describing the DTI due to noise information from portions of the sequence that are not involved in the DTI. Therefore, extraction of local residual patterns related to DTI requires accurate prediction. Convolutional neural networks can capture important local residual patterns throughout the space.

도 3은 전체 단백질 서열로부터 로컬 잔류 패턴을 추출하는 전체 스키마를 예시적으로 보여주는 도면이다. 도 3을 참조하면, 단백질 서열의 컨볼루션을 가능하게 하기 위해, 단백질 서열의 아미노산 잔류 패턴이 아미노산 라벨로부터 20의 삽입 크기(ES)를 갖는 삽입 벡터에 변형될 수 있다. 이러한 변형 과정에서 모든 단백질에 대한 삽입 벡터의 길이는 최대 단백질 길이 (MPL), 즉 2500으로 설정 될 수 있다. 여백에는 널 레이블($)과 해당 포함 벡터가 채워져 있다. 이는 무의미한 컨볼루션 결과이다. 결과적으로 20×2500의 임베딩 계층(embedding layer)이 단백질 기능을 위해 구성 될 수 있다. J 번째부터 j+ws 번째 아미노산까지 컨볼루션과 함께 스트라이딩(striding) 1을 사용하여 1D 방식으로 단백질 서열에 대하여 컨볼루션이 수행될 수 있다. 이는 아래의 수학식으로 정의될 수 있다.3 exemplarily shows the entire schema for extracting local residual patterns from the entire protein sequence. Referring to FIG. 3, to enable convolution of the protein sequence, the amino acid residue pattern of the protein sequence can be modified from the amino acid label to an insertion vector having an insertion size of 20 (ES). During this modification, the length of the insertion vector for all proteins can be set to the maximum protein length (MPL), or 2500. The margin is filled with the null label ($) and its include vector. This is a meaningless convolution result. As a result, an embedding layer of 20 × 2500 can be configured for protein function. Convolution may be performed on protein sequences in a 1D manner using striding 1 with convolution from J th to j + ws th amino acids. This may be defined by the following equation.

전체 단백질 서열에 대한 컨볼루션 동작은 각 컨볼루션 필터에 대해 (MPL-WS+1) 크기의 컨볼루션 계층을 생성시킬 수 있다. 여기서 WS는 창 크기이다. 마지막으로 가장 중요한 로컬 잔류 패턴을 추출하기 위해 각 필터에 대해 글로벌 맥스 풀링(global max-pooling)이 수행될 수 있다. 이는 아래의 수학식으로 정의될 수 있다.Convolutional operations on the entire protein sequence can produce a convolutional layer of (MPL-WS + 1) size for each convolution filter. Where WS is the window size. Finally, global max-pooling can be performed for each filter to extract the most significant local residual pattern. This may be defined by the following equation.

여기서 j는 단백질 Pk에 대한 모든 컨볼루션 결과를 포함한다. 각 창에 대해 최대값 컨볼루션 결과를 갖는 필터 크기의 벡터가 산출될 수 있다. 로컬 잔류 패턴 및 최대 단백질 길이의 위치로부터 바이어스가 유도되지 않는다. 마지막으로, 모든 max-pooling 결과를 연접함으로써, 단백질의 잠복적 표현이 구성될 수 있다. 이는 상호 작용에 대한 로컬 잔류 패턴이 서열에 얼마나 중요한지를 나타낸다. 더 많은 조직화와 단백질 피쳐의 추상화를 위해, 연결된 최대 풀링 결과는 완전 연결 계층에 제공될 수 있다.Where j includes all convolution results for protein Pk. For each window a vector of filter magnitudes with the maximum convolution results can be calculated. No bias is derived from the local residual pattern and the location of the maximum protein length. Finally, by concatenating all the max-pooling results, a latent expression of the protein can be constructed. This indicates how important the local residual pattern for the interaction is to the sequence. For more organization and abstraction of protein features, linked maximum pooling results can be provided to the full connectivity layer.

상술된 바와 같이, 약물 지문 디스크립터의 잠재 표현은 DTI를 예측하는데 더 유용하다. 단백질과 약물의 특징을 신경 회로망으로 정제한 후에 이러한 약물 피쳐를 연접하고, 최종 출력을 얻어서 약물과 표적이 상호 작용 하는지를 결정하기 위해 완전 연결 계층이 구현될 수 있다.As mentioned above, latent representations of drug fingerprint descriptors are more useful for predicting DTI. After the protein and drug features have been purified with neural networks, a complete connectivity layer can be implemented to concatenate these drug features and obtain final output to determine if the drug and target interact.

구성된 딥 뉴럴 모델을 사용하여, 입력은 피드-포워드 방식으로 출력 계층으로 전달될 수 있다. 딥 뉴럴 모델은 다음과 같이 이진 교차 엔트로피로 손실을 계산할 수 있다.Using the configured deep neural model, the input can be delivered to the output layer in a feed-forward manner. The deep neural model can calculate losses with binary crossover entropy as follows.

오버피팅을 방지하기 위해 L2-norm으로 손실 함수가 적용될 수 있다.The loss function can be applied to L2-norm to prevent overfitting.

마지막으로 Adam 최적화 도구를 사용하여 가중치를 업데이트함으로써 딥 러닝 모델에 대한 일반화된 예측이 제공될 수 있다.Finally, by using the Adam optimizer to update weights, generalized predictions for deep learning models can be provided.

또한, 오버피팅(overfitting)을 방지하기 위해 일괄 정규화 (Batch Normalization) 기술이 사용될 수 있다.In addition, a batch normalization technique may be used to prevent overfitting.

본 발명의 실시 예에 따른 딥 러닝 모델은 성능에 영향을 미치는 학습 속도 및 창 크기와 같은 하이퍼 파라미터들을 교차 유효성 검사 중에 조정시킬 수 있다. 가장 적절한 하이퍼 파라미터를 선택하기 위해 외부 유효성 검사 데이터 집합이 구성될 수 있다. MATADOR 데이터베이스에서 양성 DTI를 수집하는데, 이는 훈련 데이터 세트와 배타적이다. 신뢰할 수 있는 음성 데이터 세트를 만들기 위해 Liu et al의 데이터 중 높은 음의 스코어 (>0.93)를 갖는 음성 DTI가 획득될 수 있다. 결과적으로 400개의 양성 DTI와 404개의 음성 DTI가 외부 유효성 검증 세트로써 생성될 수 있다. 외부 검증 데이터 세트가 구축 된 후, 최고의 정밀도-리콜 곡선 하 면적(AUPR; area under precision-recall)을 제시하는 최적의 하이퍼 파라미터들을 식별하기 위하여 그리드 검색 방법이 사용될 수 있다. AUPR이 측정 될 때, 최적 임계 값은 EER(Equal Error Rate)로 주어질 수 있다.The deep learning model according to an embodiment of the present invention may adjust hyperparameters such as learning speed and window size affecting performance during cross-validation. An external validation data set can be constructed to select the most appropriate hyperparameters. A positive DTI is collected from the MATADOR database, which is exclusive to the training data set. A negative DTI with a high negative score (> 0.93) of Liu et al's data can be obtained to make a reliable speech data set. As a result, 400 positive DTIs and 404 negative DTIs can be generated as the external validation set. After the external verification data set is built, a grid retrieval method can be used to identify optimal hyperparameters that present the area under precision-recall (AUPR). When AUPR is measured, the optimal threshold value can be given as EER (Equal Error Rate).

여기서 θ는 분류 임계 값이고 α는 정확도와 리콜에서 오 분류에 대한 비용 비율을 결정하는 상수이다.Where θ is the classification threshold and α is a constant that determines the ratio of costs for accuracy and false classification in recall.

한편, 민감도(Sen), 특이도(Spe), 정밀도(Pre), 정확도(Acc), 및 F1 측정값(F1)을 정하여 분류 임계 값을 정한 후 독립적인 테스트 데이터 세트를 기반으로 한 딥 뉴럴 모델의 예측 성능이 측정될 수 있다. 하이퍼 파라미터 설정의 일반적인 단계로써, 먼저 가중치 갱신의 학습 속도는 0.0001로 조정될 수 있다. 학습률이 고정 된 후 창 크기, 수, 약물 피쳐의 히든 계층(hidden layer)이 AUPR를 이용하여 벤치 마크될 수 있다.On the other hand, a deep neural model based on independent test data sets after defining classification thresholds by determining sensitivity (Sen), specificity (Spe), precision (Pre), accuracy (Acc), and F1 measurement (F1) The predictive performance of can be measured. As a general step of hyperparameter setting, the learning rate of weight update may first be adjusted to 0.0001. After the learning rate is fixed, the window size, number, and hidden layer of drug features can be benchmarked using AUPR.

도 4는 단백질 디스크립터의 최적화 모델을 위한 정밀도-리콜 곡선을 예시적으로 보여주는 도면이다. 도 4에 도시된 바와 같이, 마지막으로 모델의 최적화된 하이퍼 파라미터 변수를 선택하여 유효성 검사 데이터 세트에 대하여 AUPR 0.817을 얻을 수 있다. 도 2에 도시된 바와 같이, 완전히 최적화된 모델은 그래프로 시각화된다. 동일한 방식으로 다른 단백질 디스크립터를 사용하는 모델을 구축 및 최적화 될 수 있다.4 illustratively shows a precision-recall curve for an optimization model of a protein descriptor. As shown in FIG. 4, finally, an optimized hyperparameter variable of the model can be selected to obtain AUPR 0.817 for the validation data set. As shown in FIG. 2, the fully optimized model is visualized graphically. In the same way, models using different protein descriptors can be built and optimized.

한편, 하이퍼 파라미터가 조정된 후, 서로 다른 단백질 디스크립터, CTD 디스크립터(통상적으로 화학-게놈 모델에서 일반적으로 사용됨), 정규화된 SW 스코어, 및 본 발명의 컨볼루션 방법에 대한 성능이 비교될 수 있다.On the other hand, after the hyperparameters are adjusted, the performance for different protein descriptors, CTD descriptors (typically commonly used in chemical-genomic models), normalized SW scores, and the convolution methods of the present invention can be compared.

도 5는 PubChem 데이터 세트의 독립적인 데이터 세트에 대한 성능 비교를 예시적으로 보여주는 도면이다. 도 5를 참조하면, 결과는 본 발명의 컨볼루션 모델이 모든 데이터 세트에 대해 다른 단백질 디스크립터보다 우수한 성능을 갖는다. EER에 의해 임계 값을 선택하면, 본 발명의 컨볼루션 모델은, PubChem과 KinaseSARfari 데이터 세트 모두에서 동일하게 수행됨으로써 일반적인 적용이 가능하다. 대조적으로, 유사성 디스크립터가 특히 메트릭 간의 불균형을 나타내어 일반화를 잃어버리기 때문에, 선택된 임계값은 독립적인 테스트 데이터 세트의 평가에서 작동되지 않는다. 또한 유사성 디스크립터는 유효성 검사 단계에서 약 0.65의 AUPR을 제공하지만, PubChem 데이터 세트에서 약 0.52의 정확도를 제공한다. 이는 독립 데이터 세트 평가 시 무작위에 가깝다.FIG. 5 illustratively shows a performance comparison of an independent data set of PubChem data sets. Referring to FIG. 5, the results show that the convolutional model of the present invention has better performance than other protein descriptors for all data sets. When the threshold value is selected by EER, the convolution model of the present invention can be generally applied by performing the same on both the PubChem and KinaseSARfari data sets. In contrast, the selected threshold does not work in the evaluation of an independent test data set, since the similarity descriptors show an imbalance between the metrics, thus losing generalization. Similarity descriptors also provide an AUPR of about 0.65 in the validation phase, but an accuracy of about 0.52 in the PubChem data set. This is close to random when evaluating independent data sets.

SW 스코어 기능은 단백질 클래스가 제한되고 지정된 경우에만 작동한다. 따라서 SW 스코어는 대용량 DTI 예측에 적합하지 않다. CTD 디스크립터는 유사 선호도보다 성능이 우수하지만 본 발명의 컨볼루션 모델보다 성능이 떨어진다. 하지만 KinaseSARfari 데이터 세트에서 도 6에 도시된 바와 같이 CTD 디스크립터의 성능은 특히 F1 스코어에서 크게 감소한다. KinaseSARfari 데이터 세트는 화합물 도메인 생물학적 분석법을 제공하기 때문에, 전체 단백질 서열로 훈련된 CTD 모델은 화합물-도메인 상호 작용을 예측할 수 없다. 따라서 F1 스코어와 감도 변화가 크게 감소된다. CTD와는 대조적으로, 본 발명의 컨볼루션 모델은 훈련 과정에서 로컬 잔여 패턴을 학습함으로써 모델이 화합물-도메인 상호 작용을 안정적으로 예측할 수 있다.The SW score function only works if the protein class is limited and specified. Thus, SW scores are not suitable for large DTI predictions. CTD descriptors perform better than similar preferences but are poorer than the convolution models of the present invention. However, as shown in FIG. 6 in the KinaseSARfari data set, the performance of the CTD descriptor is significantly reduced, especially in the F1 score. Because the KinaseSARfari data set provides compound domain biological assays, CTD models trained with the entire protein sequence cannot predict compound-domain interactions. Thus, the F1 score and sensitivity change are greatly reduced. In contrast to CTD, the convolution model of the present invention allows the model to reliably predict compound-domain interactions by learning local residual patterns during training.

도 7은 본 발명의 컨볼루션 모델과 Wen의 모델 사이의 성능을 예시적으로 비교한 도면이다. 본 발명의 컨볼루션 모델의 컨볼루션과 다른 단백질 디스크립터를 비교하는 것 외에도 컨볼루션 모델과 이전 모델을 비교한다. 비교를 위해 선택된 이전 모델은 Wen et al.에 의한 것이다. 다른 연구들은 목적과 데이터 세트가 우리 모델의 그것과 다르기 때문에 성과를 비교하는 것이 어렵다. 최적화된 모델과 기술 어를 사용하여 DTI가 나타난다. 사전 훈련 및 미세 조정 단계 후에, PubChem 데이터 세트를 평가하고, 결과가 모델과 비교된다. 비교에서 본 발명의 컨볼루션의 모델은 도 7에 도시된 바와 같이 이전 모델보다 잘 수행된다. 이전 모델은 RBM 스택인 DBN으로 작성된다. 하지만 현재 RBM은 오래된 것으로 간주되어 딥 러닝 방법에서만 가중치 초기화 기술로 사용된다.7 is an exemplary comparison of performance between the convolution model of the present invention and the model of Wen. In addition to comparing the convolution and other protein descriptors of the convolution model of the present invention, the convolution model is compared with the previous model. The previous model chosen for comparison is by Wen et al. Other studies find it difficult to compare performance because the purpose and data set are different from those of our model. DTI is presented using optimized models and descriptors. After the pre-training and fine tuning steps, the PubChem data set is evaluated and the results are compared with the model. In comparison, the model of the convolution of the present invention performs better than the previous model, as shown in FIG. The previous model is written in a DBN, which is an RBM stack. However, RBM is now considered outdated and is only used as a weight initialization technique in the deep learning method.

한편, 순차적으로 CNN에 의한 결합 영역에 대한 검출이 이루어 질 수 있다. 각 윈도우에 대해 각 필터 별로 최대 결과를 수집하기 때문에 풀링된 컨볼루션 결과는 로컬 잔류 패턴과 일치하는 영역을 강조 표시 할 수 있다. 풀링된 값은 더 높은 완전 연결된 계층을 통과하기 때문에 DTI 예측 결과와 직접 관련되지 않지만 그 중 큰 값은 일치하는 로컬 잔류 패턴의 스코어를 포함함으로써 예측 결과에 영향을 미칠 수 있다. 따라서 본 발명의 컨볼루션 모델이 로컬 잔여 패턴을 포착 할 수 있다면, 실제 결합 영역에 높은 가치를 부여 할 수 있다. On the other hand, the detection of the binding region by the CNN can be made sequentially. Because the maximum results are collected for each filter for each window, the pooled convolution results can highlight areas that match the local residual pattern. The pooled value is not directly related to the DTI prediction result because it passes through the higher fully connected layer, but a larger value may affect the prediction result by including the score of the matching local residual pattern. Therefore, if the convolution model of the present invention can capture the local residual pattern, it can give high value to the actual coupling area.

예측의 복합적인 결과의 중간 계층을 조사한 결과로써, 본 발명의 컨볼루션 모델은 로컬 잔류 패턴을 수집하여 DTI를 예측하는 특징으로 사용할 수 있음을 보여준다. sc-PDB 데이터베이스는 단백질, 리간드(ligands) 및 복잡한 구조의 바인딩 사이트들(binding sites)의 아톰 디스크립션(atom description)을 제공한다. 바인딩 사이트 어노테이션(annotations)을 파싱함으로써, 단백질 도메인과 생리학적 리간드 사이의 바인딩 사이트가 쿼리 될 수 있다.As a result of examining the middle layer of the complex result of the prediction, it shows that the convolution model of the present invention can be used as a feature to predict the DTI by collecting the local residual pattern. The sc-PDB database provides atom descriptions of proteins, ligands and binding sites of complex structure. By parsing binding site annotations, the binding site between the protein domain and the physiological ligand can be queried.

PubChem 독립 데이터 세트에서 높은 스코어로 DTI 예측이 조사된다. 흥미롭게도, 컨볼루션 결과는 바인딩 영역이 포함될 때 높은 가치를 제공한다. 예를 들어, 마스트/줄기 세포(mast/stem cell) 성장 인자 수용체 키트(P10721, KIT_HUMAN)는 도 8a에 도시된 바와 같이 생리학적 리간드 3GOF, 4UOI, 4HVS, 3GOE 및 1PKG와 같은 많은 sc-PDB 주석을 가지고 있다. 도 8a에서, 주석된 결합 잔류는 각각의 sc-PDB 주석에 대해 적색으로 착색된다. 다양한 sc-PDB 주석은 약간 상이하지만 공통 영역을 갖는다. 예상대로, 컨볼루션 계층의 풀링된 일부 결과는 높은 예측 스코어에 영향을 미치는 필터 사이의 높은 순위로 바인딩 영역을 정확하게 덮는다.DTI predictions are investigated with high scores in the PubChem independent data set. Interestingly, convolutional results provide high value when binding regions are included. For example, mast / stem cell growth factor receptor kits (P10721, KIT_HUMAN) have many sc-PDB annotations such as physiological ligands 3GOF, 4UOI, 4HVS, 3GOE and 1PKG as shown in FIG. 8A. Have In FIG. 8A, the tinned bond residues are colored red for each sc-PDB tin. The various sc-PDB comments are slightly different but have common areas. As expected, some pooled results of the convolutional layer accurately cover the binding region with a high ranking between the filters affecting the high prediction score.

추가로, 이러한 컨볼루션 결과는 단백질 클래스에 국한되지 않는다. G-단백질 수용체(GPCR)인 단백질 5-하이드록시트립타민(hydroxytryptamine, 세로토닌) 수용체 1B (P28222, 5HT1B_HUMAN)에 대하여, 2개의 sc-PDB 데이터(4IAQ, 4IAR)를 찾을 수 있다. 상기한 데이터로부터, 알파-나선형 구조를 형성하는 4 개의 결합 영역은 도 8b에 도시된 바와 같이 풀링 된 컨볼루션 결과에 의해 커버될 수 있다.In addition, these convolution results are not limited to protein classes. Two sc-PDB data (4IAQ, 4IAR) can be found for the protein 5-hydroxytryptamine (serotonin) receptor 1B (P28222, 5HT1B_HUMAN), a G-protein receptor (GPCR). From the above data, the four binding regions forming the alpha-helical structure can be covered by the pooled convolution results as shown in FIG. 8B.

본 발명의 실시 예에 따른 단백질-표적 상호 작용을 예측하는 방법은, CNN으로 전체 표적 단백질 서열의 로컬 잔류 패턴을 추출하기 위한 새로운 DTI 예측 모델을 제시한다. 본 발명의 모델은 다양한 약물 데이터베이스의 DTI로 모델을 훈련시키고 MATADOR의 DTI로 모델을 최적화시킬 수 있다. 결과적으로, 단백질 서열의 검출된 로컬 특징은, CTD 및 SW 스코어와 같은 다른 단백질 디스크립터보다 우수하다. 또한 본 발명의 모델은 DBN을 기반으로 구축된 이전 모델보다 우수한 성능을 가진다. 또한 풀링된 컨볼루션 결과를 검토하고 sc-PDB의 주석과 비교함으로써 모델이 단백질 서열에서 결합 영역을 검출 할 수 있는 능력을 가진다. 마지막으로, CNN으로 로컬 잔류 패턴을 수집하는 본 발명의 접근법은 원시 서열의 단백질 피쳐를 확대함으로써, 이전의 접근법보다 더 나은 성격을 제공한다.The method for predicting protein-target interaction according to an embodiment of the present invention proposes a new DTI prediction model for extracting local residual patterns of entire target protein sequences with CNN. The model of the present invention can train the model with DTI of various drug databases and optimize the model with DTI of MATADOR. As a result, the detected local characteristics of the protein sequence are superior to other protein descriptors such as CTD and SW scores. In addition, the model of the present invention has better performance than the previous model built on the DBN. The model also has the ability to detect binding regions in protein sequences by reviewing the pooled convolution results and comparing them with the annotations of sc-PDB. Finally, our approach to collecting local residual patterns with CNNs provides better character than the previous approach by enlarging the protein features of the native sequence.

도 9는 본 발명의 실시 예에 따른 약물-표적 상호 작용을 예측하는 방법을 예시적으로 보여주는 도면이다. 도 1 내지 도 9를 참조하면, 약물-표적 상호 작용을 예측하는 방법은 다음과 같이 진행될 수 있다.9 is a diagram illustrating a method for predicting drug-target interaction according to an embodiment of the present invention. 1 to 9, the method for predicting drug-target interaction may proceed as follows.

CNN(convolution neural network)을 통하여 단백질 서열 패턴이 학습되고 그 결과로써 로컬 잔류 패턴이 포착될 수 있다(S110). 단백질 패턴으로부터 로컬 잔류 패턴을 CNN을 통하여 학습할 수 있다. CNN은 컨볼루션 계층과 글로벌 맥스 폴링 계층으로 구성될 수 있다. 컨볼루션 계층은, 단백질 패턴을 컨볼루션 연산을 수행함으로 형성될 수 있다. 글로벌 맥스 풀링 계층은 최대값을 이용하여 컨볼루션 계층의 출력값들을 통합시킴으로써 형성될 수 있다. 약물 지문 내부 패턴이 학습될 수 있다(S120). 예를 들어, 약물 지문에 대하여 완전 연결 계층(fully connected layer(FC); 혹은 내적 연산(inner product operation))을 통하여 약물 지문 패턴이 학습될 수 있다. 실시 예에 있어서, 약물 지문은 2048 길이의 바이너리 벡터로 표현되며(ECFP4), 이를 완전 연결 계층(FC)을 통하여 학습될 수 있다.A protein sequence pattern is learned through a convolution neural network (CNN), and as a result, a local residual pattern can be captured (S110). Local residual patterns from protein patterns can be learned via CNN. The CNN may consist of a convolution layer and a global max polling layer. The convolutional layer can be formed by performing a convolution operation on the protein pattern. The global max pooling layer can be formed by integrating the output values of the convolutional layer using the maximum value. The drug fingerprint internal pattern may be learned (S120). For example, a drug fingerprint pattern may be learned for a drug fingerprint through a fully connected layer (FC); or an inner product operation. In an embodiment, the drug fingerprint is represented by a binary vector of 2048 length (ECFP4), which can be learned through a fully connected layer (FC).

이후, CNN 학습된 단백질 서열 패턴과 FC 학습된 약물 지문 패턴이 연접될 수 있다(S130). 학습된 단백질 로컬 서열 패턴결과와 약물 지문 패턴 결과를 단순히 붙인다. 예를 들어, 연접 과정은 다음과 같이 표현될 수 있다. concatenating([0,1], [2,3]) -> [0,1,2,3]. 이후, 연접된 패턴이 완전 연결 계층을 통하여 학습될 수 있다(S140). 이때, 학습된 약물 패턴과 단백질 서열 패턴을 이용함으로써 서로의 상호 작용들이 학습될 수 있다. 이후 약물 상호 작용이 예측될 수 있다(S150). 완전 연결 계층을 통하여 조직화된 단백질 패턴과 약물 패턴에 각각 가중치를 두고, 활성화 함수로 활성화 시킴으로써, 최종 스코어가 도출 될 수 있다. 이 때 모든 피쳐에 가중치가 주어지기 때문에, 완전 연결(fully connected)라고 불린다. 예를 들어, 레이어가 H이고 가중치(weight) 매트릭스가 W일때: sigmoid(H*W) = sigmoid([0.9, 0.1, 0.5, 0.11] * [0.2, 0.5, 0.8, 0.1]) = sigmoid(0.641) = Score 표현될 수 있다.Thereafter, the CNN learned protein sequence pattern and the FC learned drug fingerprint pattern may be concatenated (S130). Simply attach the learned protein local sequence pattern results and drug fingerprint pattern results. For example, the concatenation process may be expressed as follows. concatenating ([0,1], [2,3])-> [0,1,2,3]. Then, the concatenated pattern may be learned through the complete connection layer (S140). At this time, interactions with each other may be learned by using the learned drug pattern and the protein sequence pattern. Drug interactions can then be predicted (S150). The final score can be derived by weighting the protein and drug patterns organized through the complete link layer and activating them with an activation function. Since all features are weighted at this time, it is called fully connected. For example, if the layer is H and the weight matrix is W: sigmoid (H * W) = sigmoid ([0.9, 0.1, 0.5, 0.11] * [0.2, 0.5, 0.8, 0.1]) = sigmoid (0.641 ) = Score can be expressed.

본 발명에 따른 단계들 및/또는 동작들은 기술분야의 통상의 기술자에 의해 이해될 수 있는 것과 같이, 다른 순서로, 또는 병렬적으로, 또는 다른 에포크(epoch) 등을 위해 다른 실시 예들에서 동시에 일어날 수 있다.The steps and / or actions according to the invention may occur simultaneously in different embodiments in different order, in parallel, or for other epochs, etc., as would be understood by one skilled in the art. Can be.

실시 예에 따라서는, 단계들 및/또는 동작들의 일부 또는 전부는 하나 이상의 비-일시적 컴퓨터-판독가능 매체에 저장된 명령, 프로그램, 상호작용 데이터 구조(interactive data structure), 클라이언트 및/또는 서버를 구동하는 하나 이상의 프로세서들을 사용하여 적어도 일부가 구현되거나 또는 수행될 수 있다. 하나 이상의 비-일시적 컴퓨터-판독가능 매체는 예시적으로 소프트웨어, 펌웨어, 하드웨어, 및/또는 그것들의 어떠한 조합일 수 있다. 또한, 본 명세서에서 논의된 "모듈"의 기능은 소프트웨어, 펌웨어, 하드웨어, 및/또는 그것들의 어떠한 조합으로 구현될 수 있다.In some embodiments, some or all of the steps and / or actions may be directed to instructions, programs, interactive data structures, clients, and / or servers stored on one or more non-transitory computer-readable media. At least some may be implemented or performed using one or more processors. One or more non-transitory computer-readable media may be illustratively software, firmware, hardware, and / or any combination thereof. In addition, the functionality of the "module" discussed herein may be implemented in software, firmware, hardware, and / or any combination thereof.

본 발명의 실시 예들의 하나 이상의 동작들/단계들/모듈들을 구현/수행하기 위한 하나 이상의 비-일시적 컴퓨터-판독가능 매체 및/또는 수단들은 ASICs(application-specific integrated circuits), 표준 집적 회로들, 마이크로 컨트롤러를 포함하는, 적절한 명령들을 수행하는 컨트롤러, 및/또는 임베디드 컨트롤러, FPGAs(field-programmable gate arrays), CPLDs(complex programmable logic devices), 및 그와 같은 것들을 포함할 수 있지만, 여기에 한정되지는 않는다. One or more non-transitory computer-readable media and / or means for implementing / performing one or more operations / steps / modules of embodiments of the present invention may be used in application-specific integrated circuits (ASICs), standard integrated circuits, A controller that performs appropriate instructions, including a microcontroller, and / or an embedded controller, field-programmable gate arrays (FPGAs), complex programmable logic devices (CPLDs), and the like. Does not.

한편, 상술 된 본 발명의 내용은 발명을 실시하기 위한 구체적인 실시 예들에 불과하다. 본 발명은 구체적이고 실제로 이용할 수 있는 수단 자체뿐 아니라, 장차 기술로 활용할 수 있는 추상적이고 개념적인 아이디어인 기술적 사상을 포함할 것이다.On the other hand, the contents of the present invention described above are only specific embodiments for carrying out the invention. The present invention will include not only specific and practically usable means per se, but also technical ideas as abstract and conceptual ideas that can be utilized in future technologies.

CNN: 컨볼루션 신경망
FC: 완전 연결 신경망
100: 딥 러닝 장치
110: CCN 유닛
120: 제 1 FC 유닛
130: 연접 유닛
140: 제 2 FC 유닛
150: DTI 예측 유닛CNN: Convolutional Neural Network
FC: fully connected neural network
100: deep learning device
110: CCN unit
120: first FC unit
130: connection unit
140: second FC unit
150: DTI prediction unit

Claims

In a method for predicting protein-target interactions:
Learning a protein sequence using a convolution neural network in a processor to extract a local residual pattern;
Learning a drug fingerprint through a first fully connected layer in the processor to extract a drug fingerprint pattern;
Concatenation of the local residual pattern and the drug fingerprint pattern in the processor; And
Learning at the processor the concatenated pattern through a second fully connected layer.

The method of claim 1,
Learning the protein sequence,
Learning a local residue pattern for the plurality of protein sequences through a convolution layer; And
Extracting a maximum value from the result of the learned local sequence pattern as the local residual pattern through a max pooling layer.

The method of claim 2,
And the max pooling layer is a global max pooling layer.

The method of claim 2,
Wherein said plurality of protein sequences are trained from a database incorporating DrugBank, International Union of Basic and Clinical Pharmacology (IUPHAR), and Kyoto Encyclopedia of Genes and Genomes (KEGG).

The method of claim 2,
Wherein the size of the convolutional layer is a value obtained by subtracting the window size (WS) from the predetermined protein sequence size (MPL) plus one.

The method of claim 5, wherein
Learning the protein sequence,
And setting the size of the insertion vector for the plurality of protein sequences through an embedding layer to the size of the predetermined protein sequence (MPL).

The method of claim 1,
Generating a score corresponding to a protein-target interaction via an output layer in the processor.

The method of claim 7, wherein
Generating the score,
Setting weights to each of the local residual pattern and the drug fingerprint pattern; And
Generating the score by activating an activation function that takes the set local residual pattern and the set drug fingerprint pattern as input.

The method of claim 8,
The activation function is a sigmoid function.

The method of claim 1,
Adjusting the learning rate or window size during cross validation.