KR102617958B1

KR102617958B1 - Method and apparatus for cross attention mechanism based compound-protein interaction prediction

Info

Publication number: KR102617958B1
Application number: KR1020220184080A
Authority: KR
Inventors: 강재우; 장광훈; 응웬옥광
Original assignee: 고려대학교산학협력단
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-12-27

Abstract

본 발명은 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법 및 장치에 관한 것이다. 본 발명의 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법은, 분자 그래프 데이터 및 분자 지문 데이터를 기반으로 화합물 정보를 인코딩하는 단계, 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩하는 단계, 인코딩된 상기 화합물 정보 및 단백질 정보를 제1 교차 어텐션 블록에 입력하는 단계 및 상기 제1 교차 어텐션 블록의 출력을 기반으로 화합물과 단백질의 상호 작용을 예측하는 단계를 포함할 수 있다.The present invention relates to a method and device for predicting compound-protein interactions based on a cross-attention mechanism. The method for predicting compound-protein interactions based on a cross-attention mechanism of the present invention includes the steps of encoding compound information based on molecular graph data and molecular fingerprint data, encoding protein information based on protein sequence data, and encoding the encoded protein information based on protein sequence data. It may include inputting compound information and protein information into a first cross-attention block and predicting an interaction between a compound and a protein based on the output of the first cross-attention block.

Description

Method and device for predicting compound-protein interaction based on cross attention mechanism {METHOD AND APPARATUS FOR CROSS ATTENTION MECHANISM BASED COMPOUND-PROTEIN INTERACTION PREDICTION}

본 발명은 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법 및 장치에 관한 것이다. 보다 상세하게는, 그래프 컨볼로션 신경망과 분자 지문(Molecular fingerprints) 데이터를 동시에 사용하면서 화합물과 단백질 정보의 단순 결합이 아닌 교차 어텐션 매커니즘을 이용한 화합물-단백질 상호작용 예측 방법 및 장치에 관한 것이다.The present invention relates to a method and device for predicting compound-protein interactions based on a cross-attention mechanism. More specifically, it relates to a method and device for predicting compound-protein interactions using a cross-attention mechanism rather than a simple combination of compound and protein information while simultaneously using graph convolutional neural networks and molecular fingerprints data.

화합물-단백질 상호작용 예측은 신약 개발과정에서 필수적인 단계로 비용이 높은 분자 도킹 시뮬레이션을 통해 수행된다. 이를 대체하기 위해 인공 지능 기반의 접근 방식들이 제안되고 있다. 최근 분자 구조를 그래프 데이터로 사용해 학습하는 그래프 컨볼루션 신경망 모델과 분자의 부분 구조를 비트 단위로 치환한 분자 지문(Molecular fingerprints) 데이터를 학습하는 인공 신경망 모델로 크게 두 가지 딥러닝 모델이 화합물 데이터를 인공 지능에 활용하는데 사용되고 있다. 대부분의 기존 연구는 두 방법론 중 하나만을 사용한다. 또한 최근 연구는 화합물-단백질 상호작용 예측 연구를 위해 화합물 및 단백질 데이터를 단순히 결합하여 학습하는 방법을 넘어 학습 시 데이터들간의 결합을 어떻게 모델링할지에 대한 연구가 활발히 진행되고 있다. 본 발명은 그래프 컨볼로션 신경망과 분자 지문 데이터를 동시에 사용하면서 화합물과 단백질 정보의 단순 결합이 아닌 교차 어텐션 매커니즘을 이용한 화합물-단백질 상호작용 예측 방법 및 장치에 관한 것이다.Compound-protein interaction prediction is an essential step in the new drug development process and is performed through expensive molecular docking simulations. Artificial intelligence-based approaches are being proposed to replace this. Recently, two types of deep learning models have been used to study compound data: a graph convolutional neural network model that learns using molecular structure as graph data, and an artificial neural network model that learns molecular fingerprints data that replaces the partial structure of the molecule in bits. It is being used for artificial intelligence. Most existing studies use only one of the two methodologies. In addition, recent research is actively conducting research on how to model the combination between data during learning, beyond simply combining and learning compound and protein data to study compound-protein interaction prediction. The present invention relates to a method and device for predicting compound-protein interactions using a cross-attention mechanism rather than a simple combination of compound and protein information while simultaneously using a graph convolutional neural network and molecular fingerprint data.

대한민국 공개특허공보 제10-2020-0124923호(2020.11.04)Republic of Korea Patent Publication No. 10-2020-0124923 (2020.11.04)

본 발명은 분자 구조 그래프 데이터 혹은 지문 데이터 중 어느 하나만을 사용하는 기존 기술과 비교해서 효과적이고 개선된 성능을 가지는 화합물-단백질 상호작용 예측 방법을 제안하는 것을 목적으로 한다.The purpose of the present invention is to propose a method for predicting compound-protein interactions that is effective and has improved performance compared to existing technologies that use only either molecular structure graph data or fingerprint data.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

전술한 기술적 과제를 달성하기 위한 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법은, 분자 그래프 데이터 및 분자 지문 데이터를 기반으로 화합물 정보를 인코딩하는 단계, 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩하는 단계, 인코딩된 상기 화합물 정보 및 단백질 정보를 제1 교차 어텐션 블록에 입력하는 단계 및 상기 제1 교차 어텐션 블록의 출력을 기반으로 화합물과 단백질의 상호 작용을 예측하는 단계를 포함할 수 있다.A method for predicting compound-protein interactions based on a cross-attention mechanism according to an embodiment of the present invention to achieve the above-described technical problem includes encoding compound information based on molecular graph data and molecular fingerprint data, and protein sequence data. Encoding protein information based on, inputting the encoded compound information and protein information into a first cross attention block, and predicting the interaction of the compound and protein based on the output of the first cross attention block. may include.

일 실시 예에 따르면, 상기 화합물 정보를 인코딩하는 단계는, 상기 분자 그래프 데이터를 D-MPNN(direct message passing neural network)에 입력하는 단계, 상기 분자 지문 데이터를 MLP(mulit layer perceptron) 인공 신경망에 입력하는 단계 및 상기 D-MPNN의 출력 및 상기 MLP 인공 신경망의 출력을 제2 교차 어텐션 블록에 입력하는 단계를 포함할 수 있다.According to one embodiment, the encoding of the compound information includes inputting the molecular graph data into a direct message passing neural network (D-MPNN), and inputting the molecular fingerprint data into a mulit layer perceptron (MLP) artificial neural network. and inputting the output of the D-MPNN and the output of the MLP artificial neural network to a second cross attention block.

일 실시 예에 따르면, 상기 제2 교차 어텐션 블록은 Q(query), K(Key), V(value)의 입력을 가지고, 상기 V, K는 상기 D-MPNN의 출력을 기반으로 결정되고, 상기 Q는 상기 MLP 인공 신경망의 출력을 기반으로 결정될 수 있다.According to one embodiment, the second cross attention block has inputs of Q (query), K (Key), and V (value), and the V and K are determined based on the output of the D-MPNN, Q can be determined based on the output of the MLP artificial neural network.

일 실시 예에 따르면, 상기 제2 교차 어텐션 블록의 출력은, 상기 Q, K, V의 입력을 기반으로, 와 같이 결정되고, 여기서 C는 임베딩 차원의 수이고, d는 어텐션 헤드(attention head)의 수일 수 있다.According to one embodiment, the output of the second cross attention block is based on the inputs of Q, K, and V, It is determined as follows, where C is the number of embedding dimensions and d may be the number of attention heads.

일 실시 예에 따르면, 상기 화합물 정보를 인코딩하는 단계는, 상기 제2 교차 어텐션 블록의 출력을 셀프 어텐션 블록에 입력하는 단계를 더 포함할 수 있다.According to one embodiment, the step of encoding the compound information may further include inputting the output of the second cross attention block to a self-attention block.

일 실시 예에 따르면, 단백질 정보를 인코딩하는 단계는, TAPE(tasks assessing protein embeddings) 토크나이저(tokenizer)를 기반으로 단백질 데이터를 전처리하는 단계, One-hot 인코딩을 기반으로 1 차원 데이터인 상기 단백질 시퀀스 데이터를 생성하는 단계 및 상기 단백질 시퀀스 데이터를 1D CNN(convolutional neural network)에 입력하는 단계를 포함할 수 있다.According to one embodiment, the step of encoding protein information includes preprocessing protein data based on a TAPE (tasks assessing protein embeddings) tokenizer, and processing the protein sequence as one-dimensional data based on one-hot encoding. It may include generating data and inputting the protein sequence data into a 1D convolutional neural network (CNN).

일 실시 예에 따르면, 상기 제1 교차 어텐션 블록은 Q(query), K(Key), V(value)의 입력을 가지고, 상기 Q는 인코딩된 상기 화합물 정보를 기반으로 결정되고, 상기 K 및 V는 인코딩된 상기 단백질 정보를 기반으로 결정될 수 있다.According to one embodiment, the first cross attention block has inputs of Q (query), K (Key), and V (value), where Q is determined based on the encoded compound information, and the K and V Can be determined based on the encoded protein information.

일 실시 예에 따르면, 상기 제1 교차 어텐션 블록의 출력은, 상기 Q, K, V의 입력을 기반으로 와 같이 결정되고, 여기서 C는 임베딩 차원의 수이고, d는 어텐션 헤드(attention head)의 수일 수 있다.According to one embodiment, the output of the first cross attention block is based on the inputs of Q, K, and V. It is determined as follows, where C is the number of embedding dimensions and d may be the number of attention heads.

본 발명의 기술적 과제를 달성하기 위한 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치는, 적어도 하나의 프로세서, 상기 적어도 하나의 프로세서에 의해 실행되는 컴퓨터 프로그램을 로드(load)하는 메모리 및 상기 컴퓨터 프로그램을 저장하는 스토리지를 포함하고, 상기 컴퓨터 프로그램은 상기 적어도 하나의 프로세서에 의해, 분자 그래프 데이터 및 분자 지문 데이터를 기반으로 화합물 정보를 인코딩하는 동작, 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩하는 동작, 인코딩된 상기 화합물 정보 및 단백질 정보를 제1 교차 어텐션 블록에 입력하는 동작 및 상기 제1 교차 어텐션 블록의 출력을 기반으로 화합물과 단백질의 상호 작용을 예측하는 동작을 실행하도록 제어될 수 있다.In order to achieve the technical object of the present invention, a compound-protein interaction prediction device based on a cross attention mechanism according to an embodiment of the present invention includes at least one processor, and loads a computer program executed by the at least one processor ( load) and a storage for storing the computer program, wherein the computer program operates by the at least one processor to encode compound information based on molecular graph data and molecular fingerprint data, based on protein sequence data. An operation of encoding protein information, an operation of inputting the encoded compound information and protein information into a first cross attention block, and an operation of predicting the interaction between a compound and a protein based on the output of the first cross attention block. It can be controlled to do so.

본 발명의 기술적 과제를 달성하기 위한 본 발명의 일 실시 예에 따른 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램은 컴퓨팅 장치와 결합되어, 분자 그래프 데이터 및 분자 지문 데이터를 기반으로 화합물 정보를 인코딩하는 동작, 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩하는 동작, 인코딩된 상기 화합물 정보 및 단백질 정보를 제1 교차 어텐션 블록에 입력하는 동작 및 상기 제1 교차 어텐션 블록의 출력을 기반으로 화합물과 단백질의 상호 작용을 예측하는 동작을 실행할 수 있다.A computer program stored in a computer-readable recording medium according to an embodiment of the present invention for achieving the technical object of the present invention is combined with a computing device, and encodes compound information based on molecular graph data and molecular fingerprint data, An operation of encoding protein information based on protein sequence data, an operation of inputting the encoded compound information and protein information into a first cross attention block, and interaction of the compound and protein based on the output of the first cross attention block. Predictive actions can be executed.

본 발명에 따르면, 분자 구조 그래프 데이터 혹은 지문 데이터 중 어느 하나만을 사용하는 기존 기술과 비교해서 효과적이고 개선된 성능을 가지는 화합물-단백질 상호작용 예측 방법이 개시된다.According to the present invention, a method for predicting compound-protein interactions is disclosed that is effective and has improved performance compared to existing technologies that use only either molecular structure graph data or fingerprint data.

본 발명의 효과들은 이상에서 언급한 효과들로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시 예에 따른 화합물-단백질 상호작용 예측 모델의 구조를 나타낸 도면이다.
도 2는 본 발명의 일 실시 예에 따른 교차 어텐션 매카니즘의 세부 구조를 나타낸 도면이다.
도 3은 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법을 나타낸 순서도이다.
도 4는 본 발명의 일 실시 예에 따른 화합물 정보를 인코딩하는 과정을 나타낸 도면이다.
도 5는 본 발명의 일 실시 예에 따른 단백질 정보를 인코딩하는 과정을 나타낸 도면이다.
도 6는 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치의 구성을 나타낸 도면이다.Figure 1 is a diagram showing the structure of a compound-protein interaction prediction model according to an embodiment of the present invention.
Figure 2 is a diagram showing the detailed structure of a cross attention mechanism according to an embodiment of the present invention.
Figure 3 is a flowchart showing a method for predicting compound-protein interactions based on a cross-attention mechanism according to an embodiment of the present invention.
Figure 4 is a diagram showing the process of encoding compound information according to an embodiment of the present invention.
Figure 5 is a diagram showing the process of encoding protein information according to an embodiment of the present invention.
Figure 6 is a diagram showing the configuration of a compound-protein interaction prediction device based on a cross attention mechanism according to an embodiment of the present invention.

본 발명의 목적과 기술적 구성 및 그에 따른 작용 효과에 관한 자세한 사항은 본 발명의 명세서에 첨부된 도면에 의거한 이하의 상세한 설명에 의해 보다 명확하게 이해될 것이다. 첨부된 도면을 참조하여 본 발명에 따른 실시 예를 상세하게 설명한다.Details regarding the purpose and technical configuration of the present invention and its operational effects will be more clearly understood by the following detailed description based on the drawings attached to the specification of the present invention. Embodiments according to the present invention will be described in detail with reference to the attached drawings.

본 명세서에서 개시되는 실시 예들은 본 발명의 범위를 한정하는 것으로 해석되거나 이용되지 않아야 할 것이다. 이 분야의 통상의 기술자에게 본 명세서의 실시 예를 포함한 설명은 다양한 응용을 갖는다는 것이 당연하다. 따라서, 본 발명의 상세한 설명에 기재된 임의의 실시 예들은 본 발명을 보다 잘 설명하기 위한 예시적인 것이며 본 발명의 범위가 실시 예들로 한정되는 것을 의도하지 않는다.The embodiments disclosed in this specification should not be construed or used as limiting the scope of the present invention. It is obvious to those skilled in the art that the description, including embodiments, of this specification has various applications. Accordingly, any embodiments described in the detailed description of the present invention are illustrative to better explain the present invention and are not intended to limit the scope of the present invention to the embodiments.

도면에 표시되고 아래에 설명되는 기능 블록들은 가능한 구현의 예들일 뿐이다. 다른 구현들에서는 상세한 설명의 사상 및 범위를 벗어나지 않는 범위에서 다른 기능 블록들이 사용될 수 있다. 또한, 본 발명의 하나 이상의 기능 블록이 개별 블록들로 표시되지만, 본 발명의 기능 블록들 중 하나 이상은 동일 기능을 실행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합일 수 있다.The functional blocks shown in the drawings and described below are only examples of possible implementations. Other functional blocks may be used in other implementations without departing from the spirit and scope of the detailed description. Additionally, although one or more functional blocks of the present invention are shown as individual blocks, one or more of the functional blocks of the present invention may be a combination of various hardware and software components that perform the same function.

또한, 어떤 구성요소들을 포함한다는 표현은 "개방형"의 표현으로서 해당 구성요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성요소들을 배제하는 것으로 이해되어서는 안 된다.In addition, the expression including certain components is an “open” expression and simply refers to the presence of the corresponding components, and should not be understood as excluding additional components.

나아가 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급될 때에는, 그 다른 구성요소에 직접적으로 연결 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 한다.Furthermore, when a component is referred to as being “connected” or “connected” to another component, it should be understood that although it may be directly connected or connected to the other component, other components may exist in between. do.

이하에서는 도면들을 참조하여 본 발명의 세부적인 실시 예들에 대해 살펴보도록 한다. Hereinafter, detailed embodiments of the present invention will be looked at with reference to the drawings.

도 1은 본 발명의 일 실시 예에 따른 화합물-단백질 상호작용 예측 모델의 구조를 나타낸 도면이다.Figure 1 is a diagram showing the structure of a compound-protein interaction prediction model according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 화합물-단백질 상호작용 예측 모델은 D-MPNN(direct message passing neural network), multi layer perceptrons (MLP) 인공 신경망, 1D CNN (convolutional neural network)을 포함하는 3 개의 입력 인공 신경망, 어텐션 블록(attention block), 그리고 출력 인공 신경망으로 구성될 수 있다. D-MPNN, MLP 인공 신경망은 화합물 정보를 인코딩하기 위해 사용되며, 1D CNN은 단백질 정보를 인코딩하기 위해 사용될 수 있다. D-MPNN, MLP, 그리고 1D CNN을 통해 인코딩된 화합물 정보 및 단백질 정보는 교차 어텐션 블록을 통해 결합되며, 출력 인공 신경망을 통해 예측된 Binding Score가 출력될 수 있다.Referring to Figure 1, the compound-protein interaction prediction model according to an embodiment of the present invention includes a direct message passing neural network (D-MPNN), a multi layer perceptrons (MLP) artificial neural network, and a 1D convolutional neural network (CNN). It can be composed of three input artificial neural networks, an attention block, and an output artificial neural network. D-MPNN and MLP artificial neural networks can be used to encode compound information, and 1D CNN can be used to encode protein information. Compound information and protein information encoded through D-MPNN, MLP, and 1D CNN are combined through a cross attention block, and the predicted Binding Score can be output through an output artificial neural network.

우선, 입력 신경망을 통해 화합물 정보를 인코딩하는 방법에 대해 설명한다. 도 1을 참조하면, 입력 인공 신경망 중 D-MPNN 및 MLP 인공 신경망은 각각 분자 구조 그래프 데이터와 분자 지문 데이터(Morgan fingerprint 데이터)를 입력받을 수 있다. First, we describe how to encode compound information through an input neural network. Referring to Figure 1, among the input artificial neural networks, D-MPNN and MLP artificial neural networks can receive molecular structure graph data and molecular fingerprint data (Morgan fingerprint data), respectively.

여기서 분자 구조 그래프 데이터 및 분자 지문 데이터는 아래와 같이 표현될 수 있다.Here, the molecular structure graph data and molecular fingerprint data can be expressed as follows.

- 는 원자들의 집합와 결합들의 집합로 구성된 분자 그래프 데이터를 나타낸다.- is a set of atoms and set of combinations It represents molecular graph data consisting of .

- Morgan fingerprint 벡터 는 이원 벡터로 특정 부분 구조의 존재 유무를 0과 1로 표현한다. Morgan 알고리즘은 분자의 모든 원자에 이르는 경로를 전부 탐색하며, Morgan fingerprint 벡터 는 고유 경로를 비트로 치환한 데이터를 나타낸다.- Morgan fingerprint vector is a binary vector, expressing the presence or absence of a specific partial structure as 0 and 1. The Morgan algorithm searches all paths to all atoms in a molecule, and uses the Morgan fingerprint vector represents data obtained by replacing the unique path with bits.

분자 그래프 데이터 는 D-MPNN에 입력되고, Moragn fingerprint 벡터 는 MLP 인공 신경망에 입력될 수 있다. D-MPNN 모델은 분자 구조 그래프의 노드가 아닌 방향성이 있는 간선을 이용해 원자간의 관계를 학습시키는 모델로 를 출력할 수 있다. MLP 인공 신경망은 입력 로부터 비선형 정보를 추출하여 를 출력할 수 있다. molecular graph data is input to D-MPNN, and the Moragn fingerprint vector can be input to the MLP artificial neural network. The D-MPNN model is a model that learns relationships between atoms using directed edges rather than nodes in the molecular structure graph. can be output. MLP artificial neural network inputs By extracting nonlinear information from can be output.

두 출력 값 및 는 교차 어텐션 매커니즘을 통해 결합될 수 있다. 교차-어텐션 매커니즘은 Q(Query), K(Key), V(Value)를 입력값으로 가지며, 는 MLP 인공 신경망의 출력인 로부터, 그리고 는 D-MPNN의 출력인 로부터 생성될 수 있다. 이를 수학식으로 나타내면 수학식 1과 같다.two output values and Can be combined through a cross attention mechanism. The cross-attention mechanism has Q (Query), K (Key), and V (Value) as input values, is the output of the MLP artificial neural network from, and is the output of D-MPNN can be created from If this is expressed mathematically, it is as shown in Equation 1.

수학식 1과 같이, 는 로부터 로서 생성되고, 는 로부터 각각 , 로서 생성될 수 있다. 이 때 는 projection function 와 는 각각 가중치와 편향을 의미함)와 같이 정의될 수 있다.As in equation 1, Is from It is created as, Is from each , It can be created as . At this time is a projection function and can be defined as follows (meaning weight and bias, respectively).

도 2는 본 발명의 일 실시 예에 따른 교차 어텐션 매카니즘의 세부 구조를 나타낸 도면이다.Figure 2 is a diagram showing the detailed structure of a cross attention mechanism according to an embodiment of the present invention.

D-MPNN 및 MLP 인공 신경망을 통해 출력된 및 는 를 입력 값으로 가지는 교차-어텐션 매커니즘을 통해 결합될 수 있다. 이를 수학식으로 나타내면 수학식 2와 같다.Output through D-MPNN and MLP artificial neural networks and Is Can be combined through a cross-attention mechanism with as an input value. If this is expressed mathematically, it is as shown in Equation 2.

여기서 와 는 각각 임베딩 차원의 수와 attention head의 수를 의미한다. 본 모델은 single-head cross-attention mechanism()을 사용하였으나, 이는 예시를 위한 것일 뿐 본 발명의 범위를 한정하지 않음에 유의한다. 본 발명의 일 실시 예에 따르면, 화합물 표현형을 개선하기 위해 셀프-어텐션(self-attention)을 추가로 적용시켜 수학식 2와 같이 화합물 Feautre의 최종 표현형 를 생성할 수 있다.here and means the number of embedding dimensions and the number of attention heads, respectively. This model uses a single-head cross-attention mechanism ( ) has been used, but note that this is for illustrative purposes only and does not limit the scope of the present invention. According to one embodiment of the present invention, in order to improve the compound phenotype, self-attention is additionally applied to obtain the final phenotype of the compound Feautre as shown in Equation 2. can be created.

다음으로, 단백질 정보를 인코딩하는 방법에 대해 설명한다. 도 1을 참조하면, 입력 인공 신경망 중 1-D CNN은 단백질 시퀀스 데이터를 입력받을 수 있다. 단백질 데이터는 residue를 UniRep Vocabulary에서 부여한 고유 숫자로 표현할 수 있는 TAPE(Tasks Assessing Protein Embeddings) tokenizer를 이용해 전처리 한 후 One-hot encoding을 이용해 단백질 시퀀스를 1차원 데이터로 구성한 뒤 1-D CNN 모델에 입력될 수 있다. 이를 통해 단백질 Feature인 가 생성될 수 있다.Next, we will explain how to encode protein information. Referring to Figure 1, among the input artificial neural networks, 1-D CNN can receive protein sequence data. Protein data is preprocessed using TAPE (Tasks Assessing Protein Embeddings) tokenizer, which can express residues as unique numbers assigned by UniRep Vocabulary, and then one-hot encoding is used to construct the protein sequence into one-dimensional data and input into a 1-D CNN model. It can be. Through this, the protein feature can be created.

정확한 화합물-단백질 상호작용을 예측하기 위해선 입력 인공 신경망들을 통해 얻은 화합물 Feature와 단백질 feature들(을 결합시키는 방법이 중요하다. 본 발명은 화합물의 구조의 다양한 데이터들을 결합한 뒤 최종적으로 단백질 데이터와 교차 어텐션 모듈을 이용해 결합시켜 상호 작용을 예측하는 새로운 비대칭적 인공지능 모델을 제안한다. 교차 어텐션 모듈은 다양한 단백질 데이터에 대해 화합물 데이터가 주는 영향의 패턴을 파악하도록 학습된다. 다시 말해 단백질이 화합물과 반응하는 방식을 모델링한다. 본 모델에서 와 의 결합 공식은 아래 수학식 3과 같다.To predict accurate compound-protein interactions, compound features and protein features obtained through input artificial neural networks ( The way to combine them is important. The present invention proposes a new asymmetric artificial intelligence model that combines various data on the structure of a compound and finally combines them with protein data using a cross-attention module to predict interactions. The cross-attention module is trained to identify patterns of influence of compound data on various protein data. In other words, it models how proteins react with compounds. In this model and The combination formula is shown in Equation 3 below.

는 출력 인공 신경망에 입력되어 단백질과 화합물간의 Binding score를 생성하며 이를 기반으로 화합물과 단백질 간 상호 작용을 예측할 수 있다. 예를 들어, 교차 어텐션 블록의 출력인 는 MLP 인공 신경망에 입력되며 MLP 인공 신경망은 출력으로 Binding Score를 생성할 수 있다. 해당 Binding Score를 기반으로 화합물과 단백질 간 최종적인 상호 작용을 예측할 수 있다. is input to the output artificial neural network to generate a binding score between proteins and compounds, and based on this, interactions between compounds and proteins can be predicted. For example, the output of the cross attention block is is input to the MLP artificial neural network, and the MLP artificial neural network can generate a Binding Score as an output. Based on the Binding Score, the final interaction between the compound and protein can be predicted.

도 3은 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법을 나타낸 순서도이다. 이하 도면을 참조하여 설명할 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법의 각 단계는 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치에 의해 수행되는 것으로 이해될 수 있다. 또한 이하 본 발명을 설명함에 있어서 별도로 정의하지 않는 한 "장치"는 본 발명의 다양한 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치를 의미하는 것으로 이해될 수 있다.Figure 3 is a flowchart showing a method for predicting compound-protein interactions based on a cross-attention mechanism according to an embodiment of the present invention. It will be understood that each step of the cross-attention mechanism-based compound-protein interaction prediction method according to an embodiment of the present invention, which will be described below with reference to the drawings, is performed by a cross-attention mechanism-based compound-protein interaction prediction device. You can. Additionally, in describing the present invention below, unless otherwise defined, “device” may be understood to mean a device for predicting compound-protein interactions based on a cross-attention mechanism according to various embodiments of the present invention.

도 3을 참조하면, 장치는 우선 분자 그래프 및 분자 지문 데이터(Morgan fingerprint 데이터)를 기반으로 화합물 정보를 인코딩할 수 있다(S310). 분자 그래프 데이터는 원자들의 집합와 결합들의 집합로 구성된 분자 그래프 데이터를 나타내는 로 표현될 수 있다. 분자 지문 데이터는 이원 벡터로 특정 부분 구조의 존재 유무를 0 과 1로 표현하는 Morgan fingerprint 벡터 로 표현될 수 있다. S310 단계에서 화합물 정보를 인코딩하기 위한 과정을 이하 도 4를 참조하여 설명한다.Referring to FIG. 3, the device can first encode compound information based on the molecular graph and molecular fingerprint data (Morgan fingerprint data) (S310). Molecular graph data is a collection of atoms and set of combinations representing molecular graph data consisting of It can be expressed as Molecular fingerprint data is a binary vector, the Morgan fingerprint vector expressing the presence or absence of a specific partial structure as 0 and 1. It can be expressed as The process for encoding compound information in step S310 will be described below with reference to FIG. 4.

도 4는 본 발명의 일 실시 예에 따른 화합물 정보를 인코딩하는 과정을 나타낸 도면이다.Figure 4 is a diagram showing the process of encoding compound information according to an embodiment of the present invention.

도 4를 참조하면, 분자 그래프 데이터 는 D-MPNN(direct message passing neural network)에 입력될 수 있다(S410). 여기서 D-MPNN은 분자 구조 그래프의 노드가 아닌 방향성이 있는 간선을 이용해 원자 간의 관계를 학습시키는 모델로서 를 출력할 수 있다. 또한, Morgan fingerprint 로 표현되는 벡터 분자 지문 데이터는 MLP(multi layer perceptron) 인공 신경망에 입력될 수 있다(S420). MLP 인공 신경망은 입력 로부터 비선형 정보를 추출하여 를 출력할 수 있다.Referring to Figure 4, molecular graph data Can be input to D-MPNN (direct message passing neural network) (S410). Here, D-MPNN is a model that learns relationships between atoms using directed edges rather than nodes in the molecular structure graph. can be output. Also, Morgan fingerprint Vector molecular fingerprint data expressed as can be input to a multi layer perceptron (MLP) artificial neural network (S420). MLP artificial neural network inputs By extracting nonlinear information from can be output.

D-MPNN의 출력 및 MLP 인공 신경망의 출력 는 교차 어텐션 블록에 입력되며, 교차 어텐션 매커니즘을 통해 결합될 수 있다(S430). S430 단계에서 사용되는 교차-어텐션 블록은 Q(Query), K(Key), V(Value)를 입력값으로 가지며, 는 MLP 인공 신경망의 출력인 로부터, 그리고 는 D-MPNN의 출력인 로부터 생성될 수 있다. 및 로부터 값이 생성되는 과정은 전술한 수학식 1을 참조한다.Output of D-MPNN and the output of the MLP artificial neural network. is input to the cross attention block and can be combined through the cross attention mechanism (S430). The cross-attention block used in step S430 has Q (Query), K (Key), and V (Value) as input values, is the output of the MLP artificial neural network from, and is the output of D-MPNN can be created from and from For the process of generating a value, refer to Equation 1 described above.

일 실시 예에 따르면, 화합물 표현형을 개선하기 위해 교차 어텐션 블록의 출력을 셀프-어텐션(self-attention) 블록의 입력으로 하여 셀프-어텐션 매커니즘을 추가적으로 적용시켜 화합물의 최종 표현형 를 생성할 수 있다.According to one embodiment, in order to improve the compound phenotype, a self-attention mechanism is additionally applied by using the output of the cross attention block as the input of a self-attention block to obtain the final phenotype of the compound. can be created.

다시 도 3에 대한 설명으로 돌아가서, 장치는 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩 할 수 있다(S320). S320 단계에서 단백질 정보를 인코딩하기 위한 과정을 이하 도 5를 참조하여 설명한다.Going back to the description of FIG. 3, the device can encode protein information based on protein sequence data (S320). The process for encoding protein information in step S320 will be described below with reference to FIG. 5.

도 5는 본 발명의 일 실시 예에 따른 단백질 정보를 인코딩하는 과정을 나타낸 도면이다.Figure 5 is a diagram showing the process of encoding protein information according to an embodiment of the present invention.

도 5를 참조하면, 단백질 데이터는 residue를 UniRep Vocabulary에서 부여한 고유 숫자로 표현할 수 있는 TAPE(Tasks Assessing Protein Embeddings) tokenizer를 이용해 전처리 되며(S510), 이후 one-hot 인코딩을 기반으로 1차원 데이터인 단백질 시퀀스 데이터가 생성될 수 있다(S520). 생성된 단백질 시퀀스 데이터는 1-D CNN(convolutional neural network)에 입력되며, 1D CNN의 출력으로 최종 단백질 Feature인 가 생성될 수 있다(S530).Referring to Figure 5, protein data is preprocessed using TAPE (Tasks Assessing Protein Embeddings) tokenizer, which can express residues as unique numbers assigned by UniRep Vocabulary (S510), and then protein data, which is one-dimensional data, is generated based on one-hot encoding. Sequence data may be generated (S520). The generated protein sequence data is input to a 1-D CNN (convolutional neural network), and the final protein feature is generated as the output of the 1D CNN. may be created (S530).

S310 단계 및 S320 단계를 통해 인코딩된 화합물 정보 및 단백질 정보는 교차 어텐션 블록에 입력되며, 교차 어텐션 매커니즘을 통해 결합될 수 있다(S330). S330 단계에서 사용되는 교차-어텐션 블록은 Q(Query), K(Key), V(Value)를 입력값으로 가지며, 는 인코딩된 화합물 정보인 로부터, 그리고 는 인코딩된 단백질 정보인 로부터 생성될 수 있다. 및 로부터 를 생성하는 과정은 전술한 수학식 3을 참조한다.The compound information and protein information encoded through steps S310 and S320 are input to the cross attention block and can be combined through a cross attention mechanism (S330). The cross-attention block used in step S330 has Q (Query), K (Key), and V (Value) as input values, is the encoded compound information from, and is the encoded protein information can be created from and from For the process of generating , refer to Equation 3 described above.

마지막으로, 장치 S330 단계에서 사용되는 교차 어텐션 블록의 출력을 기반으로 화학물과 단백질의 상호 작용을 예측할 수 있다(S340). 일 예시로, 도 1에 도시된 바와 같이 S330 단계의 교차 어텐션 블록의 출력은 MLP 인공 신경망을 거쳐 Binding Score를 생성할 수 있으며, 해당 Binding Score를 기반으로 최종적인 화합물과 단백질 간 상호 작용을 예측할 수 있다.Finally, the interaction between chemicals and proteins can be predicted based on the output of the cross-attention block used in the device S330 step (S340). As an example, as shown in Figure 1, the output of the cross attention block in step S330 can generate a Binding Score through an MLP artificial neural network, and the final interaction between the compound and the protein can be predicted based on the Binding Score. there is.

도 6는 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치의 구성을 나타낸 도면이다.Figure 6 is a diagram showing the configuration of a compound-protein interaction prediction device based on a cross attention mechanism according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치(600)는 프로세서(610), 네트워크 인터페이스(620), 메모리(630), 스토리지(640) 및 이들을 연결하는 데이터 버스(650)를 포함할 수 있으며, 기타 본 발명의 목적을 달성함에 있어 요구되는 부가적인 구성들을 더 포함할 수 있음은 물론이라 할 것이다. The cross-attention mechanism-based compound-protein interaction prediction device 600 according to an embodiment of the present invention includes a processor 610, a network interface 620, a memory 630, a storage 640, and a data bus connecting them. It may include (650), and of course, it may also include other additional components required to achieve the purpose of the present invention.

프로세서(610)는 각 구성의 전반적인 동작을 제어한다. 프로세서(610)는 CPU(Central Processing Unit), MPU(Micro Processer Unit), MCU(Micro Controller Unit) 또는 본 발명이 속하는 기술 분야에서 널리 알려져 있는 형태의 프로세서 중 어느 하나일 수 있다. 아울러, 프로세서(610)는 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법을 수행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. The processor 610 controls the overall operation of each component. The processor 610 may be a central processing unit (CPU), a micro processor unit (MPU), a micro controller unit (MCU), or any of the types of processors widely known in the technical field to which the present invention pertains. In addition, the processor 610 may perform operations on at least one application or program to perform a method for predicting compound-protein interactions based on a cross-attention mechanism according to an embodiment of the present invention.

네트워크 인터페이스(620)는 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치(600)의 유무선 인터넷 통신을 지원하며, 그 밖의 공지의 통신 방식을 지원할 수도 있다. 따라서 네트워크 인터페이스(620)는 그에 따른 통신 모듈을 포함하여 구성될 수 있다.The network interface 620 supports wired and wireless Internet communication of the cross-attention mechanism-based compound-protein interaction prediction device 600 according to an embodiment of the present invention, and may also support other known communication methods. Accordingly, the network interface 620 may be configured to include a corresponding communication module.

메모리(630)는 각종 정보, 명령 및/또는 정보를 저장하며, 본 발명의 일 실시 예에 따른 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 방법을 수행하기 위해 스토리지(640)로부터 하나 이상의 컴퓨터 프로그램(641)을 로드할 수 있다. 도 1에서는 메모리(630)의 하나로 RAM을 도시하였으나 이와 더불어 다양한 저장 매체를 메모리(630)로 이용할 수 있음은 물론이다. The memory 630 stores various information, commands, and/or information, and stores one or more computer programs from the storage 640 to perform the cross-attention mechanism-based compound-protein interaction prediction method according to an embodiment of the present invention. (641) can be loaded. In FIG. 1, RAM is shown as one of the memories 630, but of course, various storage media can also be used as the memory 630.

스토리지(640)는 하나 이상의 컴퓨터 프로그램(641) 및 대용량 네트워크 정보를 비임시적으로 저장할 수 있다. 이러한 스토리지(640)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크(HDD), 보조 저장 매치(SSD), 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 널리 알려져 있는 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체 중 어느 하나일 수 있다. Storage 640 may non-temporarily store one or more computer programs 641 and large-capacity network information. This storage 640 includes non-volatile memory such as Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), flash memory, hard disk (HDD), secondary storage match (SSD), and removable memory. It may be a disk or any type of computer-readable recording medium widely known in the technical field to which the present invention pertains.

컴퓨터 프로그램(641)은 메모리(630)에 로드되어, 하나 이상의 프로세서(610)에 의해, 분자 그래프 데이터 및 분자 지문 데이터를 기반으로 화합물 정보를 인코딩하는 동작, 단백질 시퀀스 데이터를 기반으로 단백질 정보를 인코딩하는 동작, 인코딩된 화합물 정보 및 단백질 정보를 교차 어텐션 블록에 입력하는 동작 및 교차 어텐션 블록의 출력을 기반으로 화합물과 단백질의 상호 작용을 예측하는 동작을 실행할 수 있다. A computer program 641 is loaded into memory 630 and operates by one or more processors 610 to encode compound information based on molecular graph data and molecular fingerprint data, and encode protein information based on protein sequence data. An operation of inputting the encoded compound information and protein information into the cross attention block, and an operation of predicting the interaction between the compound and the protein based on the output of the cross attention block can be performed.

데이터 버스(650)는 이상 설명한 프로세서(610), 네트워크 인터페이스(620), 메모리(630) 및 스토리지(640) 사이의 명령 및/또는 정보의 이동 경로가 된다. The data bus 650 serves as a path for moving commands and/or information between the processor 610, network interface 620, memory 630, and storage 640 described above.

이상 첨부된 도면을 참조하여 본 발명의 실시 예들을 설명하였지만, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.Although embodiments of the present invention have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

600: 교차 어텐션 매커니즘 기반의 화합물-단백질 상호작용 예측 장치
610: 프로세서
620: 네트워크 인터페이스
630: 메모리
640: 스토리지
641: 컴퓨터 프로그램
650: 데이터 버스600: Compound-protein interaction prediction device based on cross attention mechanism
610: processor
620: Network interface
630: memory
640: Storage
641: computer program
650: data bus

Claims

In the compound-protein interaction prediction method based on the cross-attention mechanism,
Encoding compound information based on molecular graph data and molecular fingerprint data, respectively;
Encoding protein information based on protein sequence data;
Inputting the encoded compound information and protein information into a first cross attention block; and
In the cross-attention mechanism-based compound-protein interaction prediction method, which includes predicting the interaction between a compound and a protein based on the output of the first cross-attention block,
The first cross attention block has inputs of Q (query), K (Key), and V (value), where Q is determined based on the encoded compound information, and K and V are encoded protein information. It is decided based on
The step of encoding the compound information is,
Inputting the molecular graph data into a direct message passing neural network (D-MPNN);
Inputting the molecular fingerprint data into a multi layer perceptron (MLP) artificial neural network; and
Inputting the output of the D-MPNN and the output of the MLP artificial neural network into a second cross-attention block to output the encoded compound information and Q input to the first cross-attention block;
Including,
Compound-protein interaction prediction method based on cross-attention mechanism.

delete

According to paragraph 1,
The second cross attention block has inputs of Q (query), K (Key), and V (value), where V and K are determined based on the output of the D-MPNN, and Q is the MLP artificial neural network. A compound-protein interaction prediction method based on a cross-attention mechanism determined based on the output of .

According to paragraph 3,
The output of the second cross attention block is based on the inputs of Q, K, and V.

It is determined as follows, where C is the number of embedding dimensions and d is the number of attention heads. A compound-protein interaction prediction method based on a cross attention mechanism.

According to paragraph 1,
The step of encoding the compound information is,
A method for predicting compound-protein interactions based on a cross-attention mechanism, further comprising inputting the output of the second cross-attention block to a self-attention block.

According to paragraph 1,
The step of encoding protein information is,
Preprocessing protein data based on TAPE (tasks assessing protein embeddings) tokenizer;
Generating the protein sequence data as one-dimensional data based on one-hot encoding; and
A method for predicting compound-protein interactions based on a cross-attention mechanism, including the step of inputting the protein sequence data into a 1D convolutional neural network (CNN).

delete

According to paragraph 1,
The output of the first cross attention block is based on the inputs of Q, K, and V.

It is determined as follows, where C is the number of embedding dimensions and d is the number of attention heads. A compound-protein interaction prediction method based on a cross attention mechanism.

at least one processor;
a memory that loads a computer program executed by the at least one processor; and
Includes storage for storing the computer program,
The computer program is operated by the at least one processor,
An operation to encode compound information based on molecular graph data and molecular fingerprint data, respectively;
An operation to encode protein information based on protein sequence data;
Inputting the encoded compound information and protein information into a first cross attention block; and
In the cross-attention mechanism-based compound-protein interaction prediction device controlled to perform an operation to predict the interaction between a compound and a protein based on the output of the first cross-attention block,
The first cross attention block has inputs of Q (query), K (Key), and V (value), where Q is determined based on the encoded compound information, and K and V are encoded protein information. It is decided based on
The operation of encoding the compound information is,
Inputting the molecular graph data into a direct message passing neural network (D-MPNN);
Inputting the molecular fingerprint data into a multi layer perceptron (MLP) artificial neural network; and
An operation of inputting the output of the D-MPNN and the output of the MLP artificial neural network to a second cross-attention block to output the encoded compound information and Q input to the first cross-attention block;
Including,
Compound-protein interaction prediction device based on cross-attention mechanism.

Combined with a computing device,
An operation to encode compound information based on molecular graph data and molecular fingerprint data, respectively;
An operation to encode protein information based on protein sequence data;
Inputting the encoded compound information and protein information into a first cross attention block; and
In the computer program stored in a computer-readable recording medium that performs an operation of predicting an interaction between a compound and a protein based on the output of the first cross attention block,
The first cross attention block has inputs of Q (query), K (Key), and V (value), where Q is determined based on the encoded compound information, and K and V are encoded protein information. It is decided based on
The operation of encoding the compound information is,
Inputting the molecular graph data into a direct message passing neural network (D-MPNN);
Inputting the molecular fingerprint data into a multi layer perceptron (MLP) artificial neural network; and
An operation of inputting the output of the D-MPNN and the output of the MLP artificial neural network into a second cross-attention block to output the encoded compound information and Q input to the first cross-attention block;
Including,
A computer program stored on a computer-readable recording medium.