WO2024090848A1

WO2024090848A1 - Data augmentation method associated with target protein

Info

Publication number: WO2024090848A1
Application number: PCT/KR2023/015594
Authority: WO
Inventors: 이대석; 신봉근
Original assignee: 디어젠 주식회사
Priority date: 2022-10-26
Filing date: 2023-10-11
Publication date: 2024-05-02
Also published as: KR20240063817A

Abstract

Disclosed according to an embodiment of the present disclosure is a computer program stored on a computer-readable storage medium. The method comprises the steps of: acquiring target proteins and index information associated with the target protein contained in training data; identifying homologous proteins of the target proteins; and augmenting the training data by correlating the index information associated with the target proteins and the homologous proteins.

Description

Data augmentation method associated with target protein

The present invention relates to a data augmentation method, and more specifically, to a learning data augmentation technology associated with a target protein.

The drug-target interaction (DTI) prediction problem is a problem of predicting the chemical affinity between a given drug molecule and a target protein in various ways. For example, the DTI problem is the problem of computationally predicting the chemical affinity between a given drug molecule and a target protein, measured in various ways, such as IC ₅₀ , K _i , K _d or their modifications. Meanwhile, with regard to these DTI problems, methods such as docking have been used in situations where the structure of the target is known, and in situations where this is not the case, the problem has been defined in various ways such as binary classification problems, regression problems, and bipartite graph inference problems. has been dealt with In particular, from the perspective of regression problems, various machine learning and deep learning algorithms have been used, such as KronRLS, SimBoost, DeepDTA, and Deargen's MT-DTI.

One of the difficulties in approaching the DTI problem with deep learning is that there are not many types of proteins that can be used as learning data. One piece of evidence for this is that there are only 229 and 224 types of proteins that appear in the KIBA and DAVIS datasets, respectively. Another indirect evidence is that in the human body, there are only about 790 or 500 types of GPCRs and protein kinases, which are the protein categories that are the main targets of drugs. Referring to Figure 3, this problem is particularly important in that the generalization ability of deep learning models learned through such limited data to new targets is quite limited.

Republic of Korea Patent No. 10-2213670 (2021.02.02) discloses a method for predicting drug-target interaction.

The present disclosure provides data augmentation (Data Augmentation) associated with a target protein, which can expand (augment) data using homologous proteins to partially solve the problem arising from the lack of types of proteins that can be used as learning data. The purpose is to provide a method.

Meanwhile, the technical problem to be achieved by the present disclosure is not limited to the technical problems mentioned above, and may include various technical problems within the scope of what is apparent to those skilled in the art from the contents described below.

A method performed by a computing device according to an embodiment of the present disclosure for realizing the above-described problem is disclosed. The method includes acquiring a target protein included in learning data and indicator information associated with the target protein; Identifying a homologous protein of the target protein; And it may include a step of augmenting the learning data by matching index information associated with the target protein and the homologous protein.

Alternatively, the indicator information associated with the target protein may include information about the affinity of the target protein for the drug.

Alternatively, the method may further include filtering the augmented learning data by considering the affinity information of the target protein for the drug and the affinity information for the drug of the homologous protein.

Alternatively, the filtering step may include affinity information for the drug of the target protein given from learning data or predicted by a deep learning model, and the drug affinity information predicted by the deep learning model, within the batch currently being learned. It may include comparing affinity information for the drug of the homologous protein.

Alternatively, the filtering step includes filtering data regarding homologous proteins that have an accuracy higher than a certain rank among the accuracy values of the homologous proteins in the batch, and the accuracy values are the deep learning It can be generated based on a comparison between the affinity information for the drug of the homologous proteins predicted by the model and the affinity information for the drug of the target protein given from learning data or predicted by the deep learning model. there is.

Alternatively, the filtering step may include performing filtering on the augmented learning data from the middle of the learning process of the deep learning model currently being trained.

Alternatively, the step of identifying a homologous protein of the target protein may further include performing multiple sequence alignment (MSA) on the target protein and a plurality of homologous proteins.

Alternatively, performing the multiple sequence alignment may include performing a search for the target protein and a plurality of homologous proteins that satisfy a preset identity ratio.

According to an embodiment of the present disclosure for realizing the above-described object, a computer program stored in a computer-readable storage medium is disclosed. When the computer program is executed on one or more processors, it performs the following operations for data augmentation associated with a target protein, the operations being: a target protein included in learning data and an indicator associated with the target protein. The act of obtaining information; Identifying a homologous protein of the target protein; And it may include an operation of augmenting the learning data by matching index information associated with the target protein with the homologous protein.

A computing device according to an embodiment of the present disclosure for realizing the above-described problem is disclosed. The device includes at least one processor; and a memory, wherein the at least one processor acquires a target protein included in learning data and index information associated with the target protein, identifies a homologous protein of the target protein, and an index associated with the target protein. It may be configured to augment the learning data by matching information and the homologous protein.

The present disclosure can provide a data augmentation method associated with a target protein that can expand (augment) data using a homologous protein to partially solve problems arising from a lack of learning data.

Meanwhile, the effects of the present disclosure are not limited to the effects mentioned above, and various effects may be included within the range apparent to those skilled in the art from the contents described below.

1 is a block diagram of a computing device for data augmentation associated with a target protein according to an embodiment of the present disclosure.

Figure 2 is a diagram schematically showing a data augmentation method associated with a target protein according to an embodiment of the present disclosure.

Figure 3 is a first experiment result showing the effect of a data augmentation method for predicting drug-target affinity according to an embodiment of the present disclosure.

Figure 4 is a second experiment result showing the effect of the data augmentation method for predicting drug-target affinity according to an embodiment of the present disclosure.

Figure 5 is a flowchart showing a data augmentation method associated with a target protein according to an embodiment of the present disclosure.

Figure 6 is a conceptual diagram showing a neural network according to an embodiment of the present disclosure.

7 is a block diagram of a computing device according to an embodiment of the present disclosure.

Various embodiments are now described with reference to the drawings. In this specification, various descriptions are presented to provide an understanding of the disclosure. However, it is clear that these embodiments may be practiced without these specific descriptions.

As used herein, the terms “component,” “module,” “system,” and the like refer to a computer-related entity, hardware, firmware, software, a combination of software and hardware, or an implementation of software. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, a thread of execution, a program, and/or a computer. For example, both an application running on a computing device and the computing device can be a component. One or more components may reside within a processor and/or thread of execution. A component may be localized within one computer. A component may be distributed between two or more computers. Additionally, these components can execute from various computer-readable media having various data structures stored thereon. Components may transmit signals, for example, with one or more data packets (e.g., data and/or signals from one component interacting with other components in a local system, a distributed system, to other systems and over a network such as the Internet). Depending on the data being transmitted, they may communicate through local and/or remote processes.

Additionally, the term “or” is intended to mean an inclusive “or” and not an exclusive “or.” That is, unless otherwise specified or clear from context, “X utilizes A or B” is intended to mean one of the natural implicit substitutions. That is, either X uses A; X uses B; Or, if X uses both A and B, “X uses A or B” can apply to either of these cases. Additionally, the term “and/or” as used herein should be understood to refer to and include all possible combinations of one or more of the related listed items.

Additionally, the terms “comprise” and/or “comprising” should be understood to mean that the corresponding feature and/or element is present. However, the terms “comprise” and/or “comprising” should be understood as not excluding the presence or addition of one or more other features, elements and/or groups thereof. Additionally, unless otherwise specified or the context is clear to indicate a singular form, the singular terms herein and in the claims should generally be construed to mean “one or more.”

And, the term “at least one of A or B” should be interpreted to mean “a case containing only A,” “a case containing only B,” and “a case of combining A and B.”

Those skilled in the art will additionally recognize that the various illustrative logical blocks, components, modules, circuits, means, logic, and algorithm steps described in connection with the embodiments disclosed herein may be implemented using electronic hardware, computer software, or a combination of both. It must be recognized that it can be implemented with To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, configurations, means, logics, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented in hardware or software will depend on the specific application and design constraints imposed on the overall system. A skilled technician can implement the described functionality in a variety of ways for each specific application. However, such implementation decisions should not be construed as causing a departure from the scope of the present disclosure.

The description of the presented embodiments is provided to enable anyone skilled in the art to use or practice the present invention. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Therefore, the present invention is not limited to the embodiments presented herein. The present invention is to be interpreted in the broadest scope consistent with the principles and novel features presented herein.

In this disclosure, network function, artificial neural network, and neural network may be used interchangeably.

In the present disclosure, an “indicator” associated with a target protein is a concept representing various information related to the characteristics of the target protein. Indicators associated with these target proteins can represent a variety of information, including information about the properties of the target protein itself, information about the properties between the target protein and other substances, and information about substances closely related to the target protein. , can be expressed in a variety of ways, including index, scale, measured value, etc. As an example, an indicator associated with the target protein may include “affinity” between the target protein and the compound.

Additionally, in the present disclosure, “affinity” is a concept representing various associations (e.g., binding potential, relevance, correlation, reactivity, interaction, etc.) between a biological target and a compound, and can be determined based on various indices, scales, or measurements. there is. For example, affinity is based on various indices, scales, or measurements such as binding affinity, half maximal inhibitory concentration (IC50), half maximal effective concentration (EC50), half activity concentration (AC50), etc. can be decided.

In addition, binding affinity, which may be included in affinity in the present disclosure, may refer to the bonding strength between a plurality of reversibly bound molecules, which is a type of measure of the degree of reaction between a biological target and a compound. It is generally known that a compound with a higher binding affinity has a higher probability of binding specifically and selectively with respect to a biological target. Additionally, binding affinity may be the strength of the binding action of a protein or DNA and a drug or inhibitor. Additionally, binding affinity can be measured based on, for example, the equilibrium dissociation constant (K _D ). At this time, the smaller the K _D value, the higher the binding affinity of the drug or inhibitor to the biological target. Conversely, the larger the K _D value, the lower the binding affinity of the drug or inhibitor to the biological target. Additionally, binding affinity can be affected by non-covalent intermolecular interactions between two molecules, such as hydrogen bonds, electrostatic interactions, and hydrophobicity. In addition, binding affinity can be measured by measuring physical samples using experiments and measuring devices, but it can also be used by utilizing a database where the measured values are stored. Meanwhile, various methods can be used to measure binding affinity other than those based on the equilibrium dissociation constant (K _D ), and the present disclosure encompasses various methods for measuring binding affinity.

Meanwhile, as a more specific example, “affinity” in the present disclosure may refer to the binding force, catalytic rate, substrate specificity, chemical selectivity, receptor agonism, or receptor antagonism that acts between a drug and a target substance. In addition, here, the target substance may be a protein such as a receptor, and the drug may interact with the binding site of the target substance to act as a ligand that can form a stable complex between the drug and the target substance. The complex may include coenzymes or cofactors such as metal ions in addition to the drug-target substance. Additionally, the ligand may be a small molecule that can non-covalently bind to a target biomolecule for pharmacological purposes, or may be a biomolecule such as a nucleotide polymer, peptide, or antibody. Hereinafter, a method of data augmentation for drug-target affinity prediction by the computing device according to the present disclosure will be described through FIGS. 1 to 7.

The configuration of the computing device 100 shown in FIG. 1 is only a simplified example. In one embodiment of the present disclosure, the computing device 100 may include different components for performing the computing environment of the computing device 100, and only some of the disclosed components may configure the computing device 100.

The computing device 100 may include a processor 110, a memory 130, and a network unit 150.

The processor 110 may be composed of one or more cores, and may include a central processing unit (CPU), a general purpose graphics processing unit (GPGPU), and a tensor processing unit (TPU) of a computing device. unit) may include a processor for data analysis and deep learning. The processor 110 may read a computer program stored in the memory 130 and perform data processing for machine learning according to an embodiment of the present disclosure. According to an embodiment of the present disclosure, the processor 110 may perform an operation for learning a neural network. The processor 110 is used for learning neural networks, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. Calculations can be performed. At least one of the CPU, GPGPU, and TPU of the processor 110 may process learning of the network function. For example, CPU and GPGPU can work together to process learning of network functions and data classification using network functions. Additionally, in one embodiment of the present disclosure, the processors of a plurality of computing devices can be used together to process learning of network functions and data classification using network functions. Additionally, a computer program executed in a computing device according to an embodiment of the present disclosure may be a CPU, GPGPU, or TPU executable program.

According to an embodiment of the present disclosure, the data augmentation method associated with the target protein performed in the processor 110 may be understood as a type of weakly supervised learning. Weakly supervised learning refers to using labels that are not accurate in any way when training a machine learning model. For example, there are cases where labels obtained through crowdsourcing are used. According to one embodiment of the present disclosure, the processor 110 augments the learning data by matching the affinity information of the “target protein” for a given drug to the “homologous protein” of the target protein (position of the homologous protein) This augmentation of learning data can be understood as a type of weakly supervised learning (in the sense that the affinity information of another protein, which is relatively inaccurate, is matched instead of the protein's own exact affinity information).

Additionally, the processor 110 may acquire the target protein included in the learning data and index information associated with the target protein, and identify a homologous protein of the target protein. Thereafter, the processor 110 may augment the learning data by matching index information associated with the target protein with the homologous protein. At this time, the indicator information associated with the target protein may include information on the affinity of the target protein for the drug. For reference, the processor 110 may augment the learning data by using any indicator that can be assumed to have similar values when the protein structures are similar.

Meanwhile, the processor 110 uses homologous proteins to partially solve the problem that there are not many types of proteins that can be used as learning data. For example, if target protein A in the learning data has high affinity with drug X, protein B, which has a similar sequence pattern to target protein A, is also likely to have high affinity with drug X. This is because there is a high possibility that protein B also has a three-dimensional structure that binds to drug X that target protein A has. Therefore, the processor 110 can expand (augmentation) the data by including homologous proteins in the learning data set used when training a pre-trained deep learning model (eg, DTI deep learning model). Specifically, the processor 110 may perform learning assuming that if the affinity value of (A, X) is a, the affinity value of (B,

Additionally, in relation to expanding data using homologous proteins, a problem may arise where the site where target protein A binds to drug X may be omitted or appear in a modified form in homologous protein B. In that case, the index (e.g., affinity) of the homologous protein B and drug Therefore, it is necessary to properly filter the training data set expanded by the method described above rather than using it as is. In order to solve this problem, the processor according to an embodiment of the present disclosure can filter some of the data augmented using homologous proteins, and through this filtering operation (relatively high accuracy can be guaranteed) The performance of data augmentation can be further improved by preventing data that does not exist from being included in the learning data.

According to an embodiment of the present disclosure, the memory 130 may store any type of information generated or determined by the processor 110 and any type of information received by the network unit 150.

According to an embodiment of the present disclosure, the memory 130 is a flash memory type, hard disk type, multimedia card micro type, or card type memory (e.g. (e.g. SD or -Only Memory), and may include at least one type of storage medium among magnetic memory, magnetic disk, and optical disk. The computing device 100 may operate in connection with web storage that performs a storage function of the memory 130 on the Internet. The description of the memory described above is merely an example, and the present disclosure is not limited thereto.

The network unit 150 according to an embodiment of the present disclosure includes Public Switched Telephone Network (PSTN), x Digital Subscriber Line (xDSL), Rate Adaptive DSL (RADSL), Multi Rate DSL (MDSL), and VDSL ( A variety of wired communication systems can be used, such as Very High Speed DSL), Universal Asymmetric DSL (UADSL), High Bit Rate DSL (HDSL), and Local Area Network (LAN).

In addition, the network unit 150 presented in this specification includes Code Division Multi Access (CDMA), Time Division Multi Access (TDMA), Frequency Division Multi Access (FDMA), Orthogonal Frequency Division Multi Access (OFDMA), and SC-FDMA ( A variety of wireless communication systems can be used, such as Single Carrier-FDMA) and other systems.

In the present disclosure, the network unit 150 may be configured regardless of communication mode, such as wired or wireless, and may include a local area network (LAN), a personal area network (PAN), or a wide area network (WAN). It can be composed of various communication networks such as Wide Area Network. Additionally, the network may be the well-known World Wide Web (WWW), and may also use wireless transmission technology used for short-distance communication, such as Infrared Data Association (IrDA) or Bluetooth.

The techniques described herein can be used in the networks mentioned above, as well as other networks.

According to an embodiment of the present disclosure, the processor 110 may acquire a target protein included in learning data and indicator information associated with the target protein. At this time, the indicator information associated with the target protein may include various indicators whose values can be assumed to be similar when the target protein structure and the protein structure are similar. As an example, indicator information associated with a target protein may include information about the affinity of the target protein for a drug. Additionally, the processor 110 may identify a target protein and a homologous protein, and augment learning data by matching index information associated with the target protein with the homologous protein. That is, the processor 110 can augment the learning data by matching various index information associated with the target protein and homologous proteins of the target protein.

For example, referring to FIG. 2, the target protein (A) and the drug (X) have known interactions (①) with each other. For reference, since the target protein (A) and the drug (X) are known interactions (①), the indicators associated with the target protein (A) and the target protein (e.g., Affinity information (a)) is information that is already known. In addition, the target protein (A) and the homologous protein (B) are proteins with similar (or similar) sequence patterns and have a homology (③) relationship with each other. For reference, “similar” in the present disclosure may mean having a sequence identity greater than a preset ratio. In other words, the protein (A) and the homologous protein (B) are proteins whose sequence pattern (or sequence pattern) has a sequence identity greater than a preset ratio and are homologous to each other. In addition, since the target protein (A) and the homologous protein (B) are proteins with similar (similar) sequence patterns, the affinity values for the homologous protein (B) and the drug (X) are calculated using affinity information (a ), so the homologous protein (B) and the drug (X) have a potential interaction (②) relationship with each other.

According to one embodiment of the present disclosure, the processor 110 may acquire the target protein (A) included in the learning data and the affinity information of the target protein for the drug (X). Affinity information may include information about the binding force or force (strength) acting between the target protein (A) and the drug (X), and may include various types of information in addition to this information. For example, affinity information includes various types of information, including K _D (equilibrium dissociation constant), K _i , IC50 (half maximal inhibitory concentration), EC50 (half maximal effective concentration), AC50 (half activity concentration), etc. may include.

According to one embodiment of the present disclosure, the processor 110 may identify a homologous protein (B) of the target protein (A). Additionally, the processor 110 may perform multiple sequence alignment (MSA) on the target protein and a plurality of homologous proteins. For example, the processor 110 may search a database for a protein homologous to the target protein (A) and perform multiple sequence alignment (MSA). Multiple sequence alignment has been used in approaches to protein structure prediction problems, such as methods using evolutionary correlations and template-based methods. In addition, this structure prediction method using homologous proteins has also been applied to the problem of drug-target interaction prediction. The processor 110 may use this multiple sequence alignment to identify a protein homologous to the target protein (A) among a plurality of proteins. By way of example, the processor 110 may perform homologous structure search and multiple sequence alignment (MSA) using the HHBlits algorithm based on a hidden Markov model, but is not limited to this and may be previously developed or used. Algorithms developed in the future may be applied. Additionally, the processor 110 may perform a search for a target protein that satisfies a preset identity ratio and a plurality of homologous proteins. For example, the processor 110 may use the HHBlits algorithm for homologous protein search by setting the minimum match ratio to 70%. The processor 110 may determine a preset ratio by finding a balance between diversity and accuracy of data expansion. The preset ratio may be a number determined through the prediction performance of a pre-trained deep learning model. However, the preset ratio is only an example and is not limited thereto.

According to one embodiment of the present disclosure, the processor 110 can augment (expand) the learning data by matching the affinity information of the target protein with the homologous protein. The processor 110 can augment (expand) the learning data using homologous proteins to partially solve the problem that there are not many types of proteins that can be used as learning data. For example, if target protein A in the learning data has high affinity with drug X, protein B, which has a similar sequence pattern to target protein A, is also likely to have high affinity with drug X. This is because there is a high possibility that protein B also has a three-dimensional structure that binds to drug X that target protein A has. Therefore, the processor 110 can expand (augmentation) the data by including homologous proteins in the learning data set used when training a pre-trained deep learning model (eg, DTI deep learning model). For example, if the affinity value of (A, X) is a, the processor 110 may learn by assuming that the affinity value of (B,

According to an embodiment of the present disclosure, the processor 110 may filter the augmented learning data by considering the affinity information of the target protein for the drug and the affinity information for the drug of the homologous protein. The processor 110 may perform filtering to solve problems with previously augmented learning data. A problem may arise in which the site where the target protein (A) binds to the drug (X) may be omitted or appear in a modified form in the homologous protein (B). In that case, the affinity between the homologous protein (B) and the drug (X) will not be able to be estimated from the affinity between the target protein (A) and the drug (X). Therefore, it is necessary to properly filter the training data set expanded by the method described above rather than using it as is. Meanwhile, the processor 110 only predicts the affinity of the homologous protein (B) and drug (X) relatively close to the affinity a of the target protein (A) and drug (X) given in the data set by the deep learning model being learned. (B, X, a) can be used for learning.

According to one embodiment of the present disclosure, the processor 110 determines “the affinity information for the drug of the target protein predicted by the deep learning model” and “homology predicted by the deep learning model” within the batch currently being learned. You can compare the “affinity information of proteins for drugs.” Alternatively, if the affinity information for the drug of the target protein is already included in the learning data, the processor 110 may select the target protein (in the learning data) within the currently being learned batch. You can also compare “drug affinity information” and “drug affinity information of homologous proteins predicted by a deep learning model.” Additionally, the processor 110 may filter data regarding homologous proteins that have an accuracy higher than a certain rank (eg, higher than the median value) among the accuracy values of the homologous proteins in the batch. The processor 110 can improve the accuracy of learning by filtering data about homologous proteins that have an accuracy higher than a certain rank. Here, accuracy figures can be generated based on a comparison between the affinity information for the drug of the homologous protein predicted by the deep learning model and the affinity information for the drug of the target protein predicted by the deep learning model. .

According to an embodiment of the present disclosure, the processor 110 may perform filtering on the augmented learning data, and may perform the filtering from the middle of the learning process of the deep learning model currently being trained. In other words, with respect to the augmented learning data, the processor 110 does not perform a filtering operation from the start of learning for the deep learning model, but at some point after learning begins (that is, after learning has progressed to a certain extent). Filtering operations can be performed from any point in time. Since the prediction accuracy of the deep learning model cannot be sufficiently guaranteed at the beginning of learning, it is preferable not to perform filtering operations using the deep learning model from the start of learning to a certain point, and at the certain point (i.e., at what point does learning take place) This is because it is desirable to perform a filtering operation using the deep learning model after the process has progressed to a certain extent and the accuracy of the prediction of the deep learning model can be guaranteed to some extent. For example, the processor 110 does not perform a filtering operation on the augmented learning data from the start of learning to a preset epoch (e.g., until 850 epoch) during the learning process of the deep learning model currently being learned. It may not be possible, and the filtering operation may be performed after the preset epoch. Meanwhile, through this filtering operation, data that may reduce prediction accuracy can be removed from the augmented learning data, so the final learning performance of the deep learning model can be further improved.

Figure 3 is a first experiment result showing the effect of the data augmentation method for predicting drug-target affinity according to an embodiment of the present disclosure, and Figure 4 is a result of the first experiment according to an embodiment of the present disclosure. This is the second experiment result showing the effectiveness of the data augmentation method for predicting drug-target affinity. Figures 3 and 4 show performance evaluation results performed on all performance measurement items (eg, MSE, CI, AUPR) based on different data sets. As performance evaluation indicators, mean square error (MSE), CI (Consistency Index), and AUPR (Area Under Precision-Recall) were used.

As an example, referring to FIG. 3, the first curve (original split) is a learning curve of the MT-DIT (Molecule Transformer Drug Target Interaction) model in the existing KIBA data set. The second curve (New split) is the learning curve of MT-DTI in the KIBA data set divided so that only new proteins appear during testing (hereinafter “new data set”). Through the learning curve of the original split, we can see that the generalization ability of the deep learning model learned through limited data to new targets is quite limited.

Referring to FIG. 4 as an example, the third curve (original MT-DTI) is an MT-DTI learning curve in a new dataset. The third curve (original MT-DTI) is the same as the second curve (New split) in FIG. 3. The fourth curve (MT-DTI with MSA) is a learning curve when additional multiple sequence alignment (MSA) information is used in a new dataset. In other words, the fourth curve (MT-DTI with MSA) augments the learning data using a homologous protein in a data augmentation method for predicting drug-target affinity according to an embodiment of the present disclosure. This is the learning curve (before using filtering). The fifth curve (MT-DTI with MSA filtered from 850 epochs) is a learning curve when multiple sequence alignment (MSA) information is additionally used in a new dataset and filtering is performed. In other words, the fifth curve (MT-DTI with MSA filtered from 850 epochs) is a learning curve in which learning data is augmented using homologous proteins and additional filtering operations are applied, according to an embodiment of the present disclosure. For reference, filtering was applied starting from 850 epochs. Looking at the learning curve of the fifth curve (MT-DTI with MSA filtered from 850 epochs), it can be seen that it shows excellent performance in all performance measurement items (e.g., MSE, CI, AUPR). The learning curve in Figure 4 is the average of five results obtained by learning using four different unions of five learning folds. The smaller the MSE and the larger the CI and AUPR, the better the performance. That is, as shown in FIG. 4, by “using the learning data augmented based on the homologous protein,” additionally, “the affinity information of the target protein for the drug and the affinity information of the homologous protein for the drug are obtained.” By “performing filtering with consideration”, the learning performance was improved.

According to an embodiment of the present disclosure, using the existing KIBA data set (e.g., data set related to binding affinity), learning is performed in the following three ways using the same deep learning model to determine the performance of the model learned in the test set. Can be compared using MSE, CI, AUPR, etc. The three methods are: ① a learning method from the existing KIBA data set, ② a learning method applying element 1 (e.g., data set expansion using MSA), ③ element 1 (e.g., data set expansion using MSA), and element 2 ( For example, it may include a learning method applying filtering of an extended data set. At this time, a hold-out cross validation method may be used for verification, but is not limited to this. For example, when using hold-out cross-validation, the entire data set can be divided into one test set and k training sets. More specifically, when k=3, (1) “Train with train set 2, train set 3 => Evaluate with test set”, (2) “Train with train set 1, train set 3 => Evaluate with test set” ", (3) "Learning with train set 1, train set 2 => evaluation with test set" can be performed. In this case, the artificial intelligence model (deep learning model) or learning method can be evaluated using the average of the scores from each evaluation ((1), (2), ..., (k)) (for reference, In the case of Figure 4, the method with k=5 is used). On the other hand, if the cross-validation result shows that “the performance of (2) is better than the performance of (1),” it can be determined that element 1 is effective. Additionally, if it appears that “the performance of (3) is better than the performance of (2),” it can be determined that element 2 is effective.

Below, we will briefly look at the operation flow of the present application based on the details described above.

The data augmentation method associated with the target protein shown in FIG. 5 can be performed by the computing device 100 described above. Therefore, even if the content is omitted below, the information described with respect to the computing device 100 can be equally applied to the description of the data augmentation method associated with the target protein.

The computing device 100 according to an embodiment of the present disclosure may acquire the target protein included in the learning data and indicator information associated with the target protein (S110). For example, an indicator associated with a target protein may include various information that can be inferred that its value will be similar when the target protein and the protein structure are similar. Additionally, the indicator information associated with the target protein may include information on the affinity of the target protein for the drug.

The computing device 100 according to an embodiment of the present disclosure may identify a homologous protein of the target protein (S120). Here, the computing device 100 may perform multiple sequence alignment (MSA) on the target protein and a plurality of homologous proteins. Additionally, multiple sequence alignment can be performed by searching for a target protein that satisfies a preset identity ratio and a plurality of homologous proteins.

Additionally, the computing device 100 according to an embodiment of the present disclosure can augment learning data by matching indicator information associated with the target protein and homologous proteins to each other (S130). That is, the computing device 100 can augment the data of the index by matching index information associated with the target protein with the homologous protein.

Meanwhile, the computing device 100 may filter the augmented learning data by considering the drug affinity information of the target protein and the drug affinity information of the homologous protein. At this time, in the batch currently being learned, the computing device 100 determines the affinity information for the drug of the target protein given from the learning data or predicted by the deep learning model and the homologous protein predicted by the deep learning model. Accuracy information can be calculated by comparing affinity information for drugs, and the calculated accuracy information can be used for filtering. Additionally, the computing device 100 may filter data regarding homologous proteins that have an accuracy higher than a certain rank among the accuracy values of homologous proteins in the batch. Additionally, the computing device 100 may perform filtering on augmented learning data from the middle of the learning process of the deep learning model currently being trained.

In the above description, steps S110 to S130 may be further divided into additional steps or combined into fewer steps, depending on the implementation of the present disclosure. Additionally, some steps may be omitted or the order between steps may be changed as needed.

Meanwhile, the present disclosure, as discussed above, utilizes a weakly supervised learning method based on data augmentation that corresponds the affinity information of the target protein for a given drug to the homologous protein for the target protein, but performs the filtering operation from the beginning. Rather, the learning performance of the neural network model for predicting drug-target affinity can be improved by performing a filtering operation on the augmented data from the stage after the middle of learning (eg, after 800 epochs).

In addition, the present disclosure discussed above can augment learning data and improve the learning performance of a neural network model without the need to compare or project the structures of homologous proteins. Accordingly, the present disclosure prevents excessive consumption of learning resources that may be caused in the process of analyzing the structures of homologous proteins, mutually projecting structures, or analyzing ghost structures, etc. in the process of augmenting learning data, while preventing excessive consumption of learning resources. Learning performance can be improved.

Figure 6 is a schematic diagram showing a network function according to an embodiment of the present disclosure.

Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node. Nodes (or neurons) that make up neural networks may be interconnected by one or more links.

Within a neural network, one or more nodes connected through a link may form a relative input node and output node relationship. The concepts of input node and output node are relative, and any node in an output node relationship with one node may be in an input node relationship with another node, and vice versa. As described above, input node to output node relationships can be created around links. One or more output nodes can be connected to one input node through a link, and vice versa.

In a relationship between an input node and an output node connected through one link, the value of the data of the output node may be determined based on the data input to the input node. Here, the link connecting the input node and the output node may have a weight. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. The output node value can be determined based on the weight.

As described above, in a neural network, one or more nodes are interconnected through one or more links to form an input node and output node relationship within the neural network. The characteristics of the neural network may be determined according to the number of nodes and links within the neural network, the correlation between the nodes and links, and the value of the weight assigned to each link. For example, if the same number of nodes and links exist and two neural networks with different weight values of the links exist, the two neural networks may be recognized as different from each other.

A neural network may consist of a set of one or more nodes. A subset of nodes that make up a neural network can form a layer. Some of the nodes constituting the neural network may form one layer based on the distances from the first input node. For example, a set of nodes with a distance n from the initial input node may constitute n layers. The distance from the initial input node can be defined by the minimum number of links that must be passed to reach the node from the initial input node. However, this definition of a layer is arbitrary for explanation purposes, and the order of a layer within a neural network may be defined in a different way than described above. For example, a layer of nodes may be defined by distance from the final output node.

The initial input node may refer to one or more nodes in the neural network through which data is directly input without going through links in relationships with other nodes. Alternatively, in the relationship between nodes based on links within a neural network, it may refer to nodes that do not have other input nodes connected by links. Similarly, the final output node may refer to one or more nodes that do not have an output node in their relationship with other nodes among the nodes in the neural network. Additionally, hidden nodes may refer to nodes constituting a neural network other than the first input node and the last output node.

The neural network according to an embodiment of the present disclosure is a neural network in which the number of nodes in the input layer may be the same as the number of nodes in the output layer, and the number of nodes decreases and then increases again as it progresses from the input layer to the hidden layer. You can. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be less than the number of nodes in the output layer, and the number of nodes decreases as it progresses from the input layer to the hidden layer. there is. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be greater than the number of nodes in the output layer, and the number of nodes increases as it progresses from the input layer to the hidden layer. You can. A neural network according to another embodiment of the present disclosure may be a neural network that is a combination of the above-described neural networks.

A deep neural network (DNN) may refer to a neural network that includes multiple hidden layers in addition to the input layer and output layer. Deep neural networks allow you to identify latent structures in data. In other words, it is possible to identify the potential structure of a photo, text, video, voice, or music (e.g., what object is in the photo, what the content and emotion of the text are, what the content and emotion of the voice are, etc.) . Deep neural networks include convolutional neural networks (CNN), recurrent neural networks (RNN), auto encoders, generative adversarial networks (GAN), and restricted Boltzmann machines (RBM). machine), deep belief network (DBN), Q network, U network, Siamese network, Generative Adversarial Network (GAN), etc. The description of the deep neural network described above is only an example and the present disclosure is not limited thereto.

In one embodiment of the present disclosure, the network function may include an autoencoder. An autoencoder may be a type of artificial neural network to output output data similar to input data. The autoencoder may include at least one hidden layer, and an odd number of hidden layers may be placed between input and output layers. The number of nodes in each layer may be reduced from the number of nodes in the input layer to an intermediate layer called the bottleneck layer (encoding), and then expanded symmetrically and reduced from the bottleneck layer to the output layer (symmetrical to the input layer). Autoencoders can perform nonlinear dimensionality reduction. The number of input layers and output layers can be corresponded to the dimension after preprocessing of the input data. In an auto-encoder structure, the number of nodes in the hidden layer included in the encoder may have a structure that decreases as the distance from the input layer increases. If the number of nodes in the bottleneck layer (the layer with the fewest nodes located between the encoder and decoder) is too small, not enough information may be conveyed, so if it is higher than a certain number (e.g., more than half of the input layers, etc.) ) may be maintained.

A neural network may be trained in at least one of supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. Learning of a neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network.

Neural networks can be trained to minimize output errors. In neural network learning, learning data is repeatedly input into the neural network, the output of the neural network and the error of the target for the learning data are calculated, and the error of the neural network is transferred from the output layer of the neural network to the input layer in the direction of reducing the error. This is the process of updating the weight of each node in the neural network through backpropagation. In the case of supervised learning, learning data in which the correct answer is labeled for each learning data is used (i.e., labeled learning data), and in the case of unsupervised learning, the correct answer may not be labeled in each learning data. That is, for example, in the case of supervised learning on data classification, the training data may be data in which each training data is labeled with a category. Labeled training data is input to the neural network, and the error can be calculated by comparing the output (category) of the neural network with the label of the training data. As another example, in the case of unsupervised learning on data classification, the error can be calculated by comparing the input training data with the neural network output. The calculated error is backpropagated in the reverse direction (i.e., from the output layer to the input layer) in the neural network, and the connection weight of each node in each layer of the neural network can be updated according to backpropagation. The amount of change in the connection weight of each updated node may be determined according to the learning rate. The neural network's calculation of input data and backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently depending on the number of repetitions of the learning cycle of the neural network. For example, in the early stages of neural network training, a high learning rate can be used to increase efficiency by allowing the neural network to quickly achieve a certain level of performance, and in the later stages of training, a low learning rate can be used to increase accuracy.

In the learning of neural networks, the training data can generally be a subset of real data (i.e., the data to be processed using the learned neural network), and thus the error for the training data is reduced, but the error for the real data is reduced. There may be an incremental learning cycle. Overfitting is a phenomenon in which errors in actual data increase due to excessive learning on training data. For example, a phenomenon in which a neural network that learned a cat by showing a yellow cat fails to recognize that it is a cat when it sees a non-yellow cat may be a type of overfitting. Overfitting can cause errors in machine learning algorithms to increase. To prevent such overfitting, various optimization methods can be used. To prevent overfitting, methods such as increasing the learning data, regularization, dropout to disable some of the network nodes during the learning process, and use of a batch normalization layer can be applied. You can.

Meanwhile, according to an embodiment of the present disclosure, a computer-readable medium storing a data structure is disclosed.

Data structure can refer to the organization, management, and storage of data to enable efficient access and modification of data. Data structure can refer to the organization of data to solve a specific problem (e.g., retrieving data, storing data, or modifying data in the shortest possible time). A data structure may be defined as a physical or logical relationship between data elements designed to support a specific data processing function. Logical relationships between data elements may include connection relationships between user-defined data elements. Physical relationships between data elements may include actual relationships between data elements that are physically stored in a computer-readable storage medium (e.g., a persistent storage device). A data structure may specifically include a set of data, relationships between data, and functions or instructions applicable to the data. Effectively designed data structures allow computing devices to perform computations while minimizing the use of the computing device's resources. Specifically, computing devices can increase the efficiency of operations, reading, insertion, deletion, comparison, exchange, and search through effectively designed data structures.

Data structures can be divided into linear data structures and non-linear data structures depending on the type of data structure. A linear data structure may be a structure in which only one piece of data is connected to another piece of data. Linear data structures may include List, Stack, Queue, and Deque. A list can refer to a set of data that has an internal order. The list may include a linked list. A linked list may be a data structure in which data is connected in such a way that each data is connected in a single line with a pointer. In a linked list, a pointer may contain connection information to the next or previous data. Depending on its form, a linked list can be expressed as a singly linked list, a doubly linked list, or a circularly linked list. A stack may be a data listing structure that allows limited access to data. A stack can be a linear data structure in which data can be processed (for example, inserted or deleted) at only one end of the data structure. Data stored in the stack may have a data structure (LIFO-Last in First Out) where the later it enters, the sooner it comes out. A queue is a data listing structure that allows limited access to data. Unlike the stack, it can be a data structure (FIFO-First in First Out) where data stored later is released later. A deck can be a data structure that can process data at both ends of the data structure.

A non-linear data structure may be a structure in which multiple pieces of data are connected behind one piece of data. Nonlinear data structures may include graph data structures. A graph data structure can be defined by vertices and edges, and an edge can include a line connecting two different vertices. Graph data structure may include a tree data structure. A tree data structure may be a data structure in which there is only one path connecting two different vertices among a plurality of vertices included in the tree. In other words, it may be a data structure that does not form a loop in the graph data structure.

Data structures may include neural networks. And the data structure including the neural network may be stored in a computer-readable medium. Data structures including neural networks also include data preprocessed for processing by a neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may include a loss function for learning. A data structure containing a neural network may include any of the components disclosed above. In other words, the data structure including the neural network includes preprocessed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may be composed of all or any combination of loss functions for learning. In addition to the configurations described above, a data structure containing a neural network may include any other information that determines the characteristics of the neural network. Additionally, the data structure may include all types of data used or generated in the computational process of a neural network and is not limited to the above. Computer-readable media may include computer-readable recording media and/or computer-readable transmission media. A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node.

The data structure may include data input to the neural network. A data structure containing data input to a neural network may be stored in a computer-readable medium. Data input to the neural network may include learning data input during the neural network learning process and/or input data input to the neural network on which training has been completed. Data input to the neural network may include data that has undergone pre-processing and/or data subject to pre-processing. Preprocessing may include a data processing process to input data into a neural network. Therefore, the data structure may include data subject to preprocessing and data generated by preprocessing. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure may include the weights of the neural network. (In this specification, weights and parameters may be used with the same meaning.) And the data structure including the weights of the neural network may be stored in a computer-readable medium. A neural network may include multiple weights. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. Based on the weight, the data value output from the output node can be determined. The above-described data structure is only an example and the present disclosure is not limited thereto.

As an example and not a limitation, the weights may include weights that are changed during the neural network learning process and/or weights for which neural network learning has been completed. Weights that change during the neural network learning process may include weights that change at the start of the learning cycle and/or weights that change during the learning cycle. Weights for which neural network training has been completed may include weights for which a learning cycle has been completed. Therefore, a data structure including weights of a neural network may include weights that change during the neural network learning process and/or weights that have completed neural network learning. Therefore, the above-described weights and/or combinations of each weight are included in the data structure including the weights of the neural network. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure including the weights of the neural network may be stored in a computer-readable storage medium (e.g., memory, hard disk) after going through a serialization process. Serialization can be the process of converting a data structure into a form that can be stored on the same or a different computing device and later reorganized and used. Computing devices can transmit and receive data over a network by serializing data structures. Data structures containing the weights of a serialized neural network can be reconstructed on the same computing device or on a different computing device through deserialization. The data structure including the weights of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network is a data structure to increase computational efficiency while minimizing the use of computing device resources (e.g., in non-linear data structures, B-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree) may be included. The foregoing is merely an example and the present disclosure is not limited thereto.

The data structure may include hyper-parameters of a neural network. And the data structure including the hyperparameters of the neural network can be stored in a computer-readable medium. A hyperparameter may be a variable that can be changed by the user. Hyperparameters include, for example, learning rate, cost function, number of learning cycle repetitions, weight initialization (e.g., setting the range of weight values subject to weight initialization), Hidden Unit. It may include a number (e.g., number of hidden layers, number of nodes in hidden layers). The above-described data structure is only an example and the present disclosure is not limited thereto.

7 is a brief, general conceptual diagram of an example computing environment in which embodiments of the present disclosure may be implemented.

Although the present disclosure has generally been described above as being capable of being implemented by a computing device, those skilled in the art will understand that the present disclosure can be implemented in combination with computer-executable instructions and/or other program modules that can be executed on one or more computers and/or in hardware and software. It will be well known that it can be implemented as a combination.

Typically, program modules include routines, programs, components, data structures, etc. that perform specific tasks or implement specific abstract data types. Additionally, those skilled in the art will understand that the methods of the present disclosure are applicable to single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, etc. It will be appreciated that each of these may be implemented in other computer system configurations, including those capable of operating in conjunction with one or more associated devices.

The described embodiments of the disclosure can also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Computers typically include a variety of computer-readable media. Computer-readable media can be any medium that can be accessed by a computer, and such computer-readable media includes volatile and non-volatile media, transitory and non-transitory media, removable and non-transitory media. Includes removable media. By way of example, and not limitation, computer-readable media may include computer-readable storage media and computer-readable transmission media. Computer-readable storage media refers to volatile and non-volatile media, transient and non-transitory media, removable and non-removable, implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Includes media. Computer readable storage media may include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage. This includes, but is not limited to, a device, or any other medium that can be accessed by a computer and used to store desired information.

A computer-readable transmission medium typically implements computer-readable instructions, data structures, program modules, or other data on a modulated data signal, such as a carrier wave or other transport mechanism. Includes all information delivery media. The term modulated data signal refers to a signal in which one or more of the characteristics of the signal have been set or changed to encode information within the signal. By way of example, and not limitation, computer-readable transmission media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also intended to be included within the scope of computer-readable transmission media.

An example environment 1100 is shown that implements various aspects of the present disclosure, including a computer 1102, which includes a processing unit 1104, a system memory 1106, and a system bus 1108. do. System bus 1108 couples system components, including but not limited to system memory 1106, to processing unit 1104. Processing unit 1104 may be any of a variety of commercially available processors. Dual processors and other multiprocessor architectures may also be used as processing unit 1104.

System bus 1108 may be any of several types of bus structures that may further be interconnected to a memory bus, peripheral bus, and local bus using any of a variety of commercial bus architectures. System memory 1106 includes read only memory (ROM) 1110 and random access memory (RAM) 1112. The basic input/output system (BIOS) is stored in non-volatile memory 1110, such as ROM, EPROM, and EEPROM, and is a basic input/output system that helps transfer information between components within the computer 1102, such as during startup. Contains routines. RAM 1112 may also include high-speed RAM, such as static RAM, for caching data.

Computer 1102 may also include an internal hard disk drive (HDD) 1114 (e.g., EIDE, SATA)—the internal hard disk drive 1114 may also be configured for external use within a suitable chassis (not shown). Yes - a magnetic floppy disk drive (FDD) 1116 (e.g., for reading from or writing to a removable diskette 1118), and an optical disk drive 1120 (e.g., a CD-ROM for reading the disk 1122 or reading from or writing to other high-capacity optical media such as DVDs). Hard disk drive 1114, magnetic disk drive 1116, and optical disk drive 1120 are connected to system bus 1108 by hard disk drive interface 1124, magnetic disk drive interface 1126, and optical drive interface 1128, respectively. ) can be connected to. The interface 1124 for implementing an external drive includes at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.

These drives and their associated computer-readable media provide non-volatile storage of data, data structures, computer-executable instructions, and the like. For computer 1102, drive and media correspond to storing any data in a suitable digital format. Although the description of computer-readable media above refers to removable optical media such as HDDs, removable magnetic disks, and CDs or DVDs, those skilled in the art will also recognize removable optical media such as zip drives, magnetic cassettes, flash memory cards, cartridges, etc. It will be appreciated that other types of computer-readable media, such as the like, may also be used in the example operating environment and that any such media may contain computer-executable instructions for performing the methods of the present disclosure.

A number of program modules may be stored in the drive and RAM 1112, including an operating system 1130, one or more application programs 1132, other program modules 1134, and program data 1136. All or portions of the operating system, applications, modules and/or data may also be cached in RAM 1112. It will be appreciated that the present disclosure may be implemented on various commercially available operating systems or combinations of operating systems.

A user may enter commands and information into computer 1102 through one or more wired/wireless input devices, such as a keyboard 1138 and a pointing device such as mouse 1140. Other input devices (not shown) may include microphones, IR remote controls, joysticks, game pads, stylus pens, touch screens, etc. These and other input devices are connected to the processing unit 1104 through an input device interface 1142, which is often connected to the system bus 1108, but may also include a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, It can be connected by other interfaces, etc.

A monitor 1144 or other type of display device is also connected to system bus 1108 through an interface, such as a video adapter 1146. In addition to monitor 1144, computers typically include other peripheral output devices (not shown) such as speakers, printers, etc.

Computer 1102 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 1148, via wired and/or wireless communications. Remote computer(s) 1148 may be a workstation, computing device computer, router, personal computer, portable computer, microprocessor-based entertainment device, peer device, or other conventional network node, and is generally connected to computer 1102. For simplicity, only memory storage device 1150 is shown, although it includes many or all of the components described. The logical connections depicted include wired/wireless connections to a local area network (LAN) 1152 and/or a larger network, such as a wide area network (WAN) 1154. These LAN and WAN networking environments are common in offices and companies and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to a worldwide computer network, such as the Internet.

When used in a LAN networking environment, computer 1102 is connected to local network 1152 through wired and/or wireless communication network interfaces or adapters 1156. Adapter 1156 may facilitate wired or wireless communication to LAN 1152, which also includes a wireless access point installed thereon for communicating with wireless adapter 1156. When used in a WAN networking environment, the computer 1102 may include a modem 1158 or be connected to a communicating computing device on the WAN 1154 or to establish communications over the WAN 1154, such as via the Internet. Have other means. Modem 1158, which may be internal or external and a wired or wireless device, is coupled to system bus 1108 via serial port interface 1142. In a networked environment, program modules described for computer 1102, or portions thereof, may be stored in remote memory/storage device 1150. It will be appreciated that the network connections shown are exemplary and that other means of establishing a communications link between computers may be used.

Computer 1102 may be associated with any wireless device or object deployed and operating in wireless communications, such as a printer, scanner, desktop and/or portable computer, portable data assistant (PDA), communications satellite, wirelessly detectable tag. Performs actions to communicate with any device or location and telephone. This includes at least Wi-Fi and Bluetooth wireless technologies. Accordingly, communication may be a predefined structure as in a conventional network or may simply be ad hoc communication between at least two devices.

Wi-Fi (Wireless Fidelity) allows connection to the Internet, etc. without wires. Wi-Fi is a wireless technology, like cell phones, that allows these devices, such as computers, to send and receive data indoors and outdoors, anywhere within the coverage area of a cell tower. Wi-Fi networks use wireless technology called IEEE 802.11 (a, b, g, etc.) to provide secure, reliable, and high-speed wireless connections. Wi-Fi can be used to connect computers to each other, to the Internet, and to wired networks (using IEEE 802.3 or Ethernet). Wi-Fi networks can operate in the unlicensed 2.4 and 5 GHz wireless bands, for example, at data rates of 11 Mbps (802.11a) or 54 Mbps (802.11b), or in products that include both bands (dual band). .

Those skilled in the art will understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols and chips that may be referenced in the above description include voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields. It can be expressed by particles or particles, or any combination thereof.

Those skilled in the art will understand that the various illustrative logical blocks, modules, processors, means, circuits and algorithm steps described in connection with the embodiments disclosed herein may be used in electronic hardware, (for convenience) It will be understood that it may be implemented by various forms of program or design code (referred to herein as software) or a combination of both. To clearly illustrate this interoperability of hardware and software, various illustrative components, blocks, modules, circuits and steps have been described above generally with respect to their functionality. Whether this functionality is implemented as hardware or software depends on the specific application and design constraints imposed on the overall system. A person skilled in the art of this disclosure may implement the described functionality in various ways for each specific application, but such implementation decisions should not be construed as departing from the scope of this disclosure.

The various embodiments presented herein may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term article of manufacture includes a computer program, carrier, or media accessible from any computer-readable storage device. For example, computer-readable storage media include magnetic storage devices (e.g., hard disks, floppy disks, magnetic strips, etc.), optical disks (e.g., CDs, DVDs, etc.), smart cards, and flash. Includes, but is not limited to, memory devices (e.g., EEPROM, cards, sticks, key drives, etc.). Additionally, various storage media presented herein include one or more devices and/or other machine-readable media for storing information.

It is to be understood that the specific order or hierarchy of steps in the processes presented is an example of illustrative approaches. It is to be understood that the specific order or hierarchy of steps in processes may be rearranged within the scope of the present disclosure, based on design priorities. The appended method claims present elements of the various steps in a sample order but are not meant to be limited to the particular order or hierarchy presented.

The description of the presented embodiments is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these embodiments will be apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not limited to the embodiments presented herein but is to be interpreted in the broadest scope consistent with the principles and novel features presented herein.

As described above, the relevant content has been described in the best form for carrying out the invention.

Claims

A data augmentation method associated with a target protein, performed by a computing device, comprising:

Obtaining a target protein included in learning data and indicator information associated with the target protein;

Identifying a homologous protein of the target protein; and

Augmenting the learning data by matching index information associated with the target protein and the homologous protein;

Including,

method.
According to claim 1,

The indicator information associated with the target protein includes affinity information for the drug of the target protein,

method.
According to claim 2,

Filtering the augmented learning data by considering drug affinity information of the target protein and drug affinity information of the homologous protein;

Containing more,

method.
According to claim 3,

The filtering step is,

In the batch currently being learned, the affinity information of the target protein for the drug given from learning data or predicted by the deep learning model, and the affinity of the homologous protein for the drug predicted by the deep learning model Steps to Compare Information

Including,

method.
According to claim 4,

The filtering step is,

Filtering data on homologous proteins with an accuracy higher than a certain rank among the accuracy values of the homologous proteins in the batch.

Including,

The accuracy values are between the affinity information for the drug of the homologous proteins given from the learning data or predicted by the deep learning model, and the affinity information for the drug of the target protein predicted by the deep learning model. Generated based on a comparison of,

method.
According to claim 3,

The filtering step is,

Comprising the step of performing filtering on the augmented learning data from the middle of the learning process of the deep learning model currently being learned,

method.
According to claim 1,

The step of identifying a homologous protein of the target protein is,

Performing multiple sequence alignment (MSA) on the target protein and a plurality of homologous proteins.

Containing more,

method.
According to claim 7,

The step of performing the multiple sequence alignment is,

Performing a search for the target protein and a plurality of homologous proteins that satisfy a preset identity ratio

Including,

method.
A computer program stored in a computer-readable storage medium, wherein the computer program, when executed on one or more processors, performs the following operations for data augmentation associated with a target protein, the operations being:

Obtaining a target protein included in learning data and index information associated with the target protein;

Identifying a homologous protein of the target protein; and

An operation of augmenting the learning data by matching index information associated with the target protein and the homologous protein;

Including,

A computer program stored on a computer-readable storage medium.
A computing device for data augmentation associated with a target protein,

at least one processor; and

Memory;

Including,

The at least one processor,

Obtaining a target protein included in learning data and indicator information associated with the target protein;

Identifying a homologous protein of the target protein; and

Configured to augment the learning data by matching index information associated with the target protein and the homologous protein,

Device.