WO2021089008A1 - 一种分子间的结合活性预测方法及装置 - Google Patents

一种分子间的结合活性预测方法及装置 Download PDF

Info

Publication number
WO2021089008A1
WO2021089008A1 PCT/CN2020/127249 CN2020127249W WO2021089008A1 WO 2021089008 A1 WO2021089008 A1 WO 2021089008A1 CN 2020127249 W CN2020127249 W CN 2020127249W WO 2021089008 A1 WO2021089008 A1 WO 2021089008A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
original matrix
protein
binding activity
small molecule
Prior art date
Application number
PCT/CN2020/127249
Other languages
English (en)
French (fr)
Inventor
胡帆
蒋佳新
殷鹏
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2021089008A1 publication Critical patent/WO2021089008A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs

Definitions

  • This application belongs to the field of data processing technology, and in particular relates to a method and device for predicting the binding activity between molecules.
  • the technologies for drug research and development through scientific and technological means include structure-based and ligand-based computer virtual screening.
  • the most widely used and successful rate is the molecular docking method. Its core purpose is that the ability of the molecule to bind to the target protein and that the molecule depends on the specific biological activity exhibited by the binding site of the protein.
  • the three-dimensional structure of the protein can be obtained through experimental data, homology modeling or molecular dynamics simulation, and then using molecular docking and other technologies to match a large number of small molecules in the compound database according to the inferred binding site on the target structure. Subsequently, the compounds are evaluated and scored according to certain rules, and the compounds are ranked according to the level of the score.
  • the higher-ranked compound is the potential lead inhibitor of the protein target.
  • the calculation speed of the above method is slow and the efficiency is low, and it needs to perform simulation scoring against a massive ligand database, which takes a long time.
  • researchers need to further artificially select and visually analyze the preliminary screening results.
  • the efficiency is low, and the accuracy rate fluctuates due to the influence of the researchers' experience level, which is still the goal of the research.
  • the embodiments of the present application provide a method and device for predicting the binding activity between molecules, which can solve the problem of the slow calculation speed and low efficiency of the prior art, which takes a long time.
  • the efficiency is low, and the accuracy rate fluctuates due to the influence of the experience level of the researchers.
  • the embodiments of the present application provide a method for predicting the binding activity between molecules, including:
  • the first feature vector and the second feature vector are interlocked and calculated to obtain the prediction result of the binding activity between the protein and the small molecule output by the prediction model.
  • an intermolecular binding activity prediction device including:
  • Obtaining module used to obtain the original matrix of proteins and the original matrix of small molecules
  • the extraction module is used to extract the first feature vector corresponding to the original matrix of proteins and the second feature vector corresponding to the original matrix of small molecules;
  • the interlocking module is used for interlocking and calculating the first feature vector and the second feature vector to obtain the prediction result of the binding activity between the protein and the small molecule output by the prediction model.
  • an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • the processor executes the computer program, The method for predicting the intermolecular binding activity as described in any one of the above-mentioned first aspects is realized.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the implementation is as described in any of the above-mentioned first aspects.
  • the described method for predicting the binding activity between molecules is as described in any of the above-mentioned first aspects.
  • the embodiments of the present application provide a computer program product that, when the computer program product runs on a terminal device, causes the terminal device to execute the method for predicting the intermolecular binding activity of any one of the above-mentioned first aspects.
  • the embodiment of the application uses the convolutional neural network model to extract the features of the one-dimensional sequence of proteins and small molecules to obtain the binding activity of proteins and small molecules, avoiding research errors due to unclear structures of large molecules such as proteins, and improving
  • the efficiency of drug development through scientific and technological means has effectively shortened the development time and ensured the stability of the research process and results.
  • FIG. 1 is a schematic flowchart of a method for predicting intermolecular binding activity provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a prediction model provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a feedforward fully connected layer provided by an embodiment of the present application.
  • FIG. 4 is a prediction effect diagram of the prediction model provided in an embodiment of the present application in the PDBbind database
  • FIG. 5 is a schematic structural diagram of an intermolecular binding activity prediction device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a terminal device to which the method provided in an embodiment of the present application is applicable.
  • the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the method for predicting intermolecular binding activity can be applied to mobile phones, tablets, wearable devices, vehicle-mounted devices, notebook computers, Ultra-Mobile Personal Computers (UMPC), netbooks, and personal digital assistants.
  • UMPC Ultra-Mobile Personal Computers
  • PDA Personal Digital Assistant
  • Fig. 1 shows a schematic flow chart of the method for predicting binding activity between molecules provided in the present application.
  • the method can be applied to any of the above-mentioned terminal devices.
  • the original matrix of the protein and the original matrix of small molecules are convolved through the convolution layer to obtain the first eigenvector corresponding to the original matrix of the protein and the second eigenvector corresponding to the original matrix of small molecules.
  • the first feature vector and the second feature vector are interlocked, and the interlocked first feature vector and second feature vector are input into different numbers of fully connected layers to obtain the protein and small components output by the prediction model. Predicted results of intermolecular binding activity. Interlocking refers to the mutual restriction relationship established between the first feature vector and the second feature vector.
  • the number of fully connected layers represents the number of fully connected layers and the number of neurons, which can be specifically set according to the actual situation. For example, set the fully connected layer as a 3-layer fully connected layer, and the number of neurons in turn It is 2048, 512, 64.
  • step S101 includes:
  • the one-dimensional sequence of the protein is converted into the corresponding original matrix by the preset conversion method, and the one-dimensional sequence of the small molecule is converted into the corresponding original matrix; wherein, the preset conversion method includes one-hot encoding.
  • the preset conversion method includes but is not limited to one-hot encoding.
  • the one-dimensional sequence of proteins and small molecules is converted into one-hot encoding matrix representations of sizes (P, 1200) and (C, 200), where P and C represent the differences between proteins and small molecules, respectively
  • P the number of characters. For example, if the number of different amino acids in a protein is 20 (A, R, L%), then P is 20.
  • SILES simplified molecular-input line-entry system
  • step S102 includes:
  • the original matrix of the protein and the original matrix of the small molecule are respectively subjected to convolution processing to obtain the first eigenvector corresponding to the original matrix of the protein and the second eigenvector corresponding to the original matrix of the small molecule.
  • the feature extraction process is mainly: convolution processing the original matrix of protein and the original matrix of small molecules through a convolution layer with a convolution kernel size of 3*3 and a step size of 1. .
  • convolution layer with a convolution kernel size of 3*3 and a step size of 1.
  • two convolutional layers and one pooling layer are regarded as one convolution module, and the number of convolution modules can be specifically set according to actual conditions.
  • a total of 3 convolution modules that is, 6 convolution layers
  • the number of convolution kernels is 32, 32, 64, 64, 128, 128 in order.
  • Figure 2 exemplarily shows a schematic structural diagram of a prediction model.
  • neural networks can automatically extract features, and the prior art cannot specify certain features extracted by neural networks, a series of known or unknown features can be mapped to a high-dimensional space, which is The obtained first feature vector or second feature vector.
  • step S103 includes:
  • the fully connected layer performs fully connected processing on the interlocked first feature vector and second feature vector to determine whether the first feature vector and the second feature vector have binding activity, and the size of the binding activity .
  • Fig. 3 exemplarily shows a structural schematic diagram of a simple feedforward fully connected layer.
  • x is the input value
  • W[1] and W[2] respectively represent the weight parameters from the input layer to the hidden layer and the hidden layer to the output layer (obtained after pre-training the neural network).
  • is the activation function
  • a[1] is the value of the hidden layer activation transformation
  • y is the predicted value of the output.
  • the input value x is the interlocking feature vector of protein and small molecule (2048)
  • the number of neurons in the two middle layers is 512 and 64 respectively
  • the input data is The output data of the previous layer.
  • the first two layers of activation function ⁇ are relu
  • the last layer of activation function is: classification task (sigmoid), regression task (linear).
  • step S1032 includes:
  • the interlocked first feature vector and the second feature vector are subjected to regression processing to obtain the predicted result of the binding activity between the protein and the small molecule output by the prediction model.
  • the first feature vector and the second feature vector after the interlocking are subjected to classification-oriented processing specifically through the task-oriented classification. Predict whether the small molecule is bound to the protein. If the predicted result is that the protein and the small molecule have binding activity, perform regression processing on the interlocked first eigenvector and the second eigenvector through the regression task to predict the small molecule and the protein The bonding strength.
  • step S202 the method includes:
  • the sample data is processed through the loss function to realize the pre-training process of the prediction model and obtain the pre-trained prediction model; wherein, the loss function includes at least one of cross entropy and mean square error.
  • the output value of the classification task is 0 or 1. 0 means that there is no binding activity between the protein and small molecules, and 1 means that there is binding activity between the protein and small molecules.
  • the output of the regression task is a continuous value, such as 4.2, 1.6 or 8.9, which indicates the strength of the binding activity of the protein and the small molecule compound.
  • Cross Entropy (Binary Cross Entropy) is to obtain the optimal solution of the model training weight parameter W through calculation, and realize the loss function of the optimized model.
  • Mean Square Error (Mean Square Error) is also the loss function used to optimize the model.
  • the training optimizer is set to Adam
  • the learning rate is a hyperparameter in the neural network, which is set to 0.0001, beta1 to 0.9, and beta2 to 0.999.
  • PDBbind is a database containing tens of thousands of protein and small molecule binding structures and their binding activities. It is used to establish and test a variety of virtual screening methods. PDBBind can be used to horizontally compare the performance of different virtual screening models.
  • the root mean square error (Root Mean Square Rrror, RMSE) of the prediction model in the PDBBind data set of the training data set, the verification data set, and the test data set are 0.930, 1.388 and 1.372 respectively, and the corresponding correlation coefficients are 0.87, 0.69 and 0.70.
  • the DUD-E database is a benchmark data set for evaluating virtual screening algorithms.
  • the prediction effect of the prediction model in the DUD-E database can reach 0.997.
  • Table 2 shows the prediction effects of traditional molecular docking methods such as Smina and AutoDock Vina, machine learning algorithm support vector machine methods, and prediction models in the DUD-E database.
  • This embodiment uses the convolutional neural network model to extract features of the one-dimensional sequence of proteins and small molecules to obtain the binding activity of proteins and small molecules, avoiding research errors due to unclear structures of large molecules such as proteins, and improving the pass
  • the efficiency of drug research and development by scientific and technological means has effectively shortened the research and development time and ensured the stability of the research process and results.
  • FIG. 5 shows a structural block diagram of the intermolecular binding activity prediction device provided in an embodiment of the present application. The relevant part of the embodiment.
  • the intermolecular binding activity prediction device 200 includes:
  • the first obtaining module 101 is used to obtain the original matrix of proteins and the original matrix of small molecules;
  • the extraction module 102 is used to extract the first feature vector corresponding to the original matrix of proteins and the second feature vector corresponding to the original matrix of small molecules;
  • the interlocking module 103 is configured to interlock and calculate the first feature vector and the second feature vector to obtain the prediction result of the binding activity between the protein and the small molecule output by the prediction model.
  • the device for predicting the binding activity between molecules further includes:
  • the second acquisition module is used to acquire sample data
  • the pre-training module is used to pre-train the prediction model through sample data to obtain a pre-trained prediction model; wherein the prediction model includes a deep learning model.
  • This embodiment uses the convolutional neural network model to extract features of the one-dimensional sequence of proteins and small molecules to obtain the binding activity of proteins and small molecules, avoiding research errors due to unclear structures of large molecules such as proteins, and improving the pass
  • the efficiency of drug research and development by scientific and technological means has effectively shortened the research and development time and ensured the stability of the research process and results.
  • FIG. 6 shows a block diagram of a part of the structure of a terminal device provided in an embodiment of the present application.
  • the terminal equipment includes: a radio frequency (RF) circuit 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (WiFi) module 170, and a processor 180, and power supply 190 and other components.
  • RF radio frequency
  • the structure of the terminal device shown in FIG. 6 does not constitute a limitation on the terminal device, and may include more or fewer components than shown in the figure, or combine some components, or arrange different components.
  • the RF circuit 110 can be used for receiving and sending signals during information transmission or communication. In particular, after receiving the downlink information of the base station, it is processed by the processor 180; in addition, the designed uplink data is sent to the base station.
  • the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 110 may also communicate with the network and other devices through wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
  • GSM Global System of Mobile Communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Messaging Service
  • the memory 120 may be used to store software programs and modules.
  • the processor 180 executes various functional applications and data processing of the terminal device by running the software programs and modules stored in the memory 120.
  • the memory 120 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.;
  • the data (such as audio data, phone book, etc.) created by the use of the terminal device, etc.
  • the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 130 may be used to receive input digital or character information, and generate key signal input related to user settings and function control of the terminal device 100.
  • the input unit 130 may include a touch panel 131 and other input devices 132.
  • the touch panel 131 also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 131 or near the touch panel 131. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 131 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 180, and can receive and execute the commands sent by the processor 180.
  • the touch panel 131 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 130 may also include other input devices 132.
  • the other input device 132 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
  • the display unit 140 may be used to display information input by the user or information provided to the user and various menus of the terminal device.
  • the display unit 140 may include a display panel 141.
  • the display panel 141 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), etc.
  • the touch panel 131 can cover the display panel 141. When the touch panel 131 detects a touch operation on or near it, it transmits it to the processor 180 to determine the type of the touch event, and then the processor 180 responds to the touch event. The type provides corresponding visual output on the display panel 141.
  • the touch panel 131 and the display panel 141 are used as two independent components to implement the input and input functions of the terminal device, in some embodiments, the touch panel 131 and the display panel 141 can be integrated And realize the input and output functions of terminal equipment.
  • the terminal device 100 may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 141 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 141 and the display panel 141 when the terminal device is moved to the ear. / Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when stationary, and can be used for applications that recognize the posture of the terminal device (such as horizontal and vertical screen switching, Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for the terminal equipment, there are other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. that can be configured here. No longer.
  • the audio circuit 160, the speaker 161, and the microphone 162 can provide an audio interface between the user and the terminal device.
  • the audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, and the audio circuit 160 After being received, it is converted into audio data, and then processed by the audio data output processor 180, and then sent to another terminal device via the RF circuit 110, or the audio data is output to the memory 120 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the terminal device can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 170. It provides users with wireless broadband Internet access.
  • FIG. 6 shows the WiFi module 170, it is understandable that it is not a necessary component of the terminal device 100, and can be omitted as needed without changing the essence of the invention.
  • the processor 180 is the control center of the terminal device. It uses various interfaces and lines to connect the various parts of the entire terminal device, runs or executes software programs and/or modules stored in the memory 120, and calls data stored in the memory 120. , Perform various functions of the terminal equipment and process data, so as to monitor the terminal equipment as a whole.
  • the processor 180 may include one or more processing units; preferably, the processor 180 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 180.
  • the terminal device 100 also includes a power source 190 (such as a battery) for supplying power to various components.
  • a power source 190 such as a battery
  • the power source may be logically connected to the processor 180 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system. .
  • the terminal device 100 may also include a camera.
  • the position of the camera on the terminal device 100 may be front or rear, which is not limited in the embodiment of the present application.
  • the terminal device 100 may include a single camera, a dual camera, or a triple camera, etc., which is not limited in the embodiment of the present application.
  • the terminal device 100 may include three cameras, of which one is a main camera, one is a wide-angle camera, and one is a telephoto camera.
  • all of the multiple cameras may be front-mounted, or all rear-mounted, or partly front-mounted and another part rear-mounted, which is not limited in the embodiment of the present application.
  • the terminal device 100 may also include a Bluetooth module, etc., which will not be repeated here.
  • An embodiment of the present application also provides a terminal device.
  • the terminal device includes: at least one processor, a memory, and a computer program stored in the memory and running on the at least one processor, and the processor executes The computer program implements the steps in any of the foregoing method embodiments.
  • the embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the steps in each of the foregoing method embodiments can be realized.
  • the embodiments of the present application provide a computer program product.
  • the steps in the foregoing method embodiments can be realized when the mobile terminal is executed.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer programs can be stored in a computer-readable storage medium.
  • the steps of the foregoing method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM random access memory
  • electric carrier signal telecommunications signal and software distribution medium.
  • U disk mobile hard disk, floppy disk or CD-ROM, etc.
  • computer-readable media cannot be electrical carrier signals and telecommunication signals.
  • the disclosed apparatus/network equipment and method may be implemented in other ways.
  • the device/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Physics & Mathematics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种分子间的结合活性预测方法,包括:获取蛋白的原始矩阵及小分子的原始矩阵(S101),并提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量(S102),然后联锁第一特征向量和第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果(S103)。该方法通过对卷积神经网络模型对蛋白和小分子的一维序列进行特征提取,获得蛋白和小分子的结合活性,避免了由于蛋白等大分子结构不明确情况下的研究失误,提高通过科学技术手段进行药物研发的效率,有效缩短了研发时间,保证了研究过程及结果的稳定性。

Description

一种分子间的结合活性预测方法及装置 技术领域
本申请属于数据处理技术领域,尤其涉及一种分子间的结合活性预测方法及装置。
背景技术
近年来,随着科学技术的发展,通过科技手段实现药物研发是社会的共同目标。由于新兴药物的研发过程需要耗费大量资金、人力与时间的资源。如何提高药物临床研究速度,成为了目前的主要研究方向。
目前,通过科技手段进行药物研发的技术包括基于结构和基于配体的计算机虚拟筛选,其中应用最为广泛、成功率较高的是分子对接方法。其核心宗旨在于,分子与靶标蛋白的结合能力及该分子依赖于与蛋白结合位点所表现出来的特定生物活性。蛋白的三维结构可通过实验数据,同源模建或分子动力学模拟等方法得到,然后利用分子对接等技术,根据靶标结构上推测的结合位点,对化合物数据库中的大量小分子进行匹配,随后依据一定的规则对化合物进行评价打分,根据打分高低对化合物进行排名,排名较高的化合物即为潜在的该蛋白靶标的先导抑制剂。然而上述方法计算速度慢且效率低,需要针对海量的配体数据库进行模拟打分,耗费的时间较长。同时在软件初筛后,需要研究人员进一步对初筛结果进行人为挑选和可视化分析,效率低下,并且准确率受研究人员的经验水准的影响而产生波动,仍为达到研究的目标。
发明内容
本申请实施例提供了一种分子间的结合活性预测方法及装置,可以解决现有技术计算速度慢且效率低,耗费时间较长。效率低下,并且准确率受研究人 员的经验水准的影响而产生波动的问题。
第一方面,本申请实施例提供了一种分子间的结合活性预测方法,包括:
获取蛋白的原始矩阵及小分子的原始矩阵;
提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
第二方面,本申请实施例提供了一种分子间的结合活性预测装置,包括:
获取模块,用于获取蛋白的原始矩阵及小分子的原始矩阵;
提取模块,用于提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
联锁模块,用于联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
第三方面,本申请实施例提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面中任一项所述的分子间的结合活性预测方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面中任一项所述的分子间的结合活性预测方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的分子间的结合活性预测方法。
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
本申请实施例通过对卷积神经网络模型对蛋白和小分子的一维序列进行特 征提取,获得蛋白和小分子的结合活性,避免了由于蛋白等大分子结构不明确情况下的研究失误,提高通过科学技术手段进行药物研发的效率,有效缩短了研发时间,保证了研究过程及结果的稳定性。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的分子间的结合活性预测方法的流程示意图;
图2是本申请一实施例提供的预测模型的结构示意图;
图3是本申请一实施例提供的前馈全连接层的结构示意图;
图4是本申请一实施例提供的预测模型在PDBbind数据库中的预测效果图;
图5是本申请一实施例提供的分子间的结合活性预测装置的结构示意图;
图6是本申请一实施例提供的方法所适用于的终端设备的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请实施例提供的分子间的结合活性预测方法可以应用于手机、平板电脑、可穿戴设备、车载设备、笔记本电脑、超级移动个人计算机(Ultra-Mobile Personal Computer,UMPC)、上网本、个人数字助理(Personal Digital Assistant,PDA)等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。
图1示出了本申请提供的分子间的结合活性预测方法的示意性流程图,作为示例而非限定,该方法可以应用于上述任一终端设备中。
S101、获取蛋白的原始矩阵及小分子的原始矩阵;
在具体应用中,获取蛋白与小分子化合物的一维序列,并通过独热编码(One-hot)对蛋白与小分子化合物的一维序列进行编码,得到蛋白的原始矩阵 及小分子的原始矩阵。
S102、提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
在具体应用中,通过卷积层对蛋白的原始矩阵和小分子的原始矩阵进行卷积处理,获得蛋白的原始矩阵对应的第一特征向量,以及小分子原始矩阵对应的第二特征向量。
S103、联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
在具体应用中,将第一特征向量和第二特征向量进行联锁,并将联锁后的第一特征向量和第二特征向量输入不同数量的全连接层,获得预测模型输出的蛋白和小分子间结合活性的预测结果。联锁是指在第一特征向量和第二特征向量之间建立的相互制约关系。
其中,全连接层的数量表示全连接层的层数和神经元个数,其可根据实际情况进行具体设定,例如,设定全连接层为3层全连接层,其神经元个数依次为2048,512,64。
在一个实施例中,步骤S101,包括:
获取蛋白和小分子的一维序列;
通过预设转换方法将蛋白的一维序列转换为对应的原始矩阵,将小分子的一维序列转换为对应的原始矩阵;其中,预设转换方法包括独热编码。
在具体应用中,预设转换方法包括但不限于独热编码。
在本实施例中,将蛋白和小分子的一维序列转换为大小为(P,1200)和(C,200)的独热编码矩阵表示,其中,P和C分别代表蛋白和小分子的不同字符的个数。例如,某蛋白的不同氨基酸个数为20(A,R,L…),则P为20。
小分子字符的个数是简化分子线性输入规范(Simplified molecular-input line-entry system,SMILES)的个数,例如, CCCCCN(C(=O)[C@@H](NC(=O)[C@H](Cc1ccccc1)NC(=O)C)CCC中的单一字符的个数。
在一个实施例中,步骤S102,包括:
对蛋白的原始矩阵和小分子的原始矩阵分别进行卷积处理,获得蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量。
在具体应用中,在本实施例中,特征提取过程主要为:通过卷积核大小为3*3,步长为1的卷积层对蛋白的原始矩阵和小分子的原始矩阵进行卷积处理。其中,将两卷积层和一池化层视为一个卷积模块,卷积模块的数量可根据实际情况进行具体设定。
在本实施例中,共采用了3个卷积模块(即是6个卷积层),其中,卷积核数量依次为32,32,64,64,128,128。通过上述3个卷积模块分别对蛋白的原始矩阵和小分子的原始矩阵进行特征提取,最后获得蛋白的原始矩阵对应的第一特征向量,以及小分子原始矩阵对应的第二特征向量。
图2示例性的示出了一种预测模型的结构示意图。
需要说明的是:由于神经网络能够自动提取特征,而现有技术无法具体说明神经网络提取到的某些特征,因此,可将一系列的已知或未知的特征映射到高维空间,即为获得的第一特征向量或第二特征向量。
在一个实施例中,步骤S103,包括:
S1031、联锁所述第一特征向量和所述第二特征向量;
S1032、对联锁后的第一特征向量和第二特征向量进行全连接处理,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
在具体应用中,通过全连接层对联锁后的第一特征向量和第二特征向量进行全连接处理,以判断第一特征向量和第二特征向量间是否具有结合活性,以及结合活性的大小。
图3示例性的示出了一种简单的前馈全连接层的结构示意图。
其中,x是输入值,W[1]和W[2]分别表示输入层到隐藏层、隐藏层到输出层的权重参数(对神经网络进行预训练后获得的)。σ是激活函数,a[1]是隐藏层激活变换的数值,y则为输出的预测值。
作为示例,在层数为3层、且其神经元个数依次为2048,512,64的全连接层中,共有W[1](2048-512) W[2](512-64) W[3](64-1)三个权重参数矩阵,输入值x为蛋白与小分子的联锁特征向量(2048),两层中间层的神经元个数分别为512,64,其输入数据即为上一层的输出数据。激活函数σ前两层为relu,最后一层激活函数为:分类任务(sigmoid),回归任务(linear)。
在一个实施例中,步骤S1032,包括:
对联锁后的第一特征向量和第二特征向量进行面向分类处理,获得预测模型输出的蛋白和小分子间是否具有结合活性的预测结果;
若预测结果为蛋白和小分子间具有结合活性,则对联锁后的第一特征向量和第二特征向量进行回归处理,获得预测模型输出的蛋白和小分子间结合活性大小的预测结果。
在具体应用中,具体通过面向分类任务对对联锁后的第一特征向量和第二特征向量进行面向分类处理。预测小分子是否与蛋白有结合,若预测结果为蛋白和小分子间具有结合活性,则通过回归任务对联锁后的第一特征向量和第二特征向量进行回归处理,以预测小分子与蛋白的结合强度。
在一个实施例中,S101之前,还包括:
S201、获取样本数据;
S202、通过样本数据对所述预测模型进行预训练,获得预训练后的预测模型;其中,所述预测模型包括深度学习模型。
在一个实施例中,步骤S202之后,包括:
通过损失函数对样本数据进行处理,实现对预测模型的预训练过程,获得预训练后的预测模型;其中,损失函数包括交叉熵和均方误差中的至少一种。
在具体应用中,分类任务输出的值是0或者1,0表示蛋白与小分子间没有 结合活性,1表示蛋白与小分子间有结合活性。
回归任务输出的是一个连续性的数值,如4.2,1.6或8.9等,表示蛋白与小分子化合物的结合活性的强度。
交叉熵(Binary Cross Entropy)为通过运算获得模型训练权重参数W的最优解,实现优化模型的损失函数。均方误差(Mean Square Error)同样是用于优化模型的损失函数。
在本实施例中,设定训练优化器为Adam,学习率Learning Rate是神经网络中一个超参数,设定为0.0001,beta1为0.9,beta2为0.999,。
PDBbind是一个包含有数万种蛋白与小分子结合结构及其结合活性的数据库,被用于建立与测试多种虚拟筛选方法。PDBBind可用于横向比较不同虚拟筛选模型的性能。
经过试验,预测模型在PDBbind数据库中的预测效果如图4所示。
预测模型在PDBBind数据集中的训练数据集、验证数据集和测试数据集的均方根误差(Root Mean Square Rrror,RMSE)指分别为0.930,1.388和1.372,对应的相关系数分别为0.87,0.69和0.70。
具体的,传统机器学习算法,支持向量机方法和随机森林算法,现有的基于结构的深度神经预测网络模型以及预测模型在PDBbind数据库中的预测性能结果如表1所示。
Figure PCTCN2020127249-appb-000001
表1
DUD-E数据库是一种评估虚拟筛选算法的基准数据集,预测模型在DUD-E数据库中的预测效果可达到0.997。
Smina、AutoDock Vina等传统的分子对接方法,机器学习算法支持向量机 方法以及预测模型在DUD-E数据库中的预测效果如表2所示。
Figure PCTCN2020127249-appb-000002
表2
本实施例通过对卷积神经网络模型对蛋白和小分子的一维序列进行特征提取,获得蛋白和小分子的结合活性,避免了由于蛋白等大分子结构不明确情况下的研究失误,提高通过科学技术手段进行药物研发的效率,有效缩短了研发时间,保证了研究过程及结果的稳定性。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的分子间的结合活性预测方法,图5示出了本申请实施例提供的分子间的结合活性预测装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图5,该分子间的结合活性预测装置200包括:
第一获取模块101,用于获取蛋白的原始矩阵及小分子的原始矩阵;
提取模块102,用于提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
联锁模块103,用于联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
在一个实施例中,分子间的结合活性预测装置,还包括:
第二获取模块,用于获取样本数据;
预训练模块,用于通过样本数据对所述预测模型进行预训练,获得预训练后的预测模型;其中,所述预测模型包括深度学习模型。
本实施例通过对卷积神经网络模型对蛋白和小分子的一维序列进行特征提取,获得蛋白和小分子的结合活性,避免了由于蛋白等大分子结构不明确情况下的研究失误,提高通过科学技术手段进行药物研发的效率,有效缩短了研发时间,保证了研究过程及结果的稳定性。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
图6示出的是与本申请实施例提供的终端设备的部分结构的框图。参考图6,终端设备包括:射频(Radio Frequency,RF)电路110、存储器120、输入单元130、显示单元140、传感器150、音频电路160、无线保真(wireless fidelity,WiFi)模块170、处理器180、以及电源190等部件。本领域技术人员可以理解,图6中示出的终端设备结构并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图6对终端设备的各个构成部件进行具体的介绍:
RF电路110可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器180处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路110还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term  Evolution,LTE))、电子邮件、短消息服务(Short Messaging Service,SMS)等。
存储器120可用于存储软件程序以及模块,处理器180通过运行存储在存储器120的软件程序以及模块,从而执行终端设备的各种功能应用以及数据处理。存储器120可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端设备的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元130可用于接收输入的数字或字符信息,以及产生与终端设备100的用户设置以及功能控制有关的键信号输入。具体地,输入单元130可包括触控面板131以及其他输入设备132。触控面板131,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板131上或在触控面板131附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板131可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器180,并能接收处理器180发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板131。除了触控面板131,输入单元130还可以包括其他输入设备132。具体地,其他输入设备132可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元140可用于显示由用户输入的信息或提供给用户的信息以及终端设备的各种菜单。显示单元140可包括显示面板141,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting  Diode,OLED)等形式来配置显示面板141。进一步的,触控面板131可覆盖显示面板141,当触控面板131检测到在其上或附近的触摸操作后,传送给处理器180以确定触摸事件的类型,随后处理器180根据触摸事件的类型在显示面板141上提供相应的视觉输出。虽然在图6中,触控面板131与显示面板141是作为两个独立的部件来实现终端设备的输入和输入功能,但是在某些实施例中,可以将触控面板131与显示面板141集成而实现终端设备的输入和输出功能。
终端设备100还可包括至少一种传感器150,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板141的亮度,接近传感器可在终端设备移动到耳边时,关闭显示面板141和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别终端设备姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于终端设备还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路160、扬声器161,传声器162可提供用户与终端设备之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,传声器162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出处理器180处理后,经RF电路110以发送给比如另一终端设备,或者将音频数据输出至存储器120以便进一步处理。
WiFi属于短距离无线传输技术,终端设备通过WiFi模块170可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图6示出了WiFi模块170,但是可以理解的是,其并不属于终端设备100的必须构成,完全可以根据需要在不改变发明的本质的范围内而省 略。
处理器180是终端设备的控制中心,利用各种接口和线路连接整个终端设备的各个部分,通过运行或执行存储在存储器120内的软件程序和/或模块,以及调用存储在存储器120内的数据,执行终端设备的各种功能和处理数据,从而对终端设备进行整体监控。可选的,处理器180可包括一个或多个处理单元;优选的,处理器180可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器180中。
终端设备100还包括给各个部件供电的电源190(比如电池),优选的,电源可以通过电源管理系统与处理器180逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,终端设备100还可以包括摄像头。可选地,摄像头在终端设备100的上的位置可以为前置的,也可以为后置的,本申请实施例对此不作限定。
可选地,终端设备100可以包括单摄像头、双摄像头或三摄像头等,本申请实施例对此不作限定。
例如,终端设备100可以包括三摄像头,其中,一个为主摄像头、一个为广角摄像头、一个为长焦摄像头。
可选地,当终端设备100包括多个摄像头时,这多个摄像头可以全部前置,或者全部后置,或者一部分前置、另一部分后置,本申请实施例对此不作限定。
另外,尽管未示出,终端设备100还可以包括蓝牙模块等,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在, 也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种终端设备,该终端设备包括:至少一个处理器、存储器以及存储在所述存储器中并可在所述至少一个处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述任意各个方法实施例中的步骤。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的步骤。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详 述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种分子间的结合活性预测方法,其特征在于,包括:
    获取蛋白的原始矩阵及小分子的原始矩阵;
    提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
    联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
  2. 如权利要求1所述的分子间的结合活性预测方法,其特征在于,所述获取蛋白的原始矩阵及小分子的原始矩阵,包括:
    获取蛋白和小分子的一维序列;
    通过预设转换方法将蛋白的一维序列转换为对应的原始矩阵,将小分子的一维序列转换为对应的原始矩阵;其中,预设转换方法包括独热编码。
  3. 如权利要求1所述的分子间的结合活性预测方法,其特征在于,所述提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量,包括:
    对蛋白的原始矩阵和小分子的原始矩阵分别进行卷积处理,获得蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量。
  4. 如权利要求1所述的分子间的结合活性预测方法,其特征在于,所述联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果,包括:
    联锁所述第一特征向量和所述第二特征向量;
    对联锁后的第一特征向量和第二特征向量进行全连接处理,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
  5. 如权利要求4所述的分子间的结合活性预测方法,其特征在于,所述对联锁后的第一特征向量和第二特征向量进行全连接处理,获得预测模型输出的蛋白和小分子间结合活性的预测结果,包括:
    对联锁后的第一特征向量和第二特征向量进行面向分类处理,获得预测模型输出的蛋白和小分子间是否具有结合活性的预测结果;
    若预测结果为蛋白和小分子间具有结合活性,则对联锁后的第一特征向量和第二特征向量进行回归处理,获得预测模型输出的蛋白和小分子间结合活性大小的预测结果。
  6. 如权利要求1所述的分子间的结合活性预测方法,其特征在于,所述取蛋白的原始矩阵及小分子的原始矩阵之前,还包括:
    获取样本数据;
    通过样本数据对所述预测模型进行预训练,获得预训练后的预测模型;其中,所述预测模型包括深度学习模型。
  7. 如权利要求6所述的分子间的结合活性预测方法,其特征在于,所述通过样本数据对所述预测模型进行预训练,获得预训练后的预测模型,包括:
    通过损失函数对样本数据进行处理,实现对预测模型的预训练过程,获得预训练后的预测模型;其中,损失函数包括交叉熵和均方误差中的至少一种。
  8. 一种分子间的结合活性预测装置,其特征在于,包括:
    第一获取模块,用于获取蛋白的原始矩阵及小分子的原始矩阵;
    提取模块,用于提取蛋白的原始矩阵对应的第一特征向量及小分子的原始矩阵对应的第二特征向量;
    联锁模块,用于联锁所述第一特征向量和所述第二特征向量并计算,获得预测模型输出的蛋白和小分子间结合活性的预测结果。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。
PCT/CN2020/127249 2019-11-08 2020-11-06 一种分子间的结合活性预测方法及装置 WO2021089008A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911090145.4A CN110910964A (zh) 2019-11-08 2019-11-08 一种分子间的结合活性预测方法及装置
CN201911090145.4 2019-11-08

Publications (1)

Publication Number Publication Date
WO2021089008A1 true WO2021089008A1 (zh) 2021-05-14

Family

ID=69817102

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/127249 WO2021089008A1 (zh) 2019-11-08 2020-11-06 一种分子间的结合活性预测方法及装置

Country Status (2)

Country Link
CN (1) CN110910964A (zh)
WO (1) WO2021089008A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910964A (zh) * 2019-11-08 2020-03-24 深圳先进技术研究院 一种分子间的结合活性预测方法及装置
CN111627493A (zh) * 2020-05-29 2020-09-04 北京晶派科技有限公司 一种激酶抑制剂的选择性预测方法和计算设备
CN112086145B (zh) * 2020-09-02 2024-04-16 腾讯科技(深圳)有限公司 一种化合物活性预测方法、装置、电子设备和存储介质
CN112420124B (zh) * 2021-01-19 2021-04-13 腾讯科技(深圳)有限公司 一种数据处理方法、装置、计算机设备和存储介质
CN112786120B (zh) * 2021-01-26 2022-07-05 云南大学 神经网络辅助化学材料合成的方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196480A1 (en) * 2014-05-05 2016-07-07 Atomwise Inc. Systems and methods for applying a convolutional network to spatial data
CN109887541A (zh) * 2019-02-15 2019-06-14 张海平 一种靶点蛋白质与小分子结合预测方法及系统
CN110444250A (zh) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) 基于分子指纹和深度学习的高通量药物虚拟筛选系统
CN110910964A (zh) * 2019-11-08 2020-03-24 深圳先进技术研究院 一种分子间的结合活性预测方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6671348B2 (ja) * 2014-05-05 2020-03-25 アトムワイズ,インコーポレイテッド 結合親和性予測システム及び方法
CN107742061B (zh) * 2017-09-19 2021-06-01 中山大学 一种蛋白质相互作用预测方法、系统和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160196480A1 (en) * 2014-05-05 2016-07-07 Atomwise Inc. Systems and methods for applying a convolutional network to spatial data
CN109887541A (zh) * 2019-02-15 2019-06-14 张海平 一种靶点蛋白质与小分子结合预测方法及系统
CN110444250A (zh) * 2019-03-26 2019-11-12 广东省微生物研究所(广东省微生物分析检测中心) 基于分子指纹和深度学习的高通量药物虚拟筛选系统
CN110910964A (zh) * 2019-11-08 2020-03-24 深圳先进技术研究院 一种分子间的结合活性预测方法及装置

Also Published As

Publication number Publication date
CN110910964A (zh) 2020-03-24

Similar Documents

Publication Publication Date Title
WO2021089008A1 (zh) 一种分子间的结合活性预测方法及装置
CN110362494B (zh) 微服务状态信息展示的方法、模型训练方法以及相关装置
WO2019237860A1 (zh) 一种图像标注方法和装置
TWI533241B (zh) 一種實現人工智能的方法、服務器和設備
CN108121803B (zh) 一种确定页面布局的方法和服务器
WO2021120875A1 (zh) 搜索方法、装置、终端设备及存储介质
WO2020125500A1 (zh) 机器人避障控制方法、装置及终端设备
WO2020147369A1 (zh) 自然语言处理方法、训练方法及数据处理设备
WO2021147421A1 (zh) 用于人机交互的自动问答方法、装置和智能设备
WO2021114928A1 (zh) 纠错词排序方法、装置、终端设备和存储介质
WO2020108457A1 (zh) 目标对象的控制方法、装置、设备及存储介质
CN107194732A (zh) 一种应用推送方法、移动终端以及计算机可读存储介质
TW201512865A (zh) 一種網頁數據搜索方法、裝置和系統
CN114595124B (zh) 时序异常检测模型评估方法、相关装置及存储介质
CN108573307A (zh) 一种处理神经网络模型文件的方法及终端
CN110263077A (zh) 一种获取移动终端中文件的方法、移动终端及存储介质
CN115022098B (zh) 人工智能安全靶场内容推荐方法、装置及存储介质
CN107715449A (zh) 一种账号登录方法及相关设备
CN117093766A (zh) 问诊平台的信息推荐方法、相关装置及存储介质
CN110597957B (zh) 一种文本信息检索的方法及相关装置
CN113569572A (zh) 文本实体生成方法、模型训练方法及装置
CN108230104A (zh) 应用类特征生成方法、移动终端以及可读存储介质
CN107133296A (zh) 一种应用程序推荐方法、装置及计算机可读存储介质
CN110019648B (zh) 一种训练数据的方法、装置及存储介质
WO2021073434A1 (zh) 对象行为的识别方法、装置及终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20885121

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20885121

Country of ref document: EP

Kind code of ref document: A1