WO2021031549A1 - Method for establishing molecular structure and activity database - Google Patents

Method for establishing molecular structure and activity database Download PDF

Info

Publication number
WO2021031549A1
WO2021031549A1 PCT/CN2020/077657 CN2020077657W WO2021031549A1 WO 2021031549 A1 WO2021031549 A1 WO 2021031549A1 CN 2020077657 W CN2020077657 W CN 2020077657W WO 2021031549 A1 WO2021031549 A1 WO 2021031549A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
database
activity
module
molecular
Prior art date
Application number
PCT/CN2020/077657
Other languages
French (fr)
Chinese (zh)
Inventor
牛春意
方磊
徐旻
温晓明
齐珍珍
张佩宇
马健
温书豪
赖力鹏
Original Assignee
深圳晶泰科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳晶泰科技有限公司 filed Critical 深圳晶泰科技有限公司
Priority to PCT/CN2020/077657 priority Critical patent/WO2021031549A1/en
Publication of WO2021031549A1 publication Critical patent/WO2021031549A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction

Definitions

  • the invention belongs to the technical field of data processing, and specifically relates to a method for establishing a molecular structure and activity database, which is mainly applied in the field of new drug research and development, and provides good data support for applications in the field of computer-aided drugs and virtual screening.
  • Drug screening is the initial stage and key step of drug discovery, and it occupies an important position in the process of new drug discovery.
  • traditional screening experiments often take a long time and high cost. Therefore, with the development of computer technology, virtual screening is gradually developed.
  • the development, optimization, and specific application of virtual screening methods to actual scenarios requires a large amount of high-quality data, including a variety of compound structures, unified and accurate activity data, etc.
  • the commonly used databases containing these data mainly include the public molecular database Chembl and paid databases.
  • the structure-activity analysis between different compounds of the same target is very important. But at present, there are often a large number of patents and the structure and activity data of compounds reported in the literature for the same target. It is often laborious and laborious to analyze and sort these data, but there is no suitable analysis software on the market that can quickly analyze and interpret it.
  • the existing database lacks structure-activity relationship analysis for different drug molecules between the same target, which is not conducive to the later use of such data.
  • the present invention provides a method for establishing a molecular structure and activity database, which is applied to data collection and cleaning in the drug design process of new drug development.
  • the method mainly includes collecting data from the existing database to construct the data source to be used, and then extracting the useful data from the data source to be cleaned through the tool script.
  • the data of the same target is extracted from it, and simple structure-effect analysis is performed by calling Jupyter scripts and user input to provide analysis ideas for subsequent drug design work.
  • the method of establishing molecular structure and activity database includes the following steps:
  • the method is mainly to collect data through automatic collection and active upload, and upload the collected data to a temporary file.
  • the data source will cause different data parameters. At the same time, not all the collected data is needed, and there are errors in the data, so the data will be cleaned to obtain unified standardized data.
  • the data cleaning module will convert the external data into a standardized format as required. Main cleaning standards:
  • the data cleaning module will call the corresponding interpreter according to the unused data content and mark type.
  • the screening criteria mainly include the molecular activity test method (enzyme activity or cell activity), the molecular activity expression method (whether it is an accurate value), and the data source.
  • the interpreter matches the data one by one according to the specified standardized format. If the match is successful, the data is stored in the corresponding data structure of the memory.
  • check module check the data one by one. First, the data type, read different verification rules according to the data type. For the same molecule, if the activity test type is the same, but there are multiple pieces of data. If the difference between the data does not exceed the specified range, take the average value; if the difference exceeds the specified range, output the prompt and download the data source literature for manual inspection.
  • Users can send a search request to the data search module through the SDK, which includes the data table, molecular structure, fields and query conditions to be queried.
  • the data retrieval module converts the request into an identifiable sentence and accesses the database to get the result.
  • the result will be returned to the data retrieval module and then passed to the user SDK to complete the retrieval.
  • a specific target can be selected and all the data containing the target can be extracted. Then call the structure-activity analysis module in Jupyter, and perform sub-organization matching and similarity comparison calculations between the structure and the structure in the database according to the core structure and similarity requirements input by the user.
  • the present invention provides a complete set of standardized methods for establishing the activity database of small molecule inhibitors, which is suitable for computer-aided drug design and virtual screening and other drug screening fields, and realizes semi-automatic data collection and cleaning data to generate standardized databases, and at the same time
  • the rapid SAR molecular summary of a large number of molecules of the same target accelerates the entire drug discovery process.
  • FIG. 1 is a flowchart of the present invention.
  • IDH1 Isocitrate dehydrogenase 1
  • IDH1 can oxidize isocitrate to oxalosuccinic acid, and then convert it into ⁇ -ketoglutarate, thereby participating in the tricarboxylic acid cycle and regulating energy metabolism in the body.
  • IDH1 mutations are closely related to glioma, paraganglion, and acute myeloid leukemia. Therefore, the development of small molecule inhibitors against IDH1 is essential for the treatment of this type of cancer.
  • Figure 1 we follow the process shown in Figure 1 to establish a method to modify the database mainly including the following steps:
  • step S01 the Uniprot ID of IDH1 is determined to be O75874, and the original data of the existing molecular structure and activity are collected through the open source database Chembl and the python web crawler technology. There are a total of 35948 molecules and 37932 activity data.
  • step S02 the data cleaning module is called to clean and classify the original data, and finally 31267 molecules are obtained.
  • the cleaning process includes:
  • Step S03 call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between different databases, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the determined data. The final data is obtained by taking the average value.
  • Step S04 the data after data cleaning and verification is stored in the database from a temporary file.
  • the storage method is:
  • Step S05 data search, to convert the search request uploaded by the user through the SDK into a recognizable language. After searching the database, the required result is obtained, and then returned to the user.
  • the molecular structure is converted into a molecular fingerprint string through the molecular structure interpreter, and then the atom type and bond connection mode are compared accordingly, and the result set where the molecular structure is located and the unique molecular ID are finally obtained. , And then select according to the user's needs, search the ID corresponding to the enzyme activity, cell activity and other properties, and then output different results.
  • Step S06 structure-activity analysis is performed on the compound. Selectively conduct structure-effect analysis according to user needs. By calling the structure-activity analysis module written in Jupyter, batch structure-activity analysis of compounds is performed.
  • PARP1 is a type of catalytic poly(ADP) that exists in eukaryotic cells. Ribosylated nuclease, poly-ADP ribosylation is one of the important modification methods after protein translation. PARP1 accounts for more than 80% of the PARP activity in cells, and it is widely present in organisms, repairing DNA damage, gene transcription and expression And cell apoptosis and other physiological processes play an important role. PARP inhibitors mainly prevent DNA replication by synthesizing a lethal mechanism. Currently, they are mainly used in BRCA1/2 mutant tumors and platinum-sensitive recurrent tumors. We follow The process shown in Figure 1, the method of establishing a database mainly includes the following steps:
  • Step S01 Determine the Uniprot ID of PARP1 as P09874, and collect the original data of the existing molecular structure and activity through the open source database Chembl and the python web crawler technology. There are a total of 3331 molecules and 4439 activity data. Get 6784 molecules through paid database, a total of 10283 activity data
  • step S02 the data cleaning module is called to clean and classify the original data, and finally 4,324 molecules are obtained.
  • the cleaning process includes:
  • Step S03 call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between data from different database sources, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the confirmed data. The data within the threshold is obtained by taking the average value to obtain the final data.
  • Step S04 the data after data cleaning and verification is stored in the database from a temporary file.
  • the storage method is:
  • Step S05 data search, to convert the search request uploaded by the user through the SDK into a recognizable language. After searching the database, the required result is obtained, and then returned to the user.
  • the molecular structure is converted into a molecular fingerprint string through the molecular structure interpreter, and then the atom type and bond connection mode are compared accordingly, and the result set where the molecular structure is located and the unique molecular ID are finally obtained. , And then select according to the user's needs, search the ID corresponding to the enzyme activity, cell activity and other properties, and then output different results.
  • Step S06 structure-activity analysis is performed on the compound. Selectively conduct structure-effect analysis according to user needs. By calling the structure-activity analysis module written in Jupyter, batch structure-activity analysis of compounds is performed.

Abstract

A method for establishing a molecular structure and activity database, comprising: searching a compound database to obtain all compounds related to selected targets, recording relevant information of the compounds, and converting external data into a standardized format according to requirements; checking the data to ensure the accuracy of the data; uploading the checked data to a MongoDB database via a stored temporary file; a user sending a retrieval request to a data retrieval module via an SDK, selecting a specific target according to requirements of the user, and extracting all data including the target; calling a structure-activity analysis module in Jupyter, and according to a core structure inputted by a user and a similarity requirement, performing sub-structure matching and similarity comparison calculations on the structure and structures in the database. The method is suitable for computer-aided drug design and drug screening such as virtual screening, and implements semi-automated data collection and data cleaning to generate a standardized database.

Description

建立分子结构与活性数据库的方法Method of establishing molecular structure and activity database 技术领域Technical field
本发明属于数据处理技术领域,具体涉及一种建立分子结构与活性数据库的方法,主要应用于新药研发领域,为计算机辅助药物以及虚拟筛选领域的应用提供了良好的数据支持。The invention belongs to the technical field of data processing, and specifically relates to a method for establishing a molecular structure and activity database, which is mainly applied in the field of new drug research and development, and provides good data support for applications in the field of computer-aided drugs and virtual screening.
背景技术Background technique
药物筛选是药物发现的最初阶段和关键步骤,在新药发现的过程中占有重要的地位。但是传统筛选实验往往筛选时间长、成本高。因此,随着计算机技术的发展,虚拟筛选逐渐被发展起来。虚拟筛选方法的开发、优化和以及具体的应用到实际的场景当中,是需要大量的优质数据包括较为多样的化合物结构、统一准确的活性数据等。目前常用的包含这些数据的数据库主要有公开的分子数据库Chembl以及付费的数据库等。与此同时,在药物设计的过程中,对于同一靶点不同化合物之间的构效分析是有很重要的作用。但目前,针对同一个靶点往往有大量的专利以及文献中所报道的化合物结构与活性数据。对这些数据进行分析整理往往费事费力,但市场上缺少一个合适的分析软件能快速的对其进行分析解读。Drug screening is the initial stage and key step of drug discovery, and it occupies an important position in the process of new drug discovery. However, traditional screening experiments often take a long time and high cost. Therefore, with the development of computer technology, virtual screening is gradually developed. The development, optimization, and specific application of virtual screening methods to actual scenarios requires a large amount of high-quality data, including a variety of compound structures, unified and accurate activity data, etc. At present, the commonly used databases containing these data mainly include the public molecular database Chembl and paid databases. At the same time, in the process of drug design, the structure-activity analysis between different compounds of the same target is very important. But at present, there are often a large number of patents and the structure and activity data of compounds reported in the literature for the same target. It is often laborious and laborious to analyze and sort these data, but there is no suitable analysis software on the market that can quickly analyze and interpret it.
现有的数据库往往存在以下弊端:Existing databases often have the following drawbacks:
(1)公开的数据库的数据更新不够及时,而新药研发是一个不断发展变动的过程,因此一两年的数据延迟,可能会漏掉一些非常重要的信息,对于计算的准确性往往有所影响。(1) The data of the public database is not updated in time, and the development of new drugs is a process of continuous development and change. Therefore, the data delay of one or two years may miss some very important information, which often affects the accuracy of calculations. .
(2)付费的数据库的数据,相比于公开数据库尽管数据更新的更加及时,但是往往参数过多,不能直接使用,需要进一步的清洗。(2) Compared with the public database, the data of the paid database is updated more timely, but there are often too many parameters to be used directly, and further cleaning is required.
(3)从不同地方所收集的数据库的数据格式往往有所不同,因此想要把他们合并一起使用,需要大量的数据清洗和整理工作,会浪费大量的时间以及人工成本。(3) The data formats of the databases collected from different places are often different, so if they want to merge them together, a lot of data cleaning and sorting work is required, which will waste a lot of time and labor costs.
(4)单一的数据库没有办法验证数据的准确性,难以确保数据的准确性。(4) A single database cannot verify the accuracy of the data, and it is difficult to ensure the accuracy of the data.
(5)现有的数据库缺少针对同一靶点之间不同药物分子的构效关系分析,不利于对后期对此类数据的使用。(5) The existing database lacks structure-activity relationship analysis for different drug molecules between the same target, which is not conducive to the later use of such data.
发明内容Summary of the invention
针对上述技术问题,本发明提供一种建立分子结构与活性数据库的方法,应用于新药研发中药物设计过程的数据收集与清洗。该方法主要包括通过对现有数据库的数据进行收 集构建待用的数据源,后通过工具脚本提取待清洗数据源中的有用数据。在建立的数据库的基础上,从中提取同一靶点的数据,通过调用Jupyter的脚本以及用户的输入,进行简单的构效分析,为后续的药物设计工作提供分析思路。In view of the above technical problems, the present invention provides a method for establishing a molecular structure and activity database, which is applied to data collection and cleaning in the drug design process of new drug development. The method mainly includes collecting data from the existing database to construct the data source to be used, and then extracting the useful data from the data source to be cleaned through the tool script. On the basis of the established database, the data of the same target is extracted from it, and simple structure-effect analysis is performed by calling Jupyter scripts and user input to provide analysis ideas for subsequent drug design work.
所采用的技术方案为:The technical solutions adopted are:
建立分子结构与活性数据库的方法,包括以下步骤:The method of establishing molecular structure and activity database includes the following steps:
(1)数据的采集(1) Data collection
从化合物数据库上进行搜索获取与选定靶点相关的所有化合物,并记录化合物的相关信息。方法主要是通过自动收集以及主动上传两种方式进行数据收集,收集后的数据上传至临时文件中。Search from the compound database to obtain all the compounds related to the selected target, and record the relevant information of the compound. The method is mainly to collect data through automatic collection and active upload, and upload the collected data to a temporary file.
(1.1)自动收集主要是从开源的数据库Chembl,首先确定所选择靶点的Uniprot ID,根据ID可以锁定准确且唯一的靶点,后利用python网络爬虫技术进行自动收集生成原始数据。(1.1) Automatic collection is mainly from the open source database Chembl. First, the Uniprot ID of the selected target is determined. According to the ID, the accurate and unique target can be locked, and then the python web crawler technology is used to automatically collect and generate the original data.
(1.2)主动上传主要是针对付费数据库,这类数据库无法使用python网络爬虫技术,只能通过手动下载后,再将数据由本地进行上传。(1.2) Active uploading is mainly for paid databases. This kind of database cannot use python web crawler technology. It can only be manually downloaded, and then the data can be uploaded locally.
(2)数据清洗(2) Data cleaning
不论是自动收集或主动上传,数据的来源不同导致数据的参数等会有所差别。同时,并不是所收集得到的所有数据都是被所需要的,以及数据会存在错误,因此会对数据进行清洗,得到统一的标准化数据。数据清洗模块会按照需求将外部数据转换成为标准化格式。主要的清洗标准:Regardless of whether it is automatically collected or actively uploaded, the data source will cause different data parameters. At the same time, not all the collected data is needed, and there are errors in the data, so the data will be cleaned to obtain unified standardized data. The data cleaning module will convert the external data into a standardized format as required. Main cleaning standards:
A、根据不同数据库所获得的原数据,调用不同的数据清洗模块。数据清洗模块会根据不用的数据内容以及标记类型,调用相对应的解释器。A. According to the original data obtained by different databases, different data cleaning modules are called. The data cleaning module will call the corresponding interpreter according to the unused data content and mark type.
B、包括分子结构数据解释器、分子实验活性数据解释器等。B. Including molecular structure data interpreter, molecular experiment activity data interpreter, etc.
C、用过Jupyter调用筛选模块,过滤掉一些不符合标准的分子。筛选标准主要包括分子的活性测试方法(酶活或细胞活性)、分子的活性表示方法(是否是准确值)以及数据的来源等标准。C. Use Jupyter to call the filter module to filter out some molecules that do not meet the criteria. The screening criteria mainly include the molecular activity test method (enzyme activity or cell activity), the molecular activity expression method (whether it is an accurate value), and the data source.
D、解释器根据所规定的标准化格式,将数据逐一匹配,匹配成功的,就将数据存储在内存相应的数据结构中。D. The interpreter matches the data one by one according to the specified standardized format. If the match is successful, the data is stored in the corresponding data structure of the memory.
(3)数据校验(3) Data verification
由于现有的数据库中的数据也多是通过图片或关键字识别抓取文献中的信息所得,在 数据生成以及数据存储的过程中也可能存在一些错误。因此,通过对不同数据库的数据进行校验核对还确保数据的准确性。Since most of the data in the existing database is obtained by capturing information in the literature through picture or keyword recognition, there may also be some errors in the process of data generation and data storage. Therefore, the accuracy of the data is also ensured by verifying the data in different databases.
(3.1)数据清洗后,调用数据校验模块,将需要校验的数据由清洗模块系统传入数据校验模块。(3.1) After data cleaning, call the data verification module, and transfer the data to be verified from the cleaning module system to the data verification module.
(3.2)在校验模块中,逐条对数据进行校验。首先数据类型,根据数据类型读取不同的校验规则。对于同一个分子,如果活性测试类型一样,但是存在多条数据的情况。若数据之间差值不超过规定范围则取平均值,若相差超过规定范围,则输出提示后,并将数据来源的文献下载输出供人工查验。(3.2) In the check module, check the data one by one. First, the data type, read different verification rules according to the data type. For the same molecule, if the activity test type is the same, but there are multiple pieces of data. If the difference between the data does not exceed the specified range, take the average value; if the difference exceeds the specified range, output the prompt and download the data source literature for manual inspection.
(3.3)按照校验规则逐一匹配需要校验的数据,校验完成后通过校验的数据会被模块持久化到临时文件系统当中。(3.3) Match the data to be verified one by one according to the verification rules. After the verification is completed, the data that passes the verification will be persisted by the module to the temporary file system.
(4)数据检索(4) Data retrieval
将校验通过存储的临时文件,上传至MongoDB数据库中,供后续使用。用户可以通过SDK向数据检索模块发送检索请求,其中包括了要查询的数据表、分子结构、字段和查询条件。数据检索模块会将请求转化成可识别语句,访问数据库得到结果。结果将返回数据检索模块后传给用户SDK,最终完成检索。Upload the temporary files stored in the verification pass to the MongoDB database for subsequent use. Users can send a search request to the data search module through the SDK, which includes the data table, molecular structure, fields and query conditions to be queried. The data retrieval module converts the request into an identifiable sentence and accesses the database to get the result. The result will be returned to the data retrieval module and then passed to the user SDK to complete the retrieval.
(5)构效分析(5) Structure-activity analysis
根据用户的需求通过上述的数据检索方式,可以选取某一特定的靶点,提取包含该靶点的全部数据。后调用Jupyter中的构效分析模块,根据用户所输入的母核结构以及相似度的要求,对该结构与数据库中的结构进行亚机构匹配以及相似度比较计算。According to the needs of users, through the above-mentioned data retrieval method, a specific target can be selected and all the data containing the target can be extracted. Then call the structure-activity analysis module in Jupyter, and perform sub-organization matching and similarity comparison calculations between the structure and the structure in the database according to the core structure and similarity requirements input by the user.
(5.1)对数据库中的分子进行亚结构匹配,调用rdkit中的亚结构匹配模块,匹配所有包含该结构的亚结构。(5.1) Perform substructure matching on the molecules in the database, call the substructure matching module in rdkit, and match all substructures that contain the structure.
(5.2)将匹配到的分子结构转化成为分子指纹,后计算其Tanimoto相似度与用户需求进行匹配。(5.2) Convert the matched molecular structure into a molecular fingerprint, and then calculate its Tanimoto similarity to match the user's needs.
(5.3)在满足匹配要求的化合物当中,利用rdkit化学工具包取代侧链模块以及取代基转换模块,对取代基团以及取代位点进行切割、转换、分类。最后列出SAR列表便于用户对结构以及活性进行比较分析。(5.3) Among the compounds that meet the matching requirements, use the rdkit chemistry toolkit to replace the side chain module and the substituent conversion module to cut, convert, and classify the substituent groups and substitution sites. Finally, the SAR list is listed to facilitate users to compare and analyze the structure and activity.
本发明提供的建立分子结构与活性数据库的方法,具有以下技术效果:The method for establishing a molecular structure and activity database provided by the present invention has the following technical effects:
本发明提供了一套完整的标准化建立小分子抑制剂的活性数据库的方法,适用于计算机辅助药物设计以及虚拟筛选等药物筛选领域,实现了半自动化收集数据以及清洗数据生 成标准化数据库,同时可以对同一靶点的大量分子进行快速的SAR分子总结加速了整个药物发现的进程。具有以下的技术优势:The present invention provides a complete set of standardized methods for establishing the activity database of small molecule inhibitors, which is suitable for computer-aided drug design and virtual screening and other drug screening fields, and realizes semi-automatic data collection and cleaning data to generate standardized databases, and at the same time The rapid SAR molecular summary of a large number of molecules of the same target accelerates the entire drug discovery process. Has the following technical advantages:
(1)实现了主动与自动结合的数据收集方式,相对于现有的数据库,所覆盖的文献以及数据量更广,能够提供更多的数据资源。(1) It realizes the combination of active and automatic data collection. Compared with the existing database, it covers a wider range of documents and data and can provide more data resources.
(2)实现了对多个数据库信息的自动整合相互验证,加入了进一步人工校对,因此相对于现有的数据库,数据的准确性更高。(2) The automatic integration and mutual verification of multiple database information is realized, and further manual proofreading is added, so the accuracy of the data is higher than that of the existing database.
(3)第一次提出了对数据库加入了化合物的构效关系分析模块,能够减轻用户对大量数据的分析时间。(3) For the first time, it is proposed to add a structure-activity relationship analysis module of compounds to the database, which can reduce the time for users to analyze large amounts of data.
附图说明Description of the drawings
图1是本发明的流程图。Figure 1 is a flowchart of the present invention.
具体实施方式detailed description
下面通过附图和实施例,对本发明的技术方案做进一步的详细描述。The technical solutions of the present invention will be further described in detail below through the accompanying drawings and embodiments.
实施例1Example 1
本实施例以异柠檬酸脱氢酶1(Isocitrate dehydrogenase 1,IDH1)的小分子抑制剂的活性数据库的建立为例。IDH1可以将异柠檬酸氧化成为草酰琥珀酸,然后再转化成为α-酮戊二酸,从而参与三羧酸循环,调节体内的能量代谢。研究表明IDH1的突变与脑胶质瘤、副神经节流以及急性髓细胞白血病密切相关。因此开发出针对IDH1的小分子抑制剂对治疗这类癌症至关重要。我们按照如图1所示的流程,建立方法改数据库主要包括以下步骤:In this embodiment, the establishment of the activity database of the small molecule inhibitor of Isocitrate dehydrogenase 1 (IDH1) is taken as an example. IDH1 can oxidize isocitrate to oxalosuccinic acid, and then convert it into α-ketoglutarate, thereby participating in the tricarboxylic acid cycle and regulating energy metabolism in the body. Studies have shown that IDH1 mutations are closely related to glioma, paraganglion, and acute myeloid leukemia. Therefore, the development of small molecule inhibitors against IDH1 is essential for the treatment of this type of cancer. We follow the process shown in Figure 1 to establish a method to modify the database mainly including the following steps:
步骤S01,确定IDH1的Uniprot ID为O75874,通过开源数据库Chembl,利用python网络爬虫技术,收集现有的分子结构以及活性的原始数据。一共有35948个分子,37932条活性数据。In step S01, the Uniprot ID of IDH1 is determined to be O75874, and the original data of the existing molecular structure and activity are collected through the open source database Chembl and the python web crawler technology. There are a total of 35948 molecules and 37932 activity data.
步骤S02,调用数据清洗模块,对原始数据进行清洗分类,最终得到31267个分子,清洗的过程包括:In step S02, the data cleaning module is called to clean and classify the original data, and finally 31267 molecules are obtained. The cleaning process includes:
(2.1)通过分子结构解释器,得到分子指纹字符串。主要步骤为:(2.1) Get molecular fingerprint string through molecular structure interpreter. The main steps are:
a.读取分子结构M(通常为smiles表示形式),将其转化成为mol的3D结构,通过Rdkit中的Chem.MolFromSmiles()模块a. Read the molecular structure M (usually in the form of smiles) and convert it into a 3D structure of mol, through the Chem.MolFromSmiles() module in Rdkit
b.再计算mol结构的Morgan型分子指纹,通过Rdkit中的GetMorganFingerprint()。最终得到分子指纹字符串,以及对应的分子ID。b. Calculate the Morgan-type molecular fingerprint of the mol structure through GetMorganFingerprint() in Rdkit. Finally, the molecular fingerprint string and the corresponding molecular ID are obtained.
(2.2)通过分子活性数据解释器,将不同测试方法的数据进行分类(在此例中,主要 包括分子对IDH1体外酶活抑制的IC 50以及对IDH1突变型的细胞系生长抑制的IC 50)。 (2.2), the data from different tests to classify (in this example, including molecular IDH1 vitro activity of IC 50 and the inhibition of mutant cell lines IDH1 growth inhibition IC 50) the activity of the molecule by a data interpreter .
(2.3)调用清洗模块,分类存储后的数据进行清洗。主要包括去除不符合标准的数据,以及重复的数据等。(2.3) Call the cleaning module, and clean the data after sorting and storage. It mainly includes removing data that does not meet the standard, and duplicate data.
步骤S03,调用数据校验模块对数据进行进一步的校验。提取分子ID以及对应的分子指纹字符串,进行不同数据库之间的相互校验,规定数据误差阈值,对超过阈值的数据进行报告后,通过人工进行原文校验后得到确定数据,对阈值内的数据通过采取平均值的方式获得最终的数据。Step S03, call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between different databases, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the determined data. The final data is obtained by taking the average value.
步骤S04,通过数据清洗以及校验后的数据,由临时文件存储至数据库。存储方式为:Step S04, the data after data cleaning and verification is stored in the database from a temporary file. The storage method is:
通过对分子结构的指纹字符串进行对比,得到分子指纹的相似度。将相似度较高的分子放在同一结果集,同时将其所对应的活性数据存储在其子集内,以此类推,上传至MongoDB数据库中。By comparing the fingerprint strings of molecular structure, the similarity of molecular fingerprints is obtained. Put molecules with higher similarity in the same result set, and store their corresponding activity data in their subsets, and so on, upload them to the MongoDB database.
步骤S05,数据搜索,将用户通过SDK上传的检索请求转化成可识别语言。后通过对数据库检索得到所需结果,后返还给用户。Step S05, data search, to convert the search request uploaded by the user through the SDK into a recognizable language. After searching the database, the required result is obtained, and then returned to the user.
其中,对于分子结构的识别,通过分子结构解释器将该分子结构转化为分子指纹字符串,后依此对比原子类型以及键的连接方式,最终得到该分子结构所在的结果集以及唯一的分子ID,再根据用户的需求选择,检索该ID所对应的酶活性、细胞活性等性质,进而输出不同的结果。Among them, for the identification of molecular structure, the molecular structure is converted into a molecular fingerprint string through the molecular structure interpreter, and then the atom type and bond connection mode are compared accordingly, and the result set where the molecular structure is located and the unique molecular ID are finally obtained. , And then select according to the user's needs, search the ID corresponding to the enzyme activity, cell activity and other properties, and then output different results.
步骤S06,对化合物进行构效分析。根据用户的需求选择性的进行构效分析。通过调用Jupyter中所编写的构效分析模块,对化合物进行批量的构效分析。Step S06, structure-activity analysis is performed on the compound. Selectively conduct structure-effect analysis according to user needs. By calling the structure-activity analysis module written in Jupyter, batch structure-activity analysis of compounds is performed.
(6.1)用户输入其所感兴趣的结构,例如Smiles表达的O=C1CCCN1(1),通过亚结构匹配选取所有包含该结构的化合物,发现一共有398个分子含有这个亚结构;Smiles表达为C1=CC=NN1(2),通过亚结构匹配选取所有包含该结构的化合物,发现一共有2323个分子含有这个亚结构。(6.1) The user enters the structure he or she is interested in, such as O=C1CCCN1(1) expressed by Smiles, and selects all compounds containing this structure through substructure matching, and found that a total of 398 molecules contain this substructure; Smiles is expressed as C1= CC=NN1(2), through substructure matching to select all compounds containing this structure, it is found that a total of 2323 molecules contain this substructure.
Figure PCTCN2020077657-appb-000001
Figure PCTCN2020077657-appb-000001
(6.2)对含有这一亚结构的化合物通过rdkit工具包中的Chem.ReplaceCore()、Chem.GetMolFrags()等命令进行取代基的切除、转换、以及取代位点的分类。(6.2) For compounds containing this substructure, use the Chem.ReplaceCore() and Chem.GetMolFrags() commands in the rdkit toolkit to perform substitution, conversion, and classification of substitution sites.
(6.3)将每个化合物标记出其取代位点取代类型、结构、活性等数据最后生成SAR分析列表。(6.3) Mark each compound with its substitution site substitution type, structure, activity and other data and finally generate a SAR analysis list.
(6.4)当我们再对初步形成的列表有了解后,可以进一步的细化母核结构,即重复以上过程,等到进一步细化的SAR分析列表。(6.4) When we have an understanding of the preliminary formed list, we can further refine the core structure, that is, repeat the above process, and wait for the further refined SAR analysis list.
实施例2Example 2
本实施例以聚腺苷二磷酸核糖聚合酶(poly(ADP-ribose)polymerase 1,PARP1的小分子抑制剂的活性数据库的建立为例。PARP1是一类存在于真核细胞中的催化聚ADP核糖基化的细胞核酶,聚ADP核糖化是蛋白质翻译后的重要修饰方式之一。PARP1占细胞内PARP活性的80%以上,广泛的存在于生物体内,对DNA的损伤修复、基因转录和表达以及细胞凋亡等生理过程起着重要作用。PARP抑制剂主要通过合成致死的作用机制来阻止DNA的复制,目前主要应用与BRCA1/2突变的肿瘤、铂敏感的复发性肿瘤当中。我们按照如图1所示的流程,建立数据库的方法主要包括以下步骤:This example takes the establishment of the activity database of poly(ADP-ribose) polymerase 1, PARP1 as a small molecule inhibitor. PARP1 is a type of catalytic poly(ADP) that exists in eukaryotic cells. Ribosylated nuclease, poly-ADP ribosylation is one of the important modification methods after protein translation. PARP1 accounts for more than 80% of the PARP activity in cells, and it is widely present in organisms, repairing DNA damage, gene transcription and expression And cell apoptosis and other physiological processes play an important role. PARP inhibitors mainly prevent DNA replication by synthesizing a lethal mechanism. Currently, they are mainly used in BRCA1/2 mutant tumors and platinum-sensitive recurrent tumors. We follow The process shown in Figure 1, the method of establishing a database mainly includes the following steps:
步骤S01,确定PARP1的Uniprot ID为P09874,通过开源数据库Chembl,利用python网络爬虫技术,收集现有的分子结构以及活性的原始数据。一共有3331个分子,4439条活性数据。通过付费数据库得到6784个分子,一共10283条活性数据Step S01: Determine the Uniprot ID of PARP1 as P09874, and collect the original data of the existing molecular structure and activity through the open source database Chembl and the python web crawler technology. There are a total of 3331 molecules and 4439 activity data. Get 6784 molecules through paid database, a total of 10283 activity data
步骤S02,调用数据清洗模块,对原始数据进行清洗分类,最终得到4324个分子。清洗的过程包括:In step S02, the data cleaning module is called to clean and classify the original data, and finally 4,324 molecules are obtained. The cleaning process includes:
(2.1)通过分子结构解释器,得到分子指纹字符串。主要步骤为:(2.1) Get molecular fingerprint string through molecular structure interpreter. The main steps are:
a.读取分子结构M(通常为smiles表示形式),将其转化成为mol的3D结构,通过Rdkit中的Chem.MolFromSmiles()模块a. Read the molecular structure M (usually in the form of smiles) and convert it into a 3D structure of mol, through the Chem.MolFromSmiles() module in Rdkit
b.再计算mol结构的Morgan型分子指纹,通过Rdkit中的GetMorganFingerprint()。最终得到分子指纹字符串,以及对应的分子ID。b. Calculate the Morgan-type molecular fingerprint of the mol structure through GetMorganFingerprint() in Rdkit. Finally, the molecular fingerprint string and the corresponding molecular ID are obtained.
(2.2)通过分子活性数据解释器,将不同测试方法的数据进行分类(在此例中,主要包括分子对PARP1体外酶活抑制的IC 50以及对BRCA1/2突变的肿瘤细胞系生长抑制的IC 50)。 (2.2) Use the molecular activity data interpreter to classify the data of different test methods (in this example, it mainly includes the IC 50 of the molecule's inhibition of PARP1 in vitro enzyme activity and the IC 50 of the growth inhibition of BRCA1/2 mutant tumor cell lines 50 ).
(2.3)调用清洗模块,分类存储后的数据进行清洗。主要包括去除不符合标准的数据,以及重复的数据等。(2.3) Call the cleaning module, and clean the data after sorting and storage. It mainly includes removing data that does not meet the standard, and duplicate data.
步骤S03,调用数据校验模块对数据进行进一步的校验。提取分子ID以及对应的分子 指纹字符串,进行不同数据库来源的数据之间的相互校验,规定数据误差阈值,对超过阈值的数据进行报告后,通过人工进行原文校验后得到确定数据,对阈值内的数据通过采取平均值的方式获得最终的数据。Step S03, call the data verification module to perform further verification on the data. Extract the molecular ID and the corresponding molecular fingerprint string, perform mutual verification between data from different database sources, specify the data error threshold, report the data that exceeds the threshold, and manually verify the original text to obtain the confirmed data. The data within the threshold is obtained by taking the average value to obtain the final data.
步骤S04,通过数据清洗以及校验后的数据,由临时文件存储至数据库。存储方式为:Step S04, the data after data cleaning and verification is stored in the database from a temporary file. The storage method is:
通过对分子结构的指纹字符串进行对比,得到分子指纹的相似度。将相似度较高的分子放在同一结果集,同时将其所对应的活性数据存储在其子集内,以此类推,上传至MongoDB数据库中。By comparing the fingerprint strings of molecular structure, the similarity of molecular fingerprints is obtained. Put molecules with higher similarity in the same result set, and store their corresponding activity data in their subsets, and so on, upload them to the MongoDB database.
步骤S05,数据搜索,将用户通过SDK上传的检索请求转化成可识别语言。后通过对数据库检索得到所需结果,后返还给用户。Step S05, data search, to convert the search request uploaded by the user through the SDK into a recognizable language. After searching the database, the required result is obtained, and then returned to the user.
其中,对于分子结构的识别,通过分子结构解释器将该分子结构转化为分子指纹字符串,后依此对比原子类型以及键的连接方式,最终得到该分子结构所在的结果集以及唯一的分子ID,再根据用户的需求选择,检索该ID所对应的酶活性、细胞活性等性质,进而输出不同的结果。Among them, for the identification of molecular structure, the molecular structure is converted into a molecular fingerprint string through the molecular structure interpreter, and then the atom type and bond connection mode are compared accordingly, and the result set where the molecular structure is located and the unique molecular ID are finally obtained. , And then select according to the user's needs, search the ID corresponding to the enzyme activity, cell activity and other properties, and then output different results.
步骤S06,对化合物进行构效分析。根据用户的需求选择性的进行构效分析。通过调用Jupyter中所编写的构效分析模块,对化合物进行批量的构效分析。Step S06, structure-activity analysis is performed on the compound. Selectively conduct structure-effect analysis according to user needs. By calling the structure-activity analysis module written in Jupyter, batch structure-activity analysis of compounds is performed.
(6.1)用户输入其所感兴趣的结构,例如Smiles表达的O=C1NN=CC2=C1CCCC2(1),通过亚结构匹配选取所有包含该结构的化合物,发现一共623个分子含有这个亚结构;Smiles表达为C12=C[N]N=C1C=CC=C2(2),通过亚结构匹配选取所有包含该结构的化合物,发现一共有482个分子含有这个亚结构。(6.1) The user enters the structure he or she is interested in, for example, O=C1NN=CC2=C1CCCC2 expressed by Smiles (1), and select all compounds containing the structure through substructure matching, and find that a total of 623 molecules contain this substructure; Smiles expresses For C12=C[N]N=C1C=CC=C2(2), all compounds containing this structure are selected through substructure matching, and a total of 482 molecules containing this substructure are found.
Figure PCTCN2020077657-appb-000002
Figure PCTCN2020077657-appb-000002
(6.2)对含有这一亚结构的化合物通过rdkit工具包中的Chem.ReplaceCore()、Chem.GetMolFrags()等命令进行取代基的切除、转换、以及取代位点的分类。(6.2) For compounds containing this substructure, use the Chem.ReplaceCore() and Chem.GetMolFrags() commands in the rdkit toolkit to perform substitution, conversion, and classification of substitution sites.
(6.3)将每个化合物标记出其取代位点取代类型、结构、活性等数据最后生成SAR分析列表。(6.3) Mark each compound with its substitution site substitution type, structure, activity and other data and finally generate a SAR analysis list.
(6.4)当我们再对初步形成的列表有了解后,可以进一步的细化母核结构,即重复以上过程,等到进一步细化的SAR分析列表。(6.4) When we have an understanding of the preliminary formed list, we can further refine the core structure, that is, repeat the above process, and wait for the further refined SAR analysis list.

Claims (5)

  1. 建立分子结构与活性数据库的方法,其特征在于,包括以下步骤:The method for establishing a database of molecular structure and activity is characterized in that it comprises the following steps:
    (1)数据的采集(1) Data collection
    从化合物数据库上进行搜索获取与选定靶点相关的所有化合物,并记录化合物的相关信息,收集后的数据上传至临时文件中;Search from the compound database to obtain all compounds related to the selected target, and record the relevant information of the compound, and upload the collected data to a temporary file;
    (2)数据清洗(2) Data cleaning
    数据清洗模块按照需求将外部数据转换成为标准化格式;The data cleaning module converts external data into a standardized format as required;
    (3)数据校验(3) Data verification
    通过对不同数据库的数据进行校验核对确保数据的准确性;Ensure the accuracy of the data by verifying the data in different databases;
    (4)数据检索(4) Data retrieval
    将校验通过存储的临时文件,上传至MongoDB数据库中,供后续使用;Upload the temporary files stored in the verification pass to the MongoDB database for subsequent use;
    用户通过SDK向数据检索模块发送检索请求,其中包括了要查询的数据表、分子结构、字段和查询条件;The user sends a retrieval request to the data retrieval module through the SDK, which includes the data table, molecular structure, fields and query conditions to be queried;
    数据检索模块将请求转化成可识别语句,访问数据库得到结果;The data retrieval module converts the request into an identifiable sentence and accesses the database to get the result;
    结果将返回数据检索模块后传给用户SDK,最终完成检索;The result will be returned to the data retrieval module and then passed to the user SDK to complete the retrieval;
    (5)构效分析(5) Structure-activity analysis
    根据用户的需求通过上述的数据检索方式,选取某一特定的靶点,提取包含该靶点的全部数据;后调用Jupyter中的构效分析模块,根据用户所输入的母核结构以及相似度的要求,对该结构与数据库中的结构进行亚机构匹配以及相似度比较计算。According to the user’s needs, select a specific target through the above-mentioned data retrieval method, and extract all the data containing the target; then call the structure-activity analysis module in Jupyter, according to the core structure and similarity input by the user It is required to perform sub-organization matching and similarity comparison calculation between the structure and the structure in the database.
  2. 根据权利要求1所述的建立分子结构与活性数据库的方法,其特征在于,步骤(1)中,收集数据方法主要是通过自动收集以及主动上传两种方式进行数据收集:The method for establishing a molecular structure and activity database according to claim 1, characterized in that, in step (1), the method of collecting data is mainly through automatic collection and active uploading for data collection:
    (1.1)自动收集主要是从开源的数据库Chembl,首先确定所选择靶点的Uniprot ID,根据ID可以锁定准确且唯一的靶点,后利用python网络爬虫技术进行自动收集生成原始数据;(1.1) The automatic collection is mainly from the open source database Chembl. First, the Uniprot ID of the selected target is determined. According to the ID, the accurate and unique target can be locked, and then the python web crawler technology is used to automatically collect and generate the original data;
    (1.2)主动上传主要是针对付费数据库,这类数据库无法使用python网络爬虫技术,通过手动下载后,再将数据由本地进行上传。(1.2) Active upload is mainly for paid databases. This kind of database cannot use python web crawler technology. After manually downloading, the data is uploaded locally.
  3. 根据权利要求1所述的建立分子结构与活性数据库的方法,其特征在于,步骤(2)主要的清洗标准:The method for establishing a molecular structure and activity database according to claim 1, wherein the main cleaning criteria in step (2) are:
    A、根据不同数据库所获得的原数据,调用不同的数据清洗模块;数据清洗模块根据不用的数据内容以及标记类型,调用相对应的解释器;A. According to the original data obtained by different databases, different data cleaning modules are called; the data cleaning module calls the corresponding interpreter according to the unused data content and mark type;
    B、包括分子结构数据解释器、分子实验活性数据解释器;B. Including molecular structure data interpreter, molecular experiment activity data interpreter;
    C、用过Jupyter调用筛选模块,过滤掉一些不符合标准的分子;筛选标准主要包括分子的活性测试方法、分子的活性表示方法以及数据的来源标准;C. Use Jupyter to call the screening module to filter out some molecules that do not meet the standards; the screening criteria mainly include the molecular activity test method, the molecular activity expression method and the data source standard;
    D、解释器根据所规定的标准化格式,将数据逐一匹配,匹配成功的,就将数据存储在内存相应的数据结构中。D. The interpreter matches the data one by one according to the specified standardized format. If the match is successful, the data is stored in the corresponding data structure of the memory.
  4. 根据权利要求1所述的建立分子结构与活性数据库的方法,其特征在于,步骤(3)数据校验,主要包括以下步骤:The method for establishing a molecular structure and activity database according to claim 1, wherein step (3) data verification mainly includes the following steps:
    (3.1)数据清洗后,调用数据校验模块,将需要校验的数据由清洗模块系统传入数据校验模块;(3.1) After data cleaning, call the data verification module, and transfer the data to be verified from the cleaning module system to the data verification module;
    (3.2)在校验模块中,逐条对数据进行校验;首先数据类型,根据数据类型读取不同的校验规则;对于同一个分子,如果活性测试类型一样,但是存在多条数据的情况;若数据之间差值不超过规定范围则取平均值,若相差超过规定范围,则输出提示后,并将数据来源的文献下载输出供人工查验;(3.2) In the verification module, the data is verified one by one; first, the data type, read different verification rules according to the data type; for the same molecule, if the activity test type is the same, but there are multiple data; If the difference between the data does not exceed the specified range, take the average value; if the difference exceeds the specified range, output the prompt and download and output the literature of the data source for manual inspection;
    (3.3)按照校验规则逐一匹配需要校验的数据,校验完成后通过校验的数据会被模块持久化到临时文件系统当中。(3.3) Match the data to be verified one by one according to the verification rules. After the verification is completed, the data that passes the verification will be persisted by the module to the temporary file system.
  5. 根据权利要求1所述的建立分子结构与活性数据库的方法,其特征在于,步骤(5)主要包括以下步骤:The method for establishing a molecular structure and activity database according to claim 1, wherein step (5) mainly includes the following steps:
    (5.1)对数据库中的分子进行亚结构匹配,调用rdkit中的亚结构匹配模块,匹配所有包含该结构的亚结构;(5.1) Perform substructure matching on molecules in the database, call the substructure matching module in rdkit, and match all substructures that contain the structure;
    (5.2)将匹配到的分子结构转化成为分子指纹,后计算其Tanimoto相似度与用户需求进行匹配;(5.2) Convert the matched molecular structure into molecular fingerprints, and then calculate its Tanimoto similarity to match user needs;
    (5.3)在满足匹配要求的化合物当中,利用rdkit化学工具包取代侧链模块以及取代基转换模块,对取代基团以及取代位点进行切割、转换、分类;最后列出SAR列表便于用户对结构以及活性进行比较分析。(5.3) Among the compounds that meet the matching requirements, use the rdkit chemistry toolkit to replace the side chain module and the substituent conversion module to cut, convert, and classify the substituent groups and substitution sites; finally, the SAR list is listed to facilitate users to structure And the activity is compared and analyzed.
PCT/CN2020/077657 2020-03-03 2020-03-03 Method for establishing molecular structure and activity database WO2021031549A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/077657 WO2021031549A1 (en) 2020-03-03 2020-03-03 Method for establishing molecular structure and activity database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/077657 WO2021031549A1 (en) 2020-03-03 2020-03-03 Method for establishing molecular structure and activity database

Publications (1)

Publication Number Publication Date
WO2021031549A1 true WO2021031549A1 (en) 2021-02-25

Family

ID=74660084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077657 WO2021031549A1 (en) 2020-03-03 2020-03-03 Method for establishing molecular structure and activity database

Country Status (1)

Country Link
WO (1) WO2021031549A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219193A (en) * 2023-09-22 2023-12-12 宁波甬恒瑶瑶智能科技有限公司 Supermolecule database retrieval method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194201A1 (en) * 2001-06-05 2002-12-19 Wilbanks John Thompson Systems, methods and computer program products for integrating biological/chemical databases to create an ontology network
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN110021367A (en) * 2018-10-16 2019-07-16 中国人民解放军军事科学院军事医学研究院 Drug integrated information database building method and system based on drug and target information

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194201A1 (en) * 2001-06-05 2002-12-19 Wilbanks John Thompson Systems, methods and computer program products for integrating biological/chemical databases to create an ontology network
CN104750761A (en) * 2013-12-31 2015-07-01 上海致化化学科技有限公司 Method for creating molecular structure databases and method for searching same
CN110021367A (en) * 2018-10-16 2019-07-16 中国人民解放军军事科学院军事医学研究院 Drug integrated information database building method and system based on drug and target information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ASPIRIN: "RDKit | Substructure search and MCS algorithm", 23 October 2019 (2019-10-23), XP055781756, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/87355064> *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117219193A (en) * 2023-09-22 2023-12-12 宁波甬恒瑶瑶智能科技有限公司 Supermolecule database retrieval method and system

Similar Documents

Publication Publication Date Title
CN108198621B (en) Database data comprehensive diagnosis and treatment decision method based on neural network
CN116205724A (en) Large scale heterogeneous data ingestion and user resolution
CN112700325A (en) Method for predicting online credit return customers based on Stacking ensemble learning
CN108335756B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
CN102402615A (en) Method for tracking source information based on structured query language (SQL) sentences
Lee et al. Modeling of inter‐sample variation in flow cytometric data with the joint clustering and matching procedure
CN111415702B (en) Method for establishing molecular structure and activity database
WO2022127245A1 (en) Technology transfer office general information exchange method, terminal, and medium
CN112182148A (en) Standard auxiliary compiling method based on full-text retrieval
WO2021031549A1 (en) Method for establishing molecular structure and activity database
CN112071385A (en) Rare disease auxiliary analysis method and device based on artificial intelligence and storage medium
Shen et al. A novel framework for efficient automated singer identification in large music databases
Manguinhas et al. FRBRization of MARC records in multiple catalogs
Korzeniowski et al. Artist similarity with graph neural networks
Eken et al. DoCA: a content-based automatic classification system over digital documents
CN108320797B (en) Nasopharyngeal carcinoma database and comprehensive diagnosis and treatment decision method based on database
jr et al. How current optical music recognition systems are becoming useful for digital libraries
Wellman How to use SAS to study egocentric networks
Chen et al. Improved score-performance alignment algorithms on polyphonic music
Ogier et al. Madonne: document image analysis techniques for cultural heritage documents
CN112967759B (en) DNA material evidence identification STR typing comparison method based on memory stack technology
Kim et al. Image retrieval model based on weighted visual features determined by relevance feedback
Sun et al. Domain adaptation for supervised integration of scRNA-seq data
Huang et al. Variant transfer learning for wood recognition
CN114822686A (en) Method for screening single cell data sensitivity gene based on information entropy

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20853730

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30/01/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20853730

Country of ref document: EP

Kind code of ref document: A1