WO2023221186A1 - 一种化合物聚类的方法、装置、系统及存储介质 - Google Patents

一种化合物聚类的方法、装置、系统及存储介质 Download PDF

Info

Publication number
WO2023221186A1
WO2023221186A1 PCT/CN2022/096714 CN2022096714W WO2023221186A1 WO 2023221186 A1 WO2023221186 A1 WO 2023221186A1 CN 2022096714 W CN2022096714 W CN 2022096714W WO 2023221186 A1 WO2023221186 A1 WO 2023221186A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
compound
legend
identified
clustering
Prior art date
Application number
PCT/CN2022/096714
Other languages
English (en)
French (fr)
Inventor
金羽童
潘麓蓉
Original Assignee
慧壹科技(上海)有限公司
香港圆壹智慧有限公司
美国圆壹智慧科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 慧壹科技(上海)有限公司, 香港圆壹智慧有限公司, 美国圆壹智慧科技有限公司 filed Critical 慧壹科技(上海)有限公司
Publication of WO2023221186A1 publication Critical patent/WO2023221186A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • This application relates to the field of information processing technology, and specifically to a method, device, system and storage medium for compound clustering.
  • Clustering is used to subdivide large compound data sets into individual small groups of similar compounds. Analyzes commonly used to analyze high-throughput screening results, virtual screening, or docking studies. The traditional clustering method based on chemical informatics has low recognition efficiency and slow recognition speed. Even if the similarity of compound fingerprint characteristics is used for identification, the demand for computing and storage space is too much, resulting in a limited number of compounds that can be identified.
  • embodiments of this specification provide a method, device, system and storage medium for compound clustering, which are used in the small molecule compound clustering process.
  • the embodiments of this specification provide a method for compound clustering, including:
  • the target identification result corresponding to the compound sample to be identified is obtained; wherein the identification label includes the initial identification label.
  • the embodiments of this specification also provide a device for compound clustering, including:
  • An acquisition module used to acquire a sample of a compound to be identified, and segment the sample of a compound to be identified into a sample subset containing an initial identification label
  • An output module is configured to obtain a target identification result corresponding to the compound sample to be identified based on the sample legend and identification label; wherein the identification label includes the initial identification label.
  • Embodiments of this specification also provide a system for compound clustering, including: a memory, a processor, and a computer program.
  • the computer program is stored in the memory, and the processor runs the computer program to perform the following steps: obtain the to-be-determined Identify the compound sample, and segment the compound sample to be identified into a sample subset containing an initial identification label; obtain a sample legend based on the sample subset; obtain the compound sample to be identified based on the sample legend and the identification label The corresponding target recognition result; wherein the identification tag includes the initial identification tag.
  • Embodiments of this specification also provide a readable storage medium, in which a computer program is stored.
  • the computer program When executed by a processor, the computer program is used to implement the following steps: obtain a sample of a compound to be identified, and convert the The compound sample to be identified is divided into a sample subset containing an initial identification label; according to the sample subset; a sample legend is obtained; based on the sample legend and the identification label, a target identification result corresponding to the compound sample to be identified is obtained; wherein, The identification tag includes the initial identification tag.
  • the beneficial effects that can be achieved by at least one of the above technical solutions adopted in the embodiments of this specification include at least: obtaining a sample of a compound to be identified and dividing the sample of the compound to be identified into sample sub-samples containing initial identification tags. set; according to the sample subset, the sample legend is obtained; according to the sample legend and identification label, the target identification result corresponding to the compound sample to be identified is obtained.
  • Adding compound image recognition and detection on the basis of coarse-grained identification of small molecule compound clustering can improve the accuracy of small molecule compound clustering, reduce the clustering processing space, break through the limitations of small molecule clustering, and make small molecules more efficient. The processing of compound clustering is more efficient and accurate.
  • Figure 1 is a schematic diagram of the application of compound clustering provided by the embodiment of this specification.
  • Figure 2 is a flow chart 1 of a compound clustering method provided by the embodiment of this specification.
  • Figure 3 is a flow chart 2 of a compound clustering method provided by the embodiment of this specification.
  • Figure 4 is a schematic diagram of a device for compound clustering provided by the embodiment of this specification.
  • Figure 5 is a schematic structural diagram of a compound clustering system provided by an embodiment of this specification.
  • Small molecule compound clustering is often used to analyze the results of high-throughput screening, virtual screening, or docking studies.
  • the traditional clustering method based on chemical informatics has low recognition efficiency and slow recognition speed. Even if the similarity of compound fingerprint characteristics is used for identification, the demand for computing and storage space is too much, resulting in a limited number of compounds that can be identified.
  • Figure 1 is a schematic diagram of the application of compound clustering provided by the embodiment of this specification.
  • a sample 11 of a compound to be identified is included, such as a large compound.
  • the server 10 can be executed by a subject, such as the server 10 , where the server includes a terminal device capable of running software, including but not limited to a computer, a tablet, a mobile phone, etc.
  • a compound clustering method proposed in the embodiment of this specification by obtaining a sample of a compound to be identified, and dividing the sample of a compound to be identified into a sample subset containing an initial identification label; according to the sample subset, a sample legend is obtained; according to The sample legend and identification label are used to obtain the target identification result corresponding to the compound sample to be identified.
  • Adding compound image processing based on coarse-grained clustering methods can save a lot of molecular feature extraction processes with minimal feature use, and greatly improve computational efficiency.
  • compound graph processing is used to complete small molecule compound clustering based on preliminary clustering results, achieving a more efficient small molecule compound clustering process with larger data volume and more accurate results.
  • FIG. 2 is a flow chart 1 of a compound clustering method provided in the embodiment of this specification. As shown in Figure 2, the method may include steps S210 to S230. In step S210, a sample of a compound to be identified is obtained, and the sample of a compound to be identified is divided into sample subsets containing initial identification labels.
  • the sample of compounds to be identified includes large compounds, and the purpose is to subdivide them into a single group of similar small molecule compounds through a special clustering method.
  • the compound sample to be identified adopts the SMILES sequence text representation format, which can display the chemical characteristics of the compound (for example, including molecular characteristics and atomic characteristics).
  • the sample of the compound to be identified is initially divided into multiple sample subsets, so that the complex sample of the compound to be identified is initially segmented to facilitate subsequent processing of the sample of the compound to be identified.
  • small molecule compounds usually refer to biologically functional molecules with a molecular weight of less than 1,000 Daltons; from a biological perspective, they generally refer to biologically active small peptides, oligopeptides, oligosaccharides, oligonucleotides, vitamins, minerals, and small molecular groups. Water, etc.; from a nutritional perspective, small molecules can also be divided into proteins, fats, sugars, etc.
  • the existing inventory of compound samples to be identified can be clustered and identified according to the compound attribute characteristics.
  • the initial identification label in the segmentation process can also be obtained (that is, the representation of the type and number of a single group of similar small molecule compounds identified by clustering the sample of the compound to be identified).
  • simple dimensionality reduction is performed to facilitate more rapid identification of the sample of the compound to be identified and obtain the final target identification result.
  • statistical clustering methods include K-means (k-means clustering algorithm, K-means clustering algorithm) and OPTICS (Ordering points to identify the clustering structure, density-based clustering algorithm).
  • K-means k-means clustering algorithm
  • K-means clustering algorithm K-means clustering algorithm
  • OPTICS Organic-phenyl-N-phenyl-N-phenyl-N-phenyl-N-phenyl clustering algorithm
  • Step S220 Obtain a sample legend according to the sample subset.
  • the sample legend is converted into a corresponding discrete mathematical diagram according to the representation format of the compound sample to be identified, and the tractable compounds to be identified are converted into a more easily identifiable discrete mathematical diagram based on their attribute characteristics, and in the discrete data diagram, for each to be identified
  • the fingerprint characteristics of compounds are more prominent, which makes it easier to identify the most similar feature parts among multiple compounds, and improves the accuracy of small molecule compound clustering.
  • Step S230 Obtain the target identification result corresponding to the compound sample to be identified based on the sample legend and identification tag.
  • the identification tag is used to identify the category of the compound according to the characteristics of the compound, including at least one or more.
  • the sample legend identifies the category of a single group of similar small molecule compounds through identification labels, thereby obtaining the target identification results of all sample legends corresponding to the compound samples to be identified in the sample subset.
  • the identification tags include the type and number of small molecule compounds.
  • the identification label includes a legend label, where the legend label can further highlight the fingerprint characteristics corresponding to each type of similar small molecule compounds. Therefore, after the sample of the compound to be identified is initially divided into multiple coarse-grained sample subsets, and Convert each compound sample to be identified expressed in sequence text format in the sample subset into a sample legend.
  • the target identification result corresponding to the compound sample to be identified based on the sample legend it is not limited to the initial identification label based on the molecular characteristics or atoms of the compound.
  • Characteristics are used to judge the category of compounds to be identified, which not only improves the accuracy of small molecule compound clustering, but also divides a large amount of data to be identified compound samples into relatively small ranges based on coarse-grained preliminary identification and segmentation of the compound samples to be identified.
  • Data processing realizes data dimensionality reduction and strives for processing space for the subsequent sample legend identification process. That is, data dimensionality reduction improves the utilization of data processing space, thereby speeding up the processing speed of small molecule compound clustering.
  • the legend labels are obtained, where the legend labels highlight the specific range of fingerprint features corresponding to the connected space of each type of small molecule compound, using It is used to cluster compounds with the most similar substructure correspondence into the same class.
  • the most similar substructure represents the connected space (connected components) of the small molecule compound graph, which may include one or more identical atoms and chemical bonds.
  • image feature points are extracted from the sample legends, each sample legend corresponds to each molecule in the compound to be identified and the topological comparison of each molecule is combined with the fingerprint features corresponding to the compound to obtain the most similar substructure in the image features.
  • (subgraph) features that is, the unique discriminant features that stand out through image features can be identified to find the most similar substructure features between compounds.
  • the similarity score is calculated to determine whether they belong to the same category based on the threshold.
  • the two sample legends are determined to be of the same category, so that all sample legends in the sample subset have the most common Compounds with similar substructures are assigned to the same category, and legend labels corresponding to the same category are obtained.
  • the legend label includes the most similar substructure and the threshold corresponding to the similarity score.
  • obtaining the target identification result corresponding to the compound sample to be identified according to the sample legend and the identification label includes: according to the sample legend, the initial identification label and the legend label, converting the The compounds to be identified in the sample legend that satisfy the initial identification label and the legend label are clustered into the same category; based on the initial identification labels and legend labels corresponding to different categories of compounds, the identification categories corresponding to all the compound samples to be identified are obtained respectively.
  • the identification tags in this embodiment include initial identification tags and legend identification tags.
  • Each category of compounds corresponds to a set of initial identification labels and legend identification labels. Therefore, in the process of clustering small molecule compounds in the sample legend, the initial identification label (for example, Ti1) and the legend identification label (for example, Pt1) are combined to subset the samples in the sample legend to the compounds to be identified that satisfy both the initial identification label and the legend label. Cluster into the same category (e.g. compound 1). Then, based on the initial identification labels and legend labels corresponding to different categories of compounds (such as [Ti1, Pt1], [Ti2, Pt2],..., [Ti5, Pt5]...), the identification categories corresponding to all the compound samples to be identified are obtained. (For example, compound 1, compound 2).
  • clustering the compounds to be identified in the sample legend that satisfy the initial identification label and the legend label into the same category includes: obtaining the initial identification label and the legend label to a preset level. The sample standard chart corresponding to the threshold; perform similarity calculations on all sample legends and the sample standard chart. If the sample legend matches the sample standard chart, cluster the compounds to be identified corresponding to the sample legend. for the same category.
  • the compounds to be identified in the sample legend that satisfy both the initial identification label and the legend label are clustered into the same category, it is necessary to obtain the same group of sample standard diagrams corresponding to the initial identification label and the legend label that reach the preset threshold.
  • reaching the preset threshold for the same group of initial identification labels and legend labels may include that the sum of the corresponding weights of the initial identification labels and legend labels reaches the preset threshold.
  • the clustering of small molecule compounds can be achieved more accurately based on the most similar substructures contained in the legend labels. Therefore, after obtaining the sample standard image corresponding to the initial identification label and the legend label reaching the preset threshold, similar calculations are performed on all sample legends and the sample standard image corresponding to each category.
  • the compounds to be identified corresponding to the sample legend will be clustered into the same category as the compounds corresponding to the sample standard chart.
  • the following formula 1 can be used to calculate the match between the sample legend and the sample standard chart:
  • G 1 and G 2 are the input sample legend and sample standard diagram respectively. Whether they match can be obtained by using the similarity algorithm of Formula 2.
  • a and B represent the number of nodes of the compared sample legend and sample standard graph respectively, and
  • sample legends are obtained, including:
  • each compound sample to be identified in the sample subset is converted to obtain a corresponding sample legend.
  • the compound to be identified is represented by the processable data (for example, the compound represented by the SMILES sequence text ), convert atoms into nodes and chemical bonds into data graphs as edges.
  • nodes include atomic number attributes
  • edges include single, double, and triple bond attributes.
  • the method further includes: outputting the target identification result, and storing the compound sample to be identified corresponding to the target identification result. Subsequent downstream applications or data storage of small molecule compound clustering are realized. Therefore, the compound clustering method of the present invention can not only realize small molecule compound clustering quickly and accurately, but also improve the efficiency and efficiency of associated small molecule compound clustering applications. Accuracy. Below, some examples will be used for schematic description.
  • Figure 3 is a flow chart 2 of a compound clustering method provided by an embodiment of the present invention.
  • the preferred steps adopted in the embodiment of the present invention are as follows: input the sample of the compound to be identified represented by the SMILES sequence text format; segment the sample of the compound to be identified into a sample subset containing the initial identification tag according to statistical clustering; Convert the to-be-identified compound samples in SMILES sequence text format in the sample subset into discrete mathematical graphs; calculate the most similar structures through topological comparison of the discrete mathematical graphs in the sample subset, and obtain the legend label; according to the sample legend (i.e., discrete mathematical graph) , initial identification label and legend label to obtain the target identification results corresponding to the sample of the compound to be identified; output the clustering results of the sample of the compound to be identified in the SMILES sequence.
  • Step 1 Obtain the sample of the compound to be identified represented by SMILES sequence text.
  • Step 2 Obtain a sample subset corresponding to a small molecule compound cluster through statistical clustering, for example, the initial identification label is Ti2.
  • Statistical clustering methods include K-means and OPTICS.
  • Step 3 Convert the compounds to be identified represented by SMILES sequence text in the sample subset into corresponding sample legends.
  • Step 4 Obtain the target identification result corresponding to the compound sample to be identified according to the sample legend and identification label. For example, compounds that do not belong to the Ti2 category are eliminated from the sample subset based on the preliminary identification label (Ti2).
  • the method further includes obtaining, according to the sample legend, the initial identification label and the legend label, a sample standard diagram corresponding to the initial identification label and the legend label reaching a preset threshold; and comparing all sample legends with the Similar calculations are performed on the sample standard chart. If the sample legend matches the sample standard chart, the compounds to be identified corresponding to the sample legend are clustered into the same category. For example, the compounds belonging to the Ti2 category in the sample subset are identified. compounds; ultimately more accurately rejecting compounds belonging to the Ti2 category. Thus, the target identification results corresponding to all the compounds to be identified are obtained. Step 5. Output the clustered sample subset for subsequent downstream analysis applications or data storage.
  • Step 1 Obtain the sample of the compound to be identified represented by SMILES sequence text.
  • Step 2. Use the functions in the RDKit package (one of the running codes) of Python (a kind of programming language) to calculate the chemical property characteristics, and obtain the sample subsets corresponding to at least two small molecule compound clusters through statistical clustering. set, for example, the initial identification tag is at least two of Ti1, Ti2, Ti3 and Ti4.
  • Statistical clustering methods include K-means and OPTICS.
  • Step 3 Convert the compounds to be identified represented by SMILES sequence text in the sample subset into corresponding sample legends.
  • Step 4: Obtain the target identification result corresponding to the compound sample to be identified according to the sample legend and identification label.
  • the method further includes clustering the compounds to be identified in the sample legend that satisfy the initial identification label and the legend label into the same category according to the sample legend, the initial identification label and the legend label; according to The initial identification labels and legend labels corresponding to different categories of compounds are used to obtain the identification categories corresponding to all the compound samples to be identified. Step 5. Output the clustered sample subset for subsequent downstream analysis applications or data storage.
  • Step 1 Obtain the sample of the compound to be identified represented by SMILES sequence text.
  • Step 2. Obtain the characteristics of the compound to be identified, such as Morgan Fingerprints (molecular fingerprints) or vector (embedding) variables trained by deep learning; and use the PCA algorithm to perform data (dimensionality reduction), and after reducing the data characteristics to 10-100 dimensions, pass Statistical clustering obtains sample subsets corresponding to at least two small molecule compound clusters, for example, the initial identification labels are at least two of Ti1, Ti2, Ti3, and Ti4.
  • Statistical clustering methods include, for example, K-means.
  • Step 3. Convert the compounds to be identified represented by SMILES sequence text in the sample subset into corresponding sample legends.
  • Step 4 Obtain the target identification result corresponding to the compound sample to be identified according to the sample legend and identification label. For example, based on the preliminary identification labels (Ti1, Ti2, and Ti3), the compounds corresponding to the four categories of Ti1, Ti2, Ti4, and Ti5 were finally accurately identified in the sample subset.
  • the method further includes clustering the compounds to be identified in the sample legend that satisfy the initial identification label and the legend label into the same category according to the sample legend, the initial identification label and the legend label; according to The initial identification labels and legend labels corresponding to different categories of compounds are used to obtain the identification categories corresponding to all the compound samples to be identified. Step 5. Output the clustered sample subset for subsequent downstream analysis applications or data storage.
  • Figure 4 is a schematic diagram of a device for compound clustering provided by the embodiment of this specification. As shown in Figure 4, the device 30 includes:
  • the acquisition module 31 is used to acquire a sample of a compound to be identified, and segment the sample of a compound to be identified into a sample subset containing an initial identification label;
  • Obtaining module 32 is used to obtain a sample legend according to the sample subset
  • the output module 33 is configured to obtain the target identification result corresponding to the compound sample to be identified according to the sample legend and identification label; wherein the identification label includes the initial identification label.
  • the device of the embodiment shown in Figure 4 can be used to perform the steps in the method embodiment shown in Figure 2.
  • the implementation principles and technical effects are similar and will not be described again here.
  • Figure 5 is a schematic structural diagram of a compound clustering system provided by the embodiment of this specification. As shown in Figure 5, the system 40 includes: a processor 41, a memory 42 and a computer program; wherein
  • Memory 42 is used to store the computer program. This memory may also be flash memory.
  • the computer program is, for example, an application program, functional module, etc. that implements the above method.
  • the processor 41 is used to execute the computer program stored in the memory to implement each step performed by the device in the above method. For details, please refer to the relevant descriptions in the previous method embodiments.
  • the memory 42 can be independent or integrated with the processor 41 .
  • the device may also include:
  • Bus 43 is used to connect the memory 42 and the processor 41 .
  • the present invention also provides a readable storage medium.
  • a computer program is stored in the readable storage medium.
  • the computer program is executed by a processor, the computer program is used to implement the methods provided by the above-mentioned various embodiments.
  • the readable storage medium may be a computer storage medium or a communication medium.
  • Communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • a readable storage medium is coupled to a processor such that the processor can read information from the readable storage medium and write information to the readable storage medium.
  • the readable storage medium may also be an integral part of the processor.
  • the processor and readable storage medium may be located in Application Specific Integrated Circuits (ASICs for short). Additionally, the ASIC can be located in the user equipment.
  • ASICs Application Specific Integrated Circuits
  • the processor and the readable storage medium may also exist as discrete components in the communication device.
  • Readable storage media can be read-only memory (ROM), random-access memory (RAM), CD-ROM, tapes, floppy disks, optical data storage devices, etc.
  • the present invention also provides a program product.
  • the program product includes execution instructions, and the execution instructions are stored in a readable storage medium.
  • At least one processor of the device can read the execution instruction from the readable storage medium, and at least one processor executes the execution instruction to cause the device to implement the methods provided by the various embodiments described above.
  • the processor can be a central processing unit (English: Central Processing Unit, referred to as: CPU), or other general-purpose processor, digital signal processor (English: Digital Signal Processor, referred to as : DSP), application specific integrated circuit (English: Application Specific Integrated Circuit, abbreviation: ASIC), etc.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in the present invention can be directly implemented by a hardware processor, or can be executed by a combination of hardware and software modules in the processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Chemical & Material Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

本申请提供一种化合物聚类的方法、装置、系统及存储介质,通过获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;根据样本子集,得到样本图例;根据样本图例以及识别标签,得到待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。本发明基于统计类化合物聚类,提供一种高效、快速以及精准的小分子化合物聚类的方法,提升小分子化合物聚类的准确性、降低聚类的处理空间,突破小分子聚类的局限性,从而使小分子化合物聚类的处理更加高效和精准。

Description

一种化合物聚类的方法、装置、系统及存储介质 技术领域
本申请涉及信息处理技术领域,具体涉及一种化合物聚类的方法、装置、系统及存储介质。
背景技术
我们通常称由几个或几十个原子组成的分子为小分子,常温下可以呈固态、气态和液态的物质。常见的有机小分子化合物如乙醇、葡萄糖和甲烷等。
聚类用于将大型化合物数据集合细分为单个小组相似化合物。通常用于分析高通量筛选结果、虚拟筛选或对接研究的分析。传统的基于化学信息学的聚类方法识别效率低,且识别速度缓慢。即使采用化合物指纹特征的相似性来识别,对计算与存储空间的需求过多,造成识别的化合物有限。
因此,需要一种新方案。
发明内容
有鉴于此,本说明书实施例提供一种化合物聚类的方法、装置、系统及存储介质,用于小分子化合物聚类过程。
本说明书实施例提供以下技术方案:
本说明书实施例提供一种化合物聚类的方法,包括:
获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;
根据所述样本子集;得到样本图例;
根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目 标识别结果;其中,所述识别标签包括所述初始识别标签。
本说明书实施例还提供一种化合物聚类的装置,包括:
获取模块,用于获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;
得到模块,用于根据所述样本子集;得到样本图例;
输出模块,用于根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
本说明书实施例还提供一种化合物聚类的系统,包括:存储器、处理器以及计算机程序,所述计算机程序存储在所述存储器中,所述处理器运行所述计算机程序执行以下步骤:获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;根据所述样本子集;得到样本图例;根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
本说明书实施例还提供一种可读存储介质,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时用于实现以下步骤:获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;根据所述样本子集;得到样本图例;根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
与现有技术相比,本说明书实施例采用的上述至少一个技术方案能够达到的有益效果至少包括:通过获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;根据样本子集,得到样本图例;根据样本图例以及识别标签,得到待识别化合物样本对应的目标识别结果。在小分子化合物聚类粗粒度识别的基础上增加化合物图像的识别检测,可以 提升小分子化合物聚类的准确性、降低聚类的处理空间,突破小分子聚类的局限性,从而使小分子化合物聚类的处理更加高效和精准。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本说明书实施例提供的一种化合物聚类的应用示意图;
图2是本说明书实施例提供的一种化合物聚类的方法流程图一;
图3是本说明书实施例提供的一种化合物聚类的方法流程图二;
图4是本说明书实施例提供的一种化合物聚类的装置示意图;
图5是本说明书实施例提供的一种化合物聚类的系统结构示意图。
具体实施方式
下面结合附图对本申请实施例进行详细描述。
以下通过特定的具体实例说明本申请的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本申请的其他优点与功效。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。本申请还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本申请的精神下进行各种修饰或改变。需说明的是,在不冲突的情况下,以下实施例及实施例中的特征可以相互组合。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
要说明的是,以下实施例中所提供的图示仅以示意方式说明本申请的基本构想,图式中仅显示与本申请中有关的组件而非按照实际实施时的组件数 目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。
另外,在以下描述中,提供具体细节是为了便于透彻理解实例。然而,所属领域的技术人员将理解,可在没有这些特定细节的情况下实践。
小分子化合物聚类通常用于分析高通量筛选结果、虚拟筛选或对接研究的分析。传统的基于化学信息学的聚类方法识别效率低,且识别速度缓慢。即使采用化合物指纹特征的相似性来识别,对计算与存储空间的需求过多,造成识别的化合物有限。
有鉴于此,发明人发现现有技术中机器学习小分子化合物聚类的结果往往模糊不准确,造成无法识别化合物信息中的固定类型导致处理结果毫无使用价值。即使采用化合物指纹特征的相似性,但对处理和存储空间占用率过高,导致小分子化合物聚类无法扩容至十万以上级别的化合物库,造成小分子化合物聚类的局限性。
基于此,本说明书实施例提出了一种化合物聚类的处理方案:图1是本说明书实施例提供的一种化合物聚类的应用示意图。如图1所示,包括待识别化合物样本11,例如大型化合物。将待识别化合物样本11分割为包含初始识别标签的样本子集;根据样本子集;得到样本图例;根据所述样本图例以及识别标签,得到待识别化合物样本对应的目标识别结果(例如包括第1类别、第2类别……第n-1类别和第n类别)。
具体实施中可以由一主体执行,例如由服务端10执行,其中,服务端包括能够运行软件的终端设备,包括但不限于计算机、平板电脑、手机等。
本说明书实施例提出的一种化合物聚类的方法,通过获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;根据样本子集,得到样本图例;根据样本图例以及识别标签,得到待识别化合 物样本对应的目标识别结果。基于粗粒度的聚类方法增加化合物图像处理,可以在最少的特征使用情况下,大量节省了分子特征提取流程,对计算效率有大幅度提升。并且,基于初步聚类结果利用化合物图处理来完成小分子化合物聚类,实现了更高效、数据量更大、结果更精准的小分子化合物聚类过程。
上述应用场景仅是为了便于理解本申请而示出,本说明书的实施方式在此方面不受任何限制。相反,本说明书的实施方式可以应用于适用的任何场景。
以下结合附图,说明本申请各实施例提供的技术方案。
图2是本说明书实施例提供的一种化合物聚类的方法流程图一。如图2所示,方法可以包括步骤S210~步骤S230。其中,步骤S210、获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集。
本实施例中待识别化合物样本包括大型化合物,目的是通过特殊聚类方法细分为单个小组相似小分子化合物。在一些实施例中,待识别化合物样本采用SMILES序列文本的表示格式,该表示格式中可以显示化合物的化学特性(例如包括分子特征和原子特性)等。在大数据量情况下,将待识别化合物样本初步划分为多个样本子集,从而将庞杂的待识别化合物样本进行初步分割,方便对待识别化合物样本进行后续处理。其中小分子化合物按照化学角度通常指分子量小于1000道尔顿的生物功能分子;按照生物角度一般指具有生物活性的小肽、寡肽、寡糖寡核苷酸、维生素、矿物质、小分子团水等;还可以从营养角度讲小分子分为蛋白质、脂肪、糖等。
在一些实施例中,按照一般的统计学聚类方法可将现有库存待识别化合 物样本按照化合物属性特征,对待识别化合物样本进行聚类识别,不仅可以将待识别化合物样本分割得到多个样本子集,还可以得到分割过程中的初始识别标签(即对待识别化合物样本聚类识别出单组相似小分子化合物的种类和个数的表示)。为实现更加精准的化合物聚类进行简单降维,便于更加快速地对待识别化合物样本进行识别得到最终目标识别结果。其中,统计学聚类方法包括K-means(k-means clustering algorithm,K均值聚类算法)和OPTICS(Ordering points to identify the clustering structure,基于密度的聚类算法)。在另一些实施例中,大数据量待识别化合物样本的情况下,没有经过初步聚类识别的过程,故需通过统计学聚类方法将所有的待识别化合物样本初步划分为多个样本子集,且其分割过程中产生初始识别标签,其实现过程与上述K-means或OPTICS类似,此处不再赘述。
步骤S220、根据所述样本子集;得到样本图例。
结合上述实施例,将待识别化合物样本分割为样本子集后,需根据样本子集得到该样本子集中所有待识别化合物样本对应的样本图例。其中,样本图例根据待识别化合物样本的表示格式转换为对应的离散数学图,将可处理待识别化合物根据其属性特征转化为更易识别的离散数学图,且在离散数据图中针对每个待识别化合物的指纹特征更为突显,从而更加方便多个化合物间最相似特征部分的识别,为小分子化合物聚类提高了精确度。
步骤S230、根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果。
具体地,识别标签根据化合物特性用来识别化合物的类别,包括至少一个或多个。样本图例通过识别标签来识别出单组相似小分子化合物的类别,从而获得样本子集中所有样本图例对应待识别化合物样本的目标识别结果。 其中识别标签包括小分子化合物的种类和个数等。
为了提高小分子化合物聚类的准确度,需对初步分割样本子集进行识别标签的准确获得,进而得到所有待识别化合物样本对应精准度更高的目标识别结果。
在一些实施例中,识别标签包括图例标签,其中图例标签可以更加突出每类相似小分子化合物对应的指纹特征,故在将待识别化合物样本初步分割为粗粒度的多个样本子集后,并将样本子集中以序列文本格式表示的每个待识别化合物样本转化为样本图例,根据样本图例得到所述待识别化合物样本对应的目标识别结果时,不局限于初始识别标签根据化合物分子特征或者原子特性来判断待识别化合物的类别,不仅提升了小分子化合物聚类的准确性,而且基于对待识别化合物样本进行粗粒度的初步识别分割,将大量数据的待识别化合物样本划分为相对小范围内的数据处理,实现了数据的降维,为后续采用样本图例识别过程争取处理空间,即数据降维提高了数据处理空间的利用率,从而可以加快小分子化合物聚类的处理速度。
在一些实施例中,通过对样本图例进行特征提取,结合化合物对应的指纹特征,经过大数据的训练获得图例标签,其中图例标签更加突出每类小分子化合物对应连通空间的特定范围指纹特征,用于将具有最相似子结构对应的化合物聚类为同一类。其中最相似子结构表示小分子化合物图的连通空间(connected components),例如可以包括一个或多个相同的原子以及化学键。
具体地,对样本图例进行图像特征点提取,对每个样本图例对应待识别化合物中每个分子和每个分子的拓扑比对,结合化合物对应的指纹特征,获取图像特征中最相似的子结构(subgraph)特征,即可以通过图像特征识别突出的唯一判别特征,来寻找化合物间的最相似子结构特征。同时,计算相似性分数,进而根据阈值决定是否属于同一类别。从而经过大量数据的训练后,若检测任意两两样本图例中最相似子结构特征对应相似分数满足阈值,则确 定该两个样本图例为同一类别,从而将样本子集中所有样本图例中拥有共同最相似子结构的化合物分配为同一类别,且获得该同一类别对应的图例标签。图例标签包括该最相似子结构以及相似分数对应的阈值。
在一些实施例中,根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果,包括:根据所述样本图例、所述初始识别标签以及所述图例标签,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别;根据不同类别化合物对应的初始识别标签以及图例标签,分别获得所有待识别化合物样本对应的识别类别。
本实施例中的识别标签包括初始识别标签和图例识别标签。每一类别的化合物分别对应一组初始识别标签和图例识别标签。因此对样本图例进行小分子化合物聚类过程中,结合初始识别标签(例如Ti1)和图例识别标签(例如Pt1),将样本子集中样本图例中同时满足初始识别标签和图例标签对应的待识别化合物聚类为同类别(例如化合物1)。进而根据不同类别化合物对应的初始识别标签以及图例标签(例如[Ti1、Pt1],[Ti2、Pt2]、……、[Ti5、Pt5]……),分别获得所有待识别化合物样本对应的识别类别(例如化合物1、化合物2……)。
在一些实施例中,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别,包括:获取所述初始识别标签和所述图例标签达到预设阈值对应的样例标准图;对所有样本图例与所述样例标准图进行相似计算,若所述样本图例与所述样例标准图相匹配,则将所述样本图例对应待识别化合物聚类为同类别。
结合上述实施例,样本图例中同时满足初始识别标签和图例标签对应的待识别化合物聚类为同类别时,需获取同一组初始识别标签和图例标签达到预设阈值对应的样例标准图。其中,对同一组初始识别标签和图例标签到达预设阈值可以包括初始识别标签和图例标签分别对应权重之和达到预设阈值。并且根据图例标签中包含的最相似子结构可以更准确地实现小分子化合物聚类。因此,获取所述初始识别标签和所述图例标签达到预设阈值对应的样例标准图之后,将所有样本图例与每种类别对应样例标准图进行相似计算,若样本图例与样例标准图相匹配,则将样本图例对应待识别化合物聚类为与样例标准图对应化合物为同类别。其中,计算样本图例与样例标准图相匹配可以采用如下公式一:
S=G 1∩G 2   (公式一)
其中,G 1、G 2分别为为输入的样本图例和样例标准图。通过采用公式二的相似算法可以获得是否相匹配。
Figure PCTCN2022096714-appb-000001
其中,A与B分别代表比对的样本图例和样例标准图的节点数,|A∩B|则表示图A与图B节点的共同节点数量。即当J(A,B)等于预设阈值,则样本图例与样例标准图相匹配。最终将样本图例对应待识别化合物聚类为与样例标准图对应化合物为同类别。
在一些实施例中,根据所述样本子集;得到样本图例,包括:
根据所述样本子集中化合物的属性特征,将所述样本子集中每个待识别化合物样本转化得到对应的样本图例。
具体地,根据样本子集中待识别化合物的属性特征,例如化合物的分子 特征、logP(油水分配系数)、环数以及原子特征,根据可处理数据表示的待识别化合物(例如SMILES序列文本表示的化合物),按原子为节点,化学键作为边转换为数据图,一些示例中节点包括原子号属性,边包括单双三键属性。
在一些实施例中,在得到所述待识别化合物样本对应的目标识别结果之后,还包括:输出所述目标识别结果,并将所述目标识别结果对应的所述待识别化合物样本进行存储。实现小分子化合物聚类下游的后续应用或者数据存储,因此采用本发明的化合物聚类方法不仅可以快速、准确地实现小分子化合物聚类,还可以提高关联小分子化合物聚类应用的高效性和精准性。下面再以一些实施例进行示意说明。
图3是本发明实施例提供的一种化合物聚类的方法流程图二。如图3所示,本发明实施例采用优选的各步骤如下:输入SMILES序列文本格式表示的待识别化合物样本;根据统计学聚类将待识别化合物样本分割为包含初始识别标签的样本子集;将样本子集中SMILES序列文本格式的待识别化合物样本转化为离散数学图;对样本子集中离散数学图通过拓扑比对,计算最相似结构,并得到图例标签;根据样本图例(即离散数学图)、初始识别标签以及图例标签,得到待识别化合物样本对应的目标识别结果;输出SMILES序列待识别化合物样本的聚类结果。
实例一:
步骤1、获取以SMILES序列文本表示的待识别化合物样本。步骤2、针对通过统计学聚类得到一种小分子化合物聚类对应的样本子集,例如初始识别标签为Ti2。统计学聚类方法包括K-means和OPTICS等。步骤3、将样本子集中以SMILES序列文本表示的待识别化合物转为对应的样本图例。步骤4、根据样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果。例如根据初步识别标签(Ti2)在样本子集中剔除掉不属于Ti2类的化合物。其中,还包括根据所述样本图例、所述初始识别标签以及所述图例标签,获取所述初始识别标签和所述图例标签达到预设阈值对应的样例标准图;对所有样本图例与所述样例标准图进行相似计算,若所述样本图例与所述样例标准图相匹配,则将所述样本图例对应待识别化合物聚类为同类别,例如识别出该样本子集中属于Ti2类别的化合物;最终更加准确地剔除掉属于Ti2类别的化合物。从而得到所有待识别化合物对应的目标识别结果。步骤5、输出聚类好的样本子集,以便进行后续的下游分析应用或数据存储。
实例二:
步骤1、获取以SMILES序列文本表示的待识别化合物样本。步骤2、通过Python(程序语言的一种)RDKit包(运行代码中的一种)内的函数进行化学性质特征计算,并通过统计学聚类得到至少两种小分子化合物聚类对应的样本子集,例如初始识别标签为Ti1、Ti2、Ti3以及Ti4中的至少两种。统计学聚类方法包括K-means和OPTICS等。步骤3、将样本子集中以SMILES 序列文本表示的待识别化合物转为对应的样本图例。步骤4、根据样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果。例如根据初步识别标签(例如Ti1、Ti2以及Ti3)在样本子集中最终准确识别为Ti1、Ti2和Ti4三类别对应的化合物。其中,还包括根据所述样本图例、所述初始识别标签以及所述图例标签,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别;根据不同类别化合物对应的初始识别标签以及图例标签,分别获得所有待识别化合物样本对应的识别类别。步骤5、输出聚类好的样本子集,以便进行后续的下游分析应用或数据存储。
实例三:
步骤1、获取以SMILES序列文本表示的待识别化合物样本。步骤2、获得待识别化合物特征,例如MorganFingerprints(分子指纹)或深度学习训练完的向量(embedding)变量;并采用PCA算法进行数据(降维),降至10-100维度的数据特征后,通过统计学聚类得到至少两种小分子化合物聚类对应的样本子集,例如初始识别标签为Ti1、Ti2、Ti3以及Ti4中的至少两种。统计学聚类方法例如包括K-means。步骤3、将样本子集中以SMILES序列文本表示的待识别化合物转为对应的样本图例。步骤4、根据样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果。例如根据初步识别标签(Ti1、Ti2以及Ti3)在样本子集中最终准确识别为Ti1、Ti2、Ti4和 Ti5四类别对应的化合物。其中,还包括根据所述样本图例、所述初始识别标签以及所述图例标签,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别;根据不同类别化合物对应的初始识别标签以及图例标签,分别获得所有待识别化合物样本对应的识别类别。步骤5、输出聚类好的样本子集,以便进行后续的下游分析应用或数据存储。
图4是本说明书实施例提供的一种化合物聚类的装置示意图,如图4所示,所述装置30包括:
获取模块31,用于获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;
得到模块32,用于根据所述样本子集;得到样本图例;
输出模块33,用于根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
图4所示实施例的装置对应地可用于执行图2所示方法实施例中的步骤,其实现原理和技术效果类似,此处不再赘述。
图5是本说明书实施例提供的一种化合物聚类的系统结构示意图,如图5所示,该系统40包括:处理器41、存储器42和计算机程序;其中
存储器42,用于存储所述计算机程序,该存储器还可以是闪存(flash)。所述计算机程序例如是实现上述方法的应用程序、功能模块等。
处理器41,用于执行所述存储器存储的计算机程序,以实现上述方法中设备执行的各个步骤。具体可以参见前面方法实施例中的相关描述。
可选地,存储器42既可以是独立的,也可以跟处理器41集成在一起。
当所述存储器42是独立于处理器41之外的器件时,所述设备还可以包括:
总线43,用于连接所述存储器42和处理器41。
本发明还提供一种可读存储介质,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时用于实现上述的各种实施方式提供的方法。
其中,可读存储介质可以是计算机存储介质,也可以是通信介质。通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。计算机存储介质可以是通用或专用计算机能够存取的任何可用介质。例如,可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称:ASIC)中。另外,该ASIC可以位于用户设备中。当然,处理器和可读存储介质也可以作为分立组件存在于通信设备中。可读存储介质可以是只读存储器(ROM)、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本发明还提供一种程序产品,该程序产品包括执行指令,该执行指令存储在可读存储介质中。设备的至少一个处理器可以从可读存储介质读取该执行指令,至少一个处理器执行该执行指令使得设备实施上述的各种实施方式提供的方法。
在上述设备的实施例中,应理解,处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用 处理器中的硬件及软件模块组合执行完成。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例侧重说明的都是与其他实施例的不同之处。尤其,对于后面说明的产品实施例而言,由于其与方法是对应的,描述比较简单,相关之处参见系统实施例的部分说明即可。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到的变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (10)

  1. 一种化合物聚类的方法,其特征在于,所述方法包括:
    获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;
    根据所述样本子集,得到样本图例;
    根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
  2. 根据权利要求1所述方法,其特征在于,所述识别标签包括图例标签。
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    通过对所述样本图例进行特征提取,结合化合物对应的指纹特征,训练获得所述图例标签。
  4. 根据权利要求2所述的方法,其特征在于,根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果,包括:
    根据所述样本图例、所述初始识别标签以及所述图例标签,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别;
    根据不同类别化合物对应的初始识别标签以及图例标签,分别获得所有待识别化合物样本对应的识别类别。
  5. 根据权利要求4所述的方法,其特征在于,将所述样本图例中满足所述初始识别标签和所述图例标签对应的待识别化合物聚类为同类别,包括:
    获取所述初始识别标签和所述图例标签达到预设阈值对应的样例标准图;
    对所有样本图例与所述样例标准图进行相似计算,若所述样本图例与所述样例标准图相匹配,则将所述样本图例对应待识别化合物聚类为同类别。
  6. 根据权利要求1所述的方法,其特征在于,根据所述样本子集;得到样本图例,包括:
    根据所述样本子集中化合物的属性特征,将所述样本子集中每个待识别化合物样本转化得到对应的样本图例。
  7. 根据权利要求1所述的方法,其特征在于,在得到所述待识别化合物样本对应的目标识别结果之后,还包括:
    输出所述目标识别结果,并将所述目标识别结果对应的所述待识别化合物样本进行存储。
  8. 一种化合物聚类的装置,其特征在于,所述装置包括:
    获取模块,用于获取待识别化合物样本,并将所述待识别化合物样本分割为包含初始识别标签的样本子集;
    得到模块,用于根据所述样本子集;得到样本图例;
    输出模块,用于根据所述样本图例以及识别标签,得到所述待识别化合物样本对应的目标识别结果;其中,所述识别标签包括所述初始识别标签。
  9. 一种化合物聚类的系统,其特征在于,包括:存储器、处理器以及计算机程序,所述计算机程序存储在所述存储器中,所述处理器运行所述计算 机程序执行权利要求1至7中任一所述化合物聚类的方法。
  10. 一种可读存储介质,其特征在于,所述可读存储介质中存储有计算机程序,所述计算机程序被处理器执行时用于实现权利要求1至7中任一所述化合物聚类的方法。
PCT/CN2022/096714 2022-05-17 2022-06-01 一种化合物聚类的方法、装置、系统及存储介质 WO2023221186A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210537018.XA CN115049866A (zh) 2022-05-17 2022-05-17 一种化合物聚类的方法、装置、系统及存储介质
CN202210537018.X 2022-05-17

Publications (1)

Publication Number Publication Date
WO2023221186A1 true WO2023221186A1 (zh) 2023-11-23

Family

ID=83159080

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/096714 WO2023221186A1 (zh) 2022-05-17 2022-06-01 一种化合物聚类的方法、装置、系统及存储介质

Country Status (3)

Country Link
US (1) US20230376794A1 (zh)
CN (1) CN115049866A (zh)
WO (1) WO2023221186A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018316A1 (en) * 2016-07-15 2018-01-18 At&T Intellectual Property I, Lp Data analytics system and methods for text data
CN110516558A (zh) * 2019-08-01 2019-11-29 仲恺农业工程学院 样本数据获取方法、装置、计算机设备及存储介质
CN111582185A (zh) * 2020-05-11 2020-08-25 北京百度网讯科技有限公司 用于识别图像的方法和装置
CN113159072A (zh) * 2021-04-22 2021-07-23 中国人民解放军国防科技大学 一种基于一致正则化的在线超限学习机目标识别方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018316A1 (en) * 2016-07-15 2018-01-18 At&T Intellectual Property I, Lp Data analytics system and methods for text data
CN110516558A (zh) * 2019-08-01 2019-11-29 仲恺农业工程学院 样本数据获取方法、装置、计算机设备及存储介质
CN111582185A (zh) * 2020-05-11 2020-08-25 北京百度网讯科技有限公司 用于识别图像的方法和装置
CN113159072A (zh) * 2021-04-22 2021-07-23 中国人民解放军国防科技大学 一种基于一致正则化的在线超限学习机目标识别方法及系统

Also Published As

Publication number Publication date
CN115049866A (zh) 2022-09-13
US20230376794A1 (en) 2023-11-23

Similar Documents

Publication Publication Date Title
WO2022078346A1 (zh) 文本意图识别方法、装置、电子设备及存储介质
WO2019080411A1 (zh) 电子装置、人脸图像聚类搜索方法和计算机可读存储介质
US9471886B2 (en) Class discriminative feature transformation
CN109189892B (zh) 一种基于文章评论的推荐方法及装置
CN109857957B (zh) 建立标签库的方法、电子设备及计算机存储介质
CN112434194A (zh) 基于知识图谱的相似用户识别方法、装置、设备及介质
CN113127605A (zh) 一种目标识别模型的建立方法、系统、电子设备及介质
CN111738009B (zh) 实体词标签生成方法、装置、计算机设备和可读存储介质
WO2023221186A1 (zh) 一种化合物聚类的方法、装置、系统及存储介质
CN110209895B (zh) 向量检索方法、装置和设备
CN110377721B (zh) 自动问答方法、装置、存储介质及电子设备
CN111488479B (zh) 超图构建方法、装置以及计算机系统和介质
CN111523309A (zh) 药品信息归一化的方法、装置、存储介质及电子设备
US20230186613A1 (en) Sample Classification Method and Apparatus, Electronic Device and Storage Medium
CN114691907B (zh) 一种跨模态检索的方法、设备及介质
CN112989040B (zh) 一种对话文本标注方法、装置、电子设备及存储介质
CN113362809B (zh) 语音识别方法、装置和电子设备
CN114139530A (zh) 同义词提取方法、装置、电子设备及存储介质
CN114091458A (zh) 基于模型融合的实体识别方法和系统
CN113688243A (zh) 语句中实体的标注方法、装置、设备以及存储介质
CN113505257A (zh) 图像检索方法、商标检索方法、电子设备以及存储介质
CN113313126A (zh) 用于图像识别的方法、计算设备和计算机存储介质
CN111708884A (zh) 文本分类方法、装置及电子设备
CN112417131A (zh) 信息推荐方法和装置
CN114388084A (zh) 一种人类表型本体术语提取系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22942220

Country of ref document: EP

Kind code of ref document: A1