WO2022096861A2 - Computational drug target selection - Google Patents

Computational drug target selection Download PDF

Info

Publication number
WO2022096861A2
WO2022096861A2 PCT/GB2021/052813 GB2021052813W WO2022096861A2 WO 2022096861 A2 WO2022096861 A2 WO 2022096861A2 GB 2021052813 W GB2021052813 W GB 2021052813W WO 2022096861 A2 WO2022096861 A2 WO 2022096861A2
Authority
WO
WIPO (PCT)
Prior art keywords
publication
documents
drug
data
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/GB2021/052813
Other languages
English (en)
French (fr)
Other versions
WO2022096861A3 (en
Inventor
Daniel James CROWTHER
David NARGANES-CARLÓN
Guillermo SERRANO-NÁJERA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Exscientia Ltd
Original Assignee
Exscientia Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Exscientia Ltd filed Critical Exscientia Ltd
Priority to KR1020237017962A priority Critical patent/KR20230128266A/ko
Priority to CN202180074273.9A priority patent/CN116508017A/zh
Priority to JP2023550727A priority patent/JP2023547964A/ja
Priority to EP21884122.9A priority patent/EP4238097A2/en
Publication of WO2022096861A2 publication Critical patent/WO2022096861A2/en
Publication of WO2022096861A3 publication Critical patent/WO2022096861A3/en
Priority to US18/138,705 priority patent/US20230352193A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/091Active learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/20ICT specially adapted for the handling or processing of medical references relating to practices or guidelines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query
    • G06F16/3326Reformulation based on results of preceding query using relevance feedback from the user, e.g. relevance feedback on documents, documents sets, document terms or passages
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Creation or modification of classes or clusters
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/382Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using citations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the invention relates to methods and systems for the computational selection of target molecules or genes, e.g. drug targets, with which molecules, e.g. drugs, are to be designed to interact in an optimal manner.
  • Drug discovery is the process of identifying candidate compounds for progression to the next stage of drug development, e.g. pre-clinical trials. Such candidate compounds are required to satisfy certain criteria for further development.
  • Modem drug discovery involves the identification and optimisation of initial screening ‘hit’ compounds.
  • such compounds need to be optimised relative to required criteria, which can include the optimisation of a number of different properties.
  • the properties to be optimised can include, for instance: activity against a desired biological target; selectivity against nondesired biological targets; low probability of toxicity; and, good drug metabolism and pharmacokinetic properties (ADME). Only compounds satisfying the specified requirements become candidate compounds that can continue to the drug development process.
  • a drug target is something - typically a protein or nucleic acid, for instance - that exists in a living organism to which a drug interacts, e.g. binds. Such interaction with a drug causes a change in behaviour of the drug target.
  • a promising drug target may be one that has an association with a particular disease under consideration, e.g. the drug target modifies the disease or plays a role in the pathophysiology of the disease.
  • an optimal selection of drug target for a particular drug discovery project can increase the probability of identifying a candidate compound in less time, i.e. in fewer design cycles of the project. In turn, this reduces the associated time and/or cost associated with the particular project.
  • the present invention provides an improved method of identifying biological targets for drugs, possibly in association with particular diseases, in order to reduce the overall time and/or cost associated with the drug discovery process, e.g. to increase the efficiency of identifying a candidate compound as part of a particular drug discovery project.
  • the invention provides methods for drug discovery.
  • the methods may comprise undertaking a drug discovery project based on said at least one drug target; and optionally selecting and/or synthesising and/or testing potential therapeutic compounds against the at least one selected drug target.
  • the method includes ingesting publication data from at least one publication data source and relating to a plurality of publication documents including historical publication documents and current publication documents.
  • the method includes searching the publication data to provide an indication, for each of the publication documents, as to whether the respective publication document is associated with one or more drug targets.
  • the method includes determining an expected publication parameter for each of the one or more drug targets based on the searched publication data from the historical publication documents, and determining an actual publication parameter for each of the one or more drug targets based on the searched publication data from the current publication documents.
  • the method includes evaluating each of the one or more drug targets for selection based on its actual publication parameter relative to its expected publication parameter.
  • the method may comprise defining, for each drug target, one or more character expressions as referring to the drug target, and wherein searching the publication data comprises searching the publication data for the one or more character expressions for each drug target.
  • the method may comprise classifying, for each drug target, each of the one or more character expressions to be a safe character expression or an unsafe character expression.
  • the classification may be based on a likelihood that an instance of the character expression in the publication data refers to the drug target.
  • the publication document is determined to be associated with the drug target.
  • One or more character expressions may be user-defined to be classified as safe character expressions.
  • One or more character expression unsafe characteristics may be user-defined to indicate that a corresponding character expression is unsafe. Character expressions in the searched publication data that exhibit one or more of the character expression unsafe characteristics may be classified as unsafe character expressions.
  • the one or more user-defined character expression unsafe characteristics may include one or more of: a character expression corresponding to a word in a particular natural language; a character expression having fewer than a prescribed number of characters, optionally wherein the prescribed number is three; and, a character expression that is defined to refer to at least two different drug targets.
  • One or more character expression ambiguity characteristics may be defined to ascribe an ambiguity score to one or more of the character expressions.
  • Each of the character expressions may be classified to be a safe character expression or an unsafe character expression based on the correspondingly ascribed ambiguity score.
  • One or more of the character expression ambiguity characteristics may be user-defined.
  • the character expression may be classified as an unsafe character expression if its ambiguity score is greater than a prescribed threshold ambiguity score.
  • the one or more character expression ambiguity characteristics may include one or more of, for each drug target: a total number of publication documents in the publication data that include the defined one or more character expressions referring to the drug target; a number of publication documents in the publication data that includes one of the defined character expressions referring to the drug target, relative to the total number of publication documents in the publication data that include the defined one or more character expressions referring to the drug target; a number of characters in one of the defined character expressions referring to the drug target; a frequency with which each character in one of the defined character expressions referring to the drug target occurs in the publication data, optionally a sum of the frequency for each of the characters in the one character expression, optionally a logarithm of the sum; a number of the defined character expressions for the one or more drug targets that include the one defined character expression; a probability that a publication document in the publication data that includes one of the defined character expressions, other than a selected character expression that is a safe character expression from the defined character expressions referring to the drug
  • the method may comprise applying a machine learning algorithm to ascribe the ambiguity score to each of the one or more character expressions based on the one or more character expression ambiguity characteristics.
  • the machine learning algorithm may use the one or more character expression unsafe characteristics to ascribe the ambiguity score to each of the one or more of the character expressions.
  • the machine learning algorithm may comprise a positive-unlabelled learning technique.
  • the machine learning algorithm may comprise application of a random forest classifier.
  • a subset of the ascribed ambiguity scores are inspected by a user to determine whether to manually change any of the subset of ascribed ambiguity scores.
  • the subset may correspond to a prescribed number of the character expressions having the highest ascribed ambiguity scores.
  • the method may comprise determining, for each identified pair of publication documents, a co-citation value representative of a number of publication documents that cite both of the pair of publication documents.
  • the method may comprise assigning pairs of publication documents to one of a plurality of communities of publication documents based on their determined co-citation value and on the publication documents that cite the pairs of publication documents.
  • assigning pairs of publication documents to one of the plurality of communities includes application of a greedy optimisation algorithm.
  • the method may comprise determining, for each of the plurality of communities of publication documents, whether to associate the community with one of the drug targets.
  • the determination may comprise determining which of the defined character expressions referring to the one drug target are present in the publication data of each of the publication documents in the community.
  • the determination may comprise determining a proportion of the publication documents in the community that include at least one safe character expression in their publication data. In some embodiments, it is determined to associate the community with the one of the drug targets if the proportion is greater than a prescribed threshold proportion.
  • searching for the pairs of publication documents includes searching for pairs of publication documents that each include at least one of the character expressions defined as referring to one of the drug targets.
  • the publication data for at least some of the publication documents does not include citation data.
  • the method may comprise determining whether to assign the publication document to one of the communities associated with one of the drug targets based on its publication data, in particular on one or more of the defined character expressions referring to the drug target in its publication data.
  • the publication data of the publication document includes at least one instance of a safe character expression, then it is determined to assign the publication document to the one of the communities associated with the one of the drug targets.
  • the determination whether to assign the publication document to one of the communities associated with one of the drug targets is performed using a machine learning algorithm.
  • the machine learning algorithm may comprise a positive-unlabelled learning technique.
  • the machine learning algorithm may comprise application of a machine learning classifier, optionally at least one of: a logistic regression classifier an extra tree classifier; a gaussian process classifier; a k-nearest neighbour classifier; a ridge classifier; a random forest classifier; and, a support vector machine classifier.
  • the expected publication parameter may be an expected number of publication documents associated with the drug target
  • the actual publication parameter may be an actual number of publication documents associated with the drug target.
  • the expected publication parameter may be one of: an expected number of clinical trials associated with the drug target; an expected number of review publication documents associated with the drug target; and, an expected number of publication documents linked to a defined size of company
  • the actual publication parameter may be one of: an actual number of clinical trials associated with the drug target; an actual number of review publication documents associated with the drug target; and, an actual number of publication documents linked to the defined size of company, respectively.
  • determining the expected publication parameter comprises using a machine learning algorithm trained using the searched publication data from the historical publication documents.
  • the machine learning algorithm may be a recurrent neural network algorithm.
  • evaluating the drug targets for selection comprises ranking the drug targets based on a comparison of their respective actual and expected publications parameters.
  • the drug targets may be ranked according to a parameter indicative of a difference between their respective actual and expected publications parameters.
  • the method may comprise determining a target-target co-occurrence parameter between pairs of the drug targets, the target-target co-occurrence parameter being determined based on the indication from the searched publication data which publication documents both drug targets in a pair are associated with.
  • Each target-target co-occurrence parameter may be indicative of the number of publication documents in which both of the drug targets in a pair appear.
  • the method may comprise evaluating the one or more drug targets for selection based on the determined target-target co-occurrence parameters.
  • the method may comprise searching the publication data to provide an indication, for each of the publication documents, as to whether the respective publication document is associated with one or more diseases.
  • the method may comprise defining, for each disease, one or more character expressions as referring to the disease.
  • Searching the publication data may comprise searching the publication data for the one or more character expressions for each disease.
  • the method may comprise determining a target-disease co-occurrence parameter between each of the drug targets and each of the diseases.
  • the target-disease cooccurrence parameter may be determined based on the indication from the searched publication data which publication documents each drug target and each disease are associated with.
  • Each target-disease co-occurrence parameter may be indicative of the number of publication documents in which one of the drug targets and one of the diseases appear.
  • the method may comprise evaluating the one or more drug targets for selection based on the determined target-disease co-occurrence parameters.
  • the method may comprise applying a topic modelling algorithm to the publication data for the publication documents associated with each of the drug targets to obtain one or more topics associated with each drug target.
  • the method may comprise evaluating the one or more drug targets for selection based on the obtained one or more topics.
  • the method may comprise determining, for each drug target, errors in association of one or more publication documents with the drug target based on the obtained one or more topics.
  • the topic modelling algorithm may include at least one of: a latent Dirichlet allocation algorithm; and, a non-negative matrix factorisation algorithm.
  • the publication data relating to one or more of the publication documents may include one or more of: a title of the publication document; an abstract of the publication document; and, one or more keywords associated with the publication document.
  • the publication data may include a publication date for each of the plurality of publication documents.
  • the publication date may define whether each of the publication documents is a historical publication document or a current publication document.
  • publication documents having a publication date prior to a predefined threshold date are defined to be historical publication documents.
  • publication documents having a publication date after the predefined threshold date are defined to be current publication documents.
  • publication documents having a publication date in a predefined threshold date range are defined to be current publication documents.
  • the at least one publication data source may include at least one online publication data source.
  • the one or more drug targets may include one or more genes, optionally one or more human genes, optionally one or more proteins encoded by such genes.
  • the method may comprise using the evaluation of the one or more drug targets to inform selection of at least one of the drug targets for use in a drug discovery project.
  • the method may comprise designing the drug discovery project by selecting at least one of the drug targets for use in the drug discovery project based on the evaluation.
  • the method may comprise undertaking the drug discovery project using the at least one selected drug target.
  • undertaking the drug discovery project includes selecting, optionally synthesising, and testing (in silico, in vitro and/or in vivo) compounds against the at least one selected drug target.
  • a method for identifying a drug / compound having binding affinity for a drug target / target molecule comprising undertaking a drug discovery project (e.g. based on a method for identifying a drug target according to aspects and embodiments disclosed herein) and optionally selecting and/or synthesising and/or testing compounds against the at least one selected drug target to identify a compound having therapeutic activity against the drug target; wherein therapeutic activity’ may include, without limitation, a desirable binding characteristic (e.g. affinity, selectivity); inhibition characteristic; agonist or antagonist characteristics.
  • a non-transitory, computer-readable storage medium storing instructions thereon that when executed by a computer processor causes the computer processor to perform the method described above.
  • a computer device for drug target selection.
  • the computer device is configured to ingest, receive or download publication data from at least one publication data source and relating to a plurality of publication documents including historical publication documents and current publication documents.
  • the computer device is configured to search the publication data to provide an indication, for each of the publication documents, as to whether the respective publication document is associated with one or more drug targets.
  • the computer device is configured to determine an expected publication parameter for each of the one or more drug targets based on the searched publication data from the historical publication documents, and determine an actual publication parameter for each of the one or more drug targets based on the searched publication data from the current publication documents.
  • the computer device is configured to evaluate each of the one or more drug targets for selection based on its actual publication parameter relative to its expected publication parameter.
  • Figure 1 summarises the steps of a computational drug selection method in accordance with the invention
  • Figure 2 illustrates a graph database showing relationships between publication documents determined to be associated with a particular gene of interest using the method of Figure 1 ;
  • Figure 3 illustrates a comparison of predicted versus real publication dynamics associated with different genes, determined using the method of Figure 1 ;
  • Figure 4 illustrates predicted relative to real publication dynamics associated for different genes associated with different groups of diseases determined using the method of Figure 1;
  • Figure 5 illustrates a co-occurrence network showing gene-gene connections and genedisease connections determined using the method of Figure 1 ;
  • Figure 6 illustrates timelines for the number of publications mentioning different genes of interest for different groups of publications each related to different extracted topics as determined using the method of Figure 1.
  • Molecular or drug design can be considered a multi-dimensional optimisation problem that uses the hypothesis generation and experimentation cycle to advance knowledge.
  • Each compound design can be considered a hypothesis which is falsified in experimentation.
  • the experimental results are represented as structure-activity relationships, which construct a landscape of hypotheses as to which chemical structure is likely to contain the desired characteristics.
  • the process of drug design is also an optimisation problem as each project needs a defined product profile - i.e. drug target function - of desired, specified attributes against which hit compounds are analysed.
  • the drug discovery process is typically performed in iterations known as design cycles. At each iteration a set of molecules or compounds is synthesised, and their biological properties are measured. The activities are analysed, and a new set of compounds is proposed, based on what has been lea ed from previous iterations. This process is repeated until a clinical candidate is found. As well as activity, the measured biological properties can include one or more of selectivity, toxicity, absorption, distribution, metabolism, and excretion.
  • the drug discovery process is generally time consuming and expensive. Efficiencies that can be found at any stage of the process can therefore help in reducing the time and cost associated with a drug discovery project. Pharmaceutical companies are actively looking for ways to reduce their attrition rates, the time taken for drug development, and the associated development costs.
  • the present invention recognises that computational methods can be utilised to identify trends in the published literature in respect of potential drug targets, e.g. genes, which can be used to inform selection of drug targets for particular drug discovery projects, for instance.
  • the invention is advantageous in that it provides a computational method for drug target selection that can detect changing trends in relation to certain genes, for instance, which is interpreted to indicate a change in the fundamental assumptions about a particular gene, e.g. a scientific breakthrough regarding a particular gene.
  • a first step of the computational drug target selection method comprises ingesting publication data from at least one publication data source.
  • the publication data source may be an online publication data source, and may include a database such as PubMed® that has access to millions of publication documents in the form of academic or journal articles in the particular field of interest.
  • Publication data may additionally or alternatively be ingested from other sources such as published clinical trial report data, published patent document data and/or published preprints of articles. It will be understood that publication data may be obtained and ingested from any suitable source of such publication data, and from any number of these suitable sources.
  • the information included in the ingested publication data for a given publication document may depend on the particular type of document or the particular source from which the publication data is obtained.
  • the publication data may be restricted to the data available as open source data for a given publication document, with further information only being accessible behind a paywall.
  • the ingested publication data relates to a plurality of different publication documents, i.e. each publication document has publication data associated with it.
  • the publication documents for which publication data is ingested may be regarded as being split into historical publication documents and current publication documents.
  • the publication data associated with at least some of the plurality of publication documents may include a publication date of, or associated with, those publication documents. This may be used to determine or define which documents are identified as historical publication documents and which are identified as current publication documents.
  • the publication date could be a specific day, month or year of publication of the relevant document, for instance.
  • publication documents having a publication date prior to a predefined threshold date may be defined to be historical publication documents
  • publication documents having a publication date after the predefined threshold date may be defined to be current publication documents.
  • the publication date of documents may be used in any suitable way to define whether they are historical or current documents, e.g. publication documents having a publication date in a predefined threshold date range may be defined to be current publication documents.
  • a next step of the present invention includes searching the ingested publication data to provide an indication, for each of the publication documents, as to whether the respective publication document is associated with one or more drug targets, e.g. genes. That is, the information included in the publication data of each publication document is searched to identify potential associations or links that each publication document has to one or more drug targets. Such a search may be performed in any suitable manner. One option is to search for mentions of the drug targets of interest in the publication data.
  • the publication data may include data such as one or more of a title, an abstract, and one or more keywords associated with the publication document.
  • Such information is generally readily available as open source information from online publication databases storing journal articles, for instance, and as such this information may be readily ingested as part of the publication data associated with different publication documents.
  • the names of one or more drug targets of interest e.g. genes of interest
  • the content of the publication data is automatically searched for the defined drug target names. For instance, if the defined name of one of the drug targets is found in the publication data associated with a particular publication document, then that drug target may be regarded as being associated with, or linked to, the particular publication document.
  • the defined name for a drug target may for example be an approved symbol, e.g. an approved gene symbol, according to an accepted nomenclature.
  • ‘gene symbol’ is used to refer to the approved symbol for a particular gene from any of the 19084 human, protein-coding genes accepted by the HUGO Gene Nomenclature Committee; however, it will be understood that this is purely for illustrative purposes and is non-limiting.
  • a significant obstacle for the automatic analysis of the biomedical literature by computational methods is in the use of non-redundant alternative gene (drug target) synonyms, symbols, and acronyms from different competing sources that can have other meanings in other areas of research. That is, in the literature a single drug target, such as a gene, can be referred to in a number of different ways that are accepted in the field. It can also be the case that one or more of the synonyms for a particular drug target coincides with, or is part of, the term or expression for an entirely different concept, or have an entirely different meaning, in a different context. These factors make it difficult to unambiguously determine via automatic (computational) analysis which publication documents that include references coinciding with a name for a drug target do in fact refer to that drug target.
  • the method may include defining, for each drug target, one or more character expressions or synonyms as referring to that drug target.
  • These character expressions may be defined by a user, and may include any suitable characters that can be searched computationally. For instance, suitable characters may include letters used in one or more natural languages, or other types of symbols.
  • searching the publication data may include searching the publication data for the one or more defined character expressions or synonyms for each drug target.
  • a ‘gene synonym' may refer to any of the possible gene name variations by which the scientific community refers to, or has referred to, a given gene. Approved gene symbols - as defined above - are also included in the gene synonyms.
  • ‘EGFR’ is the approved gene symbol whereas ‘EGFR’, ‘Epidermal Growth Factor Receptor, ‘ERBBT, ‘ErbB-T, ‘c-erbBT, ‘HERT, and ‘ERBB’ may be the gene synonyms.
  • the different character expressions defined as potentially referring to a particular drug target may be obtained from different sources.
  • the various character expressions, i.e. synonyms, for different human genes may be gathered from different sources to sample the potential publication documents mentioning human gene names.
  • Each defined character expression for a drug target e.g. each different synonym for a gene
  • each defined character expression for a drug target may be regarded as having a different level of ambiguity associated with it. That is, some synonyms that are found in the literature that have meanings outside of a particular gene of interest, for instance, can be regarded as having a greater level of ambiguity associated with them as there is a greater likelihood that instances of those synonyms in the literature are referring to something different from the gene of interest. On the other hand, when a particular character expression or synonym does not have a (common) meaning outside of referring to a particular gene of interest, then such a synonym may be regarded as having a low level of ambiguity in that instances of that synonym in the literature are therefore likely to be referring to the gene of interest.
  • a level of ambiguity associated with a gene synonym may arise for different reasons.
  • a so-called ‘promiscuous gene name (homonym)’ may be regarded as any gene name that is a synonym of more than one gene. This could include previous official gene symbols (according to an accepted nomenclature) as these will not have been expunged from the literature.
  • ‘CDH3* and ‘cadherinS’ are promiscuous for gene symbols 'CHD15' and 'CHD3‘.
  • ‘ARP1* is a gene synonym for the gene symbols ‘NR2F2’, ‘ACTR1A’, ‘ACTR1B’, ‘ANGPTLT, 'APOBEC2', ‘ARFRP1’, and ‘PITX2’.
  • a so-called ‘nested gene synonym’ may be regarded as a gene synonym that is part of another gene synonym.
  • ‘insulin* is a nested gene synonym of ‘insulin receptor’.
  • ‘TNF’ is a nested gene synonym of ‘TNF Receptor Superfamily Member 1A‘ (gene symbol ‘TNFRSF1A’) and ‘TNF Receptor Associated Factor 2‘ (gene symbol ‘TRAF2’).
  • the method of the invention may therefore include classifying, for each drug target (e.g. gene), each of the one or more defined character expressions (e.g. gene synonyms) found in the publication data to be a safe character expression or an unsafe character expression.
  • the classification is based on a likelihood that an instance of the character expression in the publication data refers to the drug target, i.e. a level of ambiguity associated with the character expression.
  • a safe character expression may have a relatively low level of ambiguity associated with it, whereas an unsafe character expression may have a relatively high level of ambiguity associated with it.
  • an ‘unsafe gene synonym’ may include a gene synonym that has a different meaning in other areas of research or in a different context, e.g. a word that appears in the English dictionary.
  • the ‘STAR’ gene symbol may be regarded as being an unsafe character expression as opposed to its gene synonym 'Steroidogenic Acute Regulatory Protein'.
  • ‘CCP4’ may be regarded as being unsafe as it is both a gene synonym and the name for crystallography software.
  • the searched publication data from one of the publication documents is determined to include a safe character expression (for a particular drug target)
  • that publication document may be determined to be associated with that drug target.
  • that publication document is regarded as being linked to, or related to, the particular drug target under consideration, e.g. gene of interest.
  • the particular drug target under consideration, e.g. gene of interest.
  • at least some of the defined character expressions that potentially refer to the drug target, i.e. in that they appear in the literature may be regarded as definitely being safe character expressions.
  • one or more of the character expressions may be user-defined to be classified as safe character expressions.
  • One or more characteristics of character expressions may be defined, e.g. by a user, to be indicative that character expressions exhibiting such characteristics have a high level of ambiguity such that they are unsafe.
  • character expressions in the searched publication data that exhibit such ‘character expression unsafe characteristics* may be classified automatically as unsafe character expressions.
  • An example of a user-defined character expression unsafe characteristic may be any character expression corresponding to a word in a particular natural language, e.g. a word in the English dictionary (see the ‘STAR’ example above).
  • Another example of a user-defined character expression unsafe characteristic may be a character expression having fewer than a prescribed number of characters. For instance, the prescribed number may be three, or any other suitably-defined number.
  • a further example of a user-defined character expression unsafe characteristic may be a character expression that is defined to refer to at least two different drug targets (see ‘promiscuous gene name’ mentioned above). In this way, character expressions including at least one of the defined unsafe characteristics are regarded as being definitely unsafe. It will be understood that any suitable characteristics of a character expression may be defined as being indicative that the character expression is highly ambiguous, and so unsafe.
  • a level of ambiguity associated with the remaining character expressions may be determined or calculated in order to classify these character expressions as being safe or unsafe.
  • One option to determine which character expressions or synonyms have a potentially high level of ambiguity - such that they are regarded as being unsafe - is to perform feature engineering to obtain variables that characterise unsafe synonyms and then to ascribe a level of ambiguity to each of the synonyms based on the obtained variables. For instance, longer gene names may be less likely to be ambiguous.
  • character expression ambiguity characteristics may be defined in any suitable manner to ascribe an ambiguity score to one or more of the character expressions. This may be by user definition, e.g. based on the feature engineering, or otherwise. Each of these character expressions may then be classified to be a safe character expression or an unsafe character expression based on the correspondingly ascribed ambiguity score. For instance, a character expression may be classified as an unsafe character expression if its ambiguity score is greater than a prescribed threshold ambiguity score.
  • a machine learning algorithm may be applied to ascribe the ambiguity score to each of the character expressions or synonyms based on the obtained character expression ambiguity characteristics, e.g. by feature engineering.
  • the machine learning algorithm may use the character expression unsafe characteristics to ascribe the ambiguity score to each of the character expressions not yet classified as safe or unsafe. That is, an ambiguity score is ascribed to synonyms in an unlabelled set of synonyms, i.e. those not in a set of synonyms previously labelled as safe or a set of synonyms previously labelled as unsafe. The ambiguity score is then used to label the as yet unlabelled synonyms as safe or unsafe.
  • the machine learning algorithm may include application of a positive-unlabelled learning technique, e.g. positive-unlabelled bagging strategy, and a classification scheme, e.g. a random forest classifier.
  • the machine learning algorithm may be run as an iterative process. After each iteration of the algorithm, a subset of the ascribed ambiguity scores may be inspected by a user to determine whether to manually change any of them, i.e. to correct classifications made by the algorithm in order to train the algorithm and increase its accuracy for subsequent iterations. For instance, the subset may correspond to a prescribed number of the synonyms or character expressions having the highest ascribed ambiguity scores (and therefore considered by the algorithm to be the most unsafe).
  • the character expression ambiguity characteristics - obtained by feature engineering, for instance - may include a total number of publication documents in the publication data that include a defined expression referring to a specific drug target.
  • the ambiguity characteristics may include a number of publication documents in the publication data that includes one of the defined character expressions referring to said drug target, relative to the total number of publication documents in the publication data that include the defined character expressions referring to a specific drug target.
  • the ambiguity characteristics may include a number of characters in one of the defined character expressions referring to a drug target. For instance, shorter expressions may in general be regarded as more ambiguous than longer ones.
  • the ambiguity characteristics may include a frequency with which each character in one of the defined character expressions referring to the drug target occurs in the publication data.
  • a sum of the frequency for each of the characters in a particular character expression may be considered, i.e. a frequency score for the entire expression. Any suitable metric using this overall frequency score may be used, e.g. a logarithm of this overall score. For instance, synonyms or character expressions including less common characters may be less ambiguous than those synonyms composed entirely of characters commonly found in the publication data (or more generally considered as being common).
  • a further ambiguity characteristic for a particular character expression or synonym may be based on a number of the defined character expressions for the drug targets that include the particular defined character expression. In other words, an ambiguity characteristic may be based on the number of nested synonyms (as defined above) relevant to a particular synonym, i.e.
  • Another ambiguity characteristic may be a probability that a publication document in the publication data that includes one of the defined character expressions, other than a selected safe character expression, also includes the selected safe character expression.
  • an ambiguity characteristic may be the conditional probability of finding the gene synonym of interest in the publication data for a specific publication document given that one of the (other) gene synonyms for the same gene symbol (as defined above) appears in the text.
  • a further ambiguity characteristic may be in essence the ‘reverse probability’ of the above, i.e. a probability that a publication document in the publication data that includes the selected character expression (i.e. the synonym under consideration) also includes another one of the defined character expressions for that drug target.
  • an ambiguity characteristic may be the conditional probability of finding one of the (other) gene synonyms for the same gene symbol in the publication data for a specific publication document given that the gene synonym of interest appears in the text.
  • an ambiguity characteristic may be based on whether the character expression under consideration is the accepted character expression for a particular drug target, e.g. whether the gene synonym under consideration is the gene symbol.
  • the method of the invention may use the labelled character expressions - i.e. labelled safe or unsafe depending on the associated ambiguity of the expression - to unambiguously associate or link each drug target, e.g. human gene, to a subset of the publication documents for which publication data has been ingested.
  • each drug target e.g. human gene
  • an approach based on a network of co-citations may be used. That is, the citations of a publication document may be used to more accurately determine whether mentions of character expressions for a particular drug target in the publication data of that publication data actually mean that the particular drug target is linked to the publication document (or whether the character expression is being mentioned in a different context such that it does not actually refer to the drug target).
  • the co-citation approach - described in more detail below - may be used to reduce or eliminate ‘false positives’ from the searched publication data, i.e. publication documents whose publication data mentions a defined character expression (gene synonym) - which is indicative that a publication document may be associated with the drug target (gene) relevant to the defined character expression - but which are in fact not associated with or linked to the drug target.
  • This approach may be regarded as being based on an assumption that publication documents including "false positives’ will tend to belong to different communities of publications relating to different research fields from publication documents including "true positives’, i.e. publication documents including defined character expressions in the text that do in fact refer to a drug target of interest. In this way, an identified community of publication documents may be determined (as a whole) to be linked to, or to not be linked to, a gene of interest.
  • the ingested publication data for at least some of the publication documents may include citation data indicative of citations made by one publication document to one or more other publication documents from the plurality of publication documents.
  • the method may involve identifying so-called ‘co-dtations* in the publication data.
  • a co-citation may be regarded as an occurrence of two publication documents both being cited by a third document. That is, if ‘Publication A’ and ‘Publication B‘ are in the list of references of ‘Publication C, then there is a co-citation between ‘Publication A’ and ‘Publication B‘.
  • the step of searching the publication data may therefore include identifying, using the ingested citation data, pairs of (first and second) publication documents that have been cited by the same (third) publication document.
  • this step of searching for the pairs of publication documents includes searching for pairs of publication documents that each include at least one of the character expressions (gene synonyms) defined as referring to one of the drug targets (genes).
  • a co-citation network may be obtained using the identified co-citations, i.e. pairs of publication documents. For each identified pair of publication documents, a co-citation value representative of a number of (different) publication documents that cite both of the pair of publication documents may be determined. That is, a weighted co-citation graph may be obtained where the weight of the edges represents the frequency of two publications being cited simultaneously (co-cited) by a third publication.
  • pairs of publication documents are to be assigned into different communities of publication documents.
  • Each community includes publication documents that include instances of defined character expressions for a particular drug target; however, it may be the case that not all of the communities include publication documents that are in fact associated with the particular drug target, i.e. some communities may be composed of documents whose instances of the character expressions are in a context different from the particular drug target.
  • the method may therefore include assigning pairs of publication documents to one of a plurality of communities of publication documents based on their determined cocitation value and on the publication documents that cite those pairs of publication documents. This may be performed automatically using an appropriate community detection technique. For instance, assigning pairs of publication documents to one of the communities may include application of a (fast) greedy optimisation algorithm.
  • the method may include determining, for each of the plurality of communities of publication documents, whether to associate that community with one of the drug targets.
  • the relative ‘safety’ of character expressions (determined as described above) that are present in the publication documents of a particular community may be used. This can involve determining or identifying which of the defined character expressions referring to a particular drug target are present in the publication data of each of the publication documents in a particular community. A determination as to how many safe character expressions are in a community may be used to determine whether that community is associated with the relevant drug target.
  • a proportion of the publication documents in the community under consideration that include at least one safe character expression in their publication data may be determined. It may then be determined to associate that community with the drug target of interest if the determined proportion is greater than a prescribed threshold proportion.
  • one or more of the communities having the highest proportions of safe character expressions may be regarded as being associated with the relevant drug target.
  • publication data of some of the publication documents may not include citation data, i.e. details of citations made by a particular publication document. This may be a particular issue in a case in which publication data of open-access publications is ingested as citation data is often not available from such sources.
  • the method may include determining whether to assign the publication document to one of the communities associated with one of the drug targets based on its publication data, in particular on one or more of the defined character expressions referring to that drug target in its publication data. For instance, if the publication data of a publication document includes at least one instance of a safe character expression, then it may be determined to assign the publication document to one of the communities associated with the relevant drug target. On the other hand, if the publication data of a publication document does not include a safe character expression, then the determination whether to assign the publication document to one of the communities associated with the relevant drug target may be performed using a machine learning algorithm, e.g. a positive- unlabelled learning technique.
  • a machine learning algorithm e.g. a positive- unlabelled learning technique.
  • the machine learning algorithm may apply a machine learning classifier, such as one of a logistic regression classifier, an extra tree classifier, a gaussian process classifier, a k-nearest neighbour classifier, a ridge classifier, a random forest classifier, and a support vector machine classifier. That is, a positive-unlabelled bagging approach may be used to train multiple classifiers to associate the disconnected publications (without citation data) with the previously computed co-citation network components using the words / expressions contained in the publication data, e.g. title, abstract, etc.
  • a machine learning classifier such as one of a logistic regression classifier, an extra tree classifier, a gaussian process classifier, a k-nearest neighbour classifier, a ridge classifier, a random forest classifier, and a support vector machine classifier. That is, a positive-unlabelled bagging approach may be used to train multiple classifiers to associate the disconnected publications (without citation data) with the previously computed co-citation network components using
  • a next step of the method therefore includes determining an expected publication parameter and an actual publication parameter for each drug target of interest based on the searched publication data.
  • the expected publication parameter is determined based on publication data from the historical publication documents. Specifically, historical publication dynamics for a particular gene, for example, are calculated using the historical publication documents, and then these historical publication dynamics are used to determine or predict the expected publication parameter, e.g. by extrapolation.
  • the publication dynamics for a given gene for each of a number of successive years may be calculated using the historical publication data, e.g. using publication dates associated with historical publication documents in the publication data, and these calculated (historical) publication dynamics can be used to predict current publication dynamics for that given gene.
  • Determining the expected publication parameter may be performed using a machine learning algorithm trained using the searched publication data from the historical publication documents, e.g. a recurrent neural network algorithm.
  • the actual publication parameter is determined based on publication data from the current publication documents.
  • the expected and actual publication parameters can be a measure or indication of any one or more aspects of the publication dynamics associated with a particular drug target.
  • the expected and actual publication parameters may be an expected and actual number of publication documents, respectively, e.g. the number of publication documents in a given year.
  • the expected and actual publication parameters may include an expected and actual number of clinical trials associated with the particular drug target under consideration, an expected and actual number of review publication documents associated with the particular drug target, and an expected and actual number of publication documents linked to a defined size of company.
  • the relevant information needs to be available in the ingested publication data in order to determine the relevant parameter.
  • the publication data for some publication documents may indicate whether the publication document is associated with a large- or medium-sized pharmaceutical company.
  • the method in order to deted incoming trends in the literature the method then includes evaluating each of the drug targets of interest for seledion based on its adual publication parameter relative to its expeded publication parameter.
  • evaluating the drug targets for seledion may include ranking the drug targets in a (prioritised) list based on a comparison of their respedive adual and expeded publications parameters.
  • a drug target may be considered as potentially interesting for seledion if there is a (significant) difference between its respedive adual and expeded publications parameters as this may mean that there has been a step change in interest in the drug target relative to what may be expeded according to historical publication data.
  • the described method produces accurate predidions of the publication dynamics, i.e. the adual publication parameter is generally in line with the expeded publication parameter.
  • the adual or real number of publications or citations may be significantly higher than expeded.
  • the adual number of publications or citations exceeds the predictions, this may be interpreted that the publication dynamics have changed substantially in a way that cannot be explained simply by the publication history of a gene of interest, for instance, implying that a meaningful discovery in the field may have recently occurred.
  • trendiness* may be defined as the probability of a fold-change between the predided and real number of publications and citations for a given gene. This metric can be used to identify the trendiest' genes in the academic community (using all publications), or in the pharmaceutical industry (using publications coming from pharmaceutical companies).
  • the method may indude using the evaluation of the drug targets to inform selection of at least one of the drug targets for use in a drug discovery project, for instance based on a ranked list of the trendiest genes.
  • the method could indude designing the drug discovery projed by seleding at least one of the drug targets for use in the drug discovery project based on the evaluation.
  • the method may indude undertaking the drug discovery project using the at least one drug target as selected, at least in part, based on the above evaluation.
  • Such a drug discovery project can indude seleding compounds and testing them against the at least one seleded drug target, for instance, e.g. to identify a compound potentially having therapeutic adivity against a disease target.
  • the methods of this disdosure may then involve synthesising at least one compound potentially having binding adivity against the seleded drug target. It can also be informative to consider the relationship between two different drug targets when analysing particular drug targets for selection. In particular, it may be found that genes of interest may cluster together in association networks. Hence it can be informative to have an insight into the association that one gene has with other genes in the literature as this could mean that identification of one gene whose publication dynamics has undergone a fold change may lead to one or more further genes whose publication dynamics have changed in a manner that means they are of interest.
  • analysis of the publication dynamics of drug targets may include determining a target-target co-occurrence parameter between pairs of the drug targets.
  • a target-target co-occurrence parameter may be determined based on the indication from the searched publication data which publication documents both drug targets in a pair are associated with, i.e. publication documents with which two different drug targets are associated.
  • Each target-target cooccurrence parameter may be indicative of the number of publication documents in which both of a pair of drug targets appear.
  • the evaluation of drug targets for selection may then be based on the determined target-target co-occurrence parameters.
  • Drug targets of potential interest to the pharmaceutical industry may be drug targets that can be associated with particular diseases.
  • the described method of searching publication data to associate drug targets with publications can also be applicable to associate particular diseases with publications.
  • the method may include searching the ingested publication data to provide an indication, for each of the publication documents, as to whether the respective publication document is associated with one or more diseases.
  • this may involve defining, for each disease, one or more character expressions as referring to said disease, searching the publication data for the character expressions for each disease.
  • disease names and their synonyms may be obtained from the Medical Subject Headings (MeSH) ontology at the Bioportal.
  • MeSH ontology contains 4818 different disease nodes at different levels of the ontology.
  • a dictionary for each disease may be created with the preferred and alternative names.
  • the diseases can then be disambiguated in the publication data (e.g. title, abstract, etc.) using corresponding techniques described above for genes.
  • the method may then include determining a target-disease co-occurrence parameter between each of the drug targets and each of the diseases.
  • a target-disease co-occurrence parameter may be determined based on the indication from the searched publication data which publication documents each drug target and each disease are associated with, each target-disease co-occurrence parameter being indicative of the number of publication documents in which one of the drug targets and one of the diseases appear.
  • the evaluation of drug targets for selection may then be based on the determined target-disease co-occurrence parameters.
  • the method may include applying a topic modelling algorithm to the publication data for the publication documents associated with a drug target of interest to obtain one or more topics associated with the drug target, and the evaluation of the drug targets for selection may then be based on the obtained topics.
  • a topic may be regarded as a collection of similar words, specific to a group of documents.
  • Non-negative matrix factorisation may be used to generate a set of latent topics for each query.
  • the topic modelling algorithm may include a latent Dirichlet allocation algorithm and/or a non-negative matrix factorisation algorithm.
  • the topic detection can also be used to determine, fora drug target, errors in association of one or more publication documents with the drug target according to the search publication data based on the obtained one or more topics, and so can further aid the accuracy of drug target association in the literature.
  • Figure 1 summarises the steps of a computational drug target selection method 10 according to the invention.
  • publication data is received, ingested or downloaded from at least one publication data source, such as an online database storing publication documents, e.g. articles, journal papers, etc.
  • the publication documents include historical publication documents and current publication documents.
  • the publication data can include publication dates, authorship, titles, abstracts, keywords, citations, etc. in connection with publication documents.
  • the received publication data is searched to provide an indication, for each of the publication documents, as to whether the respective publication document may be associated with one or more potential drug targets, e.g. genes.
  • this can involve searching the publication data for mentions or instances of one or more defined character expressions for each drug target.
  • a mention of one of these character expressions in the publication data for one of the publication documents indicates that the publication document may be associated with the particular potential drug target.
  • Further steps may be performed to determine whether the publication document is in fact associated with the potential drug target. For instance, a relative ‘safety’ of the character expression in the publication data may be established (as described above) to indicate a confidence that the character expression in the publication data does indeed refer to the potential drug target of interest.
  • Further steps to cluster the publication documents into communities based on the searched publication data may be performed to determine whether clusters of publication documents are in fact associated with, or linked to, the drug target of interest.
  • searching the publication data is used to establish a group of publication documents that are linked to each of one or more potential drug targets.
  • an expected publication parameter for each potential drug target is determined based on the searched publication data from (i.e. connected to) the historical publication documents. Also, an actual or real publication parameter for each potential drug target is determined based on the searched publication data from the current publication documents.
  • the publication parameters may be any suitable parameters for describing publication dynamics (over time) for each potential drug target. For instance, the publication parameters may indicate the number of publication documents each calendar year that are associated with a particular drug target.
  • the expected or predicted publication parameter may be determined by determining the historical publication dynamics for a drug target based on the historical publication data, and extrapolating these dynamics to predict the present or future publication dynamics.
  • each potential drug target may be evaluated for selection based on its actual publication parameter relative to its expected publication parameter.
  • differences between the expected and actual parameters for a given potential drug target may be indicative of a change in assumptions about the drug target, and may indicate interest in further investigation for selection as a drug target.
  • the evaluation can involve creating target lists of potential drug targets based on the above analysis (i.e. differences between predictions and actual values, and also perhaps based on the confidence of the predictions) in order to prioritise potential drug targets for selection, for instance in association with any disease or biological mechanism of choice.
  • the evaluation can inform selection of drug targets for various applications, e.g. designing and performing a particular drug discovery project.
  • the method of the invention may be implemented on any suitable computing device, for instance by one or more functional units or modules implemented on one or more computer processors.
  • Such functional units may be provided by suitable software running on any suitable computing substrate using conventional or customer processors and memory.
  • the one or more functional units may use a common computing substrate (for example, they may run on the same server) or separate substrates, or one or both may themselves be distributed between multiple computing devices.
  • a computer memory may store instructions for performing the method, and the processors) may execute the stored instructions to perform the method.
  • PubMed® baseline released in December 2019 contains more than 30 million publications, around 170 million citations from open source data, almost 9 million authors, and almost 300 million MeSH annotations. PubMed® can be converted into a graph database using the graph database platform Neo4J to efficiently query for relationships such as authorship, references or annotations.
  • the resulting database contained five different node types: publications; authors; human protein-coding genes; human diseases; and, Medical SubHeadings (MeSH) terms.
  • the publication nodes have multiple attributes that were extracted from the PubMed® baseline: PubMed ID; title; abstract; keywords; authors; affiliations; the date of publication; the journal; and, the article type (e.g. article, review or clinical trial).
  • An attribute aggregating affiliation data was also included to know whether a pharmaceutical company was involved in the authorship of the publications.
  • edges There are five types of relationships (edges): cited by (from publication to publication); published (from authors to publications); MeSH annotation (from MESH terms to publications); gene annotation (from genes to publications); and, disease annotation (from diseases to publications).
  • a disambiguation pipeline to unequivocally link human protein-coding genes symbols and human diseases to individual publications was implemented.
  • Human gene synonyms were gathered from different sources (Ensembl, UniProt, HGCN, Entrez and OpenTargets) to sample the potential publications mentioning human gene names. It is noted that human genes have around 10 synonyms each on average, and many of these synonyms are ambiguous (when considered out of context).
  • More than 30% of gene symbols have at least one promiscuous synonym, around 10% of the gene symbols have another meaning in a different context and have at least one gene synonym in the English dictionary, and almost 50% of gene symbols have a nested synonym. Combining these problems, almost 60% of the 19082 gene symbols have at least one of these types of ambiguity.
  • feature engineering is performed to obtain variables that characterise unsafe synonyms (e.g. longer gene names are less likely to be ambiguous).
  • PU positive-unlabelled bagging
  • HGNC HUGO Gene Nomenclature Committee
  • PU learning is a form of semi-supervised lea ing which iteratively finds positive examples within a-priori unlabelled data.
  • the PU learning was run for five iterations with a random forest classifier.
  • the pure positive class (unsafe) was constructed combining gene synonyms present in the English dictionary, gene synonyms with fewer than three characters, and promiscuous gene synonyms.
  • the top 1000 examples with the highest probability of being unsafe were manually relabelled if they were wrongly classified.
  • true positive unsafe synonyms like gene families (e.g. 'G protein coupled receptor'), phenotypes (e.g. 'Williams Beuren Syndrome') and other biological entities (e.g. 'Cell surface antigen') were included in the true positive set for the next iteration.
  • False positives like 'thymopoietin' or 'tubulin alpha-1 C chain' were included into a new true negative class for the remaining iterations.
  • a gene synonym was considered unsafe if: (i) it is included in the English dictionary; (ii) it is a word with fewer than three characters; (Hi) the predicted score for the random forest classifier was higher than 0.5; and, (iv) it is a promiscuous gene synonym.
  • co-citation networks were used, i.e. a weighted graph where the weight of the edges represents the frequency of two publications being cited simultaneously (co-cited) by a third publication.
  • the fast greedy modulation algorithm from iGraph was used to determine communities in the co-citation network and distinguished communities of publications focusing on the gene of interest by detecting the presence of ‘safe gene synonyms’ in their titles and abstracts.
  • Each of the publications in a community were labelled with the gene symbol of interest if the ratio of publications mentioning at least one safe synonym with respect to publications that only mention unsafe synonyms was higher than 0.1%.
  • Each corpus was pre-processed by: (i) removal of non-alphanumeric characters; (II) tokenisation or split by whitespace; (iii) deletion of stop words from NLTK (natural language toolkit); (iv) lower case conversion; (v) deletion of tokens whose length is less than three characters; (vi) deletion of tokens representing integers; and, (vii) stemming (e.g. 'disambiguated', 'disambiguations', 'disambiguating' is converted to 'disambiguat').
  • the disease synonyms were obtained from the 'Concept List Terms' field in the ontology to gather the preferred and the alternative ways of denoting the disease. Further synonyms of the diseases were generated by reversing the order of synonyms with commas: 'Insipidus, Diabetes' to 'Diabetes Insipidus'.
  • Co-occurrence of genes and diseases was computed using the simultaneous occurrence of gene/disease tags in publications after disambiguation, normalised by the total number of publications presenting those tags.
  • Mutual information metrics for gene-gene and genedisease associations were also computed.
  • MeSH term was associated with its lowest ancestor in the MeSH ontology under the node Disease. After computing the gene-disease co-occurrence, each gene was linked with the most frequent ancestor disease term.
  • the model produces accurate predictions of the publication dynamics, but for a small subset of genes the real number of publications or citations is significantly higher than expected.
  • the trendiness of a gene can be regarded as the probability of observing the magnitude of fold-change between the predicted and the real number of publications forthat given gene. The error in the predictions is inevitably higher with genes associated with small numbers of publications.
  • five bins were generated based on the initial number of publications (percentiles 20, 40, 60, 80 and 100).
  • the area under the obtained probability density function is equal to 1.
  • Trendiness is the area of the right tail of the probability density function bounded to the left by the observed fold change. This provides an estimate of how extreme the fold change was forthat gene in a specific bin.
  • Time-series data from 1980 to 2013 was used to predict the per gene publication dynamics in each category between 2014 and 2019 using a Recurrent Neural Network model with an encoder-decoder architecture preceded by an attention layer, where both the encoder and decoder are composed of five hidden layers of Gated Recurrent Units (GRU).
  • GRU Gated Recurrent Unit
  • the model was implemented in Keras using the Tensorflow-GPU backend. Min-max normalisation was used to rescale the time series before training.
  • the optimiser was RMSprop and the loss was computed as the log error. 30% of the time series was reserved for validation during the training.
  • Input data was in both forms: cumulative and differential. Multiple normalisations were used fnone', 'minmax', log', 'standard', and combinations of them). Similar results were obtained with different normalisations and minmax was finally selected.
  • Multiple Recurrent Neural Networks (RNNs) architectures were used (GRU, LSTM) in the form of encoderdecoder, with different numbers of neurons (1 , 5, 10, 20, 50). Models were compared with the Mean Accuracy Scaled Error (MASE), an unbiased method to compare time-series prediction models by comparing how much each model outperforms a naive model that repeats the last value. The 5-neuron-GRU was selected because it was the most parsimonious model with the smallest MASE.
  • MASE Mean Accuracy Scaled Error
  • a topic detection pipeline was implemented as an automatic, fast discovery tool to study groups of publications that mention the gene of interest.
  • topic modelling algorithms were used.
  • a topic is a collection of similar words, specific to a group of documents.
  • Two different topic detection algorithms were used: Latent Dirichlet Allocation (LDA), and NonNegative Matrix Factorisation (NMF).
  • Both algorithms factor a nonnegative matrix 'A' with size NxM, where N is the number of publications and M is the dimension of the TF IDF vector obtained for Named Entity Recognition, into non-negative factors matrix W of size NxK and matrix H with size KxM, where WxH is an approximation of matrix A.
  • the matrix W contains the strength of the association of a given publication to belong to a latent topic while H contains the strength of the association between a latent topic and a given n-gram.
  • Scikit Lea implementations for both algorithms were used to generate 'K' number of topics defined by the user with the default parameters until convergence (tolerance of 1e- 12). Topic timelines were obtained by calculating the mean and standard deviations of the topic probabilities for all publications mentioning the gene of interest per calendar year.
  • a review recommender system to accelerate the screening of the publications that cover most of the information in a network can also be designed.
  • the algorithm aggregates both topic and network information from the citation subgraph of the publications that mention the gene of interest to obtain the most query-centric reviews.
  • the topic information comes from the latent topics obtained from the topic detection algorithm.
  • the topic probability of the publications and an aggregated PageRank score of the citation networks was used.
  • the network information was captured by the PageRank scores of the subgraph.
  • the user can select an interval number of reviews (R) that they are willing to read: between 2-3 or 3-50.
  • three matrices are defined for each group of publications: (i) a binary, sparse matrix of size NxR with N publications and R reviews that comprised the citation adjacency network; (ii) a Nx1 weight matrix that comprise a PageRank scores; and (Hi) a NxK matrix with the topic probabilities for N publications and K user-defined topics.
  • the score for each review was defined as the sum of the PageRank scores of its references while the score for a combination of reviews is defined as the row sum of the indexed NxR matrix multiplied by the Nx1 PageRank vector and the sum of the obtained vector. Results were later normalised by the total maximum score, defined as a hypothetical review citing all gene publications.
  • the best reviews are the ones that cite the publications with highest PageRank scores.
  • the combination that simultaneously maximises the cumulative PageRank score and minimises the overlapping of their combined citations was found. This way, a small set of reviews covering the main topics and publications in the field can be obtained.
  • This recommender system can be used to select the optimal subset of reviews to assess why genes might be trendy.
  • the described example method includes downloading publication data from PubMed® baseline and creating a graph database with the acquired information.
  • a comprehensive collection of human coding gene names and synonyms is acquired, and the method involves automatic determination of potential ambiguous (unsafe) gene names.
  • the graph database is annotated with unambiguous gene symbols by combining co-citation network topology and binary classifiers.
  • the method then involves prediction of per-gene publication trends using a recurrent neural network. When a gene has significantly more publications or citations than expected by the model it is considered to be ‘trendy’.
  • the method optionally involves automatic topic detection of collections of publications, and this algorithm was used to quantity the evolution of topics in trendy gene publications over time.
  • a review recommender system that uses information from the citation network and topic detection to recommend the most efficient set of reviews to explore the literature can be implemented.
  • Figure 2 illustrates an example of the created graph database for a particular gene when different ones of the above-described techniques or steps are used.
  • Figure 2(a) illustrates a citation network for a subset of publication documents from PubMed® mentioning any of the gene synonyms of the gene symbol LRWD1, including ORCA.
  • the nodes represent publication documents and the size of the nodes represent the number of citations.
  • the edges indicate citations between documents, including the direction of the citation.
  • Figure 2(b) illustrates a co-citation network of the same subset of publication documents as in Figure 2(a). The thickness of the edges represents the number of times a pair of documents have been co-cited.
  • Figure 2(c) illustrates different communities of publication documents obtained by using iGraph’s fast greedy algorithm, as described above. Each community is associated with different obtained topics. For instance, there is a so-called ‘killer whale’ community 201 , an ‘orca plant’ cluster or community 202, and ‘LRWD1 in drosophila’ community 203, and an ‘LRWD1 in heterochromatin’ community 204.
  • Figure 2(d) indicates the number of safe synonyms in the title or abstract of each publication document in the same co-citation network.
  • Figure 2(e) illustrates the citation network with review documents added to show citations by the review documents to any of the publication documents.
  • Figure 2(f) illustrates review information as defined by the recommender system scaled from 0 to 1.
  • Figure 3 illustrates the detection of trends of different genes, and of gene-gene-disease co-occurrence.
  • Figure 3(a) shows a logarithmic scatter plot of the predicted number of publications against the real number of publications in the year 2019 for different genes.
  • Figures 3(b), 3(c) and 3(d) respectively show the predicted number of review documents, citations, and citations from ‘big’ pharmaceutical companies against the actual number of review documents, citations, and citations from 'big' pharmaceutical companies in the year 2019 for different genes.
  • Those genes whose real number of publications, for instance, is greater than the predicted value i.e. whose node is above a line indicating a log linear relationship
  • Figure 4 illustrates the trendiness - namely, log 2 (predicted /real) - for different genes associated with different groups of diseases (according to MeSH parent categories).
  • Figure 4(a) illustrates average trendiness of publications, reviews, citations and citations from reviews for all (general) publication documents
  • Figure 4(b) illustrates average trendiness of publications, reviews, citations and citations from reviews originating from big and medium sized pharmaceutical companies.
  • Figure 5 illustrates a gene-gene-disease co-occurrence network of the first neighbours of CD274.
  • Disease and gene nodes are labelled with their defined name, and the size of the gene nodes represents their 'trendiness' according to the define metric.
  • the edges indicate gene-disease and gene-gene associations, with the width of the edges reflecting the number of co-occurrences in each case.
  • Figure 6 illustrates topic timelines for the number of publications mentioning different genes of interest, i.e. the evolution of the topics associated with some trendiest genes are explored.
  • Figures 6(a), 6(b) and 6(c) respectively show topic timelines for publications mentioning any of the genes for the immune checkpoint inhibitor, necroptosis, and pyroptosis pathways.
  • four topic timelines are shown.
  • the latent four topics were obtained using Non-Negative-Factorisation of all publications annotated with the genes after disambiguation. All timelines show a rising topic after 2013 that represents the reason why these genes became trendy'.
  • the topic timeline suggests that there was a rapid decrease in the likelihood of publications - indicated by the topic timeline labelled 601 - discussing the biological role of these immune checkpoint inhibitors since 2010, which coincides with a notable increase in topics - labelled 602 - that discuss cancer therapies and monoclonal antibodies that target these four different transmembrane immunoglobulins.
  • the topic-detection pipeline is able to capture the evolution of the research from its biological description to the clinical application.
  • the topic timeline of the members of the necroptosis pathway suggests that in the last decade there has been a decrease in the likelihood of publications discussing these genes in the context of apoptosis - indicated by the topic timeline labelled 611 - in favour of publications that verse on the newly discovered form of cell death, the necroptotic pathway, as well as, the translational medicine perspective of this pathway as is suggested bywords like mouse, treatment and activity or cancer (indicated by the topic timelines labelled 612).
  • the topic timeline of the members of the pyroptosis pathway shows a fast increase from 2013 of publications discussing the therapeutic opportunity in cancer immunotherapy with agonists for TMEM173 (indicated by the topic timeline labelled 621), while again, the remaining topics seemed to contain information on the biochemistry and biological role of the genes.
  • Immune checkpoint inhibitors CTLA4, CD274, PDCD1, TIGIT CTLA4, PDCD1 (PD-1), CD274 (PD-L1) and TIGIT are among the trendiest genes in academia and pharma in 2019.
  • CTLA4, PDCD1, CD274 and TIGIT genes encode four different transmembrane immunoglobulins that act as co-inhibitory receptors: checkpoints or 'breaks' for the adaptive immune response that prevent T cells from exerting their functions.
  • CTLA4 competes with its analogous CD28 for CD80 and CD86 to prevent a premature activation of T cells.
  • PDCD1-CD274 interaction counters the positive signals that may have already activated T effector cells.
  • TIGIT interacts with CD155 to down- regulate natural killer cells and T lymphocytes. Cancer cells attempt to impair these checkpoints and currently there are 7 FDA approved monoclonal antibodies that target three of proteins (CTLA4: Ipilimumab; PDCD1 : Nivolumab, Pembrolizumab, Cemiplimab; CD274: Atezolizumab, Avelumab) and multiple candidates targeting TIGIT (BGB-A1217, OMP-313M32, MTIG7192A, AB154).
  • CTL4 Ipilimumab
  • PDCD1 Nivolumab, Pembrolizumab, Cemiplimab
  • CD274 Atezolizumab, Avelumab
  • TIGIT BGB-A1217, OMP-313M32, MTIG7192A, AB154.
  • C9orf72 encodes a guanine nucleotide exchange factor involved in endosomal trafficking and autophagy. Hexanucleotide repeat expansions in promoter or intronic regions of C9orf72 are some of the major causes of sporadic and familial forms of both amyotrophic lateral sclerosis and frontotemporal dementia. Antisense oligonucleotides are being used to impede the transcription of C9orf72 or CRISPR-Cas9 system to target the GGGGCC repeat in the DNA or RNA.
  • TREM2 gene encodes a transmembrane immunoglobulin receptor expressed in macrophages, osteoclasts, dendritic cells, and brain microglia. TREM2 variants have been associated with Nasu-Hakola disease, late-onset Alzheimer’s disease, frontotemporal dementia, amyotrophic lateral sclerosis and Parkinson’s disease. TREM2 activates a pathway - through TYROBP/DAP12 - that promotes inflammation and promotes phagocytosis of cellular waste, remains of apoptotic cells, and pathogens.
  • TYROBP/DAP12 - that promotes inflammation and promotes phagocytosis of cellular waste, remains of apoptotic cells, and pathogens.
  • two independent groups have generated anti-TREM2 antibodies to stimulate microglia to remove amyloid plaques.
  • Alenco in collaboration with Abbvie, has entered Phase I clinical trials.
  • cGAS DNA sensing bv cGAS-STING: cGAS, TMEM173.
  • GSDMD GSDMA
  • the cytosolic nucleic acid-sensing pathway leads to pyroptosis, a lytic pro-inflammatory type of cell death involved in antiviral, antibacterial, and anticancer response.
  • cGAS is a nucleotidyltransferase that catalyses production of cyclic GMP-AMP (cGAMP) upon the recognition of double-stranded DNA.
  • TMEM173 (STING) binds to cGAMP and promotes the activation of both TBK1 and IRF3, increasing the transcription of genes encoding type I interferons.
  • GSDMA and GSDMD are pore-forming effector proteins in the plasma membrane to release proinflammatory interleukins like IL-1 ⁇ and IL-18.
  • the cGAS-STING pathway has been associated to multiple autoimmune and chronic inflammatory diseases like non-alcoholic fatty liver disease, systemic lupus erythematosus, vascular and pulmonary syndrome, macular degeneration, Bloom syndrome, Aicardi-Goutines syndrome, cancer, DNA damage, neurodegeneration and beyond.
  • TMEM173 and GSDMD Although there are no reported trials for GSDMA nor cGAS.
  • RIPK1 , RIPK3 and MLKL form part of the tumour necrosis factor-induced necroptosis pathway. This pathway has been associated with multiple pathologies: systemic inflammatory response syndrome, ulcerative colitis, psoriasis, rheumatoid arthritis, neurodegenerative diseases and even cancer.
  • TNFR1, FasL, TRAIL, and TLR can all activate RIPK1 to decide the cell’s fate: inflammation, apoptosis or necrosis. If caspase-8 is inhibited, RIPK1 and RIPK3 form the necrosome that subsequently phosphorylates MLKL.
  • MLKL forms homo-trimers, migrates to the plasma membrane, binds to highly phosphorylated inositol phosphates, creates pores in the membrane and disrupts the cell integrity.
  • RIPK1 dates back to 1995. Since then, four inhibitor programs have progressed through human phase II safety trials. The first publication mentioning MLKL is more recent and, despite the lack of kinase activity, pharmaceutical companies have cited its publications by 60 times more since 2013. Although there are no clinical trials yet, there are at least three known different chemical inhibitors.
  • Mechanobioloov YAP1/WWTR1.
  • PIEZO1 and PIEZO2 Mechanobioloov: YAP1/WWTR1.
  • YAP1/WWTR1 are transcriptional co-activators and mechanotransducers. YAP/TAZ is hyperactivated in cancers, its inhibition reduces atherogenesis and fibrosis, it triggers pulmonary hypertension, and it is necessary for epithelial regeneration in the intestine.
  • PIEZO1 and PIEZO2 are two mechano-sensitive cation channels that play a key role in cell number regulation and migration, hearing, neural and vascular development, somatosensory functions, proprioception and beyond.
  • Piezo channels have been recently associated with multiple pathologies like arthrogryposis, apnea, congenital lymphatic dysplasia, hyperalgesia, malaria, pancreatitis, xerocytosis, Gordon syndrome, Marden-Walker Syndrome, and Distal Arthrogryposis Type 5.
  • the discovery of mechanotransduction signalling pathways has received notable attention in the last years and may open the door to new therapeutic strategies to treat these diseases.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Primary Health Care (AREA)
  • Bioethics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Toxicology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Library & Information Science (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
PCT/GB2021/052813 2020-10-29 2021-10-29 Computational drug target selection Ceased WO2022096861A2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020237017962A KR20230128266A (ko) 2020-10-29 2021-10-29 컴퓨터를 이용한 약물 표적 선택
CN202180074273.9A CN116508017A (zh) 2020-10-29 2021-10-29 计算药物靶标选择
JP2023550727A JP2023547964A (ja) 2020-10-29 2021-10-29 コンピュータによる薬剤標的の選択
EP21884122.9A EP4238097A2 (en) 2020-10-29 2021-10-29 Computational drug target selection
US18/138,705 US20230352193A1 (en) 2020-10-29 2023-04-24 Computational Drug Target Selection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB2017177.3 2020-10-29
GB2017177.3A GB2600687A (en) 2020-10-29 2020-10-29 Computational drug target selection

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/138,705 Continuation US20230352193A1 (en) 2020-10-29 2023-04-24 Computational Drug Target Selection

Publications (2)

Publication Number Publication Date
WO2022096861A2 true WO2022096861A2 (en) 2022-05-12
WO2022096861A3 WO2022096861A3 (en) 2022-08-25

Family

ID=73776466

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2021/052813 Ceased WO2022096861A2 (en) 2020-10-29 2021-10-29 Computational drug target selection

Country Status (7)

Country Link
US (1) US20230352193A1 (https=)
EP (1) EP4238097A2 (https=)
JP (1) JP2023547964A (https=)
KR (1) KR20230128266A (https=)
CN (1) CN116508017A (https=)
GB (1) GB2600687A (https=)
WO (1) WO2022096861A2 (https=)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024194479A1 (en) 2023-03-23 2024-09-26 Exscientia Ai Limited Computational drug target selection

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20250045561A (ko) 2023-09-25 2025-04-02 주식회사 엘지에너지솔루션 충전 관리 장치 및 그것의 동작 방법

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5347878B2 (ja) * 2009-09-29 2013-11-20 富士通株式会社 文献間関係解析装置、該プログラム、及び該方法
US10592541B2 (en) * 2015-05-29 2020-03-17 Intel Corporation Technologies for dynamic automated content discovery
CN109074420B (zh) * 2016-05-12 2022-03-08 豪夫迈·罗氏有限公司 用于预测靶向药物治疗疾病的效果的系统
CN108427702B (zh) * 2017-10-23 2021-02-09 平安科技(深圳)有限公司 目标文档获取方法及应用服务器
JP7237574B2 (ja) * 2018-12-27 2023-03-13 オムロンヘルスケア株式会社 血圧測定装置
US11721441B2 (en) * 2019-01-15 2023-08-08 Merative Us L.P. Determining drug effectiveness ranking for a patient using machine learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024194479A1 (en) 2023-03-23 2024-09-26 Exscientia Ai Limited Computational drug target selection

Also Published As

Publication number Publication date
GB202017177D0 (en) 2020-12-16
JP2023547964A (ja) 2023-11-14
GB2600687A (en) 2022-05-11
US20230352193A1 (en) 2023-11-02
KR20230128266A (ko) 2023-09-04
CN116508017A (zh) 2023-07-28
WO2022096861A3 (en) 2022-08-25
EP4238097A2 (en) 2023-09-06

Similar Documents

Publication Publication Date Title
Shao et al. DTI-HETA: prediction of drug–target interactions based on GCN and GAT on heterogeneous graph
Ata et al. Recent advances in network-based methods for disease gene prediction
Zhang et al. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning
Min et al. TargetNet: functional microRNA target prediction with deep neural networks
Le Machine learning-based approaches for disease gene prediction
Seoane et al. A pathway-based data integration framework for prediction of disease progression
Shehu et al. A survey of computational methods for protein function prediction
Elbasir et al. DeepCrystal: a deep learning framework for sequence-based protein crystallization prediction
Peng et al. EnANNDeep: an ensemble-based lncRNA–protein interaction prediction framework with adaptive k-nearest neighbor classifier and deep models
Li et al. PGCN: Disease gene prioritization by disease and gene embedding through graph convolutional neural networks
Chen et al. DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning
Park et al. Tissue-aware data integration approach for the inference of pathway interactions in metazoan organisms
Li et al. Evaluating disease similarity based on gene network reconstruction and representation
Jung et al. TimesVector: a vectorized clustering approach to the analysis of time series transcriptome data from multiple phenotypes
Alborzi et al. PPIDomainMiner: Inferring domain-domain interactions from multiple sources of protein-protein interactions
US20230352193A1 (en) Computational Drug Target Selection
Dabydeen et al. Unbiased Boolean analysis of public gene expression data for cell cycle gene identification
Miranda-Escalada et al. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical–protein relations
Serrano Nájera et al. TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery
Hong et al. CrepHAN: cross-species prediction of enhancers by using hierarchical attention networks
Wang et al. Self-supervised graph representation learning integrates multiple molecular networks and decodes gene-disease relationships
Moler et al. Integrating naive Bayes models and external knowledge to examine copper and iron homeostasis in S. cerevisiae
Wang et al. Deep fusion learning facilitates anatomical therapeutic chemical recognition in drug repurposing and discovery
Blei et al. Statistical modeling of biomedical corpora: mining the caenorhabditis genetic center bibliography for genes related to life span
Zhang et al. Prediction of gene co-expression from chromatin contacts with graph attention network

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2023550727

Country of ref document: JP

Ref document number: 202180074273.9

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021884122

Country of ref document: EP

Effective date: 20230530

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884122

Country of ref document: EP

Kind code of ref document: A2