WO2011151500A1 - Dispositif et procédé permettant de trouver des relations entre des données - Google Patents

Dispositif et procédé permettant de trouver des relations entre des données Download PDF

Info

Publication number
WO2011151500A1
WO2011151500A1 PCT/FI2010/050441 FI2010050441W WO2011151500A1 WO 2011151500 A1 WO2011151500 A1 WO 2011151500A1 FI 2010050441 W FI2010050441 W FI 2010050441W WO 2011151500 A1 WO2011151500 A1 WO 2011151500A1
Authority
WO
WIPO (PCT)
Prior art keywords
nodes
subgraph
query
graph
node
Prior art date
Application number
PCT/FI2010/050441
Other languages
English (en)
Inventor
Lauri Eronen
Atte Hinkka
Petteri Hintsanen
Melissa Kasari
Kimmo Kulovesi
Laura Langohr
Petteri Sevon
Hannu Toivonen
Original Assignee
Helsingin Yliopisto
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Helsingin Yliopisto filed Critical Helsingin Yliopisto
Priority to PCT/FI2010/050441 priority Critical patent/WO2011151500A1/fr
Publication of WO2011151500A1 publication Critical patent/WO2011151500A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the invention pertains to information and computer sciences.
  • the invention concerns the creation of a searchable aggregate knowledge base from a number of information sources.
  • different biological databases may be separately searched e.g. via web interfaces using a plurality of search terms such that the top priority search results finally incorporate a maximum number of any or all them.
  • a gene mapping for a particular phenotype could be considered. The mapping may have resulted in a large set of candidate genes.
  • the researchers may first compare the candidates in the light of what is disclosed about them in the public databases and literature, hoping to be able to concentrate the efforts and resources on the most promising candidates. This slow and laborious task is mostly done manually by browsing the databases, which inevitably limits the extent and coverage of the search and easily lowers the quality of obtained results.
  • the objective is to alleviate one or more of the aforesaid defects present in prior art solutions and to provide at least a feasible alternative for finding information.
  • Certain embodiments of the present solution may be applied to implement a search engine regarding a predetermined domain or domains such as biological, biomedical and/or medical domains.
  • An index may be pre-constructed on the basis of a plurality of data sources such as databases relating to the domain.
  • the data sources may include heterogeneous data.
  • the index integrates data from the sources into a local repository represented by a weighted, such as probabilistic, model, advantageously a graph, wherein nodes represent data records and edges represent their interrelations (based on associations be- tween the data records indicated e.g. in the data sources). Thereafter, the user may query the index, e.g.
  • the obtained subgraph may con- tain only a single node, while in the maximum case it may contain all the nodes of the original source graph.
  • the solution is applicable for exposing indirect and therefore commonly unknown associations between two or more data records and related concepts, for instance.
  • the proposed solution may be utilized for determining the subgraph and/or path best connecting two or more query nodes according to a number of predetermined criteria, such as maximization of subgraph or path reliability (probability).
  • an electronic arrangement such as a server arrangement, comprises:
  • -a communication entity configured to receive data records from a number of, such as plurality of, data sources such as databases relating to a predetermined application domain such as biological or biomedical domain
  • -an indexing entity configured to populate an index for the data records contained in said number of data sources, said index further comprising associations between said records based on indications in the data sources
  • the utilized data model comprises a weighted graph, such as a probabilistic graph, having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship (association) between two nodes, an edge being associated with a weight
  • -a search interface entity configured to receive a user-defined query relating to a number of concepts of the domain and to associate the concepts with the corresponding one or more query nodes of the graph
  • -a search engine entity configured to determine a subgraph from the graph associated with said one or more query nodes according to a number of predetermined criteria utilizing the weights, edge type and/or other characteristic of the edges, and a predetermined subgraph extraction technique
  • -a visualization entity configured to control the graphical visualization of the subgraph on a display device, wherein said one or more query nodes, a number of related other nodes and edges of the subgraph are illustrated so as to facilitate finding, verifying and understanding indirect associations among said one or more query nodes and/or associations between one or more graph elements such as one or more other nodes and said one or more query nodes, such as facilitating verifying a hypothesis between a phenotype and a related gene.
  • the arrangement may be configured to utilize a proprietary or public database, e.g. a loadable file version thereof, as a data source.
  • the data- base may be a scientific, engineering, or some other type of a database.
  • a biological database e.g. EntrezGene, UniProt, Swiss-Prot, EntrezProte- in, GO(A) (Gene Ontology (Annotation)), Interpro, STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) or (O)MIM ((Online) Mendelian Inheritance in Man
  • a data record of a data source such as of a database may be turned into a node of the applied probabilistic model, whereas a cross- reference between records may be turned into an edge thereof.
  • a node and/or an edge of the data model (graph) may be associated with at least one characteristic type and/or other characteristic attribute.
  • an edge type may generally confer similarity, interaction, and/or causal relationship between the end nodes thereof.
  • a node may be of gene or protein type in a biological context, for exam- pie. Generic types incorporating a plurality of sub-types may be provided. For instance, a sequence may subsume both gene and protein. The aforesaid weight attribute may be used to indicate the estimated edge strength and/or probability of the edge's actual existence, for instance.
  • the arrangement may be configured to identify common concepts from a plurality of data sources such as databases such that potential naming practice differences relative to the same concept are detected, on the basis of e.g. synonym information and/or se- mantic analysis, and the related data are combined to the same record associated e.g. with a certain node of the established index, for example.
  • a weight may be assigned to an edge by applying a function of a number of aspects.
  • the weight may indicate edge goodness on the basis of e.g. a) predetermined data reliability factor associated with each data source (e.g. edge derived on the basis of data source a, receives a reliability factor x, edge derived o the basis of data source b, receives a reliability factor y, etc.), b) relevance of an edge type assigned e.g. query-specifically (may be user-controllable), and/or c) rarity associated with e.g. the degree of the incident node of an edge (lower degree of node -> more specific and potentially more informative link behind the related edge, for example).
  • indexing may be performed periodically, e.g. in a timed manner, or upon some more specific triggering condition, e.g. detection of change(s) in a data source or in the indexed data, and/or by external trigger, e.g. by any of the data sources or a client device.
  • the query may di- rectly include or at least implicitly indicate only a single query node of the graph.
  • the subgraph may include so-called neighbor nodes best relating to the queried node according to a number of predetermined criteria, for example.
  • a path strength at least partly defined by the product of path edge weights (path probability in the case of given edge probabilities) may be used as criterion for determining the connection between the query node and some other node.
  • the search interface entity may be configured to utilize a user-defmed search query term (concept) directly as a query node (or other distinguishable member) of the graph.
  • embodiment determination of the target subgraph may be executed using at least one subgraph extraction precept selected from the group consisting of: a subgraph spanned by most likely acyclic paths (strongest paths connecting the query nodes may be selected until a desired size of the subgraph is reached, for example), subgraph spanned by minimum paths (paths the sub- paths of which do not connect given nodes may be selected until a desired size is reached), most reliable subgraph (a subgraph of selected size may be determined such that it, as a whole, connects the query nodes as effectively as possible, i.e.
  • PC Path Covering
  • PC may be implemented as a two-phase process.
  • a path sampling phase a relatively small set C of candidate paths may be gathered from the set of all s-t- paths in a selected graph.
  • the aim may be to choose an optimal subset 'F of the candidate paths in C, according to e.g. edge budget.
  • a modification of the PC method may be applied to cases with two or more query nodes.
  • Some PC principles may remain unchanged, such as the two phases and use of Monte Carlo simulation.
  • the proposed method advantageously considers spanning trees connecting the k query nodes, with 2 ⁇ k ⁇
  • set S in the objective function to be reviewed hereinlater consists of spanning trees instead of paths as in PC.
  • the obtained (sub)graph may be cultivated by aggregating (grouping and/or combining) nodes having e.g. an identical or substantially similar neighborhood (such as similar connections with common end points) according to a number of predetermined criteria together so as to elevate the informative value of the subgraph from the standpoint of the user.
  • nodes having e.g. an identical or substantially similar neighborhood (such as similar connections with common end points) according to a number of predetermined criteria together so as to elevate the informative value of the subgraph from the standpoint of the user.
  • other techniques for grouping and potentially combining nodes and/or (sub-)graphs may be applied.
  • edge and/or path strength of sufficient (predetermined level) between nodes may be utilized as a feasible criterion in addition to or instead of other criteria, whereby e.g. clusters and/or cliques may be determined accordingly.
  • the graphical visua- lization of the subgraph may be performed utilizing a Java-based entity such as application or applet.
  • the graphical visualization may preferably include node labels (identifiers) and/or node and/or edge types, preferably also supplementary or alternative information such as various attributes, e.g. edge weights.
  • the graphical visualization is interactive.
  • the illustrated graph may be zoomed, rotated and/or oth- erwise controlled.
  • An element such as a node or edge may be selected by the user via the user interface (UI) to activate an associated feature.
  • the selection feature may trigger a function selected from the group consisting of: visualization of the element detail(s), access to the corresponding data source using e.g. HTTP/WWW, change of a related type or attribute, change of a query criterion or other parameter, editing of the element, at least visually expanding the element (if e.g.
  • a subgraph represented by a single entity at least visually collapsing the element (vice versa, a number of nodes and/or edges may be represented by a node and a number of related edges), and addition of supplementary information.
  • Selection may trigger execution of a related function automatically or after a second action such as additional verification, selection, or e.g. click action in the case of a double-click type selection action.
  • the change of the query may indicate narrowing, expanding or otherwise changing the query in terms of the re- lated index search criteria.
  • the arrangement may be configured to automatically hide details, such as nodes, edges, node- related information, and/or edge-related information, deemed as uninteresting from the standpoint of the user according to predetermined criterion such as rules based on query terms (directly user-de fined and/or related concepts).
  • the arrangement may be configured to determine a number of representative nodes in a graph such as the original source graph (index) or obtained subgraph thereof.
  • the given nodes are first clustered utilizing e.g. a desired similarity measure and then a representative is preferably selected from each cluster.
  • the user may be allowed to specify query nodes (concepts), whereupon the arrangement is configured to find other nodes (concepts) that are relevant with respect to the query nodes, but advantageously non-redundant with respect to each other.
  • query nodes concepts
  • Such embodiment may be motivated by biological graphs, for example, where similar problems arise in connection with summarizing existing knowledge concisely, as well as in disco- vering and highlighting a selected set of potentially novel, less obviously related concepts.
  • the arrangement is configured to utilize a probabilistic proximity measure for defining a relevance function for a node with respect to (positive) query nodes.
  • the measure is expanded to take negative query nodes to be avoided into account.
  • the ar- rangement may be generally configured to measure the relevance and non- redundancy of a set of retrieved nodes.
  • the arrangement may be configured to estimate, in addition to or instead of relationship strength (e.g. path strength) between e.g. two nodes/concepts, also the unexpectedness of such a relationship.
  • relationship strength e.g. path strength
  • the resulting subgraph might be constructed so as to include and/or highlight nodes the connection of which is strong, but inevident.
  • the used criterion may be based on link types, for example.
  • a method for finding relationships among data comprising -receiving data records from a number of, such as plurality of, data sources such as databases relating to a predetermined application domain such as biological or biomedical domain,
  • the utilized data model comprises a weighted graph having a plurality of nodes and edges, a node corresponding to one or multiple, aggregated, data records relating to the same concept and an edge representing a relationship between two nodes, an edge being associated with a weight,
  • the utility of the present invention follows from a plurality of issues.
  • the invention may be applied as a highly automatic technical aid for digging up new information on the basis of found implicit, indirect, often therefore previously unknown or unrecognized, connection(s), i.e. paths or "chains" of associations, be- tween given concepts in a certain domain for visually verifying, producing and/or ranking related hypothesis from the standpoint of a life scientist, for example.
  • the suggested PC method is efficient and scalable in connection with large source graphs, for example.
  • the suggested modified PC method for reliable subgraph extraction in connection with two or more query nodes is also both effec- tive and extraordinarily scalable in comparison with prior art methods.
  • the invention may provide information on which concepts are related, how they are related, and/or how strong are the connections. Single paths, complexes and distant connections relative to the queried concepts may be identified. The concepts and/or their interrelations may be then prioritized for future research, for exam- pie.
  • the domain may preferably include abundant data, uncer- tain/weighted direct relations among data, some incompleteness (e.g. in the light of relations), and/or heterogeneity for maximum applicability in view of the present invention.
  • the present invention may traverse whole network(s) of relations on the basis of a formal, probabilistic model for the purpose. Finding and exploring non-trivial, possibly surprising, associations is especially important in many scientific and engineering fields, where large bodies of domain knowledge are available, and research and development has a fast pace and is very competitive. Biomedical domain is an example of a very promising that kind of domain.
  • a biomedical researcher may construct a hypothesis or guesstimate around a phenotype and potential genes related thereto, whereupon the hypothesis may be conveniently verified using the suggested arrangement and method.
  • Representative concepts may be determined on the basis of a larger initial group for reducing redundancy and facilitating identifying complementary or alternative entities, for example.
  • the used data sources are preferably structured but additionally or alternatively, also only partly structured or even unstructured sources, e.g. books, journals and/or web pages, may be applied with the cost of additional processing likely required for mining it, e.g. via textual/semantic analysis, and the related uncertainty.
  • the present invention may be used to graphically illustrate the found links between interesting concepts preferably with interactivity and/or editability, which facilitates understanding the found relationships instead of merely numerical or textual output.
  • Additional query formats may further facilitate providing new in- formation to the user. For example, a query may be constructed for finding other relevant concepts related to the queried concept, to determine the surprising ones according to the used criterion, and/or for finding analogical/similar relations.
  • Fig. la illustrates an embodiment of the arrangement in accordance with the present invention.
  • Fig. lb is a block diagram of an embodiment of the arrangement.
  • Fig. 2 illustrates a first example of a subgraph obtained utilizing an embodiment of the present invention.
  • Fig. 3 illustrates a second example of a subgraph obtained utilizing an embodiment of the present invention.
  • Fig. 4 visualizes test results in connection with an embodiment of a modified path covering method for discovering the most reliable sub-network in the light of two or more queried concepts.
  • Fig. 5a illustrates an embodiment of finding representative nodes in a graph in accordance with the present invention.
  • Fig. 5b illustrates an embodiment of finding nodes particularly relevant to the query node(s).
  • Fig. 5c is another illustration of the embodiment for finding nodes particularly relevant to the query node(s).
  • Fig. 6 discloses a flow diagram of an embodiment of a method according to the present invention.
  • Figure la illustrates the concept of the present invention in accordance with an embodiment thereof.
  • a terminal 102 such as a desktop, laptop or mobile (hand-held) computer, is functionally connected to a server arrangement 104 e.g. via at least one communication network 106 such as the Internet.
  • the server ar- rangement 104 offers a search service, e.g. a web service, to its clients such that it constructs an index preferably representable in the form of a weighted (e.g. probabilistic) graph 105 based on data records available in a plurality of optionally heterogeneous, external data sources 108a, 108b and 108c, and serves a search query provided by the user 102a via the terminal 102 through the provision of appropriate subgraph 103 and/or other data in return.
  • the server arrangement 104 may be implemented, as a logical entity, also directly in the utilizing terminal device 102, being then typically best suitable for local use.
  • the broken arrows between the data sources 108a, 108b, 108c represent the pro- vision of data to the arrangement 104 for indexing. Data may be transferred also in the reverse direction.
  • the arrangement 104 may poll a source 108a, 108b, 108c upon need, or the source 108a, 108b, 108c may be configured to send the data to the arrangement 104 automatically.
  • Figure lb represents, by way of example only, a block diagram of the server arrangement 104, terminal 102 or other similar arrangement incorporating one or more devices and configured to provide the search service in accordance with the present invention.
  • the entity in question is typically provided with one or more processing devices capable of processing instructions and other data, such as one or more microprocessors, micro-controllers, DSPs (digital signal processor), programmable logic chips, etc.
  • the processing entity 1 10 may thus, as a functional entity, physically comprise a plurality of mutually co-operating processors and/or a number of sub- processors connected to a central processing unit, for instance.
  • the processing entity 1 10 may be configured to execute the code stored in a memory 1 16, which may refer to search engine software 1 18 in accordance with the present invention and other applicable software applications.
  • the logic for the search engine and related functionalities may indeed be implemented as software stored in the memory entity 1 16 and executed by the processing entity 1 10.
  • the search entity may include a number of applications, modules, and/or other software preferably having at least a functional interconnection.
  • Software 1 18 may utilize a dedicated or a shared processor for executing the tasks thereof.
  • the memory entity 1 16 may be divided between one or more physical memory chips or other memory elements.
  • the memory 1 16 may further refer to and include other storage media such as a preferably detach- able memory card, a floppy disc, a CD-ROM, or a fixed storage medium such as a hard drive.
  • the memory 326 may be non-volatile, e.g. ROM (Read Only Memory), and/or volatile, e.g. RAM (Random Access Memory), by nature.
  • the UI (user interface) 1 14 may comprise a display, e.g. an (O)LED (Organic LED) or LCD (liquid crystal display) display, and/or a connector to an external display or other type of a display device such as a data projector, and a key- board/keypad/mouse/touchpad and/or other applicable control input means (e.g. touch screen or voice control input, or separate keys/buttons/knobs/switches) configured to provide the user of the entity with practicable data, e.g. graph, vi- sualization and control means.
  • a display e.g. an (O)LED (Organic LED) or LCD (liquid crystal display) display
  • a connector to an external display or other type of a display device
  • a key- board/keypad/mouse/touchpad and/or other applicable control input means e.g. touch screen or voice control input, or separate keys/buttons/knobs/switches
  • the UI 1 14 may further include one or more loudspeakers and associated circuitry such as D/A (digital-to-analogue) converters) for sound output, and a microphone with A/D converter for sound input.
  • the entity may comprise an interface 1 12 such as at least one transceiver incorporating e.g. a radio part including a wireless transceiver, such as WLAN (Wireless LAN), Bluetooth or a cellular like GSM (Global System for Mobile Communications)/UMTS (Universal Mobile Telecommunication System) transceiver, for general communications with external devices and/or a network in- frastructure, and/or other wireless or wired data connectivity means such as one or more wired interfaces (e.g. Firewire, LAN such as Ethernet, or USB (Universal Serial Bus)) for communication with other devices such as terminal devices, control devices, server devices, peripheral devices such as external sensors, and/or network infrastructure(s).
  • WLAN Wireless LAN
  • Bluetooth or a cellular like GSM (Global System for Mobile Communications)/UMTS (
  • a carrier medium such as an optical disk, a floppy disk, a memory card, a hard disk, a memory chip, or a memory stick may be configured to comprise computer code, e.g. a computer program product, for performing at least part of the tasks described herein.
  • the program code and/or related data such as model or repre- sentation data may be provided on a signal carrier.
  • the code and/or the data may be at least partially encrypted using a selected encryption method such as AES (Advanced Encryption Standard).
  • the entity such as the terminal 102 or the server 104
  • the entity may be self-contained and include all the necessary functionality from obtaining the data from a number of sources to providing related search services for finding indirect relationships.
  • tasks may be shared and distributed among available devices 102, 104, and/or optionally further devices embodiment- specifically as understood by a skilled person.
  • the server arrangement 104 may in practice contain a single (computer) device or a plurality of at least functionally interconnected devices fulfilling the required functions as an aggregate server entity.
  • the I/O entity 130 may take care of data input and output between external enti- ties and the suggested solution.
  • a control block 120 may generally manage the execution of tasks according to requests received via the I/O entity 130 and internal logic, for example.
  • Indexing entity 122 takes care of producing the integrated, searchable index from a number of, likely a plurality of, data sources such as databases.
  • Search interface entity 124 may turn a received search query into an in- dex query, which may refer to receiving the search query and identifying the nodes of the index (graph) either directly or implicitly identified therein.
  • a user may construct a search query including a term not directly identified as such as a searchable node (and/or edge/edge type and/or characteristic) or other member of the index (graph), whereupon the entity 124 may be configured to provide a list of suggestions for the query node based on inherent estimation logic such as semantic logic trying to associate the search term with a searchable entity of the index.
  • a data structure such as a table may be applied to contain synonyms of different concepts for rapid concept term-to-graph element (e.g. node) term transformation.
  • Visualization entity 128 may be configured to control the visualization (views) of the graphs, subgraphs, and related entities to the user 102a. It may be configured to create related visualization data, such as graphical image or video (animation) data and/or related instructions, to be provided to the terminal 102 for illustration via an available data visualization device such as a display.
  • related visualization data such as graphical image or video (animation) data and/or related instructions
  • Search engine/subgraph formation entity 126 may be configured to traverse the index database graph and extract a target subgraph therefrom for future visualization and/or other purposes.
  • Any aforesaid entity may be implementation-specifically realized as divided into a number of sub-entities or integrated with some other entity.
  • the user 102a may be provided with various means by the UI 1 14 for controlling the searches and/or visualization of the search result.
  • edge type, node type and/or data source -specific weighting may be user- controllable or -selectable in connection with executing searches, as may be the size on the search result (subgraph), desired node types (from which representa- tives may be selected and/or that should be located at the border of the network), and/or other node and/or edge characteristics affecting the search.
  • a (purely) textual result e.g. in hypertext (e.g. HTML) or XML (extensible Markup Language) format may be provided.
  • hypertext e.g. HTML
  • XML extensible Markup Language
  • Figure 2 illustrates a first example of a subgraph 202 obtained utilizing an embodiment of the present invention.
  • a fictional example of a subgraph summarizing the link between a gene and a phenotype has been depicted.
  • the user has specified the search concepts (source/origin node: Gene(S) and destination node: Phenotype(T)) used as query nodes, and the class of path types of interest (path types suggesting a causal relationship). Indeed, instead of all paths connecting two nodes, the user may be interested in paths with specific semantics, e.g. paths that confer similarity or paths that suggest a causal relationship.
  • path type the string of node and/or edge types on a path.
  • a path class corresponding to a type of relationship may be defined as the set of path types that sug- gest that type of link between the end nodes.
  • the subgraph results from a query for causal links from a given gene to a given phenotype.
  • a context-free grammar may be used as a means of defining path classes.
  • a CFG may define a language of strings (path types in our framework) of terminal sym- bols (edge and vertex types) that can be derived from a distinguished (starting) non-terminal symbol using a set of production rules and a set of other nonterminal symbols. Each non-terminal symbol defines a path class, and paths of a given class are queried by specifying the non-terminal corresponding to that class as the starting symbol.
  • r expressions CFGs are more expressive, and provide a natural means of naming path classes.
  • a subgraph querying system may rely on a background CFG, defining a comprehensive domain-specific list of path classes.
  • the complexity of using CFGs may be hidden from the end-user: subgraphs can be queried simply by giving either one or two (sets of) query nodes (vertices) and the path class of interest. Alternatively, the user can pose more complex queries by manually defining the top level production rules and/or any auxiliary production rules.
  • the implemented system may support queries such as: 1) connection subgraph queries between two given nodes (or sets of nodes), returning the sub- graph linking the nodes together, and 2) neighborhood queries, returning the subgraph induced by the set of paths starting from a given node and matching the query.
  • Fig. 3 illustrates a second example of a subgraph obtained utilizing an embodiment of the present invention.
  • a life scientist may have been studying the etiology of Alzheimer disease (AD) for drug development or diagnostic purposes, for example.
  • AD Alzheimer disease
  • DCDC2 gene could be associated with AD.
  • a related query of two concepts (nodes) "DCDC2 AD" may have resulted in the shown graph.
  • Each unit of information in the graph may, in principle, be previously known from a data source or another, but the chains of relations may be far from obvious even to a skilled person.
  • Advantageously only the strongest advantageously indirect relationships between the query nodes have been shown to the user for improving the clarity of the obtained subgraph.
  • weights e.g. probabilities
  • the arrangement may be configured to produce a number of preferably user-selectable views on the (sub-)graphs. For example, colors and/or other type of highlighting may be applied to distinguish between different properties of nodes and/or edges.
  • the view may be generated on the basis of user-provided configuration. For example, gene expression information may be used for color (tone) scaling.
  • edge visualization such as color, style, size and/or shape, may indicate attribute such as a type and/or weight thereof, whereupon textual descriptions are not necessary.
  • the weights may be determined using a selected method. For example, three factors may be utilized: data reliability, relevance, and rarity.
  • the data reliabilities of edges may be defmed using a set of rules, such as: if the edge is derived from a predetermined data source A, e.g. Swiss-Prot, then its reliability is X (6 [0, 1]), whereas if the edge is derived e.g. from the computer-annotated TrEMBL database, then its reliability may be Y (lower than A, for example), etc.
  • the interpretation of edge reliability is the degree of belief the investigator has for the edge being correctly annotated.
  • the value can be transformed into a [0, l]-similarity value.
  • the similarity of nodes u and v is the probability that any relationship between u and a third node t is also true for v and t, the similarity can be multiplied into the reliability of the edge.
  • the relevance of an edge type may be defined as the degree of the user's belief that edges of that type represents a relevant connection with respect to the query.
  • the user may have a basic configuration— a set of default relevance values for edge types— and only few adjustments are needed for a typical query.
  • the relevance values may sometimes be easier to give in terms of node types instead of edge types.
  • relevance q(x) for a node type ⁇ can be decomposed into coefficients for edge types by multiplying all edge types with one end-node of type ⁇ by square(q(x)), and edge types with both end-vertices of type ⁇ by q(x).
  • a path relevance may be defmed as a product of included edge relevances, this gives the desired outcome: the relevance of any path visiting a node of type ⁇ is multiplied by q(x).
  • d(v) is the probability that any two edges incident on v are related to each other and represent a meaningful path.
  • d(v) (
  • the parame- ter a determines how steeply the penalty increases with the degree.
  • rarity d(v) 1/(
  • lower values of a do not give equally attractive interpretations as random walk probabilities, they can be useful in practice to give relevant penalties for node degree that reward parallel paths more than a standard random walker.
  • the maximum value of d(v) for an non- terminal node v of a path is 3 ⁇ iK , Rarity values of the terminal nodes may be ignored; they would only add a constant factor to all paths.
  • the values of a could be set separately for each node type, but in this exemplary embodiment a single value for all nodes is used.
  • the rarity val- ues are decomposed into edge-specific coefficients by taking the square root of them. Ideally, in the context of analysis of connection subgraphs, the relatedness of edges incident on a node should be tested for each pair of edges separately and independently. With the rarity values of nodes decomposed on the incident edges, this is clearly not the case. The approximation is used in order to avoid the quadratic computational cost for each node. It has no effect on evaluation of the goodness of a single path.
  • the probabilities may be converted into distances by taking the negative logarithm of the goodness so that a selected discovery method for finding shortest paths may be utilized.
  • a publication "Link Discovery in Graphs Derived from Biological Databases” (Petteri Sevon, Lauri Eronen, Petteri Hintsanen, Kimmo Kulovesi, Hannu Toivonen. 3rd International Workshop on Data Integration in the Life Sciences 2006 (DILS'06), LNBI 4705, 35-49, Hinxton, UK, July 2006. Springer), which is incorporated herein by reference in its entirety, describes, see particular- ly chapters 3.2-3.4 (pages 7-10), few applicable methods for finding and evaluating links in connection with large graphs.
  • One feasible subgraph extraction method relies on the principle of determining the most reliable subgraph of selected size constructed such that it, as a whole, connects the query nodes as effectively as possible, i.e. it is maximally likely that there is a connection between the query nodes.
  • Applicable algorithms e.g. for the scenarios of two query nodes include "Best Paths Incremental” (BPI) and “Series-Parallel Augmentation” (SPA), which are described in the publication “Finding Reliable Subgraphs from Large Probabilistic Graphs " (Petteri Hintsa- nen, Hannu Toivonen. Data Mining and Knowledge Discovery 17 (1): 3-23. 2008. Springer) incorporated herein by reference in its entirety, see especially pages 5 and 7-15.
  • BPI Best Paths Incremental
  • SPA Series-Parallel Augmentation
  • the BPI algorithm is based on the following idea: find the most probable, or "best", paths between the terminal nodes s and t, and let them span a subgraph.
  • the BPI algorithm adds best paths to the solution until it has at least
  • the SPA algorithm is based on direct optimization of the reliability of the result subgraph H in a greedy, iterative manner. The cost of evaluating reliability is greatly reduced by constructing series-parallel graphs, a restricted class of graphs for which the reliability can be evaluated efficiently.
  • Fig. 4 visualizes few, merely exemplary, test results in connection with an embo- diment of a novel, modified PC method for discovering the most reliable subnetwork especially in scenarios with two or more target concepts (query nodes).
  • G (V, E) be an undirected graph where V is the set of nodes and E the set of edges.
  • G is a Bernoulli random graph where each edge e has an associated probability pe.
  • edge e 6 E exists with probability pe, and conversely e does not exist, or is not true with probability 1-pe.
  • edge probabilities the states of edges are mutually independent. Nodes are static.
  • the network reliability (G,Q) of G is defined as the probability that Q is connected, i.e., that any node in Q can be reached from any other node in Q.
  • a Monte-Carlo sampling based algorithm Path Covering may be applied for the two-terminal case, for example.
  • the algorithm has two phases: a path sam- pling phase and a subgraph construction phase.
  • the path sampling phase the goal is to identify a small set C of paths that have high probabilities and are relatively independent of each other.
  • PC does not maximize (G(S)) directly, but works on its lower bound Pr(S) instead.
  • PC generates S iteratively by choosing at each iteration the path P * by the objective function, which gives the maximal per-edge increase to the (esti- mated) probability Pr(S), that is
  • H G(S) is the result subgraph being constructed.
  • paths that become included into H are removed from C.
  • > B are also removed.
  • the above-introduced PC algorithm is reviewed in more detail.
  • the first phase i.e. path sampling phase
  • an iterative approach may be applied for constructing the set C of candidate paths efficiently between terminals (query nodes) s and t.
  • PC gathers a relatively small set C of candidate paths from the set of all s-t paths in Q.
  • P the probability that at least one of the paths in is true.
  • the path sampling phase scales to large inputs with exponentially many paths
  • the subgraph construction phase produces a better optimized subgraph G(P) with a larger computational cost per path.
  • the most probable, or best, s-t path is used as the initial candidate path.
  • C is augmented in each iteration with a path P such that Pr(C V P) is approximately maximized. Let C denote an event where none of the paths in C exists. Since
  • Pr(C vP) Pr(C v (£ AP)) Pr(C) + Pr(£ A P),
  • edges are realized until all paths in C have been decided— even if some paths are found to exist. Then, if some paths exist, we iteratively and greedily fail the edge e which in- tersects the largest number of true paths in C until no true paths remain in C. If there is more than one such edge, the one with the smallest probability p(e) may be chosen. This modification may be implemented by removing line 1 1 of the path sampling algorithm and adding the cut sampler (Table 2) just before line 12. a le 2 Cut sa pler
  • Table 3 discloses an exemplary algorithm for subgraph construction phase.
  • the executing element first obtains the set C of candidate paths generated in the first phase, chooses a subset c C having at most 5 unique edges in total, and return the subgraph G(F) c g induced by them.
  • the objective is to choose the set of paths such that the reliability R(G( P)) is maximized.
  • Exhaustive search and evaluation of all feasible subsets is often intractable, even though the number of candidate paths C is assumed to be relative - ly small.
  • Pr( ) Pr(V e ⁇ P) instead. It is a lower bound of R(G(!P)) and is easier to evaluate— but still requires exponential time in the worst case.
  • the path selection problem reduces to an instance of a specialized SET COVER problem (hence the name PATH COVERING), where the goal is to choose a set of paths such that ⁇ C(P) ⁇ is maximized and
  • PATH COVERING a specialized SET COVER problem
  • the goal is to choose a set of paths such that ⁇ C(P) ⁇ is maximized and
  • This problem differs from the ordinary SET COVER in three ways: it does not require the entire universe (the set of all positive realizations) to be covered, it is weighted (via budget B), and the weights are dynamic (different choices of paths affect the cost of individual paths).
  • the executing entity may use a greedy approach where one path is added at a time to an initially empty (lines 5-12). The best possible addition may be always chosen from C, until the budget B has been exhausted.
  • “best possible” means the one adding most Monte Carlo realizations to the cover per edge added to IP.
  • the algorithm extracts a set of trees from the original graph G.
  • Each of the trees connects the given k query nodes; by construction, they are spanning trees having the query nodes as leaves.
  • these trees are used as building blocks to construct the result of the algorithm just like PC uses paths as its building blocks.
  • the algorithm outputs a set C of trees such that each tree connects the query nodes. These trees are called herein candidate trees.
  • C is used as an input in the second phase of the algorithm, hereinafter Algorithm 2 (see Table 5).
  • Edge sampling The algorithm is stochastic. At each iteration round, it randomly decides, according to the probabilities pe, which edges exist and which do not (Line 5). Only edges that are included in at least one candidate tree are decided. All other edges are considered to exist. The next step is to determine if any of the previous candidate trees exist in the current graph realization (Line 9). If one does not exist a new candidate tree is generated (Lines 14-15). If a previously discovered tree exists, the first such tree is taken into examination (Line 10). If the tree is complete, the algorithm proceeds directly to the next iteration. Otherwise the tree is extended (Lines 17-21) before continuing to the next iteration.
  • Tree construction a new tree is formed by the best path connecting two query nodes (Line 15).
  • a previously established incomplete tree is extended by con- necting a new query node to it with the best path between some node in the tree and the new query node (Lines 18-20).
  • the probabilities of all edges in the tree are set to 1 prior to the search of the best path (Line 17), while the probabilities of other edges remain the same.
  • the new branch is formed by the best path between the new query node and the tree. Edges that do not exist at the ite- ration are not used. All edge weights are set to their original values before proceeding to the next iteration (Line 21).
  • Discovering strong trees The collection C of candidate trees is organized in a queue, i.e., the oldest candidate trees are always considered first. This drives the algorithm to complete some trees first (the oldest ones) rather than extending them in random order and not necessarily up to a completion.
  • the stochasticity of the algorithm favors strong trees: they are more likely to be true at any given iteration and thus more likely to be extended.
  • the algorithm al- so has a tendency to avoid similar trees: when two (partial) trees are true at the same time, only the oldest one is potentially extended.
  • Stopping condition The number
  • the number of iterations would be another alternative. Using the number of trees seems a better choice than using the number of iterations, since the minimum number of iterations needed to produce a single complete tree increases when the number of query nodes increase.
  • a large index of various interlinked public biological databases such as EntrezGene, UniProt, InterPro, GO, and STRING, was constructed in accordance with the principles of the present invention described herein.
  • the index enables representing and viewing the contents of the aforesaid databases as a large, heterogeneous biological graph. Nodes in this graph represent biological entities (records) in the original databases, and edges represent their annotated relationships. Edges have weights interpreted as probabilities.
  • the proposed method was evaluated using six source graphs of varying sizes (Table 6) and a set of up to ten query nodes. They were obtained as follows.
  • the largest subgraph consisting of approximately 5000 edges and 1500 nodes, was retrieved from the index database using Crawler, a proprietary subgraph retrieval component described in more detail hereinafter.
  • Crawler a proprietary subgraph retrieval component described in more detail hereinafter.
  • the query node identifiers are EntrezGene:348, En-êtGene:29244, EntrezGene:6376, EntrezGene:4137, UniProt:P51 149, Uni- Prot:Q91ZX7, EntrezGene: 14810, UniProt: P49769, EntrezGene: l 1810, and Un- iProt:P98156.
  • Third, smaller subgraphs were retrieved with Crawler by a sequence of subgraph retrievals, always extracting the next smaller subgraph from the previous subgraph, using the ten query nodes given above.
  • Crawler The subgraph retrieval component of the suggested indexing system, "Crawler”, was used to extract the source graphs, and it will also be used below in a comparative experiment to assess the effectiveness of the proposed algorithm.
  • Stopping condition The number of complete candidate trees generated was used as the stopping condition for Algorithm 1. Another alternative would have been be the number of iterations. Neither condition is totally perfect: for instance, the number of query nodes has a strong e.ect on the number of trees needed to find a good subgraph. On the other hand, the number of query nodes has also a strong effect on the number of iterations needed to produce a sufficient amount of trees.
  • a single fixed number of candidate paths is a suitable stopping condition for the two-terminal case but it is problematic in the k-terminal case where the building blocks are trees consisting of multiple branches. For the done experiments, a fixed number of candidate trees gave a fair impression of the performance of the method, however.
  • Results At 402 of Figure 4, it is illustrated how the proposed method succeeded in extracting a reliable subgraph.
  • a subgraph of relatively low number of edges such as 20-30 edges, managed to capture about 80% of the reliability of the original source graph of 500 edges.
  • the problem seems more challenging.
  • Larger subgraphs are preferably needed for a larger number of query nodes, if the reliability is to be preserved.
  • of candidate trees produced in the first phase of the algorithm has an effect on the reliability of the extracted subgraph, but sampling a relatively small number of trees is enough to produce good subgraphs (approximately 50 trees for four query nodes; results not shown).
  • An experimental analysis of the running time indicates that the method scales linearly with respect to the number of candidate trees generated.
  • the scalability of the proposed algorithm to large source graphs may be considered as superior to previous methods.
  • Source graphs of thousands of edges may be handled within a second or two.
  • Scalability is close to linear, which was expected: the running time of the algorithm is dominated by Monte Carlo simulation, whose complexity grows linearly with respect to the input graph size and the number of iterations. Limiting the length of tree branches may shorten the running times in some cases.
  • the relative difference in reliability is less than 20% in all cases, emphasizing the ability of the algorithm to preserve strong connectivity between the query nodes.
  • the proposed method was briefly compared against the aforesaid Crawler using four query nodes.
  • the proposed method reached 80% of the original relia- bility with only 30 edges whereas the Crawler needed 60 edges for the same.
  • one or more ge- neric sets performing reliably over wider range of input graphs and query nodes may be determined.
  • the solution may be further tailored for use with directed graphs.
  • Figure 5a illustrates an embodiment of finding representative nodes.
  • the arrangement 104 may be configured to extract and/or (visually) highlight a number of representative concepts (nodes) in a subject graph, such as the original index graph or a search result subgraph, so that the representative concepts are (maximally) relevant to the search, but mutually (maximally) different.
  • a com- pact and at the same time extensive representation may be obtained and visualized.
  • Identification of few representative nodes may be considered as one applicable approach to help users make sense of large graphs.
  • the visualization/perception problems typically start already with only dozens of nodes.
  • link discovery Given a large number of predicted links, it would be useful to present only a small number of representative ones to the user.
  • the representatives could be used to abstract a large set of nodes, e.g. all nodes fulfilling some user-specified criteria of relevance, into a smaller but representative sample.
  • wet lab techniques are often used for identifying numerous genes, proteins, and/or something else as potentially interesting, e.g., by the statistical significance of their expression, or association with a phenotype such as disease. Finding representative genes among the potentially interesting ones would be useful in several ways. First, it could be used to remove redundancy, when several genes are closely related and showing all of them adds no value. Second, representatives might be helpful in identifying complementary or alternative components in biological mechanisms.
  • Probability of a path Given a path P consisting of edges ei,...,ek, the probability p(P) of the path may be defined as the product p(ei)-...-p(ek) as mentioned herei- nearlier. This corresponds to the probability that the path exists, i.e., that all of its edges exist. Probability of the best path: Given two nodes u, v V, a measure of their connectedness or similarity may be defined as the probability of the best path connecting them:
  • this is not necessarily the path with the least number of edges.
  • this kind of similarity function s( ) may be used for finding representatives.
  • finding representatives in networks incorporates clustering the given nodes, using the similarity measure defined above, and then selecting a representative from each cluster.
  • the method execution can be characterized as follows:
  • the aim is to have representatives that are similar to the nodes they represent (i.e., to other members of the cluster), and also to have diverse representatives (from different clusters).
  • clustering k-medoids or hierarchical clustering may be applied, for example.
  • k-medoids is similar to the k-means method, but better suited for clustering nodes in a graph. Given k, the number of clusters to be constructed, the k-medoids method iteratively chooses cluster centers (medoids) and assigns all nodes to the cluster identi ed by the nearest medoid.
  • k-medoids instead of using the mean value of the objects within a cluster as cluster center, k-medoids uses the best object as a cluster center. This is a practical approach when working with graphs, since there is no well defined mean for a set of nodes. The k-medoids method also immediately gives the representatives. For very large graphs, a straight forward implementation of k-medoids is not necessarily the most efficient. Tools to facilitate faster clustering may be utilized.
  • an embodiment of the method may proceed as follows. First, the index database may be queried for a graph G of at most e.g. 1000 (or other predetermined number) nodes cross-connecting nodes in S as strongly as possible. The pairwise similarities between nodes may be then calculated as the best path probabilities in G.
  • the genes belong to three known groups, each group of three genes being associated to the same phenotype.
  • the three OMIM phenotypes used in the example are a pigmentation phenotype (MIM:227220), lactase persistence (MIM:223100), and Alzheimer disease (MIM: 104300).
  • Clusters (diamonds, boxes, ellipses) and representatives (double borders) 504 of nine given nodes, and some connecting nodes 506(circles) on best paths between them. Lines represent edges between two nodes, dotted lines represent best paths with several nodes.
  • the clustering produced the expected partitioning: each gene was assigned to a cluster close to its corresponding phenotype with the exception of EntrezGene: 1627.
  • the three representatives (medoids) are genes assigned to different phenotypes. Hence, the medoids can be considered representative for the nine genes.
  • Hierarchical clustering could be applied. With the k-medoids approach it may happen that it discovers star-shaped clusters, where cluster members are connected mainly through the medoid. To give more weight on cluster coherence, we may use the average linkage method, as follows. In the practical implementation, we again start by querying the index database for a graph G of at most 1000 nodes connecting the given nodes S, and compute similarities of nodes in S as the probabilities of the best paths connecting them in G. The hie- rarchical clustering proceeds in the standard, iterative manner, starting with having each node in a cluster of its own.
  • those two clusters are merged that give the best merged cluster as a result, measured by the average similarity of nodes in the merged cluster.
  • the clustering is finished when exactly k clusters remain. After the clusters have been identified, a medoid may be found in each cluster (as in the k-medoids method) and be returned as a representative.
  • Figure 5b illustrates an embodiment of finding and visualizing nodes (and associated target subgraph(s)) deemed as particularly relevant to the query nodes included in a weighted source graph.
  • a user may initially specify some query nodes (concepts), whereupon the arrangement in configured to identify a number of other nodes (concepts) that are relevant with respect to the query nodes, but still preferably non-redundant with respect to each other.
  • a user who wants to know how query nodes, such as Barcelona 510 and Helsin- ki 512, are related might already know that Barcelona is a city in Spain, that Helsinki is the capital of Finland, and that both Spain and Finland are in Europe and also are members of the European Union (EU).
  • EU European Union
  • the user might have not known that architects Antoni Gaudi (who lived in Barcelona) and Alvar Aalto (who lived in Helsinki) both have exhibited at a world's fair (also known as Expo).
  • Expo also known as Expo
  • Given Barcelona and Helsinki as query nodes, the goal is to identify a non-redundant set of nodes relevant to both cities.
  • a relevant node typically is a central node for the (indirect) relation between the query nodes.
  • a non-redundant set of relevant nodes then highlights distinct relations between the query nodes. For more than two query nodes, we may consider a node to be more relevant if it is relevant to all the query nodes, i.e., if it helps to connect all of them. For a single query node the relevance may be directly related to proximity to the query node and may seem less interesting. However, finding a set of non-relevant nodes may help in giving a summary of the neighborhood. For Barcelona, for instance, a relatively non-redundant set of relevant nodes consists of Spain, FC Barcelona, and Antoni Gaudi.
  • a weight associated with an edge e is its probability p(e) (or can be at least interpreted as a probability): edge e exists with probability p(e), and conversely e does not exist, or is not true, with probability l-p(e). Edges are assumed mutually independent. Probability of a path may be still considered as the product of asso- ciated edge probabilities. Best path may maximize the path probability between two query nodes (refer to the equation 5 set forth hereinearlier, for example). Probability of the best path may be utilized as a natural measure of the nodes' proximity (s(-)).
  • the relevance of node ueV with respect q is simply s(u, q). In other words, relevance directly depends on proximity.
  • the relevance of u is the product of its proximities to the query nodes: relpiu.
  • nodes may have equal relevances. This happens, in particular, when there are two query nodes: all nodes on the best path between the query nodes have identical relevance, equal to the probability of the best path. For instance, Spain, EU and Finland all have relevance 0.4608 in the figure. Nodes that are roughly equally far from each query node in terms of relevance (proximity) could be preferred. Such ties may be thus solved by giving highest priority to nodes with the smallest sum of squared proximities to the query nodes, i.e., by ranking the tied nodes u in ascending order by
  • Figure 5c illustrates an extreme case.
  • the provided arrangement may further support negative search terms (concepts), i.e.
  • Negative query nodes may be applied in the definition of relevance, to specify which neighborhoods are less relevant. Noting that redundancy between nodes is based on a similar effect of repellance as that of negative query nodes, it is proposed, as one feasible option, to treat these effects technically in the same way.
  • the result is a relatively simple function that tries to find a balance between relevance with respect to positive query nodes, avoidance of negative query nodes, and mutual non- redundancy of nodes in the result, by treating them all in somewhat uniform manner.
  • the relevance in the presence of a negative query node may be defined as the reverse of their mutual proximity: the relevance of node u V in the presence of a negative query node q € V is l-s(u,tf).
  • a query may be converted into a task of finding a diverse set of relevant nodes.
  • the problem may be determined formally.
  • QP positive
  • QN negative
  • An alternative formulation that may be preferable in practice, is to rank the given nodes in V instead.
  • the goal would be that any top k nodes would constitute a good solution to the original problem, whatever the value of k is. The user could then explore the top results and set the cut-off after seeing the results. At the same time it is obvious that with a single ranking the solutions cannot be optimal for all values of k.
  • the first form highlights how relevance and non-redundancy are defined independently.
  • the second form shows, in turn, how relevances and redundancies are handled in a uniform way. For example, considering Barcelona and Helsinki as positive, and Europe as negative query nodes. A set R of k nodes that maximizes the relevance and nonredundancy measure (equation 1 1) would be ⁇ Jari Litma- nen, world's fair,architect ⁇ , where the nodes are relevant to both, Barcelona and Helsinki, irrelevant to Europe, and non-redundant to each other. Regarding the potential algorithms for finding relevant and non-redundant nodes, two examples are provided hereinafter.
  • Algorithm 1 see Table 7, produces a ranked list of nodes in an incremental, greedy fashion. In each iteration, it finds the currently most representative node (Line 3) and outputs it. In case of a tie, the node minimizing Equation 4 is prioritized (Line 7).
  • One clue of the algorithm is in treating nodes selected and output during previous iterations as negative query nodes. This leads to the desired property that the ith node output by Algorithm 1 is non-redundant with respect to first i - 1 nodes already output. Algorithm 1 actually makes in each iteration an optimal choice with respect to Equation 1 1, given the previously selected nodes. As a preprocessing step, it first computes all proximities s(-) in a single batch (Line 1). ' Table 7 Al orith I for amks reJe 3 ⁇ 4 Q5 3 ⁇ 4d $mJa sets
  • Algorithm 2 in turn, produces a non-redundant set of k relevant nodes, where k is given as a parameter, see Table 8.
  • the algorithm also takes k nodes as input, used as an initial solution that is then iteratively improved in the algorithm. In each iteration, the algorithm takes one of the k nodes and replaces it by the op- timal one, given the k-1 other current nodes. When no improvements can be achieved, the algorithm stops.
  • G ( V, .£ ⁇ ), & weighted graph.
  • V C V a set of admissible nodes
  • finding relevant and non-redundant target nodes relative to query nodes may form the main objective of the query, whereas the extracted subgraph then mainly serves for visualizing the target nodes' relationships with the query nodes.
  • Figure 6 discloses a flow diagram of an embodiment of a method in accordance with the present invention.
  • the applied arrangement and associated device(s) thereof and optionally connected thereto such as a single, self- contained device, a terminal device and a server device, or a server entity com- prising a plurality of at least functionally interconnected devices, may be obtained and configured, for example, via installation and execution of related software.
  • data is obtained from a number of, typically a plurality of, data sources such as databases for constructing an index.
  • the index is created by preferably establishing a weighted (e.g.
  • Actions behind items 604 and 606 may be later at least partially re-executed upon need, e.g. in connection with updating the data in the data source(s) and/or adding/modifying/removing data relative to the index.
  • Data retrieval from each data source may be executed in a timed manner, for example.
  • the user and/or operator of the index may update data therein.
  • Supplementary data that is not common to all uses such as personal annotations relative to graph information like nodes, edges, or complete (sub-)graphs may be stored with reference to the index so that the arrangement may manage and visualize the user data flexibly together with the generally available data, if needed.
  • Dotted horizontal line is used to highlight the logical division between preparatory actions and the actual subgraph determination in the figure.
  • a query is received from the user and converted into a format applicable for traversing the index using a number of suitable methods as reviewed hereinbefore.
  • the queried concepts (query terms) provided by the user may be directly suitable for association with the query nodes, or optional logics may be first used according to the guidelines set forth in this text, for example. E.g. misspelling, synonym, and general availability checks may be executed.
  • At 610 at least one subgraph fulfilling the search criteria is determined.
  • the emphasis may be on finding the relationships between query nodes or on finding the relevant and non-redundant target nodes relative to the query node(s), for exam- pie. Nevertheless, the relevant subgraph(s) is advantageously provided as output.
  • the output is visualized to the user preferably in conjunction with the provision of interactive controls enabling the user to change the visualization details (zoom, angle of view, rotation, panning, shown node/edge details, etc.), re- vise query details, add/remove/change visualized information, and/or export result data, for example.
  • the arrangement controls the visualization by providing related visualization data, e.g. (graphical) image data or related instructions, to a local display device or an external client device comprising or being connected with a display, for instance.
  • related visualization data e.g. (graphical) image data or related instructions
  • a local display device or an external client device comprising or being connected with a display, for instance.
  • one or more visualization aspects may be client terminal/display device -specific, whereupon the arrangement and/or the display device itself may adapt the display data according to these aspects such as display properties, e.g. maximum or preferred resolution.
  • the broken loop-back arrow depicts the potentially repetitive nature of the various method items. Multiple queries may be sequentially or even simultaneously served depending on the computational power and memory capability of the executing device, and the index database may be flexibly updated at intervals or not until actual need, for instance.
  • the arrangement may be configured to provide at least one addi- tional feature selected from the group consisting of: user or user group-specific log-in functionality, support for importing user data to be included in the subgraph extraction and/or related analysis, store functionality for personal (search) history and/or results, annotation and/or editing functionality of history data, automatic notices such as e-mail notices relative to predetermined features like availability of new information in view of a query stored by a user (e.g. the query result could/would change), publication of results to the public or selected parties, optionally with editing and/or execution rights, using a selected domain such as predetermined server like the index server, team work support (e.g. result, annotation and/or settings adoption between team members), and social networking support (e.g. discussion forum, messaging, and/or contact board).
  • addi- tional feature selected from the group consisting of: user or user group-specific log-in functionality, support for importing user data to be included in the subgraph extraction and/or related analysis, store functionality for personal (search)
  • the arrangement 104 may be configured to estimate, in addition to or instead of relationship strength (e.g. path strength) between two or more (query) nodes, also the unexpectedness of such a relationship as mentioned hereinbefore.
  • relationship strength e.g. path strength
  • the resulting subgraph might be adapted so as to include and/or highlight nodes the connection of which is strong, but inevident in the light of the used criteria such as link types. For instance, it is not generally surprising that a gene may code a protein, but some other link type could imply more non-obvious relationship instead. Text mining of available data sources, e.g. literature on the domain, may be further used to determine obviousness of connections. Accordingly, finding truly surprising information and associations could be facilitated.
  • the obtained subgraph may be generally determined or modified so as to exclude uninteresting and/or obvious border nodes/edges according to a number of predetermined criteria.
  • An interesting path may include an obvious intermediate sub-path, which should advantageously still remain in the subgraph for clarity/visualization purposes due to its role as a connecting entity between the interesting entities, but e.g. an obvious sub-path at the end of an interesting path could be omitted from the subgraph and/or visualization thereof.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Physiology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un dispositif électronique, tel qu'un appareil informatique, et un procédé associé permettant de trouver des relations entre des données. La solution proposée consiste à : recevoir des enregistrements de données en provenance de plusieurs sources de données telles que des bases de données relatives à un domaine d'application prédéterminé tel qu'un domaine biologique ou biomédical (604), indexer des enregistrements de données contenus dans ladite pluralité de sources de données, ledit index comprenant également des associations entre lesdits enregistrements sur la base des indications dans les sources de données, le modèle de données utilisé comprenant un graphe pondéré ayant une pluralité de nœuds et de tranches, un nœud correspondant à un ou plusieurs enregistrements de données agrégés relatifs au même concept et une tranche représentant une relation entre deux nœuds, une tranche étant associée à un poids (606), recevoir une requête définie par l'utilisateur définissant plusieurs concepts du domaine et associer les concepts au(x) nœud(s) de requête correspondant(s) du graphe (608), déterminer un sous-graphe à partir du graphe en associant de façon optimale le ou les nœuds de requête à un ou plusieurs autres nœuds conformément à un nombre de critères prédéterminés en utilisant les poids et une technique d'extraction de sous-graphe prédéterminée (610), et créer une visualisation graphique du sous-graphe à des fins d'illustration sur un écran, ledit ou lesdits nœuds de requête, plusieurs autres nœuds et tranches associés du sous-graphe étant illustrés de façon à faciliter la recherche, la vérification et la compréhension des associations indirectes parmi ledit ou lesdits nœuds de requête et/ou les associations entre un ou plusieurs éléments de graphe tels qu'un ou plusieurs autres nœuds et ledit ou lesdits nœuds de requête (612), comme la vérification d'une hypothèse entre un phénotype et un gène associé.
PCT/FI2010/050441 2010-05-31 2010-05-31 Dispositif et procédé permettant de trouver des relations entre des données WO2011151500A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/FI2010/050441 WO2011151500A1 (fr) 2010-05-31 2010-05-31 Dispositif et procédé permettant de trouver des relations entre des données

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/FI2010/050441 WO2011151500A1 (fr) 2010-05-31 2010-05-31 Dispositif et procédé permettant de trouver des relations entre des données

Publications (1)

Publication Number Publication Date
WO2011151500A1 true WO2011151500A1 (fr) 2011-12-08

Family

ID=45066226

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2010/050441 WO2011151500A1 (fr) 2010-05-31 2010-05-31 Dispositif et procédé permettant de trouver des relations entre des données

Country Status (1)

Country Link
WO (1) WO2011151500A1 (fr)

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013148724A1 (fr) * 2012-03-29 2013-10-03 Audible, Inc. Personnalisation de contenu
US8849676B2 (en) 2012-03-29 2014-09-30 Audible, Inc. Content customization
US9037956B2 (en) 2012-03-29 2015-05-19 Audible, Inc. Content customization
US9075760B2 (en) 2012-05-07 2015-07-07 Audible, Inc. Narration settings distribution for content customization
US9195941B2 (en) 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
US9317486B1 (en) 2013-06-07 2016-04-19 Audible, Inc. Synchronizing playback of digital content with captured physical content
US20160267409A1 (en) * 2015-03-10 2016-09-15 Wipro Limited Methods for identifying related context between entities and devices thereof
US9472113B1 (en) 2013-02-05 2016-10-18 Audible, Inc. Synchronizing playback of digital content with physical content
US9632647B1 (en) 2012-10-09 2017-04-25 Audible, Inc. Selecting presentation positions in dynamic content
WO2017147396A1 (fr) * 2016-02-24 2017-08-31 Data2Discovery Procédé et système orientés objet présentant des sous-structures sémantiques d'apprentissage automatique
US20170277857A1 (en) * 2016-03-24 2017-09-28 Fujitsu Limited System and a method for assessing patient treatment risk using open data and clinician input
WO2017210437A1 (fr) * 2016-06-01 2017-12-07 Life Technologies Corporation Procédés et systèmes destinés à la conception de panneau génétique
US10140344B2 (en) 2016-01-13 2018-11-27 Microsoft Technology Licensing, Llc Extract metadata from datasets to mine data for insights
US10242223B2 (en) 2017-02-27 2019-03-26 Microsoft Technology Licensing, Llc Access controlled graph query spanning
WO2019067167A1 (fr) * 2017-09-29 2019-04-04 Oracle International Corporation Gestion de configuration pilotée par intelligence artificielle
WO2019084147A1 (fr) * 2017-10-24 2019-05-02 Ge Inspection Technologies, Lp Génération de recommandations selon une saisie de connaissances sémantiques
US10402403B2 (en) 2016-12-15 2019-09-03 Microsoft Technology Licensing, Llc Utilization of probabilistic characteristics for reduction of graph database traversals
WO2019169452A1 (fr) * 2018-03-09 2019-09-12 Garvan Institute Of Medical Research Visualisation de données cliniques et génétiques
WO2019186169A1 (fr) * 2018-03-28 2019-10-03 Benevolentai Technology Limited Outil de recherche utilisant un arbre de relations
US10445361B2 (en) 2016-12-15 2019-10-15 Microsoft Technology Licensing, Llc Caching of subgraphs and integration of cached subgraphs into graph query results
US10467254B2 (en) 2015-03-10 2019-11-05 Microsoft Technology Licensing, Llc Methods of searching through indirect cluster connections
US10467229B2 (en) 2016-09-30 2019-11-05 Microsoft Technology Licensing, Llc. Query-time analytics on graph queries spanning subgraphs
CN110674359A (zh) * 2019-09-03 2020-01-10 中国建设银行股份有限公司 多场景展示关系图谱的方法及系统
US10545945B2 (en) 2016-10-28 2020-01-28 Microsoft Technology Licensing, Llc Change monitoring spanning graph queries
US10586176B2 (en) 2016-01-22 2020-03-10 International Business Machines Corporation Discovery of implicit relational knowledge by mining relational paths in structured data
US10585903B2 (en) 2016-12-05 2020-03-10 Dropbox, Inc. Identifying relevant information within a document hosting system
JP2020086874A (ja) * 2018-11-22 2020-06-04 富士ゼロックス株式会社 情報処理装置及びプログラム
CN111506737A (zh) * 2020-04-08 2020-08-07 北京百度网讯科技有限公司 图数据处理方法、检索方法、装置及电子设备
CN111523012A (zh) * 2019-02-01 2020-08-11 慧安金科(北京)科技有限公司 用于检测异常数据的方法、设备和计算机可读存储介质
CN111611419A (zh) * 2019-02-26 2020-09-01 阿里巴巴集团控股有限公司 一种子图识别方法及装置
CN111708845A (zh) * 2020-05-07 2020-09-25 北京明略软件系统有限公司 一种身份匹配方法和装置
US10789065B2 (en) 2018-05-07 2020-09-29 Oracle lnternational Corporation Method for automatically selecting configuration clustering parameters
CN111932174A (zh) * 2020-07-28 2020-11-13 中华人民共和国深圳海关 货运监管异常信息获取方法、装置、服务器及存储介质
CN112115289A (zh) * 2020-09-28 2020-12-22 支付宝(杭州)信息技术有限公司 一种图数据采样方法和系统
CN112148771A (zh) * 2020-09-22 2020-12-29 京东数字科技控股股份有限公司 数据的关联查询方法、装置、电子设备及存储介质
CN112446951A (zh) * 2020-11-06 2021-03-05 杭州易现先进科技有限公司 三维重建方法、装置、电子设备及计算机存储介质
EP3798864A1 (fr) * 2019-09-26 2021-03-31 Palantir Technologies Inc. Fonctions pour traversées de voie de l'entrée à la sortie de graines
EP3812923A4 (fr) * 2018-08-01 2021-07-28 National Institute for Materials Science Système de recherche et procédé de recherche
WO2021217497A1 (fr) * 2020-04-29 2021-11-04 Paypal, Inc. Moteur de requête de sous-graphe sensible aux statistiques
CN114610921A (zh) * 2021-11-30 2022-06-10 腾讯科技(深圳)有限公司 对象集群画像确定方法、装置、计算机设备和存储介质
CN117116356A (zh) * 2023-10-25 2023-11-24 智泽童康(广州)生物科技有限公司 细胞亚群关联网络图的生成方法、存储介质和服务器

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
HINTSANEN P ET AL.: "Finding Reliable Subgraphs from Large Probabilistic Graphs", DATA MINING AND KNOWLEDGE DISCOVERY, vol. 17, no. 1, 2008, pages 3 - 23, Retrieved from the Internet <URL:http://www.cs.helsinki.fi/research/discovery> [retrieved on 20110303] *
HINTSANEN P: "The most reliable subgraph problem", PROCEEDINGS OF THE 11TH EUROPEAN CONFERENCE ON PRINCIPLES AND PRACTICE OF KNOWLEDGE DISCOVERY IN DATABASES, 2007, pages 471 - 478, Retrieved from the Internet <URL:http://www.springerlink.com/content/g467551350004536> [retrieved on 20110315] *
LANGOHR L ET AL.: "Finding representative nodes in probabilistic graphs", WORKSHOP ON EXPLORATIVE ANALYTICS OF INFORMATION NETWORKS AT ECML PKDD, 65-76, September 2009 (2009-09-01), BLED, SLOVENIA, pages 65 - 76, Retrieved from the Internet <URL:http://cs.helsinki.fi/hannu.toivonen/pubs/Langohr_WEAIN-PKDD2009.pdf> [retrieved on 20110309] *
LANGOHR L: "Finding a diverse set of nodes in probabilistic graphs", SEMINAR: GRAPH MINING (3 CR), SPRING 2010, 21 May 2010 (2010-05-21), Retrieved from the Internet <URL:http://www.cs.helsinki.fi/ulhtoivone/teachinglseminarS10> [retrieved on 20110309] *
NIKRAVESH M: "Concept-based search and questionnaire system", SOFT COMPUTING, vol. 12, 2008, pages 301 - 314, Retrieved from the Internet <URL:http://www.springerlink.com/content/n77654811673qq77> [retrieved on 20110302] *
SEVON P ET AL.: "Link Discovery in Graphs Derived from Biological Databases", DATA INTEGRATION IN THE LIFE SCIENCES, LECTURE NOTES IN COMPUTER SCIENCE, vol. 4075, 2006, pages 35 - 49, Retrieved from the Internet <URL:http://www.springerlink.com/content/j615318046155613> [retrieved on 20110303] *
SEVON P ET AL.: "Subgraph queries by context-free grammar", JOURNAL OF INTEGRATIVE BIOINFORMATICS, vol. 5, no. 2, 2008, pages 100, Retrieved from the Internet <URL:http://journal.imbio.de/article.php?aid=100> [retrieved on 20110315] *
TOIVONEN H ET AL.: "A framework for path-oriented network simplification", THE NINTH INTERNATIONAL SYMPOSIUM ON INTELLIGENT DATA ANALYSIS (IDA), 19 May 2010 (2010-05-19) - 21 May 2010 (2010-05-21), TUCSON, ARIZONA, US, Retrieved from the Internet <URL:http://www.cs.helsinki.fi/ulhtoivone/pubs> [retrieved on 20110311] *
TOIVONEN H: "Biomine search engine for probabilistic graphs", 3 February 2010 (2010-02-03), Retrieved from the Internet <URL:http://videolectures.net/solomon_toivonen_bsepg> [retrieved on 20110303] *
TOMIYAMA T ET AL.: "Concept-based Web communities for Google search engine", THE 12TH IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, 2003. FUZZ, 2003, pages 1122 - 1128, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/xplslabsall.jsp?arnumber=1206589&tag=1> [retrieved on 20110302] *
WANG X ET AL.: "A Study of Methods for Negative Relevance Feedback", SIGIR '08: PROCEEDINGS OF THE 31ST ANNUAL INTERNATIONAL ACM, SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2008, NEW YORK, NY, USA, pages 219 - 226, Retrieved from the Internet <URL:http://portal.acm.org/citation.cfm?id=1390374> [retrieved on 20110311] *
ZHOU F ET AL.: "Review of Network Abstraction Techniques", WORKSHOP ON EXPLORATIVE ANALYTICS OF INFORMATION NETWORKS AT ECML PKDD, September 2009 (2009-09-01), BLED, SLOVENIA, pages 50 - 63, Retrieved from the Internet <URL:http://www.cs.helsinki.filu/htoivone/pubs> [retrieved on 20110315] *

Cited By (65)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8849676B2 (en) 2012-03-29 2014-09-30 Audible, Inc. Content customization
US9037956B2 (en) 2012-03-29 2015-05-19 Audible, Inc. Content customization
WO2013148724A1 (fr) * 2012-03-29 2013-10-03 Audible, Inc. Personnalisation de contenu
US9075760B2 (en) 2012-05-07 2015-07-07 Audible, Inc. Narration settings distribution for content customization
US9632647B1 (en) 2012-10-09 2017-04-25 Audible, Inc. Selecting presentation positions in dynamic content
US9472113B1 (en) 2013-02-05 2016-10-18 Audible, Inc. Synchronizing playback of digital content with physical content
US9195941B2 (en) 2013-04-23 2015-11-24 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
US9406021B2 (en) 2013-04-23 2016-08-02 International Business Machines Corporation Predictive and descriptive analysis on relations graphs with heterogeneous entities
US9317486B1 (en) 2013-06-07 2016-04-19 Audible, Inc. Synchronizing playback of digital content with captured physical content
US20160267409A1 (en) * 2015-03-10 2016-09-15 Wipro Limited Methods for identifying related context between entities and devices thereof
US10467254B2 (en) 2015-03-10 2019-11-05 Microsoft Technology Licensing, Llc Methods of searching through indirect cluster connections
US10140344B2 (en) 2016-01-13 2018-11-27 Microsoft Technology Licensing, Llc Extract metadata from datasets to mine data for insights
US10599993B2 (en) 2016-01-22 2020-03-24 International Business Machines Corporation Discovery of implicit relational knowledge by mining relational paths in structured data
US10586176B2 (en) 2016-01-22 2020-03-10 International Business Machines Corporation Discovery of implicit relational knowledge by mining relational paths in structured data
WO2017147396A1 (fr) * 2016-02-24 2017-08-31 Data2Discovery Procédé et système orientés objet présentant des sous-structures sémantiques d'apprentissage automatique
US20170277857A1 (en) * 2016-03-24 2017-09-28 Fujitsu Limited System and a method for assessing patient treatment risk using open data and clinician input
US10885150B2 (en) * 2016-03-24 2021-01-05 Fujitsu Limited System and a method for assessing patient treatment risk using open data and clinician input
WO2017210437A1 (fr) * 2016-06-01 2017-12-07 Life Technologies Corporation Procédés et systèmes destinés à la conception de panneau génétique
US10467229B2 (en) 2016-09-30 2019-11-05 Microsoft Technology Licensing, Llc. Query-time analytics on graph queries spanning subgraphs
US10545945B2 (en) 2016-10-28 2020-01-28 Microsoft Technology Licensing, Llc Change monitoring spanning graph queries
US10585903B2 (en) 2016-12-05 2020-03-10 Dropbox, Inc. Identifying relevant information within a document hosting system
US11461341B2 (en) 2016-12-05 2022-10-04 Dropbox, Inc. Identifying relevant information within a document hosting system
US10402403B2 (en) 2016-12-15 2019-09-03 Microsoft Technology Licensing, Llc Utilization of probabilistic characteristics for reduction of graph database traversals
US10445361B2 (en) 2016-12-15 2019-10-15 Microsoft Technology Licensing, Llc Caching of subgraphs and integration of cached subgraphs into graph query results
US10242223B2 (en) 2017-02-27 2019-03-26 Microsoft Technology Licensing, Llc Access controlled graph query spanning
WO2019067167A1 (fr) * 2017-09-29 2019-04-04 Oracle International Corporation Gestion de configuration pilotée par intelligence artificielle
US10496396B2 (en) 2017-09-29 2019-12-03 Oracle International Corporation Scalable artificial intelligence driven configuration management
US10592230B2 (en) 2017-09-29 2020-03-17 Oracle International Corporation Scalable artificial intelligence driven configuration management
US10664264B2 (en) 2017-09-29 2020-05-26 Oracle International Corporation Artificial intelligence driven configuration management
US11023221B2 (en) 2017-09-29 2021-06-01 Oracle International Corporation Artificial intelligence driven configuration management
WO2019084147A1 (fr) * 2017-10-24 2019-05-02 Ge Inspection Technologies, Lp Génération de recommandations selon une saisie de connaissances sémantiques
WO2019169452A1 (fr) * 2018-03-09 2019-09-12 Garvan Institute Of Medical Research Visualisation de données cliniques et génétiques
US11631482B2 (en) 2018-03-09 2023-04-18 Garvan Institute Of Medical Research Visualising clinical and genetic data
US11880375B2 (en) 2018-03-28 2024-01-23 Benevolentai Technology Limited Search tool using a relationship tree
WO2019186169A1 (fr) * 2018-03-28 2019-10-03 Benevolentai Technology Limited Outil de recherche utilisant un arbre de relations
US10789065B2 (en) 2018-05-07 2020-09-29 Oracle lnternational Corporation Method for automatically selecting configuration clustering parameters
EP3812923A4 (fr) * 2018-08-01 2021-07-28 National Institute for Materials Science Système de recherche et procédé de recherche
JP7172497B2 (ja) 2018-11-22 2022-11-16 富士フイルムビジネスイノベーション株式会社 情報処理装置及びプログラム
JP2020086874A (ja) * 2018-11-22 2020-06-04 富士ゼロックス株式会社 情報処理装置及びプログラム
CN111523012A (zh) * 2019-02-01 2020-08-11 慧安金科(北京)科技有限公司 用于检测异常数据的方法、设备和计算机可读存储介质
CN111523012B (zh) * 2019-02-01 2024-01-09 慧安金科(北京)科技有限公司 用于检测异常数据的方法、设备和计算机可读存储介质
CN111611419B (zh) * 2019-02-26 2023-06-20 阿里巴巴集团控股有限公司 一种子图识别方法及装置
CN111611419A (zh) * 2019-02-26 2020-09-01 阿里巴巴集团控股有限公司 一种子图识别方法及装置
CN110674359A (zh) * 2019-09-03 2020-01-10 中国建设银行股份有限公司 多场景展示关系图谱的方法及系统
CN110674359B (zh) * 2019-09-03 2022-07-05 中国建设银行股份有限公司 多场景展示关系图谱的方法及系统
US11392585B2 (en) 2019-09-26 2022-07-19 Palantir Technologies Inc. Functions for path traversals from seed input to output
US11886231B2 (en) 2019-09-26 2024-01-30 Palantir Technologies Inc. Functions for path traversals from seed input to output
EP3798864A1 (fr) * 2019-09-26 2021-03-31 Palantir Technologies Inc. Fonctions pour traversées de voie de l'entrée à la sortie de graines
CN111506737B (zh) * 2020-04-08 2023-12-19 北京百度网讯科技有限公司 图数据处理方法、检索方法、装置及电子设备
CN111506737A (zh) * 2020-04-08 2020-08-07 北京百度网讯科技有限公司 图数据处理方法、检索方法、装置及电子设备
US11734350B2 (en) 2020-04-29 2023-08-22 Paypal, Inc. Statistics-aware sub-graph query engine
WO2021217497A1 (fr) * 2020-04-29 2021-11-04 Paypal, Inc. Moteur de requête de sous-graphe sensible aux statistiques
CN111708845B (zh) * 2020-05-07 2023-05-19 北京明略软件系统有限公司 一种身份匹配方法和装置
CN111708845A (zh) * 2020-05-07 2020-09-25 北京明略软件系统有限公司 一种身份匹配方法和装置
CN111932174A (zh) * 2020-07-28 2020-11-13 中华人民共和国深圳海关 货运监管异常信息获取方法、装置、服务器及存储介质
CN111932174B (zh) * 2020-07-28 2024-05-28 中华人民共和国深圳海关 货运监管异常信息获取方法、装置、服务器及存储介质
CN112148771A (zh) * 2020-09-22 2020-12-29 京东数字科技控股股份有限公司 数据的关联查询方法、装置、电子设备及存储介质
CN112115289B (zh) * 2020-09-28 2023-11-14 支付宝(杭州)信息技术有限公司 一种图数据采样方法和系统
CN112115289A (zh) * 2020-09-28 2020-12-22 支付宝(杭州)信息技术有限公司 一种图数据采样方法和系统
CN112446951B (zh) * 2020-11-06 2024-03-26 杭州易现先进科技有限公司 三维重建方法、装置、电子设备及计算机存储介质
CN112446951A (zh) * 2020-11-06 2021-03-05 杭州易现先进科技有限公司 三维重建方法、装置、电子设备及计算机存储介质
CN114610921B (zh) * 2021-11-30 2023-02-28 腾讯科技(深圳)有限公司 对象集群画像确定方法、装置、计算机设备和存储介质
CN114610921A (zh) * 2021-11-30 2022-06-10 腾讯科技(深圳)有限公司 对象集群画像确定方法、装置、计算机设备和存储介质
CN117116356A (zh) * 2023-10-25 2023-11-24 智泽童康(广州)生物科技有限公司 细胞亚群关联网络图的生成方法、存储介质和服务器
CN117116356B (zh) * 2023-10-25 2024-01-30 智泽童康(广州)生物科技有限公司 细胞亚群关联网络图的生成方法、存储介质和服务器

Similar Documents

Publication Publication Date Title
WO2011151500A1 (fr) Dispositif et procédé permettant de trouver des relations entre des données
US8010570B2 (en) System, method and computer program for transforming an existing complex data structure to another complex data structure
JP5718431B2 (ja) 消費者定義の情報アーキテクチャ用のシステム、方法およびコンピュータプログラム
US8572064B2 (en) Visualization technique for biological information
Desimoni et al. Empirical evaluation of linked data visualization tools
WO2019233463A1 (fr) Suggestion et évaluation de requête de mot-clé sensible à la qualité
KR20130098772A (ko) 토픽 기반 커뮤니티 인덱스 생성장치, 토픽 기반 커뮤니티 검색장치, 토픽 기반 커뮤니티 인덱스 생성방법 및 토픽 기반 커뮤니티 검색방법
Zhang et al. CEGSO: boosting essential proteins prediction by integrating protein complex, gene expression, gene ontology, subcellular localization and orthology information
JP2008059442A (ja) 文書集合分析装置,文書集合分析方法,その方法を実装したプログラム及びそのプログラムを格納した記録媒体
Song et al. Interactive visual pattern search on graph data via graph representation learning
Cohen-Boulakia et al. Path-based systems to guide scientists in the maze of biological data sources
Satuluri Scalable clustering of modern networks
Cheng et al. Context-based page unit recommendation for web-based sensemaking tasks
Bross et al. Visualizing blog archives to explore content-and context-related interdependencies
Han et al. Mining integration patterns of programmable ecosystem with social tags
Afra et al. NetDriller-V3: A Powerful Social Network Analysis Tool
JP2013145508A (ja) グラフパターンマッチングシステムおよびグラフパターン代表元抽出方法
Sarkar et al. Representing Tasks with a Graph-Based Method for Supporting Users in Complex Search Tasks
Chisham et al. Cdaostore: A phylogenetic repository using logic programming and web services
Sahoo et al. Prom: A semantic web framework for provenance management in science
Le et al. Dblpminer: a tool for exploring bibliographic data
Zhou et al. Protein Complex Identification Based on Heterogeneous Protein Information Network
Aghababaei et al. Interpolative self-training approach for link prediction
Pazienza et al. Application of a Semantic Search Algorithm to Semi-Automatic GUI Generation
Muley Search, Retrieve, Visualize, and Analyze Protein–Protein Interactions from Multiple Databases: A Guide for Experimental Biologists

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10852456

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10852456

Country of ref document: EP

Kind code of ref document: A1