CN114201618A - Drug development literature visualization interpretation method and system - Google Patents

Drug development literature visualization interpretation method and system Download PDF

Info

Publication number
CN114201618A
CN114201618A CN202210147101.6A CN202210147101A CN114201618A CN 114201618 A CN114201618 A CN 114201618A CN 202210147101 A CN202210147101 A CN 202210147101A CN 114201618 A CN114201618 A CN 114201618A
Authority
CN
China
Prior art keywords
entity
entities
determining
unit
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210147101.6A
Other languages
Chinese (zh)
Other versions
CN114201618B (en
Inventor
丁红霞
伍星
吴忠毅
余志颖
徐更惟
李靖
李琪
廖宛玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingwei Jingwei Information Technology Beijing Co ltd
Original Assignee
Jingwei Jingwei Information Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingwei Jingwei Information Technology Beijing Co ltd filed Critical Jingwei Jingwei Information Technology Beijing Co ltd
Priority to CN202210147101.6A priority Critical patent/CN114201618B/en
Publication of CN114201618A publication Critical patent/CN114201618A/en
Application granted granted Critical
Publication of CN114201618B publication Critical patent/CN114201618B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a visualized interpretation method and a visualized interpretation system for drug development literature, wherein the method comprises the following steps: determining each text unit and the position thereof in the literature; identifying the text unit to obtain each entity in the text unit, wherein the entities comprise: drugs, targets, indications, companies; and carrying out normalization and merging processing on the entities to obtain an entity table corresponding to the document, wherein the entity table comprises: entities and their locations and frequency of occurrence; determining the relationship between different entities in the entity table; and generating a knowledge graph corresponding to the literature according to the relation between the entities. By using the invention, the user can conveniently read the drug research and development literature to find the key information and the incidence relation thereof and provide basic capability for batch analysis and processing of the literature by a subsequent utilization system.

Description

Drug development literature visualization interpretation method and system
Technical Field
The invention relates to the technical field of information processing, in particular to a visualized interpretation method and system for drug research and development literature.
Background
The new drug development process generates a large amount of literature, and the research and development literature of the new drug generally has the characteristics of multidisciplinary intersection and complexity, so that the interpretation of the related literature is difficult and the efficiency is low.
Disclosure of Invention
The invention provides a visualized interpretation method and a visualized interpretation system for drug development documents, which are convenient for users to interpret the drug development documents so as to find key information and association thereof.
Therefore, the invention provides the following technical scheme:
a method of visual interpretation of drug development literature, the method comprising:
determining each text unit and the position thereof in the literature;
identifying the text unit to obtain each entity in the text unit, wherein the entities comprise: drugs, targets, indications, companies;
and carrying out normalization and merging processing on the entities to obtain an entity table corresponding to the document, wherein the entity table comprises: entities and their locations and frequency of occurrence;
determining the relationship between different entities in the entity table;
and generating a knowledge graph corresponding to the literature according to the relation between the entities.
Optionally, the determining each text unit and its position in the document includes:
determining the position of each text unit in the document according to the chapter keywords and the chapter division characteristics;
and splitting the document to obtain each text unit and the position thereof in the document.
Optionally, the chapter division features include any one or more of: key words, fonts, word sizes and line feed characters.
Optionally, the determining a relationship between different entities in the entity table includes:
selecting one entity from the entities in the entity table as an initial node, and taking other entities as target nodes;
and determining the paths from the starting node to each target node by a shortest path algorithm from the starting node.
Optionally, the selecting a start node from the entities of the entity table includes:
determining the weight of each entity in the entity table;
and selecting the entity with the maximum weight as the starting node.
Optionally, the determining the weight of each entity in the entity table includes:
determining the type of the entity according to a knowledge base;
and determining the weight of the entity according to the type, the position and the occurrence frequency of the entity.
Optionally, the method further comprises:
querying the knowledge base to obtain associated entities of different entity nodes in the knowledge graph;
adding the associated entities to the knowledge-graph.
Optionally, the method further comprises:
and displaying the knowledge graph, and adopting different display forms for the entity node and the associated entity.
Optionally, the method further comprises:
if the document contains a structural formula picture, identifying the structural formula picture to obtain an entity corresponding to the structural formula picture;
and adding the entity corresponding to the structural formula picture into the entity table.
A system for visual interpretation of drug development literature, the system comprising:
the text unit determining module is used for determining each text unit and the position thereof in the document;
a text recognition module, configured to recognize the text unit to obtain entities in the text unit, where the entities include: drugs, targets, indications, companies;
a normalization processing module, configured to perform normalization and merging processing on the entities to obtain an entity table corresponding to the document, where the entity table includes: entities and their locations and frequency of occurrence;
the relation determining module is used for determining the relation between different entities in the entity table;
and the knowledge graph generating module is used for generating a knowledge graph corresponding to the literature according to the relation between the entities.
Optionally, the text unit determining module includes:
the position determining unit is used for determining the position of each text unit in the document according to the chapter key words and the chapter division characteristics;
and the splitting unit is used for splitting the literature to obtain each text unit and the position thereof in the literature.
Optionally, the relationship determination module includes:
a node determining unit, configured to select one entity from the entities in the entity table as an initial node, and use other entities as target nodes;
and the path determining unit is used for determining paths from the starting node to each target node by a shortest path algorithm.
Optionally, the node determining unit includes:
the weight calculation unit is used for determining the weight of each entity in the entity table;
and the node selection unit is used for selecting the entity with the largest weight as the starting node.
Optionally, the node determining unit further includes: the type determining unit is used for determining the type of the entity according to a knowledge base;
the weight calculation unit is specifically configured to determine the weight of the entity according to the type, the location, and the frequency of occurrence of the entity.
Optionally, the system further comprises:
and the query module is used for querying the knowledge base to obtain associated entities of different entity nodes in the knowledge graph and adding the associated entities into the knowledge graph.
Optionally, the system further comprises:
and the display module is used for displaying the knowledge graph and adopting different display forms for the entity nodes and the associated entities.
Optionally, the system further comprises:
the structural formula processing module is used for identifying the structural formula picture under the condition that the document contains the structural formula picture to obtain an entity corresponding to the structural formula picture;
and the normalization processing module is also used for adding the entity corresponding to the structural formula picture into the entity table.
Optionally, the structured processing module comprises:
the conversion module is used for converting the literature into pictures;
the detection module is used for detecting the picture and determining whether a structural picture exists;
the segmentation module is used for segmenting the structural formula picture under the condition that the structural formula picture exists to obtain each structural formula picture;
the structural formula identification module is used for identifying the structural formula picture to obtain an identification result;
the file generation module is used for generating a structure description table according to the identification result;
and the material entity determining module is used for determining the corresponding material entity according to the structure description table.
Optionally, the structural formula identification module includes:
the picture conversion unit is used for converting the structural formula picture into a binary image;
and the image identification unit is used for carrying out character identification and image type identification on the binary image to obtain atoms and chemical bonds corresponding to the structural formula picture.
According to the visualized interpretation method and system for the drug development literature, provided by the embodiment of the invention, each text unit and the position thereof in the literature are determined at first; and then identifying each text unit to obtain each entity in the text unit, performing normalization and merging processing on the identified entities to obtain an entity table corresponding to the document, determining the relationship among different entities in the entity table, generating a knowledge graph corresponding to the document according to the relationship among the entities, and displaying the key information in the document in a more intuitive and visual manner through the knowledge graph, so that the method is beneficial to quickly reading the document by a user, helping the user to find the key information and the association relationship thereof, and providing basic capability for batch analysis and processing of the document by a subsequent utilization system.
Drawings
Fig. 1 is a flowchart of a method for visually interpreting a drug development document according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of communication paths between different entities in an embodiment of the present invention;
FIG. 3 is an example of obtaining a knowledge-graph of corresponding documents in an embodiment of the present invention;
FIG. 4 is another example of obtaining a knowledge-graph of corresponding documents in an embodiment of the present invention;
FIG. 5 is a flow chart of identifying a structural formula picture in a document according to an embodiment of the present invention;
FIG. 6 is an example of a structural formula in an embodiment of the present invention;
FIG. 7 is a schematic structural diagram of a visualized interpretation system of drug development documents according to an embodiment of the present invention;
FIG. 8 is another schematic structural diagram of a visualization interpretation system for drug development literature according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a specific configuration of a structured processing module in the system of the present invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
Aiming at the characteristics of drug research and development documents, the embodiment of the invention provides a visualized interpretation method and system for the drug research and development documents, which are analyzed and processed to form a more intuitive knowledge map and help a user to discover key information and association thereof.
As shown in fig. 1, the method is a flowchart of a visualization interpretation method for a drug development literature, and includes the following steps:
step 101, determining each text unit and the position thereof in the document.
Specifically, the position of each text unit in the document can be determined according to the chapter keywords and the chapter division characteristics, and then the document is split to obtain each text unit and the position thereof in the document.
The chapter keyword ratio is as follows: abstract, Keywords, introductions, Conclusion, and the like; the chapter division features include, but are not limited to, any one or more of: keywords, fonts, word sizes, line breaks, etc.
It should be noted that, if the document is in a picture format, text recognition needs to be performed on the picture first to obtain each text unit.
Step 102, identifying the text unit to obtain each entity in the text unit, where the entities include but are not limited to: drugs, targets, indications, companies, and the like.
The text units are identified, mainly the entities contained in each text unit are identified. Specifically, each sentence in the text unit may be segmented, and the entity in the sentence may be determined by means of a corresponding dictionary or other tools.
Aiming at the characteristics of entities involved in drug development documents, in the embodiment of the invention, the entities to be identified mainly comprise: drugs, targets, indications, companies. Of course, there may be other entities according to different application requirements, and the embodiment of the present invention is not limited thereto.
Step 103, performing normalization and merging processing on the entities to obtain an entity table corresponding to the document, wherein the entity table comprises: entities and their locations and frequency of occurrence.
In the medical field, the same entity often has a number of different forms of expression, such as: for a chemical element, a small molecule compound can be expressed as:
1) name of substance
Such as Celecoxib, and also has distinction of alias names/languages, such as "Celecoxib" in chinese; in addition, the names of different stages of each small molecule drug are different, such as research and development codes and names of medicines and commodities. For example, celecoxib is also known as: TPI-336 (crystaline), AI-525, CEP-33222, SC-58635, DRGT-47, DRGT-46, AD-2111, DFD-07, YM-177, F14, Onsenal, Celebrex, Celebra, Solexa, Celecox, Celebrex, Nu-Celecoxib, Elyxyb, Niflam, Celecoxib, Nu-Celecoxib, Celecoxib, セレコキシブ and the like.
2) Chemical name
For example, IUPAC names:
4- [5- (4-Methylphenyl) -3- (trifluoromethylphenyl) pyrazol-1-yl ] benzanesulfinamide, CAS number 169590-42-5, and the like.
3) Structural formula (I)
For example, the structural formula for Celecoxib is as follows:
Figure DEST_PATH_IMAGE001
that is, in the drug development literature, many different forms of expression may occur for an entity. In this case, in the embodiment of the present invention, it is necessary to perform a normalization and combination process on these different expression forms, for example, Celecoxib, 4- [5- (4-Methylphenyl) -3- (trifluoromethyl) pyrazol-1-yl ] zenesulfynamide, 169590-42-5 (CAS No.) or corresponding structural formulas appear in the literature text, and it is necessary to normalize them into a code in the knowledge base, where the code is the unique identifier of the corresponding entity.
The data processing normalization of small molecule drugs is described above, and similar to the normalization of targets and other entities, and will not be further illustrated.
After all entity data are normalized and combined, a structured table, namely an entity table corresponding to the document is generated, wherein the entity table is used for recording the characteristic information of entities (each entity corresponds to a code) in the document, and the entity table comprises: entities and their locations and frequency of occurrence.
For example:
TABLE 1
Figure DEST_PATH_IMAGE002
And 104, determining the relation among different entities in the entity table.
Specifically, one entity is selected from the entities in the entity table as an initial node, and other entities are selected as target nodes; then, starting from the starting node, determining the paths from the starting node to the target nodes through a shortest path algorithm.
When determining the start node, the entity with the largest weight may be selected as the start node according to the weight of each entity.
The weight of each entity can be determined according to the type, the position, the frequency of occurrence and the like of the entity, wherein the type of the entity can be determined according to a corresponding knowledge base, and the position and the frequency of occurrence of the entity can be obtained by the corresponding steps.
For example, the weight score W for each entity may be calculated as follows:
firstly, according to entity type T, position P, frequency F, an intermediate value W' is calculated:
Figure DEST_PATH_IMAGE003
in order to make the value range of W be (0, 1), converting the intermediate value Wprime to obtain a weight value W:
Figure DEST_PATH_IMAGE004
for different positions of the entity, the corresponding value P can be as shown in table 2 below:
TABLE 2
Figure DEST_PATH_IMAGE005
For different types of entities, the corresponding values T can be shown in table 3 below:
TABLE 3
Figure DEST_PATH_IMAGE006
Assume that the entity node with the largest weight in the whole document is a, i.e., the starting node is a, and the other entities are target nodes, respectively B, C, D. The shortest paths between A-B, A-C, A-D are respectively calculated according to corresponding knowledge bases, path nodes in the paths are repeated, and after the repeated nodes are combined, the communication paths among multiple points can be obtained, as shown in FIG. 2.
The knowledge base is pre-established, comprises a medical entity base, and further comprises a knowledge graph which is established according to entity relations and corresponds to the medical entity base, wherein the knowledge graph takes a core entity as a key node and a general entity as a common node. The medical entity library is a knowledge library established by extracting medical entities from medical related data of Chinese and other languages.
And 105, generating a knowledge graph corresponding to the literature according to the relation between the entities.
Further, in another non-limiting embodiment of the method of the present invention, the associated entities of different entity nodes in the knowledge-graph may also be obtained by querying the knowledge base, and the associated entities are added to the knowledge-graph.
As shown in fig. 3, it is shown in the graph that Celecoxib and imrechoxib act on the same target point, and the knowledge base is queried according to the shortest path to obtain a corresponding entity node, namely COX-2, so that completion can be performed through information in the knowledge base, and a user can obtain a clear cognition.
In order to distinguish from entities appearing in other documents, as shown in fig. 3, entities appearing in documents may be represented by solid lines and entities complemented by a knowledge base may be represented by dotted lines. Of course, if COX-2 is also present in the literature as the target, the dotted line in FIG. 3 is a solid line, and the corresponding solid nodes are directly associated without being supplemented from the knowledge base.
Furthermore, the knowledge base can be used for carrying out expanding query on part or all entity nodes in the acquired knowledge graph to acquire corresponding associated entities. For example, the entity node Celecoxib in fig. 3 is subjected to an extended query to obtain an associated entity related to the entity Celecoxib, as shown in fig. 4.
Further, in another embodiment of the method of the present invention, the knowledge-graph may also be presented, and the entity node and the associated entity may adopt different presentation forms. For example, different colors, lines, and the like are used to distinguish the entity nodes and the associated entities, and the embodiment of the present invention is not limited thereto.
According to the visualized interpretation method of the medicine research and development literature, firstly, each text unit and the position of each text unit in the literature are determined; and then identifying each text unit to obtain each entity in the text unit, performing normalization and merging processing on the identified entities to obtain an entity table corresponding to the document, determining the relationship between different entities in the entity table, generating a knowledge graph corresponding to the document according to the relationship between the entities, and displaying the key information in the document in a more intuitive and visual manner through the knowledge graph, so that the method is beneficial to quickly reading the document by a user and helping the user to find the key information and the association relationship thereof.
For the drug development literature, a structural formula picture is sometimes included, and the analysis and interpretation of the structural formula picture can enable a user to better understand relevant information. Therefore, in another embodiment of the method, a structural formula picture in a document can be analyzed and identified to obtain an entity corresponding to the structural formula picture; and adding the entity corresponding to the structural formula picture into the entity table.
As shown in fig. 5, the flowchart is a flowchart for identifying a structural formula picture in a document in the embodiment of the present invention, and includes the following steps:
step 501, converting the document into a picture.
The document may be a pdf file or a file with other format, and the embodiment of the present invention is not limited thereto.
Step 502, detecting the picture to determine whether there is a structural picture. If so, step 503 is performed.
Specifically, a machine vision target detection technology can be adopted, a MASKR-CNN model is used for detection, the MASKR-CNN model is a model constructed by adding a branch network on the basis of a Faster R-CNN model, and two tasks of target detection (frame line) and pixel level segmentation and extraction can be completed simultaneously. The MASKR-CNN model can be trained by using a manually marked structural formula picture data set (coordinate positions of structural formula pictures in different scenes need to be marked for the model to extract features), and the MASKR-CNN model is evolved into the chemical field.
And 503, segmenting the picture to obtain each structural formula picture.
And step 504, identifying the structural formula picture to obtain an identification result.
Specifically, the structural formula picture can be converted into a binary image, and then character recognition and image type recognition are performed on the binary image to obtain atoms and chemical bonds corresponding to the structural formula picture.
The chemical structural formula is composed of atoms/groups and chemical bonds, and character recognition (atoms/groups) and image type recognition (chemical bonds: single bonds, double bonds, triple bonds and special bonds) are adopted during recognition. The atoms/groups are connected through chemical bonds to form the whole chemical structure.
And 505, generating a structure description table according to the identification result.
For example, a structure description table shown in the following table 4 can be obtained corresponding to the structural formula shown in fig. 6.
TABLE 4
Figure DEST_PATH_IMAGE007
The fields in table 4 are illustrated as follows:
sequence number: table row numbers, corresponding examples of structural formulae see fig. 6;
atom/group number: c-6, N-7, O-8,201 represents a sulfonamide group (which may be self-defined);
number of chemical bonds: the number of chemical bonds to which the current atom/group is attached;
description of the chemical bond: description of the chemical bonds to which the atoms/groups of the current sequence number are attached, [ a, b, c ] a: bond order (single bond 1, double bond 2, etc.), b: chemical bond type number (normal 0, wedge 1, dashed line 2, etc.), c (number attached to other atom/group);
step 506, determining a corresponding substance entity according to the structure description table.
According to the structure description table, the structure description table can be converted into a typical file format of the compound, such as an x-mol or an x-sdf format, and then the open chemical substance database service is inquired by using a mol or sdf file, so that the substance entity corresponding to the structural formula can be obtained.
It should be noted that, in the existing compound expression methods in the industry, the widely applied mol file is also similar to the method of using a connection table to describe the structure of a compound, and the mol file is divided into four parts: a count statistics portion, an atom description portion, a chemical bond description portion, and an attribute description portion. In the embodiment of the invention, the mapping conversion of the file format can be realized by using the data in the table 4 and the public standard of the mol file.
Further, some drug development documents may also include tables, and for the contents in the tables, the contents may be used as ordinary texts for entity identification and processing, and the specific processing process may refer to the foregoing description, and is not repeated herein.
It should be noted that in determining the weight scores of each entity, higher scores may be given to the entity positions P of the entities in the structural formula and the table.
Correspondingly, the embodiment of the invention also provides a visualized interpretation system of the drug development literature, which is a schematic structural diagram of the system as shown in fig. 7.
In this embodiment, the system includes the following modules:
a text unit determining module 701, configured to determine each text unit in the document and a position of the text unit;
a text recognition module 702, configured to recognize the text unit to obtain entities in the text unit, where the entities include: drugs, targets, indications, companies;
a normalization processing module 703, configured to perform normalization and merging processing on the entities to obtain an entity table corresponding to the document, where the entity table includes: entities and their locations and frequency of occurrence;
a relationship determining module 704, configured to determine relationships between different entities in the entity table;
and a knowledge graph generating module 705, configured to generate a knowledge graph corresponding to the document according to the relationship between the entities.
A specific structure of the text unit determining module 701 may include the following units:
the position determining unit is used for determining the position of each text unit in the document according to the chapter key words and the chapter division characteristics; the chapter keyword ratio is as follows: abstract, Keywords, introductions, Conclusion, and the like; the chapter division features include, but are not limited to, any one or more of: keywords, fonts, word sizes, line feed characters, and the like;
and the splitting unit is used for splitting the literature to obtain each text unit and the position thereof in the literature.
One specific structure of the relationship determining module 704 may include the following units:
a node determining unit, configured to select one entity from the entities in the entity table as an initial node, and use other entities as target nodes;
and the path determining unit is used for determining paths from the starting node to each target node by a shortest path algorithm.
The node determining unit may select, according to the weight of each entity, the entity with the largest weight as the start node. Correspondingly, the node determining unit may specifically include: a weight calculation unit and a node selection unit. The weight calculation unit is used for determining the weight of each entity in the entity table; the node selection unit is used for selecting the entity with the largest weight as the starting node.
Further, the node determining unit may further include: and the type determining unit is used for determining the type of the entity according to the knowledge base. Accordingly, the weight calculation unit may determine the weight of the entity according to the type, location, and frequency of occurrence of the entity.
In another non-limiting embodiment of the system of the present invention, as shown in FIG. 8, the system may further comprise: the query module 801 is configured to query a knowledge base to obtain associated entities of different entity nodes in the knowledge graph, and add the associated entities to the knowledge graph.
Further, the system may further include: and a display module (not shown) for displaying the knowledge-graph and adopting different display forms for the entity node and the associated entity.
According to the visualized interpretation system of the medicine research and development literature, firstly, each text unit and the position thereof in the literature are determined; and then identifying each text unit to obtain each entity in the text unit, performing normalization and merging processing on the identified entities to obtain an entity table corresponding to the document, determining the relationship between different entities in the entity table, generating a knowledge graph corresponding to the document according to the relationship between the entities, and displaying the key information in the document in a more intuitive and visual manner through the knowledge graph, so that the method is beneficial to quickly reading the document by a user and helping the user to find the key information and the association relationship thereof.
For the drug development literature, a structural formula picture is sometimes included, and the analysis and interpretation of the structural formula picture can enable a user to better understand relevant information. Therefore, in another embodiment of the system of the present invention, the system may further include a structural formula processing module, configured to identify the structural formula picture when the document includes the structural formula picture, so as to obtain an entity corresponding to the structural formula picture; . Accordingly, in this embodiment, the normalization processing module is further configured to add the entity corresponding to the structural formula picture to the entity table.
As shown in fig. 9, it is a schematic diagram of a specific structure of a structural processing module in the system of the present invention, and specifically includes the following modules:
a conversion module 901, configured to convert a document into a picture;
the detection module 902 is configured to detect the picture and determine whether there is a structural picture; the detection can be carried out by using a MASKR-CNN model;
a dividing module 903, configured to divide the structural formula picture to obtain each structural formula picture under the condition that the structural formula picture exists;
the structural formula identification module 904 is used for identifying the structural formula picture to obtain an identification result;
the file generation module 905 is configured to generate a structure description table according to the identification result;
a material entity determining module 906, configured to determine a corresponding material entity according to the structure description table.
The structural formula identification module 904 may specifically include the following units:
the picture conversion unit is used for converting the structural formula picture into a binary image;
and the image identification unit is used for carrying out character identification and image type identification on the binary image to obtain atoms and chemical bonds corresponding to the structural formula picture.
Through the identification processing of the structural formula, help information can be further provided for the understanding of the user to the drug development literature.
In addition, the scheme of the invention provides an important basis for the digitization of the drug research and development literature for the information obtained by processing the drug research and development literature, helps a user to analyze and interpret massive related literatures subsequently, provides basic capability for analyzing and processing the literatures in batches by using a system subsequently, and forms wider application.
It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. Furthermore, the above-described system embodiments are merely illustrative, wherein modules and units illustrated as separate components may or may not be physically separate, i.e., may be located on one network element, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The present invention has been described in detail with reference to the embodiments, and the description of the embodiments is provided to facilitate the understanding of the method and apparatus of the present invention, and is intended to be a part of the embodiments of the present invention rather than the whole embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any creative effort shall fall within the protection scope of the present invention, and the content of the present description shall not be construed as limiting the present invention. Therefore, any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (15)

1. A method for visual interpretation of drug development literature, the method comprising:
determining each text unit and the position thereof in the literature;
identifying the text unit to obtain each entity in the text unit, wherein the entities comprise: drugs, targets, indications, companies;
and carrying out normalization and merging processing on the entities to obtain an entity table corresponding to the document, wherein the entity table comprises: entities and their locations and frequency of occurrence;
determining the relationship between different entities in the entity table;
and generating a knowledge graph corresponding to the literature according to the relation between the entities.
2. The method of claim 1, wherein determining each text unit and its location in the document comprises:
determining the position of each text unit in the document according to the chapter keywords and the chapter division characteristics;
and splitting the document to obtain each text unit and the position thereof in the document.
3. The method of claim 1, wherein determining relationships between different entities in the entity table comprises:
selecting one entity from the entities in the entity table as an initial node, and taking other entities as target nodes;
and determining the paths from the starting node to each target node by a shortest path algorithm from the starting node.
4. The method of claim 3, wherein selecting a starting node from the entities of the entity table comprises:
determining the weight of each entity in the entity table;
and selecting the entity with the maximum weight as the starting node.
5. The method of claim 4, wherein determining the weight of each entity in the entity table comprises:
determining the type of the entity according to a knowledge base;
and determining the weight of the entity according to the type, the position and the occurrence frequency of the entity.
6. The method of claim 5, further comprising:
querying the knowledge base to obtain associated entities of different entity nodes in the knowledge graph;
adding the associated entities to the knowledge-graph.
7. The method of claim 6, further comprising:
and displaying the knowledge graph, and adopting different display forms for the entity node and the associated entity.
8. The method according to any one of claims 1 to 7, further comprising:
if the document contains a structural formula picture, identifying the structural formula picture to obtain an entity corresponding to the structural formula picture;
and adding the entity corresponding to the structural formula picture into the entity table.
9. A system for visual interpretation of drug development literature, the system comprising:
the text unit determining module is used for determining each text unit and the position thereof in the document;
a text recognition module, configured to recognize the text unit to obtain entities in the text unit, where the entities include: drugs, targets, indications, companies;
a normalization processing module, configured to perform normalization and merging processing on the entities to obtain an entity table corresponding to the document, where the entity table includes: entities and their locations and frequency of occurrence;
the relation determining module is used for determining the relation between different entities in the entity table;
and the knowledge graph generating module is used for generating a knowledge graph corresponding to the literature according to the relation between the entities.
10. The system of claim 9, wherein the text unit determination module comprises:
the position determining unit is used for determining the position of each text unit in the document according to the chapter key words and the chapter division characteristics;
and the splitting unit is used for splitting the literature to obtain each text unit and the position thereof in the literature.
11. The system of claim 9, wherein the relationship determination module comprises:
a node determining unit, configured to select one entity from the entities in the entity table as an initial node, and use other entities as target nodes;
and the path determining unit is used for determining paths from the starting node to each target node by a shortest path algorithm.
12. The system according to claim 11, wherein the node determining unit comprises:
the weight calculation unit is used for determining the weight of each entity in the entity table;
and the node selection unit is used for selecting the entity with the largest weight as the starting node.
13. The system of claim 12, wherein the node determining unit further comprises: the type determining unit is used for determining the type of the entity according to a knowledge base;
the weight calculation unit is specifically configured to determine the weight of the entity according to the type, the location, and the frequency of occurrence of the entity.
14. The system of claim 13, further comprising:
and the query module is used for querying the knowledge base to obtain associated entities of different entity nodes in the knowledge graph and adding the associated entities into the knowledge graph.
15. The system of any one of claims 9 to 14, further comprising:
the structural formula processing module is used for identifying the structural formula picture under the condition that the document contains the structural formula picture to obtain an entity corresponding to the structural formula picture;
and the normalization processing module is also used for adding the entity corresponding to the structural formula picture into the entity table.
CN202210147101.6A 2022-02-17 2022-02-17 Drug development literature visualization interpretation method and system Active CN114201618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210147101.6A CN114201618B (en) 2022-02-17 2022-02-17 Drug development literature visualization interpretation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210147101.6A CN114201618B (en) 2022-02-17 2022-02-17 Drug development literature visualization interpretation method and system

Publications (2)

Publication Number Publication Date
CN114201618A true CN114201618A (en) 2022-03-18
CN114201618B CN114201618B (en) 2022-09-13

Family

ID=80645600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210147101.6A Active CN114201618B (en) 2022-02-17 2022-02-17 Drug development literature visualization interpretation method and system

Country Status (1)

Country Link
CN (1) CN114201618B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657067A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Methods of exhibiting, device, computer equipment and the storage medium of knowledge mapping
CN111309824A (en) * 2020-02-18 2020-06-19 中国工商银行股份有限公司 Entity relationship map display method and system
CN111737492A (en) * 2020-06-23 2020-10-02 安徽大学 Autonomous robot task planning method based on knowledge graph technology
WO2020233261A1 (en) * 2019-07-12 2020-11-26 之江实验室 Natural language generation-based knowledge graph understanding assistance system
CN112347204A (en) * 2021-01-08 2021-02-09 药渡经纬信息科技(北京)有限公司 Method and device for constructing drug research and development knowledge base
CN112395871A (en) * 2020-12-02 2021-02-23 华中科技大学 Collocation configuration type automatic acquisition method and system and visualization method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657067A (en) * 2018-11-19 2019-04-19 平安科技(深圳)有限公司 Methods of exhibiting, device, computer equipment and the storage medium of knowledge mapping
WO2020233261A1 (en) * 2019-07-12 2020-11-26 之江实验室 Natural language generation-based knowledge graph understanding assistance system
CN111309824A (en) * 2020-02-18 2020-06-19 中国工商银行股份有限公司 Entity relationship map display method and system
CN111737492A (en) * 2020-06-23 2020-10-02 安徽大学 Autonomous robot task planning method based on knowledge graph technology
CN112395871A (en) * 2020-12-02 2021-02-23 华中科技大学 Collocation configuration type automatic acquisition method and system and visualization method
CN112347204A (en) * 2021-01-08 2021-02-09 药渡经纬信息科技(北京)有限公司 Method and device for constructing drug research and development knowledge base

Also Published As

Publication number Publication date
CN114201618B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
US11017034B1 (en) System and method for search with the aid of images associated with product categories
US20100079464A1 (en) Information processing apparatus capable of easily generating graph for comparing of a plurality of commercial products
CN108734212B (en) Method for determining classification result and related device
JP2006190006A (en) Text displaying method, information processor, information processing system, and program
CN111406262A (en) Cognitive document image digitization
CN111428494A (en) Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111681769B (en) Medicine word segmentation searching method and system
JP6533876B2 (en) Product information display system, product information display method, and program
US11544309B2 (en) Similarity index value computation apparatus, similarity search apparatus, and similarity index value computation program
JP6880974B2 (en) Information output program, information output method and information processing device
CN114201618B (en) Drug development literature visualization interpretation method and system
US11645312B2 (en) Attribute extraction apparatus and attribute extraction method
CN117421389A (en) Intelligent model-based technical trend determination method and system
JP5443788B2 (en) Formal name determination system and formal name determination program
CN113255369A (en) Text similarity analysis method and device and storage medium
JPH11306187A (en) Method and device for presenting retrieval result of document with category
JP2020181332A (en) High-precision similar image search method, program and high-precision similar image search device
KR20010055126A (en) System for learning information of goods in internet shopping malls and method using the same
CN113779193B (en) Text quotation method and device and electronic equipment
US20230409620A1 (en) Non-transitory computer-readable recording medium storing information processing program, information processing method, information processing device, and information processing system
JPH08115330A (en) Method for retrieving similar document and device therefor
CN111274352B (en) Method and equipment for marking characteristic words in tool book
CN111046629B (en) Outline display method, device and equipment
CN109271392B (en) Method and equipment for quickly distinguishing and extracting relational database entity and attribute
CN116301400A (en) Expression search method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant