CN111797296B - Method and system for mining poison-target literature knowledge based on network crawling - Google Patents

Method and system for mining poison-target literature knowledge based on network crawling Download PDF

Info

Publication number
CN111797296B
CN111797296B CN202010654561.9A CN202010654561A CN111797296B CN 111797296 B CN111797296 B CN 111797296B CN 202010654561 A CN202010654561 A CN 202010654561A CN 111797296 B CN111797296 B CN 111797296B
Authority
CN
China
Prior art keywords
poison
target
literature
information
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010654561.9A
Other languages
Chinese (zh)
Other versions
CN111797296A (en
Inventor
周文霞
韩露
张永祥
肖智勇
黄晏
刘港
高圣乔
罗丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
Academy of Military Medical Sciences AMMS of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Military Medical Sciences AMMS of PLA filed Critical Academy of Military Medical Sciences AMMS of PLA
Priority to CN202010654561.9A priority Critical patent/CN111797296B/en
Publication of CN111797296A publication Critical patent/CN111797296A/en
Application granted granted Critical
Publication of CN111797296B publication Critical patent/CN111797296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a poison-target literature knowledge mining method and system based on network crawling, wherein the poison-target literature knowledge mining method based on network crawling comprises the following steps: acquiring and processing poison and target data information to establish a comprehensive data set; developing a web crawler tool; based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database; determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation; and carrying out poison-target literature knowledge mining by utilizing the literature text database and the poison-target relation knowledge base. The poison-target literature knowledge mining method and system based on network crawling are high in efficiency, good in accuracy and high in intelligent degree.

Description

Method and system for mining poison-target literature knowledge based on network crawling
Technical Field
The invention relates to the technical field of Internet, in particular to a poison-target literature knowledge mining method and system based on network crawling.
Background
Along with the rapid development of toxicology, molecular biology and other subjects, a large number of data sets related to toxicants and targets are developed on the Internet, but at present, the data sets are scattered in resource storage, heterogeneous in format, a large number of repeated redundant information exists between different toxicants and target data sets, the toxicant names and alias information are disordered, and unified naming standards are lacking. Although the occurrence of the data sets provides references for vast poison scientific researchers, the problems of low manual searching and inquiring efficiency, too much redundant information, difficult knowledge discovery and the like are caused by the fact that the data standards are not uniform and the necessary data filtering and quality control mechanism is lacked. Therefore, it is particularly important how to integrate existing poison and target data sets in order and efficiently, and to build a comprehensive poison target data set including the currently known data sets.
The existing document retrieval system for the poison and the target is used for carrying out information retrieval and fuzzy matching in a background document database based on keywords input by a user such as a poison name, a target name or a combination of the two, finding documents with high similarity and returning the documents to the user, wherein the document retrieval mode is still remained on a static content searching matching layer, so that hidden knowledge in the documents is difficult to obtain, knowledge mining from massive biomedical documents is more difficult, and the working efficiency of poison research and development and scientific researchers is seriously influenced.
Therefore, it is highly desirable to provide a poison-target literature knowledge mining method and system with high efficiency, accuracy and intellectualization.
Disclosure of Invention
First, the technical problem to be solved
In view of the foregoing, it is a primary object of the present invention to provide a poison-target literature knowledge mining method and system based on web crawling, with the aim of at least partially solving at least one of the above-mentioned technical problems.
(II) technical scheme
According to one aspect of the present invention, there is provided a poison-target literature knowledge mining method based on web crawling, including:
acquiring and processing poison and target data information to establish a comprehensive data set;
developing a web crawler tool;
based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database;
determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation;
and carrying out poison-target literature knowledge mining by utilizing the literature text database and the poison-target relation knowledge base.
Further, acquiring and processing poison and target data information to create a comprehensive data set, comprising:
acquiring known poison and target data information;
and carrying out information deduplication, data filtering and standard processing on the known poison and target data information, and establishing a comprehensive data set of the poison and the target.
Further, the integrated dataset includes information of poisons and information of targets; the information of the poison comprises basic information of a poison name, a CAS number, a chemical structure and a molecular formula, and the information of the target comprises information of a target name and a Uniprot sequence number.
Further, developing a web crawler tool, comprising: web crawler tools are developed based on Python language and Scapy architecture.
Further, based on the integrated data set, crawling poison and target document text information with the web crawler tool and processing to build a document text database, comprising:
based on the poison names and the target names in the comprehensive data set, automatically crawling poison and target document text information from a document website Pubmed by utilizing the web crawler tool;
and cleaning the data of the poison and target document text information, and establishing a document text database of the poison and target by using the cleaned document text information.
Further, based on the document text database, determining a potential action relationship of the poison-target by using a natural language processing technology to form a knowledge base of the poison-target relationship, comprising: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation.
Further, determining a literature poison-target relationship based on similarity analysis, cluster analysis, topic mining, entity relationship extraction and a deep learning algorithm, and forming a poison-target relationship knowledge base, comprising:
processing the data in the crawled literature and the built literature text database based on similarity analysis, cluster analysis and topic mining to obtain multidimensional literature quantitative data;
extracting a statistical analysis model for determining the relationship between the poison and the target by adopting a data mining technology based on the entity relationship;
optimizing the model parameters by adopting a word vector and multi-layer neural network technology based on a deep learning algorithm;
based on multidimensional literature quantitative data, determining a literature poison-target relationship by using an optimized statistical analysis model, and forming a poison-target relationship knowledge base.
Further, the method for mining the poison-target literature knowledge by using the literature text database and the poison-target relation knowledge base comprises the following steps:
searching and inquiring the oriented literature by using the literature text database and the poison-target relation knowledge base, so as to mine the poison-target literature knowledge.
Further, searching and targeted document query are performed by using the document text database and the poison-target relation knowledge base, so as to mine the poison-target document knowledge, which comprises the following steps:
retrieving poison from the literature text database to obtain the name, CAS number, chemical structure and molecular formula basic information of the poison;
retrieving the target in the document text database to obtain the name, DNA name, receptor name, uniprot sequence number, molecular weight, alias, gene sequence and protein ID information of the target;
and obtaining potential action relations of the poison and the target on target proteins, receptors or human target organs by carrying out poison and target retrieval in a poison-target relation knowledge base.
According to another aspect of the present invention, there is provided a web-crawling-based poison-target document knowledge mining system comprising a processor for performing the method.
(III) beneficial effects
According to the technical scheme, the poison-target literature knowledge mining method and system based on network crawling have at least one of the following beneficial effects:
(1) The invention processes the known poison and target data based on the known poison and target data to establish the comprehensive data set, solves the problems of repeated redundancy, disordered information and irregular naming of the data in the existing data set, and is beneficial to improving the mining efficiency and accuracy.
(2) According to the invention, the automatic crawling of the massive biomedical documents is performed based on the poison target information in the comprehensive data set, the document text database is established, the hidden knowledge in the documents can be obtained, and the knowledge mining from the massive biomedical documents is facilitated.
(3) The method is based on machine learning and deep learning algorithms to mine potential action relations of poison targets from mass documents, and is beneficial to improving the intelligent degree and effectiveness of mining.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of the poison-target literature knowledge mining system of the present invention.
Detailed Description
The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.
The invention provides a poison-target literature knowledge mining method based on network crawling, which comprises the following steps:
acquiring and processing poison and target data information to establish a comprehensive data set;
developing a web crawler tool;
based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database;
determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation;
and carrying out poison-target literature knowledge mining by utilizing the literature text database and the poison-target relation knowledge base.
The obtained poison and target data information is raw known poison and target data (poison and target data in each existing database have the problems of repetition, redundancy, confusion, non-standard naming and the like); the comprehensive data set is a data set formed by processing the original known poison and target data information; and processing the acquired poison and target data information, namely performing de-duplication, filtering, standardization, unified processing and the like. The invention processes the known poison and target data based on the known poison and target data to establish the comprehensive data set, solves the problems of repeated redundancy, disordered information and irregular naming of the data in the existing data set, and is beneficial to improving the mining efficiency and accuracy
Specifically, acquiring and processing poison and target data information to create a comprehensive data set, comprising: acquiring known poison and target data information; and performing information deduplication, data filtering and standard processing on the known poison and target data information, and establishing a comprehensive data set of the poison and the target.
Developing a web crawler tool, comprising: web crawler tools are developed based on Python language and Scapy architecture.
Based on the integrated dataset, crawling poison and target document text information with the web crawler tool and processing to build a document text database, comprising: based on the poison names and the target names in the comprehensive data set, automatically crawling poison and target document text information from a document website Pubmed by utilizing the web crawler tool; and cleaning the data of the poison and target document text information, and establishing a document text database of the poison and target by using the cleaned document text information.
Determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation, wherein the knowledge base comprises the following steps: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation. The method is based on machine learning and deep learning algorithms to mine potential action relations of poison targets from mass documents, and is beneficial to improving the intelligent degree and effectiveness of mining.
And carrying out poison-target literature knowledge mining by using the literature text database and the poison-target relation knowledge base, wherein the method comprises the following steps of: retrieving poison from the literature text database to obtain the name, CAS number, chemical structure and molecular formula basic information of the poison; retrieving the target in the document text database to obtain the name, DNA name, receptor name, uniprot sequence number, molecular weight, alias, gene sequence and protein ID information of the target; and obtaining potential action relations of the poison and the target on target proteins, receptors or human target organs by carrying out poison and target retrieval in a poison-target relation knowledge base.
Compared with the method for directly acquiring the poison-target literature knowledge through the scattered data resources on the existing network, the method for acquiring the poison-target literature knowledge processes the known poison and target data information to form a comprehensive data set, builds a literature text database and a poison-target relation knowledge base, and performs poison and/or target retrieval in the literature text database and the poison-target relation knowledge base, so that the efficiency is higher and the accuracy is better.
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The poison-target literature knowledge mining method based on network crawling in the embodiment comprises the following steps:
step 1: acquiring poison and target data information and establishing a comprehensive data set;
specifically, the related data information of the known poison and the target is collected, arranged and mined, and a comprehensive data set containing basic information of the poison and the target is established. More specifically, by searching and querying various existing databases including, but not limited to, TOXNET database, drug bank database, TDD database, NRDB drug protein database, drug-target interaction information database, supertarget, etc., the related data information of all currently known poisons and targets is subjected to data information collection, arrangement and mining, and the processing procedures of information deduplication, data filtering, etc., a comprehensive data set of basic information of the poisons and the targets is established. The comprehensive data set comprises information of a poison and information of a target, wherein the information of the poison comprises basic information such as a name, a CAS number, a chemical structure, a molecular formula and the like, and the information of the target comprises basic information such as a target name, a Uniprot sequence number and the like. The constructed comprehensive data set of the basic information of the poison and the target provides accurate and effective information basis for crawling and mining of the document text information in the next step.
Step 2: web crawler tool development, poison-target document crawling and text information preprocessing, and document text database is established;
specifically, a web crawler tool is developed, literature text information related to the poison and the target is automatically acquired from a literature website Pubmed, namely, the names of the poison and the target in the comprehensive data set are utilized to automatically crawl the literature text information related to the poison and the target from the literature website Pubmed, all crawled literature texts are initially sorted and collected, and a literature text database is established. More specifically, the automatic crawling of the literature text aiming at all poisons and targets on the Pubmed website can be realized by respectively developing a literature collection site management module, a literature collection template management module, a literature collection module and a collection monitoring module, so that a literature text data set of the poisons and the targets is obtained, data extraction, exchange, loading and other data cleaning processes are carried out on the literature text data set of the poisons and the targets, a structured storage preprocessing process of the literature text data is realized, and a literature text database of the poisons and the targets is established.
Preferably, the web crawler tool is developed based on the python language and the Scapy architecture, specific function development is performed for specific websites, and functions of each module of the web crawler tool based on the Scapy architecture are as follows:
engine (Scrapy Engine): responsible for Spider, itemPipeline, downloader, scheduler intermediate communication, signal, data transmission and the like;
scheduler (Scheduler): the system is responsible for receiving Request requests sent by the engine, arranging the Request requests in a certain mode, enqueuing the Request requests, and returning the Request requests to the engine when the engine is needed;
downloader (Downloader): all Requests sent by the Scrayy Engine are downloaded in charge, the acquired Responses are returned to the Scrayy Engine, and the Responses are handed over to the Spider by the Engine for processing;
crawler (Spider): the method is responsible for processing all Responses, analyzing and extracting data from the Responses, acquiring data required by an Item field, submitting URL (uniform resource locator) required to be followed to an engine, and entering a Scheduler again;
item Pipeline: the places responsible for processing the obtained Item in the Spider and performing post-processing (detailed analysis, filtering, storage and the like);
downloader Middlewares (download middleware): the system can be a component capable of customizing the extended downloading function;
spider Middlewares (Spider middleware): may be a functional component that can self-expand and operate the engine to communicate with the Spider (e.g., responses into the Spider, and Requests out of the Spider).
Step 3: the method comprises the steps of document text data mining processing and knowledge base establishment, specifically, determining potential action relations of poison-targets by utilizing a natural language processing technology based on the document text database to form a poison-target relation knowledge base;
more specifically, based on similarity analysis, cluster analysis, topic mining, entity relation extraction and other data mining and deep learning algorithms, determining the relation of literature poison-target to form a poison-target relation knowledge base, wherein the process is as follows:
firstly, processing the crawled literature and the established literature library data to obtain multidimensional literature data quantitative data;
a statistical analysis model of the relation between the poison and the target is established by adopting a data mining technology, model parameters are optimized by adopting a word vector and multi-layer neural network (deep learning) technology, and the verification test is carried out on the result by adopting the existing literature relation database; preferably, the invention adopts the deep learning based on the bidirectional GRU and the double-layer Attention mechanism, improves the precision and has better effect;
based on the optimized and tested model, utilizing the literature data quantitative data to construct a poison-target relation, and forming a poison-target relation knowledge base.
Step 4: poison-target literature knowledge mining;
utilizing the literature text database and the poison-target relation knowledge base to carry out poison-target literature knowledge mining; specifically, the poison-target relation and the document text information are searched and the directed document is searched based on the document text database and the poison-target relation knowledge base, so that the poison-target document knowledge is mined.
The invention also provides a poison-target literature knowledge mining system based on the network crawling, which comprises a processor and a processor, wherein the processor is used for executing the poison-target literature knowledge mining method based on the network crawling. The established comprehensive data set of poison and target information, the literature text database, the poison-target relation knowledge base, the web-based literature crawler tool, the literature data mining algorithm and the like are seamlessly integrated, so that an online poison-target literature knowledge mining system integrating poison, target information retrieval, literature text information retrieval and directional literature query is established, and the online poison-target literature knowledge mining system has the characteristics of being extensible, cross-platform, multi-user and the like.
The structure of the poison-target literature knowledge mining system of the present invention is shown in fig. 1.
The man-machine interaction of the system comprises two parts: crawler and literature knowledge mining monitoring interfaces, and poison-target knowledge base retrieval interfaces. Referring to fig. 1, the knowledge mining system is a four-layer system architecture based on a browser-service-intellectualization engine-data layer, and includes: an interface layer, using a web browser, using a standard MVC framework (e.g., vus. Js); the service layer adopts a RESTFul service interface (such as Sprintoot and the like) intelligent engine supporting CRUD standard operation: development with Python, using Python development language, numpy, scipy, NLTK, spaCy, gensim development packages, tensorflow and Keras development platforms. Implementing web crawlers and data mining (deep learning); the data layer adopts MongoDB and Neo4j to respectively realize a document library and a knowledge library; the platform layer, the basic network and the OS platform, and the system is developed and operated on the Windows platform, and is easy to deploy, maintain and operate.
With continued reference to fig. 1, the existing poison target data from different sources are subjected to data integration, data cleaning, deduplication, and then a comprehensive data set of poison and target is established, wherein the comprehensive data set is used as the data input of the crawler engine in fig. 1, and the crawler engine automatically obtains relevant literature information from the Pubmed biomedical website (but not limited to this of course) according to the data input, and stores the relevant literature information in a local structured literature library (literature text database of poison and target) after pretreatment of the literature data. Firstly, word frequency analysis is carried out on a local structured document library to obtain a high-frequency document data set, then a sentence segmenter and a word segmentation device are used for segmenting a document abstract to obtain a word tuple list, part-of-speech tagging is carried out on the word tuple list, part-of-speech information of all word lists is further obtained, after processing through a block divider, named entity identification and extraction of relations among entities are carried out, and finally a poison-target document knowledge base (also called a poison-target relation knowledge base) is generated.
Based on the local structured document library and the poison-target relation knowledge base, quick full text retrieval can be performed on the poison and the target. The invention provides local document database retrieval and poison-target relation knowledge base retrieval of different types of poison, targets, paths, poison target action relations and the like, and the data retrieval and display modes of different types are different.
When searching the poison in the local structured document library, the invention can search all the document information containing the poison or the poison alias in full text, and the displayed document information list is ordered according to the weight proportion of the keywords in the documents. Clicking on a poison may look at the basic information of the poison's name, CAS number, chemical structure, molecular formula, etc., and the common external poison reference links.
When searching the target in the local structured document library, the invention can search all document information containing the target name, DNA name and receptor name in full text, the displayed document information is also sequenced according to the weight of the target, and related target name, uniprot sequence number, gene name, molecular weight, alias, gene sequence, protein ID and common external target link information can be obtained by clicking the target.
When poison or target retrieval is carried out in a poison-target relation knowledge base, biomedical literature data based on a mass network can be given, potential action relations of the poison or target on other target proteins, receptors or human target organs can be obtained through data mining, and related literature links are given so as to be convenient for checking detailed literature information.
The present invention has been described in detail with reference to the accompanying drawings. The present invention should be clearly recognized by those skilled in the art in light of the above description.
It should be noted that, in the drawings or the text of the specification, implementations not shown or described are all forms known to those of ordinary skill in the art, and not described in detail. Furthermore, the above definitions of the elements are not limited to the specific structures, shapes or modes mentioned in the embodiments, and may be simply modified or replaced by those of ordinary skill in the art.
Of course, according to actual needs, the present invention may also include other parts, and since the parts are irrelevant to the innovations of the present invention, the details are not repeated here.
Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the method of the invention should not be interpreted as reflecting the intention: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Furthermore, in the drawings or description, like or identical parts are provided with the same reference numerals. Features of the embodiments illustrated in the description may be combined freely to form new solutions without conflict, in addition, each claim may be used alone as one embodiment or features of the claims may be combined as a new embodiment, and in the drawings, the shape or thickness of the embodiments may be enlarged and labeled in a simplified or convenient manner. Furthermore, elements or implementations not shown or described in the drawings are of a form known to those of ordinary skill in the art. Additionally, although examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the corresponding values, but may be approximated to the corresponding values within acceptable error margins or design constraints.
The various embodiments of the invention described above may be freely combined to form further embodiments, unless otherwise technically impaired or contradictory, which are all within the scope of the invention.
Although the present invention has been described with reference to the accompanying drawings, the examples disclosed in the drawings are intended to illustrate preferred embodiments of the invention and are not to be construed as limiting the invention. The dimensional proportions in the drawings are illustrative only and should not be construed as limiting the invention.
Although a few embodiments of the present general inventive concept have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the general inventive concept, the scope of which is defined in the claims and their equivalents.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (9)

1. The poison-target literature knowledge mining method based on network crawling is characterized by comprising the following steps of:
acquiring and processing poison and target data information to establish a comprehensive data set;
developing a web crawler tool;
based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database;
determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation;
searching and inquiring the poison-target relation and the document text information by using the document text database and the poison-target relation knowledge base, so as to carry out poison-target document knowledge mining;
the method for determining potential action relation of poison-target based on the document text database by utilizing natural language processing technology to form a knowledge base of poison-target relation comprises the following steps:
processing the data in the crawled literature and the built literature text database based on similarity analysis, cluster analysis and topic mining to obtain multidimensional literature quantitative data;
extracting a statistical analysis model for determining the relationship between the poison and the target by adopting a data mining technology based on the entity relationship;
optimizing model parameters by adopting word vectors and a multi-layer neural network technology based on a deep learning algorithm;
based on multidimensional literature quantitative data, determining a literature poison-target relationship by using an optimized statistical analysis model, and forming a poison-target relationship knowledge base.
2. The method of claim 1, wherein acquiring poison and target data information and processing to create a comprehensive dataset comprises:
acquiring known poison and target data information;
and carrying out information deduplication, data filtering and standard processing on the known poison and target data information, and establishing a comprehensive data set of the poison and the target.
3. The method of claim 1, wherein the integrated dataset includes information of poisons and information of targets; the information of the poison comprises basic information of a poison name, a CAS number, a chemical structure and a molecular formula, and the information of the target comprises information of a target name and a Uniprot sequence number.
4. The method of claim 1, wherein developing a web crawler tool comprises: web crawler tools are developed based on Python language and Scapy architecture.
5. The method of claim 1, wherein crawling and processing poison and target document text information using the web crawler tool based on the integrated dataset to build a document text database comprises:
based on the poison names and the target names in the comprehensive data set, automatically crawling poison and target document text information from a document website Pubmed by utilizing the web crawler tool;
and cleaning the data of the poison and target document text information, and establishing a document text database of the poison and target by using the cleaned document text information.
6. The method of claim 1, wherein determining poison-target potential effect relationships using natural language processing techniques based on the document text database to form a knowledge base of poison-target relationships comprises: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation.
7. The method of claim 1, wherein utilizing the literature text database and poison-target relationship knowledge base for poison-target literature knowledge mining comprises:
searching and inquiring the oriented literature by using the literature text database and the poison-target relation knowledge base, so as to mine the poison-target literature knowledge.
8. The method of claim 7, wherein searching and targeting literature queries using the literature text database and a poison-target relationship knowledge base to mine poison-target literature knowledge comprises:
retrieving poison from the literature text database to obtain the name, CAS number, chemical structure and molecular formula basic information of the poison;
retrieving the target in the document text database to obtain the name, DNA name, receptor name, uniprot sequence number, molecular weight, alias, gene sequence and protein ID information of the target;
and obtaining potential action relations of the poison and the target on target proteins, receptors or human target organs by carrying out poison and target retrieval in a poison-target relation knowledge base.
9. A web-crawling-based poison-target literature knowledge mining system, comprising a processor for performing the method of any one of claims 1 to 8.
CN202010654561.9A 2020-07-08 2020-07-08 Method and system for mining poison-target literature knowledge based on network crawling Active CN111797296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010654561.9A CN111797296B (en) 2020-07-08 2020-07-08 Method and system for mining poison-target literature knowledge based on network crawling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010654561.9A CN111797296B (en) 2020-07-08 2020-07-08 Method and system for mining poison-target literature knowledge based on network crawling

Publications (2)

Publication Number Publication Date
CN111797296A CN111797296A (en) 2020-10-20
CN111797296B true CN111797296B (en) 2024-04-09

Family

ID=72811357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010654561.9A Active CN111797296B (en) 2020-07-08 2020-07-08 Method and system for mining poison-target literature knowledge based on network crawling

Country Status (1)

Country Link
CN (1) CN111797296B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927168B (en) * 2022-05-31 2023-08-29 四川大学 Construction method of biomechanical regulation and control bone reconstruction text mining interaction website
CN114996465A (en) * 2022-08-01 2022-09-02 中国传媒大学 Information propagation dynamics document classification knowledge base establishing method, system and equipment
CN115827948B (en) * 2023-02-09 2023-05-02 中南大学 Single-reflection intelligent agent for crawling literature data and literature data crawling method

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521337A (en) * 2011-12-08 2012-06-27 华中科技大学 Academic community system based on massive knowledge network
CN104572709A (en) * 2013-10-18 2015-04-29 北京中海纪元数字技术发展股份有限公司 Data mining system used for enterprise innovation system
JP2016192198A (en) * 2015-03-30 2016-11-10 国立研究開発法人情報通信研究機構 Argument-sharing discriminator learning apparatus, language knowledge collecting device, and anaphor/abbreviation analyzer
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106156335A (en) * 2016-07-07 2016-11-23 苏州大学 A kind of discovery and arrangement method and system of teaching material knowledge point
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN108984761A (en) * 2018-07-19 2018-12-11 南昌工程学院 A kind of information processing system driven based on model and domain knowledge
CN110309393A (en) * 2019-03-28 2019-10-08 平安科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium storing program for executing
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110347894A (en) * 2019-05-31 2019-10-18 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN110347844A (en) * 2019-07-15 2019-10-18 中国人民解放军战略支援部队航天工程大学 A kind of space object knowledge map construction system
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained
CN110929165A (en) * 2019-12-17 2020-03-27 云南大学 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030220860A1 (en) * 2002-05-24 2003-11-27 Hewlett-Packard Development Company,L.P. Knowledge discovery through an analytic learning cycle

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102521337A (en) * 2011-12-08 2012-06-27 华中科技大学 Academic community system based on massive knowledge network
CN104572709A (en) * 2013-10-18 2015-04-29 北京中海纪元数字技术发展股份有限公司 Data mining system used for enterprise innovation system
JP2016192198A (en) * 2015-03-30 2016-11-10 国立研究開発法人情報通信研究機構 Argument-sharing discriminator learning apparatus, language knowledge collecting device, and anaphor/abbreviation analyzer
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN106156335A (en) * 2016-07-07 2016-11-23 苏州大学 A kind of discovery and arrangement method and system of teaching material knowledge point
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN108984761A (en) * 2018-07-19 2018-12-11 南昌工程学院 A kind of information processing system driven based on model and domain knowledge
CN110309393A (en) * 2019-03-28 2019-10-08 平安科技(深圳)有限公司 Data processing method, device, equipment and readable storage medium storing program for executing
CN110347894A (en) * 2019-05-31 2019-10-18 平安科技(深圳)有限公司 Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110347844A (en) * 2019-07-15 2019-10-18 中国人民解放军战略支援部队航天工程大学 A kind of space object knowledge map construction system
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained
CN110929165A (en) * 2019-12-17 2020-03-27 云南大学 JAVA Doc knowledge graph-based multidimensional evaluation recommendation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Data-Driven Statistical Approach for Monitoring and Analysis of Large Industrial Processes;A. Montazeri 等;《IFAC-PapersOnLine》;2354-2359 *
基于深度学习框架的实体关系抽取研究进展;李枫林 等;《情报科学》;169-176 *

Also Published As

Publication number Publication date
CN111797296A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
Kumar et al. A survey of Web crawlers for information retrieval
US10789229B2 (en) Determining a hierarchical concept tree using a large corpus of table values
CN111797296B (en) Method and system for mining poison-target literature knowledge based on network crawling
US9092504B2 (en) Clustered information processing and searching with structured-unstructured database bridge
EP2823410B1 (en) Entity augmentation service from latent relational data
US9715493B2 (en) Method and system for monitoring social media and analyzing text to automate classification of user posts using a facet based relevance assessment model
Crescenzi et al. Clustering web pages based on their structure
CN109522465A (en) The semantic searching method and device of knowledge based map
EP3671526B1 (en) Dependency graph based natural language processing
US20020065857A1 (en) System and method for analysis and clustering of documents for search engine
EP1949273A1 (en) Extending keyword searching to syntactically and semantically annotated data
KR20060045743A (en) Content propagation for enhanced document retrieval
CN109086573B (en) Multi-source biological big data fusion system
WO2014054052A2 (en) Context based co-operative learning system and method for representing thematic relationships
WO2001057711A1 (en) Combinatorial query generating system and method
López et al. An efficient and scalable search engine for models
Delboni et al. Semantic expansion of geographic web queries based on natural language positioning expressions
Valentine et al. EarthCube Data Discovery Studio: A gateway into geoscience data discovery and exploration with Jupyter notebooks
Sellami et al. Keyword-based faceted search interface for knowledge graph construction and exploration
Schadd et al. Word-sense disambiguation for ontology mapping: Concept disambiguation using virtual documents and information retrieval techniques
Moraes et al. Prequery discovery of domain-specific query forms: A survey
CN114117242A (en) Data query method and device, computer equipment and storage medium
Gollapalli et al. Automated discovery of multi-faceted ontologies for accurate query answering and future semantic reasoning
Xu et al. Building spatial temporal relation graph of concepts pair using web repository
WO2018022333A1 (en) Cross-platform computer application query categories

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant