CN111797296B

CN111797296B - Method and system for mining poison-target literature knowledge based on network crawling

Info

Publication number: CN111797296B
Application number: CN202010654561.9A
Authority: CN
Inventors: 周文霞; 韩露; 张永祥; 肖智勇; 黄晏; 刘港; 高圣乔; 罗丹
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2024-04-09
Anticipated expiration: 2040-07-08
Also published as: CN111797296A

Abstract

The invention provides a poison-target literature knowledge mining method and system based on network crawling, wherein the poison-target literature knowledge mining method based on network crawling comprises the following steps: acquiring and processing poison and target data information to establish a comprehensive data set; developing a web crawler tool; based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database; determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation; and carrying out poison-target literature knowledge mining by utilizing the literature text database and the poison-target relation knowledge base. The poison-target literature knowledge mining method and system based on network crawling are high in efficiency, good in accuracy and high in intelligent degree.

Description

Method and system for mining poison-target literature knowledge based on network crawling

Technical Field

The invention relates to the technical field of Internet, in particular to a poison-target literature knowledge mining method and system based on network crawling.

Background

Along with the rapid development of toxicology, molecular biology and other subjects, a large number of data sets related to toxicants and targets are developed on the Internet, but at present, the data sets are scattered in resource storage, heterogeneous in format, a large number of repeated redundant information exists between different toxicants and target data sets, the toxicant names and alias information are disordered, and unified naming standards are lacking. Although the occurrence of the data sets provides references for vast poison scientific researchers, the problems of low manual searching and inquiring efficiency, too much redundant information, difficult knowledge discovery and the like are caused by the fact that the data standards are not uniform and the necessary data filtering and quality control mechanism is lacked. Therefore, it is particularly important how to integrate existing poison and target data sets in order and efficiently, and to build a comprehensive poison target data set including the currently known data sets.

The existing document retrieval system for the poison and the target is used for carrying out information retrieval and fuzzy matching in a background document database based on keywords input by a user such as a poison name, a target name or a combination of the two, finding documents with high similarity and returning the documents to the user, wherein the document retrieval mode is still remained on a static content searching matching layer, so that hidden knowledge in the documents is difficult to obtain, knowledge mining from massive biomedical documents is more difficult, and the working efficiency of poison research and development and scientific researchers is seriously influenced.

Therefore, it is highly desirable to provide a poison-target literature knowledge mining method and system with high efficiency, accuracy and intellectualization.

Disclosure of Invention

First, the technical problem to be solved

In view of the foregoing, it is a primary object of the present invention to provide a poison-target literature knowledge mining method and system based on web crawling, with the aim of at least partially solving at least one of the above-mentioned technical problems.

(II) technical scheme

According to one aspect of the present invention, there is provided a poison-target literature knowledge mining method based on web crawling, including:

acquiring and processing poison and target data information to establish a comprehensive data set;

developing a web crawler tool;

based on the comprehensive data set, crawling poison and target document text information by utilizing the web crawler tool and processing the information to establish a document text database;

determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation;

and carrying out poison-target literature knowledge mining by utilizing the literature text database and the poison-target relation knowledge base.

Further, acquiring and processing poison and target data information to create a comprehensive data set, comprising:

acquiring known poison and target data information;

and carrying out information deduplication, data filtering and standard processing on the known poison and target data information, and establishing a comprehensive data set of the poison and the target.

Further, the integrated dataset includes information of poisons and information of targets; the information of the poison comprises basic information of a poison name, a CAS number, a chemical structure and a molecular formula, and the information of the target comprises information of a target name and a Uniprot sequence number.

Further, developing a web crawler tool, comprising: web crawler tools are developed based on Python language and Scapy architecture.

Further, based on the integrated data set, crawling poison and target document text information with the web crawler tool and processing to build a document text database, comprising:

based on the poison names and the target names in the comprehensive data set, automatically crawling poison and target document text information from a document website Pubmed by utilizing the web crawler tool;

and cleaning the data of the poison and target document text information, and establishing a document text database of the poison and target by using the cleaned document text information.

Further, based on the document text database, determining a potential action relationship of the poison-target by using a natural language processing technology to form a knowledge base of the poison-target relationship, comprising: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation.

Further, determining a literature poison-target relationship based on similarity analysis, cluster analysis, topic mining, entity relationship extraction and a deep learning algorithm, and forming a poison-target relationship knowledge base, comprising:

processing the data in the crawled literature and the built literature text database based on similarity analysis, cluster analysis and topic mining to obtain multidimensional literature quantitative data;

extracting a statistical analysis model for determining the relationship between the poison and the target by adopting a data mining technology based on the entity relationship;

optimizing the model parameters by adopting a word vector and multi-layer neural network technology based on a deep learning algorithm;

based on multidimensional literature quantitative data, determining a literature poison-target relationship by using an optimized statistical analysis model, and forming a poison-target relationship knowledge base.

Further, the method for mining the poison-target literature knowledge by using the literature text database and the poison-target relation knowledge base comprises the following steps:

searching and inquiring the oriented literature by using the literature text database and the poison-target relation knowledge base, so as to mine the poison-target literature knowledge.

Further, searching and targeted document query are performed by using the document text database and the poison-target relation knowledge base, so as to mine the poison-target document knowledge, which comprises the following steps:

retrieving poison from the literature text database to obtain the name, CAS number, chemical structure and molecular formula basic information of the poison;

retrieving the target in the document text database to obtain the name, DNA name, receptor name, uniprot sequence number, molecular weight, alias, gene sequence and protein ID information of the target;

and obtaining potential action relations of the poison and the target on target proteins, receptors or human target organs by carrying out poison and target retrieval in a poison-target relation knowledge base.

According to another aspect of the present invention, there is provided a web-crawling-based poison-target document knowledge mining system comprising a processor for performing the method.

(III) beneficial effects

According to the technical scheme, the poison-target literature knowledge mining method and system based on network crawling have at least one of the following beneficial effects:

(1) The invention processes the known poison and target data based on the known poison and target data to establish the comprehensive data set, solves the problems of repeated redundancy, disordered information and irregular naming of the data in the existing data set, and is beneficial to improving the mining efficiency and accuracy.

(2) According to the invention, the automatic crawling of the massive biomedical documents is performed based on the poison target information in the comprehensive data set, the document text database is established, the hidden knowledge in the documents can be obtained, and the knowledge mining from the massive biomedical documents is facilitated.

(3) The method is based on machine learning and deep learning algorithms to mine potential action relations of poison targets from mass documents, and is beneficial to improving the intelligent degree and effectiveness of mining.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a schematic diagram of the poison-target literature knowledge mining system of the present invention.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent.

The invention provides a poison-target literature knowledge mining method based on network crawling, which comprises the following steps:

developing a web crawler tool;

The obtained poison and target data information is raw known poison and target data (poison and target data in each existing database have the problems of repetition, redundancy, confusion, non-standard naming and the like); the comprehensive data set is a data set formed by processing the original known poison and target data information; and processing the acquired poison and target data information, namely performing de-duplication, filtering, standardization, unified processing and the like. The invention processes the known poison and target data based on the known poison and target data to establish the comprehensive data set, solves the problems of repeated redundancy, disordered information and irregular naming of the data in the existing data set, and is beneficial to improving the mining efficiency and accuracy

Specifically, acquiring and processing poison and target data information to create a comprehensive data set, comprising: acquiring known poison and target data information; and performing information deduplication, data filtering and standard processing on the known poison and target data information, and establishing a comprehensive data set of the poison and the target.

Developing a web crawler tool, comprising: web crawler tools are developed based on Python language and Scapy architecture.

Based on the integrated dataset, crawling poison and target document text information with the web crawler tool and processing to build a document text database, comprising: based on the poison names and the target names in the comprehensive data set, automatically crawling poison and target document text information from a document website Pubmed by utilizing the web crawler tool; and cleaning the data of the poison and target document text information, and establishing a document text database of the poison and target by using the cleaned document text information.

Determining potential action relations of the poison-target by utilizing a natural language processing technology based on the document text database to form a knowledge base of the poison-target relation, wherein the knowledge base comprises the following steps: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation. The method is based on machine learning and deep learning algorithms to mine potential action relations of poison targets from mass documents, and is beneficial to improving the intelligent degree and effectiveness of mining.

And carrying out poison-target literature knowledge mining by using the literature text database and the poison-target relation knowledge base, wherein the method comprises the following steps of: retrieving poison from the literature text database to obtain the name, CAS number, chemical structure and molecular formula basic information of the poison; retrieving the target in the document text database to obtain the name, DNA name, receptor name, uniprot sequence number, molecular weight, alias, gene sequence and protein ID information of the target; and obtaining potential action relations of the poison and the target on target proteins, receptors or human target organs by carrying out poison and target retrieval in a poison-target relation knowledge base.

Compared with the method for directly acquiring the poison-target literature knowledge through the scattered data resources on the existing network, the method for acquiring the poison-target literature knowledge processes the known poison and target data information to form a comprehensive data set, builds a literature text database and a poison-target relation knowledge base, and performs poison and/or target retrieval in the literature text database and the poison-target relation knowledge base, so that the efficiency is higher and the accuracy is better.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The poison-target literature knowledge mining method based on network crawling in the embodiment comprises the following steps:

step 1: acquiring poison and target data information and establishing a comprehensive data set;

specifically, the related data information of the known poison and the target is collected, arranged and mined, and a comprehensive data set containing basic information of the poison and the target is established. More specifically, by searching and querying various existing databases including, but not limited to, TOXNET database, drug bank database, TDD database, NRDB drug protein database, drug-target interaction information database, supertarget, etc., the related data information of all currently known poisons and targets is subjected to data information collection, arrangement and mining, and the processing procedures of information deduplication, data filtering, etc., a comprehensive data set of basic information of the poisons and the targets is established. The comprehensive data set comprises information of a poison and information of a target, wherein the information of the poison comprises basic information such as a name, a CAS number, a chemical structure, a molecular formula and the like, and the information of the target comprises basic information such as a target name, a Uniprot sequence number and the like. The constructed comprehensive data set of the basic information of the poison and the target provides accurate and effective information basis for crawling and mining of the document text information in the next step.

Step 2: web crawler tool development, poison-target document crawling and text information preprocessing, and document text database is established;

specifically, a web crawler tool is developed, literature text information related to the poison and the target is automatically acquired from a literature website Pubmed, namely, the names of the poison and the target in the comprehensive data set are utilized to automatically crawl the literature text information related to the poison and the target from the literature website Pubmed, all crawled literature texts are initially sorted and collected, and a literature text database is established. More specifically, the automatic crawling of the literature text aiming at all poisons and targets on the Pubmed website can be realized by respectively developing a literature collection site management module, a literature collection template management module, a literature collection module and a collection monitoring module, so that a literature text data set of the poisons and the targets is obtained, data extraction, exchange, loading and other data cleaning processes are carried out on the literature text data set of the poisons and the targets, a structured storage preprocessing process of the literature text data is realized, and a literature text database of the poisons and the targets is established.

Preferably, the web crawler tool is developed based on the python language and the Scapy architecture, specific function development is performed for specific websites, and functions of each module of the web crawler tool based on the Scapy architecture are as follows:

engine (Scrapy Engine): responsible for Spider, itemPipeline, downloader, scheduler intermediate communication, signal, data transmission and the like;

scheduler (Scheduler): the system is responsible for receiving Request requests sent by the engine, arranging the Request requests in a certain mode, enqueuing the Request requests, and returning the Request requests to the engine when the engine is needed;

downloader (Downloader): all Requests sent by the Scrayy Engine are downloaded in charge, the acquired Responses are returned to the Scrayy Engine, and the Responses are handed over to the Spider by the Engine for processing;

crawler (Spider): the method is responsible for processing all Responses, analyzing and extracting data from the Responses, acquiring data required by an Item field, submitting URL (uniform resource locator) required to be followed to an engine, and entering a Scheduler again;

item Pipeline: the places responsible for processing the obtained Item in the Spider and performing post-processing (detailed analysis, filtering, storage and the like);

downloader Middlewares (download middleware): the system can be a component capable of customizing the extended downloading function;

spider Middlewares (Spider middleware): may be a functional component that can self-expand and operate the engine to communicate with the Spider (e.g., responses into the Spider, and Requests out of the Spider).

Step 3: the method comprises the steps of document text data mining processing and knowledge base establishment, specifically, determining potential action relations of poison-targets by utilizing a natural language processing technology based on the document text database to form a poison-target relation knowledge base;

more specifically, based on similarity analysis, cluster analysis, topic mining, entity relation extraction and other data mining and deep learning algorithms, determining the relation of literature poison-target to form a poison-target relation knowledge base, wherein the process is as follows:

firstly, processing the crawled literature and the established literature library data to obtain multidimensional literature data quantitative data;

a statistical analysis model of the relation between the poison and the target is established by adopting a data mining technology, model parameters are optimized by adopting a word vector and multi-layer neural network (deep learning) technology, and the verification test is carried out on the result by adopting the existing literature relation database; preferably, the invention adopts the deep learning based on the bidirectional GRU and the double-layer Attention mechanism, improves the precision and has better effect;

based on the optimized and tested model, utilizing the literature data quantitative data to construct a poison-target relation, and forming a poison-target relation knowledge base.

Step 4: poison-target literature knowledge mining;

utilizing the literature text database and the poison-target relation knowledge base to carry out poison-target literature knowledge mining; specifically, the poison-target relation and the document text information are searched and the directed document is searched based on the document text database and the poison-target relation knowledge base, so that the poison-target document knowledge is mined.

The invention also provides a poison-target literature knowledge mining system based on the network crawling, which comprises a processor and a processor, wherein the processor is used for executing the poison-target literature knowledge mining method based on the network crawling. The established comprehensive data set of poison and target information, the literature text database, the poison-target relation knowledge base, the web-based literature crawler tool, the literature data mining algorithm and the like are seamlessly integrated, so that an online poison-target literature knowledge mining system integrating poison, target information retrieval, literature text information retrieval and directional literature query is established, and the online poison-target literature knowledge mining system has the characteristics of being extensible, cross-platform, multi-user and the like.

The structure of the poison-target literature knowledge mining system of the present invention is shown in fig. 1.

The man-machine interaction of the system comprises two parts: crawler and literature knowledge mining monitoring interfaces, and poison-target knowledge base retrieval interfaces. Referring to fig. 1, the knowledge mining system is a four-layer system architecture based on a browser-service-intellectualization engine-data layer, and includes: an interface layer, using a web browser, using a standard MVC framework (e.g., vus. Js); the service layer adopts a RESTFul service interface (such as Sprintoot and the like) intelligent engine supporting CRUD standard operation: development with Python, using Python development language, numpy, scipy, NLTK, spaCy, gensim development packages, tensorflow and Keras development platforms. Implementing web crawlers and data mining (deep learning); the data layer adopts MongoDB and Neo4j to respectively realize a document library and a knowledge library; the platform layer, the basic network and the OS platform, and the system is developed and operated on the Windows platform, and is easy to deploy, maintain and operate.

With continued reference to fig. 1, the existing poison target data from different sources are subjected to data integration, data cleaning, deduplication, and then a comprehensive data set of poison and target is established, wherein the comprehensive data set is used as the data input of the crawler engine in fig. 1, and the crawler engine automatically obtains relevant literature information from the Pubmed biomedical website (but not limited to this of course) according to the data input, and stores the relevant literature information in a local structured literature library (literature text database of poison and target) after pretreatment of the literature data. Firstly, word frequency analysis is carried out on a local structured document library to obtain a high-frequency document data set, then a sentence segmenter and a word segmentation device are used for segmenting a document abstract to obtain a word tuple list, part-of-speech tagging is carried out on the word tuple list, part-of-speech information of all word lists is further obtained, after processing through a block divider, named entity identification and extraction of relations among entities are carried out, and finally a poison-target document knowledge base (also called a poison-target relation knowledge base) is generated.

Based on the local structured document library and the poison-target relation knowledge base, quick full text retrieval can be performed on the poison and the target. The invention provides local document database retrieval and poison-target relation knowledge base retrieval of different types of poison, targets, paths, poison target action relations and the like, and the data retrieval and display modes of different types are different.

When searching the poison in the local structured document library, the invention can search all the document information containing the poison or the poison alias in full text, and the displayed document information list is ordered according to the weight proportion of the keywords in the documents. Clicking on a poison may look at the basic information of the poison's name, CAS number, chemical structure, molecular formula, etc., and the common external poison reference links.

When searching the target in the local structured document library, the invention can search all document information containing the target name, DNA name and receptor name in full text, the displayed document information is also sequenced according to the weight of the target, and related target name, uniprot sequence number, gene name, molecular weight, alias, gene sequence, protein ID and common external target link information can be obtained by clicking the target.

When poison or target retrieval is carried out in a poison-target relation knowledge base, biomedical literature data based on a mass network can be given, potential action relations of the poison or target on other target proteins, receptors or human target organs can be obtained through data mining, and related literature links are given so as to be convenient for checking detailed literature information.

The present invention has been described in detail with reference to the accompanying drawings. The present invention should be clearly recognized by those skilled in the art in light of the above description.

It should be noted that, in the drawings or the text of the specification, implementations not shown or described are all forms known to those of ordinary skill in the art, and not described in detail. Furthermore, the above definitions of the elements are not limited to the specific structures, shapes or modes mentioned in the embodiments, and may be simply modified or replaced by those of ordinary skill in the art.

Of course, according to actual needs, the present invention may also include other parts, and since the parts are irrelevant to the innovations of the present invention, the details are not repeated here.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the method of the invention should not be interpreted as reflecting the intention: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Furthermore, in the drawings or description, like or identical parts are provided with the same reference numerals. Features of the embodiments illustrated in the description may be combined freely to form new solutions without conflict, in addition, each claim may be used alone as one embodiment or features of the claims may be combined as a new embodiment, and in the drawings, the shape or thickness of the embodiments may be enlarged and labeled in a simplified or convenient manner. Furthermore, elements or implementations not shown or described in the drawings are of a form known to those of ordinary skill in the art. Additionally, although examples of parameters including particular values may be provided herein, it should be appreciated that the parameters need not be exactly equal to the corresponding values, but may be approximated to the corresponding values within acceptable error margins or design constraints.

The various embodiments of the invention described above may be freely combined to form further embodiments, unless otherwise technically impaired or contradictory, which are all within the scope of the invention.

Although the present invention has been described with reference to the accompanying drawings, the examples disclosed in the drawings are intended to illustrate preferred embodiments of the invention and are not to be construed as limiting the invention. The dimensional proportions in the drawings are illustrative only and should not be construed as limiting the invention.

Although a few embodiments of the present general inventive concept have been shown and described, it would be appreciated by those skilled in the art that changes may be made in these embodiments without departing from the principles and spirit of the general inventive concept, the scope of which is defined in the claims and their equivalents.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The poison-target literature knowledge mining method based on network crawling is characterized by comprising the following steps of:

developing a web crawler tool;

searching and inquiring the poison-target relation and the document text information by using the document text database and the poison-target relation knowledge base, so as to carry out poison-target document knowledge mining;

the method for determining potential action relation of poison-target based on the document text database by utilizing natural language processing technology to form a knowledge base of poison-target relation comprises the following steps:

optimizing model parameters by adopting word vectors and a multi-layer neural network technology based on a deep learning algorithm;

2. The method of claim 1, wherein acquiring poison and target data information and processing to create a comprehensive dataset comprises:

acquiring known poison and target data information;

3. The method of claim 1, wherein the integrated dataset includes information of poisons and information of targets; the information of the poison comprises basic information of a poison name, a CAS number, a chemical structure and a molecular formula, and the information of the target comprises information of a target name and a Uniprot sequence number.

4. The method of claim 1, wherein developing a web crawler tool comprises: web crawler tools are developed based on Python language and Scapy architecture.

5. The method of claim 1, wherein crawling and processing poison and target document text information using the web crawler tool based on the integrated dataset to build a document text database comprises:

6. The method of claim 1, wherein determining poison-target potential effect relationships using natural language processing techniques based on the document text database to form a knowledge base of poison-target relationships comprises: determining the relation of the literature poison-target based on similarity analysis, cluster analysis, topic mining, entity relation extraction and deep learning algorithm, and forming a knowledge base of the poison-target relation.

7. The method of claim 1, wherein utilizing the literature text database and poison-target relationship knowledge base for poison-target literature knowledge mining comprises:

8. The method of claim 7, wherein searching and targeting literature queries using the literature text database and a poison-target relationship knowledge base to mine poison-target literature knowledge comprises:

9. A web-crawling-based poison-target literature knowledge mining system, comprising a processor for performing the method of any one of claims 1 to 8.