CN112380345B - COVID-19 scientific literature fine-grained classification method based on GNN - Google Patents

COVID-19 scientific literature fine-grained classification method based on GNN Download PDF

Info

Publication number
CN112380345B
CN112380345B CN202011313700.8A CN202011313700A CN112380345B CN 112380345 B CN112380345 B CN 112380345B CN 202011313700 A CN202011313700 A CN 202011313700A CN 112380345 B CN112380345 B CN 112380345B
Authority
CN
China
Prior art keywords
scientific
scientific literature
knowledge
designing
chinese medicine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011313700.8A
Other languages
Chinese (zh)
Other versions
CN112380345A (en
Inventor
杨帅
王小红
赵志刚
窦方坤
曹皓伟
潘景山
魏志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Computer Science Center National Super Computing Center in Jinan
Original Assignee
Shandong Computer Science Center National Super Computing Center in Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Computer Science Center National Super Computing Center in Jinan filed Critical Shandong Computer Science Center National Super Computing Center in Jinan
Priority to CN202011313700.8A priority Critical patent/CN112380345B/en
Publication of CN112380345A publication Critical patent/CN112380345A/en
Application granted granted Critical
Publication of CN112380345B publication Critical patent/CN112380345B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a GNN-based COVID-19 scientific literature fine-grained classification method, which comprises the following steps: a) constructing a knowledge graph of COVID-19 scientific literature; a-1) division of knowledge categories; a-2) entity design in scientific literature; a-3) design of relationships in scientific literature; a-4) constructing a knowledge graph of COVID-19 scientific literature; b) constructing a western medicine treatment knowledge map; c) constructing a Chinese medicine treatment knowledge map; d) building a graph neural network model; e) text classification. The COVID-19 scientific literature fine-grained classification method based on GNN provides an effective screening and classification method for medical workers to quickly find self-required knowledge type literatures in a large amount (usually more than 1 ten thousand) of related COVID-19 scientific literatures, has obvious beneficial effects and is suitable for application and popularization.

Description

COVID-19 scientific literature fine-grained classification method based on GNN
Technical Field
The invention relates to a scientific literature fine-grained classification method, in particular to a COVID-19 scientific literature fine-grained classification method based on GNN.
Background
Text classification is a common task of natural language processing, and performs automatic classification and labeling on a text set according to a certain classification system or standard. The text classification method is divided into three categories: rule-based methods, machine learning-based methods (data-driven methods), and hybrid methods. Rule-based text classification methods (Rule-based methods) require manual involvement in formulating rules, and are often relatively accurate. However, when the rules are changed or updated, the rules need to be manually re-summarized, and the maintenance cost is high. Moreover, when there are many rules, there is a possibility that the rules may conflict with each other, which makes maintenance difficult. In terms of extensibility, a given rule is difficult to extend into other scenarios, and new scenarios often require rewriting of the rule.
The text classification method based on machine learning, also called data-driven method, can be subdivided into two categories, namely the traditional machine learning text classification method and the text classification method based on deep learning. The traditional machine learning text classification method comprises svm, gbdt and the like, features need to be specified manually, a large amount of data analysis and feature engineering work is introduced, and the feature engineering needs to be combined with business scenes, so that the traditional machine learning text classification method is difficult to generalize to other scenes. The text classification based on deep learning can automatically learn the internal relation between data and labels through a deep learning model, and does not need manual intervention except labeling work. A Hybrid method (hybrids) is a method generally adopted in the industry, a plurality of deep learning models such as attention, CNN, LSTM, BERT and the like are mixed for use, and a rule-based method is used for pre-filtering and post-bottoming, so that the method is simple and easy to implement and consumes few resources. For some complex and large-scale learning tasks, the deep learning model has better training effect.
The classic text classification methods are: Feed-Forward Neural Networks, DAN (deep average network), fastText, Tree-LSTM model, Multi-Timescale LSTM (MT-LSTM) model, Topic RNN, dynamic CNN, kim-CNN, Capsule Neural Networks, Transformers, and the like. The data sets common in text classification machine learning tasks can be classified by categories: emotion classification, news classification, topic classification, QA data set, NLI natural language reasoning data set and the like, wherein the classical data sets of all the categories comprise: (1) and (3) emotion classification: yelp, IMDb, Movie Review, SST, MPQA, AMazon, Aspect-Based sentment Analysis; (2) and (4) news classification: AG News, 20News groups, SogouNews, Reuters News, BingNews, NYTIMEs, BBC, Google News; (3) topic classification: DBpedia, Ohsumed, EUR-Lex, WOS, PubMed 200k RCT, Irony (which is composed of indexed documents from the social news website read, Twitter dataset for topic classification of tways, arXiv collection); (4) QA data set: SQuAD, MS MARCO, TREC-QA, WikiQA, Quora, QA inclusions With Adversal genetics (SWAG), WikiQA, SelQA; (5) NLI natural language inference dataset: SNLI, Multi-NLI, SICK, MSRP, Semantic Temporal Similarity (STS), RTE, SciTal.
The existing text classification method achieves good effects, but the problems of lack of data sets of complex text scenes, poor interpretability, poor model design, small sample learning and the like still exist.
(1) Complex text classification scenarios lack a data set. Although the text classification algorithm based on deep learning has a good effect on many data sets, in some complex text classification scenes, the data sets are still lacked, and the effect of the current model cannot be verified, such as QA, multi-step text inference, multi-language text classification and the like in the complex scenes.
(2) And (5) text information knowledge modeling. Modeling knowledge: it is necessary to model knowledge in text information, such as building a knowledge base and a knowledge graph, and performing analysis and inference based on the knowledge.
(3) Deep learning is poorly interpretable. In many real business scenarios, business parties are more concerned about the interpretability problem, but the interpretability of current deep learning is not strong.
(4) Smaller, more efficient models. With the appearance of the models such as the bert model, the models are larger and larger, more and more resources are consumed by training, the training time under limited resources is very long, and smaller models capable of being efficiently trained are more needed in the industry.
(5) Small sample learning (zeor-shot and raw shot learning): the current deep learning model depends on a large amount of labeled data too much, and whether knowledge can be introduced or not is used for solving the problem of less labeled samples.
A knowledge graph is a semantic network that exposes relationships between entities. In the official vocabulary entry of wikipedia, the knowledgegraph is the repository used by Google to enhance its search engine functionality. Essentially, a knowledge graph is intended to describe various entities or concepts and their relationships that exist in the real world, and constitutes a huge semantic network graph, where nodes represent entities or concepts and edges have attributes or relationships. Today, knowledge graphs have been used to broadly refer to a variety of large-scale knowledge bases.
A knowledge graph is composed of entities, semantic classes (concepts), content, attributes (values), and relationships. An entity refers to something that is distinguishable and exists independently, such as: a person, a city, a commodity, etc. Semantic classes (concepts) refer to collections of entities having common characteristics, such as: country, ethnic group, etc. Content is typically expressed as names, descriptions, interpretations, etc. of entities and semantic classes, which may be expressed in text, images, audio-video, etc. Attributes (values) refer to different types of attributes an entity has. Relationships refer to connections between entities. At present, large-scale knowledge maps with higher popularity are as follows: FreeBase, Google knowledge graph, DBpedia, Wikipedia, Baidu knowledge graph, dog-searching knowledge cube, etc.
The novel coronavirus pneumonia (Corona Virus Disease 2019, COVID-19) is called new coronavirus pneumonia for short, and the world health organization is named as 2019 coronavirus Disease, and is pneumonia caused by 2019 novel coronavirus infection. In the face of such severe epidemic situation, a COVID-19 knowledge base is urgently needed to support the scientific epidemic prevention data.
Specifically, in the face of the data of the COVID-19 scientific literature, no method for directly classifying the fine granularity of the COVID-19 scientific literature exists at present. On one hand, fine-grained modeling needs to be carried out on COVID-19 scientific literature information; on the other hand, the problems of model design, small sample learning and the like need to be solved. At the beginning of the outbreak of the COVID-19 epidemic, the original literature is not yet large, and the work of knowledge classification can be completed by manpower, but at present, the overall scale of the COVID-19 scientific literature reaches more than ten thousand. In the face of such a huge scale, manual classification is no longer feasible, so that a machine learning method is urgently needed to realize fine-grained classification of the COVID-19 scientific literature so as to solve the practical problem.
Disclosure of Invention
In order to overcome the defects of the technical problems, the invention provides a COVID-19 scientific literature fine-grained classification method based on GNN.
The invention discloses a GNN-based COVID-19 scientific literature fine-grained classification method, which is characterized by comprising the following steps of:
a) constructing a knowledge graph of COVID-19 scientific literature;
a-1) dividing the knowledge categories, acquiring a certain amount of medical journal scientific documents related to COVID-19, extracting basic information of each scientific document, carrying out knowledge classification on each scientific document, dividing the scientific documents into the knowledge categories of virus naming, virus detection, origin and variation, virus propagation, species propagation, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine development, vaccine treatment, novel therapy, pharmacotherapy, drug development, drug treatment or clinical research, and classifying the scientific documents which do not belong to the knowledge categories into other categories;
a-2), designing an entity in scientific literature, namely designing scientific literature data into 5 types of entities of scientific literature, knowledge categories, scientific researchers, academic journals and scientific research institutions; designing scientific literature entities into 10 attributes of a paper number, publication time, a question, a paper address, an information address, an abstract, a content brief introduction, a DOI (disk object identifier), a keyword and remarks, designing knowledge categories of scientific literature into 2 attributes of a knowledge category number and a knowledge category name according to the division in the step a-1), designing scientific research personnel entities into 2 attributes of a scientific research personnel number and a scientific research personnel name, designing academic journals into 2 attributes of an academic journal number and an academic journal name, and designing scientific research institutions into 2 attributes of a scientific research institution number and a scientific research institution name;
a-3) designing the relation in scientific literature, designing scientific literature data into 5 types of relations of scientific research institution-scientific research personnel, scientific literature-scientific research institution, scientific literature-scientific research personnel, scientific literature-knowledge category and scientific literature-academic journal, designing the relations of scientific research institution-scientific research personnel into 3 attributes of scientific research personnel number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research institution into 3 attributes of scientific literature number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research personnel into 3 attributes of scientific literature number, scientific research personnel number and relation type, designing the relations of scientific literature-knowledge category into scientific literature number, knowledge category number, scientific literature type number and relation type, 3 attributes of the relation type, namely designing the relation of scientific literature-academic journal as scientific literature number, academic journal number and 3 attributes of the relation type;
a-4) constructing a COVID-19 scientific literature knowledge graph, and analyzing, denoising and normalizing the data of the scientific literature entities and the relationship constructed in the steps a-2) and a-3) to form the scientific literature knowledge graph;
b) constructing a western medicine treatment knowledge map;
b-1) designing entities in western medicine treatment data, namely designing the western medicine treatment data into western medicine, western medicine related thesis and knowledge category 3 entities, designing the western medicine entities into 9 attributes of western medicine numbers, western medicine names, recording numbers, medicine categories, descriptions, extension numbers, CAS numbers, InCHI codes and SMILES codes, and designing the western medicine related thesis into western medicine related thesis numbers, thesi names, thesis publication time, thesis links, thesis abstracts, DOI numbers, remarks and keywords 8 attributes;
b-2) designing the relation in the western medicine treatment data, namely designing the western medicine treatment relation into 2 types of relations of 'knowledge category-western medicine' and 'western medicine-western medicine related thesis', designing the 'knowledge category-western medicine' into 3 attributes of knowledge category number, western medicine number and relation type, and designing the 'western medicine-western medicine related thesis' into 3 attributes of western medicine number, western medicine related thesis number and relation type;
b-3), constructing a western medicine treatment knowledge map, and analyzing, denoising and normalizing the western medicine entity and relationship data constructed in the steps b-1) and b-2) to form the western medicine treatment knowledge map;
c) constructing a Chinese medicine treatment knowledge map;
c-1) designing entities in the traditional Chinese medicine treatment data, namely designing the traditional Chinese medicine treatment data into 5 types of entities of traditional Chinese medicine prescriptions, traditional Chinese medicine scientific documents, medicinal materials, effective components and knowledge categories, designing the traditional Chinese medicine prescriptions into 2 attributes of traditional Chinese medicine prescription numbers and traditional Chinese medicine prescription names, designing the traditional Chinese medicine scientific documents into 10 attributes of traditional Chinese medicine scientific document numbers, titles, knowledge links, traditional Chinese medicine prescriptions, medicinal materials, effective components, purposes, methods, results and conclusions, designing the medicinal materials into 2 attributes of medicinal material numbers and medicinal material names, designing the effective components into 2 vertical compound numbers and compound names, and designing the knowledge categories into medicine treatment;
c-2) designing the relationship in the Chinese medicine treatment data, designing the Chinese medicine treatment data into 4 types of entity relationship of 'Chinese medicine science literature-Chinese medicine prescription', 'Chinese medicine prescription-effective component', 'Chinese medicine prescription-medicinal material' and 'Chinese medicine prescription-knowledge type', designing the entity relationship of 'Chinese medicine science literature-Chinese medicine prescription' into 3 attributes of Chinese medicine science literature number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-effective component' into 3 attributes of compound number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-medicinal material' into 3 attributes of medicinal material number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-knowledge type' into knowledge type number, Chinese medicine prescription number, relation type, Relationship type 3 attributes;
c-3) constructing a traditional Chinese medicine treatment knowledge graph, and analyzing, denoising and normalizing the data of the traditional Chinese medicine entities and the relationship constructed in the steps c-1) and c-2) to form the traditional Chinese medicine treatment knowledge graph;
d) constructing a graph neural network model, fusing the COVID-19 scientific literature knowledge graph constructed in the step a), the western medicine treatment knowledge graph constructed in the step b) and the traditional Chinese medicine treatment knowledge graph constructed in the step c) to form a COVID-19 knowledge graph, constructing a COVID-19 scientific literature fine-grained classification data set based on the COVID-19 knowledge graph, and constructing a COVID-19 scientific literature text classification model (CTGC) based on GNN;
e) and text classification, namely inputting the title, abstract and key words of the scientific literature into a trained COVID-19 scientific literature text classification model CTGC for the scientific literature to be classified, and outputting the classification of the scientific literature.
The invention discloses a COVID-19 scientific literature fine-grained classification method based on GNN, which comprises the following steps of basic information extraction of scientific literature:
1) word segmentation: firstly, dividing the text of scientific literature into words or phrases, and because COVID-19 scientific literature is English, the title and abstract data are divided by using an English word segmentation tool nltk;
2) go stop word: removing a, an, the, above, after, the stop words which appear in the text in a large amount but do not have much influence on the text classification by using nltk;
3) small writing: uniformly converting English texts into a lower case form;
4) noise removal: removing special symbols and punctuations in the text;
5) spell checking: checking whether spelling errors exist or not, and correcting errors.
6) Slang and abbreviation: reducing abbreviations to full form or converting some spoken representation to written language;
7) stem extraction and morphology reduction: converting English words into the most basic form;
8) word frequency statistics and filtering: and after the text training set is processed, counting the word frequency of the remaining words, filtering low-frequency words with the word frequency lower than 5, and keeping the words with the word frequency higher than a threshold value.
The COVID-19 scientific literature fine-grained classification method based on the GNN is characterized in that the graph neural network GNN in the step d) is composed of a convolution layer, a pooling layer and a full connection layer, the full connection layer is realized through a softmax function, the size of a data set is 256, the number of threads used for loading data is 4, a loss function is normalized to be 1e-4, the word embedding dimension is 100, the discarding rate is 0.5, the classification category number is 34, the learning rate is used for attenuation, the attenuation rate is 0.95, and the attenuation interval is 1.
The COVID-19 scientific literature fine-grained classification method based on GNN is realized by constructing a COVID-19 scientific literature knowledge graph in the step a), a western medicine treatment knowledge graph in the step b) and a traditional Chinese medicine treatment knowledge graph in the step c) through a Neo4j database tool.
The invention has the beneficial effects that: the invention relates to a GNN-based COVID-19 scientific literature fine-grained classification method, which comprises the steps of firstly dividing acquired scientific literature into knowledge categories, constructing a scientific literature knowledge graph by using attributes of scientific literature entities and entity relations, then respectively establishing the knowledge graphs of western medicine data and traditional Chinese medicine data according to the entities and the entity relations of the western medicine data and the traditional Chinese medicine data, finally constructing a scientific literature text classification model (CTGC) by using a Graph Neural Network (GNN), and for the scientific literature to be classified, obtaining classification of the scientific literature by using the trained scientific literature text classification model (CTGC), wherein the classification is accurate and has finer classification granularity, so that medical workers can quickly find needed knowledge category literatures (such as virus naming, virus detection, origin and variation, virus propagation and the like) in massive (usually more than 1 ten thousand) related COVID-19 scientific literatures, Species transmission, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine research and development, vaccine treatment, novel therapy, drug clinic, drug research and development, drug treatment or clinical research and the like), provides an effective screening and classifying method, has obvious beneficial effects, and is suitable for application and popularization.
Drawings
FIG. 1 is a schematic representation of the COVID-19 scientific literature knowledge graph constructed in the present invention;
FIG. 2 is a schematic diagram of a western medicine treatment knowledge map constructed in the present invention;
FIG. 3 is a schematic diagram of a knowledge graph of Chinese medicine treatment constructed in the present invention;
FIG. 4 is a schematic diagram of a COVID-19 knowledge graph formed by fusing a COVID-19 scientific literature knowledge graph, a drug treatment knowledge graph and a traditional Chinese medicine treatment knowledge graph constructed in the invention.
Fig. 1 to 4 are only schematic diagrams showing the constructed knowledge graph, so that perceptual knowledge and understanding can be given, and the completeness and disclosure sufficiency of the GNN-based COVID-19 scientific literature fine-grained classification method described in the invention can be influenced even without fig. 1 to 4.
Detailed Description
The invention is further described with reference to the following figures and examples.
The COVID-19 knowledge graph is mainly constructed based on the latest published scientific literature knowledge, and the key step in the construction process is to classify the fine granularity according to the research points of the scientific literature, so that the work is redundant and heavy. At the beginning of the outbreak of the COVID-19 epidemic, the original literature is not yet large, and the work of knowledge classification can be completed by manpower, but at present, the overall scale of the COVID-19 scientific literature reaches more than ten thousand. In the face of such a huge scale, manual classification is no longer feasible, so that a machine learning method is urgently needed to realize fine-grained classification of the COVID-19 scientific literature so as to solve the practical problem.
The scientific literature totally 491 is obtained, and the knowledge content relates to the aspects of virus source, transmission, isolation and control, naming, clinical research, human immunity, drug therapy, drug clinic, vaccine development, pathogenic mechanism and pathogenic molecular mechanism. The knowledge in the aspect of drug treatment is divided into: the western medicine treatment and the traditional Chinese medicine treatment are two aspects, and the western medicine treatment includes 23 medicines and 122 pieces of related scientific literature data; the treatment aspect of the traditional Chinese medicine includes 40 traditional Chinese medicine formulas for treating COVID-19 related in 49 scientific documents, 112 traditional Chinese medicinal materials related to the traditional Chinese medicine formulas and 86 traditional Chinese medicine effective components, and as shown in table 1, a scientific document source statistical table is given:
TABLE 1
Figure DEST_PATH_IMAGE002
The scientific literature contains a large amount of preprinting scientific literature which needs to be checked by researchers of the same colleagues, the data of the preprinting scientific literature mainly come from biopixiv, medRxiv and other physiological and medical fields, and the preprinting scientific literature can be updated in batches after the scientific literature is officially published in the later period. The international top-level Journal/Medical Journal (such as: Cell, Nature, Science, JAMA, Lancet, The New England Journal of Medicine, British Medical Journal) accounts for up to 56%, and The data quality is very high. In addition, the knowledge coverage of scientific literature is comprehensive and fine in granularity, and the COVID-19 knowledge graph construction process not only extracts basic information of the scientific literature, but also classifies data of each scientific literature, such as aspects of virus detection, virus source, transmission, clinical research, human immunity, medicine clinic, vaccine research and development, pathogenic mechanism and the like.
The COVID-19 scientific literature fine-grained classification method based on GNN of the invention obtains 491 scientific literature in total, and is realized by the following steps:
a) constructing a knowledge graph of COVID-19 scientific literature;
a-1) dividing the knowledge categories, acquiring a certain amount of medical journal scientific documents related to COVID-19, extracting basic information of each scientific document, carrying out knowledge classification on each scientific document, dividing the scientific documents into the knowledge categories of virus naming, virus detection, origin and variation, virus propagation, species propagation, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine development, vaccine treatment, novel therapy, pharmacotherapy, drug development, drug treatment or clinical research, and classifying the scientific documents which do not belong to the knowledge categories into other categories;
as shown in table 2, a statistical table of knowledge classes of scientific literature is given:
TABLE 2
Knowledge categories Ratio of
Virus nomenclature 3 pieces, account for 1%
Virus detection 8 parts, accounting for 3%
Origin and variation 26, 8 percent in percentage
Viral transmission 45 in a proportion of 14%
Inter-species spread 7 pieces, are 2 percent
Molecular mechanisms of pathogenesis 20 pieces, are 6 percent
Pathogenic mechanism 46 in a proportion of 14%
Human immunity 13 parts, accounting for 4%
Vaccine development 9 pieces, in a ratio of 3%
Vaccine treatment 2 in 1 percent
Novel therapy 7 pieces, are 2 percent
Clinical application of medicine 18 portions, accounting for 6 percent
Drug development 7 pieces, are 2 percent
Medical treatment 12 accounts for 4%
Clinical research 72 pieces, are 23 percent
The remaining 19 categories 25 pieces, account for 7%
Clinical studies (13% in percentage), drug therapy (14% in percentage), COVID-19 pathogenesis (20% in percentage), and virus transmission and variation (24% in percentage) can be found as hot spots in scientific research in the past.
a-2), designing an entity in scientific literature, namely designing scientific literature data into 5 types of entities of scientific literature, knowledge categories, scientific researchers, academic journals and scientific research institutions; designing scientific literature entities into 10 attributes of a paper number, publication time, a question, a paper address, an information address, an abstract, a content brief introduction, a DOI (disk object identifier), a keyword and remarks, designing knowledge categories of scientific literature into 2 attributes of a knowledge category number and a knowledge category name according to the division in the step a-1), designing scientific research personnel entities into 2 attributes of a scientific research personnel number and a scientific research personnel name, designing academic journals into 2 attributes of an academic journal number and an academic journal name, and designing scientific research institutions into 2 attributes of a scientific research institution number and a scientific research institution name;
the entity field design of the scientific literature is shown in table 3:
TABLE 3
Scientific literature entity attributes Field design
Paper numbering paperId
Time of release paperPubtime
Topic of questions paperTitle
Paper address paperLink
Information address paperNewsLink
Abstract paperAbstract
Introduction of content paperContent
DOI number paperDOI
Keyword paperKeywords
Remarks for note paperNotes
The attributes of the knowledge category entities are: the category number and the category name are 2, and the specific field design is shown in table 4:
TABLE 4
Knowledge class entity attributes Field design
Knowledge category numbering CategoryId
Knowledge category name CategoryName
The attributes of the scientific research personnel entities are as follows: the numbers of the scientific researchers and the names of the scientific researchers are 2, and the specific field design is shown in a table 5:
TABLE 5
Physical attributes of scientific researchers Field design
Scientific research personnel number peopleId
Name of researcher peopleName
The attributes of the academic journal entity are: the number of academic periodicals and the name of periodicals are 2, the specific field design is shown in a table 6,
TABLE 6
Academic journal entity attributes Field design
Academic journal number journalId
Academic journal name journalName
The attributes of the scientific research institution entities are: the scientific research institution serial number and the name of the scientific research institution are 2, the specific field design is shown in a table 7,
TABLE 7
Entity attributes of research institutions Field design
Scientific research institution number institutionId
Name of scientific research institution institutionName
a-3) designing the relation in scientific literature, designing scientific literature data into 5 types of relations of scientific research institution-scientific research personnel, scientific literature-scientific research institution, scientific literature-scientific research personnel, scientific literature-knowledge category and scientific literature-academic journal, designing the relations of scientific research institution-scientific research personnel into 3 attributes of scientific research personnel number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research institution into 3 attributes of scientific literature number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research personnel into 3 attributes of scientific literature number, scientific research personnel number and relation type, designing the relations of scientific literature-knowledge category into scientific literature number, knowledge category number, scientific literature type number and relation type, 3 attributes of the relation type, namely designing the relation of scientific literature-academic journal as scientific literature number, academic journal number and 3 attributes of the relation type;
the scientific research institution-scientific research personnel relationship explains the relationship between the scientific research institutions and the scientific research personnel, and the field design of the relationship is shown in table 8:
TABLE 8
"scientific research institution-scientific research personnel" relationship attribute Field design
Scientific research personnel number peopleId
Scientific research institution number institutionId
Type of relationship workin
The scientific literature-scientific research institution relationship describes the relationship between scientific literature and scientific research institution, and the field design of this type of relationship is shown in table 9:
TABLE 9
"scientific literature-scientific research institution" relationship attributes Field design
Scientific literature numbering paperId
Scientific research institution number institutionId
Type of relationship platformin
The scientific literature-researcher relationship describes the situation of the researcher publishing the scientific literature, the field design of the relationship is shown in table 10,
watch 10
Relationship attribute of scientific literature-scientific research personnel Field design
Scientific literature numbering paperId
Scientific research personnel number peopleId
Type of relationship author
The "scientific literature-knowledge class" relationship describes the knowledge class of each scientific literature, the field design of such relationship is shown in table 11,
TABLE 11
"scientific literature-knowledge class" relational attributes Field design
Scientific literature numbering paperId
Knowledge category numbering CategoryId
Type of relationship belongto
The scientific literature-academic journal relationship describes the journal published by each scientific literature, the field design of such relationship is shown in table 12,
TABLE 12
Relation attribute of scientific literature-academic journal Field design
Scientific literature numbering paperId
Academic journal number journalId
Type of relationship publishin
a-4) constructing a COVID-19 scientific literature knowledge graph, and analyzing, denoising and normalizing the data of the scientific literature entities and the relationship constructed in the steps a-2) and a-3) to form the scientific literature knowledge graph;
through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 1.
b) Constructing a western medicine treatment knowledge map;
b-1) designing entities in western medicine treatment data, namely designing the western medicine treatment data into western medicine, western medicine related thesis and knowledge category 3 entities, designing the western medicine entities into 9 attributes of western medicine numbers, western medicine names, recording numbers, medicine categories, descriptions, extension numbers, CAS numbers, InCHI codes and SMILES codes, and designing the western medicine related thesis into western medicine related thesis numbers, thesi names, thesis publication time, thesis links, thesis abstracts, DOI numbers, remarks and keywords 8 attributes;
the western medicine related thesis entity refers to a thesis related to a western medicine entity, and the attributes of the entity are as follows: western medicine relevant thesis number, thesis name, thesis publication time, thesis link, thesis abstract, DOI number, remark (PMID, PMCID), keyword and the like are 8, and the specific field design is as shown in Table 13:
watch 13
Western medicine related thesis entity attributes Field design
Western medicine related thesis number drugPapersID
Name of thesis drugPapersTitle
Time to publication drugPapersPublishTime
Paper linking drugPapersLink
Abstract of thesis drugPapersAbstract
DOI number drugPapersDOI
Remarks (PMID, PMCID) drugPapersNotes
Keyword drugPapersKeywords
b-2) designing the relation in the western medicine treatment data, namely designing the western medicine treatment relation into 2 types of relations of 'knowledge category-western medicine' and 'western medicine-western medicine related thesis', designing the 'knowledge category-western medicine' into 3 attributes of knowledge category number, western medicine number and relation type, and designing the 'western medicine-western medicine related thesis' into 3 attributes of western medicine number, western medicine related thesis number and relation type;
the "knowledge category-western medicine" relationship defines the knowledge categories (medications) to which 23 western medicines belong, and the fields of such relationship are designed as shown in table 14:
TABLE 14
"knowledge category-western medicine" relationship attribute Field design
Knowledge category numbering CategoryId
Western medicine numbering drugID
Type of relationship include
The relation of western medicine-western medicine related thesis explains the information of western medicine and related thesis, and the field design of the relation is shown in table 15:
watch 15
Relation attribute of western medicine-western medicine related thesis Field design
Western medicine numbering drugID
Western medicine related thesis number drugPapersID
Type of relationship related
b-3), constructing a western medicine treatment knowledge map, and analyzing, denoising and normalizing the western medicine entity and relationship data constructed in the steps b-1) and b-2) to form the western medicine treatment knowledge map;
through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 2.
c) Constructing a Chinese medicine treatment knowledge map;
c-1) designing entities in the traditional Chinese medicine treatment data, namely designing the traditional Chinese medicine treatment data into 5 types of entities of traditional Chinese medicine prescriptions, traditional Chinese medicine scientific documents, medicinal materials, effective components and knowledge categories, designing the traditional Chinese medicine prescriptions into 2 attributes of traditional Chinese medicine prescription numbers and traditional Chinese medicine prescription names, designing the traditional Chinese medicine scientific documents into 10 attributes of traditional Chinese medicine scientific document numbers, titles, knowledge links, traditional Chinese medicine prescriptions, medicinal materials, effective components, purposes, methods, results and conclusions, designing the medicinal materials into 2 attributes of medicinal material numbers and medicinal material names, designing the effective components into 2 vertical compound numbers and compound names, and designing the knowledge categories into medicine treatment;
comprises 40 Chinese medicinal preparations for treating COVID-19, and relates to 49 selected scientific documents from Chinese herbal medicines and Chinese herbal medicines. The data of the part also comprises 112 Chinese medicinal materials and 86 Chinese medicinal effective components on the basis of 40 Chinese medicinal formulas, and the research on the aspects of Chinese medicinal effective component mining, Chinese medicinal action mechanism exploration and the like can be carried out on the basis of the data.
The Chinese medicinal preparation entity comprises 40 Chinese medicinal preparations for treating COVID-19, and has the following attributes: the number of the Chinese medicine prescription, the name of the Chinese medicine prescription, etc., and the specific field design is shown in table 16:
TABLE 16
Physical attributes of Chinese medicinal formulae Field design
Chinese medicine prescription number herbPrescriptionId
Name of Chinese medicine prescription herbPrescriptionName
The Chinese medicine science literature entity comprises 49 pieces of scientific literature data, and the attributes of the entity are as follows: the Chinese medicine science literature numbers, titles, knowledge links, Chinese medicine prescriptions, medicinal materials, active ingredients, purposes, methods, results and conclusions are 10, and the specific field designs are shown in Table 17:
TABLE 17
Chinese medicine science literature entity attributes Field design
Chinese medicine science literature number herbPaperId
Topic of questions herbPaperTitle
Knowledge linking herbNewsLink
Chinese medicine prescription herbPrescription
Medicinal materials herbComponent
Active ingredient herbCompound
Purpose(s) to goal
Method method
Results result
Conclusion conclusion
The medicinal material entities comprise 112 traditional Chinese medicinal materials, and the attributes of the entities comprise: the drug material numbers, drug material names, specific field designs are shown in table 18:
watch 18
Physical attributes of Chinese medicinal materials Field design
Medicinal material numbering herbId
Name of medicinal material herbName
The effective component entity comprises 86 effective compound components in 40 Chinese medicinal formulas, and the attributes of the entity are as follows: compound number, compound name, specific field design are shown in table 19:
watch 19
Active ingredient entity attributes Field design
Compound numbering compoundId
Name of Compound compoundName
The knowledge category entity has the value of 'medication', and is designed to be associated with the knowledge category of the scientific literature.
c-2) designing the relationship in the Chinese medicine treatment data, designing the Chinese medicine treatment data into 4 types of entity relationship of 'Chinese medicine science literature-Chinese medicine prescription', 'Chinese medicine prescription-effective component', 'Chinese medicine prescription-medicinal material' and 'Chinese medicine prescription-knowledge type', designing the entity relationship of 'Chinese medicine science literature-Chinese medicine prescription' into 3 attributes of Chinese medicine science literature number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-effective component' into 3 attributes of compound number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-medicinal material' into 3 attributes of medicinal material number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-knowledge type' into knowledge type number, Chinese medicine prescription number, relation type, Relationship type 3 attributes;
the relationship between the Chinese medicine science literature and the Chinese medicine prescription is illustrated, and the field design of the relationship is shown in table 20:
watch 20
Relationship attribute of Chinese medicine science literature-Chinese medicine prescription Field design
Chinese medicine science literature number herbPaperId
Chinese medicine prescription number herbPrescriptionId
Type of relationship Introduce
The "Chinese medicinal formula-active ingredient" relationship clarifies the effective compounds in each formula for treating COVID-19, and the field design of such relationship is shown in Table 21:
TABLE 21
Relationship between Chinese medicinal formula and effective components Field design
Compound numbering compoundId
Chinese medicine prescription number herbPrescriptionId
Type of relationship refine
The relationship between Chinese medicinal formula and medicinal material clarifies the medicinal material components of each formula, and the field design of the relationship is shown in Table 22:
TABLE 22
' middleRelationship between the prescriptions and herbs Field design
Medicinal material numbering herbId
Chinese medicine prescription number herbPrescriptionId
Type of relationship consistof
The "Chinese medicinal formula-knowledge class" relationship puts the data of 40 Chinese medicinal formulas for treating COVID-19 into the "medication" knowledge class for fusion with (1), and the fields of the relationship are designed as shown in Table 23:
TABLE 23
Relationship attribute of Chinese medicine prescription-knowledge category Field design
Knowledge category numbering CategoryId
Chinese medicine prescription number herbPrescriptionId
Type of relationship belongto
c-3) constructing a traditional Chinese medicine treatment knowledge graph, and analyzing, denoising and normalizing the data of the traditional Chinese medicine entities and the relationship constructed in the steps c-1) and c-2) to form the traditional Chinese medicine treatment knowledge graph;
through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 3.
d) Constructing a graph neural network model, fusing the COVID-19 scientific literature knowledge graph constructed in the step a), the western medicine treatment knowledge graph constructed in the step b) and the traditional Chinese medicine treatment knowledge graph constructed in the step c) to form a COVID-19 knowledge graph, constructing a COVID-19 scientific literature fine-grained classification data set based on the COVID-19 knowledge graph, and constructing a COVID-19 scientific literature text classification model (CTGC) based on GNN;
graph Neural Networks (GNNs) are a natural generalization of convolutional Neural networks in non-euclidean data domains. Briefly, the GNN combines graph data and a neural network, performs end-to-end calculation on the graph data, can perform representation learning and knowledge reasoning on the graph structure data, and has strong interpretability, which is an advantage of the GNN. These characteristics make GNNs particularly suitable for machine learning tasks based on knowledge map spectrogram data.
The invention is based on COVID-19 scientific literature Text Classification model CTGC (COVID-19 Text GCN Classification) of GNN, the model is composed of a convolution layer, a pooling layer and a full connection layer, the full connection layer realizes the multi-Classification of COVID-19 scientific literature through softmax function, the model hyper-parameter setting is shown as table 24:
watch 24
Hyper-parameter settings Description of the invention Numerical value setting
batch_size 256
num_workers Number of threads used to load data 4
weight_decay Loss function, regularization 1e-4
embed_size Word embedding dimension 100
drop_prop Discard rate 0.5
classes Number of classification categories 34
use_lrdecay Whether learning rate attenuation is used True
lr_decay Rate of decay 0.95
n_epoch Attenuation interval 1
e) And text classification, namely inputting the title, abstract and key words of the scientific literature into a trained COVID-19 scientific literature text classification model CTGC for the scientific literature to be classified, and outputting the classification of the scientific literature.
The basic information extraction steps of the scientific literature are as follows:
1) word segmentation: firstly, dividing the text of scientific literature into words or phrases, and because COVID-19 scientific literature is English, the title and abstract data are divided by using an English word segmentation tool nltk;
2) go stop word: removing a, an, the, above, after, the stop words which appear in the text in a large amount but do not have much influence on the text classification by using nltk;
3) small writing: uniformly converting English texts into a lower case form;
4) noise removal: removing special symbols and punctuations in the text;
5) spell checking: checking whether spelling errors exist or not, and correcting errors.
6) Slang and abbreviation: reducing abbreviations to full form or converting some spoken representation to written language;
7) stem extraction and morphology reduction: converting English words into the most basic form;
8) word frequency statistics and filtering: and after the text training set is processed, counting the word frequency of the remaining words, filtering low-frequency words with the word frequency lower than 5, and keeping the words with the word frequency higher than a threshold value.
Common model performance metrics indicators in the field of text classification are: accuracy and error rate (the sum of both is equal to 1), precision/call/F-measure, Exact Match (EM) and Mean Reliable Rank (MRR), wherein the Exact Match (EM) is mainly used for a QA system and used for measuring the matching degree between prediction and real answer, and is a commonly used evaluation method in SQuAD games. The Mean Recircular Rank (MRR) is used mainly to evaluate the quality of the ranking, query-document ranking and QA answer ranking. In the invention, F-measure is used as a model performance measurement index.
The invention compares the performance of the CTGC model with the performance of common text classification models (such as FastText, TextCNN, RCNN, LSTM, DPCNN and the like), the default hyper-parameter is adopted for training the hyper-parameter of each model, and the weight f1-score value of each model on the test set is shown in Table 25:
TABLE 25
Model (model) Test set weighted f1-score Remarks for note
FastText 92.6% Baseline
TextCNN 93.8%
RCNN 93%
Multi-layer bidirectional LSTM 94.4%
Multi-layer bidirectional LSTM with Attention 94.7%
DPCNN 95.4%
CTGC 95.5%
The results of the experiment were analyzed as follows:
(1) the FastText model is experimental Baseline, with a weighted f1-score of 92.6% on the test set;
(2) the DPCNN model achieves the best performance on a test set, and the weighted f1-score is 95.4%;
(3) the effect of the Attention mechanism on the classification problem is not obvious, and the performance of the multi-layer bidirectional LSTM and the multi-layer bidirectional LSTM with Attention is almost the same;
(4) the RCNN model (RNN in combination with CNN) did not achieve the expected performance, with a lower performance boost than FastText;
(5) the TextCNN is the most efficient model, the model structure is simple, the training speed is high, and the experimental effect is good;
the above experimental results are all based on default hyper-parameters, and if each model is optimized one by one, the experimental results are more excellent. Compared with a CNN model, the model based on the RNN is more difficult to train and takes longer time, and the experimental result has no obvious advantages. Gradient explosion phenomena are easy to occur in the RNN-based model training process, and attention is needed in the setting of related methods and production parameters. Deep networks (DPCNN) are more difficult to train than shallow networks (TextCNN), but the experimental effect of deep networks is significantly better than that of shallow networks, and in addition, deep networks need to use residual connections to mitigate the gradient vanishing phenomenon.
After the model training is finished, the COVID-19 scientific literature can be subjected to fine-grained knowledge classification, and the corresponding knowledge class is predicted, wherein the specific method comprises the following three steps:
(1) and extracting text data of information such as abstracts, keywords, authors and the like of the COVID-19 English scientific literature.
(2) And (4) running the trained model main function under the CTGC model file directory, inputting the text data extracted in the step (1) and returning to run the main function.
(3) The CTGC model directly returns the knowledge category of the scientific literature according to the abstract, the key words and the author information of the COVID-19 scientific literature.
Specifically, the implementation process based on the CTGC model is introduced in the invention by way of example: first, a scientific literature published in 6/10/2020 was selected, and in the COVID-19-v5.0 knowledge graph, the basic data samples of the scientific literature are shown in table 26:
watch 26
Attribute name Attribute value
Numbering 330
Date of release 2020.06.10
Knowledge categories Vaccine development
Topic of questions A SARS-CoV-2 infection model in mice demonstrates protection by neutralizing antibodies
Document address https://www.cell.com/cell/fulltext/S0092-8674(20)30742- X
Information address https://mp.weixin.qq.com/s/grELLG8bhayXw-s8w_Tdjg
Abstract Severe acute respiratory syndrome coronavirus 2 (SARS- CoV-2) has caused a pandemic with millions of human infections. One limitation to the evaluation of potential therapies and vaccines to inhibit SARS-CoV-2 infection and ameliorate disease is the lack of susceptible small animals in large numbers. Commercially available laboratory strains of mice are not readily infected by SARS-CoV-2 because of species- specific differences in their angiotensin-converting enzyme 2 (ACE2) receptors. Here, we transduced replication-defective adenoviruses encoding human ACE2 via intranasal administration into BALB/c mice and established receptor expression in lung tissues. hACE2- transduced mice were productively infected with SARS- CoV-2, and this resulted in high viral titers in the lung, lung pathology, and weight loss. Passive transfer of a neutralizing monoclonal antibody reduced viral burden in the lung and mitigated inflammation and weight loss. The development of an accessible mouse model of SARS-CoV-2 infection and pathogenesis will expedite the testing and deployment of therapeutics and vaccines.
Introduction of content 10.6.2020, Michael S. Diamond, university of Washington, school of medicine Cell published on-line entitled "A SARS-CoV-2 infection model in micro Investigation of demonstrates protection by neutral anti-infectives ″ A study in which a replication-deficient adenovirus encoding human ACE2 was administered intranasally Transduced into BALB/c mice and receptor expression was established in lung tissue. hACE2 The transduced mice are efficiently infected with SARS-CoV-2, which results in high pulmonary viral titers and lung Partial pathology and weight loss. Passive transfer of neutralizing monoclonal antibodies reduced pulmonary Viral burden, reduction of inflammation and weight loss. SARS-CoV-2 infection and pathogenesis Will expedite the testing and deployment of therapeutics and vaccines.
Scientific research institution University OF WASHINGTON
Research team Michael S. Diamond
DOI DOI:https://doi.org/10.1016/j.cell.2020.06.011
Periodical Cell
Keyword Is free of
Remarks for note Is free of
The invention inputs the scientific literature 'title, abstract and keyword' data into a trained CTGC model, and the model outputs the classification of the scientific literature according to the input data: the vaccine research and development completes the fine-grained classification of the COVID-19 scientific literature, can replace the manual work, and realizes the quick and large-scale fine-grained classification task of the COVID-19 scientific literature.
Therefore, the COVID-19 scientific literature fine-grained classification method based on GNN of the invention comprises the steps of 1, constructing a COVID-19 knowledge graph, and having high data quality and fine knowledge granularity; 2. the COVID-19 scientific literature fine-grained classification model is designed, the model accuracy is high, the problem of small sample learning is solved, and the interpretability is strong; 3. by using the graph neural network technology, the practical problem of fine-grained classification of COVID-19 scientific literature is solved, and scientific epidemic prevention is assisted.

Claims (4)

1. A COVID-19 scientific literature fine-grained classification method based on GNN is characterized by comprising the following steps:
a) constructing a knowledge graph of COVID-19 scientific literature;
a-1) dividing the knowledge categories, acquiring a certain amount of medical journal scientific documents related to COVID-19, extracting basic information of each scientific document, carrying out knowledge classification on each scientific document, dividing the scientific documents into the knowledge categories of virus naming, virus detection, origin and variation, virus propagation, species propagation, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine development, vaccine treatment, novel therapy, pharmacotherapy, drug development, drug treatment or clinical research, and classifying the scientific documents which do not belong to the knowledge categories into other categories;
a-2), designing an entity in scientific literature, namely designing scientific literature data into 5 types of entities of scientific literature, knowledge categories, scientific researchers, academic journals and scientific research institutions; designing scientific literature entities into 10 attributes of a paper number, publication time, a question, a paper address, an information address, an abstract, a content brief introduction, a DOI (disk object identifier), a keyword and remarks, designing knowledge categories of scientific literature into 2 attributes of a knowledge category number and a knowledge category name according to the division in the step a-1), designing scientific research personnel entities into 2 attributes of a scientific research personnel number and a scientific research personnel name, designing academic journals into 2 attributes of an academic journal number and an academic journal name, and designing scientific research institutions into 2 attributes of a scientific research institution number and a scientific research institution name;
a-3) designing the relation in scientific literature, designing scientific literature data into 5 types of relations of scientific research institution-scientific research personnel, scientific literature-scientific research institution, scientific literature-scientific research personnel, scientific literature-knowledge category and scientific literature-academic journal, designing the relations of scientific research institution-scientific research personnel into 3 attributes of scientific research personnel number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research institution into 3 attributes of scientific literature number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research personnel into 3 attributes of scientific literature number, scientific research personnel number and relation type, designing the relations of scientific literature-knowledge category into scientific literature number, knowledge category number, scientific literature type number and relation type, 3 attributes of the relation type, namely designing the relation of scientific literature-academic journal as scientific literature number, academic journal number and 3 attributes of the relation type;
a-4) constructing a COVID-19 scientific literature knowledge graph, and analyzing, denoising and normalizing the data of the scientific literature entities and the relationship constructed in the steps a-2) and a-3) to form the scientific literature knowledge graph;
b) constructing a western medicine treatment knowledge map;
b-1) designing entities in western medicine treatment data, namely designing the western medicine treatment data into western medicine, western medicine related thesis and knowledge category 3 entities, designing the western medicine entities into 9 attributes of western medicine numbers, western medicine names, recording numbers, medicine categories, descriptions, extension numbers, CAS numbers, InCHI codes and SMILES codes, and designing the western medicine related thesis into western medicine related thesis numbers, thesi names, thesis publication time, thesis links, thesis abstracts, DOI numbers, remarks and keywords 8 attributes;
b-2) designing the relation in the western medicine treatment data, namely designing the western medicine treatment relation into 2 types of relations of 'knowledge category-western medicine' and 'western medicine-western medicine related thesis', designing the 'knowledge category-western medicine' into 3 attributes of knowledge category number, western medicine number and relation type, and designing the 'western medicine-western medicine related thesis' into 3 attributes of western medicine number, western medicine related thesis number and relation type;
b-3), constructing a western medicine treatment knowledge map, and analyzing, denoising and normalizing the western medicine entity and relationship data constructed in the steps b-1) and b-2) to form the western medicine treatment knowledge map;
c) constructing a Chinese medicine treatment knowledge map;
c-1) designing entities in the traditional Chinese medicine treatment data, namely designing the traditional Chinese medicine treatment data into 5 types of entities of traditional Chinese medicine prescriptions, traditional Chinese medicine scientific documents, medicinal materials, effective components and knowledge categories, designing the traditional Chinese medicine prescriptions into 2 attributes of traditional Chinese medicine prescription numbers and traditional Chinese medicine prescription names, designing the traditional Chinese medicine scientific documents into 10 attributes of traditional Chinese medicine scientific document numbers, titles, knowledge links, traditional Chinese medicine prescriptions, medicinal materials, effective components, purposes, methods, results and conclusions, designing the medicinal materials into 2 attributes of medicinal material numbers and medicinal material names, designing the effective components into 2 vertical compound numbers and compound names, and designing the knowledge categories into medicine treatment;
c-2) designing the relationship in the Chinese medicine treatment data, designing the Chinese medicine treatment data into 4 types of entity relationship of 'Chinese medicine science literature-Chinese medicine prescription', 'Chinese medicine prescription-effective component', 'Chinese medicine prescription-medicinal material' and 'Chinese medicine prescription-knowledge type', designing the entity relationship of 'Chinese medicine science literature-Chinese medicine prescription' into 3 attributes of Chinese medicine science literature number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-effective component' into 3 attributes of compound number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-medicinal material' into 3 attributes of medicinal material number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-knowledge type' into knowledge type number, Chinese medicine prescription number, relation type, Relationship type 3 attributes;
c-3) constructing a traditional Chinese medicine treatment knowledge graph, and analyzing, denoising and normalizing the data of the traditional Chinese medicine entities and the relationship constructed in the steps c-1) and c-2) to form the traditional Chinese medicine treatment knowledge graph;
d) constructing a graph neural network model, fusing the COVID-19 scientific literature knowledge graph constructed in the step a), the western medicine treatment knowledge graph constructed in the step b) and the traditional Chinese medicine treatment knowledge graph constructed in the step c) to form a COVID-19 knowledge graph, constructing a COVID-19 scientific literature fine-grained classification data set based on the COVID-19 knowledge graph, and constructing a COVID-19 scientific literature text classification model (CTGC) based on GNN;
e) and text classification, namely inputting the title, abstract and key words of the scientific literature into a trained COVID-19 scientific literature text classification model CTGC for the scientific literature to be classified, and outputting the classification of the scientific literature.
2. The fine-grained classification method for COVID-19 scientific literature based on GNN according to claim 1, wherein the basic information extraction step of scientific literature is as follows:
1) word segmentation: firstly, dividing the text of scientific literature into words or phrases, and because COVID-19 scientific literature is English, the title and abstract data are divided by using an English word segmentation tool nltk;
2) go stop word: removing a, an, the, above, after, the stop words which appear in the text in a large amount but do not have much influence on the text classification by using nltk;
3) small writing: uniformly converting English texts into a lower case form;
4) noise removal: removing special symbols and punctuations in the text;
5) spell checking: checking whether spelling errors exist or not, and correcting errors;
6) slang and abbreviation: reducing abbreviations to full form or converting some spoken representation to written language;
7) stem extraction and morphology reduction: converting English words into the most basic form;
8) word frequency statistics and filtering: and after the text training set is processed, counting the word frequency of the remaining words, filtering low-frequency words with the word frequency lower than 5, and keeping the words with the word frequency higher than a threshold value.
3. A GNN-based COVID-19 scientific literature fine-grained classification method according to claim 1 or 2, wherein the graphical neural network GNN in step d) is composed of a convolutional layer, a pooling layer and a full-link layer, the full-link layer is implemented by a softmax function, the size of the data set is 256, the number of threads used for loading data is 4, the loss function is normalized to 1e-4, the word embedding dimension is 100, the discarding rate is 0.5, the classification category number is 34, the learning rate is used for attenuation, the attenuation rate is 0.95, and the attenuation interval is 1.
4. The fine-grained classification method for COVID-19 scientific literature based on GNN according to claim 1 or 2, wherein the construction of the knowledge graph of COVID-19 scientific literature in step a), the knowledge graph of western medicine treatment in step b) and the knowledge graph of traditional Chinese medicine treatment in step c) is realized by a Neo4j database tool.
CN202011313700.8A 2020-11-20 2020-11-20 COVID-19 scientific literature fine-grained classification method based on GNN Active CN112380345B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011313700.8A CN112380345B (en) 2020-11-20 2020-11-20 COVID-19 scientific literature fine-grained classification method based on GNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011313700.8A CN112380345B (en) 2020-11-20 2020-11-20 COVID-19 scientific literature fine-grained classification method based on GNN

Publications (2)

Publication Number Publication Date
CN112380345A CN112380345A (en) 2021-02-19
CN112380345B true CN112380345B (en) 2022-03-29

Family

ID=74587186

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011313700.8A Active CN112380345B (en) 2020-11-20 2020-11-20 COVID-19 scientific literature fine-grained classification method based on GNN

Country Status (1)

Country Link
CN (1) CN112380345B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221705B (en) * 2021-04-30 2024-01-09 平安科技(深圳)有限公司 Automatic classification method, device, equipment and storage medium for electronic documents
CN113505583B (en) * 2021-05-27 2023-07-18 山东交通学院 Emotion reason clause pair extraction method based on semantic decision graph neural network
CN113190691B (en) * 2021-05-28 2022-09-23 齐鲁工业大学 Link prediction method and system of knowledge graph
CN113609251A (en) * 2021-06-29 2021-11-05 中国科学院微生物研究所 Method and device for processing coronavirus associated data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284391A (en) * 2018-12-07 2019-01-29 吉林大学 A kind of document automatic classification method
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN110807101A (en) * 2019-10-15 2020-02-18 中国科学技术信息研究所 Scientific and technical literature big data classification method
CN110827966A (en) * 2019-11-11 2020-02-21 重庆亚德科技股份有限公司 Regional single disease supervision system
CN110990589A (en) * 2019-12-14 2020-04-10 周世海 Knowledge graph automatic generation method based on deep reinforcement learning
CN111125347A (en) * 2019-12-27 2020-05-08 山东省计算中心(国家超级计算济南中心) Knowledge graph 3D visualization method based on unity3D
CN111222681A (en) * 2019-11-05 2020-06-02 量子数聚(北京)科技有限公司 Data processing method, device, equipment and storage medium for enterprise bankruptcy risk prediction
CN111402971A (en) * 2020-03-06 2020-07-10 浙江大学医学院附属第一医院 Big data-based method and system for quickly identifying adverse drug reactions

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3061717A1 (en) * 2018-11-16 2020-05-16 Royal Bank Of Canada System and method for a convolutional neural network for multi-label classification with partial annotations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284391A (en) * 2018-12-07 2019-01-29 吉林大学 A kind of document automatic classification method
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN110717049A (en) * 2019-08-29 2020-01-21 四川大学 Text data-oriented threat information knowledge graph construction method
CN110807101A (en) * 2019-10-15 2020-02-18 中国科学技术信息研究所 Scientific and technical literature big data classification method
CN111222681A (en) * 2019-11-05 2020-06-02 量子数聚(北京)科技有限公司 Data processing method, device, equipment and storage medium for enterprise bankruptcy risk prediction
CN110827966A (en) * 2019-11-11 2020-02-21 重庆亚德科技股份有限公司 Regional single disease supervision system
CN110990589A (en) * 2019-12-14 2020-04-10 周世海 Knowledge graph automatic generation method based on deep reinforcement learning
CN111125347A (en) * 2019-12-27 2020-05-08 山东省计算中心(国家超级计算济南中心) Knowledge graph 3D visualization method based on unity3D
CN111402971A (en) * 2020-03-06 2020-07-10 浙江大学医学院附属第一医院 Big data-based method and system for quickly identifying adverse drug reactions

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
COVID-19 Literature Knowledge Graph Construction and Drug Repurposing Report Generation;qingyun WANG等;《NAACL 1 July 2020 Computer Science》;20200706;全文 *
KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19;Justin Reese等;《Patterns》;20200818;全文 *
基于 Neo4j生物医药知识图谱的构建;曹皓伟, 徐建良, 窦方坤;《计算机时代》;20200615(第6期);全文 *

Also Published As

Publication number Publication date
CN112380345A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
CN112380345B (en) COVID-19 scientific literature fine-grained classification method based on GNN
Ball et al. TextHunter–a user friendly tool for extracting generic concepts from free text in clinical research
Zhang et al. Extracting databases from dark data with deepdive
US20200075135A1 (en) Trial planning support apparatus, trial planning support method, and storage medium
Li et al. Automatic approach for constructing a knowledge graph of knee osteoarthritis in Chinese
Kompa et al. Artificial intelligence based on machine learning in pharmacovigilance: a scoping review
Guo et al. A disease inference method based on symptom extraction and bidirectional Long Short Term Memory networks
Chan et al. Rapid customization for event extraction
Lokala et al. A computational approach to understand mental health from reddit: knowledge-aware multitask learning framework
Jiang et al. Two-stage entity alignment: combining hybrid knowledge graph embedding with similarity-based relation alignment
Alotaibi et al. COVID-19 vaccine rejection causes based on Twitter people’s opinions analysis using deep learning
Yaiprasert et al. Artificial intelligence for target symptoms of Thai herbal medicine by web scraping
Ishida et al. Extracting citizen feedback from social media by appraisal opinion type viewpoint
Norman Systematic review automation methods
Kim et al. Network of institutions, source journals, and keywords on COVID-19 by Korean authors based on the Web of Science Core Collection in January 2021
Chen et al. Entity relation extraction from electronic medical records based on improved annotation rules and BiLSTM-CRF
Noh et al. Document retrieval for biomedical question answering with neural sentence matching
CN114898895A (en) Xinjiang local adverse drug reaction identification method and related device
Dugan Mechanizing Alice: Automating the Subject Matter Eligibility Test of Alice v. CLS Bank
Lavanya et al. Auto capture on drug text detection in social media through NLP from the heterogeneous data
Mary et al. Jen-Ton: A framework to enhance the accuracy of aspect level sentiment analysis in big data
Hao et al. Ontology alignment with semantic and structural embeddings
Baglini et al. Multilingual Sentiment Normalization for Scandinavian Languages
Luling et al. COVID-19 literature mining and analysis research
Segev et al. Internet as a knowledge base for medical diagnostic assistance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant