CN112380345B

CN112380345B - COVID-19 scientific literature fine-grained classification method based on GNN

Info

Publication number: CN112380345B
Application number: CN202011313700.8A
Authority: CN
Inventors: 杨帅; 王小红; 赵志刚; 窦方坤; 曹皓伟; 潘景山; 魏志强
Original assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-03-29
Anticipated expiration: 2040-11-20
Also published as: CN112380345A

Abstract

The invention discloses a GNN-based COVID-19 scientific literature fine-grained classification method, which comprises the following steps: a) constructing a knowledge graph of COVID-19 scientific literature; a-1) division of knowledge categories; a-2) entity design in scientific literature; a-3) design of relationships in scientific literature; a-4) constructing a knowledge graph of COVID-19 scientific literature; b) constructing a western medicine treatment knowledge map; c) constructing a Chinese medicine treatment knowledge map; d) building a graph neural network model; e) text classification. The COVID-19 scientific literature fine-grained classification method based on GNN provides an effective screening and classification method for medical workers to quickly find self-required knowledge type literatures in a large amount (usually more than 1 ten thousand) of related COVID-19 scientific literatures, has obvious beneficial effects and is suitable for application and popularization.

Description

COVID-19 scientific literature fine-grained classification method based on GNN

Technical Field

The invention relates to a scientific literature fine-grained classification method, in particular to a COVID-19 scientific literature fine-grained classification method based on GNN.

Background

Text classification is a common task of natural language processing, and performs automatic classification and labeling on a text set according to a certain classification system or standard. The text classification method is divided into three categories: rule-based methods, machine learning-based methods (data-driven methods), and hybrid methods. Rule-based text classification methods (Rule-based methods) require manual involvement in formulating rules, and are often relatively accurate. However, when the rules are changed or updated, the rules need to be manually re-summarized, and the maintenance cost is high. Moreover, when there are many rules, there is a possibility that the rules may conflict with each other, which makes maintenance difficult. In terms of extensibility, a given rule is difficult to extend into other scenarios, and new scenarios often require rewriting of the rule.

The text classification method based on machine learning, also called data-driven method, can be subdivided into two categories, namely the traditional machine learning text classification method and the text classification method based on deep learning. The traditional machine learning text classification method comprises svm, gbdt and the like, features need to be specified manually, a large amount of data analysis and feature engineering work is introduced, and the feature engineering needs to be combined with business scenes, so that the traditional machine learning text classification method is difficult to generalize to other scenes. The text classification based on deep learning can automatically learn the internal relation between data and labels through a deep learning model, and does not need manual intervention except labeling work. A Hybrid method (hybrids) is a method generally adopted in the industry, a plurality of deep learning models such as attention, CNN, LSTM, BERT and the like are mixed for use, and a rule-based method is used for pre-filtering and post-bottoming, so that the method is simple and easy to implement and consumes few resources. For some complex and large-scale learning tasks, the deep learning model has better training effect.

The classic text classification methods are: Feed-Forward Neural Networks, DAN (deep average network), fastText, Tree-LSTM model, Multi-Timescale LSTM (MT-LSTM) model, Topic RNN, dynamic CNN, kim-CNN, Capsule Neural Networks, Transformers, and the like. The data sets common in text classification machine learning tasks can be classified by categories: emotion classification, news classification, topic classification, QA data set, NLI natural language reasoning data set and the like, wherein the classical data sets of all the categories comprise: (1) and (3) emotion classification: yelp, IMDb, Movie Review, SST, MPQA, AMazon, Aspect-Based sentment Analysis; (2) and (4) news classification: AG News, 20News groups, SogouNews, Reuters News, BingNews, NYTIMEs, BBC, Google News; (3) topic classification: DBpedia, Ohsumed, EUR-Lex, WOS, PubMed 200k RCT, Irony (which is composed of indexed documents from the social news website read, Twitter dataset for topic classification of tways, arXiv collection); (4) QA data set: SQuAD, MS MARCO, TREC-QA, WikiQA, Quora, QA inclusions With Adversal genetics (SWAG), WikiQA, SelQA; (5) NLI natural language inference dataset: SNLI, Multi-NLI, SICK, MSRP, Semantic Temporal Similarity (STS), RTE, SciTal.

The existing text classification method achieves good effects, but the problems of lack of data sets of complex text scenes, poor interpretability, poor model design, small sample learning and the like still exist.

(1) Complex text classification scenarios lack a data set. Although the text classification algorithm based on deep learning has a good effect on many data sets, in some complex text classification scenes, the data sets are still lacked, and the effect of the current model cannot be verified, such as QA, multi-step text inference, multi-language text classification and the like in the complex scenes.

(2) And (5) text information knowledge modeling. Modeling knowledge: it is necessary to model knowledge in text information, such as building a knowledge base and a knowledge graph, and performing analysis and inference based on the knowledge.

(3) Deep learning is poorly interpretable. In many real business scenarios, business parties are more concerned about the interpretability problem, but the interpretability of current deep learning is not strong.

(4) Smaller, more efficient models. With the appearance of the models such as the bert model, the models are larger and larger, more and more resources are consumed by training, the training time under limited resources is very long, and smaller models capable of being efficiently trained are more needed in the industry.

(5) Small sample learning (zeor-shot and raw shot learning): the current deep learning model depends on a large amount of labeled data too much, and whether knowledge can be introduced or not is used for solving the problem of less labeled samples.

A knowledge graph is a semantic network that exposes relationships between entities. In the official vocabulary entry of wikipedia, the knowledgegraph is the repository used by Google to enhance its search engine functionality. Essentially, a knowledge graph is intended to describe various entities or concepts and their relationships that exist in the real world, and constitutes a huge semantic network graph, where nodes represent entities or concepts and edges have attributes or relationships. Today, knowledge graphs have been used to broadly refer to a variety of large-scale knowledge bases.

A knowledge graph is composed of entities, semantic classes (concepts), content, attributes (values), and relationships. An entity refers to something that is distinguishable and exists independently, such as: a person, a city, a commodity, etc. Semantic classes (concepts) refer to collections of entities having common characteristics, such as: country, ethnic group, etc. Content is typically expressed as names, descriptions, interpretations, etc. of entities and semantic classes, which may be expressed in text, images, audio-video, etc. Attributes (values) refer to different types of attributes an entity has. Relationships refer to connections between entities. At present, large-scale knowledge maps with higher popularity are as follows: FreeBase, Google knowledge graph, DBpedia, Wikipedia, Baidu knowledge graph, dog-searching knowledge cube, etc.

The novel coronavirus pneumonia (Corona Virus Disease 2019, COVID-19) is called new coronavirus pneumonia for short, and the world health organization is named as 2019 coronavirus Disease, and is pneumonia caused by 2019 novel coronavirus infection. In the face of such severe epidemic situation, a COVID-19 knowledge base is urgently needed to support the scientific epidemic prevention data.

Specifically, in the face of the data of the COVID-19 scientific literature, no method for directly classifying the fine granularity of the COVID-19 scientific literature exists at present. On one hand, fine-grained modeling needs to be carried out on COVID-19 scientific literature information; on the other hand, the problems of model design, small sample learning and the like need to be solved. At the beginning of the outbreak of the COVID-19 epidemic, the original literature is not yet large, and the work of knowledge classification can be completed by manpower, but at present, the overall scale of the COVID-19 scientific literature reaches more than ten thousand. In the face of such a huge scale, manual classification is no longer feasible, so that a machine learning method is urgently needed to realize fine-grained classification of the COVID-19 scientific literature so as to solve the practical problem.

Disclosure of Invention

In order to overcome the defects of the technical problems, the invention provides a COVID-19 scientific literature fine-grained classification method based on GNN.

The invention discloses a GNN-based COVID-19 scientific literature fine-grained classification method, which is characterized by comprising the following steps of:

a) constructing a knowledge graph of COVID-19 scientific literature;

a-1) dividing the knowledge categories, acquiring a certain amount of medical journal scientific documents related to COVID-19, extracting basic information of each scientific document, carrying out knowledge classification on each scientific document, dividing the scientific documents into the knowledge categories of virus naming, virus detection, origin and variation, virus propagation, species propagation, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine development, vaccine treatment, novel therapy, pharmacotherapy, drug development, drug treatment or clinical research, and classifying the scientific documents which do not belong to the knowledge categories into other categories;

a-2), designing an entity in scientific literature, namely designing scientific literature data into 5 types of entities of scientific literature, knowledge categories, scientific researchers, academic journals and scientific research institutions; designing scientific literature entities into 10 attributes of a paper number, publication time, a question, a paper address, an information address, an abstract, a content brief introduction, a DOI (disk object identifier), a keyword and remarks, designing knowledge categories of scientific literature into 2 attributes of a knowledge category number and a knowledge category name according to the division in the step a-1), designing scientific research personnel entities into 2 attributes of a scientific research personnel number and a scientific research personnel name, designing academic journals into 2 attributes of an academic journal number and an academic journal name, and designing scientific research institutions into 2 attributes of a scientific research institution number and a scientific research institution name;

a-3) designing the relation in scientific literature, designing scientific literature data into 5 types of relations of scientific research institution-scientific research personnel, scientific literature-scientific research institution, scientific literature-scientific research personnel, scientific literature-knowledge category and scientific literature-academic journal, designing the relations of scientific research institution-scientific research personnel into 3 attributes of scientific research personnel number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research institution into 3 attributes of scientific literature number, scientific research institution number and relation type, designing the relations of scientific literature-scientific research personnel into 3 attributes of scientific literature number, scientific research personnel number and relation type, designing the relations of scientific literature-knowledge category into scientific literature number, knowledge category number, scientific literature type number and relation type, 3 attributes of the relation type, namely designing the relation of scientific literature-academic journal as scientific literature number, academic journal number and 3 attributes of the relation type;

a-4) constructing a COVID-19 scientific literature knowledge graph, and analyzing, denoising and normalizing the data of the scientific literature entities and the relationship constructed in the steps a-2) and a-3) to form the scientific literature knowledge graph;

b) constructing a western medicine treatment knowledge map;

b-1) designing entities in western medicine treatment data, namely designing the western medicine treatment data into western medicine, western medicine related thesis and knowledge category 3 entities, designing the western medicine entities into 9 attributes of western medicine numbers, western medicine names, recording numbers, medicine categories, descriptions, extension numbers, CAS numbers, InCHI codes and SMILES codes, and designing the western medicine related thesis into western medicine related thesis numbers, thesi names, thesis publication time, thesis links, thesis abstracts, DOI numbers, remarks and keywords 8 attributes;

b-2) designing the relation in the western medicine treatment data, namely designing the western medicine treatment relation into 2 types of relations of 'knowledge category-western medicine' and 'western medicine-western medicine related thesis', designing the 'knowledge category-western medicine' into 3 attributes of knowledge category number, western medicine number and relation type, and designing the 'western medicine-western medicine related thesis' into 3 attributes of western medicine number, western medicine related thesis number and relation type;

b-3), constructing a western medicine treatment knowledge map, and analyzing, denoising and normalizing the western medicine entity and relationship data constructed in the steps b-1) and b-2) to form the western medicine treatment knowledge map;

c) constructing a Chinese medicine treatment knowledge map;

c-1) designing entities in the traditional Chinese medicine treatment data, namely designing the traditional Chinese medicine treatment data into 5 types of entities of traditional Chinese medicine prescriptions, traditional Chinese medicine scientific documents, medicinal materials, effective components and knowledge categories, designing the traditional Chinese medicine prescriptions into 2 attributes of traditional Chinese medicine prescription numbers and traditional Chinese medicine prescription names, designing the traditional Chinese medicine scientific documents into 10 attributes of traditional Chinese medicine scientific document numbers, titles, knowledge links, traditional Chinese medicine prescriptions, medicinal materials, effective components, purposes, methods, results and conclusions, designing the medicinal materials into 2 attributes of medicinal material numbers and medicinal material names, designing the effective components into 2 vertical compound numbers and compound names, and designing the knowledge categories into medicine treatment;

c-2) designing the relationship in the Chinese medicine treatment data, designing the Chinese medicine treatment data into 4 types of entity relationship of 'Chinese medicine science literature-Chinese medicine prescription', 'Chinese medicine prescription-effective component', 'Chinese medicine prescription-medicinal material' and 'Chinese medicine prescription-knowledge type', designing the entity relationship of 'Chinese medicine science literature-Chinese medicine prescription' into 3 attributes of Chinese medicine science literature number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-effective component' into 3 attributes of compound number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-medicinal material' into 3 attributes of medicinal material number, Chinese medicine prescription number and relation type, designing the entity relationship of 'Chinese medicine prescription-knowledge type' into knowledge type number, Chinese medicine prescription number, relation type, Relationship type 3 attributes;

c-3) constructing a traditional Chinese medicine treatment knowledge graph, and analyzing, denoising and normalizing the data of the traditional Chinese medicine entities and the relationship constructed in the steps c-1) and c-2) to form the traditional Chinese medicine treatment knowledge graph;

d) constructing a graph neural network model, fusing the COVID-19 scientific literature knowledge graph constructed in the step a), the western medicine treatment knowledge graph constructed in the step b) and the traditional Chinese medicine treatment knowledge graph constructed in the step c) to form a COVID-19 knowledge graph, constructing a COVID-19 scientific literature fine-grained classification data set based on the COVID-19 knowledge graph, and constructing a COVID-19 scientific literature text classification model (CTGC) based on GNN;

e) and text classification, namely inputting the title, abstract and key words of the scientific literature into a trained COVID-19 scientific literature text classification model CTGC for the scientific literature to be classified, and outputting the classification of the scientific literature.

The invention discloses a COVID-19 scientific literature fine-grained classification method based on GNN, which comprises the following steps of basic information extraction of scientific literature:

1) word segmentation: firstly, dividing the text of scientific literature into words or phrases, and because COVID-19 scientific literature is English, the title and abstract data are divided by using an English word segmentation tool nltk;

2) go stop word: removing a, an, the, above, after, the stop words which appear in the text in a large amount but do not have much influence on the text classification by using nltk;

3) small writing: uniformly converting English texts into a lower case form;

4) noise removal: removing special symbols and punctuations in the text;

5) spell checking: checking whether spelling errors exist or not, and correcting errors.

6) Slang and abbreviation: reducing abbreviations to full form or converting some spoken representation to written language;

7) stem extraction and morphology reduction: converting English words into the most basic form;

8) word frequency statistics and filtering: and after the text training set is processed, counting the word frequency of the remaining words, filtering low-frequency words with the word frequency lower than 5, and keeping the words with the word frequency higher than a threshold value.

The COVID-19 scientific literature fine-grained classification method based on the GNN is characterized in that the graph neural network GNN in the step d) is composed of a convolution layer, a pooling layer and a full connection layer, the full connection layer is realized through a softmax function, the size of a data set is 256, the number of threads used for loading data is 4, a loss function is normalized to be 1e-4, the word embedding dimension is 100, the discarding rate is 0.5, the classification category number is 34, the learning rate is used for attenuation, the attenuation rate is 0.95, and the attenuation interval is 1.

The COVID-19 scientific literature fine-grained classification method based on GNN is realized by constructing a COVID-19 scientific literature knowledge graph in the step a), a western medicine treatment knowledge graph in the step b) and a traditional Chinese medicine treatment knowledge graph in the step c) through a Neo4j database tool.

The invention has the beneficial effects that: the invention relates to a GNN-based COVID-19 scientific literature fine-grained classification method, which comprises the steps of firstly dividing acquired scientific literature into knowledge categories, constructing a scientific literature knowledge graph by using attributes of scientific literature entities and entity relations, then respectively establishing the knowledge graphs of western medicine data and traditional Chinese medicine data according to the entities and the entity relations of the western medicine data and the traditional Chinese medicine data, finally constructing a scientific literature text classification model (CTGC) by using a Graph Neural Network (GNN), and for the scientific literature to be classified, obtaining classification of the scientific literature by using the trained scientific literature text classification model (CTGC), wherein the classification is accurate and has finer classification granularity, so that medical workers can quickly find needed knowledge category literatures (such as virus naming, virus detection, origin and variation, virus propagation and the like) in massive (usually more than 1 ten thousand) related COVID-19 scientific literatures, Species transmission, pathogenic molecular mechanism, pathogenic mechanism, human immunity, vaccine research and development, vaccine treatment, novel therapy, drug clinic, drug research and development, drug treatment or clinical research and the like), provides an effective screening and classifying method, has obvious beneficial effects, and is suitable for application and popularization.

Drawings

FIG. 1 is a schematic representation of the COVID-19 scientific literature knowledge graph constructed in the present invention;

FIG. 2 is a schematic diagram of a western medicine treatment knowledge map constructed in the present invention;

FIG. 3 is a schematic diagram of a knowledge graph of Chinese medicine treatment constructed in the present invention;

FIG. 4 is a schematic diagram of a COVID-19 knowledge graph formed by fusing a COVID-19 scientific literature knowledge graph, a drug treatment knowledge graph and a traditional Chinese medicine treatment knowledge graph constructed in the invention.

Fig. 1 to 4 are only schematic diagrams showing the constructed knowledge graph, so that perceptual knowledge and understanding can be given, and the completeness and disclosure sufficiency of the GNN-based COVID-19 scientific literature fine-grained classification method described in the invention can be influenced even without fig. 1 to 4.

Detailed Description

The invention is further described with reference to the following figures and examples.

The COVID-19 knowledge graph is mainly constructed based on the latest published scientific literature knowledge, and the key step in the construction process is to classify the fine granularity according to the research points of the scientific literature, so that the work is redundant and heavy. At the beginning of the outbreak of the COVID-19 epidemic, the original literature is not yet large, and the work of knowledge classification can be completed by manpower, but at present, the overall scale of the COVID-19 scientific literature reaches more than ten thousand. In the face of such a huge scale, manual classification is no longer feasible, so that a machine learning method is urgently needed to realize fine-grained classification of the COVID-19 scientific literature so as to solve the practical problem.

The scientific literature totally 491 is obtained, and the knowledge content relates to the aspects of virus source, transmission, isolation and control, naming, clinical research, human immunity, drug therapy, drug clinic, vaccine development, pathogenic mechanism and pathogenic molecular mechanism. The knowledge in the aspect of drug treatment is divided into: the western medicine treatment and the traditional Chinese medicine treatment are two aspects, and the western medicine treatment includes 23 medicines and 122 pieces of related scientific literature data; the treatment aspect of the traditional Chinese medicine includes 40 traditional Chinese medicine formulas for treating COVID-19 related in 49 scientific documents, 112 traditional Chinese medicinal materials related to the traditional Chinese medicine formulas and 86 traditional Chinese medicine effective components, and as shown in table 1, a scientific document source statistical table is given:

TABLE 1

The scientific literature contains a large amount of preprinting scientific literature which needs to be checked by researchers of the same colleagues, the data of the preprinting scientific literature mainly come from biopixiv, medRxiv and other physiological and medical fields, and the preprinting scientific literature can be updated in batches after the scientific literature is officially published in the later period. The international top-level Journal/Medical Journal (such as: Cell, Nature, Science, JAMA, Lancet, The New England Journal of Medicine, British Medical Journal) accounts for up to 56%, and The data quality is very high. In addition, the knowledge coverage of scientific literature is comprehensive and fine in granularity, and the COVID-19 knowledge graph construction process not only extracts basic information of the scientific literature, but also classifies data of each scientific literature, such as aspects of virus detection, virus source, transmission, clinical research, human immunity, medicine clinic, vaccine research and development, pathogenic mechanism and the like.

The COVID-19 scientific literature fine-grained classification method based on GNN of the invention obtains 491 scientific literature in total, and is realized by the following steps:

a) constructing a knowledge graph of COVID-19 scientific literature;

as shown in table 2, a statistical table of knowledge classes of scientific literature is given:

TABLE 2

Knowledge categories	Ratio of
		Virus nomenclature	3 pieces, account for 1%
Virus detection	8 parts, accounting for 3%
		Origin and variation	26, 8 percent in percentage
Viral transmission	45 in a proportion of 14%
		Inter-species spread	7 pieces, are 2 percent
Molecular mechanisms of pathogenesis	20 pieces, are 6 percent
		Pathogenic mechanism	46 in a proportion of 14%
Human immunity	13 parts, accounting for 4%
		Vaccine development	9 pieces, in a ratio of 3%
Vaccine treatment	2 in 1 percent
		Novel therapy	7 pieces, are 2 percent
Clinical application of medicine	18 portions, accounting for 6 percent
		Drug development	7 pieces, are 2 percent
Medical treatment	12 accounts for 4%
		Clinical research	72 pieces, are 23 percent
The remaining 19 categories	25 pieces, account for 7%

Clinical studies (13% in percentage), drug therapy (14% in percentage), COVID-19 pathogenesis (20% in percentage), and virus transmission and variation (24% in percentage) can be found as hot spots in scientific research in the past.

the entity field design of the scientific literature is shown in table 3:

TABLE 3

Scientific literature entity attributes	Field design
		Paper numbering	paperId
Time of release	paperPubtime
		Topic of questions	paperTitle
Paper address	paperLink
		Information address	paperNewsLink
Abstract	paperAbstract
		Introduction of content	paperContent
DOI number	paperDOI
		Keyword	paperKeywords
Remarks for note	paperNotes

The attributes of the knowledge category entities are: the category number and the category name are 2, and the specific field design is shown in table 4:

TABLE 4

Knowledge class entity attributes	Field design
		Knowledge category numbering	CategoryId
Knowledge category name	CategoryName

The attributes of the scientific research personnel entities are as follows: the numbers of the scientific researchers and the names of the scientific researchers are 2, and the specific field design is shown in a table 5:

TABLE 5

Physical attributes of scientific researchers	Field design
		Scientific research personnel number	peopleId
Name of researcher	peopleName

The attributes of the academic journal entity are: the number of academic periodicals and the name of periodicals are 2, the specific field design is shown in a table 6,

TABLE 6

Academic journal entity attributes	Field design
		Academic journal number	journalId
Academic journal name	journalName

The attributes of the scientific research institution entities are: the scientific research institution serial number and the name of the scientific research institution are 2, the specific field design is shown in a table 7,

TABLE 7

Entity attributes of research institutions	Field design
		Scientific research institution number	institutionId
Name of scientific research institution	institutionName

the scientific research institution-scientific research personnel relationship explains the relationship between the scientific research institutions and the scientific research personnel, and the field design of the relationship is shown in table 8:

TABLE 8

"scientific research institution-scientific research personnel" relationship attribute	Field design
		Scientific research personnel number	peopleId
Scientific research institution number	institutionId
		Type of relationship	workin

The scientific literature-scientific research institution relationship describes the relationship between scientific literature and scientific research institution, and the field design of this type of relationship is shown in table 9:

TABLE 9

"scientific literature-scientific research institution" relationship attributes	Field design
		Scientific literature numbering	paperId
Scientific research institution number	institutionId
		Type of relationship	platformin

The scientific literature-researcher relationship describes the situation of the researcher publishing the scientific literature, the field design of the relationship is shown in table 10,

watch 10

Relationship attribute of scientific literature-scientific research personnel	Field design
		Scientific literature numbering	paperId
Scientific research personnel number	peopleId
		Type of relationship	author

The "scientific literature-knowledge class" relationship describes the knowledge class of each scientific literature, the field design of such relationship is shown in table 11,

TABLE 11

"scientific literature-knowledge class" relational attributes	Field design
		Scientific literature numbering	paperId
Knowledge category numbering	CategoryId
		Type of relationship	belongto

The scientific literature-academic journal relationship describes the journal published by each scientific literature, the field design of such relationship is shown in table 12,

TABLE 12

Relation attribute of scientific literature-academic journal	Field design
		Scientific literature numbering	paperId
Academic journal number	journalId
		Type of relationship	publishin

through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 1.

b) Constructing a western medicine treatment knowledge map;

the western medicine related thesis entity refers to a thesis related to a western medicine entity, and the attributes of the entity are as follows: western medicine relevant thesis number, thesis name, thesis publication time, thesis link, thesis abstract, DOI number, remark (PMID, PMCID), keyword and the like are 8, and the specific field design is as shown in Table 13:

watch 13

Western medicine related thesis entity attributes	Field design
		Western medicine related thesis number	drugPapersID
Name of thesis	drugPapersTitle
		Time to publication	drugPapersPublishTime
Paper linking	drugPapersLink
		Abstract of thesis	drugPapersAbstract
DOI number	drugPapersDOI
		Remarks (PMID, PMCID)	drugPapersNotes
Keyword	drugPapersKeywords

the "knowledge category-western medicine" relationship defines the knowledge categories (medications) to which 23 western medicines belong, and the fields of such relationship are designed as shown in table 14:

TABLE 14

"knowledge category-western medicine" relationship attribute	Field design
		Knowledge category numbering	CategoryId
Western medicine numbering	drugID
		Type of relationship	include

The relation of western medicine-western medicine related thesis explains the information of western medicine and related thesis, and the field design of the relation is shown in table 15:

watch 15

Relation attribute of western medicine-western medicine related thesis	Field design
		Western medicine numbering	drugID
Western medicine related thesis number	drugPapersID
		Type of relationship	related

through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 2.

c) Constructing a Chinese medicine treatment knowledge map;

comprises 40 Chinese medicinal preparations for treating COVID-19, and relates to 49 selected scientific documents from Chinese herbal medicines and Chinese herbal medicines. The data of the part also comprises 112 Chinese medicinal materials and 86 Chinese medicinal effective components on the basis of 40 Chinese medicinal formulas, and the research on the aspects of Chinese medicinal effective component mining, Chinese medicinal action mechanism exploration and the like can be carried out on the basis of the data.

The Chinese medicinal preparation entity comprises 40 Chinese medicinal preparations for treating COVID-19, and has the following attributes: the number of the Chinese medicine prescription, the name of the Chinese medicine prescription, etc., and the specific field design is shown in table 16:

TABLE 16

Physical attributes of Chinese medicinal formulae	Field design
		Chinese medicine prescription number	herbPrescriptionId
Name of Chinese medicine prescription	herbPrescriptionName

The Chinese medicine science literature entity comprises 49 pieces of scientific literature data, and the attributes of the entity are as follows: the Chinese medicine science literature numbers, titles, knowledge links, Chinese medicine prescriptions, medicinal materials, active ingredients, purposes, methods, results and conclusions are 10, and the specific field designs are shown in Table 17:

TABLE 17

Chinese medicine science literature entity attributes	Field design
		Chinese medicine science literature number	herbPaperId
Topic of questions	herbPaperTitle
		Knowledge linking	herbNewsLink
Chinese medicine prescription	herbPrescription
		Medicinal materials	herbComponent
Active ingredient	herbCompound
		Purpose(s) to	goal
Method	method
		Results	result
Conclusion	conclusion

The medicinal material entities comprise 112 traditional Chinese medicinal materials, and the attributes of the entities comprise: the drug material numbers, drug material names, specific field designs are shown in table 18:

watch 18

Physical attributes of Chinese medicinal materials	Field design
		Medicinal material numbering	herbId
Name of medicinal material	herbName

The effective component entity comprises 86 effective compound components in 40 Chinese medicinal formulas, and the attributes of the entity are as follows: compound number, compound name, specific field design are shown in table 19:

watch 19

Active ingredient entity attributes	Field design
		Compound numbering	compoundId
Name of Compound	compoundName

The knowledge category entity has the value of 'medication', and is designed to be associated with the knowledge category of the scientific literature.

the relationship between the Chinese medicine science literature and the Chinese medicine prescription is illustrated, and the field design of the relationship is shown in table 20:

watch 20

Relationship attribute of Chinese medicine science literature-Chinese medicine prescription	Field design
		Chinese medicine science literature number	herbPaperId
Chinese medicine prescription number	herbPrescriptionId
		Type of relationship	Introduce

The "Chinese medicinal formula-active ingredient" relationship clarifies the effective compounds in each formula for treating COVID-19, and the field design of such relationship is shown in Table 21:

TABLE 21

Relationship between Chinese medicinal formula and effective components	Field design
		Compound numbering	compoundId
Chinese medicine prescription number	herbPrescriptionId
		Type of relationship	refine

The relationship between Chinese medicinal formula and medicinal material clarifies the medicinal material components of each formula, and the field design of the relationship is shown in Table 22:

TABLE 22

' middleRelationship between the prescriptions and herbs	Field design
		Medicinal material numbering	herbId
Chinese medicine prescription number	herbPrescriptionId
		Type of relationship	consistof

The "Chinese medicinal formula-knowledge class" relationship puts the data of 40 Chinese medicinal formulas for treating COVID-19 into the "medication" knowledge class for fusion with (1), and the fields of the relationship are designed as shown in Table 23:

TABLE 23

Relationship attribute of Chinese medicine prescription-knowledge category	Field design
		Knowledge category numbering	CategoryId
Chinese medicine prescription number	herbPrescriptionId
		Type of relationship	belongto

through the design, after the metadata is analyzed, subjected to noise reduction and normalized, the metadata is imported into a Neo4j database, and a formed knowledge graph is shown in fig. 3.

graph Neural Networks (GNNs) are a natural generalization of convolutional Neural networks in non-euclidean data domains. Briefly, the GNN combines graph data and a neural network, performs end-to-end calculation on the graph data, can perform representation learning and knowledge reasoning on the graph structure data, and has strong interpretability, which is an advantage of the GNN. These characteristics make GNNs particularly suitable for machine learning tasks based on knowledge map spectrogram data.

The invention is based on COVID-19 scientific literature Text Classification model CTGC (COVID-19 Text GCN Classification) of GNN, the model is composed of a convolution layer, a pooling layer and a full connection layer, the full connection layer realizes the multi-Classification of COVID-19 scientific literature through softmax function, the model hyper-parameter setting is shown as table 24:

watch 24

Hyper-parameter settings	Description of the invention	Numerical value setting
			batch_size		256
num_workers	Number of threads used to load data	4
			weight_decay	Loss function, regularization	1e-4
embed_size	Word embedding dimension	100
			drop_prop	Discard rate	0.5
classes	Number of classification categories	34
			use_lrdecay	Whether learning rate attenuation is used	True
lr_decay	Rate of decay	0.95
			n_epoch	Attenuation interval		1

The basic information extraction steps of the scientific literature are as follows:

3) small writing: uniformly converting English texts into a lower case form;

4) noise removal: removing special symbols and punctuations in the text;

Common model performance metrics indicators in the field of text classification are: accuracy and error rate (the sum of both is equal to 1), precision/call/F-measure, Exact Match (EM) and Mean Reliable Rank (MRR), wherein the Exact Match (EM) is mainly used for a QA system and used for measuring the matching degree between prediction and real answer, and is a commonly used evaluation method in SQuAD games. The Mean Recircular Rank (MRR) is used mainly to evaluate the quality of the ranking, query-document ranking and QA answer ranking. In the invention, F-measure is used as a model performance measurement index.

The invention compares the performance of the CTGC model with the performance of common text classification models (such as FastText, TextCNN, RCNN, LSTM, DPCNN and the like), the default hyper-parameter is adopted for training the hyper-parameter of each model, and the weight f1-score value of each model on the test set is shown in Table 25:

TABLE 25

Model (model)	Test set weighted f1-score	Remarks for note
			FastText	92.6%	Baseline
TextCNN	93.8%
			RCNN	93%
Multi-layer bidirectional LSTM	94.4%
			Multi-layer bidirectional LSTM with Attention	94.7%
DPCNN	95.4%
			CTGC	95.5%

The results of the experiment were analyzed as follows:

(1) the FastText model is experimental Baseline, with a weighted f1-score of 92.6% on the test set;

(2) the DPCNN model achieves the best performance on a test set, and the weighted f1-score is 95.4%;

(3) the effect of the Attention mechanism on the classification problem is not obvious, and the performance of the multi-layer bidirectional LSTM and the multi-layer bidirectional LSTM with Attention is almost the same;

(4) the RCNN model (RNN in combination with CNN) did not achieve the expected performance, with a lower performance boost than FastText;

(5) the TextCNN is the most efficient model, the model structure is simple, the training speed is high, and the experimental effect is good;

the above experimental results are all based on default hyper-parameters, and if each model is optimized one by one, the experimental results are more excellent. Compared with a CNN model, the model based on the RNN is more difficult to train and takes longer time, and the experimental result has no obvious advantages. Gradient explosion phenomena are easy to occur in the RNN-based model training process, and attention is needed in the setting of related methods and production parameters. Deep networks (DPCNN) are more difficult to train than shallow networks (TextCNN), but the experimental effect of deep networks is significantly better than that of shallow networks, and in addition, deep networks need to use residual connections to mitigate the gradient vanishing phenomenon.

After the model training is finished, the COVID-19 scientific literature can be subjected to fine-grained knowledge classification, and the corresponding knowledge class is predicted, wherein the specific method comprises the following three steps:

(1) and extracting text data of information such as abstracts, keywords, authors and the like of the COVID-19 English scientific literature.

(2) And (4) running the trained model main function under the CTGC model file directory, inputting the text data extracted in the step (1) and returning to run the main function.

(3) The CTGC model directly returns the knowledge category of the scientific literature according to the abstract, the key words and the author information of the COVID-19 scientific literature.

Specifically, the implementation process based on the CTGC model is introduced in the invention by way of example: first, a scientific literature published in 6/10/2020 was selected, and in the COVID-19-v5.0 knowledge graph, the basic data samples of the scientific literature are shown in table 26:

watch 26

Attribute name	Attribute value
		Numbering	330
Date of release	2020.06.10
		Knowledge categories	Vaccine development
Topic of questions	A SARS-CoV-2 infection model in mice demonstrates protection by neutralizing antibodies
		Document address	https://www.cell.com/cell/fulltext/S0092-8674(20)30742- X
Information address	https://mp.weixin.qq.com/s/grELLG8bhayXw-s8w_Tdjg
		Abstract	Severe acute respiratory syndrome coronavirus 2 (SARS- CoV-2) has caused a pandemic with millions of human infections. One limitation to the evaluation of potential therapies and vaccines to inhibit SARS-CoV-2 infection and ameliorate disease is the lack of susceptible small animals in large numbers. Commercially available laboratory strains of mice are not readily infected by SARS-CoV-2 because of species- specific differences in their angiotensin-converting enzyme 2 (ACE2) receptors. Here, we transduced replication-defective adenoviruses encoding human ACE2 via intranasal administration into BALB/c mice and established receptor expression in lung tissues. hACE2- transduced mice were productively infected with SARS- CoV-2, and this resulted in high viral titers in the lung, lung pathology, and weight loss. Passive transfer of a neutralizing monoclonal antibody reduced viral burden in the lung and mitigated inflammation and weight loss. The development of an accessible mouse model of SARS-CoV-2 infection and pathogenesis will expedite the testing and deployment of therapeutics and vaccines.
Introduction of content	10.6.2020, Michael S. Diamond, university of Washington, school of medicine Cell published on-line entitled "A SARS-CoV-2 infection model in micro Investigation of demonstrates protection by neutral anti-infectives ″ A study in which a replication-deficient adenovirus encoding human ACE2 was administered intranasally Transduced into BALB/c mice and receptor expression was established in lung tissue. hACE2 The transduced mice are efficiently infected with SARS-CoV-2, which results in high pulmonary viral titers and lung Partial pathology and weight loss. Passive transfer of neutralizing monoclonal antibodies reduced pulmonary Viral burden, reduction of inflammation and weight loss. SARS-CoV-2 infection and pathogenesis Will expedite the testing and deployment of therapeutics and vaccines.
		Scientific research institution	University OF WASHINGTON
Research team	Michael S. Diamond
		DOI	DOI:https://doi.org/10.1016/j.cell.2020.06.011
Periodical	Cell
		Keyword	Is free of
Remarks for note	Is free of

The invention inputs the scientific literature 'title, abstract and keyword' data into a trained CTGC model, and the model outputs the classification of the scientific literature according to the input data: the vaccine research and development completes the fine-grained classification of the COVID-19 scientific literature, can replace the manual work, and realizes the quick and large-scale fine-grained classification task of the COVID-19 scientific literature.

Therefore, the COVID-19 scientific literature fine-grained classification method based on GNN of the invention comprises the steps of 1, constructing a COVID-19 knowledge graph, and having high data quality and fine knowledge granularity; 2. the COVID-19 scientific literature fine-grained classification model is designed, the model accuracy is high, the problem of small sample learning is solved, and the interpretability is strong; 3. by using the graph neural network technology, the practical problem of fine-grained classification of COVID-19 scientific literature is solved, and scientific epidemic prevention is assisted.

Claims

1. A COVID-19 scientific literature fine-grained classification method based on GNN is characterized by comprising the following steps:

a) constructing a knowledge graph of COVID-19 scientific literature;

b) constructing a western medicine treatment knowledge map;

c) constructing a Chinese medicine treatment knowledge map;

2. The fine-grained classification method for COVID-19 scientific literature based on GNN according to claim 1, wherein the basic information extraction step of scientific literature is as follows:

3) small writing: uniformly converting English texts into a lower case form;

4) noise removal: removing special symbols and punctuations in the text;

5) spell checking: checking whether spelling errors exist or not, and correcting errors;

3. A GNN-based COVID-19 scientific literature fine-grained classification method according to claim 1 or 2, wherein the graphical neural network GNN in step d) is composed of a convolutional layer, a pooling layer and a full-link layer, the full-link layer is implemented by a softmax function, the size of the data set is 256, the number of threads used for loading data is 4, the loss function is normalized to 1e-4, the word embedding dimension is 100, the discarding rate is 0.5, the classification category number is 34, the learning rate is used for attenuation, the attenuation rate is 0.95, and the attenuation interval is 1.

4. The fine-grained classification method for COVID-19 scientific literature based on GNN according to claim 1 or 2, wherein the construction of the knowledge graph of COVID-19 scientific literature in step a), the knowledge graph of western medicine treatment in step b) and the knowledge graph of traditional Chinese medicine treatment in step c) is realized by a Neo4j database tool.