CN117312493A - Multi-strategy knowledge extraction system - Google Patents

Multi-strategy knowledge extraction system Download PDF

Info

Publication number
CN117312493A
CN117312493A CN202311159972.0A CN202311159972A CN117312493A CN 117312493 A CN117312493 A CN 117312493A CN 202311159972 A CN202311159972 A CN 202311159972A CN 117312493 A CN117312493 A CN 117312493A
Authority
CN
China
Prior art keywords
model
layer
data
evidence
knowledge extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311159972.0A
Other languages
Chinese (zh)
Inventor
杨硕
王海洋
张君冬
隋明爽
陈琦
尹仁芳
李芹
王越
叶雅欣
初杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Information On Traditional Chinese Medicine Cacms
Original Assignee
Institute Of Information On Traditional Chinese Medicine Cacms
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Information On Traditional Chinese Medicine Cacms filed Critical Institute Of Information On Traditional Chinese Medicine Cacms
Priority to CN202311159972.0A priority Critical patent/CN117312493A/en
Publication of CN117312493A publication Critical patent/CN117312493A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Epidemiology (AREA)
  • Mathematical Analysis (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Computational Mathematics (AREA)
  • Algebra (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multi-strategy knowledge extraction system, which relates to the technical field of knowledge systems, and the technical scheme is as follows: the multi-strategy knowledge extraction system comprises a data layer, a preprocessing layer, a model layer, a service layer and an application layer. According to the invention, through the built multi-strategy knowledge extraction model, different extraction strategies can be formulated for documents of different classifications, a large amount of data mechanical energy can be marked, and the method is applicable to long texts and sparse words; based on a knowledge graph research framework, taking a traditional Chinese medicine gynecological evidence-based clinical document as an example, a multi-strategy knowledge extraction framework based on deep learning and the like is constructed, the entity extraction work of knowledge units in the traditional Chinese medicine evidence-based field is realized, the key technology of the traditional Chinese medicine gynecological evidence-based knowledge system is researched and constructed, and the optimal clinical evidence is provided for domestic and foreign traditional Chinese medicine gynecological clinical research.

Description

Multi-strategy knowledge extraction system
Technical Field
The invention relates to the technical field of knowledge systems, in particular to a multi-strategy knowledge extraction system.
Background
Evidence-based medicine is an emerging intersecting clinical medicine discipline developed in the beginning of the 90 s of the 20 th century, and is a method for reasonably applying optimal evidence based on clinical practice. The evidence-based traditional Chinese medicine refers to the theory and method of evidence-based medicine, collects, evaluates, produces and converts the evidence of the effectiveness, safety and economy of the traditional Chinese medicine, can reveal the clinical action characteristics and rules of the traditional Chinese medicine, guides the formulation of clinical guidelines, paths and sanitary decisions, and is an important branch of the science and research field of evidence-based medicine. In recent years, the number of traditional Chinese medicine clinical evidence-based research documents is increasing, but clinicians have less application in diagnosis and treatment methods in the documents. On the one hand, the conclusion in the literature cannot solve the complex problem in clinic, and a clinician can abandon the use of the clinical decision-making evidence because of a single evidence chain when searching for evidence for the clinical decision-making. On the other hand, the clinical practice literature is not directly converted into medical decision advice, cannot be accurately consulted in real time, and greatly hinders the operability of evidence-based literature in clinical decisions and practice.
Current scholars' knowledge of a particular field and term extraction studies have focused mainly on the following: in practical use, several methods are usually combined to obtain better recognition effect based on field linguistic rules, statistical and machine learning methods, machine learning methods and deep learning methods. (1) The term extraction is performed based on a domain linguistic rule method, and is mainly based on the linguistic feature rule of domain terms or the matching of the domain terms with terms in a dictionary. In the process, an entity dictionary or a rule dictionary is firstly constructed, then the text is subjected to word segmentation and part-of-speech tagging, and entity extraction is performed by a mode matching method. The common matching method comprises a maximum matching algorithm, regular expression matching and the like; the method for making rule dictionary includes manual summary, automatic statistics summary, etc. and the content with the same match is candidate term. The current research mainly analyzes word-forming modes of terms in the industry field to realize term extraction in different fields. And if the expansion rule is formulated and the term extraction is carried out by combining the statistical characteristics, or the cross-domain combination term is extracted by combining the statistical characteristics and the linguistic qualitative and quantitative rule analysis. The method requires the knowledge background of domain experts to support and maintain, and cannot conveniently finish domain migration. (2) The term extraction based on statistics and machine learning methods can be divided into two ways of realizing thinking: firstly, two tasks of entity boundary recognition and entity type prediction are sequentially completed, secondly, NER is regarded as a sequence labeling task, labels corresponding to positions of a sequence are predicted by constructing a sequence labeling model, and the boundary and the category of an entity are determined according to the labels. A common statistical machine learning model is a conditional random field, a support vector machine and the like. The feature engineering is a key engineering activity for constructing a statistical machine learning model, and researchers manually select valuable features and construct a proper feature template according to the characteristics of traditional Chinese medicine entities and texts so as to improve the recognition effect of the model. The current research mainly adopts N-Gram statistical language model modeling, and common statistical characteristics for extracting terms by combining with extended statistical characteristics mainly comprise word frequency (TF), solidification degree (PMI), degree of Freedom (DF), C-value and the like. The term extraction based on statistical methods is applicable to high frequency and high quality term extraction, but is less effective for low frequency and sparse term extraction. (3) The method based on deep learning mainly aims at solving a sequence labeling problem in an end-to-end learning mode, namely, after original data are input into a model, the model automatically completes feature learning and label prediction tasks. Typically, the deep learning model comprises an embedding layer, an encoding layer and a decoding layer, the embedding layer being used to obtain an embedded representation of the text sequence; the coding layer is used for text feature extraction and label prediction, such as a two-way long-short-term memory neural network, a two-way encoder characterization and other neural network models; the decoding layer is used for analyzing the optimal label sequence. However, NER research under the supervised learning method depends on large-scale and high-quality labeling corpus, but corpus resources in the traditional Chinese medicine field are deficient, and corpus acquisition cost is high.
The existing single knowledge extraction method cannot efficiently complete the knowledge extraction task in the traditional Chinese medicine field, the dictionary rule-based method relies on an expert dictionary, and the machine learning and deep learning method has good application effect but is not suitable for long text or sparse words and needs a large amount of labeling data.
Accordingly, the present invention is directed to a multi-strategy knowledge extraction system that addresses the above-mentioned related problems.
Disclosure of Invention
The invention aims to provide a multi-strategy knowledge extraction system to solve the problems that the prior single knowledge extraction method cannot efficiently complete the knowledge extraction task in the traditional Chinese medicine field, the dictionary rule-based method depends on an expert dictionary, the machine learning and deep learning method has good application effect but is not suitable for long text or sparse words, and a large amount of labeling data is needed.
The technical aim of the invention is realized by the following technical scheme: a multi-strategy knowledge extraction system comprising a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer;
the preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model;
the application layer is used for carrying out automatic literature knowledge extraction, literature knowledge online processing and literature user management.
The invention is further provided with: the data layer comprises a document library, a rule library, a domain dictionary, a deactivated word library and a noise library.
The invention is further provided with: the data preprocessing layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
The invention is further provided with: the model layer includes:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
The invention is further provided with: extracting conventional article concepts, evidence-based entities and traditional Chinese medicine field entities based on the multi-strategy knowledge extraction model specifically comprises the following steps: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
The invention is further provided with: the grammar model extracts the conventional concepts by constructing a joint rule according to the positions, the contexts and the keyword features of the entity, wherein the conventional concepts comprise titles, authors, abstracts, keywords and references.
The invention is further provided with: the domain dictionary rule model extraction evidence-based concept extraction is to extract evidence-based concepts with keyword and pattern similarity according to the preset rules of occurrence of evidence-based concepts; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test set, control set, medication method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
The invention is further provided with: the deep learning model extracts medical entity concepts by selecting labeling entities in the field of traditional Chinese medicine for training, and obtaining the deep learning model for extracting the medical entity concepts.
The invention also provides an information data processing terminal which is used for realizing the multi-strategy knowledge extraction system.
In summary, the invention has the following beneficial effects: according to the invention, through the built multi-strategy knowledge extraction model, different extraction strategies can be formulated for documents of different classifications, a large amount of data mechanical energy can be marked, and the method is applicable to long texts and sparse words; based on a knowledge graph research framework, taking a traditional Chinese medicine gynecological evidence-based clinical document as an example, a multi-strategy knowledge extraction framework based on deep learning and the like is constructed, the entity extraction work of knowledge units in the traditional Chinese medicine evidence-based field is realized, the key technology of the traditional Chinese medicine gynecological evidence-based knowledge system is researched and constructed, and the optimal clinical evidence is provided for domestic and foreign traditional Chinese medicine gynecological clinical research.
Drawings
FIG. 1 is a system configuration diagram of a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 2 is a diagram of different types of entity extraction policies of a multi-policy knowledge extraction system according to embodiment 1 of the invention;
FIG. 3 is a schematic diagram showing the effect of document preprocessing in a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of domain model thesaurus construction of a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 5 is a schematic diagram of the structure of the Bert-bilisTM-CRF model in a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 6 is a schematic diagram of a first page of the knowledge overview in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 7 is a diagram of a first page of a knowledge graph in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 8 is a diagram of a document management page in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 9 is a schematic diagram of a knowledge extraction page in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 10 is a schematic diagram of an entity modification page in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 11 is a schematic diagram of an entity added page in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
fig. 12 is a schematic diagram of an entity word adding page in the multi-strategy knowledge extraction system in embodiment 2 of the present invention.
Detailed Description
The invention is described in further detail below with reference to fig. 1-12.
Example 1
1-5, the multi-strategy knowledge extraction system comprises a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer; the data layer comprises a document library, a rule library, a domain dictionary, a deactivated word library and a noise library.
The preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the pretreatment layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
In the embodiment, for a document with an input format of PDF format, an OCR technology is adopted to convert image information into a text form which can be edited and used, so that the later processing is facilitated; the data with part of charts and typesetting disorder cannot be well identified, and the data needs to be cleaned and tidied in an auxiliary way by adopting a manual method.
In the embodiment, an end-to-end image sequence recognition technology is adopted, a neural network architecture integrating feature extraction, sequence modeling and transcription is built, and two networks of CNN+RNN are adopted for training. The algorithm can naturally process sequences with any length, does not relate to character segmentation or horizontal scale normalization, has excellent performance in dictionary-free or dictionary-based scene text recognition tasks, can realize recognition of Chinese and English texts, has recognition accuracy of 95%, and has the effect shown in fig. 3.
The model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the model layer comprises:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
The service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model; the method for extracting the conventional article concepts, evidence-based entities and the traditional Chinese medicine field entities based on the multi-strategy knowledge extraction model comprises the following steps: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
Grammar model extraction conventional concepts are extracted by constructing a joint rule according to the positions, the contexts and the keyword features of the entity, and the conventional concepts comprise titles, authors, abstracts, keywords and references.
In this embodiment, the conventional concepts in the article structure, such as title, author, abstract, keyword, reference, and the like, are parsed, and the embodiment uses a pattern extraction method based on the keyword and the prior rule, where the pattern mainly refers to: the grammar rule is mainly used for constructing a joint rule to extract the conventional concepts according to the characteristics of the positions, the contexts, the keywords and the like of the entity.
A scientific literature is composed of a plurality of relatively independent knowledge elements with certain semantic association. In order to facilitate the user to quickly understand the main content of the literature, the embodiment constructs a knowledge element ontology model of the scientific literature from a microscopic level, and uniformly describes and represents the composition of unstructured knowledge elements of the scientific literature and semantic relations among the knowledge elements. The ontology model may be formally represented as follows:
KE UA =(G,M,C,(sp1,sp2,sp3,...),(cw1,cw2,cw3,...),Ti)
in KE UA Representing the unstructured knowledge element body of the scientific and technical literature; g represents unstructured abstract elements; m represents a method element in the body; c represents the result/conclusion element in the body; (sp 1, sp2, sp3, …) represents a set of patterns extracted from unstructured; (cw 1, cw2, cw3, …) represents a set of slave cue words; ti represents the title of the document of origin.
And processing the text, marking the identified sentences and finally storing the sentences as XML documents according to the processed title and text information as well as sentence patterns, clue words and position information of each part of the text. The algorithm comprises the following steps:
the algorithm extracts the elements in the text
Inputting an Excel document storing all title and text paragraphs
Outputting XML documents
Begin
1) Reading an Excel document;
2) Traversing the title and the text, dividing the text into sentences according to periods, storing the sentences as a list,
3) Matching each sentence method clue word;
4) if matching is successful, marking as a method, marking the corresponding list label of the matched sentences as m, marking the sentences in front of m as 'purpose', and matching the sentences in the back by using conclusion clue words;
if matching is successful, marking the corresponding list label of the matched sentences as n, marking the sentences from m to n as a method, marking the sentences after n as a conclusion, and marking the sentences after n as break;
5) else matches sentences with conclusion clue words;
if matching is successful, marking the corresponding list label of the matched sentences as n, marking the sentences in front of n as a method, marking the sentences after m and n as a conclusion, and marking the sentences after m and n as break;
6) Creating DOM tree object, adding corresponding node with DOM object, storing the sentence, and outputting XML document.
The domain dictionary rule model extraction evidence-based concept extraction is to extract evidence-based concepts with keyword and pattern similarity according to the preset rules of occurrence of evidence-based concepts; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test group, control group, dosing method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect size combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
In this embodiment, since the entity length of the evidence-based concept is longer, the method does not conform to the conventional entity definition, has strong field characteristics, and is more suitable for a pattern extraction method based on keywords and pattern similarity. Research has found that the types of entities that are of greater interest to expert practitioners in the field of evidence-based medicine mainly cover aspects where extraction of these entities plays a key role in evidence-based document evidence-scale assessment.
1) RCT-like literature: evidence-based medicine, random control, envelope method, random number table method, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test group, control group, and medication method.
2) META analysis class literature: META analysis, system evaluation, RCT, random control, sample size, bias risk, fixed effect model, random effect model, effect size merge analysis, relative risk RR, OR, 95% confidence interval, CI.
For the extraction of the evidence-based concept, a rule for the occurrence of the evidence-based concept needs to be formulated first, the evidence-based concept is extracted according to the keyword and the pattern similarity, and if a dictionary of related concept words is used as an aid, the extraction precision of the concept words can be greatly improved.
Taking three evidence-based entities of randomization, hidden allocation and blind method as examples, the extraction rule patterns in the literature are as follows:
table 1 mode example
And constructing expert dictionary for entities such as evaluation indexes through a term generation algorithm based on the improved frequent item set, and extracting similarity concepts based on the dictionary.
In order to extract evidence-based concepts of internal condensation and high external combination freedom in the traditional Chinese medicine evidence-based literature, the module designs a multi-strategy frequent pattern extraction algorithm. The algorithm adopts indexes FP-value of the advantages of comprehensive word frequency, degree of freedom, solidification degree and C-value characteristics on the basis of an N-Gram statistical language model to balance terms, combines the matching rules of stop words and common word prefixes and suffixes to filter, and comprises the following steps of a term generation algorithm based on an improved frequent item set:
step 1, preprocessing the text to remove a mailbox, a telephone number, a mobile phone number, a date, a website and the like in the text, and replacing punctuation marks with spaces.
And generating a Step 2 candidate phrase, counting the text corpus based on an N-Gram statistical language model, and filtering text fragments below a word length threshold to obtain candidate text fragments.
Step 3, term quality score. Firstly, calculating word frequency f, solidification degree pmi, degree of freedom df and C-value cval for each candidate text segment, then carrying out Sigmoid function normalization on each feature, finally merging each feature value calculation index FPDC to initialize each feature to be evenly distributed with weight, taking the central word of multi-word nesting in the traditional Chinese medicine field into consideration, carrying out 0.15 punishment on the word frequency, and carrying out 0.15 rewarding on the C-value, and screening according to a threshold value as shown in the formula to obtain a candidate evidence-based concept.
FP-value(C1...Cn)=0.25pmi(C1...Cn)+0.1tf(C1...Cn)+
0.4cval(C1...Cn)+0.1tf(C1...Cn)
Where c1..cn represents a candidate text segment composed of a plurality of words.
Step 4, language rule filtering is carried out on the candidate evidence-based concepts through dead word filtering and language rule filtering with common words as prefixes and suffixes.
Step 5, semantic feature similarity matching traversal, namely finding out a candidate term with the maximum similarity with each feature word through each feature seed word, considering the candidate term to be similar to the seed word when the similarity is greater than a set threshold value, and adding the candidate term into a result term set; considering connectivity between candidate terms, an exponential decay method is performed on the similarity threshold to set a minimum similarity threshold for separating words from each other as MinSim, and the threshold is formulated as follows as the number of word connected increases:
step 6, sequencing and outputting the results, namely, matching the semantic feature similarity of each feature seed word and the candidate term to obtain a similarity matching result, sequencing from high to low according to the similarity, outputting a final evidence-based concept term extraction result, and updating an expert word stock.
The deep learning model extracts the medical entity concept by selecting the labeling entity in the traditional Chinese medicine field for training, so as to obtain the deep learning model and extract the medical entity concept.
In this embodiment, for five types of entities, including diseases, syndromes, symptoms, prescriptions and traditional Chinese medicines, a relatively mature extraction algorithm exists at present, and the module selects labeling entities in the traditional Chinese medicine field for training aiming at naming features of the entities in the traditional Chinese medicine field, and adopts a Bert-bilisTM-CRF model to perform conceptual extraction.
The Bert-bilisTM-CRF model is characterized in that an Embedding layer (mainly comprising word vectors, word vectors and some additional features) is formed by the Bert model, a bidirectional LSTM layer is used for predicting labels of each word, and the CRF layer selects an optimal result from a label sequence. The CRF is based on label paths as prediction targets, and constraints can be added to the final predicted label sequence on the basis of LogitsLogitsLogits to ensure that the predicted entity label sequence is valid, and the constraints can be automatically learned from the training data set by the CRF layer in the training process. In practice the final output may result in a variety of label sequence combinations, three label path combinations being listed as shown in fig. 5 below, with red path label sequences [ B-C, I-C,, B-P, I-P,, O ], blue label sequences [ O, B-P,, I-P, O,, O ], green label sequences [ I-C, O,, O, I-P,, B-C ], where the red path is truly correct and the other two are paths that may be predictably produced.
Labeling training is carried out on 1000 field documents, and the F value is 93.97% finally. Data set example:
the application layer is used for carrying out automatic document knowledge extraction, document knowledge online processing and document user management.
Example 2
This example provides a specific system implementation based on the example idea, specifically as follows:
1. introduction of System functionality
Amount of literature data: chinese: meta analysis 239, RCT study 1604; english: meta analysis 8, RCT study 187; corpus size: the number of entities is as follows, including the 19743 of semantic relationships.
The system functions are as follows, and the system is divided into three modules, namely a knowledge overview module, a literature information module and a knowledge extraction module. The knowledge overview module mainly displays the overall extraction effect of system data, wherein the data display above the overall extraction effect comprises the total amount of documents included in the system, the number of extracted entities and the relation number formed among the entities (the product is mainly based on entity co-occurrence relation); the data presentation below is mainly used for managing the status of a single entity, and presents the name, source document, modification time, entity type, entity status and corresponding operation of each entity in the form of a list. Entity modification can be performed by clicking editing, the entity can be deleted by clicking deletion, and the corresponding entity and relation in the document extraction result are deleted.
The document information mainly shows the uploading, management, status of the document, as shown in fig. 8 below. Clicking and uploading documents can select and upload PDF format documents in batches locally, and the uploaded documents are input to a knowledge extraction module after being automatically processed to support viewing.
The knowledge extraction module extracts and displays different types of entities in the traditional Chinese medicine evidence-based medical document by adopting multiple strategies, as shown in fig. 9, the system displays extraction results, wherein conventional concept entities are marked and displayed in the whole text, and the medical concept entities and the evidence-based concept entities are listed on the right side, so that the adding, deleting and checking operations on the entities are supported.
For entities that need to be modified or deleted, a lookup of the corresponding modification or deletion is performed on the right list, as shown in fig. 10. If the entity is modified or deleted, the corresponding relation of the original entity is correspondingly modified or deleted, and the modified or deleted record is synchronously stored in the knowledge overview of the top page.
And for the entity needing to be added, searching and adding in the right list area of the article. Editing the type label and the entity name of the entity, and storing after determining. The added records are synchronously saved to the knowledge overview of the home page. The added entities are shown in fig. 11 below.
In addition, online word marking is supported, the words are selected by a mouse, the corresponding entity types are selected by right click, and the words can be stored after being determined. The added records are synchronously saved to the knowledge overview of the home page. The added entities are shown in fig. 12 below.
The present embodiment is only for explanation of the present invention and is not to be construed as limiting the present invention, and modifications to the present embodiment, which may not creatively contribute to the present invention as required by those skilled in the art after reading the present specification, are all protected by patent laws within the scope of claims of the present invention.

Claims (9)

1. The multi-strategy knowledge extraction system is characterized by comprising a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer;
the preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model;
the application layer is used for carrying out automatic literature knowledge extraction, literature knowledge online processing and literature user management.
2. The multi-strategy knowledge extraction system of claim 1 wherein said data layer comprises a document library, a rule library, a domain dictionary, a stop word library, and a noise library.
3. The multi-strategy knowledge extraction system of claim 1, wherein the data preprocessing layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
4. The multi-strategy knowledge extraction system of claim 1, wherein the model layer comprises:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
5. The system of claim 4, wherein the conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities are extracted based on the multi-strategy knowledge extraction model by: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
6. The multi-strategy knowledge extraction system according to claim 5, wherein said grammar model extracts conventional concepts including title, author, abstract, keywords and references by constructing joint rules based on locations, contexts and keyword features where entities appear.
7. The multi-strategy knowledge extraction system as claimed in claim 5, wherein the domain dictionary rule model extraction evidence-based concept extraction is based on rules of occurrence of preset evidence-based concepts, and the evidence-based concept extraction is performed with keyword and pattern similarity; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test set, control set, medication method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
8. The multi-strategy knowledge extraction system according to claim 5, wherein the deep learning model extracts medical entity concepts by training through selecting labeling entities in the field of traditional Chinese medicine, and obtaining the deep learning model to extract medical entity concepts.
9. An information data processing terminal, characterized in that the information data processing terminal is arranged to implement a multi-strategy knowledge extraction system as claimed in any one of claims 1-8.
CN202311159972.0A 2023-09-08 2023-09-08 Multi-strategy knowledge extraction system Pending CN117312493A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311159972.0A CN117312493A (en) 2023-09-08 2023-09-08 Multi-strategy knowledge extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311159972.0A CN117312493A (en) 2023-09-08 2023-09-08 Multi-strategy knowledge extraction system

Publications (1)

Publication Number Publication Date
CN117312493A true CN117312493A (en) 2023-12-29

Family

ID=89254463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311159972.0A Pending CN117312493A (en) 2023-09-08 2023-09-08 Multi-strategy knowledge extraction system

Country Status (1)

Country Link
CN (1) CN117312493A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning
CN116541472A (en) * 2023-03-22 2023-08-04 麦博(上海)健康科技有限公司 Knowledge graph construction method in medical field
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776711A (en) * 2016-11-14 2017-05-31 浙江大学 A kind of Chinese medical knowledge mapping construction method based on deep learning
CN113505244A (en) * 2021-09-10 2021-10-15 中国人民解放军总医院 Knowledge graph construction method, system, equipment and medium based on deep learning
CN116541472A (en) * 2023-03-22 2023-08-04 麦博(上海)健康科技有限公司 Knowledge graph construction method in medical field
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张艺品 等: "深度学习基础上的中医实体抽取方法研究", 医学信息学杂志, vol. 40, no. 2, 25 February 2019 (2019-02-25), pages 58 - 63 *

Similar Documents

Publication Publication Date Title
CN111222340B (en) Breast electronic medical record entity recognition system based on multi-standard active learning
US9501467B2 (en) Systems, methods, software and interfaces for entity extraction and resolution and tagging
CN111078875B (en) Method for extracting question-answer pairs from semi-structured document based on machine learning
CN110826331A (en) Intelligent construction method of place name labeling corpus based on interactive and iterative learning
CN110674252A (en) High-precision semantic search system for judicial domain
CN111209738A (en) Multi-task named entity recognition method combining text classification
Huang et al. Bert-based multi-head selection for joint entity-relation extraction
CN102576355A (en) Methods and systems for knowledge discovery
El Mahdaouy et al. Word-embedding-based pseudo-relevance feedback for Arabic information retrieval
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN113032552A (en) Text abstract-based policy key point extraction method and system
CN112635071A (en) Diabetes knowledge map construction method integrating traditional Chinese and western medicine knowledge
Alyami et al. Systematic literature review of Arabic aspect-based sentiment analysis
Eldin et al. An enhanced opinion retrieval approach via implicit feature identification
Ahmed et al. BIOfid dataset: publishing a german gold standard for named entity recognition in historical biodiversity literature
Sirisha et al. Semantic interdisciplinary evaluation of image captioning models
Ispirova et al. Mapping Food Composition Data from Various Data Sources to a Domain-Specific Ontology.
CN112836062B (en) Relation extraction method of text corpus
Zhang et al. A method of constructing a fine-grained sentiment lexicon for the humanities computing of classical chinese poetry
Amato et al. A lexicon-grammar based methodology for ontology population for e-health applications
CN116227594A (en) Construction method of high-credibility knowledge graph of medical industry facing multi-source data
Fei et al. GFMRC: A machine reading comprehension model for named entity recognition
CN117312493A (en) Multi-strategy knowledge extraction system
Wu et al. Character-based deep learning approaches for clinical named entity recognition: a comparative study using Chinese EHR texts
CN111180076B (en) Medical information extraction method based on multi-layer semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination