CN117312493A - Multi-strategy knowledge extraction system - Google Patents
Multi-strategy knowledge extraction system Download PDFInfo
- Publication number
- CN117312493A CN117312493A CN202311159972.0A CN202311159972A CN117312493A CN 117312493 A CN117312493 A CN 117312493A CN 202311159972 A CN202311159972 A CN 202311159972A CN 117312493 A CN117312493 A CN 117312493A
- Authority
- CN
- China
- Prior art keywords
- model
- layer
- data
- evidence
- knowledge extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 92
- 239000003814 drug Substances 0.000 claims abstract description 42
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000005516 engineering process Methods 0.000 claims abstract description 7
- 238000013136 deep learning model Methods 0.000 claims description 22
- 230000000694 effects Effects 0.000 claims description 21
- 239000000284 extract Substances 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000002372 labelling Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 238000004140 cleaning Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 238000010197 meta-analysis Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 5
- 238000011156 evaluation Methods 0.000 claims description 5
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 230000007717 exclusion Effects 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 229940079593 drug Drugs 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000011160 research Methods 0.000 abstract description 10
- 238000013135 deep learning Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 12
- 238000010801 machine learning Methods 0.000 description 7
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 238000007711 solidification Methods 0.000 description 3
- 230000008023 solidification Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000012217 deletion Methods 0.000 description 2
- 230000037430 deletion Effects 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000009833 condensation Methods 0.000 description 1
- 230000005494 condensation Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000002950 deficient Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 229940126680 traditional chinese medicines Drugs 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Epidemiology (AREA)
- Mathematical Analysis (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Computational Mathematics (AREA)
- Algebra (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a multi-strategy knowledge extraction system, which relates to the technical field of knowledge systems, and the technical scheme is as follows: the multi-strategy knowledge extraction system comprises a data layer, a preprocessing layer, a model layer, a service layer and an application layer. According to the invention, through the built multi-strategy knowledge extraction model, different extraction strategies can be formulated for documents of different classifications, a large amount of data mechanical energy can be marked, and the method is applicable to long texts and sparse words; based on a knowledge graph research framework, taking a traditional Chinese medicine gynecological evidence-based clinical document as an example, a multi-strategy knowledge extraction framework based on deep learning and the like is constructed, the entity extraction work of knowledge units in the traditional Chinese medicine evidence-based field is realized, the key technology of the traditional Chinese medicine gynecological evidence-based knowledge system is researched and constructed, and the optimal clinical evidence is provided for domestic and foreign traditional Chinese medicine gynecological clinical research.
Description
Technical Field
The invention relates to the technical field of knowledge systems, in particular to a multi-strategy knowledge extraction system.
Background
Evidence-based medicine is an emerging intersecting clinical medicine discipline developed in the beginning of the 90 s of the 20 th century, and is a method for reasonably applying optimal evidence based on clinical practice. The evidence-based traditional Chinese medicine refers to the theory and method of evidence-based medicine, collects, evaluates, produces and converts the evidence of the effectiveness, safety and economy of the traditional Chinese medicine, can reveal the clinical action characteristics and rules of the traditional Chinese medicine, guides the formulation of clinical guidelines, paths and sanitary decisions, and is an important branch of the science and research field of evidence-based medicine. In recent years, the number of traditional Chinese medicine clinical evidence-based research documents is increasing, but clinicians have less application in diagnosis and treatment methods in the documents. On the one hand, the conclusion in the literature cannot solve the complex problem in clinic, and a clinician can abandon the use of the clinical decision-making evidence because of a single evidence chain when searching for evidence for the clinical decision-making. On the other hand, the clinical practice literature is not directly converted into medical decision advice, cannot be accurately consulted in real time, and greatly hinders the operability of evidence-based literature in clinical decisions and practice.
Current scholars' knowledge of a particular field and term extraction studies have focused mainly on the following: in practical use, several methods are usually combined to obtain better recognition effect based on field linguistic rules, statistical and machine learning methods, machine learning methods and deep learning methods. (1) The term extraction is performed based on a domain linguistic rule method, and is mainly based on the linguistic feature rule of domain terms or the matching of the domain terms with terms in a dictionary. In the process, an entity dictionary or a rule dictionary is firstly constructed, then the text is subjected to word segmentation and part-of-speech tagging, and entity extraction is performed by a mode matching method. The common matching method comprises a maximum matching algorithm, regular expression matching and the like; the method for making rule dictionary includes manual summary, automatic statistics summary, etc. and the content with the same match is candidate term. The current research mainly analyzes word-forming modes of terms in the industry field to realize term extraction in different fields. And if the expansion rule is formulated and the term extraction is carried out by combining the statistical characteristics, or the cross-domain combination term is extracted by combining the statistical characteristics and the linguistic qualitative and quantitative rule analysis. The method requires the knowledge background of domain experts to support and maintain, and cannot conveniently finish domain migration. (2) The term extraction based on statistics and machine learning methods can be divided into two ways of realizing thinking: firstly, two tasks of entity boundary recognition and entity type prediction are sequentially completed, secondly, NER is regarded as a sequence labeling task, labels corresponding to positions of a sequence are predicted by constructing a sequence labeling model, and the boundary and the category of an entity are determined according to the labels. A common statistical machine learning model is a conditional random field, a support vector machine and the like. The feature engineering is a key engineering activity for constructing a statistical machine learning model, and researchers manually select valuable features and construct a proper feature template according to the characteristics of traditional Chinese medicine entities and texts so as to improve the recognition effect of the model. The current research mainly adopts N-Gram statistical language model modeling, and common statistical characteristics for extracting terms by combining with extended statistical characteristics mainly comprise word frequency (TF), solidification degree (PMI), degree of Freedom (DF), C-value and the like. The term extraction based on statistical methods is applicable to high frequency and high quality term extraction, but is less effective for low frequency and sparse term extraction. (3) The method based on deep learning mainly aims at solving a sequence labeling problem in an end-to-end learning mode, namely, after original data are input into a model, the model automatically completes feature learning and label prediction tasks. Typically, the deep learning model comprises an embedding layer, an encoding layer and a decoding layer, the embedding layer being used to obtain an embedded representation of the text sequence; the coding layer is used for text feature extraction and label prediction, such as a two-way long-short-term memory neural network, a two-way encoder characterization and other neural network models; the decoding layer is used for analyzing the optimal label sequence. However, NER research under the supervised learning method depends on large-scale and high-quality labeling corpus, but corpus resources in the traditional Chinese medicine field are deficient, and corpus acquisition cost is high.
The existing single knowledge extraction method cannot efficiently complete the knowledge extraction task in the traditional Chinese medicine field, the dictionary rule-based method relies on an expert dictionary, and the machine learning and deep learning method has good application effect but is not suitable for long text or sparse words and needs a large amount of labeling data.
Accordingly, the present invention is directed to a multi-strategy knowledge extraction system that addresses the above-mentioned related problems.
Disclosure of Invention
The invention aims to provide a multi-strategy knowledge extraction system to solve the problems that the prior single knowledge extraction method cannot efficiently complete the knowledge extraction task in the traditional Chinese medicine field, the dictionary rule-based method depends on an expert dictionary, the machine learning and deep learning method has good application effect but is not suitable for long text or sparse words, and a large amount of labeling data is needed.
The technical aim of the invention is realized by the following technical scheme: a multi-strategy knowledge extraction system comprising a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer;
the preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model;
the application layer is used for carrying out automatic literature knowledge extraction, literature knowledge online processing and literature user management.
The invention is further provided with: the data layer comprises a document library, a rule library, a domain dictionary, a deactivated word library and a noise library.
The invention is further provided with: the data preprocessing layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
The invention is further provided with: the model layer includes:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
The invention is further provided with: extracting conventional article concepts, evidence-based entities and traditional Chinese medicine field entities based on the multi-strategy knowledge extraction model specifically comprises the following steps: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
The invention is further provided with: the grammar model extracts the conventional concepts by constructing a joint rule according to the positions, the contexts and the keyword features of the entity, wherein the conventional concepts comprise titles, authors, abstracts, keywords and references.
The invention is further provided with: the domain dictionary rule model extraction evidence-based concept extraction is to extract evidence-based concepts with keyword and pattern similarity according to the preset rules of occurrence of evidence-based concepts; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test set, control set, medication method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
The invention is further provided with: the deep learning model extracts medical entity concepts by selecting labeling entities in the field of traditional Chinese medicine for training, and obtaining the deep learning model for extracting the medical entity concepts.
The invention also provides an information data processing terminal which is used for realizing the multi-strategy knowledge extraction system.
In summary, the invention has the following beneficial effects: according to the invention, through the built multi-strategy knowledge extraction model, different extraction strategies can be formulated for documents of different classifications, a large amount of data mechanical energy can be marked, and the method is applicable to long texts and sparse words; based on a knowledge graph research framework, taking a traditional Chinese medicine gynecological evidence-based clinical document as an example, a multi-strategy knowledge extraction framework based on deep learning and the like is constructed, the entity extraction work of knowledge units in the traditional Chinese medicine evidence-based field is realized, the key technology of the traditional Chinese medicine gynecological evidence-based knowledge system is researched and constructed, and the optimal clinical evidence is provided for domestic and foreign traditional Chinese medicine gynecological clinical research.
Drawings
FIG. 1 is a system configuration diagram of a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 2 is a diagram of different types of entity extraction policies of a multi-policy knowledge extraction system according to embodiment 1 of the invention;
FIG. 3 is a schematic diagram showing the effect of document preprocessing in a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of domain model thesaurus construction of a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 5 is a schematic diagram of the structure of the Bert-bilisTM-CRF model in a multi-strategy knowledge extraction system according to embodiment 1 of the present invention;
FIG. 6 is a schematic diagram of a first page of the knowledge overview in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 7 is a diagram of a first page of a knowledge graph in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 8 is a diagram of a document management page in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 9 is a schematic diagram of a knowledge extraction page in a multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 10 is a schematic diagram of an entity modification page in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
FIG. 11 is a schematic diagram of an entity added page in the multi-strategy knowledge extraction system according to embodiment 2 of the present invention;
fig. 12 is a schematic diagram of an entity word adding page in the multi-strategy knowledge extraction system in embodiment 2 of the present invention.
Detailed Description
The invention is described in further detail below with reference to fig. 1-12.
Example 1
1-5, the multi-strategy knowledge extraction system comprises a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer; the data layer comprises a document library, a rule library, a domain dictionary, a deactivated word library and a noise library.
The preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the pretreatment layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
In the embodiment, for a document with an input format of PDF format, an OCR technology is adopted to convert image information into a text form which can be edited and used, so that the later processing is facilitated; the data with part of charts and typesetting disorder cannot be well identified, and the data needs to be cleaned and tidied in an auxiliary way by adopting a manual method.
In the embodiment, an end-to-end image sequence recognition technology is adopted, a neural network architecture integrating feature extraction, sequence modeling and transcription is built, and two networks of CNN+RNN are adopted for training. The algorithm can naturally process sequences with any length, does not relate to character segmentation or horizontal scale normalization, has excellent performance in dictionary-free or dictionary-based scene text recognition tasks, can realize recognition of Chinese and English texts, has recognition accuracy of 95%, and has the effect shown in fig. 3.
The model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the model layer comprises:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
The service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model; the method for extracting the conventional article concepts, evidence-based entities and the traditional Chinese medicine field entities based on the multi-strategy knowledge extraction model comprises the following steps: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
Grammar model extraction conventional concepts are extracted by constructing a joint rule according to the positions, the contexts and the keyword features of the entity, and the conventional concepts comprise titles, authors, abstracts, keywords and references.
In this embodiment, the conventional concepts in the article structure, such as title, author, abstract, keyword, reference, and the like, are parsed, and the embodiment uses a pattern extraction method based on the keyword and the prior rule, where the pattern mainly refers to: the grammar rule is mainly used for constructing a joint rule to extract the conventional concepts according to the characteristics of the positions, the contexts, the keywords and the like of the entity.
A scientific literature is composed of a plurality of relatively independent knowledge elements with certain semantic association. In order to facilitate the user to quickly understand the main content of the literature, the embodiment constructs a knowledge element ontology model of the scientific literature from a microscopic level, and uniformly describes and represents the composition of unstructured knowledge elements of the scientific literature and semantic relations among the knowledge elements. The ontology model may be formally represented as follows:
KE UA =(G,M,C,(sp1,sp2,sp3,...),(cw1,cw2,cw3,...),Ti)
in KE UA Representing the unstructured knowledge element body of the scientific and technical literature; g represents unstructured abstract elements; m represents a method element in the body; c represents the result/conclusion element in the body; (sp 1, sp2, sp3, …) represents a set of patterns extracted from unstructured; (cw 1, cw2, cw3, …) represents a set of slave cue words; ti represents the title of the document of origin.
And processing the text, marking the identified sentences and finally storing the sentences as XML documents according to the processed title and text information as well as sentence patterns, clue words and position information of each part of the text. The algorithm comprises the following steps:
the algorithm extracts the elements in the text
Inputting an Excel document storing all title and text paragraphs
Outputting XML documents
Begin
1) Reading an Excel document;
2) Traversing the title and the text, dividing the text into sentences according to periods, storing the sentences as a list,
3) Matching each sentence method clue word;
4) if matching is successful, marking as a method, marking the corresponding list label of the matched sentences as m, marking the sentences in front of m as 'purpose', and matching the sentences in the back by using conclusion clue words;
if matching is successful, marking the corresponding list label of the matched sentences as n, marking the sentences from m to n as a method, marking the sentences after n as a conclusion, and marking the sentences after n as break;
5) else matches sentences with conclusion clue words;
if matching is successful, marking the corresponding list label of the matched sentences as n, marking the sentences in front of n as a method, marking the sentences after m and n as a conclusion, and marking the sentences after m and n as break;
6) Creating DOM tree object, adding corresponding node with DOM object, storing the sentence, and outputting XML document.
The domain dictionary rule model extraction evidence-based concept extraction is to extract evidence-based concepts with keyword and pattern similarity according to the preset rules of occurrence of evidence-based concepts; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test group, control group, dosing method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect size combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
In this embodiment, since the entity length of the evidence-based concept is longer, the method does not conform to the conventional entity definition, has strong field characteristics, and is more suitable for a pattern extraction method based on keywords and pattern similarity. Research has found that the types of entities that are of greater interest to expert practitioners in the field of evidence-based medicine mainly cover aspects where extraction of these entities plays a key role in evidence-based document evidence-scale assessment.
1) RCT-like literature: evidence-based medicine, random control, envelope method, random number table method, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test group, control group, and medication method.
2) META analysis class literature: META analysis, system evaluation, RCT, random control, sample size, bias risk, fixed effect model, random effect model, effect size merge analysis, relative risk RR, OR, 95% confidence interval, CI.
For the extraction of the evidence-based concept, a rule for the occurrence of the evidence-based concept needs to be formulated first, the evidence-based concept is extracted according to the keyword and the pattern similarity, and if a dictionary of related concept words is used as an aid, the extraction precision of the concept words can be greatly improved.
Taking three evidence-based entities of randomization, hidden allocation and blind method as examples, the extraction rule patterns in the literature are as follows:
table 1 mode example
And constructing expert dictionary for entities such as evaluation indexes through a term generation algorithm based on the improved frequent item set, and extracting similarity concepts based on the dictionary.
In order to extract evidence-based concepts of internal condensation and high external combination freedom in the traditional Chinese medicine evidence-based literature, the module designs a multi-strategy frequent pattern extraction algorithm. The algorithm adopts indexes FP-value of the advantages of comprehensive word frequency, degree of freedom, solidification degree and C-value characteristics on the basis of an N-Gram statistical language model to balance terms, combines the matching rules of stop words and common word prefixes and suffixes to filter, and comprises the following steps of a term generation algorithm based on an improved frequent item set:
step 1, preprocessing the text to remove a mailbox, a telephone number, a mobile phone number, a date, a website and the like in the text, and replacing punctuation marks with spaces.
And generating a Step 2 candidate phrase, counting the text corpus based on an N-Gram statistical language model, and filtering text fragments below a word length threshold to obtain candidate text fragments.
Step 3, term quality score. Firstly, calculating word frequency f, solidification degree pmi, degree of freedom df and C-value cval for each candidate text segment, then carrying out Sigmoid function normalization on each feature, finally merging each feature value calculation index FPDC to initialize each feature to be evenly distributed with weight, taking the central word of multi-word nesting in the traditional Chinese medicine field into consideration, carrying out 0.15 punishment on the word frequency, and carrying out 0.15 rewarding on the C-value, and screening according to a threshold value as shown in the formula to obtain a candidate evidence-based concept.
FP-value(C1...Cn)=0.25pmi(C1...Cn)+0.1tf(C1...Cn)+
0.4cval(C1...Cn)+0.1tf(C1...Cn)
Where c1..cn represents a candidate text segment composed of a plurality of words.
Step 4, language rule filtering is carried out on the candidate evidence-based concepts through dead word filtering and language rule filtering with common words as prefixes and suffixes.
Step 5, semantic feature similarity matching traversal, namely finding out a candidate term with the maximum similarity with each feature word through each feature seed word, considering the candidate term to be similar to the seed word when the similarity is greater than a set threshold value, and adding the candidate term into a result term set; considering connectivity between candidate terms, an exponential decay method is performed on the similarity threshold to set a minimum similarity threshold for separating words from each other as MinSim, and the threshold is formulated as follows as the number of word connected increases:
step 6, sequencing and outputting the results, namely, matching the semantic feature similarity of each feature seed word and the candidate term to obtain a similarity matching result, sequencing from high to low according to the similarity, outputting a final evidence-based concept term extraction result, and updating an expert word stock.
The deep learning model extracts the medical entity concept by selecting the labeling entity in the traditional Chinese medicine field for training, so as to obtain the deep learning model and extract the medical entity concept.
In this embodiment, for five types of entities, including diseases, syndromes, symptoms, prescriptions and traditional Chinese medicines, a relatively mature extraction algorithm exists at present, and the module selects labeling entities in the traditional Chinese medicine field for training aiming at naming features of the entities in the traditional Chinese medicine field, and adopts a Bert-bilisTM-CRF model to perform conceptual extraction.
The Bert-bilisTM-CRF model is characterized in that an Embedding layer (mainly comprising word vectors, word vectors and some additional features) is formed by the Bert model, a bidirectional LSTM layer is used for predicting labels of each word, and the CRF layer selects an optimal result from a label sequence. The CRF is based on label paths as prediction targets, and constraints can be added to the final predicted label sequence on the basis of LogitsLogitsLogits to ensure that the predicted entity label sequence is valid, and the constraints can be automatically learned from the training data set by the CRF layer in the training process. In practice the final output may result in a variety of label sequence combinations, three label path combinations being listed as shown in fig. 5 below, with red path label sequences [ B-C, I-C,, B-P, I-P,, O ], blue label sequences [ O, B-P,, I-P, O,, O ], green label sequences [ I-C, O,, O, I-P,, B-C ], where the red path is truly correct and the other two are paths that may be predictably produced.
Labeling training is carried out on 1000 field documents, and the F value is 93.97% finally. Data set example:
the application layer is used for carrying out automatic document knowledge extraction, document knowledge online processing and document user management.
Example 2
This example provides a specific system implementation based on the example idea, specifically as follows:
1. introduction of System functionality
Amount of literature data: chinese: meta analysis 239, RCT study 1604; english: meta analysis 8, RCT study 187; corpus size: the number of entities is as follows, including the 19743 of semantic relationships.
The system functions are as follows, and the system is divided into three modules, namely a knowledge overview module, a literature information module and a knowledge extraction module. The knowledge overview module mainly displays the overall extraction effect of system data, wherein the data display above the overall extraction effect comprises the total amount of documents included in the system, the number of extracted entities and the relation number formed among the entities (the product is mainly based on entity co-occurrence relation); the data presentation below is mainly used for managing the status of a single entity, and presents the name, source document, modification time, entity type, entity status and corresponding operation of each entity in the form of a list. Entity modification can be performed by clicking editing, the entity can be deleted by clicking deletion, and the corresponding entity and relation in the document extraction result are deleted.
The document information mainly shows the uploading, management, status of the document, as shown in fig. 8 below. Clicking and uploading documents can select and upload PDF format documents in batches locally, and the uploaded documents are input to a knowledge extraction module after being automatically processed to support viewing.
The knowledge extraction module extracts and displays different types of entities in the traditional Chinese medicine evidence-based medical document by adopting multiple strategies, as shown in fig. 9, the system displays extraction results, wherein conventional concept entities are marked and displayed in the whole text, and the medical concept entities and the evidence-based concept entities are listed on the right side, so that the adding, deleting and checking operations on the entities are supported.
For entities that need to be modified or deleted, a lookup of the corresponding modification or deletion is performed on the right list, as shown in fig. 10. If the entity is modified or deleted, the corresponding relation of the original entity is correspondingly modified or deleted, and the modified or deleted record is synchronously stored in the knowledge overview of the top page.
And for the entity needing to be added, searching and adding in the right list area of the article. Editing the type label and the entity name of the entity, and storing after determining. The added records are synchronously saved to the knowledge overview of the home page. The added entities are shown in fig. 11 below.
In addition, online word marking is supported, the words are selected by a mouse, the corresponding entity types are selected by right click, and the words can be stored after being determined. The added records are synchronously saved to the knowledge overview of the home page. The added entities are shown in fig. 12 below.
The present embodiment is only for explanation of the present invention and is not to be construed as limiting the present invention, and modifications to the present embodiment, which may not creatively contribute to the present invention as required by those skilled in the art after reading the present specification, are all protected by patent laws within the scope of claims of the present invention.
Claims (9)
1. The multi-strategy knowledge extraction system is characterized by comprising a data layer, a preprocessing layer, a model layer, a service layer and an application layer;
the data layer is used for receiving and storing data and providing data support for the model layer, the service layer and the application layer;
the preprocessing layer is used for carrying out format conversion on the document format, carrying out data cleaning on the data and carrying out manual format correction;
the model layer is used for constructing a grammar model, a domain dictionary rule model and a deep learning model aiming at the characteristics of different entity types;
the service layer is used for packaging the grammar model, the domain dictionary rule model and the deep learning model to obtain a multi-strategy knowledge extraction model, and extracting conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities based on the multi-strategy knowledge extraction model;
the application layer is used for carrying out automatic literature knowledge extraction, literature knowledge online processing and literature user management.
2. The multi-strategy knowledge extraction system of claim 1 wherein said data layer comprises a document library, a rule library, a domain dictionary, a stop word library, and a noise library.
3. The multi-strategy knowledge extraction system of claim 1, wherein the data preprocessing layer comprises:
and a format conversion module: for converting an input document into a PDF format into an editable text form using OCR technology, such as TXT format;
and the data cleaning module is used for cleaning the data of the input data, and comprises the steps of checking the consistency of the input data, processing invalid data and deleting data.
4. The multi-strategy knowledge extraction system of claim 1, wherein the model layer comprises:
the grammar pattern model construction module is used for extracting entity characteristics of the conventional concepts and constructing a grammar pattern model through statistics and generalization of the entity characteristics;
the domain dictionary rule model construction module is used for constructing a domain dictionary rule model through an extensible domain dictionary and a combination similarity matching algorithm;
the deep learning model building module is used for building a Bert-bilisTM-CRF model, namely a deep learning model by combining an Embedding layer, a bidirectional LSTM layer and a CRF layer formed by the BERT model.
5. The system of claim 4, wherein the conventional article concepts, evidence-based entities and traditional Chinese medicine domain entities are extracted based on the multi-strategy knowledge extraction model by: the grammar model extracts conventional concepts, the domain dictionary rule model extracts follow-up entity concepts and the deep learning model extracts medical entity concepts.
6. The multi-strategy knowledge extraction system according to claim 5, wherein said grammar model extracts conventional concepts including title, author, abstract, keywords and references by constructing joint rules based on locations, contexts and keyword features where entities appear.
7. The multi-strategy knowledge extraction system as claimed in claim 5, wherein the domain dictionary rule model extraction evidence-based concept extraction is based on rules of occurrence of preset evidence-based concepts, and the evidence-based concept extraction is performed with keyword and pattern similarity; the evidence-based concepts include evidence-based medicine, random control, envelope, random digital meter, diagnostic criteria, total number of cases, inclusion criteria, exclusion criteria, grouping, test set, control set, medication method, META analysis, system evaluation, RCT, random control test, sample size, bias risk, fixed effect model, random effect model, effect combination analysis, relative risk RR, OR, 95% confidence interval, and CI.
8. The multi-strategy knowledge extraction system according to claim 5, wherein the deep learning model extracts medical entity concepts by training through selecting labeling entities in the field of traditional Chinese medicine, and obtaining the deep learning model to extract medical entity concepts.
9. An information data processing terminal, characterized in that the information data processing terminal is arranged to implement a multi-strategy knowledge extraction system as claimed in any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311159972.0A CN117312493A (en) | 2023-09-08 | 2023-09-08 | Multi-strategy knowledge extraction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311159972.0A CN117312493A (en) | 2023-09-08 | 2023-09-08 | Multi-strategy knowledge extraction system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117312493A true CN117312493A (en) | 2023-12-29 |
Family
ID=89254463
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311159972.0A Pending CN117312493A (en) | 2023-09-08 | 2023-09-08 | Multi-strategy knowledge extraction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117312493A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN113505244A (en) * | 2021-09-10 | 2021-10-15 | 中国人民解放军总医院 | Knowledge graph construction method, system, equipment and medium based on deep learning |
CN116541472A (en) * | 2023-03-22 | 2023-08-04 | 麦博(上海)健康科技有限公司 | Knowledge graph construction method in medical field |
CN116628172A (en) * | 2023-07-24 | 2023-08-22 | 北京酷维在线科技有限公司 | Dialogue method for multi-strategy fusion in government service field based on knowledge graph |
-
2023
- 2023-09-08 CN CN202311159972.0A patent/CN117312493A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776711A (en) * | 2016-11-14 | 2017-05-31 | 浙江大学 | A kind of Chinese medical knowledge mapping construction method based on deep learning |
CN113505244A (en) * | 2021-09-10 | 2021-10-15 | 中国人民解放军总医院 | Knowledge graph construction method, system, equipment and medium based on deep learning |
CN116541472A (en) * | 2023-03-22 | 2023-08-04 | 麦博(上海)健康科技有限公司 | Knowledge graph construction method in medical field |
CN116628172A (en) * | 2023-07-24 | 2023-08-22 | 北京酷维在线科技有限公司 | Dialogue method for multi-strategy fusion in government service field based on knowledge graph |
Non-Patent Citations (1)
Title |
---|
张艺品 等: "深度学习基础上的中医实体抽取方法研究", 医学信息学杂志, vol. 40, no. 2, 25 February 2019 (2019-02-25), pages 58 - 63 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111222340B (en) | Breast electronic medical record entity recognition system based on multi-standard active learning | |
US9501467B2 (en) | Systems, methods, software and interfaces for entity extraction and resolution and tagging | |
CN111078875B (en) | Method for extracting question-answer pairs from semi-structured document based on machine learning | |
CN110826331A (en) | Intelligent construction method of place name labeling corpus based on interactive and iterative learning | |
CN110674252A (en) | High-precision semantic search system for judicial domain | |
CN111209738A (en) | Multi-task named entity recognition method combining text classification | |
Huang et al. | Bert-based multi-head selection for joint entity-relation extraction | |
CN102576355A (en) | Methods and systems for knowledge discovery | |
El Mahdaouy et al. | Word-embedding-based pseudo-relevance feedback for Arabic information retrieval | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN113032552A (en) | Text abstract-based policy key point extraction method and system | |
CN112635071A (en) | Diabetes knowledge map construction method integrating traditional Chinese and western medicine knowledge | |
Alyami et al. | Systematic literature review of Arabic aspect-based sentiment analysis | |
Eldin et al. | An enhanced opinion retrieval approach via implicit feature identification | |
Ahmed et al. | BIOfid dataset: publishing a german gold standard for named entity recognition in historical biodiversity literature | |
Sirisha et al. | Semantic interdisciplinary evaluation of image captioning models | |
Ispirova et al. | Mapping Food Composition Data from Various Data Sources to a Domain-Specific Ontology. | |
CN112836062B (en) | Relation extraction method of text corpus | |
Zhang et al. | A method of constructing a fine-grained sentiment lexicon for the humanities computing of classical chinese poetry | |
Amato et al. | A lexicon-grammar based methodology for ontology population for e-health applications | |
CN116227594A (en) | Construction method of high-credibility knowledge graph of medical industry facing multi-source data | |
Fei et al. | GFMRC: A machine reading comprehension model for named entity recognition | |
CN117312493A (en) | Multi-strategy knowledge extraction system | |
Wu et al. | Character-based deep learning approaches for clinical named entity recognition: a comparative study using Chinese EHR texts | |
CN111180076B (en) | Medical information extraction method based on multi-layer semantic analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |