CN115510245A - Unstructured data oriented domain knowledge extraction method - Google Patents
Unstructured data oriented domain knowledge extraction method Download PDFInfo
- Publication number
- CN115510245A CN115510245A CN202211259591.5A CN202211259591A CN115510245A CN 115510245 A CN115510245 A CN 115510245A CN 202211259591 A CN202211259591 A CN 202211259591A CN 115510245 A CN115510245 A CN 115510245A
- Authority
- CN
- China
- Prior art keywords
- entity
- knowledge
- extraction model
- relation
- unstructured data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 123
- 238000012549 training Methods 0.000 claims abstract description 55
- 230000015654 memory Effects 0.000 claims abstract description 30
- 230000004927 fusion Effects 0.000 claims abstract description 28
- 238000013528 artificial neural network Methods 0.000 claims abstract description 22
- 230000007246 mechanism Effects 0.000 claims abstract description 19
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 18
- 238000000034 method Methods 0.000 claims description 69
- 230000008569 process Effects 0.000 claims description 40
- 238000002372 labelling Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 16
- 239000013598 vector Substances 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 12
- 230000000694 effects Effects 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 9
- 238000005457 optimization Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims description 3
- 238000013499 data model Methods 0.000 claims description 3
- 230000001788 irregular Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 238000012827 research and development Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000004880 explosion Methods 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 239000000969 carrier Substances 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000005266 casting Methods 0.000 description 1
- 239000011248 coating agent Substances 0.000 description 1
- 238000000576 coating method Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000003754 machining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a field knowledge extraction method for unstructured data, which comprises the following steps: establishing an entity extraction model based on a bidirectional long-time and short-time memory neural network and a conditional random field, establishing a relation extraction model based on an attention mechanism, and respectively training two models; extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relation by using the trained relation extraction model, and obtaining an entity-relation table on the basis of the field entity table; performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in a neo4j graph database; the invention can solve the problems that the prior domain knowledge acquisition is mainly manual, the management efficiency is low, and the domain knowledge system is not complete enough, and realizes the knowledge extraction of the unstructured data.
Description
Technical Field
The invention belongs to the technical field of knowledge extraction, and particularly relates to a method for extracting the knowledge in the field of unstructured data.
Background
The domain knowledge has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like. Under the background of intelligent manufacturing, the research and development of products have more and more urgent requirements on domain knowledge, a perfect domain knowledge acquisition, management and sharing system is established to effectively improve the research and development efficiency of the products, and a domain knowledge map is the key for realizing the goal. A knowledge graph is essentially a large-scale semantic network aimed at describing real-world concepts and events in terms of entities, with edges representing the interrelationships between them. The core of the knowledge graph is a triple composed of entities, attributes and relations, and the triple can be structurally divided into a mode layer and a data layer, wherein the mode layer is composed of concept ontologies and relations and is used for describing the structure of the knowledge graph, and the data layer is an instantiated knowledge graph constructed by specific data under the guidance of the mode layer.
The domain knowledge map is an important means for managing domain knowledge and relationship, and various kinds of knowledge in the domain can be uniformly managed through the domain knowledge map. Therefore, the construction process of the knowledge graph is important. Firstly, a data source of the knowledge graph is required to be clearly constructed, and in the construction process of the knowledge graph, the data source is divided into structured data, semi-structured data and unstructured data, wherein the extraction of the structured data and the semi-structured data is mature, and the extraction of the unstructured data is still in a development stage. In practical application, the construction of the knowledge graph still mainly adopts manual operation, the automatic construction aspect also mainly adopts structured and semi-structured, and the process field needs an automatic knowledge extraction method aiming at unstructured data, which is helpful for realizing the management of knowledge in the complex field of multi-source isomerism and is convenient for the design and decision of the field.
The method for extracting knowledge from unstructured data can be decomposed into two parts of entity extraction and relation extraction.
In the aspect of entity extraction, with the development of Natural Language Processing (NLP) technology, a plurality of entity recognition algorithms based on deep learning are developed, such as a recurrent neural network RNN, which is a neural network for processing sequence data and is suitable for processing unstructured data mainly comprising text data, on the basis, a long-short time memory neural network LSTM is developed to avoid the problem of dimension explosion, a bidirectional long-short time neural network BiLSTM is developed to accelerate training, and a conditional random field CRF is added to define a loss function to further improve the extraction precision.
In relation to draw, there are methods such as pipeline method, end2end at present, the former discerns each entity among them with the entity extractor according to the sentence first, then make up every two entities that are extracted out and add the sentence of the original text as the input of the relation recognizer and carry on the relation recognition between two input entities; the latter is also called end-to-end relation extraction, and the triples are directly extracted by processing each sentence. With the development of deep learning, a relation extraction model based on a Convolutional Neural Network (CNN) and an attention mechanism is developed in the field of relation extraction.
However, the methods for extracting entities and relationships proposed above are currently widely used in the field of general knowledge, which has the characteristics of wide coverage, large data volume, and the like, so that the knowledge graph in the general field is usually constructed from bottom to top, and information is extracted from a large amount of data to construct entities and relationships in the knowledge graph. The domain knowledge is different from the general knowledge, and the domain knowledge attaches more importance to the specialty of the knowledge, so the domain knowledge needs to have a more rigorous structure. When the domain knowledge graph is constructed, a top-down mode is needed to be adopted for construction, a mode layer of the domain knowledge graph is designed firstly, and information belonging to the domain knowledge is determined according to the mode layer. However, in the aspect of constructing the domain knowledge graph, manual construction is still taken as the main point, the management efficiency is low, the processed data is mostly structured and semi-structured data, and a systematic method is still lacked for knowledge extraction oriented to unstructured data.
Disclosure of Invention
In view of the above, the invention provides a method for extracting domain knowledge oriented to unstructured data, which can solve the problems that domain knowledge acquisition is mainly manual, management efficiency is low, and a domain knowledge system is not complete at present, and realize knowledge extraction of unstructured data.
The invention is realized by the following technical scheme:
a field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;
the extraction method comprises the following specific steps:
s1, combing a domain knowledge concept entity and relationship combing to establish a domain knowledge graph mode layer;
s2, preprocessing unstructured data to obtain manually marked text data;
s3, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, establishing a relation extraction model based on an attention mechanism, and training the entity extraction model and the relation extraction model respectively by using corresponding data sets;
s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationships on the basis of the field entity table;
and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.
Further, the specific steps of step S1 are as follows:
s1-1, combing the knowledge concepts and relations in the multi-scene field according to the purpose of knowledge extraction;
and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph mode layer.
Further, the specific steps of step S2 are as follows:
s2-1, analyzing unstructured data into a txt file by using a text analysis tool;
s2-2, performing word segmentation on the text file by using a Jieba word segmentation tool;
s2-3, performing stop word removal processing on the text after word segmentation;
and S2-4, manually labeling the text data based on a BIO labeling method or a BIOES labeling method.
Further, the specific steps of step S3 are as follows:
s3-1, forming a training set and a testing set for training an entity extraction model and a relationship extraction model according to the manually marked data;
s3-2, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.
Further, in step S3-2, when the entity extraction model is established: the output dimension of the BilSTM layer of the bidirectional long-time memory neural network BilSTM is the same as the number of label types, and for each input w i The network will output a probability value p for its corresponding tag j ij And finally, the output P of the network is obtained,namely, each input label probability value corresponding to each label; conditional random field CRF calculates a labeling probability value under conditional constraints, and if y is a predicted labeling sequence, x is a text input sequence, and y' is an accurate labeling sequence, then the conditional random field CRF has
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
wherein psi i (x, y) is a feature vector;
when training an entity extraction model, the goal is to maximize the probability P (y | x), which is obtained by log-likelihood:
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
Further, in step S3-2,
when the relation extraction model is established, firstly outputting a vector form of a text through a BilSTM layer of a bidirectional long-short time memory neural network BilSTM, then classifying the relation through an attention mechanism layer to obtain the relation between entities, and establishing the relation extraction model;
when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = 1 ,x 2 ,...,x T In which x is i The output through the BilSTM layer, representing each character, is H = { H = { (H) } 1 ,h 2 ,...,h T }, matrix parameters to be trainedd w Representing the dimension of word embedding, and satisfying:
M=tanh(H)
α=softmax(w T M)
r=Hα T
wherein alpha is an attention weight coefficient, and r is a result obtained by weighting and summing output H of the BilSTM layer;
finally generating a characterization vector h through a nonlinear function * =tanh(r);
Will characterize the vector h * Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmaxObtaining predictive tag by argmax
Wherein, W and b are parameter matrix and bias respectively;
negative log-likelihood is used to define the loss function as:
wherein t ∈ R m Is a one-hot representation, y ∈ R m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;
and optimizing the loss function J (theta) through an optimization algorithm to realize the training of the relation extraction model.
Further, in step S4, a specific method for performing knowledge fusion by using a method based on semantic similarity calculation is as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying the similarities and providing a basis for the fusion of semantic space models;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and (3) linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.
Has the advantages that:
(1) The invention provides a field knowledge extraction method for unstructured data, and relates to knowledge modeling and natural language processing technologies. The method comprises the steps of firstly, conducting concept and relation combing on domain knowledge, establishing a domain knowledge graph mode layer, then conducting preprocessing on unstructured data, manually marking a data set, creating a training set and a testing set, then adopting a named entity recognition model BiLSTM-CRF based on deep learning to train the data, evaluating the training effect of the model according to indexes such as accuracy, recall rate and F1 value, and then conducting training by using a relation extraction model based on an attention mechanism. When the knowledge is extracted, the trained model can be used for extracting entities from unstructured data, the relation extraction model based on the attention mechanism is used for extracting relations to form an entity-relation table, knowledge fusion is carried out on all extracted entities and relations based on semantic similarity, and finally a knowledge graph is formed and stored by using a graph database neo4 j. The system has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like, is suitable for the requirements of product research and development and manufacture on the domain knowledge, and can effectively improve the efficiency of product research and development by establishing a perfect domain knowledge acquisition, management and sharing system.
(2) The method is characterized in that an entity extraction model is established based on a bidirectional long-short time memory neural network (BilSTM) and a Conditional Random Field (CRF) to realize entity extraction of unstructured data; establishing a relation extraction model based on an attention mechanism to realize relation extraction of unstructured data; the process entities and the relations in the unstructured data are extracted automatically through the combination of the entity extraction model and the relation extraction model, and high extraction accuracy can be obtained through training of a large number of data sets.
(3) When the entity extraction model is established, the bidirectional long-short-term memory neural network (BilSTM) and the Conditional Random Field (CRF) are adopted, so that the problem of dimension explosion possibly occurring in the conventional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased.
(4) The invention carries out knowledge fusion based on semantic similarity, combines knowledge with the same or highly similar semantics according to all the entities and relations obtained by extraction, and adopts a semantic similarity calculation method which has the characteristics of simplicity and reliability.
Drawings
FIG. 1 is a schematic flow chart of an implementation of a domain knowledge extraction method for unstructured data.
FIG. 2 is a schematic diagram of the structure of the BilSTM model.
FIG. 3 is a schematic diagram of a long-term and short-term memory neural network model based on an attention mechanism.
FIG. 4 is a schematic diagram of a semantic similarity calculation process.
FIG. 5 is a schematic diagram of a semantic space model fusion process.
Fig. 6 is a schematic diagram of the established process knowledge profile pattern layer of example 2.
FIG. 7 is a BIO labeling diagram.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
Example 1:
the embodiment provides a domain knowledge extraction method for unstructured data, wherein the unstructured data refer to data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table, and are mainly represented by text type data.
The extraction method comprises the following specific steps:
step S1, mode layer construction:
step S1-1, combing the domain concept and the relationship: combing the knowledge concepts and relations in the multi-scene domain according to the purpose of knowledge extraction;
s1-2, constructing a domain knowledge graph mode layer: defining a knowledge structure according to the domain knowledge concept entities and the relationship, and establishing a domain knowledge graph mode layer;
s2, performing data preprocessing on the unstructured data:
step S2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;
step S2-2, word segmentation: utilizing a Jieba word segmentation tool to segment words of the text file;
s2-3, removing stop words: performing stop word removal processing on the text after word segmentation;
step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method;
step S3, model training is carried out:
step S3-1, training and testing set: forming a training set and a test set for training an entity extraction model and a relationship extraction model according to the manually marked data;
step S3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a corresponding data set;
s3-3, evaluating an entity extraction model: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value;
s3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;
wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3;
s4, constructing a domain knowledge graph:
step S4-1, field entity extraction: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity;
step S4-2, field entity table: storing the field entities extracted according to the entity extraction model in a table form as a field entity table;
step S4-3, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationship on the basis of the field entity table;
s4-4, knowledge fusion: performing knowledge fusion based on semantic similarity according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics;
step S4-5, knowledge graph: and establishing a knowledge graph in the neo4j graph database according to the entity-relation table after knowledge fusion.
Example 2:
in this embodiment, on the basis of embodiment 1, a paper related to a diesel engine process is taken as an example to extract process knowledge, that is, the unstructured data is a paper related to a diesel engine process, and an implementation flow of the extraction method is shown in fig. 1; the method comprises the following specific implementation steps:
step S1, mode layer construction:
step S1-1, combing process concepts and relations: combing the multi-scene process knowledge concepts and relations according to the purpose of knowledge extraction; the technological knowledge of the diesel engine can be combed according to three dimensions of a technological body, a workpiece body and an equipment body, the technological body can be divided into machining, assembling and casting, the workpiece body is each constituent structure and parts of the diesel engine, and the equipment body is each equipment used in processing;
s1-2, constructing a process knowledge map pattern layer: defining a knowledge structure according to the process knowledge concept entity and the relationship, and establishing a process knowledge graph mode layer;
in this embodiment, the specific method for establishing the process knowledge map pattern layer is as follows:
(1) Defining a process knowledge map application scene, and determining a process knowledge concept ontology;
(2) Determining the relationship between the process knowledge concept bodies, for example, in the diesel engine process knowledge, the relationship between the process body and the workpiece body is 'acting', the relationship between the equipment body and the process body is 'realizing', and the relationship between the equipment body and the workpiece body is 'processing', as shown in fig. 6;
s2, performing data preprocessing on the unstructured data:
s2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;
step S2-2, word segmentation: utilizing a Jieba word segmentation tool to segment words of the text file;
step S2-3, removing stop words: performing stop word removal processing on the text after word segmentation;
step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method, wherein process entities are labeled as B-TEC and I-TEC, workpiece entities are labeled as B-WOR and I-WOR, equipment entities are labeled as B-EQU and I-EQU, and the other entities are labeled as O, and partial labeling results are shown in a table 1;
table 1 partial entity annotation results
The BIO labeling method can be replaced by a BIOES labeling method, namely B is the beginning of an entity, I is the middle of the entity, E is the end of the entity, S is the entity with a single character, and O is the other; the labeling method is not unique, different labeling methods can be selected according to different entity extraction requirements, and model training is not influenced.
Step S3, model training is carried out:
step S3-1, training and testing set: forming a training set and a test set for entity extraction model training according to the manually marked data;
s3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a data set;
in the embodiment, a bidirectional long-time memory neural network (BilSTM) and a Conditional Random Field (CRF) are adopted to establish an entity extraction model, so that the problem of dimension explosion possibly occurring in the traditional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased; the specific method for establishing and training the model is as follows:
in the LSTM, memory cells are interconnected, replacing the circulation unit in the general RNN, and in addition to having a circulation connection structure between memory cells, there is also circulation inside each memory cell; the input to each memory cell is controlled by an input gate, if the input gate allows, the value can be accumulated to the state, the weight of the state is controlled by a forgetting gate, and the output can be controlled by an output gate to be closed or not;
(1) The input gate is updated as follows:
i t =σ g (W i x t +U i h t-1 +b i )
wherein i t Input gate for time t, W i Is an input weight matrix; u shape i A circular weight matrix for the input gate; b i Is an offset; activation of function sigma by sigmoid g Adjusting W i x t +U i h t-1 +b i Will output i t Set to a value between 0 and 1; x is the number of t For each character of the input variable, i.e. a sentence, h t-1 Is the hidden state of the LSTM at the time t-1;
(2) The forgetting gate is updated in the following way:
f t =σ g (W f x t +U f h t-1 +b f )
wherein f is t Forgetting to gate at time t, W f Is an input weight matrix; u shape f A circular weight matrix for a forgetting gate; b f For biasing, the function σ is activated by sigmoid g Adjusting W f x t +U f h t-1 +b f Will output f t Set to a value between 0 and 1;
(3) The output gate is updated as follows:
o t =σ g (W o x t +U o h t-1 +b o )
wherein o is t Output gate at time t, W o Is an input weight matrix; u shape o A cyclic weight matrix which is an output gate; b o Is an offset. Activation of function sigma by sigmoid g Adjusting W o x t +U o h t-1 +b o Will output o t Set to a value between 0 and 1;
(4) Memory cell c t The updating is performed by:
wherein, c t Memory cells at time t, c t-1 The memory cells at the time t-1,is an intermediate amount, W c Is an input weight matrix; u shape c Is a circular weight matrix of memory cells; b c For biasing, the function σ is activated by tanh h Carry out outputIt can be seen that the forgetting door f t Determines the data transmitted from the last memory cell, input gate i t The data currently entered into the memory cells is determined.
Hidden state h of LSTM t Co-determination by output gate and memory cell:
h t =o t σ h (c t )
wherein h is t Is the hidden state of LSTM at time t;
although LSTM can solve the problem of long distance dependence by memory cells, LSTM is a forward propagation algorithm, the output of one state can only be calculated from its previous state. However, in the problem of named entity recognition, a word vector of a text sentence is input, a named entity has semantic dependence on words nearby, in order to recognize a certain entity, the named entity is often influenced by not only previous words but also following words, and a unidirectional long-short term memory neural network cannot perform named entity recognition and other work in combination with the content behind the current moment, so that a bidirectional long-short term memory neural network model (BiLSTM) is adopted for named entity recognition, and the model structure is shown in fig. 2-3.
The structure of the bidirectional long-short term memory neural network consists of an input layer, a forward hidden layer, a backward hidden layer and an output layer. Inputting sequence data by an input layer, calculating forward characteristics by a forward hidden layer, and calculating backward characteristics by a backward hidden layer; the forward hidden layer can remember the information before the current moment, and the backward hidden layer can remember the information in the future at the current moment; and splicing the results output by the forward hidden layer and the backward hidden layer to obtain the bidirectional LSTM, namely the BilSTM network.
And finally, accessing the output to a classification label of the predictive named entity of the softmax input layer. For a named entity recognition task, defining that k labels exist in the task, namely label = { label = { (label) 1 ,label 2 ,...,label k H, the input sequence length is n, i.e. w = { w = 1 ,w 2 ,...,w n Get each input w through BilsTM t Corresponding to each label j Fraction p of (2) t,j Fraction p corresponding to n characters of the whole sequence t,j A P matrix is constructed where a larger score means that the label to which the score corresponds is closer to the genuine label.
In the task of identifying a named entity in Chinese, the entity is usually formed by combining a plurality of Chinese characters, the Chinese characters are labeled according to a BIO labeling method, which is the same as the data labeling method of a training set, B is used for representing the starting characters of the named entity, I represents the middle part and the ending part of the named entity, and O represents the non-entity part. As shown in FIG. 7, which is a labeled example, the entities are classified into three categories of Workpiece, equipment and technical, wherein B-Workpiece represents the starting character of the Workpiece entity "engine block", namely "Fang"; the I-workbench represents the middle and tail parts of the workbench entity 'engine cylinder body', namely 'moving', 'machine', 'cylinder' and 'body'. The same principle is applied to Equipment entities and technical entities.
It can be seen that for the input of the chinese sequence, the output labels have certain constraints:
(1) The starting tag of an entity must be "B-", and the tag "I-" must follow "B-" and "O" cannot appear before "I-";
(2) The tag type of an entity needs to be kept consistent, for example, after the 'B-workbench', an 'I-workbench' needs to be combined, but not an 'I-Equipment';
BilSTM does not make these constraints, so the present embodiment uses Conditional Random Fields (CRF) to further constrain the output of the network to achieve higher accuracy. The conditional random field is one of probabilistic graphical models, which can be classified into directed graphical models including bayesian networks and hidden markov models, and undirected graphical models including conditional random fields.
Conditional Random Fields (CRF) are widely applied in the field of natural language processing at present, are conditional probability distribution models, and introduce characteristic functions on the basis of Hidden Markov Models (HMM).
The transfer matrix in the CRF considers the association between the output labels at each moment, so the embodiment considers that the CRF is used as a BiLSTM layer; the BilSTM layer provides a function of extracting features according to context and can predict entity categories of input texts, and the CRF layer provides a mechanism for scoring the current output state and can further restrict the output, so that the prediction accuracy is improved.
The output dimension of the BilSTM layer is the same as the number of label types, and the number of label types is w for each input i The network will output a probability value p for its corresponding tag j ij This results in the output P of the network, i.e. the annotated probability values corresponding to each tag for each input. Calculating labeling probability value under condition constraint by CRF, setting y as predicted labeling sequence, x as text input sequence, and y' as accurate labeling sequence, and determining that there is a labeling probability value under condition constraint
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
wherein psi i (x, y) are feature vectors, so the goal of training the model is to maximize the probability P (ylx), which is obtained by log-likelihood:
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
Step S3-3, entity extraction model evaluation: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; wherein, the
S3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
in this embodiment, an attention-based mechanism is adopted to establish a relationship extraction model, and the specific method for establishing and training the model is as follows:
since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.
And training a relation extraction model to extract unstructured data to be extracted to obtain the relation between the entities. The relation extraction model firstly outputs a vector form of a text through a BilSTM layer, and then carries out relation classification through an attention mechanism layer to obtain the relation between entities.
(1) Input and word embedding layer: the model input is a sample in sentences. The word embedding layer mainly characterizes input sentences, and gives a sentence S containing T characters: s = { x = 1 ,x 2 ,...,x T In which x i Each character is represented.
(2) BilSTM: the structure of BilSTM is the same as in step S3-2, and the LSTM unit can be represented by the following formula:
c t =i t g t +f t c t-1
h t =o t tanh(c t )
the output of the model includes the forward directionAnd backward directionTwo results, by splicingAs the final BiLSTM output.
(3) The Attention structure: since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.
The input of the model is in sentence unit, and the output passing through the BilSTM layer is H = { H = { (H) } 1 ,h 2 ,...,h T }, matrix parameters to be trainedR represents a set of real numbers, d w The dimension of word embedding is represented, and the following conditions are satisfied:
M=tanh(H)
α=softmax(w T M)
r=Hα T
m intermediate quantity is not real, alpha is attention weight coefficient, r is result of weighted LSTM output H and then summed, finally, characterization vector H is generated through nonlinear function * =tanh(r)。
(4) Loss function: will characterize vector h * Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmaxObtaining a predictive tag by argmax
Where W and b are the parameter matrix and the offset, respectively.
Negative log-likelihood is used to define the loss function J (θ) as:
wherein t ∈ R m Is a one-hot expression, y ∈ R m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model and comprises n and b; f is a norm;
the loss function J (theta) is optimized through an optimization algorithm, and then the training of the relation extraction model can be achieved.
S3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;
wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3.
S4, establishing a process knowledge graph:
s4-1, extracting a process entity: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a process entity;
step S4-2, process entity table: the process entities extracted according to the entity extraction model are stored as a process entity table in a table form, and part of the process entity table is shown in table 2:
table 2 partial process entity tables
ID | Name | Label |
001 | Cylinder cover | Workpiece |
002 | Oil sprayer sheath | Workpiece |
003 | Valve seat | Workpiece |
004 | Pressure test | Process for the preparation of a coating |
Step S4-4, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationship are in one-to-one correspondence on the basis of the process entity table, wherein the entity-relationship table is a partial entity-relationship table as shown in table 3;
TABLE 3 partial entity-relationship Table
Start_Name | Relation | End_Name |
Pressure test | Act on | Cylinder cover |
Clear in entirety | Act on | Cylinder cover |
Press mounting assembly station | To realize | Sheath assembly |
S4-4, knowledge fusion: performing knowledge fusion by adopting a method based on semantic similarity calculation according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics with reference to the attached figures 4-5; the adopted method based on semantic similarity calculation can be replaced by other methods, such as an inner product method, a cosine method, a Dice coefficient method and the like.
In this embodiment, a specific method for performing knowledge fusion by using a semantic similarity calculation method is as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying, and fusing semantic space models to provide a basis;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and (3) linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.
Step S4-5, knowledge graph: establishing a knowledge graph in a neo4j graph database according to the entity-relation table after knowledge fusion; after the process knowledge graph is constructed, process designers can design the process by using the process knowledge graph, and can upload new knowledge on the basis of the process knowledge graph to realize the updating and sharing of the knowledge.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;
the extraction method is characterized by comprising the following specific steps:
s1, combing a domain knowledge concept entity and relationship combing to establish a domain knowledge graph mode layer;
s2, preprocessing unstructured data to obtain manually marked text data;
s3, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, establishing a relation extraction model based on an attention mechanism, and training the entity extraction model and the relation extraction model respectively by using corresponding data sets;
s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationships are in one-to-one correspondence on the basis of the field entity table;
and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after the knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.
2. The method for extracting the domain knowledge of the unstructured data according to claim 1, wherein the step S1 comprises the following steps:
s1-1, combing the knowledge concepts and relations in the multi-scene field according to the purpose of knowledge extraction;
and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph mode layer.
3. The method for extracting domain knowledge of unstructured data according to claim 1, wherein the step S2 comprises the following steps:
s2-1, analyzing unstructured data into a txt file by using a text analysis tool;
s2-2, performing word segmentation on the text file by using a Jieba word segmentation tool;
s2-3, performing stop word removal processing on the text after word segmentation;
and S2-4, manually labeling the text data based on a BIO labeling method or a BIOES labeling method.
4. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein the specific steps of the step S3 are as follows:
s3-1, forming a training set and a test set for training an entity extraction model and a relation extraction model according to manually marked data;
s3-2, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.
5. The method for extracting domain knowledge from unstructured data according to claim 4, wherein in step S3-2, when building the entity extraction model: the output dimensions of the BilSTM layer of the bidirectional long-time and short-time memory neural network BilSTM are the same as the number of label types, and the output dimensions of the BilSTM layer are the same for each input w l The network will output a probability value p for its corresponding tag j ij Finally, obtaining the output P of the network, namely the labeling probability value corresponding to each label of each input; the conditional random field CRF calculates the labeling probability value under the condition constraint, and if y is the predicted labeling sequence, x is the text input sequence, and y' is the accurate labeling sequence, then there is
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
wherein psi i (x, y) is a feature vector;
when the entity extraction model is trained, the aim is to maximize the probability P (y | x), and the probability P is obtained through log likelihood:
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
6. The method for extracting domain knowledge of unstructured data according to claim 4, wherein in step S3-2,
when the relation extraction model is established, firstly outputting a vector form of a text through a BilSTM layer of a bidirectional long-short time memory neural network BilSTM, then classifying the relation through an attention mechanism layer to obtain the relation between entities, and establishing the relation extraction model;
when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = 1 ,x 2 ,...,x T In which x i The output through the BilSTM layer, representing each character, is H = { H = 1 ,h 2 ,...,h T }, matrix parameters to be trainedd w The dimension of word embedding is represented, and the following conditions are satisfied:
M=tanh(H)
α=softmax(w T M)
r=Hα T
wherein alpha is an attention weight coefficient, and r is a result obtained by weighting and summing output H of the BilSTM layer;
finally generating a characterization vector h through a nonlinear function * =tanh(r);
Will characterize vector h * Mapping onto class label vector through full connection network, for input sentence s, outputting predicted relation classification probability through softmaxObtaining predictive tag by argmax
Wherein, W and b are parameter matrix and bias respectively;
negative log-likelihood is used to define the loss function as:
wherein t ∈ R m Is a one-hot representation, y ∈ R m The estimated probability of each relation type output by softmax, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;
the loss function J (theta) is optimized through an optimization algorithm, and then the training of the relation extraction model can be achieved.
7. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein in step S4, a method based on semantic similarity calculation is adopted to perform knowledge fusion as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying the similarities and providing a basis for the fusion of semantic space models;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and integrating the newly added knowledge into the knowledge map.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211259591.5A CN115510245B (en) | 2022-10-14 | 2022-10-14 | Unstructured data-oriented domain knowledge extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211259591.5A CN115510245B (en) | 2022-10-14 | 2022-10-14 | Unstructured data-oriented domain knowledge extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115510245A true CN115510245A (en) | 2022-12-23 |
CN115510245B CN115510245B (en) | 2024-05-14 |
Family
ID=84510722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211259591.5A Active CN115510245B (en) | 2022-10-14 | 2022-10-14 | Unstructured data-oriented domain knowledge extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115510245B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117033527A (en) * | 2023-10-09 | 2023-11-10 | 之江实验室 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN117909492A (en) * | 2024-03-19 | 2024-04-19 | 国网山东省电力公司信息通信公司 | Method, system, equipment and medium for extracting unstructured information of power grid |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800411A (en) * | 2018-12-03 | 2019-05-24 | 哈尔滨工业大学(深圳) | Clinical treatment entity and its attribute extraction method |
CN110598000A (en) * | 2019-08-01 | 2019-12-20 | 达而观信息科技(上海)有限公司 | Relationship extraction and knowledge graph construction method based on deep learning model |
CN110795543A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Unstructured data extraction method and device based on deep learning and storage medium |
CN110866121A (en) * | 2019-09-26 | 2020-03-06 | 中国电力科学研究院有限公司 | Knowledge graph construction method for power field |
US20200097601A1 (en) * | 2018-09-26 | 2020-03-26 | Accenture Global Solutions Limited | Identification of an entity representation in unstructured data |
KR102223382B1 (en) * | 2019-11-14 | 2021-03-08 | 숭실대학교산학협력단 | Method and apparatus for complementing knowledge based on multi-type entity |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
WO2021212682A1 (en) * | 2020-04-21 | 2021-10-28 | 平安国际智慧城市科技股份有限公司 | Knowledge extraction method, apparatus, electronic device, and storage medium |
CN114077673A (en) * | 2021-06-21 | 2022-02-22 | 南京邮电大学 | Knowledge graph construction method based on BTBC model |
CN114911945A (en) * | 2022-04-13 | 2022-08-16 | 浙江大学 | Knowledge graph-based multi-value chain data management auxiliary decision model construction method |
CN114925689A (en) * | 2022-05-24 | 2022-08-19 | 淮阴工学院 | Medical text classification method and device based on BI-LSTM-MHSA |
CN115062109A (en) * | 2022-06-16 | 2022-09-16 | 沈阳航空航天大学 | Entity-to-attention mechanism-based entity relationship joint extraction method |
CN115168606A (en) * | 2022-07-01 | 2022-10-11 | 北京理工大学 | Mapping template knowledge extraction method for semi-structured process data |
-
2022
- 2022-10-14 CN CN202211259591.5A patent/CN115510245B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200097601A1 (en) * | 2018-09-26 | 2020-03-26 | Accenture Global Solutions Limited | Identification of an entity representation in unstructured data |
CN109800411A (en) * | 2018-12-03 | 2019-05-24 | 哈尔滨工业大学(深圳) | Clinical treatment entity and its attribute extraction method |
CN110598000A (en) * | 2019-08-01 | 2019-12-20 | 达而观信息科技(上海)有限公司 | Relationship extraction and knowledge graph construction method based on deep learning model |
CN110795543A (en) * | 2019-09-03 | 2020-02-14 | 腾讯科技(深圳)有限公司 | Unstructured data extraction method and device based on deep learning and storage medium |
CN110866121A (en) * | 2019-09-26 | 2020-03-06 | 中国电力科学研究院有限公司 | Knowledge graph construction method for power field |
WO2021082366A1 (en) * | 2019-10-28 | 2021-05-06 | 南京师范大学 | Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus |
KR102223382B1 (en) * | 2019-11-14 | 2021-03-08 | 숭실대학교산학협력단 | Method and apparatus for complementing knowledge based on multi-type entity |
WO2021196520A1 (en) * | 2020-03-30 | 2021-10-07 | 西安交通大学 | Tax field-oriented knowledge map construction method and system |
WO2021212682A1 (en) * | 2020-04-21 | 2021-10-28 | 平安国际智慧城市科技股份有限公司 | Knowledge extraction method, apparatus, electronic device, and storage medium |
CN114077673A (en) * | 2021-06-21 | 2022-02-22 | 南京邮电大学 | Knowledge graph construction method based on BTBC model |
CN114911945A (en) * | 2022-04-13 | 2022-08-16 | 浙江大学 | Knowledge graph-based multi-value chain data management auxiliary decision model construction method |
CN114925689A (en) * | 2022-05-24 | 2022-08-19 | 淮阴工学院 | Medical text classification method and device based on BI-LSTM-MHSA |
CN115062109A (en) * | 2022-06-16 | 2022-09-16 | 沈阳航空航天大学 | Entity-to-attention mechanism-based entity relationship joint extraction method |
CN115168606A (en) * | 2022-07-01 | 2022-10-11 | 北京理工大学 | Mapping template knowledge extraction method for semi-structured process data |
Non-Patent Citations (2)
Title |
---|
杨秀璋等: "基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究", 《通信学报》, 21 June 2022 (2022-06-21) * |
黄培馨;赵翔;方阳;朱慧明;肖卫东;: "融合对抗训练的端到端知识三元组联合抽取", 计算机研究与发展, no. 12, 15 December 2019 (2019-12-15) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117033527A (en) * | 2023-10-09 | 2023-11-10 | 之江实验室 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN117033527B (en) * | 2023-10-09 | 2024-01-30 | 之江实验室 | Knowledge graph construction method and device, storage medium and electronic equipment |
CN117909492A (en) * | 2024-03-19 | 2024-04-19 | 国网山东省电力公司信息通信公司 | Method, system, equipment and medium for extracting unstructured information of power grid |
Also Published As
Publication number | Publication date |
---|---|
CN115510245B (en) | 2024-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107992597B (en) | Text structuring method for power grid fault case | |
CN110598005B (en) | Public safety event-oriented multi-source heterogeneous data knowledge graph construction method | |
CN104699763B (en) | The text similarity gauging system of multiple features fusion | |
CN115510245B (en) | Unstructured data-oriented domain knowledge extraction method | |
CN111626063A (en) | Text intention identification method and system based on projection gradient descent and label smoothing | |
CN109376242A (en) | Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks | |
CN109726745B (en) | Target-based emotion classification method integrating description knowledge | |
CN108875809A (en) | The biomedical entity relationship classification method of joint attention mechanism and neural network | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN113255366B (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN113705238B (en) | Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN112925904B (en) | Lightweight text classification method based on Tucker decomposition | |
CN116245107B (en) | Electric power audit text entity identification method, device, equipment and storage medium | |
CN116521882A (en) | Domain length text classification method and system based on knowledge graph | |
She et al. | Joint learning with BERT-GCN and multi-attention for event text classification and event assignment | |
CN111984791A (en) | Long text classification method based on attention mechanism | |
CN117171333A (en) | Electric power file question-answering type intelligent retrieval method and system | |
CN113590827B (en) | Scientific research project text classification device and method based on multiple angles | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN116701665A (en) | Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method | |
CN114357166B (en) | Text classification method based on deep learning | |
CN113064967B (en) | Complaint reporting credibility analysis method based on deep migration network | |
Wang et al. | Sentiment analysis based on attention mechanisms and bi-directional LSTM fusion model | |
Wang et al. | Event extraction via dmcnn in open domain public sentiment information | |
Baes et al. | RDF triples extraction from company web pages: comparison of state-of-the-art Deep Models |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |