CN115510245A - Unstructured data oriented domain knowledge extraction method - Google Patents

Unstructured data oriented domain knowledge extraction method Download PDF

Info

Publication number
CN115510245A
CN115510245A CN202211259591.5A CN202211259591A CN115510245A CN 115510245 A CN115510245 A CN 115510245A CN 202211259591 A CN202211259591 A CN 202211259591A CN 115510245 A CN115510245 A CN 115510245A
Authority
CN
China
Prior art keywords
entity
knowledge
extraction model
relation
unstructured data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211259591.5A
Other languages
Chinese (zh)
Other versions
CN115510245B (en
Inventor
王儒
孙延劭
华益威
魏竹琴
王国新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202211259591.5A priority Critical patent/CN115510245B/en
Publication of CN115510245A publication Critical patent/CN115510245A/en
Application granted granted Critical
Publication of CN115510245B publication Critical patent/CN115510245B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a field knowledge extraction method for unstructured data, which comprises the following steps: establishing an entity extraction model based on a bidirectional long-time and short-time memory neural network and a conditional random field, establishing a relation extraction model based on an attention mechanism, and respectively training two models; extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relation by using the trained relation extraction model, and obtaining an entity-relation table on the basis of the field entity table; performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in a neo4j graph database; the invention can solve the problems that the prior domain knowledge acquisition is mainly manual, the management efficiency is low, and the domain knowledge system is not complete enough, and realizes the knowledge extraction of the unstructured data.

Description

Unstructured data oriented domain knowledge extraction method
Technical Field
The invention belongs to the technical field of knowledge extraction, and particularly relates to a method for extracting the knowledge in the field of unstructured data.
Background
The domain knowledge has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like. Under the background of intelligent manufacturing, the research and development of products have more and more urgent requirements on domain knowledge, a perfect domain knowledge acquisition, management and sharing system is established to effectively improve the research and development efficiency of the products, and a domain knowledge map is the key for realizing the goal. A knowledge graph is essentially a large-scale semantic network aimed at describing real-world concepts and events in terms of entities, with edges representing the interrelationships between them. The core of the knowledge graph is a triple composed of entities, attributes and relations, and the triple can be structurally divided into a mode layer and a data layer, wherein the mode layer is composed of concept ontologies and relations and is used for describing the structure of the knowledge graph, and the data layer is an instantiated knowledge graph constructed by specific data under the guidance of the mode layer.
The domain knowledge map is an important means for managing domain knowledge and relationship, and various kinds of knowledge in the domain can be uniformly managed through the domain knowledge map. Therefore, the construction process of the knowledge graph is important. Firstly, a data source of the knowledge graph is required to be clearly constructed, and in the construction process of the knowledge graph, the data source is divided into structured data, semi-structured data and unstructured data, wherein the extraction of the structured data and the semi-structured data is mature, and the extraction of the unstructured data is still in a development stage. In practical application, the construction of the knowledge graph still mainly adopts manual operation, the automatic construction aspect also mainly adopts structured and semi-structured, and the process field needs an automatic knowledge extraction method aiming at unstructured data, which is helpful for realizing the management of knowledge in the complex field of multi-source isomerism and is convenient for the design and decision of the field.
The method for extracting knowledge from unstructured data can be decomposed into two parts of entity extraction and relation extraction.
In the aspect of entity extraction, with the development of Natural Language Processing (NLP) technology, a plurality of entity recognition algorithms based on deep learning are developed, such as a recurrent neural network RNN, which is a neural network for processing sequence data and is suitable for processing unstructured data mainly comprising text data, on the basis, a long-short time memory neural network LSTM is developed to avoid the problem of dimension explosion, a bidirectional long-short time neural network BiLSTM is developed to accelerate training, and a conditional random field CRF is added to define a loss function to further improve the extraction precision.
In relation to draw, there are methods such as pipeline method, end2end at present, the former discerns each entity among them with the entity extractor according to the sentence first, then make up every two entities that are extracted out and add the sentence of the original text as the input of the relation recognizer and carry on the relation recognition between two input entities; the latter is also called end-to-end relation extraction, and the triples are directly extracted by processing each sentence. With the development of deep learning, a relation extraction model based on a Convolutional Neural Network (CNN) and an attention mechanism is developed in the field of relation extraction.
However, the methods for extracting entities and relationships proposed above are currently widely used in the field of general knowledge, which has the characteristics of wide coverage, large data volume, and the like, so that the knowledge graph in the general field is usually constructed from bottom to top, and information is extracted from a large amount of data to construct entities and relationships in the knowledge graph. The domain knowledge is different from the general knowledge, and the domain knowledge attaches more importance to the specialty of the knowledge, so the domain knowledge needs to have a more rigorous structure. When the domain knowledge graph is constructed, a top-down mode is needed to be adopted for construction, a mode layer of the domain knowledge graph is designed firstly, and information belonging to the domain knowledge is determined according to the mode layer. However, in the aspect of constructing the domain knowledge graph, manual construction is still taken as the main point, the management efficiency is low, the processed data is mostly structured and semi-structured data, and a systematic method is still lacked for knowledge extraction oriented to unstructured data.
Disclosure of Invention
In view of the above, the invention provides a method for extracting domain knowledge oriented to unstructured data, which can solve the problems that domain knowledge acquisition is mainly manual, management efficiency is low, and a domain knowledge system is not complete at present, and realize knowledge extraction of unstructured data.
The invention is realized by the following technical scheme:
a field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;
the extraction method comprises the following specific steps:
s1, combing a domain knowledge concept entity and relationship combing to establish a domain knowledge graph mode layer;
s2, preprocessing unstructured data to obtain manually marked text data;
s3, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, establishing a relation extraction model based on an attention mechanism, and training the entity extraction model and the relation extraction model respectively by using corresponding data sets;
s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationships on the basis of the field entity table;
and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.
Further, the specific steps of step S1 are as follows:
s1-1, combing the knowledge concepts and relations in the multi-scene field according to the purpose of knowledge extraction;
and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph mode layer.
Further, the specific steps of step S2 are as follows:
s2-1, analyzing unstructured data into a txt file by using a text analysis tool;
s2-2, performing word segmentation on the text file by using a Jieba word segmentation tool;
s2-3, performing stop word removal processing on the text after word segmentation;
and S2-4, manually labeling the text data based on a BIO labeling method or a BIOES labeling method.
Further, the specific steps of step S3 are as follows:
s3-1, forming a training set and a testing set for training an entity extraction model and a relationship extraction model according to the manually marked data;
s3-2, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.
Further, in step S3-2, when the entity extraction model is established: the output dimension of the BilSTM layer of the bidirectional long-time memory neural network BilSTM is the same as the number of label types, and for each input w i The network will output a probability value p for its corresponding tag j ij And finally, the output P of the network is obtained,namely, each input label probability value corresponding to each label; conditional random field CRF calculates a labeling probability value under conditional constraints, and if y is a predicted labeling sequence, x is a text input sequence, and y' is an accurate labeling sequence, then the conditional random field CRF has
Figure BDA0003890630420000031
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
Figure BDA0003890630420000032
wherein psi i (x, y) is a feature vector;
when training an entity extraction model, the goal is to maximize the probability P (y | x), which is obtained by log-likelihood:
Figure BDA0003890630420000033
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
Further, in step S3-2,
when the relation extraction model is established, firstly outputting a vector form of a text through a BilSTM layer of a bidirectional long-short time memory neural network BilSTM, then classifying the relation through an attention mechanism layer to obtain the relation between entities, and establishing the relation extraction model;
when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = 1 ,x 2 ,...,x T In which x is i The output through the BilSTM layer, representing each character, is H = { H = { (H) } 1 ,h 2 ,...,h T }, matrix parameters to be trained
Figure BDA0003890630420000041
d w Representing the dimension of word embedding, and satisfying:
M=tanh(H)
α=softmax(w T M)
r=Hα T
wherein alpha is an attention weight coefficient, and r is a result obtained by weighting and summing output H of the BilSTM layer;
finally generating a characterization vector h through a nonlinear function * =tanh(r);
Will characterize the vector h * Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmax
Figure BDA0003890630420000042
Obtaining predictive tag by argmax
Figure BDA0003890630420000043
Figure BDA0003890630420000044
Figure BDA0003890630420000045
Wherein, W and b are parameter matrix and bias respectively;
negative log-likelihood is used to define the loss function as:
Figure BDA0003890630420000046
wherein t ∈ R m Is a one-hot representation, y ∈ R m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;
and optimizing the loss function J (theta) through an optimization algorithm to realize the training of the relation extraction model.
Further, in step S4, a specific method for performing knowledge fusion by using a method based on semantic similarity calculation is as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying the similarities and providing a basis for the fusion of semantic space models;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and (3) linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.
Has the advantages that:
(1) The invention provides a field knowledge extraction method for unstructured data, and relates to knowledge modeling and natural language processing technologies. The method comprises the steps of firstly, conducting concept and relation combing on domain knowledge, establishing a domain knowledge graph mode layer, then conducting preprocessing on unstructured data, manually marking a data set, creating a training set and a testing set, then adopting a named entity recognition model BiLSTM-CRF based on deep learning to train the data, evaluating the training effect of the model according to indexes such as accuracy, recall rate and F1 value, and then conducting training by using a relation extraction model based on an attention mechanism. When the knowledge is extracted, the trained model can be used for extracting entities from unstructured data, the relation extraction model based on the attention mechanism is used for extracting relations to form an entity-relation table, knowledge fusion is carried out on all extracted entities and relations based on semantic similarity, and finally a knowledge graph is formed and stored by using a graph database neo4 j. The system has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like, is suitable for the requirements of product research and development and manufacture on the domain knowledge, and can effectively improve the efficiency of product research and development by establishing a perfect domain knowledge acquisition, management and sharing system.
(2) The method is characterized in that an entity extraction model is established based on a bidirectional long-short time memory neural network (BilSTM) and a Conditional Random Field (CRF) to realize entity extraction of unstructured data; establishing a relation extraction model based on an attention mechanism to realize relation extraction of unstructured data; the process entities and the relations in the unstructured data are extracted automatically through the combination of the entity extraction model and the relation extraction model, and high extraction accuracy can be obtained through training of a large number of data sets.
(3) When the entity extraction model is established, the bidirectional long-short-term memory neural network (BilSTM) and the Conditional Random Field (CRF) are adopted, so that the problem of dimension explosion possibly occurring in the conventional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased.
(4) The invention carries out knowledge fusion based on semantic similarity, combines knowledge with the same or highly similar semantics according to all the entities and relations obtained by extraction, and adopts a semantic similarity calculation method which has the characteristics of simplicity and reliability.
Drawings
FIG. 1 is a schematic flow chart of an implementation of a domain knowledge extraction method for unstructured data.
FIG. 2 is a schematic diagram of the structure of the BilSTM model.
FIG. 3 is a schematic diagram of a long-term and short-term memory neural network model based on an attention mechanism.
FIG. 4 is a schematic diagram of a semantic similarity calculation process.
FIG. 5 is a schematic diagram of a semantic space model fusion process.
Fig. 6 is a schematic diagram of the established process knowledge profile pattern layer of example 2.
FIG. 7 is a BIO labeling diagram.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
Example 1:
the embodiment provides a domain knowledge extraction method for unstructured data, wherein the unstructured data refer to data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table, and are mainly represented by text type data.
The extraction method comprises the following specific steps:
step S1, mode layer construction:
step S1-1, combing the domain concept and the relationship: combing the knowledge concepts and relations in the multi-scene domain according to the purpose of knowledge extraction;
s1-2, constructing a domain knowledge graph mode layer: defining a knowledge structure according to the domain knowledge concept entities and the relationship, and establishing a domain knowledge graph mode layer;
s2, performing data preprocessing on the unstructured data:
step S2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;
step S2-2, word segmentation: utilizing a Jieba word segmentation tool to segment words of the text file;
s2-3, removing stop words: performing stop word removal processing on the text after word segmentation;
step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method;
step S3, model training is carried out:
step S3-1, training and testing set: forming a training set and a test set for training an entity extraction model and a relationship extraction model according to the manually marked data;
step S3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a corresponding data set;
s3-3, evaluating an entity extraction model: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value;
s3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;
wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3;
s4, constructing a domain knowledge graph:
step S4-1, field entity extraction: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity;
step S4-2, field entity table: storing the field entities extracted according to the entity extraction model in a table form as a field entity table;
step S4-3, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationship on the basis of the field entity table;
s4-4, knowledge fusion: performing knowledge fusion based on semantic similarity according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics;
step S4-5, knowledge graph: and establishing a knowledge graph in the neo4j graph database according to the entity-relation table after knowledge fusion.
Example 2:
in this embodiment, on the basis of embodiment 1, a paper related to a diesel engine process is taken as an example to extract process knowledge, that is, the unstructured data is a paper related to a diesel engine process, and an implementation flow of the extraction method is shown in fig. 1; the method comprises the following specific implementation steps:
step S1, mode layer construction:
step S1-1, combing process concepts and relations: combing the multi-scene process knowledge concepts and relations according to the purpose of knowledge extraction; the technological knowledge of the diesel engine can be combed according to three dimensions of a technological body, a workpiece body and an equipment body, the technological body can be divided into machining, assembling and casting, the workpiece body is each constituent structure and parts of the diesel engine, and the equipment body is each equipment used in processing;
s1-2, constructing a process knowledge map pattern layer: defining a knowledge structure according to the process knowledge concept entity and the relationship, and establishing a process knowledge graph mode layer;
in this embodiment, the specific method for establishing the process knowledge map pattern layer is as follows:
(1) Defining a process knowledge map application scene, and determining a process knowledge concept ontology;
(2) Determining the relationship between the process knowledge concept bodies, for example, in the diesel engine process knowledge, the relationship between the process body and the workpiece body is 'acting', the relationship between the equipment body and the process body is 'realizing', and the relationship between the equipment body and the workpiece body is 'processing', as shown in fig. 6;
s2, performing data preprocessing on the unstructured data:
s2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;
step S2-2, word segmentation: utilizing a Jieba word segmentation tool to segment words of the text file;
step S2-3, removing stop words: performing stop word removal processing on the text after word segmentation;
step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method, wherein process entities are labeled as B-TEC and I-TEC, workpiece entities are labeled as B-WOR and I-WOR, equipment entities are labeled as B-EQU and I-EQU, and the other entities are labeled as O, and partial labeling results are shown in a table 1;
table 1 partial entity annotation results
Figure BDA0003890630420000071
Figure BDA0003890630420000081
The BIO labeling method can be replaced by a BIOES labeling method, namely B is the beginning of an entity, I is the middle of the entity, E is the end of the entity, S is the entity with a single character, and O is the other; the labeling method is not unique, different labeling methods can be selected according to different entity extraction requirements, and model training is not influenced.
Step S3, model training is carried out:
step S3-1, training and testing set: forming a training set and a test set for entity extraction model training according to the manually marked data;
s3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a data set;
in the embodiment, a bidirectional long-time memory neural network (BilSTM) and a Conditional Random Field (CRF) are adopted to establish an entity extraction model, so that the problem of dimension explosion possibly occurring in the traditional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased; the specific method for establishing and training the model is as follows:
in the LSTM, memory cells are interconnected, replacing the circulation unit in the general RNN, and in addition to having a circulation connection structure between memory cells, there is also circulation inside each memory cell; the input to each memory cell is controlled by an input gate, if the input gate allows, the value can be accumulated to the state, the weight of the state is controlled by a forgetting gate, and the output can be controlled by an output gate to be closed or not;
(1) The input gate is updated as follows:
i t =σ g (W i x t +U i h t-1 +b i )
wherein i t Input gate for time t, W i Is an input weight matrix; u shape i A circular weight matrix for the input gate; b i Is an offset; activation of function sigma by sigmoid g Adjusting W i x t +U i h t-1 +b i Will output i t Set to a value between 0 and 1; x is the number of t For each character of the input variable, i.e. a sentence, h t-1 Is the hidden state of the LSTM at the time t-1;
(2) The forgetting gate is updated in the following way:
f t =σ g (W f x t +U f h t-1 +b f )
wherein f is t Forgetting to gate at time t, W f Is an input weight matrix; u shape f A circular weight matrix for a forgetting gate; b f For biasing, the function σ is activated by sigmoid g Adjusting W f x t +U f h t-1 +b f Will output f t Set to a value between 0 and 1;
(3) The output gate is updated as follows:
o t =σ g (W o x t +U o h t-1 +b o )
wherein o is t Output gate at time t, W o Is an input weight matrix; u shape o A cyclic weight matrix which is an output gate; b o Is an offset. Activation of function sigma by sigmoid g Adjusting W o x t +U o h t-1 +b o Will output o t Set to a value between 0 and 1;
(4) Memory cell c t The updating is performed by:
Figure BDA0003890630420000091
Figure BDA0003890630420000092
wherein, c t Memory cells at time t, c t-1 The memory cells at the time t-1,
Figure BDA0003890630420000093
is an intermediate amount, W c Is an input weight matrix; u shape c Is a circular weight matrix of memory cells; b c For biasing, the function σ is activated by tanh h Carry out output
Figure BDA0003890630420000094
It can be seen that the forgetting door f t Determines the data transmitted from the last memory cell, input gate i t The data currently entered into the memory cells is determined.
Hidden state h of LSTM t Co-determination by output gate and memory cell:
h t =o t σ h (c t )
wherein h is t Is the hidden state of LSTM at time t;
although LSTM can solve the problem of long distance dependence by memory cells, LSTM is a forward propagation algorithm, the output of one state can only be calculated from its previous state. However, in the problem of named entity recognition, a word vector of a text sentence is input, a named entity has semantic dependence on words nearby, in order to recognize a certain entity, the named entity is often influenced by not only previous words but also following words, and a unidirectional long-short term memory neural network cannot perform named entity recognition and other work in combination with the content behind the current moment, so that a bidirectional long-short term memory neural network model (BiLSTM) is adopted for named entity recognition, and the model structure is shown in fig. 2-3.
The structure of the bidirectional long-short term memory neural network consists of an input layer, a forward hidden layer, a backward hidden layer and an output layer. Inputting sequence data by an input layer, calculating forward characteristics by a forward hidden layer, and calculating backward characteristics by a backward hidden layer; the forward hidden layer can remember the information before the current moment, and the backward hidden layer can remember the information in the future at the current moment; and splicing the results output by the forward hidden layer and the backward hidden layer to obtain the bidirectional LSTM, namely the BilSTM network.
And finally, accessing the output to a classification label of the predictive named entity of the softmax input layer. For a named entity recognition task, defining that k labels exist in the task, namely label = { label = { (label) 1 ,label 2 ,...,label k H, the input sequence length is n, i.e. w = { w = 1 ,w 2 ,...,w n Get each input w through BilsTM t Corresponding to each label j Fraction p of (2) t,j Fraction p corresponding to n characters of the whole sequence t,j A P matrix is constructed where a larger score means that the label to which the score corresponds is closer to the genuine label.
In the task of identifying a named entity in Chinese, the entity is usually formed by combining a plurality of Chinese characters, the Chinese characters are labeled according to a BIO labeling method, which is the same as the data labeling method of a training set, B is used for representing the starting characters of the named entity, I represents the middle part and the ending part of the named entity, and O represents the non-entity part. As shown in FIG. 7, which is a labeled example, the entities are classified into three categories of Workpiece, equipment and technical, wherein B-Workpiece represents the starting character of the Workpiece entity "engine block", namely "Fang"; the I-workbench represents the middle and tail parts of the workbench entity 'engine cylinder body', namely 'moving', 'machine', 'cylinder' and 'body'. The same principle is applied to Equipment entities and technical entities.
It can be seen that for the input of the chinese sequence, the output labels have certain constraints:
(1) The starting tag of an entity must be "B-", and the tag "I-" must follow "B-" and "O" cannot appear before "I-";
(2) The tag type of an entity needs to be kept consistent, for example, after the 'B-workbench', an 'I-workbench' needs to be combined, but not an 'I-Equipment';
BilSTM does not make these constraints, so the present embodiment uses Conditional Random Fields (CRF) to further constrain the output of the network to achieve higher accuracy. The conditional random field is one of probabilistic graphical models, which can be classified into directed graphical models including bayesian networks and hidden markov models, and undirected graphical models including conditional random fields.
Conditional Random Fields (CRF) are widely applied in the field of natural language processing at present, are conditional probability distribution models, and introduce characteristic functions on the basis of Hidden Markov Models (HMM).
The transfer matrix in the CRF considers the association between the output labels at each moment, so the embodiment considers that the CRF is used as a BiLSTM layer; the BilSTM layer provides a function of extracting features according to context and can predict entity categories of input texts, and the CRF layer provides a mechanism for scoring the current output state and can further restrict the output, so that the prediction accuracy is improved.
The output dimension of the BilSTM layer is the same as the number of label types, and the number of label types is w for each input i The network will output a probability value p for its corresponding tag j ij This results in the output P of the network, i.e. the annotated probability values corresponding to each tag for each input. Calculating labeling probability value under condition constraint by CRF, setting y as predicted labeling sequence, x as text input sequence, and y' as accurate labeling sequence, and determining that there is a labeling probability value under condition constraint
Figure BDA0003890630420000101
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
Figure BDA0003890630420000102
wherein psi i (x, y) are feature vectors, so the goal of training the model is to maximize the probability P (ylx), which is obtained by log-likelihood:
Figure BDA0003890630420000103
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
Step S3-3, entity extraction model evaluation: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; wherein, the
Figure BDA0003890630420000111
S3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
in this embodiment, an attention-based mechanism is adopted to establish a relationship extraction model, and the specific method for establishing and training the model is as follows:
since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.
And training a relation extraction model to extract unstructured data to be extracted to obtain the relation between the entities. The relation extraction model firstly outputs a vector form of a text through a BilSTM layer, and then carries out relation classification through an attention mechanism layer to obtain the relation between entities.
(1) Input and word embedding layer: the model input is a sample in sentences. The word embedding layer mainly characterizes input sentences, and gives a sentence S containing T characters: s = { x = 1 ,x 2 ,...,x T In which x i Each character is represented.
(2) BilSTM: the structure of BilSTM is the same as in step S3-2, and the LSTM unit can be represented by the following formula:
Figure BDA0003890630420000116
Figure BDA0003890630420000117
Figure BDA0003890630420000118
c t =i t g t +f t c t-1
Figure BDA0003890630420000119
h t =o t tanh(c t )
the output of the model includes the forward direction
Figure BDA0003890630420000112
And backward direction
Figure BDA0003890630420000113
Two results, by splicing
Figure BDA0003890630420000114
As the final BiLSTM output.
(3) The Attention structure: since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.
The input of the model is in sentence unit, and the output passing through the BilSTM layer is H = { H = { (H) } 1 ,h 2 ,...,h T }, matrix parameters to be trained
Figure BDA0003890630420000115
R represents a set of real numbers, d w The dimension of word embedding is represented, and the following conditions are satisfied:
M=tanh(H)
α=softmax(w T M)
r=Hα T
m intermediate quantity is not real, alpha is attention weight coefficient, r is result of weighted LSTM output H and then summed, finally, characterization vector H is generated through nonlinear function * =tanh(r)。
(4) Loss function: will characterize vector h * Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmax
Figure BDA0003890630420000121
Obtaining a predictive tag by argmax
Figure BDA0003890630420000122
Figure BDA0003890630420000123
Figure BDA0003890630420000124
Where W and b are the parameter matrix and the offset, respectively.
Negative log-likelihood is used to define the loss function J (θ) as:
Figure BDA0003890630420000125
wherein t ∈ R m Is a one-hot expression, y ∈ R m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model and comprises n and b; f is a norm;
the loss function J (theta) is optimized through an optimization algorithm, and then the training of the relation extraction model can be achieved.
S3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;
wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3.
S4, establishing a process knowledge graph:
s4-1, extracting a process entity: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a process entity;
step S4-2, process entity table: the process entities extracted according to the entity extraction model are stored as a process entity table in a table form, and part of the process entity table is shown in table 2:
table 2 partial process entity tables
ID Name Label
001 Cylinder cover Workpiece
002 Oil sprayer sheath Workpiece
003 Valve seat Workpiece
004 Pressure test Process for the preparation of a coating
Step S4-4, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationship are in one-to-one correspondence on the basis of the process entity table, wherein the entity-relationship table is a partial entity-relationship table as shown in table 3;
TABLE 3 partial entity-relationship Table
Start_Name Relation End_Name
Pressure test Act on Cylinder cover
Clear in entirety Act on Cylinder cover
Press mounting assembly station To realize Sheath assembly
S4-4, knowledge fusion: performing knowledge fusion by adopting a method based on semantic similarity calculation according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics with reference to the attached figures 4-5; the adopted method based on semantic similarity calculation can be replaced by other methods, such as an inner product method, a cosine method, a Dice coefficient method and the like.
In this embodiment, a specific method for performing knowledge fusion by using a semantic similarity calculation method is as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying, and fusing semantic space models to provide a basis;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and (3) linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.
Step S4-5, knowledge graph: establishing a knowledge graph in a neo4j graph database according to the entity-relation table after knowledge fusion; after the process knowledge graph is constructed, process designers can design the process by using the process knowledge graph, and can upload new knowledge on the basis of the process knowledge graph to realize the updating and sharing of the knowledge.
In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;
the extraction method is characterized by comprising the following specific steps:
s1, combing a domain knowledge concept entity and relationship combing to establish a domain knowledge graph mode layer;
s2, preprocessing unstructured data to obtain manually marked text data;
s3, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, establishing a relation extraction model based on an attention mechanism, and training the entity extraction model and the relation extraction model respectively by using corresponding data sets;
s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationships are in one-to-one correspondence on the basis of the field entity table;
and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after the knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.
2. The method for extracting the domain knowledge of the unstructured data according to claim 1, wherein the step S1 comprises the following steps:
s1-1, combing the knowledge concepts and relations in the multi-scene field according to the purpose of knowledge extraction;
and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph mode layer.
3. The method for extracting domain knowledge of unstructured data according to claim 1, wherein the step S2 comprises the following steps:
s2-1, analyzing unstructured data into a txt file by using a text analysis tool;
s2-2, performing word segmentation on the text file by using a Jieba word segmentation tool;
s2-3, performing stop word removal processing on the text after word segmentation;
and S2-4, manually labeling the text data based on a BIO labeling method or a BIOES labeling method.
4. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein the specific steps of the step S3 are as follows:
s3-1, forming a training set and a test set for training an entity extraction model and a relation extraction model according to manually marked data;
s3-2, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;
s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.
5. The method for extracting domain knowledge from unstructured data according to claim 4, wherein in step S3-2, when building the entity extraction model: the output dimensions of the BilSTM layer of the bidirectional long-time and short-time memory neural network BilSTM are the same as the number of label types, and the output dimensions of the BilSTM layer are the same for each input w l The network will output a probability value p for its corresponding tag j ij Finally, obtaining the output P of the network, namely the labeling probability value corresponding to each label of each input; the conditional random field CRF calculates the labeling probability value under the condition constraint, and if y is the predicted labeling sequence, x is the text input sequence, and y' is the accurate labeling sequence, then there is
Figure FDA0003890630410000021
Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:
Figure FDA0003890630410000022
wherein psi i (x, y) is a feature vector;
when the entity extraction model is trained, the aim is to maximize the probability P (y | x), and the probability P is obtained through log likelihood:
Figure FDA0003890630410000023
and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.
6. The method for extracting domain knowledge of unstructured data according to claim 4, wherein in step S3-2,
when the relation extraction model is established, firstly outputting a vector form of a text through a BilSTM layer of a bidirectional long-short time memory neural network BilSTM, then classifying the relation through an attention mechanism layer to obtain the relation between entities, and establishing the relation extraction model;
when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = 1 ,x 2 ,...,x T In which x i The output through the BilSTM layer, representing each character, is H = { H = 1 ,h 2 ,...,h T }, matrix parameters to be trained
Figure FDA0003890630410000024
d w The dimension of word embedding is represented, and the following conditions are satisfied:
M=tanh(H)
α=softmax(w T M)
r=Hα T
wherein alpha is an attention weight coefficient, and r is a result obtained by weighting and summing output H of the BilSTM layer;
finally generating a characterization vector h through a nonlinear function * =tanh(r);
Will characterize vector h * Mapping onto class label vector through full connection network, for input sentence s, outputting predicted relation classification probability through softmax
Figure FDA0003890630410000031
Obtaining predictive tag by argmax
Figure FDA0003890630410000032
Figure FDA0003890630410000033
Figure FDA0003890630410000034
Wherein, W and b are parameter matrix and bias respectively;
negative log-likelihood is used to define the loss function as:
Figure FDA0003890630410000035
wherein t ∈ R m Is a one-hot representation, y ∈ R m The estimated probability of each relation type output by softmax, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;
the loss function J (theta) is optimized through an optimization algorithm, and then the training of the relation extraction model can be achieved.
7. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein in step S4, a method based on semantic similarity calculation is adopted to perform knowledge fusion as follows:
(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying the similarities and providing a basis for the fusion of semantic space models;
(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;
(3) Entity linking: and linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and integrating the newly added knowledge into the knowledge map.
CN202211259591.5A 2022-10-14 2022-10-14 Unstructured data-oriented domain knowledge extraction method Active CN115510245B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211259591.5A CN115510245B (en) 2022-10-14 2022-10-14 Unstructured data-oriented domain knowledge extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211259591.5A CN115510245B (en) 2022-10-14 2022-10-14 Unstructured data-oriented domain knowledge extraction method

Publications (2)

Publication Number Publication Date
CN115510245A true CN115510245A (en) 2022-12-23
CN115510245B CN115510245B (en) 2024-05-14

Family

ID=84510722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211259591.5A Active CN115510245B (en) 2022-10-14 2022-10-14 Unstructured data-oriented domain knowledge extraction method

Country Status (1)

Country Link
CN (1) CN115510245B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033527A (en) * 2023-10-09 2023-11-10 之江实验室 Knowledge graph construction method and device, storage medium and electronic equipment
CN117909492A (en) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 Method, system, equipment and medium for extracting unstructured information of power grid

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN110598000A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 Relationship extraction and knowledge graph construction method based on deep learning model
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN110866121A (en) * 2019-09-26 2020-03-06 中国电力科学研究院有限公司 Knowledge graph construction method for power field
US20200097601A1 (en) * 2018-09-26 2020-03-26 Accenture Global Solutions Limited Identification of an entity representation in unstructured data
KR102223382B1 (en) * 2019-11-14 2021-03-08 숭실대학교산학협력단 Method and apparatus for complementing knowledge based on multi-type entity
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
WO2021212682A1 (en) * 2020-04-21 2021-10-28 平安国际智慧城市科技股份有限公司 Knowledge extraction method, apparatus, electronic device, and storage medium
CN114077673A (en) * 2021-06-21 2022-02-22 南京邮电大学 Knowledge graph construction method based on BTBC model
CN114911945A (en) * 2022-04-13 2022-08-16 浙江大学 Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN114925689A (en) * 2022-05-24 2022-08-19 淮阴工学院 Medical text classification method and device based on BI-LSTM-MHSA
CN115062109A (en) * 2022-06-16 2022-09-16 沈阳航空航天大学 Entity-to-attention mechanism-based entity relationship joint extraction method
CN115168606A (en) * 2022-07-01 2022-10-11 北京理工大学 Mapping template knowledge extraction method for semi-structured process data

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200097601A1 (en) * 2018-09-26 2020-03-26 Accenture Global Solutions Limited Identification of an entity representation in unstructured data
CN109800411A (en) * 2018-12-03 2019-05-24 哈尔滨工业大学(深圳) Clinical treatment entity and its attribute extraction method
CN110598000A (en) * 2019-08-01 2019-12-20 达而观信息科技(上海)有限公司 Relationship extraction and knowledge graph construction method based on deep learning model
CN110795543A (en) * 2019-09-03 2020-02-14 腾讯科技(深圳)有限公司 Unstructured data extraction method and device based on deep learning and storage medium
CN110866121A (en) * 2019-09-26 2020-03-06 中国电力科学研究院有限公司 Knowledge graph construction method for power field
WO2021082366A1 (en) * 2019-10-28 2021-05-06 南京师范大学 Interactive and iterative learning-based intelligent construction method for geographical name tagging corpus
KR102223382B1 (en) * 2019-11-14 2021-03-08 숭실대학교산학협력단 Method and apparatus for complementing knowledge based on multi-type entity
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
WO2021212682A1 (en) * 2020-04-21 2021-10-28 平安国际智慧城市科技股份有限公司 Knowledge extraction method, apparatus, electronic device, and storage medium
CN114077673A (en) * 2021-06-21 2022-02-22 南京邮电大学 Knowledge graph construction method based on BTBC model
CN114911945A (en) * 2022-04-13 2022-08-16 浙江大学 Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN114925689A (en) * 2022-05-24 2022-08-19 淮阴工学院 Medical text classification method and device based on BI-LSTM-MHSA
CN115062109A (en) * 2022-06-16 2022-09-16 沈阳航空航天大学 Entity-to-attention mechanism-based entity relationship joint extraction method
CN115168606A (en) * 2022-07-01 2022-10-11 北京理工大学 Mapping template knowledge extraction method for semi-structured process data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨秀璋等: "基于Bert和BiLSTM-CRF的APT攻击实体识别及对齐研究", 《通信学报》, 21 June 2022 (2022-06-21) *
黄培馨;赵翔;方阳;朱慧明;肖卫东;: "融合对抗训练的端到端知识三元组联合抽取", 计算机研究与发展, no. 12, 15 December 2019 (2019-12-15) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117033527A (en) * 2023-10-09 2023-11-10 之江实验室 Knowledge graph construction method and device, storage medium and electronic equipment
CN117033527B (en) * 2023-10-09 2024-01-30 之江实验室 Knowledge graph construction method and device, storage medium and electronic equipment
CN117909492A (en) * 2024-03-19 2024-04-19 国网山东省电力公司信息通信公司 Method, system, equipment and medium for extracting unstructured information of power grid

Also Published As

Publication number Publication date
CN115510245B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN107992597B (en) Text structuring method for power grid fault case
CN110598005B (en) Public safety event-oriented multi-source heterogeneous data knowledge graph construction method
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN115510245B (en) Unstructured data-oriented domain knowledge extraction method
CN111626063A (en) Text intention identification method and system based on projection gradient descent and label smoothing
CN109376242A (en) Text classification algorithm based on Recognition with Recurrent Neural Network variant and convolutional neural networks
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN108875809A (en) The biomedical entity relationship classification method of joint attention mechanism and neural network
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN113705238B (en) Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN112925904B (en) Lightweight text classification method based on Tucker decomposition
CN116245107B (en) Electric power audit text entity identification method, device, equipment and storage medium
CN116521882A (en) Domain length text classification method and system based on knowledge graph
She et al. Joint learning with BERT-GCN and multi-attention for event text classification and event assignment
CN111984791A (en) Long text classification method based on attention mechanism
CN117171333A (en) Electric power file question-answering type intelligent retrieval method and system
CN113590827B (en) Scientific research project text classification device and method based on multiple angles
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN116701665A (en) Deep learning-based traditional Chinese medicine ancient book knowledge graph construction method
CN114357166B (en) Text classification method based on deep learning
CN113064967B (en) Complaint reporting credibility analysis method based on deep migration network
Wang et al. Sentiment analysis based on attention mechanisms and bi-directional LSTM fusion model
Wang et al. Event extraction via dmcnn in open domain public sentiment information
Baes et al. RDF triples extraction from company web pages: comparison of state-of-the-art Deep Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant