CN115510245A

CN115510245A - Unstructured data oriented domain knowledge extraction method

Info

Publication number: CN115510245A
Application number: CN202211259591.5A
Authority: CN
Inventors: 王儒; 孙延劭; 华益威; 魏竹琴; 王国新
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2022-12-23
Anticipated expiration: 2042-10-14
Also published as: CN115510245B

Abstract

The invention discloses a field knowledge extraction method for unstructured data, which comprises the following steps: establishing an entity extraction model based on a bidirectional long-time and short-time memory neural network and a conditional random field, establishing a relation extraction model based on an attention mechanism, and respectively training two models; extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relation by using the trained relation extraction model, and obtaining an entity-relation table on the basis of the field entity table; performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in a neo4j graph database; the invention can solve the problems that the prior domain knowledge acquisition is mainly manual, the management efficiency is low, and the domain knowledge system is not complete enough, and realizes the knowledge extraction of the unstructured data.

Description

Unstructured data oriented domain knowledge extraction method

Technical Field

The invention belongs to the technical field of knowledge extraction, and particularly relates to a method for extracting the knowledge in the field of unstructured data.

Background

The domain knowledge has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like. Under the background of intelligent manufacturing, the research and development of products have more and more urgent requirements on domain knowledge, a perfect domain knowledge acquisition, management and sharing system is established to effectively improve the research and development efficiency of the products, and a domain knowledge map is the key for realizing the goal. A knowledge graph is essentially a large-scale semantic network aimed at describing real-world concepts and events in terms of entities, with edges representing the interrelationships between them. The core of the knowledge graph is a triple composed of entities, attributes and relations, and the triple can be structurally divided into a mode layer and a data layer, wherein the mode layer is composed of concept ontologies and relations and is used for describing the structure of the knowledge graph, and the data layer is an instantiated knowledge graph constructed by specific data under the guidance of the mode layer.

The domain knowledge map is an important means for managing domain knowledge and relationship, and various kinds of knowledge in the domain can be uniformly managed through the domain knowledge map. Therefore, the construction process of the knowledge graph is important. Firstly, a data source of the knowledge graph is required to be clearly constructed, and in the construction process of the knowledge graph, the data source is divided into structured data, semi-structured data and unstructured data, wherein the extraction of the structured data and the semi-structured data is mature, and the extraction of the unstructured data is still in a development stage. In practical application, the construction of the knowledge graph still mainly adopts manual operation, the automatic construction aspect also mainly adopts structured and semi-structured, and the process field needs an automatic knowledge extraction method aiming at unstructured data, which is helpful for realizing the management of knowledge in the complex field of multi-source isomerism and is convenient for the design and decision of the field.

The method for extracting knowledge from unstructured data can be decomposed into two parts of entity extraction and relation extraction.

In the aspect of entity extraction, with the development of Natural Language Processing (NLP) technology, a plurality of entity recognition algorithms based on deep learning are developed, such as a recurrent neural network RNN, which is a neural network for processing sequence data and is suitable for processing unstructured data mainly comprising text data, on the basis, a long-short time memory neural network LSTM is developed to avoid the problem of dimension explosion, a bidirectional long-short time neural network BiLSTM is developed to accelerate training, and a conditional random field CRF is added to define a loss function to further improve the extraction precision.

In relation to draw, there are methods such as pipeline method, end2end at present, the former discerns each entity among them with the entity extractor according to the sentence first, then make up every two entities that are extracted out and add the sentence of the original text as the input of the relation recognizer and carry on the relation recognition between two input entities; the latter is also called end-to-end relation extraction, and the triples are directly extracted by processing each sentence. With the development of deep learning, a relation extraction model based on a Convolutional Neural Network (CNN) and an attention mechanism is developed in the field of relation extraction.

However, the methods for extracting entities and relationships proposed above are currently widely used in the field of general knowledge, which has the characteristics of wide coverage, large data volume, and the like, so that the knowledge graph in the general field is usually constructed from bottom to top, and information is extracted from a large amount of data to construct entities and relationships in the knowledge graph. The domain knowledge is different from the general knowledge, and the domain knowledge attaches more importance to the specialty of the knowledge, so the domain knowledge needs to have a more rigorous structure. When the domain knowledge graph is constructed, a top-down mode is needed to be adopted for construction, a mode layer of the domain knowledge graph is designed firstly, and information belonging to the domain knowledge is determined according to the mode layer. However, in the aspect of constructing the domain knowledge graph, manual construction is still taken as the main point, the management efficiency is low, the processed data is mostly structured and semi-structured data, and a systematic method is still lacked for knowledge extraction oriented to unstructured data.

Disclosure of Invention

In view of the above, the invention provides a method for extracting domain knowledge oriented to unstructured data, which can solve the problems that domain knowledge acquisition is mainly manual, management efficiency is low, and a domain knowledge system is not complete at present, and realize knowledge extraction of unstructured data.

The invention is realized by the following technical scheme:

a field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;

the extraction method comprises the following specific steps:

s1, combing a domain knowledge concept entity and relationship combing to establish a domain knowledge graph mode layer;

s2, preprocessing unstructured data to obtain manually marked text data;

s3, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, establishing a relation extraction model based on an attention mechanism, and training the entity extraction model and the relation extraction model respectively by using corresponding data sets;

s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationships on the basis of the field entity table;

and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.

Further, the specific steps of step S1 are as follows:

s1-1, combing the knowledge concepts and relations in the multi-scene field according to the purpose of knowledge extraction;

and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph mode layer.

Further, the specific steps of step S2 are as follows:

s2-1, analyzing unstructured data into a txt file by using a text analysis tool;

s2-2, performing word segmentation on the text file by using a Jieba word segmentation tool;

s2-3, performing stop word removal processing on the text after word segmentation;

and S2-4, manually labeling the text data based on a BIO labeling method or a BIOES labeling method.

Further, the specific steps of step S3 are as follows:

s3-1, forming a training set and a testing set for training an entity extraction model and a relationship extraction model according to the manually marked data;

s3-2, establishing an entity extraction model based on the bidirectional long-short time memory neural network and the conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;

s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.

Further, in step S3-2, when the entity extraction model is established: the output dimension of the BilSTM layer of the bidirectional long-time memory neural network BilSTM is the same as the number of label types, and for each input w _i The network will output a probability value p for its corresponding tag j _ij And finally, the output P of the network is obtained,namely, each input label probability value corresponding to each label; conditional random field CRF calculates a labeling probability value under conditional constraints, and if y is a predicted labeling sequence, x is a text input sequence, and y' is an accurate labeling sequence, then the conditional random field CRF has

Wherein, P (y | x) is the probability value of the output P after the constraint of the conditional random field; the Score can be calculated by the following formula:

wherein psi _i (x, y) is a feature vector;

when training an entity extraction model, the goal is to maximize the probability P (y | x), which is obtained by log-likelihood:

and defining a loss function as-log (P (y | x)), and optimizing the loss function-log (P (y | x)) through an optimization algorithm to realize the training of the entity extraction model BiLSTM-CRF.

Further, in step S3-2,

when the relation extraction model is established, firstly outputting a vector form of a text through a BilSTM layer of a bidirectional long-short time memory neural network BilSTM, then classifying the relation through an attention mechanism layer to obtain the relation between entities, and establishing the relation extraction model;

when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = ₁ ，x ₂ ，...，x _T In which x is _i The output through the BilSTM layer, representing each character, is H = { H = { (H) } ₁ ，h ₂ ，...，h _T }, matrix parameters to be trained

d ^w Representing the dimension of word embedding, and satisfying:

M＝tanh(H)

α＝softmax(w ^T M)

r＝Hα ^T

wherein alpha is an attention weight coefficient, and r is a result obtained by weighting and summing output H of the BilSTM layer;

finally generating a characterization vector h through a nonlinear function ^* ＝tanh(r)；

Will characterize the vector h ^* Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmax

Obtaining predictive tag by argmax

Wherein, W and b are parameter matrix and bias respectively;

negative log-likelihood is used to define the loss function as:

wherein t ∈ R ^m Is a one-hot representation, y ∈ R ^m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;

and optimizing the loss function J (theta) through an optimization algorithm to realize the training of the relation extraction model.

Further, in step S4, a specific method for performing knowledge fusion by using a method based on semantic similarity calculation is as follows:

(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying the similarities and providing a basis for the fusion of semantic space models;

(2) And (3) semantic space model fusion: according to the fusion operation rule, performing fusion operation on the domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;

(3) Entity linking: and (3) linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.

Has the advantages that:

(1) The invention provides a field knowledge extraction method for unstructured data, and relates to knowledge modeling and natural language processing technologies. The method comprises the steps of firstly, conducting concept and relation combing on domain knowledge, establishing a domain knowledge graph mode layer, then conducting preprocessing on unstructured data, manually marking a data set, creating a training set and a testing set, then adopting a named entity recognition model BiLSTM-CRF based on deep learning to train the data, evaluating the training effect of the model according to indexes such as accuracy, recall rate and F1 value, and then conducting training by using a relation extraction model based on an attention mechanism. When the knowledge is extracted, the trained model can be used for extracting entities from unstructured data, the relation extraction model based on the attention mechanism is used for extracting relations to form an entity-relation table, knowledge fusion is carried out on all extracted entities and relations based on semantic similarity, and finally a knowledge graph is formed and stored by using a graph database neo4 j. The system has the characteristics of strong professional, various knowledge carriers, complex knowledge system and the like, is suitable for the requirements of product research and development and manufacture on the domain knowledge, and can effectively improve the efficiency of product research and development by establishing a perfect domain knowledge acquisition, management and sharing system.

(2) The method is characterized in that an entity extraction model is established based on a bidirectional long-short time memory neural network (BilSTM) and a Conditional Random Field (CRF) to realize entity extraction of unstructured data; establishing a relation extraction model based on an attention mechanism to realize relation extraction of unstructured data; the process entities and the relations in the unstructured data are extracted automatically through the combination of the entity extraction model and the relation extraction model, and high extraction accuracy can be obtained through training of a large number of data sets.

(3) When the entity extraction model is established, the bidirectional long-short-term memory neural network (BilSTM) and the Conditional Random Field (CRF) are adopted, so that the problem of dimension explosion possibly occurring in the conventional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased.

(4) The invention carries out knowledge fusion based on semantic similarity, combines knowledge with the same or highly similar semantics according to all the entities and relations obtained by extraction, and adopts a semantic similarity calculation method which has the characteristics of simplicity and reliability.

Drawings

FIG. 1 is a schematic flow chart of an implementation of a domain knowledge extraction method for unstructured data.

FIG. 2 is a schematic diagram of the structure of the BilSTM model.

FIG. 3 is a schematic diagram of a long-term and short-term memory neural network model based on an attention mechanism.

FIG. 4 is a schematic diagram of a semantic similarity calculation process.

FIG. 5 is a schematic diagram of a semantic space model fusion process.

Fig. 6 is a schematic diagram of the established process knowledge profile pattern layer of example 2.

FIG. 7 is a BIO labeling diagram.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

Example 1:

the embodiment provides a domain knowledge extraction method for unstructured data, wherein the unstructured data refer to data which are irregular or incomplete in data structure, have no predefined data model and are inconvenient to express by a database two-dimensional logic table, and are mainly represented by text type data.

The extraction method comprises the following specific steps:

step S1, mode layer construction:

step S1-1, combing the domain concept and the relationship: combing the knowledge concepts and relations in the multi-scene domain according to the purpose of knowledge extraction;

s1-2, constructing a domain knowledge graph mode layer: defining a knowledge structure according to the domain knowledge concept entities and the relationship, and establishing a domain knowledge graph mode layer;

s2, performing data preprocessing on the unstructured data:

step S2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;

step S2-2, word segmentation: utilizing a Jieba word segmentation tool to segment words of the text file;

s2-3, removing stop words: performing stop word removal processing on the text after word segmentation;

step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method;

step S3, model training is carried out:

step S3-1, training and testing set: forming a training set and a test set for training an entity extraction model and a relationship extraction model according to the manually marked data;

step S3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a corresponding data set;

s3-3, evaluating an entity extraction model: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value;

s3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by using a corresponding data set;

s3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;

wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3;

s4, constructing a domain knowledge graph:

step S4-1, field entity extraction: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a field entity;

step S4-2, field entity table: storing the field entities extracted according to the entity extraction model in a table form as a field entity table;

step S4-3, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table with one-to-one correspondence of the entities and the relationship on the basis of the field entity table;

s4-4, knowledge fusion: performing knowledge fusion based on semantic similarity according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics;

step S4-5, knowledge graph: and establishing a knowledge graph in the neo4j graph database according to the entity-relation table after knowledge fusion.

Example 2:

in this embodiment, on the basis of embodiment 1, a paper related to a diesel engine process is taken as an example to extract process knowledge, that is, the unstructured data is a paper related to a diesel engine process, and an implementation flow of the extraction method is shown in fig. 1; the method comprises the following specific implementation steps:

step S1, mode layer construction:

step S1-1, combing process concepts and relations: combing the multi-scene process knowledge concepts and relations according to the purpose of knowledge extraction; the technological knowledge of the diesel engine can be combed according to three dimensions of a technological body, a workpiece body and an equipment body, the technological body can be divided into machining, assembling and casting, the workpiece body is each constituent structure and parts of the diesel engine, and the equipment body is each equipment used in processing;

s1-2, constructing a process knowledge map pattern layer: defining a knowledge structure according to the process knowledge concept entity and the relationship, and establishing a process knowledge graph mode layer;

in this embodiment, the specific method for establishing the process knowledge map pattern layer is as follows:

(1) Defining a process knowledge map application scene, and determining a process knowledge concept ontology;

(2) Determining the relationship between the process knowledge concept bodies, for example, in the diesel engine process knowledge, the relationship between the process body and the workpiece body is 'acting', the relationship between the equipment body and the process body is 'realizing', and the relationship between the equipment body and the workpiece body is 'processing', as shown in fig. 6;

s2, performing data preprocessing on the unstructured data:

s2-1, analyzing into a txt file: analyzing the unstructured data into a txt file by using a text analysis tool;

step S2-3, removing stop words: performing stop word removal processing on the text after word segmentation;

step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method, wherein process entities are labeled as B-TEC and I-TEC, workpiece entities are labeled as B-WOR and I-WOR, equipment entities are labeled as B-EQU and I-EQU, and the other entities are labeled as O, and partial labeling results are shown in a table 1;

table 1 partial entity annotation results

The BIO labeling method can be replaced by a BIOES labeling method, namely B is the beginning of an entity, I is the middle of the entity, E is the end of the entity, S is the entity with a single character, and O is the other; the labeling method is not unique, different labeling methods can be selected according to different entity extraction requirements, and model training is not influenced.

Step S3, model training is carried out:

step S3-1, training and testing set: forming a training set and a test set for entity extraction model training according to the manually marked data;

s3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long-and-short time memory neural network (BilSTM) and a Conditional Random Field (CRF), and training the model by using a data set;

in the embodiment, a bidirectional long-time memory neural network (BilSTM) and a Conditional Random Field (CRF) are adopted to establish an entity extraction model, so that the problem of dimension explosion possibly occurring in the traditional Recurrent Neural Network (RNN) can be solved, and the training speed can be increased; the specific method for establishing and training the model is as follows:

in the LSTM, memory cells are interconnected, replacing the circulation unit in the general RNN, and in addition to having a circulation connection structure between memory cells, there is also circulation inside each memory cell; the input to each memory cell is controlled by an input gate, if the input gate allows, the value can be accumulated to the state, the weight of the state is controlled by a forgetting gate, and the output can be controlled by an output gate to be closed or not;

(1) The input gate is updated as follows:

i _t ＝σ _g (W _i x _t +U _i h _t-1 +b _i )

wherein i _t Input gate for time t, W _i Is an input weight matrix; u shape _i A circular weight matrix for the input gate; b _i Is an offset; activation of function sigma by sigmoid _g Adjusting W _i x _t +U _i h _t-1 +b _i Will output i _t Set to a value between 0 and 1; x is the number of _t For each character of the input variable, i.e. a sentence, h _t-1 Is the hidden state of the LSTM at the time t-1;

(2) The forgetting gate is updated in the following way:

f _t ＝σ _g (W _f x _t +U _f h _t-1 +b _f )

wherein f is _t Forgetting to gate at time t, W _f Is an input weight matrix; u shape _f A circular weight matrix for a forgetting gate; b _f For biasing, the function σ is activated by sigmoid _g Adjusting W _f x _t +U _f h _t-1 +b _f Will output f _t Set to a value between 0 and 1;

(3) The output gate is updated as follows:

o _t ＝σ _g (W _o x _t +U _o h _t-1 +b _o )

wherein o is _t Output gate at time t, W _o Is an input weight matrix; u shape _o A cyclic weight matrix which is an output gate; b _o Is an offset. Activation of function sigma by sigmoid _g Adjusting W _o x _t +U _o h _t-1 +b _o Will output o _t Set to a value between 0 and 1;

(4) Memory cell c _t The updating is performed by:

wherein, c _t Memory cells at time t, c _t-1 The memory cells at the time t-1,

is an intermediate amount, W _c Is an input weight matrix; u shape _c Is a circular weight matrix of memory cells; b _c For biasing, the function σ is activated by tanh _h Carry out output

It can be seen that the forgetting door f _t Determines the data transmitted from the last memory cell, input gate i _t The data currently entered into the memory cells is determined.

Hidden state h of LSTM _t Co-determination by output gate and memory cell:

h _t ＝o _t σ _h (c _t )

wherein h is _t Is the hidden state of LSTM at time t;

although LSTM can solve the problem of long distance dependence by memory cells, LSTM is a forward propagation algorithm, the output of one state can only be calculated from its previous state. However, in the problem of named entity recognition, a word vector of a text sentence is input, a named entity has semantic dependence on words nearby, in order to recognize a certain entity, the named entity is often influenced by not only previous words but also following words, and a unidirectional long-short term memory neural network cannot perform named entity recognition and other work in combination with the content behind the current moment, so that a bidirectional long-short term memory neural network model (BiLSTM) is adopted for named entity recognition, and the model structure is shown in fig. 2-3.

The structure of the bidirectional long-short term memory neural network consists of an input layer, a forward hidden layer, a backward hidden layer and an output layer. Inputting sequence data by an input layer, calculating forward characteristics by a forward hidden layer, and calculating backward characteristics by a backward hidden layer; the forward hidden layer can remember the information before the current moment, and the backward hidden layer can remember the information in the future at the current moment; and splicing the results output by the forward hidden layer and the backward hidden layer to obtain the bidirectional LSTM, namely the BilSTM network.

And finally, accessing the output to a classification label of the predictive named entity of the softmax input layer. For a named entity recognition task, defining that k labels exist in the task, namely label = { label = { (label) ₁ ，label ₂ ，...，label _k H, the input sequence length is n, i.e. w = { w = ₁ ，w ₂ ，...，w _n Get each input w through BilsTM _t Corresponding to each label _j Fraction p of (2) _t，j Fraction p corresponding to n characters of the whole sequence _t，j A P matrix is constructed where a larger score means that the label to which the score corresponds is closer to the genuine label.

In the task of identifying a named entity in Chinese, the entity is usually formed by combining a plurality of Chinese characters, the Chinese characters are labeled according to a BIO labeling method, which is the same as the data labeling method of a training set, B is used for representing the starting characters of the named entity, I represents the middle part and the ending part of the named entity, and O represents the non-entity part. As shown in FIG. 7, which is a labeled example, the entities are classified into three categories of Workpiece, equipment and technical, wherein B-Workpiece represents the starting character of the Workpiece entity "engine block", namely "Fang"; the I-workbench represents the middle and tail parts of the workbench entity 'engine cylinder body', namely 'moving', 'machine', 'cylinder' and 'body'. The same principle is applied to Equipment entities and technical entities.

It can be seen that for the input of the chinese sequence, the output labels have certain constraints:

(1) The starting tag of an entity must be "B-", and the tag "I-" must follow "B-" and "O" cannot appear before "I-";

(2) The tag type of an entity needs to be kept consistent, for example, after the 'B-workbench', an 'I-workbench' needs to be combined, but not an 'I-Equipment';

BilSTM does not make these constraints, so the present embodiment uses Conditional Random Fields (CRF) to further constrain the output of the network to achieve higher accuracy. The conditional random field is one of probabilistic graphical models, which can be classified into directed graphical models including bayesian networks and hidden markov models, and undirected graphical models including conditional random fields.

Conditional Random Fields (CRF) are widely applied in the field of natural language processing at present, are conditional probability distribution models, and introduce characteristic functions on the basis of Hidden Markov Models (HMM).

The transfer matrix in the CRF considers the association between the output labels at each moment, so the embodiment considers that the CRF is used as a BiLSTM layer; the BilSTM layer provides a function of extracting features according to context and can predict entity categories of input texts, and the CRF layer provides a mechanism for scoring the current output state and can further restrict the output, so that the prediction accuracy is improved.

The output dimension of the BilSTM layer is the same as the number of label types, and the number of label types is w for each input _i The network will output a probability value p for its corresponding tag j _ij This results in the output P of the network, i.e. the annotated probability values corresponding to each tag for each input. Calculating labeling probability value under condition constraint by CRF, setting y as predicted labeling sequence, x as text input sequence, and y' as accurate labeling sequence, and determining that there is a labeling probability value under condition constraint

wherein psi _i (x, y) are feature vectors, so the goal of training the model is to maximize the probability P (ylx), which is obtained by log-likelihood:

Step S3-3, entity extraction model evaluation: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; wherein, the

in this embodiment, an attention-based mechanism is adopted to establish a relationship extraction model, and the specific method for establishing and training the model is as follows:

since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.

And training a relation extraction model to extract unstructured data to be extracted to obtain the relation between the entities. The relation extraction model firstly outputs a vector form of a text through a BilSTM layer, and then carries out relation classification through an attention mechanism layer to obtain the relation between entities.

(1) Input and word embedding layer: the model input is a sample in sentences. The word embedding layer mainly characterizes input sentences, and gives a sentence S containing T characters: s = { x = ₁ ，x ₂ ，...，x _T In which x _i Each character is represented.

(2) BilSTM: the structure of BilSTM is the same as in step S3-2, and the LSTM unit can be represented by the following formula:

c _t ＝i _t g _t +f _t c _t-1

h _t ＝o _t tanh(c _t )

the output of the model includes the forward direction

And backward direction

Two results, by splicing

As the final BiLSTM output.

(3) The Attention structure: since the "influence degree" between LSTM obtaining the output information at each time point is the same, in the relational classification, in order to highlight the importance of partial output results to the classification, the idea of weighting is introduced, and the attention mechanism is essentially weighted summation.

The input of the model is in sentence unit, and the output passing through the BilSTM layer is H = { H = { (H) } ₁ ，h ₂ ，...，h _T }, matrix parameters to be trained

R represents a set of real numbers, d ^w The dimension of word embedding is represented, and the following conditions are satisfied:

M＝tanh(H)

α＝softmax(w ^T M)

r＝Hα ^T

m intermediate quantity is not real, alpha is attention weight coefficient, r is result of weighted LSTM output H and then summed, finally, characterization vector H is generated through nonlinear function ^* ＝tanh(r)。

(4) Loss function: will characterize vector h ^* Mapping onto class label vector through full connection network, for input sentence S, outputting predicted relation classification probability through softmax

Obtaining a predictive tag by argmax

Where W and b are the parameter matrix and the offset, respectively.

Negative log-likelihood is used to define the loss function J (θ) as:

wherein t ∈ R ^m Is a one-hot expression, y ∈ R ^m The estimated probability of each relation type output by softmax is shown, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model and comprises n and b; f is a norm;

the loss function J (theta) is optimized through an optimization algorithm, and then the training of the relation extraction model can be achieved.

wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3.

S4, establishing a process knowledge graph:

s4-1, extracting a process entity: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a process entity;

step S4-2, process entity table: the process entities extracted according to the entity extraction model are stored as a process entity table in a table form, and part of the process entity table is shown in table 2:

table 2 partial process entity tables

ID	Name	Label
			001	Cylinder cover	Workpiece
002	Oil sprayer sheath	Workpiece
			003	Valve seat	Workpiece
004	Pressure test	Process for the preparation of a coating

Step S4-4, entity relation table: extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationship are in one-to-one correspondence on the basis of the process entity table, wherein the entity-relationship table is a partial entity-relationship table as shown in table 3;

TABLE 3 partial entity-relationship Table

Start_Name	Relation	End_Name
			Pressure test	Act on	Cylinder cover
Clear in entirety	Act on	Cylinder cover
			Press mounting assembly station	To realize	Sheath assembly

S4-4, knowledge fusion: performing knowledge fusion by adopting a method based on semantic similarity calculation according to all the entities and relations obtained by extraction, and combining the knowledge with the same or highly similar semantics with reference to the attached figures 4-5; the adopted method based on semantic similarity calculation can be replaced by other methods, such as an inner product method, a cosine method, a Dice coefficient method and the like.

In this embodiment, a specific method for performing knowledge fusion by using a semantic similarity calculation method is as follows:

(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through the Jaccard similarity coefficient, classifying, and fusing semantic space models to provide a basis;

Step S4-5, knowledge graph: establishing a knowledge graph in a neo4j graph database according to the entity-relation table after knowledge fusion; after the process knowledge graph is constructed, process designers can design the process by using the process knowledge graph, and can upload new knowledge on the basis of the process knowledge graph to realize the updating and sharing of the knowledge.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A field knowledge extraction method facing unstructured data refers to data which are irregular or incomplete in data structure, free of predefined data models and inconvenient to express by a database two-dimensional logic table;

the extraction method is characterized by comprising the following specific steps:

s2, preprocessing unstructured data to obtain manually marked text data;

s4, extracting unstructured data to be extracted by using the trained entity extraction model to obtain a field entity, and storing the field entity in a table form as a field entity table; extracting the relationship by using the trained relationship extraction model, and obtaining an entity-relationship table in which the entities and the relationships are in one-to-one correspondence on the basis of the field entity table;

and performing knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain an entity-relation table after the knowledge fusion, and establishing a knowledge graph in the neo4j graph database according to the entity-relation table.

2. The method for extracting the domain knowledge of the unstructured data according to claim 1, wherein the step S1 comprises the following steps:

3. The method for extracting domain knowledge of unstructured data according to claim 1, wherein the step S2 comprises the following steps:

4. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein the specific steps of the step S3 are as follows:

s3-1, forming a training set and a test set for training an entity extraction model and a relation extraction model according to manually marked data;

5. The method for extracting domain knowledge from unstructured data according to claim 4, wherein in step S3-2, when building the entity extraction model: the output dimensions of the BilSTM layer of the bidirectional long-time and short-time memory neural network BilSTM are the same as the number of label types, and the output dimensions of the BilSTM layer are the same for each input w _l The network will output a probability value p for its corresponding tag j _ij Finally, obtaining the output P of the network, namely the labeling probability value corresponding to each label of each input; the conditional random field CRF calculates the labeling probability value under the condition constraint, and if y is the predicted labeling sequence, x is the text input sequence, and y' is the accurate labeling sequence, then there is

wherein psi _i (x, y) is a feature vector;

when the entity extraction model is trained, the aim is to maximize the probability P (y | x), and the probability P is obtained through log likelihood:

6. The method for extracting domain knowledge of unstructured data according to claim 4, wherein in step S3-2,

when the relation extraction model is trained, the relation extraction model is input by taking a sentence as a unit, and a sentence S containing T characters is given: s = { x = ₁ ，x ₂ ，...，x _T In which x _i The output through the BilSTM layer, representing each character, is H = { H = ₁ ，h ₂ ，...，h _T }, matrix parameters to be trained

d ^w The dimension of word embedding is represented, and the following conditions are satisfied:

M＝tanh(H)

α＝softmax(w ^T M)

r＝Hα ^T

Will characterize vector h ^* Mapping onto class label vector through full connection network, for input sentence s, outputting predicted relation classification probability through softmax

Obtaining predictive tag by argmax

Wherein, W and b are parameter matrix and bias respectively;

negative log-likelihood is used to define the loss function as:

wherein t ∈ R ^m Is a one-hot representation, y ∈ R ^m The estimated probability of each relation type output by softmax, lambda is a regularization hyper-parameter, and theta represents a model parameter of the relation extraction model;

7. The method for extracting the domain knowledge of the unstructured data according to any one of claims 1 to 3, wherein in step S4, a method based on semantic similarity calculation is adopted to perform knowledge fusion as follows:

(3) Entity linking: and linking the newly added domain knowledge with the existing map by using a combined link model based on the map, calculating the compatibility and the dependency among entities, disambiguating the newly added knowledge according to the calculation result, and integrating the newly added knowledge into the knowledge map.