CN115510245B

CN115510245B - Unstructured data-oriented domain knowledge extraction method

Info

Publication number: CN115510245B
Application number: CN202211259591.5A
Authority: CN
Inventors: 王儒; 孙延劭; 华益威; 魏竹琴; 王国新
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2024-05-14
Anticipated expiration: 2042-10-14
Also published as: CN115510245A

Abstract

The invention discloses a domain knowledge extraction method for unstructured data, which comprises the following steps: establishing an entity extraction model based on a two-way long and short-term memory neural network and a conditional random field, establishing a relation extraction model based on an attention mechanism, and respectively training the two models; extracting unstructured data to be extracted by using a trained entity extraction model to obtain a domain entity, and storing the domain entity in a form of a table as a domain entity table; extracting the relationship by using a trained relationship extraction model, and obtaining an entity-relationship table on the basis of the entity table in the field; carrying out knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain a knowledge-fused entity-relation table, and establishing a knowledge graph in a neo4j graph database; the method and the device can solve the problems that the prior art knowledge acquisition is mainly manual, the management efficiency is low, and the art knowledge system is not perfect, and realize knowledge extraction of unstructured data.

Description

Unstructured data-oriented domain knowledge extraction method

Technical Field

The invention belongs to the technical field of knowledge extraction, and particularly relates to a field knowledge extraction method for unstructured data.

Background

The domain knowledge has the characteristics of strong specialization, various knowledge carriers, complex knowledge system and the like. Under the background of intelligent manufacturing, the research and development of products and the demand of manufacturing on domain knowledge are more and more urgent, and a perfect domain knowledge acquisition, management and sharing system is established, so that the efficiency of product research and development can be effectively improved, and a domain knowledge graph is a key for realizing the goal. Knowledge maps are essentially a large-scale semantic network aimed at describing concepts and events in the real world in terms of entities, representing their interrelationships. The core of the knowledge graph is a triplet composed of entities, attributes and relations, and can be structurally divided into a mode layer and a data layer, wherein the mode layer is composed of a concept body and relations and is used for describing the structure of the knowledge graph, and the data layer is an instantiated knowledge graph constructed through specific data under the guidance of the mode layer.

The domain knowledge graph is an important means for managing domain knowledge and relations, and can be used for uniformly managing various knowledge in the domain. Therefore, the construction process of the knowledge graph is important. Firstly, the data sources of the knowledge graph are required to be clearly constructed, and in the knowledge graph construction process, the data sources are divided into structured data, semi-structured data and unstructured data, wherein the extraction of the structured data and the semi-structured data is mature, and the extraction of the unstructured data is still in the development stage. In practical application, the construction of the knowledge graph is still mainly manual, the automatic construction is still mainly structured and semi-structured, and the technical field needs an automatic knowledge extraction method aiming at unstructured data, which is beneficial to realizing the management of knowledge in complex fields of multi-source isomerism and is convenient for the design and decision of the fields.

The method of extracting knowledge from unstructured data can be decomposed into two parts, entity extraction and relationship extraction.

In terms of entity extraction, with the development of Natural Language Processing (NLP) technology, various entity recognition algorithms based on deep learning, such as a cyclic neural network RNN, are developed, which is a type of neural network for processing sequence data, is suitable for processing unstructured data mainly comprising text data, on the basis, a long and short memory neural network LSTM is developed for avoiding the problem of dimension explosion, a bidirectional long and short neural network BiLSTM is developed for accelerating training, and a conditional random field CRF is added for defining a loss function for further improving extraction precision.

In relation extraction, there are methods such as pipeline method and end2end, the former uses entity extractor to identify each entity according to sentence, then combines every two extracted entities, and adds original text sentence as input of relation identifier to identify the relation between two input entities; the latter is also called end-to-end relation extraction, which directly extracts the triples by processing each sentence. Along with the development of deep learning, the relation extraction field develops a relation extraction model based on a convolutional neural network CNN and on an attention mechanism.

However, the method for entity extraction and relation extraction proposed above is widely used in the general knowledge field at present, and the general knowledge has the characteristics of wide coverage, large data volume and the like, so that the knowledge graph in the general field is generally constructed from bottom to top, and information is extracted from a large amount of data to form entities and relations in the knowledge graph. The domain knowledge is different from the general knowledge, and the domain knowledge is more important to the expertise of the knowledge, so that the domain knowledge needs to have a more strict structure. When the domain knowledge graph is constructed, a top-down mode is needed to be adopted for construction, a mode layer of the domain knowledge graph is designed first, and which information belongs to domain knowledge is determined according to the mode layer. However, in the construction aspect of the domain knowledge graph, manual construction is still mainly used, management efficiency is low, the processed data is mainly composed of structural and semi-structural data, and a systematic method is still lacking in knowledge extraction for unstructured data.

Disclosure of Invention

In view of the above, the invention provides a domain knowledge extraction method for unstructured data, which can solve the problems of low management efficiency and imperfect domain knowledge system of the existing domain knowledge acquisition mainly by manual operation, and realize knowledge extraction of unstructured data.

The invention is realized by the following technical scheme:

The unstructured data is data which is irregular or incomplete in data structure, has no predefined data model and is inconvenient to be represented by a two-dimensional logic table of a database;

The extraction method comprises the following specific steps:

Step S1, combing the domain knowledge concept entity and the relationship combing to establish a domain knowledge graph model layer;

s2, preprocessing unstructured data to obtain manually marked text data;

step S3, establishing an entity extraction model based on a bidirectional long-short-term memory neural network and a conditional random field, establishing a relationship extraction model based on an attention mechanism, and training the entity extraction model and the relationship extraction model by using corresponding data sets respectively;

Step S4, extracting unstructured data to be extracted by using a trained entity extraction model to obtain a domain entity, and storing the domain entity in a form of a table as a domain entity table; extracting the relationship by using a trained relationship extraction model, and obtaining an entity-relationship table corresponding to the entity and the relationship one by one on the basis of the entity table in the field;

And carrying out knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain a knowledge-relation table after knowledge fusion, and establishing a knowledge graph in a neo4j graph database according to the entity-relation table.

Further, the specific steps of step S1 are as follows:

Step S1-1, combing knowledge concepts and relations in the multiple scene fields according to the purpose of knowledge extraction;

and S1-2, defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph model layer.

Further, the specific steps of step S2 are as follows:

S2-1, analyzing unstructured data into txt files by using a text analysis tool;

S2-2, utilizing Jieba word segmentation tools to segment the text file;

s2-3, removing stop word processing is carried out on the text after word segmentation;

and S2-4, manually labeling the text data based on the BIO labeling method or BIOES labeling method.

Further, the specific steps of step S3 are as follows:

s3-1, forming a training set and a testing set for training an entity extraction model and a relation extraction model according to manually marked data;

S3-2, establishing an entity extraction model based on a bidirectional long-short-term memory neural network and a conditional random field, and training the model by using a corresponding data set; establishing a relation extraction model based on an attention mechanism, and training the model by utilizing a corresponding data set;

S3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; and evaluating the training effect of the relation extraction model according to the accuracy rate.

Further, in step S3-2, when the entity extraction model is established: the output dimension of BiLSTM layers of the bidirectional long-short-term memory neural network BiLSTM is the same as the number of label types, and for each input w _i, the network outputs a probability value P _ij of a label j corresponding to the input w _i, and finally an output P of the network is obtained, namely, each input corresponds to a labeling probability value of each label; the conditional random field CRF calculates the labeling probability value under the condition constraint, and the labeling probability value is calculated by setting y as a predicted labeling sequence, x as a text input sequence and y' as an accurate labeling sequence, and the labeling probability value is calculated by the conditional random field CRF

Wherein, P (y|x) is the probability value of the output P after constraint of the conditional random field; the Score may be calculated by:

wherein, ψ _i (x, y) is a feature vector;

When training the entity extraction model, the objective is to maximize probability P (y|x), which is obtained by log likelihood:

Defining the loss function as-log (P (y|x)), and optimizing the loss function-log (P (y|x)) by an optimization algorithm to realize training of the entity extraction model BiLSTM-CRF.

Further, in step S3-2,

When a relation extraction model is established, firstly, a vector form of a text is output through a BiLSTM layer of a bidirectional long-short-term memory neural network BiLSTM, then, the relation is classified through an attention mechanism layer, the relation among entities is obtained, and the relation extraction model is established;

when training the relation extraction model, the input of the relation extraction model takes sentences as a unit, and a sentence S containing T characters is given: s= { x ₁,x₂,...,x_T }, where x _i represents each character, the output through BiLSTM layers is h= { H ₁,h₂,...,h_T }, the matrix parameters to be trained D ^w represents the dimension of word embedding, satisfying:

M＝tanh(H)

α＝softmax(w^TM)

r＝Hα^T

Wherein alpha is the attention weight coefficient, and r is the result of adding up the weighted outputs H of BiLSTM layers;

Finally, generating a characterization vector h ^* =tanh (r) through a nonlinear function;

Mapping the characterization vector h ^* to the class vector through the fully connected network, and outputting the predicted probability of the relation classification through softmax for the input sentence S Obtaining predictive tag/>, by argmax

Wherein W and b are a parameter matrix and a bias, respectively;

the negative log likelihood is used to define the loss function as:

wherein t ε R ^m is a single-heat representation, y ε R ^m is an estimated probability of each relationship class output through softmax, λ is a regularized hyper-parameter, θ represents a model parameter of the relationship extraction model;

And (3) optimizing the loss function J (theta) through an optimization algorithm to realize the training of the relation extraction model.

Further, in step S4, the specific method for performing knowledge fusion by using the semantic similarity calculation method is as follows:

(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through Jaccard similarity coefficients, classifying the similarity, and providing a basis for semantic space model fusion;

(2) Semantic space model fusion: according to the fusion operation rule, carrying out fusion operation on domain knowledge with different similarities, and eliminating similar redundancy or conflict contradiction between the domain knowledge;

(3) Entity linking: and linking the newly added domain knowledge with the existing map by using a joint link model based on the map, calculating the compatibility and the dependence among the entities, disambiguating the newly added knowledge according to the calculation result, and merging the newly added knowledge into the knowledge map.

The beneficial effects are that:

(1) The invention provides a field knowledge extraction method for unstructured data, and relates to knowledge modeling and natural language processing technologies. The method comprises the steps of firstly carrying out concept and relation combing on domain knowledge, establishing a domain knowledge graph model layer, preprocessing unstructured data, creating a training set and a testing set through manually marking the data set, training the data by adopting a named entity recognition model BiLSTM-CRF based on deep learning, evaluating the training effect of the model according to indexes such as accuracy rate, recall rate, F1 value and the like, and training by using a relation extraction model based on an attention mechanism. When knowledge extraction is carried out, the training model can be utilized to carry out entity extraction on unstructured data, the relationship extraction model based on an attention mechanism is utilized to carry out relationship extraction, a entity-relationship table is formed, knowledge fusion is carried out on all extracted entities and relationships based on semantic similarity, and finally a knowledge graph is formed and stored by utilizing a graph database neo4 j. The method has the characteristics of strong specialization, multiple knowledge carriers, complex knowledge system and the like, is suitable for the requirements of research and development and manufacture of products on domain knowledge, and can effectively improve the efficiency of research and development of the products by establishing a perfect domain knowledge acquisition, management and sharing system.

(2) The invention establishes an entity extraction model based on a bidirectional long and short time memory neural network (BiLSTM) and a Conditional Random Field (CRF) to realize entity extraction of unstructured data; establishing a relation extraction model based on an attention mechanism to realize relation extraction of unstructured data; the process entity and the relation in the unstructured data are finally automatically extracted through the combination of the entity extraction model and the relation extraction model, and a high extraction accuracy rate can be obtained through training of a large number of data sets.

(3) When the entity extraction model is built, the bidirectional long-short-term memory neural network (BiLSTM) and the Conditional Random Field (CRF) are adopted, so that the problem of dimensional explosion possibly occurring in the traditional cyclic neural network (RNN) can be solved, and meanwhile, the training speed can be improved.

(4) The method for calculating the semantic similarity based on the knowledge fusion performs knowledge fusion based on the semantic similarity, combines knowledge with the same or highly similar semantic according to all the extracted entities and relations, and has the characteristics of simplicity and reliability.

Drawings

Fig. 1 is a schematic flow chart of an implementation of a domain knowledge extraction method for unstructured data.

FIG. 2 is a schematic diagram of a BiLSTM model structure.

Fig. 3 is a schematic diagram of a long and short term memory neural network model based on an attention mechanism.

FIG. 4 is a schematic diagram of a semantic similarity calculation process.

FIG. 5 is a schematic diagram of a semantic space model fusion process.

Fig. 6 is a schematic diagram of the established process knowledge pattern layer of example 2.

FIG. 7 is a schematic diagram of BIO labeling.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

Example 1:

The embodiment provides a domain knowledge extraction method for unstructured data, wherein the unstructured data is data with irregular or incomplete data structure, a predefined data model is not provided, and the data represented by a two-dimensional logic table of a database is inconvenient to use, and the data with text type is mainly used.

The extraction method comprises the following specific steps:

Step S1, a mode layer is constructed:

step S1-1, domain concept and relationship carding: according to the purpose of knowledge extraction, carding knowledge concepts and relations in the multi-scene field;

Step S1-2, constructing a domain knowledge graph model layer: defining a knowledge structure according to the domain knowledge concept entity and the relationship, and establishing a domain knowledge graph model layer;

Step S2, carrying out data preprocessing on unstructured data:

Step S2-1, analyzing the file into txt files: using a text parsing tool to parse unstructured data into txt files;

Step S2-2, word segmentation: utilizing Jieba word segmentation tools to segment the text file;

Step S2-3, removing stop words: removing stop words from the segmented text;

Step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method;

step S3, performing model training:

Step S3-1, training set and test set: forming a training set and a testing set for training the entity extraction model and the relation extraction model according to the manually marked data;

Step S3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long and short term memory neural network (BiLSTM) and a Conditional Random Field (CRF), and training the model by using a corresponding data set;

step S3-3, entity extraction model evaluation: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value;

Step S3-4, training a relation extraction model: establishing a relation extraction model based on an attention mechanism, and training the model by utilizing a corresponding data set;

Step S3-5, evaluating a relation extraction model: evaluating the training effect of the relation extraction model according to the accuracy rate;

wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3;

Step S4, building a domain knowledge graph:

Step S4-1, extracting the domain entity: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a domain entity;

Step S4-2, field entity table: according to the domain entity extracted by the entity extraction model, storing the domain entity in a form of a table as a domain entity table;

step S4-3, entity relation table: extracting the relationship by using a trained relationship extraction model, and obtaining an entity-relationship table corresponding to the entity and the relationship one by one on the basis of an entity table in the field;

Step S4-4, knowledge fusion: according to all the extracted entities and relations, carrying out knowledge fusion based on semantic similarity, and combining knowledge with the same or highly similar semantics;

step S4-5, knowledge graph: and establishing a knowledge graph in the neo4j graph database according to the entity-relation table after knowledge fusion.

Example 2:

In this embodiment, on the basis of embodiment 1, a diesel engine process-related paper is taken as an example to extract process knowledge, that is, the unstructured data is a diesel engine process-related paper, and the implementation flow of the extraction method is shown in fig. 1; the method comprises the following specific implementation steps:

Step S1, a mode layer is constructed:

Step S1-1, carding the technological concept and the relation: according to the purpose of knowledge extraction, carding the multi-scene process knowledge concepts and relations; the technological knowledge of the diesel engine can be carded according to three dimensions of a technological body, a workpiece body and an equipment body, wherein the technological body can be divided into machining, assembling and casting, the workpiece body is a component structure and parts of the diesel engine, and the equipment body is various equipment used in processing;

s1-2, building a process knowledge graph mode layer: defining a knowledge structure according to the technological knowledge concept entity and the relation, and establishing a technological knowledge map model layer;

in this embodiment, the specific method for establishing the process knowledge graph mode layer is as follows:

(1) Defining a process knowledge graph application scene and determining a process knowledge concept body;

(2) Determining the relationship between the process knowledge concept bodies, wherein in the diesel process knowledge, the relationship between the process bodies and the workpiece bodies is "action", the relationship between the equipment bodies and the process bodies is "implementation", and the relationship between the equipment bodies and the workpiece bodies is "processing", as shown in fig. 6;

Step S2, carrying out data preprocessing on unstructured data:

Step S2-3, removing stop words: removing stop words from the segmented text;

Step S2-4, manual labeling: manually labeling the text data based on a BIO labeling method, wherein the process entity is labeled as B-TEC and I-TEC, the workpiece entity is labeled as B-WOR and I-WOR, the equipment entity is labeled as B-EQU and I-EQU, the other labels are labeled as O, and partial labeling results are shown in the table 1;

Table 1 part of entity labeling results

The BIO labeling method can be replaced by BIOES labeling method, namely B is the beginning of an entity, I is the middle of the entity, E is the end of the entity, S is the entity with single character, and O is the other; the labeling method is not unique, different labeling methods can be selected according to different entity extraction requirements, and model training is not affected.

Step S3, performing model training:

Step S3-1, training set and test set: forming a training set and a testing set for training the entity extraction model according to the manually marked data;

Step S3-2, training an entity extraction model: establishing an entity extraction model based on a bidirectional long and short term memory neural network (BiLSTM) and a Conditional Random Field (CRF), and training the model by using a data set;

The embodiment adopts a two-way long and short-term memory neural network (BiLSTM) and a Conditional Random Field (CRF) to establish a physical extraction model, so that the problem of dimensional explosion possibly occurring in the traditional cyclic neural network (RNN) can be solved, and the training speed can be improved; the specific method for building and training the model is as follows:

In LSTM, memory cells are connected to each other, instead of the circulation unit in a general RNN, there is circulation inside each memory cell in addition to a circulation connection structure between memory cells; the input of each memory cell is controlled by an input gate, if the input gate allows, the value of each memory cell can be accumulated to a state, the weight of the state is controlled by a forgetting gate, and the output can be controlled by an output gate to be closed or not;

(1) The input gate is updated by:

i_t＝σ_g(W_ix_t+U_ih_t-1+b_i)

wherein i _t is an input gate at time t, and W _i is an input weight matrix; u _i is a cyclic weight matrix of the input gate; b _i is offset; adjusting W _ix_t+U_ih_t-1+b_i by a sigmoid activation function σ _g, setting output i _t to a value between 0 and 1; x _t is the input variable, i.e. each character in a sentence, h _t-1 is the hidden state of LSTM at time t-1;

(2) The forget gate is updated by:

f_t＝σ_g(W_fx_t+U_fh_t-1+b_f)

Wherein f _t is a forgetting gate at time t, and W _f is an input weight matrix; u _f is a cyclic weight matrix of the forget gate; b _f is bias, W _fx_t+U_fh_t-1+b_f is adjusted by sigmoid activation function σ _g, setting output f _t to a value between 0 and 1;

(3) The output gate is updated by:

o_t＝σ_g(W_ox_t+U_oh_t-1+b_o)

Wherein o _t is an output gate at time t, and W _o is an input weight matrix; u _o is a cyclic weight matrix of the output gate; b _o is offset. Adjusting W _ox_t+U_oh_t-1+b_o by a sigmoid activation function σ _g, setting output o _t to a value between 0 and 1;

(4) Memory cell c _t is refreshed by:

Wherein c _t is a memory cell at time t, c _t-1 is a memory cell at time t-1, W _c is an input weight matrix; u _c is a cyclic weight matrix of the memory cells; b _c is bias, output/>, through tanh activation function sigma _h It can be seen that the forget gate f _t determines the data transferred from the last memory cell and the input gate i _t determines the data currently transferred into the memory cell.

The hidden state h _t of LSTM is determined by the output gate and memory cell together:

h_t＝o_tσ_h(c_t)

Wherein h _t is the hidden state of LSTM at time t;

Although LSTM can solve the problem of long-range dependence by memory cells, LSTM is a forward propagation algorithm, and the output of a state can only be calculated from its previous state. However, in the problem of named entity recognition, a word vector of a text sentence is input, a named entity and words nearby the named entity have semantic dependence, in order to recognize a certain entity, many times the named entity is influenced by not only the previous word but also the subsequent word, and the one-way long-short-term memory neural network cannot be combined with the content behind the current moment to perform named entity recognition and the like, so that a two-way long-short-term memory neural network model (BiLSTM) is adopted to perform named entity recognition, and the model structure is shown in fig. 2-3.

The structure of the two-way long-short-term memory neural network consists of an input layer, a forward hiding layer, a backward hiding layer and an output layer. The input layer inputs sequence data, the forward hidden layer calculates forward characteristics, and the backward hidden layer calculates backward characteristics; the forward hidden layer can memorize information before the current moment, and the backward hidden layer can memorize information in the future of the current moment; and splicing the results output by the forward hidden layer and the backward hidden layer to obtain the bidirectional LSTM, namely BiLSTM network.

And finally, accessing the output to the classification label of the softmax input layer prediction naming entity. For a named entity recognition task, defining that k labels exist in the task, namely label= { label ₁,label₂,...,label_k }, the length of an input sequence is n, namely w= { w ₁,w₂,...,w_n }, obtaining the score P _t,j of each label _j corresponding to each input w _t through BiLSTM, and forming a P matrix by the score P _t,j corresponding to n characters of the whole sequence, wherein the larger the score is, the closer the label corresponding to the score is to a real label.

In a task of identifying a Chinese named entity, the entity is usually formed by combining a plurality of Chinese characters, the Chinese characters are marked according to a BIO marking method, the method is the same as a data marking mode of a training set, B is used for representing a beginning character of the named entity, I represents a middle part and an ending part of the named entity, and O represents a non-entity part. As an example of labeling, FIG. 7 shows that entities are categorized into Workpiece, equipment, technic categories, where B-Workpiece represents the beginning character of the Workpiece entity "engine block", namely "FAIL"; I-Workpiece represents the middle and end portions of the Workpiece entity "engine block," i.e., "engine," "cylinder," "body. The same applies to the Equipment entity and the technical entity.

It can be seen that for the input of chinese sequences, there is a certain constraint on the label to be output:

(1) The start tag of an entity must be "B-" and the tag "I-" must follow "B-" and "O" cannot occur before "I-";

(2) The label type of an entity needs to be kept consistent, e.g. "B-Workpiece" followed by "I-Workpiece", but not "I-Equipment";

These constraints are not imposed by BiLSTM, and therefore the present embodiment employs Conditional Random Fields (CRFs) to further constrain the output of the network to achieve higher accuracy. The conditional random field is one of probability map models, which can be classified into a directed map model including a bayesian network, a hidden markov model, and an undirected map model including a conditional random field.

The Conditional Random Field (CRF) is widely applied in the field of natural language processing at present, is a conditional probability distribution model, and introduces a characteristic function based on a Hidden Markov Model (HMM).

The transfer matrix in the CRF will consider the association between output labels at each moment, so this embodiment considers using the CRF to make BiLSTM layers; the BiLSTM layer provides a function of extracting characteristics according to the context, can predict entity types of input texts, and the CRF layer provides a mechanism for scoring the current output state, and can further restrict the output, so that the accuracy of prediction is improved.

The number of output dimensions of BiLSTM layers is the same as the number of label types, and for each input w _i, the network outputs a probability value P _ij of the label j corresponding to the input w _i, so that an output P of the network, namely a labeling probability value corresponding to each label for each input is obtained. CRF calculates the labeling probability value under the condition constraint, and sets y as the predicted labeling sequence, x as the text input sequence, and y' as the accurate labeling sequence, if there is

wherein, ψ _i (x, y) is the feature vector, so the goal of training the model is to maximize probability P (yx), which is obtained by log likelihood:

Step S3-3, entity extraction model evaluation: evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; wherein the said

The embodiment adopts a relationship extraction model established based on an attention mechanism, and the specific method for establishing and training the model is as follows:

Since the LSTM's "degree of influence" between obtaining the output information at each point in time is the same, in the relationship classification, in order to be able to emphasize the importance of the partial output result to the classification, the attention mechanism is essentially a weighted summation.

And training a relation extraction model to extract unstructured data to be extracted, and obtaining the relation between the entities. The relation extraction model firstly outputs a vector form of a text through a BiLSTM layer, and then classifies the relation through an attention mechanism layer to obtain the relation among the entities.

(1) Input and word embedding layer: the model input is a sample in sentence units. The word embedding layer essentially characterizes the input sentence, given a sentence S comprising given a word comprising T characters: s= { x ₁,x₂,...,x_T }, where x _i represents each character.

(2) BiLSTM: biLSTM is identical in structure to step S3-2, and the LSTM unit can be represented by the following formula:

c_t＝i_tg_t+f_tc_t-1

h_t＝o_ttanh(c_t)

the output of the model includes forward direction And backward/>Two results by stitching/>As the final BiLSTM output.

(3) The Attention structure: since the LSTM's "degree of influence" between obtaining the output information at each point in time is the same, in the relationship classification, in order to be able to emphasize the importance of the partial output result to the classification, the attention mechanism is essentially a weighted summation.

The input of the model takes sentences as a unit, the output of the model passing through BiLSTM layers is H= { H ₁,h₂,...,h_T }, and matrix parameters to be trained are obtainedR represents a set of real numbers, d ^w represents a dimension of word embedding, satisfying:

M＝tanh(H)

α＝softmax(w^TM)

r＝Hα^T

Wherein M is intermediate quantity, nonsensical, alpha is attention weight coefficient, r is the result of the LSTM output H after being weighted and summed, and finally, a characterization vector H ^* =tanh (r) is generated through a nonlinear function.

(4) Loss function: mapping the characterization vector h ^* to the class vector through the fully connected network, and outputting the predicted probability of the relation classification through softmax for the input sentence SObtaining predictive tag/>, by argmax

Where W and b are the parameter matrix and bias, respectively.

The negative log likelihood is used to define the loss function J (θ) as:

Where t ε R ^m is a single-heat representation, y ε R ^m is an estimated probability of each relationship class output through softmax, λ is a regularized hyper-parameter, θ represents model parameters of the relationship extraction model, including n and b; f is a norm;

Wherein steps S3-4 and S3-5 are interchangeable with steps S3-2 and S3-3.

S4, building a process knowledge graph:

step S4-1, extracting process entities: extracting unstructured data to be extracted by using a trained entity extraction model to obtain a process entity;

Step S4-2, process entity table: the process entity extracted according to the entity extraction model is stored as a process entity table in a table form, and part of the process entity table is shown in table 2:

TABLE 2 part of the Process entity Table

ID	Name	Label
			001	Cylinder head	Workpiece
002	Oil sprayer sheath	Workpiece
			003	Valve seat	Workpiece
004	Pressure test	Process for producing a solid-state image sensor

Step S4-4, entity relation table: extracting the relationship by using a trained relationship extraction model, and obtaining an entity-relationship table corresponding to the relationship one by one on the basis of a process entity table, wherein the entity-relationship table is shown as a part of entity-relationship table in table 3;

TABLE 3 partial entity-relationship table

Start_Name	Relation	End_Name
			Pressure test	Acting on	Cylinder head
Clear overall	Acting on	Cylinder head
			Press mounting assembly station	Realization of	Sheath assembly

Step S4-4, knowledge fusion: according to all the extracted entities and relations, carrying out knowledge fusion by adopting a semantic similarity calculation-based method, referring to fig. 4-5, merging knowledge with the same or high similarity of semantics; the method based on semantic similarity calculation can be replaced by other methods, such as an inner product method, a cosine method, a Dice coefficient method and the like.

In this embodiment, the specific method for performing knowledge fusion by using the semantic similarity calculation method is as follows:

(1) Semantic similarity calculation: calculating the similarity among concepts, attributes and structural relations in the process knowledge through Jaccard similarity coefficients, classifying, and providing a basis for semantic space model fusion;

Step S4-5, knowledge graph: establishing a knowledge graph in a neo4j graph database according to the entity-relation table after knowledge fusion; after the process knowledge graph is constructed, process designers can design a process by utilizing the process knowledge graph, and can upload new knowledge on the basis, so that the update and sharing of the knowledge are realized.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The unstructured data is data which is irregular or incomplete in data structure, has no predefined data model and is inconvenient to be represented by a two-dimensional logic table of a database;

the extraction method is characterized by comprising the following specific steps of:

s2, preprocessing unstructured data to obtain manually marked text data;

Carrying out knowledge fusion based on semantic similarity according to all the extracted entities and relations to obtain a knowledge-relation table after knowledge fusion, and establishing a knowledge graph in a neo4j graph database according to the entity-relation table;

The specific steps of step S3 are as follows:

s3-3, evaluating the training effect of the entity extraction model according to the accuracy rate, the recall rate and the F1 value; evaluating the training effect of the relation extraction model according to the accuracy rate;

In step S3-2, when the entity extraction model is established: the output dimension of BiLSTM layers of the bidirectional long-short-term memory neural network BiLSTM is the same as the number of label types, and for each input w _i, the network outputs a probability value P _ij of a label j corresponding to the input w _i, and finally an output P of the network is obtained, namely, each input corresponds to a labeling probability value of each label; the conditional random field CRF calculates the labeling probability value under the condition constraint, and the labeling probability value is calculated by setting y as a predicted labeling sequence, x as a text input sequence and y' as an accurate labeling sequence, and the labeling probability value is calculated by the conditional random field CRF

Wherein, ψ _i (x, y) is a feature vector;

defining a loss function as-log (P (y|x)), and optimizing the loss function-log (P (y|x)) through an optimization algorithm to realize training of an entity extraction model BiLSTM-CRF;

In the step S3-2 of the process,

M＝tanh(H)

α＝softmax(w^TM)

r＝Hα^T

Mapping the characterization vector h ^* onto the class vector through the fully connected network, outputting the probability of the predicted relationship classification through softmax for the input sentence s Obtaining predictive tag/>, by argmax

Wherein W and b are a parameter matrix and a bias, respectively;

the negative log likelihood is used to define the loss function as:

2. The method for domain knowledge extraction for unstructured data according to claim 1, wherein the specific steps of step S1 are as follows:

3. The method for domain knowledge extraction for unstructured data according to claim 1, wherein the specific steps of step S2 are as follows:

S2-1, analyzing unstructured data into txt files by using a text analysis tool;

S2-2, utilizing Jieba word segmentation tools to segment the text file;

4. A method for domain knowledge extraction for unstructured data according to any of claims 1-3, wherein in step S4, the specific method for knowledge fusion by using a method based on semantic similarity calculation is as follows: